FinalProject-Introduction_to_Algorithms-v2.pdf

Final Project: Putting it All Together and Then Some

Part 1: Modeling with Decision Trees

You are a data scientist working at corporation ‘xyz’, you want to predict whether users will sign up for a

subscription based service. You are to see if an online application and see if users sign up for the trial and

use the site for a certain number of days, after which they can upgrade to a basic or premium service. As

users sign up for the free trial, information about them is collected, and at the end of the trial, the site

owners note which users chose to become paying customers.

To minimize noise, the site collects some key information,

1. The site that referred them.

2. geographical location, how many page views, and service chosen.

You are given the following data set -- let’s pretend it comes from a server, and are asked to build a

predictive tree and show its graphical form. This is a machine learning technique, in which, you will gain

more exposure later. In order to complete this project, I will introduce several concepts and code that will

get you running. Your ultimate goal, however, should be to build a tree and print it out.

You are given the following data:

Page 2

• Col is the column index of the criteria to be tested.

• Value is the value that the column must match to get a true result.

• T and fb are decisionnodes, True or False

• Results store a dictionary of results for this branch.

The functions below, divide the set based on criteria that has been supplied. So for example, if you use

divideset(my_data,2,’yes’). It will divide the set into ‘yes’ and ‘no’. uniquecounts presents the mixture of

none who have signed up and those who signed up for premium and basic.

my_data=[['slashdot','USA','yes',18,'None'],

['google','France','yes',23,'Premium'],

['digg','USA','yes',24,'Basic'],

['kiwitobes','France','yes',23,'Basic'],

['google','UK','no',21,'Premium'],

['(direct)','New Zealand','no',12,'None'],

['(direct)','UK','no',21,'Basic'],

['google','USA','no',24,'Premium'],

['slashdot','France','yes',19,'None'],

['digg','USA','no',18,'None'],

['google','UK','no',18,'None'],

['kiwitobes','UK','no',19,'None'],

['digg','New Zealand','yes',12,'Basic'],

['slashdot','UK','no',21,'None'],

['google','UK','yes',18,'Basic'],

['kiwitobes','France','yes',19,'Basic']]

class decisionnode:

def __init__(self,col=-

1,value=None,results=None,tb=None,fb=None):

self.col=col

self.value=value

self.results=results

self.tb=tb

self.fb=fb

Page 3

The question, you should be asking is how to get the first node? The data as you can see is mixed and we

don’t know how good a particular attribute is. Does the site referred, or geographical location determine

sign up for a subscription? What exactly do we split our tree on?

In order to continue with our journey towards prediction, we can use information theory, to calculate

ENTROPY in the set. You will need to use this function to calculate the disorder, (which Entropy

measures), in each set of attributes, say for example.

# Divides a set on a specific column. Can handle numeric

# or nominal values

def divideset(rows,column,value):

# Make a function that tells us if a row is in

# the first group (true) or the second group (false)

split_function=None

if isinstance(value,int) or isinstance(value,float):

split_function=lambda row:row[column]>=value

else:

split_function=lambda row:row[column]==value

# Divide the rows into two sets and return them

set1=[row for row in rows if split_function(row)]

set2=[row for row in rows if not split_function(row)]

return (set1,set2)

def uniquecounts(rows):

results={}

for row in rows:

# The result is the last column

r=row[len(row)-1]

if r not in results: results[r]=0

results[r]+=1 return results

Page 4

We want the best attribute to build our tree upon. In order to see how good an attribute is, the algorithm

first calculates the entropy of the whole group. It then divides up the group by the possible values of each

attribute and calculates the entropy of the two new groups. To determine which attribute is the best to

divide on, the information gain is calculated. Hint: The information gain is the difference between the

current entropy and the weighted average entropy of the new groups. The algorithm calculates the

information gain for every attribute and chooses the one with the highest information gain.

Your function should start off with the following:

Part 2: Building the Graph

Once you have the buildtree, use the following code:

# Entropy is the sum of p(x)log(p(x)) across all

# the different possible results

def entropy(rows):

from math import log

log2=lambda x:log(x)/log(2)

results=uniquecounts(rows)

# Now calculate the entropy

ent=0.0

for r in results.keys():

p=float(results[r])/len(rows)

ent=ent-p*log2(p)

return ent

De f buildtree(rows,scoref=entropy):

If len(rows)==0: return decisionnode()

Current_score=scoref(rows)

Best_gain=0

Best_criteria=None Best_sets=None

Page 5

from PIL import Image,ImageDraw

def drawtree(tree,jpeg='tree.jpg'):

w=getwidth(tree)*100

h=getdepth(tree)*100+120

img=Image.new('RGB',(w,h),(255,255,255))

draw=ImageDraw.Draw(img)

drawnode(draw,tree,w/2,20)

img.save(jpeg,'JPEG')

def drawnode(draw,tree,x,y):

if tree.results==None:

# Get the width of each branch

w1=getwidth(tree.fb)*100

w2=getwidth(tree.tb)*100

# Determine the total space required by this node

left=x-(w1+w2)/2

right=x+(w1+w2)/2

# Draw the condition string

draw.text((x-20,y-

10),str(tree.col)+':'+str(tree.value),(0,0,0))

# Draw links to the branches

draw.line((x,y,left+w1/2,y+100),fill=(255,0,0))

draw.line((x,y,right-w2/2,y+100),fill=(255,0,0))

# Draw the branch nodes

drawnode(draw,tree.fb,left+w1/2,y+100)

drawnode(draw,tree.tb,right-w2/2,y+100)

else:

txt=' \n'.join(['%s:%d'%v for v in

tree.results.items()])

draw.text((x-20,y),txt,(0,0,0))