python
Final Project: Putting it All Together and Then Some
Part 1: Modeling with Decision Trees
You are a data scientist working at corporation ‘xyz’, you want to predict whether users will sign up for a
subscription based service. You are to see if an online application and see if users sign up for the trial and
use the site for a certain number of days, after which they can upgrade to a basic or premium service. As
users sign up for the free trial, information about them is collected, and at the end of the trial, the site
owners note which users chose to become paying customers.
To minimize noise, the site collects some key information,
1. The site that referred them.
2. geographical location, how many page views, and service chosen.
You are given the following data set -- let’s pretend it comes from a server, and are asked to build a
predictive tree and show its graphical form. This is a machine learning technique, in which, you will gain
more exposure later. In order to complete this project, I will introduce several concepts and code that will
get you running. Your ultimate goal, however, should be to build a tree and print it out.
You are given the following data:
Page 2
• Col is the column index of the criteria to be tested.
• Value is the value that the column must match to get a true result.
• T and fb are decisionnodes, True or False
• Results store a dictionary of results for this branch.
The functions below, divide the set based on criteria that has been supplied. So for example, if you use
divideset(my_data,2,’yes’). It will divide the set into ‘yes’ and ‘no’. uniquecounts presents the mixture of
none who have signed up and those who signed up for premium and basic.
my_data=[['slashdot','USA','yes',18,'None'],
['google','France','yes',23,'Premium'],
['digg','USA','yes',24,'Basic'],
['kiwitobes','France','yes',23,'Basic'],
['google','UK','no',21,'Premium'],
['(direct)','New Zealand','no',12,'None'],
['(direct)','UK','no',21,'Basic'],
['google','USA','no',24,'Premium'],
['slashdot','France','yes',19,'None'],
['digg','USA','no',18,'None'],
['google','UK','no',18,'None'],
['kiwitobes','UK','no',19,'None'],
['digg','New Zealand','yes',12,'Basic'],
['slashdot','UK','no',21,'None'],
['google','UK','yes',18,'Basic'],
['kiwitobes','France','yes',19,'Basic']]
class decisionnode:
def __init__(self,col=-
1,value=None,results=None,tb=None,fb=None):
self.col=col
self.value=value
self.results=results
self.tb=tb
self.fb=fb
Page 3
The question, you should be asking is how to get the first node? The data as you can see is mixed and we
don’t know how good a particular attribute is. Does the site referred, or geographical location determine
sign up for a subscription? What exactly do we split our tree on?
In order to continue with our journey towards prediction, we can use information theory, to calculate
ENTROPY in the set. You will need to use this function to calculate the disorder, (which Entropy
measures), in each set of attributes, say for example.
# Divides a set on a specific column. Can handle numeric
# or nominal values
def divideset(rows,column,value):
# Make a function that tells us if a row is in
# the first group (true) or the second group (false)
split_function=None
if isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=value
else:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return them
set1=[row for row in rows if split_function(row)]
set2=[row for row in rows if not split_function(row)]
return (set1,set2)
def uniquecounts(rows):
results={}
for row in rows:
# The result is the last column
r=row[len(row)-1]
if r not in results: results[r]=0
results[r]+=1 return results
Page 4
We want the best attribute to build our tree upon. In order to see how good an attribute is, the algorithm
first calculates the entropy of the whole group. It then divides up the group by the possible values of each
attribute and calculates the entropy of the two new groups. To determine which attribute is the best to
divide on, the information gain is calculated. Hint: The information gain is the difference between the
current entropy and the weighted average entropy of the new groups. The algorithm calculates the
information gain for every attribute and chooses the one with the highest information gain.
Your function should start off with the following:
Part 2: Building the Graph
Once you have the buildtree, use the following code:
# Entropy is the sum of p(x)log(p(x)) across all
# the different possible results
def entropy(rows):
from math import log
log2=lambda x:log(x)/log(2)
results=uniquecounts(rows)
# Now calculate the entropy
ent=0.0
for r in results.keys():
p=float(results[r])/len(rows)
ent=ent-p*log2(p)
return ent
De f buildtree(rows,scoref=entropy):
If len(rows)==0: return decisionnode()
Current_score=scoref(rows)
Best_gain=0
Best_criteria=None Best_sets=None
Page 5
from PIL import Image,ImageDraw
def drawtree(tree,jpeg='tree.jpg'):
w=getwidth(tree)*100
h=getdepth(tree)*100+120
img=Image.new('RGB',(w,h),(255,255,255))
draw=ImageDraw.Draw(img)
drawnode(draw,tree,w/2,20)
img.save(jpeg,'JPEG')
def drawnode(draw,tree,x,y):
if tree.results==None:
# Get the width of each branch
w1=getwidth(tree.fb)*100
w2=getwidth(tree.tb)*100
# Determine the total space required by this node
left=x-(w1+w2)/2
right=x+(w1+w2)/2
# Draw the condition string
draw.text((x-20,y-
10),str(tree.col)+':'+str(tree.value),(0,0,0))
# Draw links to the branches
draw.line((x,y,left+w1/2,y+100),fill=(255,0,0))
draw.line((x,y,right-w2/2,y+100),fill=(255,0,0))
# Draw the branch nodes
drawnode(draw,tree.fb,left+w1/2,y+100)
drawnode(draw,tree.tb,right-w2/2,y+100)
else:
txt=' \n'.join(['%s:%d'%v for v in
tree.results.items()])
draw.text((x-20,y),txt,(0,0,0))