annotated-DornerPS2.docx.pdf

ECON-575-Data Analytics-Problem Set 2 – Peter Dorner

Problem 1

I was working in an insurance firm as an intern where every member of staff worked in

an open office allowing one to see the operations and events that happened to the office. The

secretary was requested to provide a list of all clients who had arrears of more than $200. Also,

the region manager of the organization needed the list of clients to be classified on basis of

gender and location of the client. The secretary instructed the IT intern to provide the list in an

hours’ time. The IT intern searched for the clients on the company’s data files but could not

locate the list within the required time. The secretary was angry at the IT intern and blamed the

intern for incompetency. The intern was summoned by the manager, but he explained what the

problem was. The intern could not locate the list as clients’ data had been uploaded in the system

without following any format.

This is an appropriate example of classification problem. This is because the data

available in the organization at the time had been uploaded in the system without following the

required format or procedure such as on basis of gender, geographical location of the client or

even on basis of the payments they had done. Supervised segmentation method can assist

through classifying the data based on the predetermined classifications or groups of data.

Target variable:

Arrears over $200: whether arrears are over $200

Male or female: determining the gender of the client

From New York: whether the client was within the expected location or outside the New

York location or region.

David Zirkle
11470000005127775
[1/2 point deduction] – Your project needs 1 target not three.

ECON-575-Data Analytics-Problem Set 2 – Peter Dorner

The use needed in supporting the solution is to ensure there is easier retrieval of data

from the mass data. When there are well defined target variables, it becomes easier to retrieve

data that shares common features.

There are three variables that can help in setting the target variables. The first variable is

the nature of distribution where data with similar nature of distribution such as uniform

distribution is classified under a single target variable. The second attribute is the purpose of the

data or the purpose in which the data retrieved aims at attaining. The third attribute used in

setting the target variable is the level of independence of the variable in terms of whether it is

independent or dependent.

Problem 2

One probability that can be taken from the decision tree is the dependency of the past

occurrences or data in predicting future occurrences. A good example is the target variable from

the list provided is: ‘previous _cancelations: if the person who made reservation has canceled

before.’

The hotel can use the decision tree in determining the current status of the hotel and ways

in which the hotel can be improved. For instance, by determining the pattern in guest

cancellations, it is possible to determine if the past data on guest who cancel their visits has

anything to do with future cancellation. This would enable the management of the hotel in

determining the areas that needs improvement and strategies that can be applied in preventing

future guest cancellations.

David Zirkle
11470000005127775
[1/2 point deduction] – Your answer did not (or did not correctly) use the concept of information gain.
David Zirkle
11470000005127775
[1/2 point deduction] – Your answer did not clearly explain how you would use the model. This just pretty much restates part A.
David Zirkle
11470000005127775
David Zirkle
11470000005127775
[1/2 point deduction] – Your answer did not (or did not correctly) use the concept of entropy.

ECON-575-Data Analytics-Problem Set 2 – Peter Dorner

The limits of the results of the analysis are ensuring there is a known definition of the

target variables. Presence of multiple target variables would result into much complexity in

defining the dataset needed.

David Zirkle
11470000005127775
[2 point deduction] – Your answer did not discuss anything that indicates that you completed the BigML assignment.