ps7.docx

Problem 1

Part of the challenge of data mining text is that is that the sequence and context of words matters in communication. Consider the use of the word “good” in a movie review. Briefly explain how the word “good” could be used to convey both positive and negative feelings about a movie, why this highlights the importance of context, and if you believe there is a way to work around this problem.

Problem 2

This module provided an overview of a handful of other commonly used data mining techniques.

Consider a problem from your current or a past job, a hobby, or an interest that would make for a good application of one of the following techniques:

• Text-based data mining

• Co-occurrence grouping and associations

• Profiling

• Link prediction

Describe why this would be an appropriate example of a problem that can be solved with one of the methods above and what the use of the results of this analysis would be.

Please do not choose a hypothetical example like something from the textbook or an example from the slides, it should be something with which you have personal experience (yes, this problem is like problem 2 from problem set 2).

Problem 3

You have been hired by a hotel chain to take another crack at improving their booking and profitability. Armed with more data mining knowledge than ever before, you decide to once again create a classification decision tree model to predict cancelations, only this time you brought in the big guns: ensemble methods.

Target variable:

· is_canceled: whether the reservation was canceled

Attributes:

· hotel_type: whether the hotel is a “resort” or “city” hotel

· summer: whether the was made for the summer season or not

· children: whether children are listed on the reservation

· previous_cancelations: if person who made reservation has canceled before

We have 3 different tree induction models, evaluate each model on the test set.

:

1. A regular single decision tree

https://bigml.com/shared/evaluation/xTXf88MOhwF3cLqmAOBkqOTh9rA

2. An ensemble of trees using random forests (which BigML calls “decision forests”)

https://bigml.com/shared/evaluation/iDLqmKeWNuwr6kDGBK2XFM3ZarD

3. An ensemble of trees using boosting (which BigML calls “boosted trees”)

https://bigml.com/shared/evaluation/uMi3GEWbLih6L5f1q1soFA08kiX

Finally, describe and compare the performance of each model and comment on if their relative performance met your expectations.