Big Data Analytics Tools Exam and project
Big Data Analytics Tools./.DS_Store
__MACOSX/Big Data Analytics Tools./._.DS_Store
__MACOSX/Big Data Analytics Tools./ Final Exam/._PROJECT - BETTER UNDERSTAND ATTRITION.docx
Big Data Analytics Tools./ Final Exam/.DS_Store
__MACOSX/Big Data Analytics Tools./ Final Exam/._.DS_Store
Big Data Analytics Tools./ Final Exam/FINAL EXAM - 2018.docx
FINAL EXAM NAME: _________________________
1. When we evaluate models we often discuss things like predictive accuracy, speed, robustness, scalability, and interpretability. Briefly discuss what is meant by “interpretability” and why it is important.
2. You have been hired by the county government to help automate a system to detect fraudulent spending by government employees.
You have been given a database of transactions for the past 10 years to work with. Each record in the database contains all of the details of each transaction as well as information related to the particular employee. In this database accountants have manually gone through the data and marked each transaction as either “Good” or “Fraudulent”.
The goal – build a model based on the historical data that will flag future transactions as either “Good” or “Fraudulent”. This will eliminate the need for the accountants to have to go through each transaction manually in the future. What type of modeling technique (e.g., decision tree, association analysis, clustering, etc.) would you use and why?
3. AT&T has been losing customers to Verizon. They want to try to understand why this is the case. They have customer records for the past 5 years that contain demographic information (age, gender, etc.) for the customers, the type of plan that they have, the number of interactions they have had with customer support and whether or not those customers left AT&T.
AT&T wants you to build a model that can be used to predict whether or not a customer is going to leave and switch to another provider. What type of technique (e.g., decision tree, association analysis, clustering, etc.) would you use and why??
4. Kroger is trying to find ways to improve sales. They have all of their receipts for the past 5 years. The receipts contain information about what was purchased, who purchased, and the date and time of the transaction.
You task is to analyze sales patterns and make recommendations with respect to store layouts that, you hope, will increase sales. What type of modeling technique (e.g., decision trees, association analysis, clustering, etc.) would you use and why?
5. You work for a cable service provider. You provide a variety of services for your customers. Your company provides cable TV, home phone, security systems, and internet services.
Your customer base is very diverse. Your customers could be male/female, young/old, single/married/married with children, etc. You have a wide range of ethnic backgrounds and income levels.
You want to make your marketing campaigns more effective. This means targeting the right groups with the right messages using the right media. (For example, marketing via social media may be more or less effective for 18 year olds as compared to 80 year olds).
You been tasked to use the customer database and determine what the different customer segments are and what they look like. Then, once you figure out what the unique groups are you can go through and develop a targeted campaign for each group. What type of modeling technique (e.g., decision trees, association analysis, clustering, etc.) would you use to determine the different market sectors and why?
Text and Web Analytics
6. When we do text analytics, we read in the data, we transform the data into documents, and then we must generate a term/document matrix. This term/document matrix is what we use to perform analysis.
Generation of the term/document matrix involves some processing of the document (see figure on the right).
Briefly describe each step and what it does.
a. Tokenize:
b. Transform Cases:
c. Filter Stopwords (English):
d. Filter Tokens by Length:
e. Stem (Porter):
7. Briefly describe what “Sentiment Analysis” is and how it might be used by a company.
8. What is the difference between text mining and data mining?
9. FINAL EXAM – 2018 – BETTER UNDERSTAND ATTRITION “projects”
Write me a short report that tells me the following (I’d like for this report to be uploaded in a separate standalone word file and look like something you would give an employer):
Business Scenario – write this like you worked for the company. Tell me what the issue is you are exploring and why.
What you did and why you did it – just discuss the technique you used, why it was appropriate and what you did. If you did several iterations, let me know what the final configuration was. I don’t need to know everything that went on – just what you did to get the final results.
What you found – tell me everything you found/learned. Include screen shots, graphs, etc. Anything appropriate to communicate what you found. Do NOT show me everything that was generated – just those things that support your “findings”.
Recommendations - What impact this would have to the business AND what your recommendations are for the business.
1
__MACOSX/Big Data Analytics Tools./ Final Exam/._FINAL EXAM - 2018.docx
Big Data Analytics Tools./ Final Exam/HR-BalancedSheet.sta
__MACOSX/Big Data Analytics Tools./ Final Exam/._HR-BalancedSheet.sta
Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Association Analysis.pptx
Association Analysis BINS 4352
Learning Objectives
Gain an understanding of how Association Analysis is used
Understand how Associations are created and how to interpret/evaluate those Associations
Discuss and understand Association metrics – Lift, Support, and Confidence
Gain familiarity with RapidMiner
Association Analysis (Market Basket Analysis)
This is a widely used and, in many ways, one of the most successful data mining algorithm.
It can be used to determines what products people purchase together.
Uses
Stores can use this information to determine store layout and product placement
Direct marketers can use this information to determine which new products to offer to their current customers.
Inventory policies can be improved if reorder points reflect the demand for the complementary products.
Any application where you are looking to see if there is a pattern where strong associations are present
Parable Of “Beer And Diapers”
Customers who bought diapers at a grocery store between 5-7pm also tend to by beer.
This is a good example of the business value present in big data analytics.
More than a parable – it was the result of a study commissioned by Osco in the 1990’s and represented a starting point in big data analytics
The finding led to the notion that there is value in discovering uncommon relationships in data can be used to drive business value.
Association Rules for Market Basket Analysis
Rules are written in the form “left-hand side implies right-hand side” and an example is:
Yellow Peppers IMPLIES Red Peppers, Bananas
To make effective use of a rule, three numeric measures about that rule must be considered:
(1) support
(2) confidence
(3) lift
Measures of Predictive Ability
Support and Confidence: An Illustration
A
B
C
A
C
D
B
C
D
A
D
E
B
C
E
| RULE | SUPPORT | CONFIDENCE | LIFT |
| A => D | 2/5 | 2/3 | (2/3)/(2/5) = 1.67 |
| C => A | 2/5 | 2/4 | (2/4)/(2/5) = 1.25 |
| A => C | 2/5 | 2/3 | (2/3)/(2/5) = 1.67 |
| B & C => D | 1/5 | 1/3 | (1/3)/(1/5) = 1.67 |
A Note On Lift
Lift is an interesting measurement and one that has undergone a great deal of scrutiny
For our purposes we defined Lift as Confidence/Support
However, there are other ways to calculate this measure
Some have argued that one must take into account the frequency of the observation
You don’t necessarily want a product that is in 100,000 transactions to be penalized over a product that is involved in 10 transactions simply due to the number of occurrences (or visa versa)
As such – when looking at this value in a tool keep in mind that it is the “relative” value that is important and not the “absolute” value.
Market Basket Analysis Methodology
We first need a list of transactions and what was purchased.
Receipts from stores
This may have to be “reformatted” depending on the tool that you’re using
Next, we choose a list of products to analyze, and tabulate how many times each was purchased with the others.
The diagonals of the table shows how often a product is purchased in any combination, and the off-diagonals show which combinations were bought.
A Convenience Store Example
Consider the following simple example about five transactions at a convenience store:
Transaction 1: Frozen pizza, cola, milk
Transaction 2: Milk, potato chips
Transaction 3: Cola, frozen pizza
Transaction 4: Milk, pretzels
Transaction 5: Cola, pretzels
These need to be cross tabulated and displayed in a table.
A Convenience Store Example (cont)
The diagonal shows how many times a product was purchased (in any combination)
Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity?
Milk sells well with everything – people probably come here specifically to buy it.
| Product Bought | Pizza also | Milk also | Cola also | Chips also | Pretzels also |
| Pizza | 2 | 1 | 2 | 0 | 0 |
| Milk | 1 | 3 | 1 | 1 | 1 |
| Cola | 2 | 1 | 3 | 0 | 1 |
| Chips | 0 | 1 | 0 | 1 | 0 |
| Pretzels | 0 | 1 | 1 | 0 | 2 |
Using The Results
The tabulations can immediately be translated into association rules and the numerical measures computed.
Comparing this week’s table to last week’s table can immediately show the effect of this week’s promotional activities.
But, you need to be careful that the results were not impact by some external event (e.g., bad weather)
Some rules are going to be trivial (hot dogs and buns sell together) or inexplicable (toilet rings sell only when a new hardware store is opened).
Using The Results Barbie® => Candy
Forbes (Palmeri 1997) reported that a major retailer has determined that customers who buy Barbie dolls have a 60% likelihood of buying one of three types of candy bars. The retailer was unsure what to do with this nugget. The online newsletter Knowledge Discovery Nuggets invited suggestions (Piatesky-Shapiro 1998)
Put them closer together in the store.
Put them far apart in the store.
Package candy bars with the dolls.
Package Barbie + candy + poorly selling item.
Raise the price on one, lower it on the other.
Barbie accessories for proofs of purchase.
Do not advertise candy and Barbie together.
Offer candies in the shape of a Barbie Doll.
Augmenting Data to Yield More Insights
The sales data can be augmented with the addition of virtual items.
For example, we could record that the customer was new to us or had children.
The transaction record might look like:
Item 1: Sweater Item 2: Jacket Item 3: New
This might allow us to see what patterns new customers have versus existing customers.