Approaches to Data Mining

gao1279

Sec21.1-21.4.pdf

Home >Information Systems homework help >Approaches to Data Mining

CHAPTER 21 Cases

21.1 Charles Book Club1

CharlesBookClub.csv is the dataset for this case study.

The Book Industry

Approximately 50,000 new titles, including new editions, are published each year in the United States, giving rise to a $25 billion industry in 2001. In terms of percentage of sales, this industry may be segmented as follows:

16% Textbooks

16% Trade books sold in bookstores

21% Technical, scientific, and professional books

10% Book clubs and other mail-order books

17% Mass-market paperbound books

20% All other books

Book retailing in the United States in the 1970s was characterized by the growth of bookstore chains located in shopping malls. The 1980s saw increased purchases in bookstores stimulated through the widespread practice of discount- ing. By the 1990s, the superstore concept of book retailing gained acceptance and contributed to double-digit growth of the book industry. Conveniently situated near large shopping centers, superstores maintain large inventories of 30,000–80,000 titles and employ well-informed sales personnel. Book retail- ing changed fundamentally with the arrival of Amazon, which started out as an

1The Charles Book Club case was derived, with the assistance of Ms. Vinni Bhandari, from The Bookbinders Club, a Case Study in Database Marketing, prepared by Nissan Levin and Jacob Zahavi, Tel Aviv University; used with permission.

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition. Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.

499

500 CASES

online bookseller and, as of 2015, was the world’s largest online retailer of any kind. Amazon’s margins were small and the convenience factor high, putting intense competitive pressure on all other book retailers. Borders, one of the two major superstore chains, discontinued operations in 2011.

Subscription-based book clubs offer an alternative model that has persisted, though it too has suffered from the dominance of Amazon.

Historically, book clubs offered their readers different types of membership programs. Two common membership programs are the continuity and negative option programs, which are both extended contractual relationships between the club and its members. Under a continuity program, a reader signs up by accepting an offer of several books for just a few dollars (plus shipping and handling) and an agreement to receive a shipment of one or two books each month thereafter at more-standard pricing. The continuity program is most common in the chil- dren’s book market, where parents are willing to delegate the rights to the book club to make a selection, and much of the club’s prestige depends on the quality of its selections.

In a negative option program, readers get to select how many and which addi- tional books they would like to receive. However, the club’s selection of the month is delivered to them automatically unless they specifically mark “no” on their order form by a deadline date. Negative option programs sometimes result in customer dissatisfaction and always give rise to significant mailing and pro- cessing costs.

In an attempt to combat these trends, some book clubs have begun to offer books on a positive option basis, but only to specific segments of their customer base that are likely to be receptive to specific offers. Rather than expanding the volume and coverage of mailings, some book clubs are beginning to use database-marketing techniques to target customers more accurately. Information contained in their databases is used to identify who is most likely to be interested in a specific offer. This information enables clubs to design special programs carefully tailored to meet their customer segments’ varying needs.

Database Marketing at Charles

The Club The Charles Book Club (CBC) was established in December 1986 on the premise that a book club could differentiate itself through a deep understanding of its customer base and by delivering uniquely tailored offerings. CBC focused on selling specialty books by direct marketing through a variety of channels, including media advertising (TV, magazines, newspapers) and mailing. CBC is strictly a distributor and does not publish any of the books that it sells. In line with its commitment to understanding its customer base, CBC built and maintained a detailed database about its club members. Upon enrollment, readers were required to fill out an insert and mail it to CBC. Through this

CHARLES BOOK CLUB 501

process, CBC created an active database of 500,000 readers; most were acquired through advertising in specialty magazines.

The Problem CBC sent mailings to its club members each month con- taining the latest offerings. On the surface, CBC appeared very successful: mailing volume was increasing, book selection was diversifying and growing, and their customer database was increasing. However, their bottom-line profits were falling. The decreasing profits led CBC to revisit their original plan of using database marketing to improve mailing yields and to stay profitable.

A Possible Solution CBC embraced the idea of deriving intelligence from their data to allow them to know their customers better and enable mul- tiple targeted campaigns where each target audience would receive appropriate mailings. CBC’s management decided to focus its efforts on the most profitable customers and prospects, and to design targeted marketing strategies to best reach them. The two processes they had in place were:

1. Customer acquisition:

◦ New members would be acquired by advertising in specialty maga- zines, newspapers, and on TV.

◦ Direct mailing and telemarketing would contact existing club members.

◦ Every new book would be offered to club members before general advertising.

2. Data collection:

◦ All customer responses would be recorded and maintained in the database.

◦ Any information not being collected that is critical would be requested from the customer.

For each new title, they decided to use a two-step approach:

1. Conduct a market test involving a random sample of 4000 customers from the database to enable analysis of customer responses. The analysis would create and calibrate response models for the current book offering.

2. Based on the response models, compute a score for each customer in the database. Use this score and a cutoff value to extract a target customer list for direct-mail promotion.

Targeting promotions was considered to be of prime importance. Other opportunities to create successful marketing campaigns based on customer behavior data (returns, inactivity, complaints, compliments, etc.) would be addressed by CBC at a later stage.

502 CASES

Art History of Florence A new title, The Art History of Florence, is ready for release. CBC sent a test mailing to a random sample of 4000 customers from its customer base. The customer responses have been collated with past purchase data. The dataset was randomly partitioned into three parts: Training Data (1800 customers): initial data to be used to fit models, Validation Data (1400 customers): holdout data used to compare the performance of different models, and Test Data (800 customers): data to be used only after a final model has been selected to estimate the probable performance of the model when it is deployed. Each row (or case) in the spreadsheet (other than the header) corresponds to one market test customer. Each column is a variable, with the header row giving the name of the variable. The variable names and descriptions are given in Table 21.1.

Data Mining Techniques

Various data mining techniques can be used to mine the data collected from the market test. No one technique is universally better than another. The particu- lar context and the particular characteristics of the data are the major factors in determining which techniques perform better in an application. For this assign- ment, we focus on two fundamental techniques: k-nearest neighbors and logistic

TABLE 21.1 LIST OF VARIABLES IN CHARLES BOOK CLUB DATASET

Variable Name Description

Seq# Sequence number in the partition ID# Identification number in the full

(unpartitioned) market test dataset Gender 0 = Male, 1 = Female M Monetary—Total money spent on books R Recency—Months since last purchase F Frequency—Total number of purchases FirstPurch Months since first purchase ChildBks Number of purchases from the category child books YouthBks Number of purchases from the category youth books CookBks Number of purchases from the category cookbooks DoItYBks Number of purchases from the category do-it-yourself books RefBks Number of purchases from the category reference books

(atlases, encyclopedias, dictionaries) ArtBks Number of purchases from the category art books GeoBks Number of purchases from the category geography books ItalCook Number of purchases of book title Secrets of Italian Cooking ItalAtlas Number of purchases of book title Historical Atlas of Italy ItalArt Number of purchases of book title Italian Art Florence = 1 if The Art History of Florence was bought; = 0 if not Related Purchase Number of related books purchased

CHARLES BOOK CLUB 503

regression. We compare them with each other as well as with a standard industry practice known as RFM (recency, frequency, monetary) segmentation.

RFM Segmentation The segmentation process in database marketing aims to partition customers in a list of prospects into homogeneous groups (seg- ments) that are similar with respect to buying behavior. The homogeneity crite- rion we need for segmentation is the propensity to purchase the offering. How- ever, since we cannot measure this attribute, we use variables that are plausible indicators of this propensity.

In the direct marketing business, the most commonly used variables are the RFM variables:

R = recency, time since last purchase F = frequency, number of previous purchases from the company over a period M = monetary, amount of money spent on the company’s products over a period

The assumption is that the more recent the last purchase, the more products bought from the company in the past, and the more money spent in the past buying the company’s products, the more likely the customer is to purchase the product offered.

The 1800 observations in the dataset were divided into recency, frequency, and monetary categories as follows:

Recency:

0–2 months (Rcode = 1) 3–6 months (Rcode = 2) 7–12 months (Rcode = 3) 13 months and up (Rcode = 4)

Frequency:

1 book (Fcode = l) 2 books (Fcode = 2) 3 books and up (Fcode = 3)

Monetary:

$0–$25 (Mcode = 1) $26–$50 (Mcode = 2) $51–$100 (Mcode = 3) $101–$200 (Mcode = 4) $201 and up (Mcode = 5)

504 CASES

Assignment

Partition the data into training (60%) and validation (40%). Use seed = 1.

1. What is the response rate for the training data customers taken as a whole? What is the response rate for each of the 4×5×3 = 60 combinations of RFM categories? Which combinations have response rates in the training data that are above the overall response in the training data?

2. Suppose that we decide to send promotional mail only to the “above- average” RFM combinations identified in part 1. Compute the response rate in the validation data using these combinations.

3. Rework parts 1 and 2 with three segments:

Segment 1: RFM combinations that have response rates that exceed twice the overall response rate

Segment 2: RFM combinations that exceed the overall response rate but do not exceed twice that rate

Segment 3: the remaining RFM combinations

Draw the lift curve (consisting of three points for these three segments) showing the number of customers in the validation dataset on the x-axis and cumulative number of buyers in the validation dataset on the y-axis.

k-Nearest Neighbors The k-nearest-neighbors technique can be used to create segments based on product proximity to similar products of the products offered as well as the propensity to purchase (as measured by the RFM variables). For The Art History of Florence, a possible segmentation by product proximity could be created using the following variables:

R: recency—months since last purchase

F: frequency—total number of past purchases

M: monetary—total money (in dollars) spent on books

FirstPurch: months since first purchase

RelatedPurch: total number of past purchases of related books (i.e., sum of purchases from the art and geography categories and of titles Secrets of Italian Cooking, Historical Atlas of Italy, and Italian Art)

4. Use the k-nearest-neighbor approach to classify cases with k = 1, 2, ..., 11, using Florence as the outcome variable. Based on the val- idation set, find the best k. Remember to normalize all five variables. Create a lift curve for the best k model, and report the expected lift for an equal number of customers from the validation dataset.

POLITICAL PERSUASION 513

f. Using this lift curve, estimate the gross profit that would result from mailing to the 180,000 names on the basis of your data mining models.

Note: Although Tayko is a hypothetical company, the data in this case (modified slightly for illustrative purposes) were supplied by a real company that sells software through direct sales. The concept of a catalog consortium is based on the Abacus Catalog Alliance.

21.4 Political Persuasion4

Voter-Persuasion.csv is the dataset for this case study. Note: Our thanks to Ken Strasma, President of HaystaqDNA and director of targeting for the 2004 Kerry campaign and the 2008 Obama campaign, for the data used in this case, and for sharing the information in the following writeup.

Background

When you think of political persuasion, you may think of the efforts that political campaigns undertake to persuade you that their candidate is better than the other candidate. In truth, campaigns are less about persuading people to change their minds, and more about persuading those who agree with you to actually go out and vote. Predictive analytics now plays a big role in this effort, but in 2004, it was a new arrival in the political toolbox.

Predictive Analytics Arrives in US Politics

In January of 2004, candidates in the US presidential campaign were compet- ing in the Iowa caucuses, part of a lengthy state-by-state primary campaign that culminates in the selection of the Republican and Democratic candidates for president. Among the Democrats, Howard Dean was leading in national polls. The Iowa caucuses, however, are a complex and intensive process attracting only the most committed and interested voters. Those participating are not a repre- sentative sample of voters nationwide. Surveys of those planning to take part showed a close race between Dean and three other candidates, including John Kerry.

Kerry ended up winning by a surprisingly large margin, and the better than expected performance was due to his campaign’s innovative and successful use of predictive analytics to learn more about the likely actions of individual vot- ers. This allowed the campaign to target voters in such a way as to optimize

514 CASES

performance in the caucuses. For example, once the model showed sufficient support in a precinct to win that precinct’s delegate to the caucus, money and time could be redirected to other precincts where the race was closer.

Political Targeting

Targeting of voters is not new in politics. It has traditionally taken three forms:

• Geographic

• Demographic

• Individual

In geographic targeting, resources are directed to a geographic unit—state, city, county, etc.—on the basis of prior voting patterns or surveys that reveal the political tendency in that geographic unit. It has significant limitations, though. If a county is only, say, 52% in your favor, it may be in the greatest need of attention, but if messaging is directed to everyone in the county, nearly half of it is reaching the wrong people.

In demographic targeting, the messaging is intended for demographic groups—for example, older voters, younger women voters, Hispanic voters, etc. The limitation of this method is that it is often not easy to implement— messaging is hard to deliver just to single demographic groups.

Traditional individual targeting, the most effective form of targeting, was done on the basis of surveys asking voters how they plan to vote. The big limitation of this method is, of course, the cost. The expense of reaching all voters in a phone or door-to-door survey can be prohibitive.

The use of predictive analytics adds power to the individual targeting method, and reduces cost. A model allows prediction to be rolled out to the entire voter base, not just those surveyed, and brings to bear a wealth of infor- mation. Geographic and demographic data remain part of the picture, but they are used at an individual level.

Uplift

In a classical predictive modeling application for marketing, a sample of data is selected and an offer is made (e.g., on the web) or a message is sent (e.g., by mail), and a predictive model is developed to classify individuals as respond- ing or not-responding. The model is then applied to new data, propensities to respond are calculated, individuals are ranked by their propensity to respond, and the marketer can then select those most likely to respond to mailings or offers.

POLITICAL PERSUASION 515

Some key information is missing from this classical approach: how would the individual respond in the absence of the offer or mailing? Might a high- propensity customer be inclined to purchase irrespective of the offer? Might a person’s propensity to buy actually be diminished by the offer? Uplift model- ing (see Chapter 13) allows us to estimate the effect of “offer vs. no offer” or “mailing vs. no mailing” at the individual level.

In this case, we will apply uplift modeling to actual voter data that were aug- mented with the results of a hypothetical experiment. The experiment consisted of the following steps:

1. Conduct a pre-survey of the voters to determine their inclination to vote Democratic.

2. Randomly split the voters into two samples—control and treatment.

3. Send a flyer promoting the Democratic candidate to the treatment group.

4. Conduct another survey of the voters to determine their inclination to vote Democratic.

Data

The data in this case are in the file Voter-Persuasion.csv. The target variable is MOVED_AD, where a 1 = “opinion moved in favor of the Democratic can- didate” and 0 = “opinion did not move in favor of the Democratic candidate.” This variable encapsulates the information from the pre- and post-surveys. The important predictor variable is Flyer, a binary variable that indicates whether or not a voter received the flyer. In addition, there are numerous other predictor variables from these sources:

1. Government voter files

2. Political party data

3. Commercial consumer and demographic data

4. Census neighborhood data

Government voter files are maintained, and made public, to assure the integrity of the voting process. They contain essential data for identification purposes such as name, address and date of birth. The file used in this case also contains party identification (needed if a state limits participation in party primaries to voters in that party). Parties also staff elections with their own poll-watchers, who record whether an individual votes in an election. These data (termed “derived” in the case data) are maintained and curated by each party, and can be readily matched to the voter data by name. Demographic data at the neighborhood level are available from the census, and can be appended to the voter data by address matching. Consumer and additional demographic data (buying habits,

516 CASES

education) can be purchased from marketing firms and appended to the voter data (matching by name and address).

Assignment

The task in this case is to develop an uplift model that predicts the uplift for each voter. Uplift is defined as the increase in propensity to move one’s opinion in a Democratic direction. First, review the variables in Voter-Persuasion.csv and understand which data source they are probably coming from. Then, answer the following questions and perform the tasks indicated:

1. Overall, how well did the flyer do in moving voters in a Democratic direction? (Look at the target variable among those who got the flyer, compared to those who did not.)

2. Explore the data to learn more about the relationships between the pre- dictor variables and MOVED_AD using data visualization. Which of the predictors seem to have good predictive potential? Show supporting charts and/or tables.

3. Partition the data using the partition variable that is in the dataset, make decisions about predictor inclusion, and fit three predictive mod- els accordingly. For each model, give sufficient detail about the method used, its parameters, and the predictors used, so that your results can be replicated.

4. Among your three models, choose the best one in terms of predictive power. Which one is it? Why did you choose it?

5. Using your chosen model, report the propensities for the first three records in the validation set.

6. Create a derived variable that is the opposite of Flyer. Call it Flyer-reversed. Using your chosen model, re-score the validation data using the Flyer- reversed variable as a predictor, instead of Flyer. Report the propensities for the first three records in the validation set.

7. For each record, uplift is computed based on the following difference:

P(success | Flyer = 1) − P(success | Flyer = 0)

Compute the uplift for each of the voters in the validation set, and report the uplift for the first three records.

8. If a campaign has the resources to mail the flyer only to 10% of the voters, what uplift cutoff should be used?