Read the section entitled "Case Study: Spanish or English" on pages 543-547 in the supplementary textbook, and answer the following questions (over the course of the week, NOT all in one post):
· How would you summarize the results and recommendations from this analysis for the president of this supermarket chain?
· What other measures might have been made with the data used here, and what insights could they provide?
· What is the "excluded middle approach" and why was it used in this study?
· This study was done for a supermarket chain in Texas. Can you think of other situations where a similar approach would be useful? Why?
Reference
Case Study: Spanish or English
Data mining usually focuses on customers, and hence on customer-level data. This case study instead uses summarized data that contains no information about individual customers, and this data is used to discern differences in shopping patterns between Hispanics and non-Hispanics at a chain of supermarkets in Texas. Because the demographic characteristics of individual customers are not known, this case study uses the demographic characteristics of the area served by each supermarket instead. Interesting patterns emerge when extrapolating from the stores to learn about the customers.
The Business Problem
The business problem is quite simple: should this chain of supermarkets advertise the same products in Spanish as in English? This chain is located in Texas, which, as the map in Figure 15-5 shows, has areas with very high concentrations of Hispanics. Darker shading indicates a higher concentration of this group, so the counties that border Mexico are at least two-thirds Hispanic. The proportion declines for counties farther from the Mexican border.
Figure 15-5: The proportion of Hispanics by county in Texas is quite high near the Mexican border, and then declines throughout the rest of the state.
Of course, not all Hispanics speak Spanish as their first language, so the match between the demographic group “Hispanic” and the advertising goal “Spanish-language” is not a perfect match. But it is good enough.
The supermarket chain did not begin this effort with a data mining project. The Spanish-language advertising was already part of its business, and the company did the easy thing: advertise the same items in Spanish as in English. Preliminary analysis at the department level showed that Hispanics and non-Hispanics purchase about the same amount in the meat department, and in the frozen food department, and in the snacks department, and so on. Perhaps out of desperation, the chain launched a data mining project to understand the differences that might distinguish these demographic groups.
The Data
The data for the project was basically a fact table from an online analytic processing system (OLAP), with three dimensions:
· ▪Week of the year
· ▪Product
· ▪Store
Each row in the data described the sales of a single product at a single store for one week. The important measurements on the row included the number of units sold and total revenue for the particular product at the particular store during the particular week.
The data was also augmented with some ethnic variables describing the catchment area (also called micromarket) of each store. A catchment area is the neighborhood the majority of shoppers come from. Of course, not all shoppers come from the catchment area, and someone can live in the catchment area and never visit the store, but it is a good approximation of the local market. Catchment areas for retailers are defined by third parties such as Experian and IRI. Larger retailers often have the information for each store in the chain.
Ethnic variables in the data described the proportion of each store's neighborhood that was Hispanic and the proportion that was African-American. Actually, the data contained deciles, so “0” means that between 0 percent and 10 percent of the neighborhood is in that ethnic group.
The first analysis revealed a very uninteresting pattern: the higher the African-American decile, the lower the Hispanic decile, and vice versa. Far from saying something profound about ethnic groups in the great state of Texas, this relationship is simply a corollary of the following fact: The proportion of African Americans plus the proportion of Hispanics plus the proportion of others adds up to 1. In general, when one of the groups increases its representation, the others will go down—simply because the sum is 1. As a data miner, you must be careful about such patterns that appear because of the definition of variables.
WARNING
Spurious correlations can arise when variables have relationships among themselves, such as a group of demographic variables that represent proportions whose sum is 1. When one variable increases (or decreases) the others are likely to go in the opposite direction.
Defining “Hispanicity” Preference
The original business question was: “Should Spanish-language advertisements be for the same products as English-language advertisements?” After looking at the available data, the business question was rephrased as: “What are the differences in products sold in stores with a high Hispanic catchment area versus in a low Hispanic catchment area?”
Answering this question is much more feasible with the available data. The idea is to create a Hispanic preference score for each product, which is the popularity of the product in stores with highly Hispanic catchment areas minus the popularity in stores whose catchment areas have few Hispanics.
This approach divides stores into three groups:
· ▪Highly Hispanic
· ▪Mixed
· ▪Not very Hispanic
The Hispanic preference score ignores the stores in the middle (so this is sometimes called the excluded middle approach). The assumption is that if Hispanics have preferences for or against certain products, then these preferences will show up more clearly by comparing the extremes rather than the stores in the middle.
TIP
When looking for patterns about a particular group of customers, using the excluded middle approach can be helpful. The difference in behavior between stores in areas where the group predominates and stores in areas where it is underrepresented may provide insight about the group in question.
The Solution
Figure 15-6 is a scatter plot, showing the Hispanic preference score on the vertical axis and products on the horizontal axis. The size of the marker shows the overall sales of the product. Points above the axis have a positive preference, meaning that highly Hispanic stores purchase more. Points below the axis mean that highly Hispanic stores purchase less. Perhaps the most interesting feature of this chart is that the business users discussed it for more than an hour—even though the products themselves are not labeled (the business users were quite familiar with the numbering system for product codes and quickly deciphered the interesting products).
What makes this chart interesting is the large marker at the top of the chart. This product is much more popular in Hispanic stores than in stores with few Hispanics in their catchment areas. The small markers at the bottom are the opposite, less popular in Hispanic catchment areas. Even without knowing the details, the chart shows that many products are equally popular in Hispanic and non-Hispanic areas—but not all of them. That is what makes the chart valuable.
What are differences in purchasing patterns? It is true that both Hispanics and non-Hispanics purchase meat, for instance. However, non-Hispanics tend to prefer beef and Hispanics have a preference for pork. Similarly, non-Hispanics prefer to snack on potato chips and French fries, whereas Hispanics prefer corn chips as snacks. What does this mean? Ads for Fourth of July picnics should be for hamburgers and potato chips in English, and perhaps for sausages and Doritos corn chips in Spanish.
Figure 15-6: This chart shows that one product is both popular (because the cube is big) and has a high preference in Hispanic stores.
Association Analysis
One appeal of association analysis is the clarity of the results, which are in the form of rules about groups of products that appear together in orders. An association rule has an intuitive appeal because it expresses how tangible products and services group together. A rule like, “if a customer purchases marshmallows and graham crackers, then that customer will also purchase chocolate bars,” is clear. Even better, it might suggest a specific course of action, such as placing the items close to each other in a grocery store or providing customers with the recipe for that time-honored snack made with the aforementioned ingredients melted over an open flame.
Rules Are Not Always Useful
Although association rules are easy to understand, they are not always useful. The following three rules are examples of real rules generated from real data:
· ▪Wal-Mart customers who purchase Barbie dolls have a 60 percent likelihood of also purchasing one of three types of candy bars.
· ▪Customers who purchase maintenance agreements are very likely to purchase large appliances.
· ▪When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners.
The first example is quoted in Forbes on September 8, 1997. These three examples illustrate the three common types of rules produced by association rules: the actionable, the trivial, and the inexplicable. In addition to these types of rules, the sidebar “A Famous Rule: Beer and Diapers” talks about one other category.
Actionable Rules
Useful rules contain high-quality, actionable information. After the pattern is found, justifying it by telling a story can lead to insights and action. Walmart's discovery about Barbie dolls and chocolate does not suggest that Barbie prefers chocolate bars to other forms of food. Clearly this is not a likely story, because Barbie could not maintain her inhuman figure on a diet of chocolate bars. Instead, imagine Mom going shopping with her two pre-tweens. The purpose: finding a gift for little Susie's friend, Emily, for Emily's upcoming birthday party. A Barbie doll is the perfect gift. At checkout, little Jacob starts crying. He wants something, too—a candy bar fits the bill and is conveniently at eye level for a five-year-old. Or maybe the candy bar is for Mom, because shopping for birthday gifts is exhausting and Mom needs some energy. These scenarios suggest that the candy bar is an impulse purchase added on to that of the Barbie doll.
Whether Wal-Mart can make use of this information is not clear. This rule might suggest more prominent product placement, such as ensuring that customers must walk through candy aisles on their way back from Barbie-land. It might suggest product tie-ins and promotions offering candy bars and dolls together. It might suggest particular ways to advertise the products. It might demonstrate that having candy bars at eye level for five-year-old children is a good idea. Because the rule is easily understood, it suggests plausible causes and possible actions.
A Famous Rule: Beer and Diapers
Perhaps the most talked-about association rule ever “found” is the association between beer and diapers. This is a famous story dating back from the late 1980s or early 1990s, when computers were just getting powerful enough to analyze reasonably large volumes of data. The setting is somewhere in the United States, where a retailer is analyzing point of sale data to find interesting patterns.
Lo and behold, lurking in all the data is the remarkable fact that beer and diapers sell together. This immediately sets marketing minds in motion to explain what is happening. A flash of insight provides the explanation: beer drinkers do not want to interrupt their enjoyment of televised sports, so they buy diapers to reduce trips to the bathroom. No, that's not it! The more likely story is that families with young children are preparing for the weekend—diapers for the kids and beer for Daddy. Daddy probably knows that after he has a few beers, Mommy will change the diapers.
This is a powerful story. Setting aside the analytics, what can a retailer do with this information? There are two competing views. One says to put the beer and diapers close together, so when one is purchased, customers remember to buy the other one. The other says to put them as far apart as possible, so the customer must walk by as many stocked shelves as possible, having the opportunity to buy yet more items. The store could also put higher-margin diapers a bit closer to the beer, although mixing baby products and alcohol might be unseemly in some neighborhoods.
The story is so powerful that the authors have noticed at least five companies using the story—IBM, Tandem (now part of HP), Oracle, Teradata, and SAS. The actual story was debunked on April 6, 1998, in an article in Forbes magazine called “Beer-Diaper Syndrome.”
The debunked story still has a lesson. Apparently, the sales of beer and diapers were known to be correlated, at least in some stores. The correlations could be seen in inventory curves, which show how much of each item is available in each store by day or by week. For managers working in the stores, the hypothesis is quite obvious: a customer carrying both beer and diapers stands out when he or she finishes paying because these are such bulky items.
While doing a demonstration project for Teradata for a chain of drug stores in Wisconsin, Thomas Blischok, a sales manager, suggested that the demo show something interesting, like “beer and diapers being sold together.” With this small hint, analysts were able to find evidence in the data. Actually, the moral of the story is not about the power of association rules to find unexpected patterns. The moral is that hypothesis testing and exploratory data analysis, even using simple query tools, can be very persuasive and actionable. With such tools, you, too, may discover a pattern that becomes data mining legend.
Trivial Rules
Trivial results are already known by anyone at all familiar with the business. The second example (“Customers who purchase maintenance agreements are very likely to purchase large appliances”) is an example of a trivial rule. In fact, customers typically purchase maintenance agreements and large appliances at the same time. Why else would someone purchase a maintenance agreement? The two are advertised together, and rarely sold separately (although when sold separately, it is almost always the large appliance sold without the maintenance agreement rather than the other way around). This rule, though, was found after analyzing millions of point-of-sale transactions from Sears, and published at a conference once upon a time.
As a rule, the result is valid and well-supported by the data. It demonstrates the power of the algorithms that find association rules. However, it is a lousy example of data mining. As Chapter 1 emphasizes, the purpose of data mining is to find patterns that are meaningful, and this rule is useless. Similar results abound: people who buy two-by-fours also purchase nails; customers who purchase paint buy paint brushes; oil and oil filters are purchased together, as are hamburgers and hamburger buns, and charcoal and lighter fluid.
A subtler problem falls into the same category. A seemingly interesting result—such as the fact that people who buy cable service from their local telephone service provider almost always buy Internet access as well—may be the result of marketing programs and product bundles. In this case, cable, telephone, and Internet are often bundled together (as the “triple-play option”). The analysis does not produce actionable results; it produces already acted-upon results. Although a danger for any data mining technique, association analysis is particularly susceptible to reproducing the success of previous marketing campaigns because of its dependence on unsummarized point-of-sale data—exactly the same data that defines the success of the campaign. Results from association analysis may simply be measuring the success of previous marketing campaigns.
Trivial rules do have one use, although it is not directly a data mining use. Some rules may be almost 100 percent true. The few cases where these rules do not hold provide a lot of information about data quality. That is, the exceptions to very highly confidence rules point to areas where business operations, data collection, and processing may need to be further refined.
Inexplicable Rules
Inexplicable results seem to have no explanation and do not suggest a course of action. The third pattern (“When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaner”) is intriguing, tempting us with a new fact but providing information that does not give insight into consumer behavior or the merchandise or suggest further actions. In this case, a large hardware company discovered the pattern for new store openings, but could not figure out how to profit from it. Many items are on sale during the store openings, but the toilet bowl cleaners stood out. More investigation might give some explanation. One possibility is that new hardware stores often open near new subdivisions, and new home owners need to stock up on toilet bowl cleaner. On the other hand, the result could be just an anomaly from a handful of stores. Whatever the cause, it is doubtful that further analysis of just the market basket data can give a credible explanation.
WARNING
When you're applying market basket analysis, many of the results are often either trivial or inexplicable. Trivial rules reproduce common knowledge about the business, wasting the effort used to apply sophisticated analysis techniques. Inexplicable rules are flukes in the data and are not actionable.
Item Sets to Association Rules
Association rules start with transactions containing one or more product or service offerings and some rudimentary information about the transaction. For the purpose of analysis, the products and service offerings are referred to as items. Table 15-1 illustrates five transactions from a grocery store. These transactions have been simplified to include only the items purchased. How to use information like the date and time and whether the customer paid with cash or a credit card is discussed later in this chapter.
Table 15-1: Grocery Point-of-Sale Transactions
Each of these transactions gives information about which products are purchased with which other products. This information is shown in a co-occurrence table that tells the number of times that any pair of products was purchased together (see Table 15-2 ). For instance, the box where the “Soda” row intersects the “OJ” column has a value of “2,” indicating that two transactions contain both items. This is easily verified against the original transaction data, where customers 1 and 4 purchased both these items. The values along the diagonal (for instance, the value in the “OJ” column and the “OJ” row) represent the number of transactions containing that item.
Table 15-2: Co-Occurrence of Products
This co-occurrence table already highlights some simple patterns:
· ▪Orange juice and soda are more likely to be purchased together than any other two items.
· ▪Detergent is never purchased with window cleaner or milk.
· ▪Milk is never purchased with soda or detergent.
These observations are examples of associations. A combination of items is called an item set, so the combination of soda and orange juice is an item set that appears in two transactions. In this simple example, it is the only pair of items that appears more than once. Several item sets containing two products, three products, and four products appear in this data.
Item sets are often interesting themselves. In the beer-and-diapers example, for instance, the interesting fact is that the two are purchased together; there is no need to consider rules such as, “If a customer purchases beer, then the customer purchases diapers,” or the inverse, “If a customer purchases diapers, then the customer purchases beer.” In other cases, the goal is to turn the item set into an association rule, such as, “If a customer purchases soda, then the customer also purchases orange juice.”
Such rules are derived from item sets. A single item set has several potential rules. The most interesting are usually the rules that have one product on the right-hand side (after the “then”) and zero or more items on the left-hand side (after the “if”). The rule is structured to imply that the presence of the items on the left implies the presence of the item on the right. Let's defer discussion of how to find item sets and rules, and instead ask another question: how good is a particular rule?
How Good Is an Association Rule?
Any given set of transaction data is going to have many possible item sets, and these item sets are going to suggest even more rules. For a few hundred products, the number of combinations quickly climbs into the millions. The association rules technique has to separate the strong from the weak, so from the vast number of possible rules, only the best are chosen. The three traditional methods for measuring the goodness of an association rule are support, confidence, and lift. These measures come from the original work done in this area in the machine learning community. In addition these measures, this section also explains an alternative one, the chi-square test, which is described in Chapter 4 , because it often produces a better set of rules.
Support
Support measures the number or proportion of transactions that contain all the items in the rule. Support is typically measured in one of three ways:
· ▪As a count of the number of transactions containing all the items in the rule.
· ▪The count divided by the total number of transactions.
· ▪Less commonly, the count divided by the total number of transactions that have enough items for the rule to apply. This takes into account, for instance, that transactions with only one item are not considered for a rule that contains two or more items.
All rules using the same items—regardless of which items appear after the “then”—have the same support.
In the earlier data, two of the five transactions include both soda and orange juice. These two transactions support the rule, “if soda, then orange juice.” The support for the rule is two out of five, or 40 percent of all transactions. Because all five transactions have at least two items, the support is still 40 percent, even taking into account the number of items in the rule.
Confidence
Confidence measures how good a rule is at predicting the right-hand side (after the “then” clause of the rule), by comparing how often the right-hand side appears when the condition on the left-hand side (after the “if” clause of the rule) is true. For example, two of the three transactions that contain soda also contain orange juice, implying a high degree of confidence in the rule, “if soda, then orange juice.” In fact, the confidence is 67 percent.
The inverse rule, “if orange juice, then soda,” has a lower confidence. Of the four transactions with orange juice, only two also have soda. Its confidence, then, is 50 percent. Confidence is calculated as the ratio of the number of the transactions supporting the entire rule to the number of transactions supporting the left-hand side of the rule. Another way of saying this is that confidence is the ratio of the number of transactions with all the items to the number of transactions with just the “if” items.
Lift
Confidence tells you how good a rule is at predicting what is on the right-hand side. However, the items on the right-hand side might already be very common, so the rule may not be telling us anything. Lift (also called improvement) measures the power of the rule by comparing the full rule to randomly guessing the right-hand side. Here, randomly guessing means comparing the confidence of the rule to the confidence of the null rule. The null rule has the same item on the right-hand size, but the left-hand side is empty.
Lift is calculated as the ratio of the confidence of the rule to the prevalence of the right-hand side. In the example rule, “if soda, then orange juice,” the confidence is 67 percent. However, 80 percent of the transactions contain orange juice, so the rule actually does worse than simply guessing!
Chi-Square Value
The chi-square test produces a useful measure that can be applied to association rules. The chi-square test, as Chapter 4 explains, is used to determine when the cells of a contingency table are produced randomly—or when something more interesting is going on. A contingency table is used to compare two dimensions that take on categorical values. The table itself is a record of counts, where each record in the data is counted in exactly one cell of the table (which particular cell depends on the values in the two dimensions). The chi-square value measures the probability that the contingency table could be produced by chance. When the probability is high (and hence, the chi-square value is low), the table is similar to a table produced randomly, so the two dimensions that define the table are probably not interacting with each other. When the probability is low (and hence, the chi-square value is high), the two dimensions are probably interrelated—something is going on.
It may not be immediately obvious, but an association rule has an associated contingency table, as shown in Figure 15-7 . The two dimensions are:
· ▪Whether or not the items in the left-hand side of the rule are present in the transaction
· ▪Whether or not the items in the right-hand side of the rule are present in the transaction
The counts of the transactions are in the cells of the table.
Figure 15-7: An association rule has a corresponding contingency table, where the two dimensions are based on the two sides of the rule. The cells in the table contain counts of the number of transactions that appear or do not appear on either side.
The chi-square value, as calculated from the contingency table, measures the probability that the table is produced by a random splitting of the data. An interesting rule would not split the data randomly, so the higher the chi-square value (and hence, the less likely the split is due to chance), then the better the rule. In practice, the authors have found the chi-square value to be very useful for selecting rules.
Working with the chi-square value does have one caveat. Consider the two rules:
· ▪IF <LHS> THEN <RHS>
· ▪IF <LHS> THEN NOT <RHS>
Both these rules produce the same contingency table, and hence have the same chi-square value. An additional test is needed (lift is a good measure) to determine which is the better rule.
Building Association Rules
This basic process for finding association rules is illustrated in Figure 15-8 . Three important concerns when creating association rules are:
· ▪Choosing the right set of items.
· ▪Generating rules by deciphering the counts in the co-occurrence matrix.
· ▪Overcoming the practical limits imposed by thousands or tens of thousands of items.
The next three sections delve into these concerns in more detail.
Figure 15-8: Finding association rules has these basic steps.
Choosing the Right Set of Items
Traditionally, the data used for finding association rules is the detailed transaction data captured at the point of sale. Gathering and using this data is a critical part of applying market basket analysis, and the results depend crucially on the items chosen for analysis. As mentioned in the case study on Spanish language marketing, the grocery store chain had already looked at preferences at the department level, and found no interesting patterns, although interesting patterns were found lower down the product hierarchy.
What constitutes a particular item depends on the business need. Within a grocery store where tens of thousands of products are on the shelves, a frozen pizza might be considered an item for analysis purposes—regardless of its toppings (extra cheese, pepperoni, or mushrooms), its crust (extra thick, whole wheat, or white), or its size. So, the purchase of a large whole-wheat vegetarian pizza contains the same “frozen pizza” item as the purchase of a single-serving pepperoni pizza with extra cheese. A sample of such transactions at this summarized level might look like Table 15-3 .
Table 15-3: Transactions with More Summarized Items
On the other hand, the manager of frozen foods or a chain of pizza restaurants may be very interested in the particular combinations of toppings that are ordered. He or she might decompose a pizza order into constituent parts, with items such as those in Table 15-4 .
Table 15-4: Transactions with More Detailed Items
At some later point in time, the grocery store may become interested in having more detail in its transactions, so the single “frozen pizza” item would no longer be sufficient. Or the pizza restaurants might broaden their menu choices and become less interested in all the different toppings. The items of interest may vary over time or even at the same time depending on the analysis. This can pose a problem when trying to use historical data if different levels of detail have been removed or if product hierarchies are no longer available or frequently change.
Choosing the right level of detail is a critical consideration for the analysis. If the transaction data in the grocery store keeps track of every type, brand, and size of frozen pizza—which probably account for several dozen products—then all these items need to map up to the “frozen pizza” item for analysis.
Product Hierarchies Help to Generalize Items
In the real world, items have product codes and stock-keeping unit codes (SKUs) that fall into hierarchical categories, called a product hierarchy or taxonomy, illustrated in Figure 15-9 . What level of the product hierarchy is the right one to use? This brings up issues such as:
· ▪Are large fries and small fries the same product?
· ▪Is the brand of ice cream more relevant than its flavor?
· ▪Which is more important: the size, style, pattern, or designer of clothing?
· ▪Is the energy-saving option on a large appliance indicative of customer behavior?
Figure 15-9:Product hierarchies start with the most general and move to increasing detail.
The number of combinations to consider grows very fast as the number of items used in the analysis increases. This suggests using items from higher levels of the product hierarchy, such as “frozen desserts” instead of “Ben & Jerry's Cherry Garcia” ice cream. On the other hand, the more specific the items are, the more likely the results are to be actionable. Knowing what sells with a particular brand of frozen pizza, for instance, can help in managing the relationship with the manufacturer. One compromise is to use more general items initially, then to repeat the rule generation to hone in on more specific items. As the analysis focuses on more specific items, use only the subset of transactions containing those items.
The complexity of a rule refers to the number of items it contains. The more different types of items in the transactions, the longer it takes to generate rules of a given complexity. The desired complexity of the rules has an impact on how specific or general the items should be. In some circumstances, customers do not purchase many different items. For instance, customers purchase relatively few items at any one time at a convenience store or through some catalogs, so looking for rules containing four or more items may apply to very few transactions and be a wasted effort. In other cases, such as in supermarkets, the average transaction is larger, so more complex rules are useful.
Moving up the product hierarchy reduces the number of items. Dozens or hundreds of items may be reduced to a single generalized item, often corresponding to a single department or product line. An item like a pint of Ben & Jerry's Cherry Garcia gets generalized to “ice cream” or “frozen foods.” Instead of investigating “orange juice,” investigate “fruit juices,” and so on. Often, the appropriate level of the hierarchy ends up matching a department with a product-line manager; so using categories has the practical effect of finding interdepartmental relationships. Generalized items also help find rules with sufficient support. There will be many more transactions supporting a given rule at higher levels of the taxonomy than at lower levels.
TIP
Market basket analysis produces the best results when the items occur in roughly the same number of transactions in the data. This helps prevent rules from being dominated by the most common items. Product hierarchies can help here. Roll up rare items to higher levels in the hierarchy, so they become more frequent. More common items may not have to be rolled up at all.
Just because some items are generalized does not mean that all items need to move up to the same level. The appropriate level depends on the item, on its importance for producing actionable results, and on its frequency in the data. For instance, in a department store, big-ticket items (such as appliances) might stay at a low level in the hierarchy, whereas less expensive items (such as books) might be higher. This hybrid approach is also useful when looking at individual products. Because thousands of products often are in the data, generalize everything other than the product or products of interest.
Virtual Items Go Beyond the Product Hierarchy
The purpose of virtual items is to enable the analysis to take advantage of information that goes beyond the product hierarchy. Virtual items do not appear in the product hierarchy of the original items, because they cross product boundaries. Examples of virtual items might be designer labels, such as Calvin Klein, that appear in both apparel departments and perfumes; low-fat and no-fat products in a grocery store; and energy-saving options on appliances.
Virtual items may even include information about the transactions themselves, such as whether the purchase was made with cash, a credit card, or check, and the day of the week or the time of day the transaction occurred. However, crowding the data with too many virtual items is not a good idea. Only include virtual items when you have some idea of how they could result in actionable information if found in well-supported, high-confidence association rules.
There is a danger, though. Virtual items can cause trivial rules. For instance, imagine that there is a virtual item for “diet product” and one for “Coke product”; then a rule might appear like:
If “Coke product” and “diet product,” then “diet Coke”
That is, everywhere that “Coke” appears in a basket and “diet product” appears in a basket, then “diet Coke” also appears. Every basket that has Diet Coke satisfies this rule. Although some baskets may have regular Coke and other diet products, the rule will have high lift because it defines Diet Coke. When using virtual items, checking and rechecking the rules to be sure that such trivial rules are not arising is worth your time. One way to avoid creating such rules is to not consider rules when the virtual items come from the same product. With this restriction, only virtual items that appear in different products are included in rules, and such trivial rules are not produced. To implement such a method requires writing your own association rule code, which you can accomplish in SQL as described in Data Analysis Using SQL and Excel.
A similar but more subtle danger occurs when the right-hand side of a rule does not include the associated item. So, a rule like:
If “Coke product” and “diet product,” then “pretzels”
probably means,
If “diet Coke,” then “pretzels”
The danger from having such rules is that they can obscure what is happening.
TIP
When applying market basket analysis, you should have a hierarchical taxonomy of the items being considered for analysis. By carefully choosing the levels of the hierarchy, these generalized items should occur about the same number of times in the data, improving the results of the analysis. For specific lifestyle-related choices that provide insight into customer behavior, such as sugar-free items and specific brands, augment the data with virtual items.
Data Quality
The data used for market basket analysis is not necessarily of very high quality. It is gathered directly at the point of customer contact and used mainly for operational purposes such as inventory control. The data is likely to have multiple formats, corrections, incompatible code types, and so on. Much of the explanation of various code values is likely to be buried deep in programming code running in legacy systems and may be difficult to extract. Different stores within a single chain sometimes have different product hierarchies or different ways of handling situations such as discounts.
Here is an example. The authors were once curious about the approximately 80 department codes present in a large set of transaction data. The client assured us that their stores had only 40 departments and provided a nice description of each one. More careful inspection revealed the problem. Some stores had point-of-sale devices provided by IBM and others used devices provided by NCR. The two types of equipment had different ways of representing department codes—hence, two incompatible sets of codes for the 40 departments.
These kinds of problems are typical when analyzing any sort of data. However, they are exacerbated for market basket analysis because this type of analysis depends heavily on the unsummarized point-of-sale transactions.
Anonymous Versus Identified
Market basket analysis has proven useful for mass-market retail, such as supermarkets, convenience stores, drug stores, and fast food chains, where many of the purchases have been made with cash. Cash transactions are anonymous, meaning that the store has no knowledge about specific customers because there is no information identifying the customer in the transaction. For anonymous transactions, the only information is the date and time, the location of the store, the cashier, the items purchased, any coupons redeemed, and the amount of change. With market basket analysis, even this limited data can yield interesting and actionable results.
The increasing prevalence of registration on websites, loyalty programs, and purchasing clubs has resulted in more and more identified transactions, providing analysts with more possibilities for information about customers and their behavior over time. Demographic and trending information is available on individuals and households to further augment customer profiles. This additional information can be incorporated into association rule analysis using virtual items.
Generating Rules from All This Data
Calculating the number of times that an item set appears in the transaction data is well and good, but a combination of items is not a rule. Sometimes, just the combination is interesting in itself, as in the Barbie doll and candy bar example. But in other circumstances, finding an underlying rule of the form makes more sense:
if condition, then result
As an example, the rule:
if Barbie doll, then candy bar
is read as, “If a customer purchases a Barbie doll, then the customer is also expected to purchase a candy bar at the same time.” The general practice is to consider rules where just one item is on the right-hand side. Table 15-5 shows some summaries of items in transactions. These summaries are useful for showing the calculations needed to turn an item set into rules. In this example, there are only three items, A, B, and C.
Table 15-5: Probabilities of Three Items and Their Combinations
Calculating Support
There are three rules with all three items in the rule and one item on the right-hand side:
· ▪If A and B, then C
· ▪If A and C, then B
· ▪If B and C, then A
Because these three rules contain the same items, they have the same support, 5 percent (of all transactions). Only counting transactions that have at least three items (where the rule could apply), these rules have 100 percent support. In this case, counting only transactions where the rule could apply does not make sense.
Calculating Confidence
Confidence is the ratio of the number of transactions with all the items in the rule to the number of transactions with just the items on the left-hand side. The confidence for the three rules is shown in Table 15-6 .
Table 15-6: Confidence in Rules
What is confidence really saying? The rule, “if B and C, then A,” has confidence of 33 percent. In other words, when B and C appear in a transaction, there is a 33 percent chance that A also appears in it. That is, one time in three, A occurs with B and C, and the other two times, B and C appear without A.
Calculating Lift
As described earlier, lift is a good measure of how much better the rule is doing. It is the ratio of the density of the target (using the left-hand side of the rule) to the density of the target overall. So the formula is:
When lift is greater than 1, then the resulting rule is better at predicting the result than guessing whether the right-hand side of the rule is present based on item frequencies in the data. When lift is less than 1, the rule is doing worse than informed guessing. Table 15-7 shows the lift for the three rules and for the rule with the best lift.
Table 15-7: Lift Measurements for Four Rules
None of the rules with three items shows an improvement over just guessing. The best rule in the data actually only has two items. When “A” is purchased, then “B” is 31 percent more likely to be in the transaction than if “A” is not purchased. In this case, as in many cases, the best rule actually contains fewer items than other rules under consideration.
The Negative Rule
When lift is less than 1, negating the result produces a better rule. If the rule:
if A and B, then C
has confidence of 20 percent, then the rule:
if A and B, then NOT C
has confidence of 80 percent. Because C appears in 40 percent of the transactions, it does not appear in 60 percent of them. Applying the same lift measure shows that the lift of this new rule is 1.33 (0.80/0.60), better than any of the other rules with three items.
Calculating Chi-Square
The chi-square value is easy to calculate. However, not enough information appears in Table 15-5 to make the calculation, because the chi-square calculation requires counts, rather than percentages. Also, more information is needed about where items do not occur. Table 15-8 shows a summary of the transactions using counts. This data results in the same percentages as Table 15-5 .
Table 15-8: Transaction Counts for Data in Table 15-5
Consider the rule, “if A and B, then C.” To turn this into a contingency table, each transaction needs to be placed into one of the following four categories:
· ▪Not both A and B, and Not C
· ▪Not both A and B, and C
· ▪A and B and Not C
· ▪A and B and C
These groups, in turn, can be sized based on the information in Table 15-8 :
· ▪Not both A and B, and Not C: A only, B only, and none
· ▪Not both A and B, and C: C only, AC only, and BC only
· ▪A and B and Not C: AB only
· ▪A and B and C: ABC only
Notice that each of the eight groups is counted exactly once.
This is enough information to calculate the chi-square value, as shown in Table 15-9 . The overall chi-square is the sum of the cell chi-square values, which is 111.1. In this case, the splits associated with the rule are significant.
Table 15-9: Chi-Square Calculation for the Rule, “If A and B, then C”
The chi-square value has one catch. The rules, “if A and B, then C” and “if A and B, then not C,” produce the same chi-square value. The chi-square calculation has determined that these rules are significant, but which one? Answering that question requires returning to the concept of lift. The lift for the positive rule is 0.5. The rule for the negative rule is 1.33. Both are significant, but the negative rule is useful because it has a lift greater than 1. The rule chosen by the chi-square metric is, “if A and B, then not C.”
Overcoming Practical Limits
Generating association rules is a multi-step process. The general algorithm is:
· 1.Generate counts for single items.
· 2.Generate the co-occurrence matrix for two items. Use this to find rules with two items.
· 3.Generate the co-occurrence matrix for three items. Use this to find rules with three items.
· 4.And so on.
In other words, the process requires counting the frequencies of many item sets.
In the grocery store that sells orange juice, milk, detergent, soda, and window cleaner, the first step calculates the counts for each of these items. During the second step, the following counts are created:
· ▪Orange juice and milk, orange juice and detergent, orange juice and soda, and orange juice and cleaner
· ▪Milk and detergent, milk and soda, milk and cleaner
· ▪Detergent and soda, detergent and cleaner
· ▪Soda and cleaner
This is a total of 10 pairs of items. The third pass takes all combinations of three items and so on. Of course, each of these stages may require a separate pass through the data, or multiple stages can be combined into a single pass by considering different numbers of combinations at the same time.
Although it is not obvious with just five items, increasing the number of items in the combinations requires exponentially more computation, and exponentially growing runtimes—and long, long waits when considering combinations with more than three or four items. The solution is pruning. Pruning item sets is a technique for reducing the number of items and combinations of items considered at each step. At each stage, the algorithm throws out a certain number of combinations that do not meet some threshold criterion.
The most common pruning method is called minimum support pruning, which requires that a rule have the support of a minimum number of transactions. For instance, if you have one million transactions and the minimum support is 1 percent, then only rules supported by 10,000 transactions are of interest. This makes sense, because the purpose of generating these rules is to pursue some sort of action—such as striking a deal with Mattel (the makers of Barbie dolls) to make a candy bar–eating doll—and the action must affect enough transactions to be worthwhile.
The minimum support constraint has a cascading effect. Consider a rule with four items in it:
if A, B, and C, then D
Using minimum support pruning, this rule has to be true on at least 10,000 transactions in the data. It follows that:
· ▪A must appear in at least 10,000 transactions
· ▪B must appear in at least 10,000 transactions
· ▪C must appear in at least 10,000 transactions
· ▪D must appear in at least 10,000 transactions
In other words, minimum support pruning eliminates items that do not appear in enough transactions. The threshold criterion applies to each step in the algorithm. The minimum threshold also implies that:
· ▪A and B must appear together in at least 10,000 transactions
· ▪A and C must appear together in at least 10,000 transactions
· ▪A and D must appear together in at least 10,000 transactions
· ▪And so on
Each step of the calculation of the co-occurrence table can eliminate combinations of items that do not meet the threshold, reducing its size and the number of combinations to consider during the next pass.
Figure 15-10 shows an example of calculating rules. In this example, choosing a minimum support level of 10 percent would eliminate all the combinations with three items—and their associated rules—from consideration. This is an example where minimum support pruning does not have an effect on the best rule because the best rule has only two items. In the case of pizza, these toppings are all fairly common, so are not pruned individually. If anchovies were included in the analysis—and only 15 pizzas contain them out of the 2,000—then a minimum support of 10 percent, or even 1 percent, would eliminate anchovies during the first pass.
The best choice for minimum support depends on the data and the situation. Another possibility is to vary the minimum support as the algorithm progresses. For instance, using different levels at different stages, you can find uncommon combinations of common items (by decreasing the support level for successive steps) or relatively common combinations of uncommon items (by increasing the support level).
The Problem of Big Data
A typical fast food restaurant offers several dozen items on its menu, say 100. To use probabilities to generate association rules, counts have to be calculated for each combination of items. The number of combinations of a given size grows exponentially. A combination with three items might be a small fries, cheeseburger, and medium Diet Coke. On a menu with 100 items, how many combinations are there with three different menu items? There are 161,700! (This calculation is based on the binomial formula.) On the other hand, a typical supermarket has at least 10,000 different items in stock, and more typically 20,000 or 30,000.
Calculating the support, confidence, and lift quickly gets out of hand as the number of items in the combinations grows. Such a grocery store has almost 50 million possible combinations of two items and more than 100 billion combinations of three items. Although computers are getting more powerful and processing power cheaper, calculating the counts for this number of combinations is still very time-consuming. Calculating the counts for five or more items is prohibitively expensive. The use of product hierarchies reduces the number of items to a manageable size.
The number of transactions is also very large. In the course of a year, a decent-size chain of supermarkets can generate hundreds of millions of transactions. Each of these transactions consists of one or more items, often several dozen at a time. So, determining whether a particular combination of items is present in a particular transaction may require a bit of effort—multiplied a million-fold for all the transactions.
Figure 15-10: This example shows how to count up the frequencies on pizza sales for market basket analysis.
Extending the Ideas
The basic ideas of association rules can be applied to different areas, such as comparing different stores and making some enhancements to the definition of the rules. These are discussed in this section.
Different Items on the Right- and Left-Hand Sides
The examples for association rules, so far, have assumed that the same items can appear on the left- and right-hand side of the rules, leading to an undirected data mining technique. Association rules can also be used in a more directed fashion by having the left- and right-hand sides have different types of items. A case study describing this technique will clarify this approach.
The case study involves a company that does e-mail marketing. It sends out offers, on behalf of other companies, to e-mail lists. For the purposes of this example, a recipient can take three actions:
· ▪Do nothing, which is by far the most common action.
· ▪Click on the e-mail, indicating interest in the e-mail.
· ▪Complain about the e-mail as unwanted spam.
The first of these actions has no benefit and basically no cost (because e-mail is quite inexpensive to send). The second generates revenue for the company, so it is quite important. The third is a cost, and a big cost. If too many customers complain at a particular ISP (Internet service provider), then the ISP might reject all e-mail offers from the company.
Over time, many people on the company's e-mail list receive multiple e-mails, and each e-mail has an offer-type classification. Its historical data contained numerous examples of customers clicking on email, and many fewer examples of customers complaining. The very first time a customer complains, the customer is removed from all email lists.
How could association rules help this company? Clicks and complaints are different events, even though both are driven by the same content in the email offer. Traditional association rules cannot help in this case. Instead, the solution was to create association rules with clicks on the left-hand side and complaints on the right-hand side. In particular, the analysis asked the question: what offer types lead to complaints when customers have already clicked on other offers? The resulting rules were of the form:
If a customer clicks on offer type A and the customer clicks on offer type B, then the customer is likely to complain on offer type C.
Figure 15-11 shows some examples of combinations of click offers that lead to complaints on other offers. Two things stand out from this table. The first is that offers of credit reports are associated with complaint clicks, presumably because such offers look a lot like spam. The second is that offers in categories that are quite different from those a customer has clicked on previously are likely to lead to complaints.
Figure 15-11: Some combinations of clicks on e-mail offer types are more likely to lead to complaints on subsequent offers.
Creating these association rules required writing special-purpose code, using the SQL query language. The authors are not familiar with any data mining tools that allow the items on the left-hand side and right-hand side of rules to be different. Also, this example used the chi-square test to find the most important rules. Most data mining tools that implement association rules use support, confidence, and lift for rule selection.
Using Association Rules to Compare Stores
Market basket analysis can be used to make comparisons between locations within a single chain. The rule about toilet bowl cleaner sales in hardware stores is an example where sales at new stores are compared to sales at existing stores. Different stores exhibit different selling patterns for many reasons: regional trends, the effectiveness of management, dissimilar advertising, and varying demographic patterns among their customers, for example. Air conditioners and fans are often purchased during heat waves, but heat waves affect only a limited region. Within smaller areas, demographics of the catchment area can have a large impact; stores in wealthy areas typically exhibit different sales patterns from those in poorer neighborhoods. These are examples where market basket analysis can help to describe the differences and serve as an example of using market basket analysis for directed data mining.
How can association rules be used to make these comparisons? The first step is augmenting the transactions with virtual items that specify which group, such as an existing location or a new location, generates the transaction. Virtual items help describe the transaction, although the virtual item is not a product or service. For instance, a sale at an existing hardware store might include the following products:
· ▪A hammer
· ▪A box of nails
· ▪Extra-fine sandpaper
After augmenting the data to specify where it came from, the transaction looks like:
· ▪A hammer
· ▪A box of nails
· ▪Extra-fine sandpaper
· ▪“At existing hardware store”
The virtual item becomes a new item in the transaction for use by association analysis.
TIP
Adding virtual transactions into the market basket data enables you to find rules that include store characteristics and customer characteristics.
To compare sales at store openings versus existing stores, the process is:
· 1.Gather data for a specific period (such as two weeks) from store openings. Augment each of the transactions in this data with a virtual item saying that the transaction is from a store opening.
· 2.Gather about the same amount of data from existing stores. Here you might use a sample across all existing stores, or you might take all the data from stores in comparable locations. Augment the transactions in this data with a virtual item saying that the transaction is from an existing store.
· 3.Find association rules.
· 4.Pay particular attention to association rules containing the virtual items.
Because association rules are undirected data mining, the rules act as starting points for further hypothesis testing. Why does one pattern exist at existing stores and another at new stores? The rule about toilet bowl cleaners and store openings, for instance, suggests looking more closely at toilet bowl cleaner sales in existing stores.
Using this technique, association analysis can be used for many other types of comparisons:
· ▪Sales during promotions versus sales at other times
· ▪Sales in various geographic areas, by county, standard statistical metropolitan area (SSMA), direct marketing area (DMA), or country
· ▪Urban versus suburban sales
· ▪Seasonal differences in sales patterns
Adding virtual items to each basket of goods enables the standard association rule techniques to make these comparisons.
Association Rules and Cross-Selling
Association rules seem, at first sight, to be ideally suited to the problem of cross-selling beyond the world of retailing. A bank, for instance, might have several dozen products (such as checking account, mortgage, auto loan, and so on) and be interested in what sells together. These would seem quite amenable to association analysis.
WARNING
Association rules by themselves often produce poor results for cross-selling in domains such as financial services that have only a few dozen products.
Unfortunately, the results from such analyses can be quite disappointing. One reason is that many financial products are bundled together—products such as a checking account and an overdraft line of credit, or a checking account and a debit card. These rules tend to dominate associations. A more important reason for the disappointment, though, is that association rules skip over a vast repertoire of data describing customers. They only use one set of features—the items purchased in the past. This section discusses an alternative approach for incorporating information from association rules in cross-sell models.
A Typical Cross-Sell Model
Figure 15-12 shows a typical cross-sell model for a company that has, at most, a few dozen products. In this case, the best approach is to develop a separate propensity model for each product. The propensity model for a product estimates the likelihood of a customer purchasing that product. Typically, when the customer already has the product, the propensity is set to 0, so the company does not look stupid, trying to sell customers products that they already have.
Figure 15-12: A typical cross-sell model builds propensities for each product and then has a decisioning algorithm to choose the best product for each customer.
The propensities for each product are combined using a decisioning algorithm, to allow business users to incorporate other factors into recommendation. The simplest decision is to choose the product with the highest propensity, because this is the product for which the customer has the most affinity. More sophisticated approaches take into account the revenue generated by a product or the net revenue, as well as the propensity. Using this information, each customer can be offered the most profitable product. The decisioning algorithm, in this case, would calculate the expected revenue using the propensity score generated by the product model.
This style of developing product propensity models does not mandate a particular form for each propensity model. Often, logistic regression is the technique of choice. However, decision trees, neural networks, and MBR approaches are also very reasonable choices.
The best way to train these models is to use information known about customers just before they purchase (or do not purchase) each product. Although existing products may play a role in the models, other variables would be likely to dominate the models. One downside to the propensity models is that including associations among products is very difficult.
A More Confident Approach to Product Propensities
As discussed earlier, association rules find associations among products, but other information is difficult to include. Propensity models include lots of rich information, but associations are difficult to include. Perhaps this is a Reese's moment: “you got peanut butter in my chocolate” and “you got chocolate in my peanut butter.” By combining the two approaches, something better emerges.
The combination works by augmenting the input data for each product propensity model. One additional variable is added to the data used for each propensity model. Remember, each propensity model is for a specific product. At the same time, you can develop association rules that predict the same product on the right-hand side. Different rules are going to apply to each customer, even for the same product, because different customers have different products for the associations. Of the rules that apply to a given customer, one is more confident than the others, and that is the rule selected.
The confidence for the best rule for each customer is then added as an input variable in the model set used to develop each product propensity model. By including this confidence, the product propensity models can leverage the information from associations, while still using the demographic and behavioral data available about each customer.
Results from Using Confidence
This approach for adding the confidence to product propensity models was developed by a friend of the authors, Frank Travisano, who works for a large insurance company. Although we are not allowed to use results from his endeavors, the resulting propensities were much improved using the confidence value, leading to better cross-sell results.
The biggest challenge in implementing this as a solution is the challenge of scoring the association rules. Finding the rule with the maximum confidence requires scanning through all the rules that a given customer qualifies for to identify the most confident one. Using association rules for this purpose requires special purpose coding.
Sequential Pattern Analysis
Sequential pattern analysis (also called longitudinal analysis) adds the element of time to market basket analysis. The analysis considers not only associations among items, but also sequences of items where ordering in time is important. Here are some examples:
· ▪New homeowners purchase shower curtains before purchasing furniture.
· ▪Customers who purchase new lawnmowers are quite likely to purchase a new garden hose in the following six weeks.
· ▪When a customer goes into a bank branch and asks for an account reconciliation, chances are good that he or she will close all his or her accounts within two weeks.
All of these involve actions that occur in a particular order, timewise. All of these—and sequential pattern analysis in general—require identifying a person over time. In particular, sequential pattern analysis does not make sense with anonymous transactions, because these cannot be tied together over time.
WARNING
To consider time-series analyses on your customers, you must have some way of identifying customers. Without a way of tracking individual customers, you cannot analyze their behavior over time.
Finding the Sequences
Sequential pattern analysis starts by looking at what sequences are in the data. The basic data structure for market basket analysis introduced earlier is fine, as long as the order record (or line item) contains a time stamp or some sort of sequential marker.
The example used in this section comes from the world of health care, where such analysis is a very important because of the complexity of medical care. The simple example here looks at particular drugs in a particular market—cholesterol-lowering drugs (technically, the lipid-lowering market).
This results are useful for several reasons. One is that health care providers and pharmaceutical companies can only measure patients' usage by their prescriptions. Most prescriptions are for 30 or 90 days of therapy. So, unlike services such as banking, cable television, and telephone service, customers do not “subscribe” to a therapy; instead, their usage has to be inferred from individual transactions.
Another reason why this is useful is because only a handful of drugs in any market compete against each other. The number of different options is relatively small, making it more feasible to understand the sequencing. Of course, the ideas here go beyond pharmaceutical usage and health care; similar ideas can be applied in many other domains.
Sequential Patterns with Just One Item
In some cases, even one item can be interesting over time. For instance, many pharmaceutical products are for chronic conditions, and should be taken consistently over time. Often, any particular product could be replaced by a competing product.
Consider the situation in the lipid-lowering market, which has been a highly competitive market because lowering cholesterol levels is widely deemed to be beneficial. Pharmaceutical companies want patients to use their product, because such companies want to sell more pills. Health care insurers are interested in these therapies, because patients who control their blood cholesterol levels are at reduced risk for more expensive treatments. And health care providers want to keep patients on the therapy, because it is in the best interest of the patients. A business question that might be asked by any of these is: what sequences of usage patterns do patients exhibit?
Table 15-10 shows an example of prescription sequences over the course of one year for just one product, Lipitor. Each “L” in the sequence represents one prescription filled by a patient. A prescription is typically for 30 or 90 days of therapy. Interestingly, the most common sequence has a length of four, which probably corresponds to four 90-day prescriptions. These patients are on the therapy, happily filling their prescriptions every three months, quite possibly by mail order. The next most common sequence is 12 prescriptions, a similar group of patients that fill their prescriptions once per month.
Table 15-10: Prescription Sequences for One Calendar Year
Such a table can be used to answer questions such as: if a patient has six prescriptions in one year, what is the probability that a patient will have seven? In this data, there are 51,446 patients with six or more prescriptions (all the “OTHER” values are sequences longer than 6). Of these, exactly 6,013 have exactly six. Another way of saying this is that 11.7 percent of the patients stop at six prescriptions, so 88.3 percent continue to have more prescriptions. When doing such analyses, remember that patients with seven, eight, or more prescriptions also have six prescriptions.
This data uses prescriptions for all patients who filled one during the year. Most patients who take cholesterol-lowering drugs do so for long periods of time; some patients would have started during the year. The preceding table cannot be used to answer the question, “What is the probability that a patient will continue to a seventh prescription after his or her sixth prescription?” To answer this question, the data would have to track patients on the tenure timeline, relative to when the patient started therapy, rather than on the calendar timeline. Such an analysis would use ideas from survival analysis.
Sequential Patterns to Visualize Switching Behavior
Homogeneous items are interesting, but switching behavior is usually more interesting. Of course, more additional items means more possible sequences, too many to publish in a small table in a book.
Table 15-11 shows a small subset of possible sequences for the same market. The sequences in this table all start with 11 “Zs,” representing 11 Zocor prescriptions. Additional letters represent other products, such as “L” for Lipitor, “V” for Vytorin, “C” for Crestor, and “M” for Mevacor.
Table 15-11: Patients with 11 Zocor Prescriptions
Switching among this group is quite uncommon, which is not surprising because a patient (and his or her doctor) who has 11 prescriptions in a row is probably content with the product. By far, the most common product that such patients switch to is Vytorin. This, too, is not surprising. Vytorin is actually a combination drug, consisting of Zocor and another drug in a single pill (and for some patients, a single copayment means that they save money every month).
These sequences make it possible to quantify the answers to questions such as the following:
· ▪What proportion of patients who have 11 Zocor prescriptions continue to a twelfth?
· ▪What proportion stop at 11 prescriptions?
· ▪What proportion switch to another product?
These questions are actually similar to questions encountered in Chapter 10 on survival analysis. In particular, the last question incorporates the idea of competing risks.
Working with Sequential Patterns
As the number of items increases, the number of potential paths increases dramatically. Unfortunately, not many tools handle such sequences very well, so such analysis often requires custom coding. The earlier examples were developed using programs written in the SAS programming language.
This is just the beginning of what could be accomplished. For instance, stopping a sequence when a customer has had no activity for a period of time might be desirable. A prescription more than three months after the previous one ends, for instance, might start a new “episode of treatment.”
These examples are focused on the pharmaceutical industry. However, such paths can occur in many areas. Customers in a bank take a “path” as they acquire and use different bank products. Visitors to a website take paths through the website. Corporate customers take paths as they acquire different hardware products, and so on.
Sequential Association Rules
You can also use the ideas behind association rules for sequential pattern analysis. To handle sequential data, the transaction data must have two additional features:
· ▪A timestamp or sequencing information to determine when transactions occurred relative to each other
· ▪Identifying information, such as account number, household ID, or customer ID that identifies different transactions as belonging to the same customer or household (sometimes called an economic marketing unit)
Building sequential rules is similar to the process of building association rules:
· 1.All items purchased by a customer are treated as a single order, and each item retains the timestamp indicating its position in the sequence.
· 2.The process is the same for item sets, groups of items being considered at one time.
· 3.The item sets contain timestamp (or sequence) information as well as items.
· 4.To develop the rules, only rules where the items on the left-hand side of the rule appeared before items on the right-hand side are considered.
The result is a set of association rules that can reveal sequential patterns.
Sequential Analysis Using Other Data Mining Techniques
Sequential analysis is related to other data mining techniques. As mentioned earlier in this section, the language for understanding sequences can become similar to the language used for survival analysis. In fact, the two have many similarities. Survival analysis focuses on timing; however, competing risks in combination with repeated events is a type of sequential pattern analysis. Consider a survival analysis of patient therapies. After a patient fills a particular prescription, how long until the next prescription is filled? And will the next prescription be for the same drug or for another one? Even more interesting is the fact that the previous sequence of events—at the time the first prescription is filled—can be incorporated as a covariate.
Sequential pattern analysis also analyzes sequences of customer behavior. The major difference is the definition of time. For survival analysis, time is measured in common units, such as days, weeks, and months. For sequential pattern analysis, the ordering is more important than the particular durations.
From the perspective of statistics, sequential pattern analysis is an example of path analysis. However, path analysis is also often used in the specific context of link analysis for analyzing websites. Regardless of the terminology, there is a relationship between sequential pattern analysis and link analysis, the topic of the next chapter.
Lessons Learned
Market basket data describes what customers purchase. Analyzing this data is complex, and no single technique is powerful enough to provide all the answers. The data itself typically describes the market basket at three different levels. The order is the event of the purchase; the line items are the items in the purchase, and the customer connects orders together over time.
Many important questions about customer behavior can be answered by looking at product sales over time. Which are the best-selling items? Which items that sold well last year are no longer selling well this year? Inventory curves do not require transaction-level data. Perhaps the most important insight they provide is the effect of marketing interventions—did sales go up or down after a particular event?
However, inventory curves are not sufficient for understanding relationships among items in a single basket. One technique that is quite powerful is association analysis. This technique finds products that tend to sell together in groups. Sometimes the groups are sufficient for insight. Other times, the groups are turned into explicit rules—when certain items are present, then you expect to find certain other items in the basket.
There are four measures of association rules. Support tells how often the rule is found in the transaction data. Confidence says how often when the “if” part is true that the “then” part is also true. Lift tells how much better the rule is at predicting the “then” part as compared with having no rule at all. The chi-square measure from statistics can also be adapted to finding the best association rules.
The rules so generated fall into three categories. Useful rules explain a relationship that was perhaps unexpected. Trivial rules explain relationships that are known (or should be known) to exist. And inexplicable rules simply do not make sense. Inexplicable rules often have weak support.
Market basket analysis and association analysis provide ways to analyze item-level detail, where the relationships between items are determined by the baskets they fall into. The ideas can be extended to sequential pattern analysis, which takes into consideration the order in which purchases are made, in addition to their contents. The next chapter turns to link analysis, which generalizes the ideas of “items” linked by “relationships,” using the background of an area of mathematics called graph theory.