Data Visualize

profileveerureddie
data_visual.pdf

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

The Endgame, or Putting it All Together

1Module 6: The Endgame

Module 6: The Endgame 1

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

“The Endgame” or Putting it All Together

During this lesson the following topics are covered:

• Survey of data visualization tools

• Creating different visualizations for sponsors and analysts

• Developing visuals to support your key points

• How to clean up a chart or visualization

• Tips and tricks

Data Visualization Techniques

Module 6: The Endgame 2

This lesson covers tools and recommendations for creating visuals for sponsors and analysts.

2Module 6: The Endgame

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Key Points Supported With Data Overview of Visualization Tools

• R  Base package

 ggplot

 Lattice

• Ggobi/Rggobi

• Inkscape

• Processing

• Modest Maps

• GnuPlot

• Tableau

• Spotfire

• Qlikview

• Adobe Illustrator

Open Source Commercial Tools

3Module 6: The Endgame

Many great visualization tools are on the market to help you in creating clear graphics for presentations and applications. Here is a listing of some of the more popular tools. As the volume and complexity of data has grown, users are becoming more reliant on using crisp visuals to illustrate key ideas and also to portray rich data in a digestible way.

In this course, we are using R as our main tool for data analysis and visualization. Over time, the open source community has developed many additional libraries to give you more options for portraying data visually. We are focusing on the base package of R in this course, although ggplot provide many more options for creating professional looking graphics, as does the Lattice library. For good examples of using open source visualization tools, you may refer to Nathan Yau’s website, flowingdata.com or his book Visualize This, which provides additional methods for developing visualizations with many more open source tools.

Regarding the commercial tools listed above, Tableau, Spotfire (by Tibco), and Qlikview function as data visualization tools, and also as business intelligence tools. Due to the growth of data in the last few years, organizations for the first time are beginning to favor ease of use and visualization in business intelligence over more traditional BI tools and databases. These tools make visualization easy with good user interfaces. Adobe Illustrator is listed as some professionals will use this to enhance visualizations made in other tools. Inkspace is an open source tool used for similar use cases, with much of Illustrator’s functionality.

Module 6: The Endgame 3

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Key Points Supported With Data Tables of Information

Module 6: The Endgame 4

Year

1 9

6 2

1 9

6 4

1 9

6 5

1 9

6 7

1 9

6 8

1 9

6 9

1 9

7 0

1 9

7 1

1 9

7 2

1 9

7 3

1 9

7 4

1 9

7 5

1 9

7 6

1 9

7 7

1 9

7 8

1 9

7 9

1 9

8 0

1 9

8 1

1 9

8 2

1 9

8 3

1 9

8 4

1 9

8 5

1 9

8 6

1 9

8 7

1 9

8 8

1 9

8 9

1 9

9 0

1 9

9 1

1 9

9 2

1 9

9 3

1 9

9 4

1 9

9 5

1 9

9 6

1 9

9 7

1 9

9 8

1 9

9 9

2 0

0 0

2 0

0 1

2 0

0 2

2 0

0 3

2 0

0 4

2 0

0 5

2 0

0 6

G ra

n d

T o

ta l

SuperBox 1 1 1 1 5 4 4 14 13 14 20 14 17 29 24 37 33 117 42 65 79 81 90 92 82 86 106 72 62 62 40 49 22 26 33 47 78 71 67 64 91 91 33 1980 BigBox 1 1 1 1 4 5 5 5 10 10 10 6 21 33 21 22 20 29 31 50 43 45 72 91 76 94 67 80 31 34 33 33 27 35 47 32 39 27 4 1196 Grand Total 1 1 1 2 5 5 5 15 17 19 25 19 27 39 34 43 54 150 63 87 99 110 121 142 125 131 178 163 138 156 107 129 53 60 66 80 105 106 114 96 130 118 37 3176

• What do you observe from this data?

• What’s the main message?

• What is the author trying to emphasize with the data?

• Tailor outputs to the audience

Year 1 9

7 2

1 9

7 3

1 9

7 4

1 9

7 5

1 9

7 6

1 9

7 7

1 9

7 8

1 9

7 9

1 9

8 0

1 9

8 1

1 9

8 2

1 9

8 3

1 9

8 4

1 9

8 5

1 9

8 6

1 9

8 7

1 9

8 8

1 9

8 9

1 9

9 0

1 9

9 1

1 9

9 2

1 9

9 3

1 9

9 4

1 9

9 5

1 9

9 6

1 9

9 7

1 9

9 8

1 9

9 9

2 0

0 0

2 0

0 1

2 0

0 2

2 0

0 3

2 0

0 4

2 0

0 5

2 0

0 6

G ra

n d

T

o ta

l

SuperBox 13 14 20 14 17 29 24 37 33 117 42 65 79 81 90 92 82 86 106 72 62 62 40 49 22 26 33 47 78 71 67 64 91 91 33 1980

BigBox 4 5 5 5 10 10 10 6 21 33 21 22 20 29 31 50 43 45 72 91 76 94 67 80 31 34 33 33 27 35 47 32 39 27 4 1196

Grand Total 17 19 25 19 27 39 34 43 54 150 63 87 99 110 121 142 125 131 178 163 138 156 107 129 53 60 66 80 105 106 114 96 130 118 37 3176

44 years of BigBox stores data

34 years of BigBox stores data

It is more difficult for people to observe the key insights when data is in tables than in charts. To underscore this point, in “Say it with Charts”, Gene Zelazny mentions that to highlight data create a visual out of it, such as a chart, graph or other data visualization. The converse is also true. If for some reason you choose to downplay the data, leaving it in a table will draw less attention to it and make it more difficult for people to digest.

The way you choose to organize the visual in terms of the color scheme, labels and sequence of information will also influence how the viewer processes the information and what they believe is your key message from the chart. The table shows many data points, and given the layout of the information it is difficult to take away the key points at a glance. There are several observations in the data (if you look closely), such as …

1) BigBox experienced strong growth in the 1980s and 1990s

2) By the 1980s, BigBox began adding more SuperBox stores to its mix of chain stores

3) SuperBox outnumber regular stores nearly 2 to 1

Depending on the point you wish to make, take care to organize the information in a way that will intuitively enable the viewer to take away the same main point you want them to. Otherwise, they will guess at your main point and may take away something different than what you intended.

Module 6: The Endgame 4

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Key Points Supported With Data Using Visuals to Illustrate Key Points

Module 6: The Endgame 5

Example of a Visual to help tell a story to a Sponsor

Above is a map of the U.S., showing the geographic location of BigBox stores. This is an example of a much more powerful way to depict data than in small tables, and this would be well suited to a sponsor audience. For a sponsor audience, you could also use simpler techniques, such as bar charts or line charts.

Module 6: The Endgame 5

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Evolution of a Graph Hypothetical Example: Exploring Pricing Data

Module 6: The Endgame 6

Example of exploring customer price data, price distributions and stability over time

Example of exploring price tiering for most and least loyal customers

The graphics above portray a hypothetical example of some of the steps a Data Scientist may go through in analyzing customer pricing data. Data Scientists typically iterate and view the data many different ways, framing hypotheses, testing them and exploring the implications of a given model. In this case, we are looking at visual examples of pricing distributions, fluctuations in pricing, and exploring the differences in price tiers before and after implementing a new model to optimize price. The first row of visualizations depict distributions of the data, in raw and log form, as well as scatterplot of the prices over time to gauge the variability and consistency of the data. The second row shows price tiering and how this can change depending on the optimization methods. These visualizations illustrate how the data may look as the result of the model, and will help a data scientist understand the relationships within the data at a glance.

Module 6: The Endgame 6

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Evolution of a Graph, Analyst Example

Module 6: The Endgame 7

• Implementing new price tiering approach increases the precision of price promotions by 23%

• Price optimization model explains 92% of customer behavior

• Model can be run in production environment on daily basis, if needed, to tailor changes to direct mail campaigns and web promotional offers

Above is an example of the output from the price optimization project scenario, showing how one may present this to an audience of other Data Scientists or data analysts. This shows a curvilinear relationship between price tiers and customer loyalty, when expressed as an index. Note that the comments at the right of the graph relate to the precision of the price targeting, the amount of variability in robustness of the model, and the expectations of model speed when run in a production environment.

Module 6: The Endgame 7

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Evolution of a Graph, Sponsor Example

Module 6: The Endgame 8

• Before the project, pricing promotions were offered to all customers equally

• With the new approach:  Highly loyal customers do not receive as

many price promotions, since their loyalty is not strongly influenced by price

 Customers with low loyalty are influenced by price, and we can now target them for this purpose better

• We project multiple cost savings with this approach

 $2M in lost customers

 $1.5M in new customer acquisition costs

 $1M in reductions for pricing promotions

Above is an example of the output from the price optimization project scenario, showing how one may present this to an audience of project sponsors. This shows a simple bar chart to depict the average price per customer or user segment. This is a much simpler looking visual than the prior slide, and this one clearly shows that customers with lower loyalty scores tend to get lower prices, due to targeting from price promotions.

Note that the comments at the right of the graphic relate to explaining the impact of the model at a high level and the cost savings of implementing this approach to price optimization.

Module 6: The Endgame 8

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Key Points Supported With Data Common Representation Methods

If you want to compare this kind of information….

…consider this kind of chart

Components Pie chart

Item Bar chart

Time Series Line chart

Frequency Line charts, histograms

Correlation Scatterplot, side-by-side bar charts

Module 6: The Endgame 9

Shown are some basic chart types to guide you in considering that different types of charts are more suited to the situation depending on the data you have and the message you are attempting to portray.

The table is by no means exhaustive, it is illustrative to convey the most basic data representations, which can be combined, embellished and made more sophisticated depending on the situation and the audience. Consider the message you are trying to communicate, then choose an appropriate visual to support the point. Misusing charts tends to confuse the audience, so be sure to take into account the data type and message when choosing a chart.

Pie charts are designed to show the components, or parts relative to the whole set of things. It is also the most overused chart. If you are going to use a pie chart, use it when showing only 2-3 items in a chart and only for sponsor audiences. Bar charts and line charts are used much more often, and are very useful for showing comparisons and trends over time. For bar charts, horizontal bar charts allow you to fit the text labels better and provide more horizontal space to fit them next to a chart, even though many people tend to use vertical bar charts. Vertical bar charts tend to work well when the labels are small, such as when showing comparisons over time using years.

For frequency, histograms will show the distribution of data, and are useful for showing information to an analyst audience or to data scientists. The data distributions are typically one of the first steps in visualization data to prepare for the model planning. When doing correlation, scatterplots are useful to compare relationships among variables.

As with any presentation, consider the audience and their level of sophistication when selecting the chart to convey your message. These charts are simple examples, but can easily become more complex with additional data variables, combining charts together, or adding animation where appropriate.

For additional reference on data types and their related charts, you may want to look at the URL: http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html

Module 6: The Endgame 9

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

How to Clean Up a Graphic, Example 1 The Before Picture

Module 6: The Endgame 10

0 5

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

100 105 110 115 120 125

1962 1965 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Chart junk

1. Horizontal Grid Lines

2. Chunky data points

3. Overuse of emphasis colors; lines & border

4. No context or labels

5. Crowded axis labels

• What are the main messages here? What is the author trying to emphasize?

• What’s wrong with this picture?

Shown is an example of a line chart comparing two trends over time. It’s a busy looking chart and contains a lot of “chart junk”, which distracts the viewer from the main message. Shown at the left of the chart are some of the chart junk this visual suffers from, which are easily addressed as shown in the next slide. Note that there is no clear message associated with the chart and no legend to provide context for what is shown.

Module 6: The Endgame 10

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

How to Clean Up a Graphic, Example 1 After

Module 6: The Endgame 11

• What are the main messages here?

• What is the author trying to emphasize?

0

20

40

60

80

100

120

140

1962 1967 1972 1977 1982 1987 1992 1997 2002

Growth of SuperBox Stores (Count of Stores)

SuperBox BigBox

-40

-20

0

20

40

60

80

100

1962 1967 1972 1977 1982 1987 1992 1997 2002

Difference in Store Openings (Count of SuperBox - Count of BigBox Stores)

Diff in SuperBox vs. BigBox

These are two examples of cleaned up versions of the chart on the previous page. Note that the problems with chart junk have been addressed, there is a clear label and title for each chart to reinforce the message, and color has been used in ways to highlight the point the author is trying to make.

Note the amount of white space being used in each of the two charts shown. Removing grid lines, excessive axes, and the visual noise within the chart allows you to create very clear contrast between emphasis colors (the green line charts) and the standard colors (the light gray of the BigBox stores). When creating charts, it is best to do most of your main visuals in standard colors, light tones or color shades so that you can choose to add stronger emphasis colors to emphasize the main points and draw attention to the parts of the graphic that demonstrate your main points. In this case, we have made the trend of BigBox stores in light gray to fade into the background, but not disappear, while making the SuperBox stores trend in a bright green and make it prominent to support the message the author is making about the growth of the SuperBox stores.

An alternative is shown at right. If the main message is to show the difference in the growth of new stores, you could simplify the chart further and choose to graph only the difference between SuperBox stores compared to regular BigBox stores. Two examples are shown to illustrate different ways to convey your message, depending on what it is you would like to show.

Module 6: The Endgame 11

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

How to Clean Up a Graphic, Example 2 The Before Picture

Module 6: The Endgame 12

Chart junk

1. Vertical Grid Lines 2. Too much emphasis

colors

3. No chart title 4. Legend at right

restricts chart space

5. Labels are too small

• What are the main messages here? What is the author trying to emphasize?

• What’s wrong with this picture?

0

20

40

60

80

100

120

140

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

SuperBox

BigBox

Grand Total

Here is a sample graphic with typical problems related to chart junk, including misuse of color schemes and lack of context. Shown at left are the main problems with the graphic, with cleaned-up alternatives to this visual on the subsequent pages.

Module 6: The Endgame 12

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

How to Clean Up a Graphic, Example 2 After

Module 6: The Endgame 13

• What are the main messages here?

• What is the author trying to emphasize?

0

20

40

60

80

100

120

140

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Total Growth of Stores, Over Time

SuperBox BigBox Grand Total

0

10

20

30

40

50

60

70

80

90

100

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Growth of SuperBox Stores (Count of Stores)

SuperBox BigBox

Shown are some simplified and cleaned up versions of the previous slide’s graphic. These show two options for modifying the graphic, depending on the main point the presenter is trying to make.

The chart on the left of the slide shows strong, emphasis color (dark blue) representing the SuperBox stores, to support the chart title about the Growth of SuperBox Stores. If the presenter wanted instead to talk about the total growth of BigBox stores, a line chart (shown on the right) showing the trends over time would be a better choice. In both cases, we have removed the noise and distractions within the chart, have de-emphasized data we wish not to speak to, and made prominent data that will reinforce our key point as stated in the chart’s title.

Module 6: The Endgame 13

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

A Quick Word About Using 3D Charts: Avoid Them!

Module 6: The Endgame 14

Chart A: 2-Dimensional Chart B: 3-Dimensional

• Difficult to gauge actual data

• Scaling becomes deceptive

• Does not make graphic fancier, just harder to understand

• Simple

• Easy to understand

• Focus on the data, not the graphics

2-Dimensional Charts 3-Dimensional Charts

0

20

40

60

80

100

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Growth of SuperBox Stores (Count of Stores)

SuperBox BigBox

0

20

40

60

80

100

Growth of SuperBox Stores (Count of Stores)

SuperBox BigBox

Shown is a side-by-side comparison of two charts. As mentioned, 3-dimensional charts often distort scales and axes, and impede viewer cognition. The charts on the left and right portray the same data, although when looking at Chart B it is more difficult to judge the actual height of the bars. In addition, the shadowing and shape of the chart cause most viewers to spend time looking at the perspective of the chart, rather than the height of the bars, which is the key message and purpose of this visual.

Module 6: The Endgame 14

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Key Points with Data Visualizations

15

• Remove distractions  Minimize “chart junk”

 Data-Ink Ratio

• Choose the simplest, clearest visual for the situation  Strive to illustrate your points

 Charts should serve to reinforce your key points

 Charts vs. Data Art

• Use color deliberately  Emphasis Colors vs. Standard Colors

 In most cases, less is more

 Focus on the contrast

• Context  Consistent scales, labels, axes

 Using logs vs. raw values to show differences

Module 6: The Endgame

The points listed summarize many of the key ideas on the preceding examples. Following these few ideas about minimizing distracts in slides and visualizations, communicating clearly and simply, using color in a deliberate way and taking time to provide context will address most of the common problems in charts and slides. These few guidelines will enable you to have crisp, clear visuals that support your story, without needing to become a data artist.

Similar to the idea of removing chart junk is being cognizant of the data-ink ratio. Data-ink refers to the actual portion of a graphic that is used to portray the data itself, while non-data ink represents for labels, edges, colors, and other decoration. The ratio = (data-ink)/(total ink used to print the graphic). In other words, the greater the ratio of data-ink in your visual, the more data rich it is and the fewer distractions it has. For more information and further examples, see http://www.infovis- wiki.net/index.php/Data-Ink_Ratio .

In most cases, the best way to show your visual is using the simplest, clearest visual to illustrate your point. To do this, remove distractions and avoid unnecessary embellishment in the visual. Keep in mind that you are trying to find the best, simplest method for transmitting your message, rather than data art which is about how data it can be represented in a creative way, and can be an end in and of itself.

Context is critical to orient the viewer of a visualization, as people have immediate reactions to imagery on a pre-cognitive level. To this end, make sure to use thoughtful usage of color, and orient the viewer with scales, legends, and axes. Using logarithms to normalize data is a useful way to fit data into a visualization, as we showed earlier in this lesson when showing a chart for analysts, showing account values distributed lognormally. This can be a very useful technique when you want to show a wide range of data, such as a broad range of income values or population sizes.

Module 6: The Endgame 15

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

The Endgame, or Putting it All Together

During this lesson the following topics were covered:

• Survey of data visualization tools.

• Creating different visualizations for sponsors and analysts.

• Developing visuals to support your key points.

• How to clean up a chart or visualization.

• Tips and tricks

Summary

Module 6: The Endgame 16

This lesson covered tools and recommendations for creating visuals for sponsors and analysts.

Module 6: The Endgame 16

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

• What are the 4 key main deliverables for operationalizing a data analytics project?

• Give an example of an appropriate data visualization for an analyst presentation and one for a sponsor?

• Name 3 considerations for delivering your code or technical documentation

• What is chart junk? What are some ways to address it in a visualization?

17Module 6: The Endgame

Module 6: The Endgame 17

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

The Endgame, or Putting it All Together

During this module the following topics were covered:

• Three tasks needed to operationalize an analytics project

• Four common deliverables of an analytics lifecycle project meet the needs of key stakeholders

• A framework for creating final presentations for sponsors and analysts

• Evaluation and improvement of data visualizations

Summary

Module 6: The Endgame 18

This module covered these topics.

Module 6: The Endgame 18