Analyzing and Visualization

profileSan77
Assignment_1_Handbook.docx

The Purpose Map in Practice

A simple illustration of the role of the purpose map involves momentarily focusing on a rather grave subject: data about offender executions. In 2013, the State of Texas reached the unenviable milestone of having executed its 500th death-row prisoner since the resumption of capital punishment in 1982. At the time of this landmark I came across a dataset

138

curated by the Texas Department of Criminal Justice and published on its website. This simply structured table of data (Figure 3.9) included striking information about the offenders, their offences and their final statements – a genuinely compelling source of data. Thinking about this subject and the dataset helps to frame the essence of what role this purpose map can play, especially in the tone dimension.

Figure 3.9 Image taken from Texas Department of Criminal Justice Website

page139image59752448

Since the milestone of 500 executions in 2013 the number has grown significantly. For the purpose of this illustration, for now we will consider the nature of this data as it was at the moment of this milestone.

page139image57582736

Imagine viewing this data from a high vantage point, like in a hot-air balloon. The big picture is that there are 500 prisoners who have been executed. That is the whole. Lowering the viewpoint, as you get a little closer, you might see a breakdown of race, showing 225 offenders were white, 187 black, 86 Hispanic and 2 defined as other. Lower still and you

139

see that 4 offenders originated from Anderson County. Lower again reveals that 112 offenders referred to God in their last statement. Down to the lowest level – the closest vantage point – you see individuals and individual items of data, such as Charles Milton, convicted in Tarrant county, who was aged 34 when executed on 25 June 1985.

The view of the data has travelled from a figurative perspective to a non- figurative one. The former is an abstraction of the data that effectively supresses the underlying phenomena being about people and translates – and maybe reduces it – into statistical quantities. People into numbers. The latter perspective concerns a more literal and realistic expression of what the data actually represents.

Going back to the discussion about judging tone, there are several different potential ways of portraying this executed offenders data depending on the purpose that has been defined.

Suppose you worked at the Texas Department of Criminal Justice as a member of staff responsible for conducting and reporting data analysis. You might be asked to analyse the resource implications of all offenders currently on death row, looking at issues around their cost ‘footprint’. In this case you might seek to strip away all the emotive qualities of the data and focus only on its statistical attributes. You would likely aim for a figurative or abstracted representation of the subject, reducing it to fundamental statistical quantities and high-level relationships. Your approach to achieve this would probably fit with the upper end of the tonal dimension, portraying your work with a utilitarian style that facilitates an efficient and precise reading of the data.

A different scenario may now involve your doing some visual work for a campaign group with a pro-capital-punishment stance. The approach might be to demonise the individuals, putting a human face to the offenders and their offences. The motive is to evoke sensation, shock and anger to get people to support this cause. Would a bar chart breakdown of the key statistics accomplish this in tone? Possibly not.

‘Data is a simplification – an abstraction – of the real world. So when you visualize data, you visualize an abstraction of the world.’ Dr Nathan Yau, Statistican and Author of Data Points

140

Another situation could see you working for a newspaper that had a particularly liberal viewpoint and was looking to publish a graphic to mark this sober milestone of 500 executions. You might avoid using the stern imagery of the offenders’ mug shots and instead focus on some of the human sentiments expressed in their last statements or on case studies of some of the extremely young offenders for whom life was perhaps never going to follow a positive path. To humanise or demonise the individuals involved in this dataset is possible because there is such richness and intimate levels of detail available from the data.

It is worth reinforcing again that a figurative approach (reading) is typically what most of your work will involve and require. Only a small proportion will require a non-figurative (feeling) approach even with emotive subjects. The whole point about introducing you to the alternative perspective of the feeling tone is to prepare you for those occasions when the desired purpose of your work requires more of a higher-level grasp of data values or a deeper connection with subject matter through its data.

To complete this discussion, here are some final points to make about the purpose map to further clarify and frame its scope.

Format: Firstly, it is important to stress that this map does not define format in terms of print, digital or physical. Exploratory visualisations will almost entirely be digital but exhibitory or explanatory projects could be print or digital.

First thoughts not final commitment: Considering the definitions of experience and tone now simply represents the beginning of this kind of design thinking. As the workflow progresses you might change (or need to change) your mind and pursue an alternative course, especially when you get deeper into data work, the nature of which may reveal a better fit with a completely different type of solution. I will state again that in these early stages the things you will think about will be the first occasion on which you think about them but not the last. The benefit of starting this kind of thinking now is the increased focus it affords from any sense of eliminating potential types of visualisation from your concern that will have no relevance to your context.

‘I have this fear that we aren’t feeling enough.’ Chris Jordan, Visual Artist and Cultural Activist

141

Collective visual quality: Decisions around tone may not be solely isolated to how data should be represented. There may be a broader sense of overall visual mood or ‘quality’ that you are trying to convey across the presentation design choices as well. As you will see, there are other media assets (photos, videos, illustrations, text) that could go towards achieving a certain tone for the project that does not necessarily directly influence the tone of the data.

Not about a singular location: Some projects will involve just a single chart and this makes it a far more straightforward prospect to inform your definition of its best-fit location on this purpose map. However, there will be other projects that you work on involving multiple chart assets, multiple interactions, different pages and deeper layers. So, when it comes to considering your initial vision through the purpose map dimensions, you may recognise separate definitions for each major elements. This will become much clearer as you get deeper into the project – and can actually identify the need for multiple assets.

The mantra proposed by Ben Schneiderman, one of the most esteemed academics in this field – ‘Overview first, details on demand’ – informs the idea of thinking about different layers of readability and depth in your visualisation work accessed through interactivity. Some of the chart types that you will meet in Chapter 6 can only ever hope to deliver a gist of the general magnitude of values (the big, the small and the medium) and not their precise details. A treemap, for example, is never going to facilitate the detailed perceiving of values because it uses rectangular areas to represent data values and our perceptual system is generally quite poor at judging different area scales. Additionally, a treemap often comprises a breakdown of many categorical values within the same chart display, so it is very busy and densely packed. However, if you have the capability to incorporate interactive features that allow the user to enter via this first overview layer and then explore beneath the surface, maybe clicking on a shape to reveal a pop-up with precise value labels, you are opening up additional details.

In effect you have moved your viewer’s readability up the tonal spectrum that began with more of a general feeling of data and then moved towards the reading of data as a result of the interactive operation. Sometimes a ‘gateway’ layer is required for your primary view, to seduce your audience or to provide a big-picture overview (feeling), and then you can let the audience move on to more

page142image59603008

142

perceptually precise displays of the data (reading) either through interaction or perhaps by advancing through pages in a report or slide-deck sequence.

In the Better Life Index, shown in Figure 3.10, the opening layer is based around a series of charts that look like flowers. This is attractive, intriguing and offers a nice, single-page, at-a-glance summary. The task of reading the petal sizes with any degree of precision is hard but that is not the intent of this first layer. The purpose is to get a balance between a form that attracts the user and a function that offers a general sense of where the big, medium and small values sit within the data. For those who want to read the values with more precision, they are only a click away (on the flowers) from viewing an alternative display using a bar chart to represent the same values.

Figure 3.10 OECD Better Life Index

Figure 3.11 Losing Ground

page143image59597056 page143image57640992

143

page144image57623776

Increasingly there is a trend for projects to incorporate both explanatory and exploratory experiences into the same overall project – the term ‘explorable explanations’ has been coined to describe them. A project like ‘Losing Ground’ by ProPublica (Figure 3.11) is an example of this as it moves between telling a story about the disappearing coastline of Louisiana and enabling users to interrogate and adjust their view of the data at various milestone stages in the sequence.

Harnessing Ideas

The discussions so far in this chapter have involved practical reasoning. Before you move on to the immediate next stage of the design process – working with data – it can be valuable to briefly allow yourself the opportunity to harness your instinctive imagination.

Alongside your consideration of the purpose map, the other strand of thinking about ‘vision’ concerns the earliest seeds of any ideas you may have in mind for what this solution might comprise or even look like. These might be mental manifestations of ideas you have formed yourself or influenced or inspired by what you have seen elsewhere.

page144image59508224

144

‘I focus on structural exploration on one side and on the reality and the landscape of opportunities in the other ... I try not to impose any early ideas of what the result will look like because that will emerge from the process. In a nutshell I first activate data curiosity, client curiosity, and then visual imagination in parallel with experimentation.’ Santiago Ortiz, founder and Chief Data Officer at DrumWave, discussing the role – and timing – of forming ideas and mental concepts

There are limits to the value of ideas and also to the role they are allowed to play, as I will mention shortly, but your instincts can offer a unique perspective if you choose to allow them to surface. If you have a naturally analytical approach to visualisation this activity might seem to be the wrong way round: how can legitimate ideas be formed until the data has been explored? I understand that, and it is a step that some readers will choose not to entertain until later in the process. However, do not rule it out, see if liberating your imagination now adds value to your analytical thinking later. There are several aspects to the concept and role of harnessing ideas that I feel are valuable to consider at this primary stage:

Mental visualisation: This concerns the other meaning of visualisation and is about embracing what we instinctively ‘see’ in our mind’s eye when we consider the emerging brief for our task. In Thinking Fast and Slow, by Daniel Kahneman, the author describes two models of thought that control our thinking activities. He calls these System 1 and System 2 thinking: the former is responsible for our instinctive, intuitive and metaphorical thoughts; the latter is much more ponderous, by contrast, much slower, and requiring of more mental effort when being called upon. System 1 thinking is what you want to harness right now: what are the mental impressions that form quickly and automatically in your mind when you first think about the challenge you’re facing? You cannot switch off System 1 thoughts. You will not be able to stop mental images formulating about what your mind’s eye sees when thinking about this problem instinctively. So, rather than stifling your natural mental habits, this earliest stage of the workflow process presents the best possible opportunity to allow yourself space to begin imagining. What colours do you see? Sometimes instinctive ideas are reflections of our culture or society, especially the connotations of colour usage. What shapes and patterns strike you as being semantically aligned

145

with the subject? This can be useful not just to inspire but also possibly to obtain a glimpse into the similarly impulsive way the minds of your audience might connect with a subject when consuming the solution.

For example, Figure 3.12 shows the size of production for different grape varieties across the wine industry. It uses a bubble chart to create the impression of a bunch of grapes. You can clearly see how this concept might have been formed in early sketches before the data even arrived, based on the mental visualisation of what the shape of a bunch of grapes looks like. It is consistent with the subject and offers an immediate metaphor that means any viewer looking at the work will immediately spot the connection between form and subject.

Figure 3.12 Grape expectations

page146image59480448 page146image57625856

Keywords: What terms of language come to mind when thinking about the subject or the phenomena of your data? Figure 3.13 shows some notes I made in capturing the instinctive keywords and colours that came to mind when I was forming early thoughts and ideas about a project to do with psychotherapy treatment in the Arctic.

The words reflected the type of language I felt would be important to frame my design thinking, establishing a reference that could inform the tone of voice of my work. The colours were somewhat arbitrary and in the end I did not actually use them all, but they were indicative of the tones I was seeking. I did, however, see through my intention

page146image59479488

146

to avoid the blacks and blues (as they would carry unwelcome and clichéd connotations in this subject’s context).

Figure 3.13 Example of Keywords and Colour Swatch Ideas

page147image57576496

Sketching: As well as taking notes, sketching ideas is of great value to us here. I mentioned earlier that this is not about being a gifted artist but recognising the freedom and speed when extracting ideas from your mind onto paper. This is particularly helpful if you are working with collaborators and want a low-fidelity sketch for discussing plans, as well as in early discussions with stakeholders to understand better each others’ take on the brief. For some people, the most fluent and efficient way to ‘sketch’ is through their software application of choice rather than on paper.

‘I draw to freely explore possibilities. I draw to visually understand what I am thinking. I draw to evaluate my ideas and intuitions by seeing them coming to life on paper. I draw to help my mind think without limitations, without boundaries. The act of drawing, and the very fact we choose to stop and draw, demands focus and attention. I use drawing as my primary expression, as a sort of functional tool for capturing and exploring thoughts.’ Giorgia Lupi, Co-founder and Design Director at Accurat

147

Regardless of whether your tool is the pen or the computer, just sketch your ideas with whatever is the most efficient and effective option given your time and confidence (see Figure 3.14). You will likely refine your sketches later on and, indeed, eventually you will move your attention completely away from pen and paper and onto the tools you are using to create the final work.

Figure 3.14 Example of a Concept Sketch, by Giorgia Lupi

Research and inspiration: It is important to be sufficiently open to influence and inspiration from the world around you. Exposing your senses to different sources of reference both within and outside of visualisation can only help to broaden the range of solutions you might be able to conceive. Research the techniques that are being used around the visualisation field, look through books and see how others might have tackled similar subjects or curiosities (e.g. how they have shown changes over time on a map).

Beyond visualisation consider any source of imagery that inspires you: colours, patterns, shapes, and metaphors from everyday life whose aesthetic qualities you just like. In addition to your notebook and sketch pad, start a scrapbook or project mood board that compiles the sources of inspiration you come across and helps you form ideas

page148image59330112 page148image57263952

148

about the style, tone or essence of your project. They might not have immediate value for the current project you are working on but may materialise as useful for future work.

‘Recently taking up drawing has helped me better articulate the images I see in my mind, otherwise I still follow up on all different types of design and art outside information design/data visualisation. I try to look at things outside my field as often as I can to keep my mind fresh as opposed to only looking at projects from my field for inspiration.’ Stefanie Posavec, Information Designer

‘Look at how other designers solve visual problems (but don’t copy the look of their solutions). Look at art to see how great painters use space, and organise the elements of their pictures. Look back at the history of infographics. It’s all been done before, and usually by hand! Draw something with a pencil (or pen ... but NOT a computer!) Sketch often: The cat asleep. The view from the bus. The bus. Personally, I listen to music – mostly jazz – a lot.’ Nigel Holmes, Explanation Graphic Designer, on inspirations that feed his approach

‘It is easy to immerse yourself in a certain idea, but I think it is important to step back regularly and recognise that other people have different ways of interpreting things. I am very fortunate to work with people whom I greatly admire and who also see things from a different perspective. Their feedback is invaluable in the process.’ Jane Pong, Data Visualisation Designer

Limitation of your ideas: There are important limitations to acknowledge around the role of ideas. Influence and inspiration are healthy: the desire to emulate what others have done is understandable. Plagiarism, copying and stealing uncredited ideas are wrong. There are ambiguities in any creative discipline about the boundaries between influence and plagiarism, and the worlds of visualisation and infographic design are not spared that challenge. Being influenced by the research you do and the great work you see around the field is not stealing, but if you do incorporate explicit ideas influenced by others in your work, at the very least you should do the noble thing and credit the authors, or even better seek them out and ask them to grant you their approval. You do not have to credit

149

William Playfair every time you use the bar chart, but there are certain unique visual devices that will be unquestionably deserving of attribution. Secondly, data is your raw material, your ideas are not. As you will see later, it is vital that you leave the main influence for your thinking to emerge from the type, size and meaning of your data. It may be that your ideas are ultimately incompatible with these properties of the data, in which case you will need to set these aside, and perhaps form new ones. Eventually you will need to evolve from ideas and sketched concepts to starting to develop a solution in your tool of choice. These early ideas and sparks of creativity are vital and they should be embraced, but do not be precious or stubborn, always maintain an open mind and recognise that they have a limited role. Try to ignore the voices in your head after a certain period! Limitation of others’ ideas: Finally, there is the diplomatic challenge of being faced with the prospect of taking on board other people’s ideas. One of the greatest anxieties I face comes from working with stakeholders who are unequivocally and emphatically clear about what they think a solution should look like. Often your involvement in a project may arrive after these ideas have already been formed and have become the basis of the brief issued by the stakeholders to you (‘Can you make this, please?’). This is where your tactful but assured communicator’s skill set comes to the fore. The ideas presented may be reasonable and well intended but it is your responsibility to lead on the creation process and guide it away from an early concept that simply may not work out. You can take these idea on board but, as with the limitations of your own ideas, there will be other factors with a greater influence – the nature of the data, the type of curiosities you are pursuing, the essence of the subject matter and the nature of the audience, among many other things. These will be the factors that ultimately dictate whether any early vision of potential ideas ends up being of value.

Summary: Formulating Your Brief Establishing Your Project’s Context

Defining Your Origin Curiosity Why are we doing it: what type of 150

curiosity has motivated the decision/desire to undertake this visualisation project?

Personal intrigue: ‘I wonder what ...’ Stakeholder intrigue: ‘He/she needs to know ...’ Audience intrigue: ‘They need to know ...’ Anticipated intrigue: ‘They might be interested in knowing ...’ Potential intrigue: ‘There should be something interesting ...’

Circumstances The key factors that will impact on your critical thinking and shape your ambitions:

People: stakeholders, audience. Constraints: pressures, rules. Consumption: frequency, setting. Deliverables: quantity, format. Resources: skills, technology.

Defining Your Purpose The ‘so what?’: what are we trying to accomplish with this visualisation? What is a successful ‘outcome’?

Establishing Your Project’s Vision ‘Purpose Map’ Plotting your expectation of what will be the best-fit type

of solution to facilitate the desired purpose:

What kind of experience? Explanatory, exhibitory or exploratory? What tone of voice will it offer? The efficiency and perceptibility of reading data vs the high-level, affective nature of feeling data?

Harnessing Ideas What mental images, ideas and keywords instinctively come to mind when thinking about the subject matter of this challenge? What influence and inspiration can you source from elsewhere that might start to shape your thinking?

Tips and Tactics

Do not get hung up if you are struggling with some circumstantial factors. Certain things may change in definition, some undefined things will emerge, some defined things will need to be reconsidered,

151

some things are just always open. Notes are so important to keep about any thoughts you have had that express the nature of your curiosity, articulation of purpose, any assumptions, things you know and do not know, where you might need to get data from, who are the experts, questions, things to do, issues/problems, wish lists ... Keep a ‘scrapbook’ (digital bookmarks, print clippings) of anything and everything that inspires and influences you – not just data visualisations. Log your ideas and inspire yourself. This stage is about ambition management/skills – it is to your benefit that you treat it with the thoroughness it needs. The negative impact of any corners being cut here will be amplified later on.

page152image56752560

152

4 Working With Data

In Chapter 3 the workflow process was initiated by exploring the defining matters around context and vision. The discussion about curiosity, framing not just the subject matter of interest but also a specific enquiry that you are seeking an answer to, in particular leads your thinking towards this second stage of the process: working with data.

In this chapter I will start by covering some of the most salient aspects of data and statistical literacy. This section will be helpful for those readers without any – or at least with no extensive – prior data experience. For those who have more experience and confidence with this topic, maybe through their previous studies, it might merely offer a reminder of some of the things you will need to focus on when working with data on a visualisation project.

There is a lot of hard work that goes into the activities encapsulated by ‘working with data’. I have broken these down into four different groups of action, each creating substantial demands on your time:

Data acquisition: Gathering the raw material. Data examination: Identifying physical properties and meaning. Data transformation: Enhancing your data through modification and consolidation. Data exploration: Using exploratory analysis and research techniques to learn.

You will find that there are overlapping concerns between this chapter and the nature of Chapter 5, where you will establish your editorial thinking. The present chapter generally focuses more on the mechanics of familiarisation with the characteristics and qualities of your data; the next chapter will build on this to shape what you will actually do with it.

As you might expect, the activities covered in this chapter are associated with the assistance of relevant tools and technology. However, the focus for the book will remain concentrated on identifying which tasks you have to undertake and look less at exactly how you will undertake these. There will be tool-specific references in the curated collection of resources that

page153image59044992 page153image59045184 page153image59045376

153

are published in the digital companion.

4.1 Data Literacy: Love, Fear and Loathing

I frequently come across people in the field who declare their love for data. I don’t love data. For me it would be like claiming ‘I love food’ when, realistically, that would be misleading. I like sprouts but hate carrots. And don’t get me started on mushrooms.

At the very start of the book, I mentioned that data might occasionally prove to be a villain in your quest for developing confidence with data visualisation. If data were an animal it would almost certainly be a cat: it has a capacity to earn and merit love but it demands a lot of attention and always seems to be conspiring against you.

I love data that gives me something interesting to do analysis-wise and then, subsequently, also visually. Sometimes that just does not happen.

I love data that is neatly structured, clean and complete. This rarely exists. Location data will have inconsistent place-name spellings, there will be dates that have a mixture of US and UK formats, and aggregated data that does not let me get to the underlying components.

You don’t need to love data but, equally, you shouldn’t fear data. You should simply respect it by appreciating that it will potentially need lots of care and attention and a shift in your thinking about its role in the creative process. Just look to develop a rapport with it, embracing its role as the absolutely critical raw material of this process, and learn how to nurture its potential.

For some of you reading this book, you might have interest in data but possibly not much knowledge of the specific activities involving data as you work on a visualisation design solution. An assumed prerequisite for anyone working in data visualisation is an appreciation of data and statistical literacy. However, this is not always the case. One of the biggest causes of failure in data visualisations – especially in relation to the principle I introduced about ‘trustworthy design’ – comes from a poor understanding of these numerate literacies. This can be overcome, though.

154

‘When I first started learning about visualisation, I naively assumed that datasets arrived at your doorstep ready to roll. Begrudgingly I accepted that before you can plot or graph anything, you have to find the data, understand it, evaluate it, clean it, and perhaps restructure it.’ Marcia Gray, Graphic Designer

I discussed in the Introduction the different entry points from which people doing data visualisation work come. Typically – but absolutely not universally – those who join from the more creative backgrounds of graphic design and development might not be expected to have developed the same level of data and statistical knowledge than somebody from the more numerate disciplines. If you are part of this creative cohort and can identify with this generalisation, then this chapter will ease you through the learning process (and in doing so hopefully dispel any myth that it is especially complicated).

Conversely, many others may think they do not know enough about data but in reality they already do ‘get’ it – they just need to learn more about its role in visualisation and possibly realign their understanding of some of the terminology. Therefore, before delving further into this chapter’s tasks, there are a few ‘defining’ matters I need to address to cover the basics in both data and statistical literacy.

Data Assets and Tabulation Types

Firstly, let’s consider some of the fundamentals about what a dataset is as well as what shape and form it comes in.

When working on a visualisation I generally find there are two main categories of data ‘assets’: data that exist in tables, known as datasets; and data that exists as isolated values.

Tabulated datasets are what we are mainly interested in at this point. Data as isolated values refers to data that exists as individual facts and statistical figures. These do not necessarily belong in, nor are they normally

For the purpose of this book I describe this type of data as being raw because it has not yet been statistically or mathematically manipulated and it has not been modified in any other way from its original state.

155

collected in, a table. They are just potentially useful values that are dispersed around the Web or across reports: individual facts or figures that you might come across during your data gathering or research stages. Later on in your work you might use these to inform calculations (e.g. applying a currency conversion) or to incorporate a fact into a title or caption (e.g. 78% of staff participated in the survey), but they are not your main focus for now.

Tabulated data is unquestionably the most common form of data asset that you will work with, but it too can exist in slightly different shapes and sizes. A primary difference lies between what can be termed normalised datasets (Figure 4.1) and cross-tabulated datasets (Figure 4.2).

A normalised dataset might loosely be described as looking like lists of data values. In spreadsheet parlance, you would see this as a series of columns and rows of data, while in database parlance it is the arrangement of fields and records. This form of tabulated data is generally the most detailed form of data available for you to work with. The table in Figure 4.1 is an example of normalised data where the columns of variables provide different descriptive values for each movie (or record) held in the table.

Figure 4.1 Example of a Normalised Dataset

Cross-tabulated data is presented in a reconfigured form where, instead of displaying raw data values, the table of cells contain the results of statistical operations (like summed totals, maximums, averages). These

page156image58992384 page156image58992576 page156image58992768 page156image56772432

156

values are aggregated calculations formed from the relationship between two variables held in the normalised form of the data. In Figure 4.2, you will see the cross-tabulated result of the normalised table of movie data, now showing a statistical summary for each movie category. The statistic under ‘Max Critic Rating’ is formed from an aggregating calculation based on the ‘Critic Rating’ and ‘Category’ variables seen in Figure 4.1.

Figure 4.2 Example of a Cross-tabulated Dataset

Typically, if you receive data in an already cross-tabulated form, you do not have access to the original data. This means you will not be able to ‘reverse-engineer’ it back into its raw form, which, in turn, means you have reduced the scope of your potential analysis. In contrast, normalised data gives you complete freedom to explore, manipulate and aggregate across multiple dimensions. You may choose to convert the data into ‘cross-tabulated’ form but that is merely an option that comes with the luxury of having access to the detailed form of your data. In summary, it is always preferable, where possible, to work with normalised data.

Data Types

One of the key parts of the design process concerns understanding the different types of data (sometimes known as levels of data or scales of measurement). Defining the types of data will have a huge influence on so many aspects of this workflow, such as determining:

the type of exploratory data analysis you can undertake; the editorial thinking you establish; the specific chart types you might use; the colour choices and layout decisions around composition.

In the simplest sense, data types are distinguished by being either qualitative or quantitative in nature. Beneath this distinction there are several further separations that need to be understood. The most useful taxonomy I have found to describe these different types of data is based on an approach devised by the psychologist researcher Stanley Stevens. He

page157image59146816 page157image59147200 page157image56765616

157

developed the acronym NOIR as a mnemonic device to cover the different types of data you may come to work with, particularly in social research: Nominal, Ordinal, Interval, and Ratio. I have extended this, adding onto the front a ‘T’ – for Textual – which, admittedly, somewhat undermines the grace of the original acronym but better reflects the experiences of handling data today. It is important to describe, define and compare these different types of data.

Textual (Qualitative)

Textual data is qualitative data and generally exists as unstructured streams of words. Examples of textual data might include:

‘Any other comments?’ data submitted in a survey. Descriptive details of a weather forecast for a given city. The full title of an academic research project. The description of a product on Amazon. The URL of an image of Usain Bolt’s victory in the 100m at the 2012 Olympics.

Figure 4.3 Graphic Language: The Curse of the CEO

158

page159image57250272

In its native form, textual data is likely to offer rich potential but it can prove quite demanding to unlock this. To work with textual data in an analysis and visualisation context will generally require certain natural language processing techniques to derive or extract classifications, sentiments, quantitative properties and relational characteristics.

159

An example of how you can use textual data is seen in the graphic of CEO swear word usage shown in Figure 4.3. This analysis provides a breakdown of the profanities used by CEOs from a review of recorded conference calls over a period of 10 years. This work shows the two ways of utilising textual data in visualisation. Firstly, you can derive categorical classifications and quantitative measurements to count the use of certain words compared to others and track their usage over time. Secondly, the original form of the textual data can be of direct value for annotation purposes, without the need for any analytical treatment, to include as captions.

Working with textual data will always involve a judgement of reward vs effort: how much effort will I need to expend in order to extract usable, valuable content from the text? There are an increasing array of tools and algorithmic techniques to help with this transformational approach but whether you conduct it manually or with some degree of automation it can be quite a significant undertaking. However, the value of the insights you are able to extract may entirely justify the commitment. As ever, your judgment of the aims of your work, the nature of your subject and the interests of your audience will influence your decision.

Nominal (Qualitative)

Nominal data is the next form of qualitative data in the list of distinct data types. This type of data exists in categorical form, offering a means of distinguishing, labelling and organising values. Examples of nominal data might include:

The ‘gender’ selected by a survey participant. The regional identifier (location name) shown in a weather forecast. The university department of an academic member of staff. The language of a book on Amazon. An athletic event at the Olympics.

Often a dataset will hold multiple nominal variables, maybe offering different organising and naming perspectives, for example the gender, eye colour and hair colour of a class of school kids.

Additionally, there might be a hierarchical relationship existing between two or more nominal variables, representing major and sub-categorical

page160image59124224

160

values: for example, a major category holding details of ‘Country’ and a sub-category holding ‘Airport’; or a major category holding details of ‘Industry’ and a sub-category holding details of ‘Company Names’. Recognising this type of relationship will become important when considering the options for which angles of analysis you might decide to focus on and how you may portray them visually using certain chart types.

Nominal data does not necessarily mean text-based data; nominal values can be numeric. For example, a student ID number is a categorical device used uniquely to identify all students. The shirt number of a footballer is a way of helping teammates, spectators and officials to recognise each player. It is important to be aware of occasions when any categorical values are shown as numbers in your data, especially in order to understand that these cannot have (meaningful) arithmetic operations applied to them. You might find logic statements like TRUE or FALSE stated as a 1 and a 0, or data captured about gender may exist as a 1 (male), 2 (female) and 3 (other), but these numeric values should not be considered quantitative values – adding ‘1’ to ‘2’ does not equal ‘3’ (other) for gender.

Ordinal (Qualitative)

Ordinal data is still categorical and qualitative in nature but, instead of there being an arbitrary relationship between the categorical values, there are now characteristics of order. Examples of nominal data might include:

The response to a survey question: based on a scale of 1 (unhappy) to 5 (very happy). The general weather forecast: expressed as Very Hot, Hot, Mild, Cold, Freezing.

The academic rank of a member of staff. The delivery options for an Amazon order: Express, Next Day, Super Saver. The medal category for an athletic event: Gold, Silver, Bronze.

Whereas nominal data is a categorical device to help distinguish values, ordinal data is also a means of classifying values, usually in some kind of ranking. The hierarchical order of some ordinal values goes through a single ascending/descending rank from high or good values to low or bad values. Other ordinal values have a natural ‘pivot’ where the direction

161

changes around a recognisable mid-point, such as the happiness scale which might pivot about ‘no feeling’ or weather forecast data that pivots about ‘Mild’. Awareness of these different approaches to ‘order’ will become relevant when you reach the design stages involving the classifying of data through colour scales.

Interval (Quantitative)

Interval data is the less common form of quantitative data, but it is still important to be aware of and to understand its unique characteristics. An interval variable is a quantitative and numeric measurement defined by difference on a scale but not by relative scale. This means the difference between two values is meaningful but an arithmetic operation such as multiplication is not.

The most common example is the measure for temperature in a weather forecast, presented in units of Celsius. The absolute difference between 15°C and 20°C is the same difference as between 5°C and 10°C. However, the relative difference between 5°C and 10°C is not the same as the difference between 10°C and 20°C (where in both cases you multiply by two or increase by 100%). This is because a zero value is arbitrary and often means very little or indeed is impossible. A temperature reading of 0°C does not mean there is no temperature, it is a quantitative scale for measuring relative temperature. You cannot have a shoe size or Body Mass Index of zero.

Ratio (Quantitative)

Ratio data is the most common quantitative variable you are likely to come across. It comprises numeric measurements that have properties of difference and scale. Examples of nominal data might include:

The age of a survey participant in years. The forecasted amount of rainfall in millimetres. The estimated budget for a research grant proposal in GBP (£). The number of sales of a book on Amazon. The distance of the winning long jump at the 2012 Olympics in metres.

Unlike interval data, for ratio data variables zero means something. The 162

absolute difference in age between a 10 and 20 year old is the same as the difference between a 40 and 50 year old. The relative difference between a 10 and a 20 year old is the same as the difference between a 40 and an 80 year old (‘twice as old’).

Whereas most of the quantitative measurements you will deal with are based on a linear scale, there are exceptions. Variables about the strength of sound (decibels) and magnitude of earthquakes (Richter) are actually based on a logarithmic scale. An earthquake with a magnitude of 4.0 on the Richter scale is 1000 times stronger based on the amount of energy released than an earthquake of magnitude 2.0. Some consider these as types of data that are different from ratio variables. Most still define them as ratio variables but separate them as non-linear scaled variables.

Temporal Data

Time-based data is worth mentioning separately because it can be a frustrating type of data to deal with, especially in attempting to define its place within the TNOIR classification. The reason for this is that different components of time can be positioned against almost all data types, depending simply on what form your time data takes:

Textual: ‘Four o’clock in the afternoon on Monday, 12 March 2016’ Ordinal: ‘PM’, ‘Afternoon’, ‘March’, ‘Q1’ Interval: ‘12’, ‘12/03/2016’, ‘2016’ Ratio: ‘16:00’

Note that time-based data is separate in concern to duration data, which, while often formatted in structures such as hh:mm:ss, should be seen as a ratio measure. To work with duration data it is often useful to transform it into single units of time, such as total seconds or minutes.

If temperature values were measured in kelvin, where there is an absolute zero, this would be considered a ratio scale, not an interval one.

163

Discrete vs Continuous

Another important distinction to make about your data, and something that cuts across the TNOIR classification, is whether the data is discrete or continuous. This distinction is influential in how you might analyse it statistically and visually.

The relatively simple explanation is that discrete data is associated with all classifying variables that have no ‘in-between’ state. This applies to all qualitative data types and any quantitative values for which only a whole is possible. Examples might be:

Heads or tails for a coin toss. Days of the week. The size of shoes. Numbers of seats in a theatre.

In contrast, continuous variables can hold the value of an in-between state and, in theory, could take on any value between the natural upper and lower limits if it was possible to take measurements in fine degrees of detail, such as:

Height and weight. Temperature. Time.

One of the classifications that is hard to nail down involves data that could, on the TNOIR scale, arguably fall under both ordinal and ratio definitions based on its usage. This makes it hard to determine if it should be considered discrete or continuous. An example would be the star system used for rating a movie or the happiness rating. When a star rating value is originally captured, the likelihood is that the input data was discrete in nature. However, for analysis purposes, the statistical operations applied to data that is based on different star ratings could reasonably be treated either as discrete classifications or, feasibly, as continuous numeric values. For both star review ratings or happiness ratings decimal averages could be calculated as a way of formulating average score. (The median and mode would still be discrete.) The suitability of this approach will depend on whether the absolute difference between classifying values can be considered equal.

164

4.2 Statistical Literacy

If the fear of data is misplaced, I can sympathise with anybody’s trepidation towards statistics. For many, statistics can feel complicated to understand and too difficult a prospect to master. Even for those relatively comfortable with stats, it is unquestionably a discipline that can easily become rusty without practice, which can also undermine your confidence. Furthermore, the fear of making mistakes with delicate and rule-based statistical calculations also depresses the confidence levels lower than they need to be.

The problem is that you cannot avoid the need to use some statistical techniques if you are going to work with data. It is therefore important to better understand statistics and its role in visualisation, as you must do with data. Perhaps you can make the problem more surmountable by packaging the whole of statistics into smaller, manageable elements that will dispel the perception of overwhelming complexity.

I do believe that it is possible to overstate the range and level of statistical techniques most people will need to employ on most of their visualisation tasks. The caveats are important as I know there will be people with visualisation experience who are exposed to a tremendous amount of statistical thinking in their work, but it is a relevant point.

It all depends, of course. From my experience, however, the majority of data visualisation challenges will generally involve relatively straightforward univariate and multivariate statistical techniques. Univariate techniques help you to understand the shape, size and range of quantitative values. Multivariate techniques help you to explore the possible relationships between different combinations of variables and variable types. I will describe some of the most relevant statistical operations associated with these techniques later in this chapter, at the point in your thinking where they are most applicable.

As you get more advanced in your work (and your confidence increases) you might have occasion to employ inference techniques. These include concepts such as data modelling and the use of regression analysis: attempting to measure the relationships between variables to explore correlations and (the holy grail) causations. Many of you will likely experience visualisation challenges that require an understanding of

165

probabilities, testing hypotheses and becoming acquainted with terms like confidence intervals. You might use these techniques to assist with forecasting or modelling risk and uncertainty. Above and beyond that, you are moving towards more advanced statistical modelling and algorithm design.

It is somewhat dissatisfactory to allocate only a small part of this text to discussing the role of descriptive and exploratory statistics. However, for the scope of this book, and seeking to achieve a pragmatic balance, the most sensible compromise is just to flag up which statistical activities you might need to consider and where these apply. It can take years to learn about the myriad advanced techniques that exist and it takes experience to know when and how to deploy all the different methods.

There are hundreds of books better placed to offer the depth of detail you truly need to fulfil these activities and there is no real need to reinvent the wheel – and indeed reinvent an inferior wheel. That statistics is just one part of the visualisation challenge, and is in itself such a prolific field, further demonstrates the variety and depth of this subject.

4.3 Data Acquisition

The first step in working with data naturally involves getting it. As I outlined in the contextual discussion about the different types of trigger curiosities, you will only have data in place before now if the opportunity presented by the data was the factor that triggered this work. You will recall this scenario was described as pursuing a curiosity born out of ‘potential intrigue’. Otherwise, you will only be in a position to know what data you need after having established your specific or general motivating curiosity. In these situations, once you have sufficiently progressed your thinking around ‘formulating your brief’, you will need to switch your thinking onto the task of acquiring your data:

What data do you need and why? From where, how, and by whom will the data be acquired? When can you obtain it?

What Data Do You Need?

166

Your primary concern is to ensure you can gather sufficient data about the subject in which you are interested to pursue your identified curiosity. By ‘sufficient’, I mean you will need to establish some general criteria in your mind for what data you do need and what data you do not need. There is no harm in getting more than you need at this stage but it can result in wasted efforts, waste that you would do well to avoid.

Let’s propose you have defined your curiosity to be ‘I wonder what a map of McDonald’s restaurant openings looks like over time?’. In this scenario you are going to try to find a source of data that will provide you with details of all the McDonald’s restaurants that have ever opened. A shopping list of data items would probably include the date of opening, the location details (as specific as possible) and maybe even a closing date to ensure you can distinguish between still operating and closed-down restaurants.

You will need to conduct some research, a perpetual strand of activity that runs throughout the workflow, as I explained earlier. In this scenario you might need first to research a bit of the history of McDonald’s restaurants to discover, for instance, when the first one opened, how many there are, and in which countries they are located. This will establish an initial sense of the timeframe (number of years) and scale (outlets, global spread) of your potential data. You might also discover significant differences between what is considered a restaurant and what is just a franchise positioned in shopping malls or transit hubs. Sensitivities around the qualifying criteria or general counting rules of a subject are important to discover, as they will help significantly to substantiate the integrity and accuracy of your work.

Unless you know or have been told where to find this restaurant data, you will then need to research from where the data might be obtainable. Will this type of information be published on the Web, perhaps on the commercial pages of McDonald’s own site? You might have to get in touch with somebody (yes, a human) in the commercial or PR department to access some advice. Perhaps there will be some fast-food enthusiast in some niche corner of the Web who has already gathered and made available data like this?

Suppose you locate a dataset that includes not just McDonald’s restaurants but all fast-food outlets. This could potentially broaden the scope of your curiosity, enabling broader analysis about the growth of the fast-food

167

industry at large to contextualise MacDonald’s contribution to this. Naturally, if you have any stakeholders involved in your project, you might need to discuss with them the merits of this wider perspective.

Another judgement to make concerns the resolution of the data you anticipate needing. This is especially relevant if you are working with big, heavy datasets. You might genuinely want and need all available data. This would be considered full resolution – down to the most detailed grain (e.g. all details about all MacDonald’s restaurants, not just totals per city or country). Sometimes, in this initial gathering activity, it may be more practical just to obtain a sample of your data. If this is the case, what will be the criteria used to identify a sufficient sample and how will you select or exclude records? What percentage of your data will be sufficient to be representative of the range and diversity (an important feature we will need to examine next)? Perhaps you only need a statistical, high-level summary (total number of restaurants opened by year)?

The chances are that you will not truly know what data you want or need until you at least get something to start with and learn from there. You might have to revisit or repeat the gathering of your data, so an attitude of ‘what I have is good enough to start with’ is often sensible.

From Where, How and By Whom Will the Data Be Acquired?

There are several different origins and methods involved in acquiring data, depending on whether it will involve your doing the heavy work to curate the data or if this will be the main responsibility of others.

Curated by You

This group of data-gathering tasks or methods is characterised by your having to do most of the work to bring the data together into a convenient digital form.

Primary data collection: If the data you need does not exist or you need to have full control over its provenance and collection, you will have to consider embarking on gathering ‘primary’ data. In contrast to secondary data, primary data involves you measuring and collecting the raw data

168

yourself. Typically, this relates to situations where you gather quite small, bespoke datasets about phenomena that are specific to your needs. It might be a research experiment you have designed and launched for participants to submit responses. You may manually record data from other measurement devices, such as your daily weight as measured by your bathroom scales, or the number of times you interacted face-to-face with friends and family. Some people take daily photographs of themselves, their family members or their gardens, in order to stitch these back together eventually to portray stories of change. This data-gathering activity can be expensive in terms of both the time and cost. The benefit however is that you have carefully controlled the collection of the data to optimise its value for your needs.

Manual collection and data foraging: If the data you need does not exist digitally or in a convenient singular location, you will need to forage for it. This again might typically relate to situations where you are sourcing relatively small datasets. An example might be researching historical data from archived newspapers that were only published in print form and not available digitally. You might look to pull data from multiple sources to create a single dataset: for example, if you were comparing the attributes of a range of different cars and weighing up which to buy. To achieve this you would probably need to source different parts of the data you need from several different places. Often, data foraging is something you undertake in order to finish off data collected by other means that might have a few missing values. It is sometimes more efficient to find the remaining data items yourself by hand to complete the dataset. This can be somewhat time-consuming depending on the extent of the manual gathering required, but it does provide you with greater assurance over the final condition of the data you have collected.

Extracted from pdf files: A special subset of data foraging – or a variation at least – involves those occasions when your data is digital but essentially locked away in a pdf file. For many years now reports containing valuable data have been published on the Web in pdf form. Increasingly, movements like ‘open data’ are helping to shift the attitudes of organisations towards providing additional, fully accessible digital versions of data. Progress is being made but it will take time before all industries and government bodies adopt this as a common standard. In the meantime, there are several tools on the market (free and proprietary) that will assist you in extracting tables of data from pdf files and converting

169

these to more usable Excel or CSV formats.

Some data acquisition tasks may be repetitive and, should you possess the skills and have access to the necessary resources, there will be scope for exploring ways to automate these. However, you always have to consider the respective effort and ongoing worth of your approach. If you do go to the trouble of authoring an automation routine (of any description) you could end up spending more time on that than you would otherwise collecting by more manual methods. If it is going to be a regular piece of analysis the efficiency gains from your automation will unquestionably prove valuable going forward, but, for any one-off projects, it may not be ultimately worth it

Web scraping (also known as web harvesting): This involves using special tools or programs to extract structured and unstructured items of data published in web pages and convert these into tabulated form for analysis. For example, you may wish to extract several years’ worth of test cricket results from a sports website. Depending on the tools used, you can often set routines in motion to extract data across multiple pages of a site based on the connected links that exist within it. This is known as web crawling. Using the same example (let’s imagine), you could further your gathering of test cricket data by programmatically fetching data back from the associated links pointing to the team line-ups. An important consideration to bear in mind with any web scraping or crawling activity concerns rules of access and the legalities of extracting the data held on certain sites. Always check – and respect – the terms of use before undertaking this.

Curated by Others

In contrast to the list of methods I have profiled, this next set of data- gathering approaches is characterised by other people having done most of the work to source and compile the data. They will make it available for you to access in different ways without needing the extent of manual efforts often required with the methods presented already. You might occasionally still have to intervene by hand to fine-tune your data, but others would generally have put in the core effort.

Issued to you: On the occasions when you are commissioned by a stakeholder (client, colleague) you will often be provided with the data you

170

need (and probably much more besides), most commonly in a spreadsheet format. The main task for you is therefore less about collection and more about familiarisation with the contents of the data file(s) you are set to work with.

Download from the Web: Earlier I bemoaned the fact that there are still organisations publishing data (through, for example, annual reports) in pdf form. To be fair, increasingly there are facilities being developed that enable interested users to extract data in a more structured form. More sophisticated reporting interfaces may offer users the opportunity to construct detailed queries to extract and download data that is highly customised to their needs.

System report or export: This is related more to an internal context in organisations where there are opportunities to extract data from corporate systems and databases. You might, for example, wish to conduct some analysis about staff costs and so the personnel database may be where you can access the data about the workforce and their salaries.

‘Don’t underestimate the importance of domain expertise. At the Office for National Statistics (ONS), I was lucky in that I was very often working with the people who created the data – obviously, not everyone will have that luxury. But most credible data producers will now produce something to accompany the data they publish and help users interpret it – make sure you read it, as it will often include key findings as well as notes on reliability and limitations of the data.’ Alan Smith OBE, Data Visualisation Editor, Financial Times

Third-party services: There is an ever-increasing marketplace for data and many commercial services out there now offer extensive sources of curated and customised data that would otherwise be impossible to obtain or very complex to gather. Such requests might include very large, customised extracts from social media platforms like Twitter based on specific keywords and geo-locations.

API: An API (Application Programme Interface) offers the means to create applications that programmatically access streams of data from sites or services, such as accessing a live feed from Transport for London (TfL) to track the current status of trains on the London Underground system.

171

When Can the Data Be Acquired?

The issue of when data is ready and available for acquisition is a delicate one. If you are conducting analysis of some survey results, naturally you will not have the full dataset of responses to work with until the survey is closed. However, you could reasonably begin some of your analysis work early by using an initial sample of what had been submitted so far. Ideally you will always work with data that is as complete as possible, but on occasions it may be advantageous to take the opportunity to get an early sense of the nature of the submitted responses in order to begin preparing your final analysis routines. Working on any dataset that may not yet be complete is a risk. You do not want to progress too far ahead with your visualisation workflow if there is the real prospect that any further data that emerges could offer new insights or even trigger different, more interesting curiosities.

4.4 Data Examination

After acquiring your data your next step is to thoroughly examine it. As I have remarked, your data is your key raw material from which the eventual visualisation output will be formed. Before you choose what meal to cook, you need to know what ingredients you have and what you need to do to prepare them.

It may be that, in the act of acquiring the data, you have already achieved a certain degree of familiarity about its status, characteristics and qualities, especially if you curated the data yourself. However, there is a definite need to go much further than you have likely achieved before now. To do this you need to conduct an examination of the physical properties and the meaning of your data.

As you progress through the stages of this workflow, your data will likely change considerably: you will bring more of it in, you will remove some of it, and you will refine it to suit your needs. All these modifications will alter the physical makeup of your data so you will need to keep revisiting this step to preserve your critical familiarity.

Data Properties

172

The first part of familiarising yourself with with your data is to undertake an examination of its physical properties. Specifically you need to ascertain its type, size and condition. This task is quite mechanical in many ways because you are in effect just ‘looking’ at the data, establishing its surface characteristics through visual and/or statistical observations.

What To Look For?

The type and size of your data involve assessing the characteristics and amount of data you have to work with. As you examine the data you also need to determine its condition: how good is its quality and is it fit for purpose?

Data types: Firstly, you need to identify what data types you have. In gathering this data in the first place you might already have a solid appreciation about what you have before you, but doing this thoroughly helps to establish the attention to detail you will need to demonstrate throughout this stage. Here you will need to refer to the definitions from earlier in the chapter about the different types of data (TNOIR). Specifically you are looking to define each column or field of data based on whether it is qualitative (text, nominal, ordinal) or quantitative (interval, ratio) and whether it is discrete or continuous in nature.

Size: Within each column or field you next need to know what range of values exist and what are the specific attributes/formats of the values held. For example, if you have a quantitative variable (interval or ratio), what is the lowest and the highest value? In what number format is it presented (i.e. how many decimal points or comma formatted)? If it is a categorical variable (nominal or ordinal), how many different values are held? If you have textual data, what is the maximum character length or word count?

Condition: This is the best moment to identify any data quality and completeness issues. Naturally, unidentified and unresolved issues around data quality will come to bite hard later, undermining the scope and, crucially, trust in the accuracy of your work. You will address these issues next in the ‘transformation’ step, but for now the focus is on identifying any problems. Things to look out for may include the following:

Missing values, records or variables – Are empty cells assumed as being of no value (zero/nothing) or no measurement (n/a,

173

null)? This is a subtle but important difference. Erroneous values – Typos and any value that clearly looks out of place (such as a gender value in the age column). Inconsistencies – Capitalisation, units of measurement, value formatting. Duplicate records. Out of date – Values that might have expired in accuracy, like someone’s age or any statistic that would be reasonably expected to have subsequently changed. Uncommon system characters or line breaks. Leading or trailing spaces – the invisible evil! Date issues around format (dd/mm/yy or mm/dd/yy) and basis (systems like Excel’s base dates on daily counts since 1 January 1900, but not all do that).

How to Approach This?

I explained in the earlier ‘Data literacy’ section the difference in asset types (data that exists in tables and data that exists as isolated values) and also the difference in form (normalised data or cross-tabulated). Depending on the asset and form of data, your examination of data types may involve slightly different approaches, but the general task is the same. Performing this examination process will vary, though, based on the tools you are using. The simplest approach, relevant to most, is to describe the task as you would undertake it using Excel, given that this continues to be the common tool most people use or have the skills to use. Also, it is likely that most visualisation tasks you undertake will involve data of a size that can be comfortably handled in Excel.

As you go through this task, it is good practice to note down a detailed overview of what data you have, perhaps in the form of a table of data descriptions. This is not as technical a duty as would be associated with the creation of a data dictionary but its role and value are similar, offering a convenient means to capture all the descriptive properties of your various data assets.

‘Data inspires me. I always open the data in its native format and look at the raw data just to get the lay of the land. It’s much like looking at a map to begin a journey.’ Kim Rees, Co-founder, Periscopic

174

Inspect and scan: Your first task is just to scan your table of data visually. Navigate around it using the mouse/trackpad, use the arrow keys to move up or down and left or right, and just look at all the data. Gain a sense of its overall dimension. How many columns and how many rows does it occupy? How big a prospect might working with this be?

Data operations: Inspecting your data more closely might require the use of interrogation features such as sorting columns and doing basic filters. This can be a quick and simple way to acquaint yourself with the type of data and range of values.

Going further, once again depending on the technology (and assuming you have normalised data to start with), you might apply a cross-tabulation or pivot table to create aggregated, summary views of different angles and combinations of your data. This can be a useful approach to also check out the unique range of values that exist under different categories as well as helping to establish how sub- categories may relate other categories hierarchically. This type of inspection will be furthered in the next step of the ‘working with data’ process when you will undertake deeper visual interrogations of the type, size and condition of your data. If you have multiple tables, you will need to repeat this approach for each one as well as determine how they are related collectively and on what basis. It could be that just considering one table as the standard template, representative of each instance, is sufficient: for example, if each subsequent table is just a different monthly view of the same activity. For so-called ‘Big Data’ (see the glossary definition earlier), it is less likely that you can conduct this examination work through relatively quick, visual observations using Excel. Instead it will need tools based around statistical language that will describe for you what is there rather than let you look at what is there. Statistical methods: The role of statistics in this examination stage generally involves relatively basic quantitative analysis methods to help describe and understand the characteristics of each data variable. The common term applied to this type of statistical approach is univariate, because it involves just looking at one variable at a time (the best opportunity to perform the analysis of multiple variables comes later). Here are some different types of statistical analyses you might find useful at this stage. These are not the only methods you will ever need to use, but will likely prove to be among the most

175

common: Frequency counts: applied to categorical values to understand the frequency of different instances. Frequency distribution: applied to quantitative values to learn about the type and shape of the distribution of values. Measurements of central tendency describe the summary attributes of a group of quantitative values, including:

the mean (the average value); the median (the middle value if all quantities were arranged from smallest to largest); the mode (the most common value).

Measurements of spread are used to describe the dispersion of values above and below the mean:

Maximum, minimum and range: the highest and lowest and magnitude of spread of values. Percentiles: the value below which x% of values fall (e.g. the 20th percentile is the value below which 20% of all quantitative values fall).

Standard deviation: a calculated measure used to determine how spread out a series of quantitative values are.

Data Meaning

Irrespective of whether you or others have curated the data, you need to be discerning about how much trust you place in it, at least to begin with. As discussed in the ‘trustworthy design’ principle, there are provenance issues, inaccuracies and biases that will affect its status on the journey from being created to being acquired. These are matters you need to be concerned with in order to resolve or at least compensate for potential shortcomings.

Knowing more about the physical properties of your data does not yet achieve full familiarity with its content nor give you sufficient acquaintance with its qualities. You will have examined the data in a largely mechanical and probably quite detached way from the underlying subject matter. You now need to think a little deeper about its meaning, specifically what it does – and does not – truly represent.

page176image58397184

‘A visualization is always a model (authored), never a mould (replica),

176

page177image58479872

of the real. That’s a huge responsibility.’ Paolo Ciuccarelli, Scientific Director of DensityDesign Research Lab at Politecnico di Milano

What Phenomenon?

Determining the meaning of your data requires that you recognise this is more than just a bunch of numbers and text values held in the cells of a table. Ask yourself, ‘What is it about? What activity, entity, instance or phenomenon does it represent?’.

One of the most valuable pieces of advice I have seen regarding this task came from Kim Rees, co-founder of Periscopic. Kim describes the process of taking one single row of data and using that as an entry point to learn carefully about what each value means individually and then collectively. Breaking down the separation between values created by the table’s cells, and then sticking the pieces back together, helps you appreciate the parts and the whole far better.

You saw the various macro- and micro-level views applied to the context of the Texas Department for Criminal Justice executed offenders information in the previous chapter. The underlying meaning of this data – its phenomenon – was offenders who had been judged guilty of committing heinous crimes and had faced the ultimate consequence. The availability of textual data describing the offenders’ last statements and details of their crimes heightened the emotive potential of this data. It was heavy stuff. However, it was still just a collection of values detailing dates, names, locations, categories. All datasets, whether on executed offenders or the locations of MacDonald’s restaurants, share the same properties as outlined by the TNOIR data-type mnemonic. What distinguishes them is what these values mean.

What you are developing here is a more semantic appreciation of your data to substantiate the physical definitions. You are then taking that collective appreciation of what your data stands for to influence how you might decide to amplify or suppress the influence of this semantic meaning. This

‘Absorb the data. Read it, re-read it, read it backwards and understand the lyrical and human-centred contribution.’ Kate McLean, Smellscape Mapper and Senior Lecturer Graphic Design

page177image58480256

177

builds on the discussion in the last chapter about the tonal dimension, specifically the difference between figurative and non-figurative portrayals.

A bar chart (Figure 4.4) comprising two bars, one of height 43 and the other of height 1, arguably does not quite encapsulate the emotive significance of Barack Obama becoming the first black US president, succeeding the 43 white presidents who served before him. Perhaps a more potent approach may be to present a chronological display of 44 photographs of each president in order to visually contrast Mr Obama’s headshot in the final image in the sequence with the previous 43. Essentially, the value of 43 is almost irrelevant in its detail – it could be 25 or 55 – it is about there being ‘many’ of the same thing followed by the ‘one’ that is‘different’. That’s what creates the impact. (What will image number 45 bring? A further striking ‘difference’ or a return to the standard mould?)

Figure 4.4 US Presidents by Ethnicity (1789 to 2015)

Learning about the underlying phenomena of your data helps you feel its spirit more strongly than just looking at the rather agnostic physical properties. It also helps you in knowing what potential sits inside the data – the qualities it possesses – so you are then equipped the best

page178image59570240 page178image57602032

178

understanding of how you might want to portray it. Likewise it prepares you for the level of responsibility and potential sensitivity you will face in curating a visual representation of this subject matter. As you saw with the case study of the ‘Florida Gun Crimes’ graphic, some subjects are inherently more emotive than others, so we have to demonstrate a certain amount of courage and conviction in deciding how to undertake such challenges.

‘Find loveliness in the unlovely. That is my guiding principle. Often, topics are disturbing or difficult; inherently ugly. But if they are illustrated elegantly there is a special sort of beauty in the truthful communication of something. Secondly, Kirk Goldsberry stresses that data visualization should ultimately be true to a phenomenon, rather than a technique or the format of data. This has had a huge impact on how I think about the creative process and its results.’ John Nelson, Cartographer

Completeness

Another aspect of examining the meaning of data is to determine how representative it is. I have touched on data quality already, but inaccuracies in conclusions about what data is saying have arguably a greater impact on trust and are more damaging than any individual missing elements of data.

The questions you need to ask of your data are: does it represent genuine observations about a given phenomenon or is it influenced by the collection method? Does your data reflect the entirety of a particular phenomenon, a recognised sample, or maybe even an obstructed view caused by hidden limitations in the availability of data about that phenomenon?

Reflecting on the published executed offenders data, there would be a certain confidence that it is representative of the total population of executions but with a specific caveat: it is all the executed offenders under the jurisdiction of the Texas Department of Criminal Justice since 1982. It is not the whole of the executions conducted across the entire USA nor is it representative of all the executions that have taken place throughout the history of Texas. Any conclusions drawn from this data must be boxed within those parameters.

179

The matter of judging completeness can be less about the number of records and more a question of the integrity of the data content. This executed offenders dataset would appear to be a trusted and reliable record of each offender but would there/could there be an incentive for the curators of this data not to capture, for example, the last statements as they were explicitly expressed? Could they have possibly been in any way sanitised or edited, for example? These are the types of questions you need to pose. This is not aimless cynicism, it is about seeking assurances of quality and condition so you can be confident about what you can legitimately present and conclude from it (as well as what you should not).

Consider a different scenario. If you are looking to assess the political mood of a nation during a televised election debate, you might consider analysing Twitter data by looking at the sentiments for and against the candidates involved. Although this would offer an accessible source of rich data, it would not provide an entirely reliable view of the national mood. It could only offer algorithmically determined insights (i.e. through the process of determining the sentiment from natural language) of the people who have a Twitter account, are watching the debate and have chosen to tweet about it during a given timeframe.

Now, just because you might not have access to a ‘whole’ population of political opinion data does not mean it is not legitimate to work on a sample. Sometimes samples are astutely reflective of the population. And in truth, if samples were not viable then most of the world’s analyses would need to cease immediately.

A final point is to encourage you to probe any absence of data. Sometimes you might choose to switch the focus away from the data you have got towards the data you have not got. If the data you have is literally as much as you can acquire but you know the subject should have more data about it, then perhaps shine a light on the gaps, making that your story. Maybe you will unearth a discovery about the lack of intent or will to make the data available, which in itself may be a fascinating discovery. As transparency increases, those who are not stand out the most.

‘This is one of the first questions we should ask about any dataset: what is missing? What can we learn from the gaps?’ Jer Thorp, Founder of The Office for Creative Research

180

Any identified lack of completeness or full representativeness is not an obstacle to progress, it just means you need to tread carefully with regard to how you might represent and present any work that emerges from it. It is about caution not cessation.

Influence on Process

This extensive examination work gives you an initial – but thorough – appreciation of the potential of your data, the things it will offer and the things it will not. Of course this potential is as yet unrealised. Furthering this examination will be the focus of the next activity, as you look to employ more visual techniques to help unearth the as-yet-hidden qualities of understanding locked away in the data. For now, this examination work takes your analytical and creative thinking forward another step.

Purpose map ‘tone’: Through deeper acquaintance with your data, you will have been able to further consider the suitability of the potential tone of your work. By learning more about the inherent characteristics of the subject, this might help to confirm or redefine your intentions for adopting a utilitarian (reading) or sensation-based (feeling) tone.

Editorial angles: The main benefit of exploring the data types is to arrive at an understanding of what you have and have not got to work with. More specifically, it guides your thinking towards what possible angles of analysis may be viable and relevant, and which can be eliminated as not. For example, if you do not have any location or spatial data, this rules out the immediate possibility of being able to map your data. This is not something you could pursue with the current scope of your dataset. If you do have time-based data then the prospect of conducting analysis that might show changes over time is viable. You will learn more about this idea of editorial ‘angle’ in the next chapter but let me state now it is one of the most important components of visualisation thinking.

Physical properties influence scale: Data is your raw material, your ideas are not. I stated towards the end of Chapter 3 that you should embrace the instinctive manifestations of ideas and seek influence and inspiration from other sources. However, with the shape and size of your data having such an impact on any eventual designs, you must respect the need to be led by your data’s physical properties and not just your ideas.

page181image58532096 page181image58532288

181

Figure 4.5 OECD Better Life Index

page182image57597248

In particular, the range of values in your data will shape things significantly. The shape of data in the ‘Better Life Index’ project you saw earlier is a good example. Figure 4.5 presents an analysis of the quality of life across the 36 OECD member states. Each country is a flower comprising 11 petals with each representing a different quality of life indicator (the larger the petal, the better the measured quality of life). Consider this. Would this design concept still be viable if there were 20 indicators? Or just 3? How about if the analysis was for 150 countries? The connection between data range and chart design involves a discerning judgement about ‘fit’. You need to identify carefully the underlying shape of the data to be displayed and what tolerances this might test in the shape of the possible design concepts used.

page182image59560640

‘My design approach requires that I immerse myself deeply in the problem domain and available data very early in the project, to get a feel for the unique characteristics of the data, its “texture” and the affordances it brings. It is very important that the results from these explorations, which I also discuss in detail with my clients, can influence the basic concept and main direction of the project. To put it in Hans Rosling’s words, you need to “let the data set change your mind set”.’ Moritz Stefaner, Truth & Beauty Operator

Another relevant concern involves the challenge of elegantly handling quantitative measures that have hugely varied value ranges and contain (legitimate) outliers. Accommodating all the values into a single display

182

can have a hugely distorting impact on the space it occupies. For example, note the exceptional size of the shape for Avatar in Figure 4.6, from the ‘Spotlight on profitability’ graphic you saw earlier. It is the one movie included that bursts through the ceiling, far beyond the otherwise entirely suitable 1000 million maximum scale value. As a single outlier, in this case, it was treated with a rather unique approach. As you can see, its striking shape conveniently trespasses onto the space offered by the two empty rows above. The result emphasises this value’s exceptional quality. You might seldom have the luxury of this type of effective resolution, so the key point to stress is always be acutely aware of the existence of ‘Avatars’ in your data.

Figure 4.6 Spotlight on Profitability

page183image59554816 page183image57598288

4.5 Data Transformation

Having undertaken an examination of your data you will have a good idea about what needs to be done to ensure it is entirely fit for purpose. The next activity is to work on transforming the data so it is in optimum condition for your needs.

At this juncture, the linearity of a book becomes rather unsatisfactory. Transforming your data is something that will take place before, during and after both the examination and (upcoming) exploration steps. It will

183

also continue beyond the boundaries of this stage of the workflow. For example, the need to transform data may only emerge once you begin your ‘editorial thinking’, as covered by the next chapter (indeed you will likely find yourself bouncing forwards and backwards between these sections of the book on a regular basis). As you get into the design stage you will constantly stumble upon additional reasons to tweak the shape and size of your data assets. The main point here is that your needs will evolve. This moment in the workflow is not going to be the only or final occasion when you look to refine your data.

Two important notes to share upfront at this stage. Firstly, in accordance with the desire for trustworthy design, any treatments you apply to your data need to be recorded and potentially shared with your audience. You must be able to reveal the thinking behind any significant assumptions, calculations and modifications you have made to your data.

Secondly, I must emphasise the critical value of keeping backups. Before you undertake any transformation, make a copy of your dataset. After each major iteration remember to save a milestone version for backup purposes. Additionally, when making changes, it is useful to preserve original (unaltered) data items nearby for easy rollback should you need them. For example, suppose you are cleaning up a column of messy data to do with ‘Gender’ that has a variety of inconsistent values (such as “M”, “Male”, “male”, “FEMALE”, “F”, “Female”). Normally I would keep the original data, duplicate the column, and then tidy up this second column of values. I have then gained access to both original and modified versions. If you are going to do any transformation work that might involve a significant investment of time and (manual) effort, having an opportunity to refer to a previous state is always useful in my experience.

There are four different types of potential activity involved in transforming your data: cleaning, converting, creating and consolidating.

Transform to clean: I spoke about the importance of data quality (better quality in, better quality out, etc.) in the examination section when looking at the physical condition of the data. There’s no need to revisit the list of potential observations you might need to consider looking out for but this is the point where you will need to begin to address these.

There is no single or best approach for how to conduct this task. Some issues can be addressed through a straightforward ‘find and

page184image59485248

184

replace’ (or remove) operation. Some treatments will be possible using simple functions to convert data into new states, such as using logic formulae that state ‘if this, do this, otherwise do that’. For example, if the value in the ‘Gender’ column is “M” make it “Male”, if the value is “MALE” make it “Male” etc. Other tasks might be much more intricate, requiring manual intervention, often in combination with inspection features like ‘sort’ or ‘filter’, to find, isolate and then modify problem values. Part of cleaning up your data involves the elimination of junk. Going back to the earlier scenario about gathering data about McDonald’s restaurants, you probably would not need the name of the restaurant manager, details of the opening times or the contact telephone number. It is down to your judgement at the time of gathering the data to decide whether these extra items of detail – if they were as easily acquirable as the other items of data that you really did need – may potentially provide value for your analysis later in the process. My tactic is usually to gather as much data as I can and then reject/trim later; later has arrived and now is the time to consider what to remove. Any fields or rows of data that you know serve no ongoing value will take up space and attention, so get rid of these. You will need to separate the wheat from the chaff to help reduce your problem. Transform to convert: Often you will seek to create new data values out of existing ones. In the illustration in Figure 4.7, it might be useful to extract the constituent parts of a ‘Release Date’ field in order to group, analyse and use the data in different ways. You might use the ‘Month’ and ‘Year’ fields to aggregate your analysis at these respective levels in order to explore within-year and across-year seasonality. You could also create a ‘Full Release Date’ formatted version of the date to offer a more presentable form of the release date value possibly for labeling purposes.

Figure 4.7 Example of Converted Data Transformation

page185image59689216

185

page186image57513456

Extracting or deriving new forms of data will be necessary when it comes to handling qualitative ‘textual’ data. As stated in the ‘Data literacy’ section, if you have textual data you will generally always need to transform this into various categorical or quantitative forms, unless its role is simply to provide value as an annotation (such as a quoted caption or label). Some would argue that qualitative visualisation involves special methods for the representation of data. I would disagree. I believe the unique challenge of working with textual data lies with the task of transforming the data: visually representing the extracted and derived properties from textual data involves the same suite of representation options (i.e. chart types) that would be useful for portraying analysis of any other data types.

Here is a breakdown of some of the conversions, calculations and extractions you could apply to textual data. Some of these tasks can be quite straightforward (e.g. Using the LEN function in Excel to determine the number of characters) while others are more technical and will require more sophisticated tools or programmes dedicated to handling textual data.

Categorical conversions: Identify keywords or summary themes from text and convert these into categorical classifications. Identify and flag up instances of certain cases existing or otherwise (e.g. X is mentioned in this passage). Identify and flag up the existence of certain relationships (e.g. A and B were both mentioned in the same passage, C was always mentioned before D).

186

Use natural language-processing techniques to determine sentiments, to identify specific word types (nouns, verbs, adjectives) or sentence structures (around clauses and punctuation marks).

With URLs, isolate and extract the different components of website address and sub-folder locations

Quantitative conversions:

Calculate the frequency of certain words being used. Analyse the attributes of text, such as total word count, physical length, potential reading duration. Count the number of sentences or paragraphs, derived from the frequency of different punctuation marks. Position the temporal location of certain words/phrases in relation to other words/phrases or compared to the whole (e.g. X was mentioned at 1m51s). Position the spatial location of certain words/phrases in relation to other words/phrases or compared to the whole.

A further challenge that falls under this ‘converting’ heading will sometimes emerge when you are working with data supplied by others in spreadsheets. This concerns the obstacles created when trying to analyse a data that has been formatted visually, perhaps in readiness for printing. If you receive data in this form you will need to unpack and reconstruct it into the normalised form described earlier, comprising all records and fields included in a single table.

Any merged cells need unmerging or removing. You might have a heading that is common to a series of columns. If you see this, unmerge it and replicate the same heading across each of the relevant columns (perhaps appending an index number to each header to maintain some differentiation). Cells that have visual formatting like background shading or font attributes (bold, coloured) to indicate a value or status are useful when observing and reading the data, but for analysis operations these properties are largely invisible. You will need to create new values in actual data form that are not visual (creating categorical values, say, or status flags like ‘yes’ or ‘no’) to recreate the meaning of the formats. The data provided to you – or that you create – via a spreadsheet does not need to be elegant in appearance, it needs to be functional.

187

Transform to create: This task is something I refer to as the hidden cleverness, where you are doing background thinking to form new calculations, values, groupings and any other mathematical or manual treatments that really expand the variety of data available.

A simple example might involve the need to create some percentage calculations in a new field, based on related quantities elsewhere within your existing data. Perhaps you have pairs of ‘start date’ and ‘end date’ values and you need to calculate the duration in days for all your records. You might use logic formula to assist in creating a new variable that summarises another – maybe something like (in language terms) IF Age < 18 THEN status = “Child”, ELSE status = “Adult”. Alternatively, you might want to create a calculation that standardised some quantities’ need to source base population figures for all the relevant locations in your data in order to convert some quantities into ‘per capita’ values. This would be particularly necessary if you anticipate wanting to map the data as this will ensure you are facilitating legitimate comparisons.

Transform to consolidate: This involves bringing in additional data to help expand (more variables) or append (more records) to enhance the editorial and representation potential of your project. An example of a need to expand your data would be if you had details about locations only at country level but you wanted to be able to group and aggregate your analysis at continent level. You could gather a dataset that holds values showing the relationships between country and continent and then add a new variable to your dataset against which you would perform a simple lookup operation to fill in the associated continent values.

Consolidating by appending data might occur if you had previously acquired a dataset that now had more or newer data (specifically, additional records) available to bring it up to date. For instance, you might have started some analysis on music record sales up to a certain point in time, but once you’d actually started working on the task another week had elapsed and more data had become available. Additionally, you may start to think about sourcing other media assets to enhance your presentation options, beyond just gathering extra data. You might anticipate the potential value for gathering photos (headshots of the people in your data), icons/symbols (country flags), links to articles (URLs), or videos (clips of goals scored). All of these would contribute to broadening the scope of your annotation options. Even though there is a while yet until we reach that particular layer of

188

design thinking, it is useful to start contemplating this as early possible in case the collection of these additional assets requires significant time and effort. It might also reveal any obstacles around having to obtain permissions for usage or sufficiently high quality media. If you know you are going to have to do something, don’t leave it too late – reduce the possibility of such stresses by acting early.