Analyzing and visualizing Data - Overview with 12 slides
School of Computer & Information Sciences
ITS530 Analyzing and Visualizing Data Chapter 4 Working with Data
9/9/18
ITS530 Chapter 04
1
Data Assets and Tabulation Types
Two main categories
Data that exist in tables; Datasets
Data that exist as isolated values
Data Types
Levels of data or scales of measurement
Type of exploratory data analysis you can undertake
Editorial thinking you establish
Specific chart types you might use
Color choices and layout decisions around composition
9/9/18
ITS530 Chapter 04
2
Normalized vs Tabulated Data
Normalized
Data presented in a table with different fields
9/9/18
ITS530 Chapter 04
3
Tabulated
The data is aggregated
Data Assets and Tabulation Types cont.
Textual (Qualitative)
Unstructured streams of words
Descriptive details of a weather forecast for a given city
The full title of an academic research project
The description of a product on Amazon
Nominal (Qualitative)
Ordinal data is still categorical and qualitative in nature
Characteristics of order
The response to a survey question: based on a scale of 1 (unhappy) to 5 (very happy)
The general weather forecast: expressed as Very Hot, Hot, Mild, Cold, Freezing
9/9/18
ITS530 Chapter 04
4
Graphic Language: The Curse of the CEO
9/9/18
ITS530 Chapter 04
5
https://www.bloomberg.com/graphics/infographics/graphic-language-the-curse-of-the-ceo.html
Data Assets and Tabulation Types cont.
Interval (Quantitative)
Interval data is the less common form of quantitative data
Quantitative and numeric measurement
Measure for temperature
Ratio (Quantitative)
Most common quantitative variable
Age of a survey participant in years
Forecasted amount of rainfall in millimetres
Unlike interval data, for ratio data variables zero means something
Temporal Data
Time-based data
Textual: ‘Four o’clock in the afternoon on Monday, 12 March 2016’ Ordinal: ‘PM’, ‘Afternoon’, ‘March’, ‘Q1’
Interval: ‘12’, ‘12/03/2016’, ‘2016’
9/9/18
ITS530 Chapter 04
6
Data Assets and Tabulation Types cont.
Discrete
No ‘in-between’ state
Days of the week
Heads or tails for a coin toss
1,2,3,4,5,6,etc.
Continuous
Has in-between state
Height and weight
Temperature
Time
1.1,1.2,1.3,1.4,1.5,etc.
9/9/18
ITS530 Chapter 04
7
Data Acquisition
What data do you need and why?
From where, how, and by whom will the data be acquired?
When can you obtain it?
Curated by You
Primary data collection
Manual collection and data foraging
Extracted from pdf files
Web scraping (also known as web harvesting)
Curated by Others
Issued to you
Download from the Web
System report or export
Third-party services
API
9/9/18
ITS530 Chapter 04
8
Great R packages for data import, wrangling & visualization
Data Examination
Data Properties
Data types
Size
Condition
Missing values
Erroneous values
Inconsistencies
Duplicate records
Out of date
Uncommon system characters or line breaks
Leading or trailing spaces
How to Approach This?
Inspect and scan
Data operations
Statistical methods
Frequency counts
Frequency distribution
Measurements of central tendency
Measurements of spread
Maximum, minimum and range
Percentiles
Standard deviation
9/9/18
ITS530 Chapter 04
9
Influence on Process
Moving forward
Purpose map ‘tone’
Editorial angles
Physical properties influence scale
Potential Activities
Transform to clean
Transform to convert
Transform to create
Transform to consolidate
9/9/18
ITS530 Chapter 04
10
Data Exploration
Exploratory Data Analysis
Instinct of the analyst
Reasoning
Deductive
Inductive
Chart types
Research
Statistical methods
Nothings
Not always needed
9/9/18
ITS530 Chapter 04
11
Data Exploration
What Good Marathons and Bad Investments Have in Common By Justin Wolfers April 22, 2014 NYTImes
9/9/18
ITS530 Chapter 04
12
Case Study 1
Background:
Globally every year 529,000 maternal deaths occur, 99% of this in developing countries
ANC: (Antenatal Care) is to monitor health of pregnant women and reduce the material deaths
Intermittent preventive therapy or intermittent preventive treatment (IPT) is a public health intervention aimed at treating and preventing malaria episodes in infants (IPTi), children (IPTc), schoolchildren (IPTsc) and pregnant women (IPTp).
Question:
Are ANC clinics in country X reaching their coverage targets for IPTp?
Data Source:
Routine health information
9/9/18
ITS530 Chapter 04
13
13
Speaker notes
Now we are going to consider how we could answer the following question:
? We will answer this question using routine health information.
| Code | Variables |
| 1. | New ANC clients |
| 2. | Group pre-test counseled |
| 3. | Individual pre-test counseled |
| 4. | Accepted HIV test |
| 5A. | HIV test result - Positive |
| 5B. | HIV test result – Negative |
| 5C. | HIV test result - Indeterminate |
| 6 A. | Post-test counseled - Positive |
| 6 B. | Post-test counseled – Negative |
| 8A. | ARV therapy received – Current NVP |
| 9. | IPTp-1 |
| 10. | IPTp-2 |
Data Source
General ANC Registers
Which of these variables are relevant to answer your question?
Which elements will be included in your numerator and which in your denominator?
Answers:
1) New ANC clients, IPTp-1
2) New ANC clients =Denominator,
IPTp-1 and IPTp-2= Numerator
9/9/18
ITS530 Chapter 04
14
14
Speaker notes
Which of these variables are relevant to answer your question? We’re going to focus on elements 1, 9 and 10. Which elements will be included in your numerator and which in your denominator?
IPTp Coverage-Facility Performance
| Code | Variables | Facility 1 | Facility 2 | Facility 3 | Facility 4 | Facility 5 |
| 9. | IPTp-1 | 536 | 1435 | 39 | 969 | 862 |
| 10. | IPTp-2 | 372 | 542 | 38 | 452 | 780 |
Number of ANC clients receiving IPTp
Question:
Among the five facilities, which one performed better?
Answer:
Cannot tell because we don’t know the denominators
Speaker notes
Here we have the data on IPTp-1 and 2 to assess facility performance. Among the five facilities, which one performed better?
15
IPTp Coverage-Facility Performance
| Code | Variables | Facility 1 | Facility 2 | Facility 3 | Facility 4 | Facility 5 |
| 1 | New ANC Clients | 744 | 2708 | 105 | 1077 | 908 |
| 9. | IPTp-1 | 536 | 1435 | 39 | 969 | 862 |
| 10. | IPTp-2 | 372 | 542 | 38 | 452 | 780 |
Number of ANC clients receiving IPTp
Question: Now, you have the denominators, which of these facility performed better?
| Indicator | Facility 1 | Facility 2 | Facility 3 | Facility 4 | Facility 5 |
| % of new ANC clients who receive IPTp-1 in the past year | 72% | 53% | 37% | 90% | 95% |
| % of new ANC clients who receive IPTp-2 in the past year | 50% | 20% | 36% | 42% | 86% |
Response: Facility 5
Speaker notes
Now, you have the denominators, which of these facility performed better? We can see that it was actually facility 5.
16
Are facilities reaching coverage targets?
Target-80%
* National coverage target for pregnant women receiving IPTp-2 is 80%.
9/9/18
ITS530 Chapter 04
17
17
Speaker notes
Here is the same information presented as a chart. We need to use this information to determine, or interpret, whether or not facilites are reaching their coverage targets. Let’s assume that the national coverage target for pregnant women receiving IPTp is 80%. Are the facilities reaching the coverage target? What else can we interpret from this information?
Possible answers
Facility 1 needs to do a better job following up and increase IPTp coverage a bit.
Facility 2 does a better job with IPTp-1 coverage than IPTp-2, but needs to increase coverage of both.
Facility 3 does a good job administering IPTp-2 to patients that receive the first round, but they need to increase initial coverage and maintain follow-up.
Facility 4 does a good job with IPTp-1 coverage, but this falls of with IPTp-2. Is this loss to follow-up, or are they not administering IPTp-2 when patients return?
Facility 5 can be seen as a model and we could investigate their best practices for use in other programs
This information does not tell you why coverage is at these levels. You would have to investigate further, but you can see which facilities you need to work with.
Additional Questions
Which facility is performing better/worse than expected?
What is the trend over time for these facilities?
How would you assess each facility’s performance based on the data?
What other data or information should you consider in providing recommendations or guidance to the facilities?
9/9/18
ITS530 Chapter 04
18
Speaker notes
Here are some other questions that we might want to ask to help interpret this information and identify how to improve performance.
18
Case Study 2
Olympic Games Analysis 1896 - 2012
Olympic Games have been the world's biggest sporting event for over a 100 years.
Here we have a dataset for the Olympic Games. We have information about all the athletes who have won medals for every Olympic Games since the inaugural games of 1896.
Here are the columns in the dataset
Year
Host Venue
Sport
Discipline
Athlete
Country
Gender
Event
Medal
9/9/18
ITS530 Chapter 04
19
https://www.kaggle.com/arpitsolanki14/olympic-games-data-analysis
Top Overall Countries
9/9/18
ITS530 Chapter 04
20
The countries which win a lot of medals at the Olympic games are the developed countries in North America, Asia and Europe.
Poorer countries in Africa and South America do not win lot of medals.
To enable effective visualization of data, we will filter out the data for the top 5 countries overall.
We will identify the top 10 countries with the highest number of medals and then plot various charts to better understand their performance over the years.
9/9/18
ITS530 Chapter 04
21
Performance of Top Countries over the years
Highest number of medals overall - by Country
9/9/18
ITS530 Chapter 04
22
USA has the highest overall number of medals, its total medals go above 4000.
Even though Soviet Union broke down in the 1980s, it is still the country with the second highest number of overall medals
Total number of Medals per Sport
As can be seen in the Top athletes charts, sports of Aquatics, Athletics, Rowing and Gymnastics offer the highest number of medals. Therefore it is no surprise that these sports provide the most successful athletes in terms of number of medals won.
9/9/18
ITS530 Chapter 04
23
Questions?
9/9/18
ITS530 Chapter 04
24
02040608010012345PercentFacility
Percent of ANC Clients Receiving IPTp in Select Facilities
IPTp-1IPTp-2
Chart1
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
Sheet1 (2)
| Figure 2. Household Ownership of at Least 1 Net or ITN, 2008 | ||||||||||||||
| Country 1 | Country 2 | Country 3 | Country 4 | Country 5 | CI | |||||||||
| Any net | 56 | 63 | 77 | 66 | 70 | 64 | 75 | |||||||
| THMIS | NIMR | PSI | NMCP | IHI/LS | ||||||||||
| LLIN | 38 | 29 | 45 | 57 | 46 | 40 | 52 | |||||||
| 80 | 80 | 80 | 80 | 80 | ||||||||||
| Total | 46 | 65 | 56 | |||||||||||
| Use of Nets or ITN by Children <5 yrs of Age, 2008 | ||||||||||||||
| Country 1 | Country 2 | Country 3 | Country 4 | Country 5 | CI | |||||||||
| Any net | 35 | 54 | 74 | 48 | 48 | 41 | 56 | |||||||
| THMIS | NIMR | PSI | NMCP | IHI/LS | ||||||||||
| ITN | 25 | 32 | 48 | 29 | 29 | 22 | 36 | |||||||
| Total | 46 | 65 | 56 | |||||||||||
| Use of Nets or ITNs by Pregnant Women, 2008 | ||||||||||||||
| THMIS | NMCP | IHI/LS | CI | |||||||||||
| Any net | 36 | 52 | 39 | 31 | 47 | |||||||||
| ITN | 26 | 30 | 19 | 13 | 27 | |||||||||
| Total | 46 | |||||||||||||
| Use of IPTp by Pregnant Women, 2008 | Use of IPTp by Pregnant Women, 2008 | |||||||||||||
| THMIS | IHI/LS | 1 | 2 | 3 | 4 | 5 | ||||||||
| IPTp-1 | 57 | 50 | 47 | 54 | IPTp-1 | 72 | 53 | 37 | 90 | 95 | ||||
| IPTp-2 | 30 | 26 | 23 | 29 | IPTp-2 | 50 | 20 | 36 | 42 | 86 | ||||
| Total | 46 | 56 | ||||||||||||
| % Children <5 with Fever who Took Specific Antimalarial, 2008 | ||||||||||||||
| 2008 | 2007 | |||||||||||||
| Sulfadoxine-Pyrimethamine | 2 | 2 | ||||||||||||
| Chloroquine | 0.5 | 0.5 | ||||||||||||
| Amodiaquine | 11 | 20 | ||||||||||||
| Quinine | 9 | 9 | ||||||||||||
| ACT | 36 | 26 | ||||||||||||
| Other | 3 | 0.5 | ||||||||||||
| % Children <5 with Fever Who Took Specific Antimalarial within Same or Next Day, 2008 | ||||||||||||||
| THMIS | NMCP | |||||||||||||
| Sulfadoxine-Pyrimethamine | 0.5 | 1 | ||||||||||||
| Chloroquine | 0 | 0 | ||||||||||||
| Amodiaquine | 12 | 4 | ||||||||||||
| Quinine | 6 | 5 | ||||||||||||
| ACT | 13 | 13 | ||||||||||||
| Other | 3 | 0.5 | ||||||||||||
| Percent Overall malaria prevalence and overall anemia prevalence | ||||||||||||||
| THMIS | NMCP | IHI/LS | CI | THMIS CI | ||||||||||
| Parasitemia | 18 | 14 | 11 | 8 | 14 | 16 | 20 | |||||||
| Anemia (HB <8 g/dL) | 8 | 6 | 3 | 3 | 4 | 7 | 9 | |||||||
| Months | Parasitaemia | HB <8 g/dl | ||||||||||||
| 6-11 | 9 | 0 | 11 | |||||||||||
| 12-23 | 14 | 0 | 12 | |||||||||||
| 24-35 | 20 | 0 | 8 | |||||||||||
| 36-47 | 20 | 0 | 5 | |||||||||||
| 48-59 | 22 | 0 | 3 | |||||||||||
| Mainland | 18 | 0 | 8 | |||||||||||
| Zanzibar | 1 | 0 | 5 | |||||||||||
| 2001 | 2003 | 2005 | 2008 | |||||||||||
| Artemisinin Mono | 0 | |||||||||||||
| ACT | 3 | 57 | ||||||||||||
| Quinine | 16 | 19 | 16 | 18 | ||||||||||
| Chloroquine | 54 | 3 | 1 | 0 | ||||||||||
| Amodiaquine | 2 | 22 | 32 | 20 | ||||||||||
| Sulfadoxine-Pyrimethamine | 28 | 57 | 48 | 5 | ||||||||||
| Net was sold | 1 | |||||||||||||
| Net was given away to relatives | 68 | |||||||||||||
| Net was given away to others | 9 | |||||||||||||
| Material used for other purpose | 1 |
Sheet1 (2)
| Net was sold |
| Net was given away to relatives |
| Net was given away to others |
| Material used for other purpose |