Analyzing and visualizing Data - Overview with 12 slides

profilesrinivas15
ITS530Chapter4WorkingwithData.pptx

School of Computer & Information Sciences

ITS530 Analyzing and Visualizing Data Chapter 4 Working with Data

9/9/18

ITS530 Chapter 04

1

Data Assets and Tabulation Types

Two main categories

Data that exist in tables; Datasets

Data that exist as isolated values

Data Types

Levels of data or scales of measurement

Type of exploratory data analysis you can undertake

Editorial thinking you establish

Specific chart types you might use

Color choices and layout decisions around composition

9/9/18

ITS530 Chapter 04

2

Normalized vs Tabulated Data

Normalized

Data presented in a table with different fields

9/9/18

ITS530 Chapter 04

3

Tabulated

The data is aggregated

Data Assets and Tabulation Types cont.

Textual (Qualitative)

Unstructured streams of words

Descriptive details of a weather forecast for a given city

The full title of an academic research project

The description of a product on Amazon

Nominal (Qualitative)

Ordinal data is still categorical and qualitative in nature

Characteristics of order

The response to a survey question: based on a scale of 1 (unhappy) to 5 (very happy)

The general weather forecast: expressed as Very Hot, Hot, Mild, Cold, Freezing

9/9/18

ITS530 Chapter 04

4

Graphic Language: The Curse of the CEO

9/9/18

ITS530 Chapter 04

5

https://www.bloomberg.com/graphics/infographics/graphic-language-the-curse-of-the-ceo.html

Data Assets and Tabulation Types cont.

Interval (Quantitative)

Interval data is the less common form of quantitative data

Quantitative and numeric measurement

Measure for temperature

Ratio (Quantitative)

Most common quantitative variable

Age of a survey participant in years

Forecasted amount of rainfall in millimetres

Unlike interval data, for ratio data variables zero means something

Temporal Data

Time-based data

Textual: ‘Four o’clock in the afternoon on Monday, 12 March 2016’ Ordinal: ‘PM’, ‘Afternoon’, ‘March’, ‘Q1’

Interval: ‘12’, ‘12/03/2016’, ‘2016’

9/9/18

ITS530 Chapter 04

6

Data Assets and Tabulation Types cont.

Discrete

No ‘in-between’ state

Days of the week

Heads or tails for a coin toss

1,2,3,4,5,6,etc.

Continuous

Has in-between state

Height and weight

Temperature

Time

1.1,1.2,1.3,1.4,1.5,etc.

9/9/18

ITS530 Chapter 04

7

Data Acquisition

What data do you need and why?

From where, how, and by whom will the data be acquired?

When can you obtain it?

Curated by You

Primary data collection

Manual collection and data foraging

Extracted from pdf files

Web scraping (also known as web harvesting)

Curated by Others

Issued to you

Download from the Web

System report or export

Third-party services

API

9/9/18

ITS530 Chapter 04

8

https://www.computerworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html

Great R packages for data import, wrangling & visualization

Data Examination

Data Properties

Data types

Size

Condition

Missing values

Erroneous values

Inconsistencies

Duplicate records

Out of date

Uncommon system characters or line breaks

Leading or trailing spaces

How to Approach This?

Inspect and scan

Data operations

Statistical methods

Frequency counts

Frequency distribution

Measurements of central tendency

Measurements of spread

Maximum, minimum and range

Percentiles

Standard deviation

9/9/18

ITS530 Chapter 04

9

Influence on Process

Moving forward

Purpose map ‘tone’

Editorial angles

Physical properties influence scale

Potential Activities

Transform to clean

Transform to convert

Transform to create

Transform to consolidate

9/9/18

ITS530 Chapter 04

10

Data Exploration

Exploratory Data Analysis

Instinct of the analyst

Reasoning

Deductive

Inductive

Chart types

Research

Statistical methods

Nothings

Not always needed

9/9/18

ITS530 Chapter 04

11

Data Exploration

What Good Marathons and Bad Investments Have in Common By Justin Wolfers April 22, 2014 NYTImes

9/9/18

ITS530 Chapter 04

12

Case Study 1

Background:

Globally every year 529,000 maternal deaths occur, 99% of this in developing countries

ANC: (Antenatal Care) is to monitor health of pregnant women and reduce the material deaths

Intermittent preventive therapy or intermittent preventive treatment (IPT) is a public health intervention aimed at treating and preventing malaria episodes in infants (IPTi), children (IPTc), schoolchildren (IPTsc) and pregnant women (IPTp).

Question:

Are ANC clinics in country X reaching their coverage targets for IPTp?

Data Source:

Routine health information

9/9/18

ITS530 Chapter 04

13

13

Speaker notes

Now we are going to consider how we could answer the following question:

? We will answer this question using routine health information.

Code Variables
1. New ANC clients
2. Group pre-test counseled
3. Individual pre-test counseled
4. Accepted HIV test
5A. HIV test result - Positive
5B. HIV test result – Negative
5C. HIV test result - Indeterminate
6 A. Post-test counseled - Positive
6 B. Post-test counseled – Negative
8A. ARV therapy received – Current NVP
9. IPTp-1
10. IPTp-2

Data Source

General ANC Registers

Which of these variables are relevant to answer your question?

Which elements will be included in your numerator and which in your denominator?

Answers:

1) New ANC clients, IPTp-1

2) New ANC clients =Denominator,

IPTp-1 and IPTp-2= Numerator

9/9/18

ITS530 Chapter 04

14

14

Speaker notes

Which of these variables are relevant to answer your question? We’re going to focus on elements 1, 9 and 10. Which elements will be included in your numerator and which in your denominator?

IPTp Coverage-Facility Performance

Code Variables Facility 1 Facility 2 Facility 3 Facility 4 Facility 5
9. IPTp-1 536 1435 39 969 862
10. IPTp-2 372 542 38 452 780

Number of ANC clients receiving IPTp

Question:

Among the five facilities, which one performed better?

Answer:

Cannot tell because we don’t know the denominators

Speaker notes

Here we have the data on IPTp-1 and 2 to assess facility performance. Among the five facilities, which one performed better?

15

IPTp Coverage-Facility Performance

Code Variables Facility 1 Facility 2 Facility 3 Facility 4 Facility 5
1 New ANC Clients 744 2708 105 1077 908
9. IPTp-1 536 1435 39 969 862
10. IPTp-2 372 542 38 452 780

Number of ANC clients receiving IPTp

Question: Now, you have the denominators, which of these facility performed better?

Indicator Facility 1 Facility 2 Facility 3 Facility 4 Facility 5
% of new ANC clients who receive IPTp-1 in the past year 72% 53% 37% 90% 95%
% of new ANC clients who receive IPTp-2 in the past year 50% 20% 36% 42% 86%

Response: Facility 5

Speaker notes

Now, you have the denominators, which of these facility performed better? We can see that it was actually facility 5.

16

Are facilities reaching coverage targets?

Target-80%

* National coverage target for pregnant women receiving IPTp-2 is 80%.

9/9/18

ITS530 Chapter 04

17

17

Speaker notes

Here is the same information presented as a chart. We need to use this information to determine, or interpret, whether or not facilites are reaching their coverage targets. Let’s assume that the national coverage target for pregnant women receiving IPTp is 80%. Are the facilities reaching the coverage target? What else can we interpret from this information?

Possible answers

Facility 1 needs to do a better job following up and increase IPTp coverage a bit.

Facility 2 does a better job with IPTp-1 coverage than IPTp-2, but needs to increase coverage of both.

Facility 3 does a good job administering IPTp-2 to patients that receive the first round, but they need to increase initial coverage and maintain follow-up.

Facility 4 does a good job with IPTp-1 coverage, but this falls of with IPTp-2. Is this loss to follow-up, or are they not administering IPTp-2 when patients return?

Facility 5 can be seen as a model and we could investigate their best practices for use in other programs

This information does not tell you why coverage is at these levels. You would have to investigate further, but you can see which facilities you need to work with.

Additional Questions

Which facility is performing better/worse than expected?

What is the trend over time for these facilities?

How would you assess each facility’s performance based on the data?

What other data or information should you consider in providing recommendations or guidance to the facilities?

9/9/18

ITS530 Chapter 04

18

Speaker notes

Here are some other questions that we might want to ask to help interpret this information and identify how to improve performance.

18

Case Study 2

Olympic Games Analysis 1896 - 2012

Olympic Games have been the world's biggest sporting event for over a 100 years.

Here we have a dataset for the Olympic Games. We have information about all the athletes who have won medals for every Olympic Games since the inaugural games of 1896.

Here are the columns in the dataset

Year

Host Venue

Sport

Discipline

Athlete

Country

Gender

Event

Medal

9/9/18

ITS530 Chapter 04

19

https://www.kaggle.com/arpitsolanki14/olympic-games-data-analysis

Top Overall Countries

9/9/18

ITS530 Chapter 04

20

The countries which win a lot of medals at the Olympic games are the developed countries in North America, Asia and Europe.

Poorer countries in Africa and South America do not win lot of medals.

To enable effective visualization of data, we will filter out the data for the top 5 countries overall.

We will identify the top 10 countries with the highest number of medals and then plot various charts to better understand their performance over the years.

9/9/18

ITS530 Chapter 04

21

Performance of Top Countries over the years

Highest number of medals overall - by Country

9/9/18

ITS530 Chapter 04

22

USA has the highest overall number of medals, its total medals go above 4000.

Even though Soviet Union broke down in the 1980s, it is still the country with the second highest number of overall medals

Total number of Medals per Sport

As can be seen in the Top athletes charts, sports of Aquatics, Athletics, Rowing and Gymnastics offer the highest number of medals. Therefore it is no surprise that these sports provide the most successful athletes in terms of number of medals won.

9/9/18

ITS530 Chapter 04

23

Questions?

9/9/18

ITS530 Chapter 04

24

02040608010012345PercentFacility

Percent of ANC Clients Receiving IPTp in Select Facilities

IPTp-1IPTp-2

Chart1

1 1
2 2
3 3
4 4
5 5
IPTp-1
IPTp-2
Facility
Percent
Percent of ANC Clients Receiving IPTp in Select Facilities
72
50
53
20
37
36
90
42
95
86

Sheet1 (2)

Figure 2. Household Ownership of at Least 1 Net or ITN, 2008
Country 1 Country 2 Country 3 Country 4 Country 5 CI
Any net 56 63 77 66 70 64 75
THMIS NIMR PSI NMCP IHI/LS
LLIN 38 29 45 57 46 40 52
80 80 80 80 80
Total 46 65 56
Use of Nets or ITN by Children <5 yrs of Age, 2008
Country 1 Country 2 Country 3 Country 4 Country 5 CI
Any net 35 54 74 48 48 41 56
THMIS NIMR PSI NMCP IHI/LS
ITN 25 32 48 29 29 22 36
Total 46 65 56
Use of Nets or ITNs by Pregnant Women, 2008
THMIS NMCP IHI/LS CI
Any net 36 52 39 31 47
ITN 26 30 19 13 27
Total 46
Use of IPTp by Pregnant Women, 2008 Use of IPTp by Pregnant Women, 2008
THMIS IHI/LS 1 2 3 4 5
IPTp-1 57 50 47 54 IPTp-1 72 53 37 90 95
IPTp-2 30 26 23 29 IPTp-2 50 20 36 42 86
Total 46 56
% Children <5 with Fever who Took Specific Antimalarial, 2008
2008 2007
Sulfadoxine-Pyrimethamine 2 2
Chloroquine 0.5 0.5
Amodiaquine 11 20
Quinine 9 9
ACT 36 26
Other 3 0.5
% Children <5 with Fever Who Took Specific Antimalarial within Same or Next Day, 2008
THMIS NMCP
Sulfadoxine-Pyrimethamine 0.5 1
Chloroquine 0 0
Amodiaquine 12 4
Quinine 6 5
ACT 13 13
Other 3 0.5
Percent Overall malaria prevalence and overall anemia prevalence
THMIS NMCP IHI/LS CI THMIS CI
Parasitemia 18 14 11 8 14 16 20
Anemia (HB <8 g/dL) 8 6 3 3 4 7 9
Months Parasitaemia HB <8 g/dl
6-11 9 0 11
12-23 14 0 12
24-35 20 0 8
36-47 20 0 5
48-59 22 0 3
Mainland 18 0 8
Zanzibar 1 0 5
2001 2003 2005 2008
Artemisinin Mono 0
ACT 3 57
Quinine 16 19 16 18
Chloroquine 54 3 1 0
Amodiaquine 2 22 32 20
Sulfadoxine-Pyrimethamine 28 57 48 5
Net was sold 1
Net was given away to relatives 68
Net was given away to others 9
Material used for other purpose 1

Sheet1 (2)

Parasitemia
Percent
Figure 10. Percent Overall Malaria Prevalence and
ACT
Quinine
Amodiaquine
Sulfadoxine-Pyrimethamine
Chloroquine
Other
Percent
ACT
Quinine
Amodiaquine
Sulfadoxine-Pyrimethamine
Chloroquine
Other
Percent
Target >80%
IPTp-1
IPTp-2
Percent
Figure 6. Use of IPTp by Pregnant Women, 2008
Target >80%
Any net
ITN
Percent
Figure 5. Use of Nets or ITNs by Pregnant Women, 2008
Target >80%
Any net
ITN
Use of Nets or ITN by Children <5 yrs of Age, 2008
Target >80%
Any net
LLIN
Percent
Household Ownership of at Least 1 Net or ITN, 2008
Target >80%
Any net
Percent
Figure 2. Household Ownership of at Least 1 Net, 2008
Target >80%
LLIN
Percent
Figure 3. Household Ownership of at Least 1 ITN, 2008
Target >80%
Any net
Percent
Use of Nets by Children <5 yrs of Age, 2008
Target >80%
ITN
Percent
Use of ITNs by Children <5 yrs of Age, 2008
Target >80%
ITN
Percent
Figure 4. Use of ITNs by Pregnant Women, 2008
Target >80%
Any net
Percent
Figure 4. Use of Nets by Pregnant Women, 2008
Target>80%
Any net
Percent
Figure 2. Household Ownership of at Least 1 Net or ITN, 2008
Anemia (HB <8 g/dL)
Percent
Figure 11. Percent Overall Anemia Prevalence
Parasitaemia
HB <8 g/dl
Age in Months
Percent
Parasitemia and Anemia among Children under Five in Tanzania, 2008
Net was sold
Net was given away to relatives
Net was given away to others
Material used for other purpose
Status of Lost Nets among Households that Lost Any Nets
1
68
9
1
IPTp-1
IPTp-2
Facility
Percent
Percent of Pregnant Women Receiving IPTp-2 in Facility Catchment Area