Decision Science Assignment
1
Decision Science
INTERNAL ASSIGNMENT- JUNE 2020
UNDER THE GUIDANCE OF
Prof. Dr. N. Palaniappan
SUBMITTED BY:
DIKSHA GUPTA
M.B.A. 2ND YEAR
DATE OF SUBMISSION:
June 23, 2020.
2
1. Identify the types of the variable.
In Decision Science or Statistics, a ‘variable’ can be defined as an attribute to be studied, of any
object. Selecting, which variable attributes to measure can be a good design for experiments.
Types of data: Quantitative and Categorical variables
Data contains a specific measurement of the set of variables. These variables are generally divided
into 2 types:
● Quantitative variables:
Quantitative data represents quantities or amounts. While collecting quantitative data, the recorded
numbers represent operations with arithmetic such as add, subtract, divide etc.
● Categorical variables:
Categorical data represents groups.
3
*Sometimes a nominal variable can also be treated as a quantitative variable. A variable may fall
under more than one variable types. If the scale is numeric, such a material quality or survey ratings,
despite having floating values, the scale can fall under continuous quantitative data- type.
Sr. No. Variable Data Type Values
a Gender Nominal Categorical
variable
Male, Female and Others.
b Educational
background
Ordinal Categorical
variable
None(0), Metric pass(12),
Graduate(15), Post- graduate(17),
Doctorate(21) etc.
c Satisfaction Ordinal Categorical
variable
Low, Medium, High
or measured on a 1 to 5 level
scale.
d Motivation Ordinal Categorical
variable
Low, Medium, High
or measured on a 1 to 5 level
scale.
e Exchange rate Continuous Quantitative
variable
Any real number value such as
0.5, 78 etc.
f Gold price Continuous Quantitative
variable
Any real number value such as
0.5, 78 etc.
g Preference of cars Nominal Categorical
variable
SUV, Sedan or Race car.
h Teacher’s
feedback
Ordinal Categorical
variable
Bad(1), Good(2), Extraordinary(3)
etc.
i Grades in
Post- Graduation
Ordinal Categorical
variable
Grade O, Grade A, Grade B, Grade F etc.
j Marital status Nominal Categorical
variable
Married, Unmarried, Divorced,
Widow/Widower.
k Quality of
services
Ordinal Categorical
variable
Bad(1), Good(2), Extraordinary(3)
etc.
4
l Age group Ordinal Categorical
variable
Young (0-12), Teenage (13-19),
Adult (20-50), Senior (beyond
50).
m GDP Continuous Quantitative
variable
Any real numbers such as $3000
billion.
n Interest rate Continuous Quantitative
variable
Any real numbers such as 10%,
5.8%, -1.3% etc.
o Twitter comments Nominal Categorical
variable
Exclamatory, Liked, Positive,
Disliked, Joyful etc.
p Facebook pictures Nominal Categorical
variable or
Discrete Quantitative
variable.
No ordered ranking among the
images.
OR
Images with pixel values from 0
to 255 (integers).
a. Gender: (Nominal)
When gender has just two categories simply, named as Male and Female, the gender data can be
treated as a binary variable. It can also be converted into discrete variable by writing Male as 1 and
Female as 0 or vice- versa. The word binary means, of relating to two (bi).
When gender has more than two categories such as Male, Female and others, the gender can be
treated as a nominal variable. There is no ordered ranking among the three categories.
b. Educational background: (Ordinal)
Educational background can be treated as the variable, having the highest level of education
accomplished by someone. Furthermore, these categories can be ranked in an ordered manner or
level. Here, one can give ordered level or ranking.
c. Satisfaction: (Ordinal)
The level is satisfaction cannot be counted in numbers, as it is not a real object. However, one can
grade it between 1 and 5 and compare it with others. Hence, it can be treated as an ordinal variable.
d. Motivation: (Ordinal)
Similar to the level of satisfaction one is having, the amount of impact one has gone through, by
encountering some motivations, can be compared with the others.
The impact on one’s life, due to some motivational speeches can be scaled as Low, Medium or
High.
e. Exchange rate: (Continuous)
$1 = ₹75 or ₹1 = $0.013.
The rate of exchange can acquire any real value. Above example illustrates the exchange rate to be
75 or 0.013 (real). Furthermore, there is sometimes a possibility of negative rates also.
f. Gold price: (Continuous)
Similar to the exchange rate, gold prices also can be any real values such as 80₹/gm. Few days ago,
crude oil prices had gone negative in USA. Similar maybe the cases rarely, for golds as well.
5
g. Preference of cars: (Nominal)
A person with huge family might prefer Sedan or SUVs and a rich person might prefer a race car
such as Ferrari. These preferences cannot be explained in terms of orders rankings.
h. Teacher’s feedback: (Ordinal)
If a feedback is represented in terms of overall ratings, the feedback can be ordered. Good feedback
can be ranked higher the bad one.
i. Grades in Post- Graduation: (Ordinal)
Grade O, Grade A can be ranked in order. Grade A is better than B, hence can be given higher
ranking. However, it is different from 0.0 – 100.0 (percentage) and 1.0 – 8.0 (CGPA), that would
have been continuous variables.
j. Marital status: (Nominal)
The status shouldn’t be ranked in any order. Ordered representation for married and widower
persons cannot be possible.
k. Quality of services: (Ordinal)
Similar to the teacher’s feedback, the quality can be measured in ordered manner. A good quality
can be ranked higher than the bad one.
l. Age group: (Ordinal)
Youngsters, adults and senior citizens can be arranged in an ordered manner, acquiring categorical
values.
m. GDP: (Continuous)
GDP of India was 2718.73 and 2800 billion USD, respectively in 2018 and 2019. The values can
be any real numbers, depending upon economical progress and several other factors.
n. Interest rate: (Continuous)
Japanese banks have negative interest rates, for current accounts. Whereas, Indian banks have rates
around 5.0%. The values can hold any real number.
o. Twitter comments: (Nominal)
Comments can be in support or ranging protests. They might be joyful or sad. There is no ordered
ranking.
p. Facebook pictures:
(Nominal)- If there can be no quantitative or ordered ranking among the Facebook pictures/
images, they can be kept classified with the unordered categorical data.
6
(Discrete)- A 1-bit monochrome image consists of pixel values as either 0 or 1 only, i.e. binary
values. An 8-bit grayscale/ coloured image has pixel values from 0 to 255= 28 -1, i.e. discrete or
natural number values.
Example- This is a 1-bit monochrome image (binary). Here among the 2x2 image (four) pixels
shown, white represents 1 and black colour represents 0.
7
2. Following data of performance scores is available of employees working with
a company. You are required to perform the following:
a. Make the frequency distribution. Calculate the frequency and the Cumulative
frequency.
Procedure to find Frequency: ● The ‘frequency’ (f) of a data value, is defined as the number of times the particular data value
occurs/ repeats.
● For example, the performance score value 31 is repeated four times, in the given data set. Hence,
the frequency of score 31 is 4. It can be represented as: f31 = 4.
● Similarly, the frequencies of each data value (performance scores ranging from 0 to 100) is to be
counted.
● Note: Frequency, for the scores that are not observed in the dataset, is written as 0. For example,
f18 = 0.
Procedure to find Cumulative Frequency: ● ‘Cumulative frequency’ (CF) is used to determine the number of observations that lie below a
particular value in a data set.
CFn = Σf<=n.
CF48 = Σf<=48 = f0 + f1 + f2 + … + f48 (for discrete values of observations).
● The cumulative frequency is calculated by adding every frequency from a frequency distribution
table upto the sum of its predecessors.
● Cumulative frequency can also be calculated by the total of all the frequencies of all the
observation upto a particular value.
● The last value of cumulative frequency will definitely be equal to the total number of all the
observations.
● For example, CF30 = f0 + f1 + … + f 29+ f30 = 0+0+…+0+1 = 1.
CF32 = f30 + f31 + f 32 = 1+3+2 = 6.
For the given data in the sets of 10 scores at a time, frequency and cumulative frequency are
observed as:
8
Observations: ● It is observed that the performance scores from 0 to 29, 99 and 100 are not obtained by any of
the 208 employees.
f0 = … = f29 = 0.
● According to the given data, 30 and 98 are the lowest and highest scores obtained by any
employee, respectively.
● Additionally, the performance scores 65 and 81 are also not obtained by anybody. Hence, it can
be said that:
f99 = f100 = f65 = f81 = 0.
● Performance score 53 is the most frequent score. f53 = 8.
9
10
b. Calculate the mean, median, mode and quartiles.
Arithmetic Mean (Average): ● An ‘arithmetic mean’ (µ) is a single number, that summarisingly represents a list of numbers. It
is the half way through (middle of) all the observed data set.
For example, the set {3,4,5,6,7} has the half way (average) at 5.
● It is a sum of the list of the observation values given and then divided by the total number of
observations.
Arithmetic Mean (A.M.) = 𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒𝑠
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
µ = Σx
𝑛.
● In the given dataset, the number of employees (observations) is 208. Their performance scores
(observation values) ranges from 30 to 98.
● According to the given data set:
µ = 52+57+50+68+74+⋯+66
208 = 63.35096 .
Median: ● The ‘median’ is the middle number obtained in an ascending/ descending order sorted list.
It can be more descriptive of any data set than the average.
● The median is generally used, when there are outliers in the dataset, causing skewness. Hence,
median tends to avoid the outliers and skewness in any data.
● For example, the set {5,2,6,3,7} can be ascendingly sorted as {2,3,5,6,7} and hence the median
is 5.
● A median separates the higher half (50%tile) from the lower half of a data sample.
Median = {(n + 1) ÷ 2}th value in a sorted dataset.
● In the given dataset, sort dataset would be: {30,31,31,31,32,32,33,…,98}.
Median = 62.
Mode: ● The observation with highest frequency is known as the ‘mode’ of the data. It indicates, most of
the data is crowded around which particular value.
● The mode of a set of data is the value in the set, that occurs most often.
● In the given dataset, it is observed that the performance score of 53 has been obtained by most
of the people.
fmax = f53 = 8.
Mode = 53.
● Note: For the given data, Mode < Mean and Median < Mean.
Hence, the data may be positively skewed, approximately. Several data-points may lie on the right
side of the mode value. To know this, skewness is also to be measured.
11
Geometric Mean (G.M.): ● It is the central number in a G.P. (geometric progression), such as 3, 9, 27 has G.M. as 9.
G. M. = (π xi) 1⁄n = (x1. x2 … xn)
1⁄n
G.M. = (52 x 57 x 50 x 68 x 74 x … x 66) 1⁄ 208 = 59.8147
Harmonic Mean (H.M.): ● The ‘harmonic mean’ is a kind of average of the reciprocals and also the Pythagorean mean.
H.M. = 208
1
52 +
1
57 +
1
50 + … +
1
66 = 56.25897
● Note: It can be always noted that H.M. ≤ G.M. ≤ A.M.
Quartiles: ● A ‘quartile’ is a type of quantile, which divides the number of data points into 4 parts of equal
lengths, that are known as ‘quarters’.
● The ‘first quartile’ (Q1) is a middle number between the lowest number and the median of the
data set. Hence in an ordered dataset, Q1 stands at the boundary between first 25%tile and the rest
75%tile of the data.
● In the given dataset, Q1= 45, i.e. 25% of the data (208/4 = 52 observations) lie on or on the left
of this number and the rest lie on the right side of this number.
● The ‘second quartile’ (Q2) is defined as the point, beyond which just 50% of the ordered
observations exist. It is also known as the ‘median’ of the data.
● In the given dataset, Q2= 62, i.e. 50% of the data (208/2 = 104 observations) lie on or on the left
of the dataset.
● The ‘third quartile’ (Q3) is defined as the middle number between the largest number and the
median of the data set. Hence in an ordered dataset, Q3 stands at the boundary between first 75%tile
and the rest 25%tile of the data.
● In the given dataset, Q3= 83, i.e. 25% of the data (208x75% = 156 observations) lie on or on the
left of this number and the rest lie on the right side of this number.
● The ‘fourth quartile’ (Q4) is defined as the largest observed data value, beyond which no more
values of the ordered observations exist. It is also known as the ‘largest observed value’ of the data.
● In the given dataset, Q4= 98, i.e. 100% of the data (208x100% = 208 observations) lie on or on
the left of the dataset.
12
c. Calculate the variance and the standard deviation.
Variance (σ2): ● ‘Variance’ measures how far the dataset values are spread out, from their average value. It is the
fact or quality of being inconsistent, divergent or different.
● Lower the variance means, lower (narrower) is spread of the data and it is more tightly clumped
around a certain mean value.
● Similarly, high variance value indicates a lose widespread (broader) of the data, around a certain
mean value.
● Variance (σ2) is always a non- negative number. σ2 ≥ 0.
●‘Population variance’ refers to the value of variance that is calculated from the
complete population data (with all the number of samples).
● ‘Sample variance’ is the variance calculated from sample data (with re-scaling to n-1 samples in
the denominator).
● Complete steps to calculate variance have been shown in the Excel sheet attached.
● Formulae for Sample variance (σs 2) and Population variance (σp
2) are given by,
13
σs2 = Σ(x− µ)²
𝑛−1. and σp2 =
Σ(x− µ)²
𝑛. = E[X2] – µ 2
where, µ = average value or arithmetic mean
n= Number of observations
x= the value of the one observation at a time.
E= Expected value.
● In the given dataset,
σs 2 = 428.8859 and σp
2 = 426.824
● Note: Since, 1/n < 1/(n-1),
hence σp 2 < σs
2.
● If the observed data values are in Kgs, the variance will be in Kgs2, i.e. having the squared unit
as that of the original data.
Standard Deviation (σ): ● The Standard Deviation is also a measure of how spread out numbers are, similar to the variance.
● It actually is the positive square root of the variance.
● σ is always a non- negative number. σ ≥ 0.
● Low value indicates low (narrow) spread of the data around a mean value and vice- versa.
● Formulae for Sample Standard Deviation (σs) and Population Standard Deviation (σp) are given
by,
σs =√ Σ(x− µ)²
𝑛−1. and σp =√
Σ(x− µ)²
𝑛.
Standard Deviation= √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
where, µ = average value or arithmetic mean
n= Number of observations
x= the value of the one observation at a time
σs = Sample Standard Deviation
σp = Population or Universal Standard Deviation.
● Complete steps to calculate SD have been shown in the Excel sheet attached.
● In the given dataset,
σs = 20.70956 and σp = 20.65972.
● Note: Since, 1/n < 1/(n-1),
hence σp < σs.
● If the observed data values are in Kgs, the standard deviation will also be in Kgs, i.e. having the
same unit as that of the original data.
● Another important aspect of the SD is the fact that it tells about the most likely range of the
data. Most of the given data lies in the range µ ± σs, i.e. between 43 and 84 approximately.
14
Results in R:
15
16
3. A. In continuation with the data of performance scores of employees in
previous example, perform the following:
a. Calculate the range and interquartile range.
b. Calculate the z- scores.
c. Calculate the skewness and Kurtosis (using excel).
d. Comment on the distribution of the data.
Range: ● The ‘range’ of a data is the difference between the amount of highest and lowest values.
Range = Highest value – Lowest value
Range = 98 – 30 = 68.
● Disadvantage– However, the range can be misleading, at giving the idea about values possible
to be seen in the data.
● In the given data, other possible values, that are not seen yet, are 0- 29, 99 and 100. Here, the
lowest possible value= 0 and the highest possible value= 100. Hence,
maximum possible range= 100 – 0 = 100.
● For example, {8,11,5,9,1,6,3616} for this set, the range= 3616 – 1= 3615.
But, except just one data value, all the other data values are around 10. Hence, IQR and σ are to be
calculated.
Inter-Quartile Range (IQR): ● The ‘inter-quartile range’ (IQR or Midspread) is a measure of variability, on the basis of dividing
the dataset into the quartiles.
● IQR is the difference between the amount of largest and smallest values, in the middle 50% of
the dataset.
Inter-Quartile Range = Third Quartile – First Quartile.
IQR = Q3 – Q1.
But, Q1 = 45 and Q3 = 83. Therefore,
IQR = 83 – 45 = 38.
Z- scores (Standardised values): ● A ‘z-score’ describes the standardised position of data value, as its distance from the mean value.
● A data value has the positive z-score, if it lies above the mean value. Similarly, the z-score is
negative, if the data value lies below the mean.
● The z-score (standard score) is the amount of standard deviations, by which a value of the data
is below or above the mean (µ).
● Note: The average of all the z-scores must be equal to 0. Sample SD of all the z-scores must be
1.
17
● Formula for the z-score of a data value (x) is given by:
z-score = 𝑥− µ
σ
where, σ = sample standard deviation = 20.70956
µ = Arithmetic mean = 63.35096 .
● For example, {1,2,3,4,5} dataset has µ = 3 and σs = 1.58. Hence, z-scores will be: {-1.265, -
0.632, 0, 0.632, 1.265}.
● The z-scores have been displayed in the corresponding Excel sheet. For the provided dataset of
employees’ performances, some of the z- scores are:
18
19
20
Skewness: ● ‘Skewness’ refers to the amount of distortion, in a normal bell curve. Skewness represents the
extent, to which a given dataset varies.
● A normal distribution (bell) generally has a skewness of zero.
● A negative skewness indicates that the tail of the distribution is on the left side. A positive
skewness indicates, the tail to be on the right.
● Application- Investors note skewness, while judging a return distribution because it considers
the extremes (outliers) of the dataset instead of focusing only on the means. Kurtosis also does the
same and hence used as an alternative.
● Some formulae for measuring skewness are:
Pearson’s Mode Skewness = 𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
S1 = µ−Mode
σ =
63.35096−53
20.70956
S1 = 0.5
Pearson’s Median Skewness = 3 (𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
S2 = 3 (µ−Median)
σ =
3 (63.35096−62)
20.70956
S2 = 0.2
Co-efficient of Skewness = Σ(𝑥− µ)3
(n−1). σ3 =
Σ(𝑥− 63.35096)3
207∗ 20.709563
Co-efficient of Skewness = 0.09767 (using MS Excel).
Kurtosis: ● Kurtosis tells us about the height and the sharpness of the central peak, when compared to that
of a standard bell curve.
● ‘Kurtosis’ defines, how heavily/ mildly the tails of the data distribution differ from that of a bell
distribution. Kurtosis tells, whether the tails of the given data distribution contain extreme values
and the number of extreme values.
● Kurtosis is a measure of the ‘tailedness’ (outliers/ extremes) in a probability distribution.
● Formula for measuring kurtosis is:
Kurtosis = Σ(𝑥− µ)4
(n−1). σ4 -3 =
Σ(𝑥− 63.35096)4
207∗ 20.709564 -3
Kurtosis = -1.2854 (using MS Excel).
21
Data distribution: ● Mean= 63.35096, Median= 62, Mode= 53.
Median < Mean and Mode < Mean.
Hence, it appears that the data may be positively skewed (towards the right).
● If Skewness ∉ (-1,1); highly skewed distribution.
If Skewness ∈ (-1, -0.5) or Skewness ∈ (0.5, 1); moderately skewed distribution.
If Skewness ∈ [-0.5, 0.5]; approximately symmetric distribution.
● However, Co-efficient of Skewness = 0.097.
Hence, the distribution is approximately symmetric.
● If Kurtosis ≈ 0; Mesokurtic distribution ⇒ Close to Normal distribution.
If Kurtosis > 0; Leptokurtic distribution ⇒ More outliers, heavy tails, risky financial investments.
If Kurtosis < 0; Platykurtic distribution ⇒ Less outliers, low tails, desirable financial investments.
● Since, Kurtosis= -1.2854. Hence, the data follows a Platykurtic distribution.
● First Quartile= 45 and Third Quartile= 83. IQR= 38.
● Hence, on the performance score’s scale between 0 and 100, the middle 50% of the data lies
between just 45 and 83, i.e. just 38% (nearly 3/8th) of all possible performance scores.
● Range = 68.
Out of all the possible scores between 0 and 100, every employee is having the score in 68% range
(nearly 2/3rd).
About 1/3rd of the scores are not obtained by anybody.
22
Results in R:
23
3. B. In continuation with the data of performance scores of employees in
previous example, perform the following:
a. Make the histogram
b. Plot the box-plot diagram
c. Plot the frequency polygon
d. Plot the Ogive diagram.
Histogram:
A ‘histogram’ is a graphical plot/ display of the given data, using bars of different frequencies
(heights). Taller bars indicate more data falling into the particular range. A histogram also displays
the spread and shape of the continuous given sampled data.
Box-Plot:
A box- plot graphically depicts groups of given data, with the help of their quartiles. Box- plots
also have extended lines, indicating variability outside the first and third quartiles. Hence, it is also
known as the terms ‘box-and-whisker plot’ diagram. It contains minima, Q1, median, Q3 and
maxima in this proper sequence only.
Frequency Polygon:
A ‘frequency polygon’ graph is constructed by using histogram bars/ bins/ intervals. There are lines
to join midpoints of each bin. Here, the heights of bins are converted into points, that represent
frequencies. A frequency polygon is generally created from the histogram. From the frequency
distribution table, it can also be created by calculating midpoints of the intervals.
24
Ogive Diagram:
An ‘ogive diagram’ (cumulative frequency polygon) shows cumulative frequencies. Here, the
cumulative percentages are added on graph from left to right (lower range to higher range).
An ogive graph has cumulative frequencies on the y-axis and class ranges (boundaries) on the x-
axis.
Conclusion:
● The histogram does not follow any normal distribution.
● However, the ogive diagram suggests that the data bins are more likely to be flat, rather than
curved, i.e. nearly equal number of employees’ fall under each bin.
● However, middle 50% of distribution has scores between 45 and 83 (Q1 and Q3).
● On the scale of 0 to 100, the 1/3rd of the scores have not been obtained by anybody.
● The data is approximately symmetric.
25
36
27
32 33
21
26
33
0
5
10
15
20
25
30
35
40
30-39 40-49 50-59 60-69 70-79 80-89 90-98
F R
E Q
U E
N C
Y
EMPLOYEES' PERFORMANCE SCORES
Histogram
0
10
20
30
40
50
60
70
80
90
100
Box- Plot
36
27
32 33
21
26
33
0
5
10
15
20
25
30
35
40
30-39 40-49 50-59 60-69 70-79 80-89 90-98
F re
q u
e n
cy
Employees' Performance Scores
Frequency Polygon
17.31
30.29
45.67
61.54
71.63
84.13
100
0
10
20
30
40
50
60
70
80
90
100
30-39 30-49 30-59 30-69 30-79 30-89 30-98
C u
m u
la ti
ve F
re q
u e
n cy
%
Employees' Performance Scores
Ogive Diagram
26
27