Decision Science Assignment

profileAnurag2305
DecisionScience_DikshaGupta_NEW.pdf

1

Decision Science

INTERNAL ASSIGNMENT- JUNE 2020

UNDER THE GUIDANCE OF

Prof. Dr. N. Palaniappan

SUBMITTED BY:

DIKSHA GUPTA

M.B.A. 2ND YEAR

DATE OF SUBMISSION:

June 23, 2020.

2

1. Identify the types of the variable.

In Decision Science or Statistics, a ‘variable’ can be defined as an attribute to be studied, of any

object. Selecting, which variable attributes to measure can be a good design for experiments.

Types of data: Quantitative and Categorical variables

Data contains a specific measurement of the set of variables. These variables are generally divided

into 2 types:

● Quantitative variables:

Quantitative data represents quantities or amounts. While collecting quantitative data, the recorded

numbers represent operations with arithmetic such as add, subtract, divide etc.

● Categorical variables:

Categorical data represents groups.

3

*Sometimes a nominal variable can also be treated as a quantitative variable. A variable may fall

under more than one variable types. If the scale is numeric, such a material quality or survey ratings,

despite having floating values, the scale can fall under continuous quantitative data- type.

Sr. No. Variable Data Type Values

a Gender Nominal Categorical

variable

Male, Female and Others.

b Educational

background

Ordinal Categorical

variable

None(0), Metric pass(12),

Graduate(15), Post- graduate(17),

Doctorate(21) etc.

c Satisfaction Ordinal Categorical

variable

Low, Medium, High

or measured on a 1 to 5 level

scale.

d Motivation Ordinal Categorical

variable

Low, Medium, High

or measured on a 1 to 5 level

scale.

e Exchange rate Continuous Quantitative

variable

Any real number value such as

0.5, 78 etc.

f Gold price Continuous Quantitative

variable

Any real number value such as

0.5, 78 etc.

g Preference of cars Nominal Categorical

variable

SUV, Sedan or Race car.

h Teacher’s

feedback

Ordinal Categorical

variable

Bad(1), Good(2), Extraordinary(3)

etc.

i Grades in

Post- Graduation

Ordinal Categorical

variable

Grade O, Grade A, Grade B, Grade F etc.

j Marital status Nominal Categorical

variable

Married, Unmarried, Divorced,

Widow/Widower.

k Quality of

services

Ordinal Categorical

variable

Bad(1), Good(2), Extraordinary(3)

etc.

4

l Age group Ordinal Categorical

variable

Young (0-12), Teenage (13-19),

Adult (20-50), Senior (beyond

50).

m GDP Continuous Quantitative

variable

Any real numbers such as $3000

billion.

n Interest rate Continuous Quantitative

variable

Any real numbers such as 10%,

5.8%, -1.3% etc.

o Twitter comments Nominal Categorical

variable

Exclamatory, Liked, Positive,

Disliked, Joyful etc.

p Facebook pictures Nominal Categorical

variable or

Discrete Quantitative

variable.

No ordered ranking among the

images.

OR

Images with pixel values from 0

to 255 (integers).

a. Gender: (Nominal)

When gender has just two categories simply, named as Male and Female, the gender data can be

treated as a binary variable. It can also be converted into discrete variable by writing Male as 1 and

Female as 0 or vice- versa. The word binary means, of relating to two (bi).

When gender has more than two categories such as Male, Female and others, the gender can be

treated as a nominal variable. There is no ordered ranking among the three categories.

b. Educational background: (Ordinal)

Educational background can be treated as the variable, having the highest level of education

accomplished by someone. Furthermore, these categories can be ranked in an ordered manner or

level. Here, one can give ordered level or ranking.

c. Satisfaction: (Ordinal)

The level is satisfaction cannot be counted in numbers, as it is not a real object. However, one can

grade it between 1 and 5 and compare it with others. Hence, it can be treated as an ordinal variable.

d. Motivation: (Ordinal)

Similar to the level of satisfaction one is having, the amount of impact one has gone through, by

encountering some motivations, can be compared with the others.

The impact on one’s life, due to some motivational speeches can be scaled as Low, Medium or

High.

e. Exchange rate: (Continuous)

$1 = ₹75 or ₹1 = $0.013.

The rate of exchange can acquire any real value. Above example illustrates the exchange rate to be

75 or 0.013 (real). Furthermore, there is sometimes a possibility of negative rates also.

f. Gold price: (Continuous)

Similar to the exchange rate, gold prices also can be any real values such as 80₹/gm. Few days ago,

crude oil prices had gone negative in USA. Similar maybe the cases rarely, for golds as well.

5

g. Preference of cars: (Nominal)

A person with huge family might prefer Sedan or SUVs and a rich person might prefer a race car

such as Ferrari. These preferences cannot be explained in terms of orders rankings.

h. Teacher’s feedback: (Ordinal)

If a feedback is represented in terms of overall ratings, the feedback can be ordered. Good feedback

can be ranked higher the bad one.

i. Grades in Post- Graduation: (Ordinal)

Grade O, Grade A can be ranked in order. Grade A is better than B, hence can be given higher

ranking. However, it is different from 0.0 – 100.0 (percentage) and 1.0 – 8.0 (CGPA), that would

have been continuous variables.

j. Marital status: (Nominal)

The status shouldn’t be ranked in any order. Ordered representation for married and widower

persons cannot be possible.

k. Quality of services: (Ordinal)

Similar to the teacher’s feedback, the quality can be measured in ordered manner. A good quality

can be ranked higher than the bad one.

l. Age group: (Ordinal)

Youngsters, adults and senior citizens can be arranged in an ordered manner, acquiring categorical

values.

m. GDP: (Continuous)

GDP of India was 2718.73 and 2800 billion USD, respectively in 2018 and 2019. The values can

be any real numbers, depending upon economical progress and several other factors.

n. Interest rate: (Continuous)

Japanese banks have negative interest rates, for current accounts. Whereas, Indian banks have rates

around 5.0%. The values can hold any real number.

o. Twitter comments: (Nominal)

Comments can be in support or ranging protests. They might be joyful or sad. There is no ordered

ranking.

p. Facebook pictures:

(Nominal)- If there can be no quantitative or ordered ranking among the Facebook pictures/

images, they can be kept classified with the unordered categorical data.

6

(Discrete)- A 1-bit monochrome image consists of pixel values as either 0 or 1 only, i.e. binary

values. An 8-bit grayscale/ coloured image has pixel values from 0 to 255= 28 -1, i.e. discrete or

natural number values.

Example- This is a 1-bit monochrome image (binary). Here among the 2x2 image (four) pixels

shown, white represents 1 and black colour represents 0.

7

2. Following data of performance scores is available of employees working with

a company. You are required to perform the following:

a. Make the frequency distribution. Calculate the frequency and the Cumulative

frequency.

Procedure to find Frequency: ● The ‘frequency’ (f) of a data value, is defined as the number of times the particular data value

occurs/ repeats.

● For example, the performance score value 31 is repeated four times, in the given data set. Hence,

the frequency of score 31 is 4. It can be represented as: f31 = 4.

● Similarly, the frequencies of each data value (performance scores ranging from 0 to 100) is to be

counted.

● Note: Frequency, for the scores that are not observed in the dataset, is written as 0. For example,

f18 = 0.

Procedure to find Cumulative Frequency: ● ‘Cumulative frequency’ (CF) is used to determine the number of observations that lie below a

particular value in a data set.

CFn = Σf<=n.

CF48 = Σf<=48 = f0 + f1 + f2 + … + f48 (for discrete values of observations).

● The cumulative frequency is calculated by adding every frequency from a frequency distribution

table upto the sum of its predecessors.

● Cumulative frequency can also be calculated by the total of all the frequencies of all the

observation upto a particular value.

● The last value of cumulative frequency will definitely be equal to the total number of all the

observations.

● For example, CF30 = f0 + f1 + … + f 29+ f30 = 0+0+…+0+1 = 1.

CF32 = f30 + f31 + f 32 = 1+3+2 = 6.

For the given data in the sets of 10 scores at a time, frequency and cumulative frequency are

observed as:

8

Observations: ● It is observed that the performance scores from 0 to 29, 99 and 100 are not obtained by any of

the 208 employees.

f0 = … = f29 = 0.

● According to the given data, 30 and 98 are the lowest and highest scores obtained by any

employee, respectively.

● Additionally, the performance scores 65 and 81 are also not obtained by anybody. Hence, it can

be said that:

f99 = f100 = f65 = f81 = 0.

● Performance score 53 is the most frequent score. f53 = 8.

9

10

b. Calculate the mean, median, mode and quartiles.

Arithmetic Mean (Average): ● An ‘arithmetic mean’ (µ) is a single number, that summarisingly represents a list of numbers. It

is the half way through (middle of) all the observed data set.

For example, the set {3,4,5,6,7} has the half way (average) at 5.

● It is a sum of the list of the observation values given and then divided by the total number of

observations.

Arithmetic Mean (A.M.) = 𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒𝑠

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

µ = Σx

𝑛.

● In the given dataset, the number of employees (observations) is 208. Their performance scores

(observation values) ranges from 30 to 98.

● According to the given data set:

µ = 52+57+50+68+74+⋯+66

208 = 63.35096 .

Median: ● The ‘median’ is the middle number obtained in an ascending/ descending order sorted list.

It can be more descriptive of any data set than the average.

● The median is generally used, when there are outliers in the dataset, causing skewness. Hence,

median tends to avoid the outliers and skewness in any data.

● For example, the set {5,2,6,3,7} can be ascendingly sorted as {2,3,5,6,7} and hence the median

is 5.

● A median separates the higher half (50%tile) from the lower half of a data sample.

Median = {(n + 1) ÷ 2}th value in a sorted dataset.

● In the given dataset, sort dataset would be: {30,31,31,31,32,32,33,…,98}.

Median = 62.

Mode: ● The observation with highest frequency is known as the ‘mode’ of the data. It indicates, most of

the data is crowded around which particular value.

● The mode of a set of data is the value in the set, that occurs most often.

● In the given dataset, it is observed that the performance score of 53 has been obtained by most

of the people.

fmax = f53 = 8.

Mode = 53.

● Note: For the given data, Mode < Mean and Median < Mean.

Hence, the data may be positively skewed, approximately. Several data-points may lie on the right

side of the mode value. To know this, skewness is also to be measured.

11

Geometric Mean (G.M.): ● It is the central number in a G.P. (geometric progression), such as 3, 9, 27 has G.M. as 9.

G. M. = (π xi) 1⁄n = (x1. x2 … xn)

1⁄n

G.M. = (52 x 57 x 50 x 68 x 74 x … x 66) 1⁄ 208 = 59.8147

Harmonic Mean (H.M.): ● The ‘harmonic mean’ is a kind of average of the reciprocals and also the Pythagorean mean.

H.M. = 208

1

52 +

1

57 +

1

50 + … +

1

66 = 56.25897

● Note: It can be always noted that H.M. ≤ G.M. ≤ A.M.

Quartiles: ● A ‘quartile’ is a type of quantile, which divides the number of data points into 4 parts of equal

lengths, that are known as ‘quarters’.

● The ‘first quartile’ (Q1) is a middle number between the lowest number and the median of the

data set. Hence in an ordered dataset, Q1 stands at the boundary between first 25%tile and the rest

75%tile of the data.

● In the given dataset, Q1= 45, i.e. 25% of the data (208/4 = 52 observations) lie on or on the left

of this number and the rest lie on the right side of this number.

● The ‘second quartile’ (Q2) is defined as the point, beyond which just 50% of the ordered

observations exist. It is also known as the ‘median’ of the data.

● In the given dataset, Q2= 62, i.e. 50% of the data (208/2 = 104 observations) lie on or on the left

of the dataset.

● The ‘third quartile’ (Q3) is defined as the middle number between the largest number and the

median of the data set. Hence in an ordered dataset, Q3 stands at the boundary between first 75%tile

and the rest 25%tile of the data.

● In the given dataset, Q3= 83, i.e. 25% of the data (208x75% = 156 observations) lie on or on the

left of this number and the rest lie on the right side of this number.

● The ‘fourth quartile’ (Q4) is defined as the largest observed data value, beyond which no more

values of the ordered observations exist. It is also known as the ‘largest observed value’ of the data.

● In the given dataset, Q4= 98, i.e. 100% of the data (208x100% = 208 observations) lie on or on

the left of the dataset.

12

c. Calculate the variance and the standard deviation.

Variance (σ2): ● ‘Variance’ measures how far the dataset values are spread out, from their average value. It is the

fact or quality of being inconsistent, divergent or different.

● Lower the variance means, lower (narrower) is spread of the data and it is more tightly clumped

around a certain mean value.

● Similarly, high variance value indicates a lose widespread (broader) of the data, around a certain

mean value.

● Variance (σ2) is always a non- negative number. σ2 ≥ 0.

●‘Population variance’ refers to the value of variance that is calculated from the

complete population data (with all the number of samples).

● ‘Sample variance’ is the variance calculated from sample data (with re-scaling to n-1 samples in

the denominator).

● Complete steps to calculate variance have been shown in the Excel sheet attached.

● Formulae for Sample variance (σs 2) and Population variance (σp

2) are given by,

13

σs2 = Σ(x− µ)²

𝑛−1. and σp2 =

Σ(x− µ)²

𝑛. = E[X2] – µ 2

where, µ = average value or arithmetic mean

n= Number of observations

x= the value of the one observation at a time.

E= Expected value.

● In the given dataset,

σs 2 = 428.8859 and σp

2 = 426.824

● Note: Since, 1/n < 1/(n-1),

hence σp 2 < σs

2.

● If the observed data values are in Kgs, the variance will be in Kgs2, i.e. having the squared unit

as that of the original data.

Standard Deviation (σ): ● The Standard Deviation is also a measure of how spread out numbers are, similar to the variance.

● It actually is the positive square root of the variance.

● σ is always a non- negative number. σ ≥ 0.

● Low value indicates low (narrow) spread of the data around a mean value and vice- versa.

● Formulae for Sample Standard Deviation (σs) and Population Standard Deviation (σp) are given

by,

σs =√ Σ(x− µ)²

𝑛−1. and σp =√

Σ(x− µ)²

𝑛.

Standard Deviation= √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

where, µ = average value or arithmetic mean

n= Number of observations

x= the value of the one observation at a time

σs = Sample Standard Deviation

σp = Population or Universal Standard Deviation.

● Complete steps to calculate SD have been shown in the Excel sheet attached.

● In the given dataset,

σs = 20.70956 and σp = 20.65972.

● Note: Since, 1/n < 1/(n-1),

hence σp < σs.

● If the observed data values are in Kgs, the standard deviation will also be in Kgs, i.e. having the

same unit as that of the original data.

● Another important aspect of the SD is the fact that it tells about the most likely range of the

data. Most of the given data lies in the range µ ± σs, i.e. between 43 and 84 approximately.

14

Results in R:

15

16

3. A. In continuation with the data of performance scores of employees in

previous example, perform the following:

a. Calculate the range and interquartile range.

b. Calculate the z- scores.

c. Calculate the skewness and Kurtosis (using excel).

d. Comment on the distribution of the data.

Range: ● The ‘range’ of a data is the difference between the amount of highest and lowest values.

Range = Highest value – Lowest value

Range = 98 – 30 = 68.

● Disadvantage– However, the range can be misleading, at giving the idea about values possible

to be seen in the data.

● In the given data, other possible values, that are not seen yet, are 0- 29, 99 and 100. Here, the

lowest possible value= 0 and the highest possible value= 100. Hence,

maximum possible range= 100 – 0 = 100.

● For example, {8,11,5,9,1,6,3616} for this set, the range= 3616 – 1= 3615.

But, except just one data value, all the other data values are around 10. Hence, IQR and σ are to be

calculated.

Inter-Quartile Range (IQR): ● The ‘inter-quartile range’ (IQR or Midspread) is a measure of variability, on the basis of dividing

the dataset into the quartiles.

● IQR is the difference between the amount of largest and smallest values, in the middle 50% of

the dataset.

Inter-Quartile Range = Third Quartile – First Quartile.

IQR = Q3 – Q1.

But, Q1 = 45 and Q3 = 83. Therefore,

IQR = 83 – 45 = 38.

Z- scores (Standardised values): ● A ‘z-score’ describes the standardised position of data value, as its distance from the mean value.

● A data value has the positive z-score, if it lies above the mean value. Similarly, the z-score is

negative, if the data value lies below the mean.

● The z-score (standard score) is the amount of standard deviations, by which a value of the data

is below or above the mean (µ).

● Note: The average of all the z-scores must be equal to 0. Sample SD of all the z-scores must be

1.

17

● Formula for the z-score of a data value (x) is given by:

z-score = 𝑥− µ

σ

where, σ = sample standard deviation = 20.70956

µ = Arithmetic mean = 63.35096 .

● For example, {1,2,3,4,5} dataset has µ = 3 and σs = 1.58. Hence, z-scores will be: {-1.265, -

0.632, 0, 0.632, 1.265}.

● The z-scores have been displayed in the corresponding Excel sheet. For the provided dataset of

employees’ performances, some of the z- scores are:

18

19

20

Skewness: ● ‘Skewness’ refers to the amount of distortion, in a normal bell curve. Skewness represents the

extent, to which a given dataset varies.

● A normal distribution (bell) generally has a skewness of zero.

● A negative skewness indicates that the tail of the distribution is on the left side. A positive

skewness indicates, the tail to be on the right.

● Application- Investors note skewness, while judging a return distribution because it considers

the extremes (outliers) of the dataset instead of focusing only on the means. Kurtosis also does the

same and hence used as an alternative.

● Some formulae for measuring skewness are:

Pearson’s Mode Skewness = 𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

S1 = µ−Mode

σ =

63.35096−53

20.70956

S1 = 0.5

Pearson’s Median Skewness = 3 (𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

S2 = 3 (µ−Median)

σ =

3 (63.35096−62)

20.70956

S2 = 0.2

Co-efficient of Skewness = Σ(𝑥− µ)3

(n−1). σ3 =

Σ(𝑥− 63.35096)3

207∗ 20.709563

Co-efficient of Skewness = 0.09767 (using MS Excel).

Kurtosis: ● Kurtosis tells us about the height and the sharpness of the central peak, when compared to that

of a standard bell curve.

● ‘Kurtosis’ defines, how heavily/ mildly the tails of the data distribution differ from that of a bell

distribution. Kurtosis tells, whether the tails of the given data distribution contain extreme values

and the number of extreme values.

● Kurtosis is a measure of the ‘tailedness’ (outliers/ extremes) in a probability distribution.

● Formula for measuring kurtosis is:

Kurtosis = Σ(𝑥− µ)4

(n−1). σ4 -3 =

Σ(𝑥− 63.35096)4

207∗ 20.709564 -3

Kurtosis = -1.2854 (using MS Excel).

21

Data distribution: ● Mean= 63.35096, Median= 62, Mode= 53.

Median < Mean and Mode < Mean.

Hence, it appears that the data may be positively skewed (towards the right).

● If Skewness ∉ (-1,1); highly skewed distribution.

If Skewness ∈ (-1, -0.5) or Skewness ∈ (0.5, 1); moderately skewed distribution.

If Skewness ∈ [-0.5, 0.5]; approximately symmetric distribution.

● However, Co-efficient of Skewness = 0.097.

Hence, the distribution is approximately symmetric.

● If Kurtosis ≈ 0; Mesokurtic distribution ⇒ Close to Normal distribution.

If Kurtosis > 0; Leptokurtic distribution ⇒ More outliers, heavy tails, risky financial investments.

If Kurtosis < 0; Platykurtic distribution ⇒ Less outliers, low tails, desirable financial investments.

● Since, Kurtosis= -1.2854. Hence, the data follows a Platykurtic distribution.

● First Quartile= 45 and Third Quartile= 83. IQR= 38.

● Hence, on the performance score’s scale between 0 and 100, the middle 50% of the data lies

between just 45 and 83, i.e. just 38% (nearly 3/8th) of all possible performance scores.

● Range = 68.

Out of all the possible scores between 0 and 100, every employee is having the score in 68% range

(nearly 2/3rd).

About 1/3rd of the scores are not obtained by anybody.

22

Results in R:

23

3. B. In continuation with the data of performance scores of employees in

previous example, perform the following:

a. Make the histogram

b. Plot the box-plot diagram

c. Plot the frequency polygon

d. Plot the Ogive diagram.

Histogram:

A ‘histogram’ is a graphical plot/ display of the given data, using bars of different frequencies

(heights). Taller bars indicate more data falling into the particular range. A histogram also displays

the spread and shape of the continuous given sampled data.

Box-Plot:

A box- plot graphically depicts groups of given data, with the help of their quartiles. Box- plots

also have extended lines, indicating variability outside the first and third quartiles. Hence, it is also

known as the terms ‘box-and-whisker plot’ diagram. It contains minima, Q1, median, Q3 and

maxima in this proper sequence only.

Frequency Polygon:

A ‘frequency polygon’ graph is constructed by using histogram bars/ bins/ intervals. There are lines

to join midpoints of each bin. Here, the heights of bins are converted into points, that represent

frequencies. A frequency polygon is generally created from the histogram. From the frequency

distribution table, it can also be created by calculating midpoints of the intervals.

24

Ogive Diagram:

An ‘ogive diagram’ (cumulative frequency polygon) shows cumulative frequencies. Here, the

cumulative percentages are added on graph from left to right (lower range to higher range).

An ogive graph has cumulative frequencies on the y-axis and class ranges (boundaries) on the x-

axis.

Conclusion:

● The histogram does not follow any normal distribution.

● However, the ogive diagram suggests that the data bins are more likely to be flat, rather than

curved, i.e. nearly equal number of employees’ fall under each bin.

● However, middle 50% of distribution has scores between 45 and 83 (Q1 and Q3).

● On the scale of 0 to 100, the 1/3rd of the scores have not been obtained by anybody.

● The data is approximately symmetric.

25

36

27

32 33

21

26

33

0

5

10

15

20

25

30

35

40

30-39 40-49 50-59 60-69 70-79 80-89 90-98

F R

E Q

U E

N C

Y

EMPLOYEES' PERFORMANCE SCORES

Histogram

0

10

20

30

40

50

60

70

80

90

100

Box- Plot

36

27

32 33

21

26

33

0

5

10

15

20

25

30

35

40

30-39 40-49 50-59 60-69 70-79 80-89 90-98

F re

q u

e n

cy

Employees' Performance Scores

Frequency Polygon

17.31

30.29

45.67

61.54

71.63

84.13

100

0

10

20

30

40

50

60

70

80

90

100

30-39 30-49 30-59 30-69 30-79 30-89 30-98

C u

m u

la ti

ve F

re q

u e

n cy

%

Employees' Performance Scores

Ogive Diagram

26

27