Analyzing & Visualizing Data - PPT
PREFERENCES FOR CAR CHOICE IN THE UNITED STATES 2
PREFERENCES FOR CAR CHOICE IN THE UNITED STATES 2
Table of Contents
Introduction………………………………………………………………………………………..3
The most common applications of Statistics is describing a set of descriptive data statistics, regression, and hypothesis testing and inferential statistics. The two main branches are descriptive and inferential statistics. People who do not have any formal training in statistics are more familiar with inferential statistics than with descriptive statistics. In this paper, the data will analyze using descriptive statistics. So we will focus on the descriptive branch of the statistics.
Descriptive Statistics Definition
The descriptive statistics are the type of statistical analysis that helps to describe the data in some meaningful way. The statistics are helpful to describe quantitatively about the essential features of the data or information. The descriptive statistics give the summaries of the given sample as well as the observations done. These summaries or descriptions can either be graphical or quantitative.
Background
This study will focus on and analyzing & Visualizing the data set about Preferences For Car Choice In The United States. The data set contained 4654 observations and 71 columns. There are several different types of graphs that help describe the statistical data. These graphs are histogram, bar graph, box and whisker plot, line graph, scatter plot, ogive, pie chart, and many more. Generally, the kinds of measurements that can use with descriptive statistics are:
The measure of central tendency describes the data which lies in the center of a given frequency distribution. The main steps of central tendency are mean and median and mode (Nick, 2020).
The spread measure describes how the scores are spread across the entire distribution. In the spread, measurements that are included standard deviation, variance, quartiles, range, absolute difference.
Data Analysis
One of the essential concepts of statistics is data analysis. It is the process that is observing the data, analyzing, and modeling the data. The purpose of data analysis is to obtain useful data information and state conclusions which support decision-making. The data analysis can be performed under several techniques using different approaches. The method of data assessment and analysis can be achieved by using analytical and logical approaches to examine each component of the data provided. Data from various sources are collected, reviewed, and then explained for decision making or conclusions. There are several methods for analyzing the results. Data mining, text analytics, and business intelligence are some of the most commonly used techniques and data visualizations.
The data analysis aims to collect raw data and convert it into useful decision-making information. The various stages of analysis of the data are as follows:
i) To make some type of sense out of each data collection
ii) To look for patterns and relationships both within a collection and also across groups,
iii) To make general discoveries about the phenomena you are researching
Before further analysis, I would like to create compactly display the structure of the given dataset.
The below list describes the data contents:
Descriptive summary of the data set: using the r code function
Figure 1.1 : Car Frame
Figure 1.2 : Price Range
Figure 1.3 : Pollution and Speed
Figure 1.4 : Pollution and Size
Figure 1.5: descriptive table for summary.data.frame(Car)
Table 1.1: Abstract table for price
Table 1.2: Abstract table for account
From the descriptive summary table, The minimum price is in term of vehicle divided by the logarithm of income for price one variable is 4.296, price three variable is 4.173 and for price five variable is 4.150 I excluded price 2, price four and price because they have the same mean and median to price 1, price three and price five simultaneously. The ranges intern of hundreds of miles vehicle can travel between refueling/recharging. The mean value for range 1 is 160.49, followed by range three is 240.38, and interval 5 is 312.03.
Data Visualization
Data visualization is the portrayal of data or data in a diagram, outline, or other visual arrangements. It imparts connections to the data with pictures. We need data visualization because a visual outline of data makes it simpler to distinguish examples and patterns than glancing through a large number of lines on a spreadsheet. It is how the human cerebrum works. Since the motivation behind data examination is to pick up experiences, data is considerably more critical if we imagine. Regardless of whether a data investigator can pull bits of knowledge from data without Visualization, it will be progressively hard to convey the significance without Visualization. Outlines and diagrams make communicating data discoveries simpler regardless of whether you can distinguish the examples without them (Sheskin, 2017).
This is significant because it permits patterns and examples to be all the more effectively observed. With the ascent of enormous data upon us, we should have the option to decipher progressively bigger bunches of data. AI makes it simpler to lead investigations, for example, prescient examination, which would then be able to fill in as supportive visualizations to introduce.
Categorical variable Visualizing and Analyzing.
Figure 2.1: Choice of a vehicle among six propositions
From the pie chart, we can create a table for better understanding.
Choice 5 is the highest percentage, followed by choice 3. While choice 2 is the lowest number of choices.
Table 1.3 : Choice of a vehicle among six propositions
Variables college education, size of household greater than 2, and commute lower than 5 miles a day.
Here 0 represents No, and one represents Yes.
Figure 2.2: College Figure 2.3: Households
Figure 2.4 : column5
The below represent the summary of the three chart:
Table 1.3: column5
Variable types
Body type, one regular car, sport utility vehicle, sports car, station wagon, truck, van, for each proposition z from 1 to 6.
Figure 2.5 :Type 1 Figure 2.6 :Type 2
Figure 2.7 :Type 3 Figure 2.8 :Type 4
Figure 2.9 :Type 5 Figure 3.0 :Type 6
The summary table of the type's variable is given below.
Table 1.4: Summary Variable
The most Preferences car is a regular car in the United States, followed by a truck.
Figure 3.1 :Type Fuel 1 Figure 3.2 :Type Fuel 2
Figure 3.3 :Type Fuel 3 Figure 3.4 :Type Fuel 4
Figure 3.5 :Type Fuel 5 Figure 3.6 :Type Fuel 6
The summary of the fuel variable is given in the table retrieved from the charts.
Table 1.5: Summary Variable
CNG is the most common fuel, and while gasoline is the least common fuel. Variable acceleration, tens of seconds required to reach 30 mph from stop and speeds highest attainable speed in hundreds of mph.
Figure 3.7 : Car Data Figure 3.8 :Car speed
Figure 3.9 :Car vs speed
From the summary table, we can conclude that.
Table 1.6: Summary Pollution
Sizes: 0 for a mini, 1 for a subcompact, 2 for a compact, and 3 for a mid-size or large vehicle.
Figure 4.0 :Car vs speed
A bar chart shows the relations between discrete categories. One axis of the graph represents the individual groups being compared, and the other axis indicates a calculated value, the diagram is shown above informs us that the most preferred configuration is a mid-size or large vehicle for the variable size. In contrast, the least preference is the mini size.
Space: Fraction of luggage space in a comparable new gas vehicle.
Table 1.7: Luggage space
Costs: cost per mile of travel (tens of cents): home recharging for an electric vehicle, station refueling otherwise
Stations: A fraction of stations that can refuel/recharge the vehicle
Table 1.8: Station refuel or recharge
A scatter plot, or scatter graph, is a visual representation of two variables (Cost and Speed) in a set of data. The plot represents using Cartesian coordinates with the independent variable x (speed) on the horizontal axis and the dependent variable y (cost) on the vertical axis. From the scatter plot, there is a weak positive relationship exist between cost and speed. The correlation coefficient ® measures the linear relationship between two variables, with a value range of -1 to 1. The correlation coefficient ® between cost and speed is 0.145011 shows that there is a weak positive relationship exist between cost and speed.
Conclusion
Based on the analysis, we can conclude that the minimum price in terms of the vehicle divided by the income logarithm for the price 1 variable is 4,296, the price 3 variable is 4,173, and the price 5 variable is 4,150. We excluded price 2, price 4, and price because they have the same mean and mean as price 1, price 3, and price 5 at the same time.
The most preferred choice is choice5, and the least option is choice2, there are 23% of respondents are college not educated while 77% are college-educated. 22% of respondents sizes of households are more significant than 2, and 78% size of household families is smaller than 2. In the sample data, 36% commute shorter than 5 miles a day, while 64% are commute higher than 5 miles a day. The preferable vehicle is a regular car, and the preferred fuel is CNG, and the least chosen fuel is gasoline. The correlation coefficient (r) between cost and speed is 0.145011 shows that there is a weak positive relationship exist between cost and speed.
References Reid, H. (2013, August). Introduction to Statistics. SAGE Publication. Jackson, S. L. (2017). Statistics plain and simple. Boston, MA: Cengage Learning Alan, J. (2018). Ohio touts successes against human trafficking. Ohio: The Columbus Dispatch. Erik, M. (2017). Regression Analysis. Market Research, 12(7), 31. Fishe, R. (2016). the social relationship between the teenager's psychological changes and physiological changes. Journal of medical statistics, 11(2), 32. Sheskin, D. J. (2017). Handbook of parametric and nonparametric statistical procedures. New York: CRC Press. Jackson, S. (2017). Statistics plain and simple. Cengage Learning. Retrieved from phoenix.vitalsource.com/#/books/9781337681728/cfi/6/8!/4/4@0:5.88
price1price3price5range1range3range5
Minimum0.5987260.5987260.6351965075250
Mean4.2962624.1732414.149952160.4856240.3792312.0327
Median4.1386844.0395754.039575125250300
Maximum17.3705617.3705617.37056300400400
acc1acc3acc5speed1speed3speed5
Minimum2.52.52.5558585
Mean4.172544.2729914.05446984.66695107.3055107.3421
Median444859595
Maximum666140140140
choiceCountPercent
choice188719%
choice22696%
choice3134529%
choice43497%
choice5149932%
choice63057%
CountPercentCountPercentCountPercent
choice188719%0107923%0298964%
choice22696%1357577%1166536%
choice3134529%
choice43497%CountPercent
choice5149932%0362178%
choice63057%1103322%
choicecollege
hsg2
coml5
type1type2type3type4type5type6Total
van41092841018624109724992
regcar31387693138362313838510930
truck4871851487117548711415628
sportuv28335283572831071048
stwagon137991137112413719204446
sportcar1998019974199129880
Total46544654465446544654465427924
fuel1fuel2fuel3fuel4fuel5fuel6
cng1178117823302330--
methanol34763476----
electric--2324232411751175
gasoline----34793479
pollution1pollution2pollution3pollution4pollution5pollution6
Mean0.08530.08530.41370.41370.59410.5941
Median000.40.40.60.6
Mode000.40.40.250.25
Minimum000.10.10.250.25
Maximum0.60.60.750.7511
space1space2space3space4space5space6
Mean0.8507740.8507740.9256770.92567711
Median111111
Minimum0.70.70.70.711
Maximum111111
station1station2station3station4station5station6
Mean0.0895140.0895140.3827680.3827680.8239150.823915
Median000.30.311
Minimum000.10.10.10.1
Maximum0.70.70.70.711