Big Data Analysis
Big Data Analytics Assignment 1
1 Description
The diamonds dataset contains the prices and other attributes of almost 54,000 diamonds. Your task is to perform an Exploratory Data Analysis on the dataset. Submit a report summarising your findings together with the source code - deadline: 05-11-2017. This assignment is part of the continuous assessment and worth 30% of the module grade.
2 Dataset
To load the data set, run the following commands:
library(ggplot2)
data(diamonds)
The dataset is available under the name diamonds. Basic summary of the data can be accessed using: ?diamonds. To see the structure of the data use: str(diamonds). As the dataset is large so generating the plots may take a few moments. We ’ll take a sample of 10,000 observations (without replacement) using the function: sample. Run the following code to do so:
s <- sample(nrow(diamonds), size=10000, replace = FALSE, prob = NULL)
diamonds.subset <- diamonds[s, ]
The above piece of code creates a new dataset named: diamonds.subset containing 10,000 observations from diamond dataset. You can now use the sampled dataset: diamonds.subset or the full dataset: diamonds for completing the task given below.
1
3 Task
Your task is to perform EDA and calculate the strength of relationships between the variables of the dataset. Consider below as a guideline:
1. Begin your analysis with a summary of the variables (use basic statis- tical methods). Briefly describe your observation. (10 points)
2. Prepare 4 plots: pie chart, bar chart, histogram, scatter plot. Each plot should display different variables (do not use price variable now). Each plot must have a title and meaningful labels. (20 points)
3. Focus your analysis on the price variable: (20 points)
(a) Show the histogram of the price variable. Describe it briefly. In- clude summary statistics like mean, median, and variance.
(b) Group diamonds by some price ranges (like low, medium, high, etc.) and summarise those groups separately.
(c) Explore prices for different cut types. You might want to use the boxplot.
(d) How different attributes are correlated with the price? Which 2 are correlated the most?
4. List the frequencies of diamonds for various cuts and clarity levels. Create 2 scatter plots and colour the diamonds price by clarity and cuts. (10 points)
5. Now focus your analysis on the carat, depth, table and dimensions (x, y , z) variables: (30 points)
(a) Compute a volume variable from x, y, z - add it to the dataset. Plot it against the price. Describe your findings.
(b) Are the carat and volume attributes correlated? Is that a strong relationship? Draw a plot with regression line.
(c) Explore the relationships between table and depth variables. Now explore relationships between table and rest of other variables. Compute correlations and describe your findings.
6. You can get up to 10 points for clarity and quality of the report and the source code.
2
- Description
- Dataset
- Task