RStudio
MidTermProject
GlobalMusicData
5/27/2022
Introduction
Music is the language of expression in today’s world most musicians make Music to send specific message for politicians and public figures.
The purpose of this project is to perform data analysis and visualization for the GlobalMusicData data set. The data set provides information in detail about the artists with their tracks genre and playlist. The GlobalMusicData data set contains data on track names, albums, playlist, genre and many more for different artists since the year 1993.It is interesting to listen music that everyone listens based on your playlist and the Spotify API we downloaded from canvas it has very useful data so we can figure out the popularity of the music. The goal is to provide analytically statistics for music lover to know just how useful music is?
We want to see how music become publicly famous, and what type of music gain more popularity like pop, rap, country and many more music types around the world.
We used R to perform data analysis and visualization to explore and identify trends in the artists tracks, and uncover insights to understand through the following steps:
- Load Required Packages
- Clean Up and Prepare Data for Analysis
- Perform Exploratory Data Analysis
- Data Visualization
More Data
For more information about http Spotify click here:API
Packages Required
You can also embed plots, for example:
library(readr) #used to read csv file library(plotly) #used to make interactive, publication-quality graphs. library(tidyr) # used to tidy up data library(GGally) #extension of ggplot2 with functions library(prettydoc) # document themes for R Markdown library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages library(lubridate) # used for date/time functions library(magrittr) # used for piping library(ggplot2) # used for data visualization library(dplyr) # used for data manipulation
Data Preparation
The following is the code used to evaluate the variables in the source data. We noted that there is a total of 32,833 observations in the data set, and 33 variables, which are listed below.
# Importing the data
data <- read.csv("Global Music Data.csv")
Data Cleaning
#Computing summary statistics for the variables datatable( summary(data) )
#Identifying the data types of each variable datatable( str(data) )
## 'data.frame': 32833 obs. of 23 variables: ## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ... ## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ... ## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ... ## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ... ## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ... ## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ... ## $ track_album_release_date: chr "14/6/2019" "13/12/2019" "5/7/2019" "19/7/2019" ... ## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ... ## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ... ## $ playlist_genre : chr "pop" "pop" "pop" "pop" ... ## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ... ## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ... ## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ... ## $ key : int 6 11 1 7 1 8 5 4 8 2 ... ## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ... ## $ mode : int 1 1 0 1 1 1 0 0 1 1 ... ## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ... ## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ... ## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ... ## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ... ## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ... ## $ tempo : num 122 100 124 122 124 ... ## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#Identifying missing data #number of missing values in this data frame. sum(is.na(data))
## [1] 15
#Count the number of missing values per column colSums(is.na(data))
## track_id track_name track_artist ## 0 5 5 ## track_popularity track_album_id track_album_name ## 0 0 5 ## track_album_release_date playlist_name playlist_id ## 0 0 0 ## playlist_genre playlist_subgenre danceability ## 0 0 0 ## energy key loudness ## 0 0 0 ## mode speechiness acousticness ## 0 0 0 ## instrumentalness liveness valence ## 0 0 0 ## tempo duration_ms ## 0 0
#Return the column names without missing values names((colSums(is.na(data))>0))
## [1] "track_id" "track_name" ## [3] "track_artist" "track_popularity" ## [5] "track_album_id" "track_album_name" ## [7] "track_album_release_date" "playlist_name" ## [9] "playlist_id" "playlist_genre" ## [11] "playlist_subgenre" "danceability" ## [13] "energy" "key" ## [15] "loudness" "mode" ## [17] "speechiness" "acousticness" ## [19] "instrumentalness" "liveness" ## [21] "valence" "tempo" ## [23] "duration_ms"
# Read first 10 rows of the cleaned data set datatable(head(data, 10),options = list(scrollX=TRUE, pageLength=5))
# Read last 10 rows of the cleaned data set datatable(tail(data, 10),options = list(scrollX=TRUE, pageLength=5))
We used the following code to tidy up our data:
##Proposed Exploratory Data Analysis and Data Visualization
You can also embed plots, for example:
pairs(~danceability+energy+key+loudness,data = data, main = "Scatterplot Matrix For GlobalMusicData")
ggplot(data, aes(x = playlist_genre,y=track_popularity)) +
#customize bars
geom_bar(color="black",
fill = "pink",
width= 0.5,
stat='identity') +
#adding values numbers
geom_text(aes(label = track_popularity),
vjust = -0.25) +
#customize x,y axes and title
ggtitle("Graph showing popularity Playlist genre") +
xlab("Playlist genre") +
ylab("Popularity of the Track") +
#change font
theme(plot.title = element_text(color="black", size=14, face="bold", hjust = 0.5 ),
axis.title.x = element_text(color="black", size=11, face="bold"),
axis.title.y = element_text(color="black", size=11, face="bold"))
##Histogram ggplot(data, aes(x=playlist_genre)) +geom_bar()
# Box plots bp <- ggplot(data, aes(x=duration_ms, y=playlist_genre, fill=playlist_genre)) + geom_boxplot()+ labs(title="Plot of Duration against playlist genre",x="Duration in (ms)", y = "Playlist genre") bp + theme_classic()