RStudio

zee14
MidTermProjectFinal.html

MidTermProject

GlobalMusicData
5/27/2022

Introduction

Music is the language of expression in today’s world most musicians make Music to send specific message for politicians and public figures.

The purpose of this project is to perform data analysis and visualization for the GlobalMusicData data set. The data set provides information in detail about the artists with their tracks genre and playlist. The GlobalMusicData data set contains data on track names, albums, playlist, genre and many more for different artists since the year 1993.It is interesting to listen music that everyone listens based on your playlist and the Spotify API we downloaded from canvas it has very useful data so we can figure out the popularity of the music. The goal is to provide analytically statistics for music lover to know just how useful music is?

We want to see how music become publicly famous, and what type of music gain more popularity like pop, rap, country and many more music types around the world.

We used R to perform data analysis and visualization to explore and identify trends in the artists tracks, and uncover insights to understand through the following steps:

  • Load Required Packages
  • Clean Up and Prepare Data for Analysis
  • Perform Exploratory Data Analysis
  • Data Visualization

More Data

For more information about http Spotify click here:API

Packages Required

You can also embed plots, for example:

library(readr)  #used to read csv file
library(plotly) #used to make interactive, publication-quality graphs.
library(tidyr) # used to tidy up data
library(GGally) #extension of ggplot2 with functions
library(prettydoc) # document themes for R Markdown
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(lubridate) # used for date/time functions
library(magrittr) # used for piping
library(ggplot2) # used for data visualization
library(dplyr) # used for data manipulation

Data Preparation

The following is the code used to evaluate the variables in the source data. We noted that there is a total of 32,833 observations in the data set, and 33 variables, which are listed below.

# Importing the data
data <- read.csv("Global Music Data.csv")

Data Cleaning

#Computing summary statistics for the variables
datatable(
  summary(data)
)
#Identifying the data types of each variable
datatable(
  str(data)
)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "14/6/2019" "13/12/2019" "5/7/2019" "19/7/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#Identifying missing data

#number of missing values in this data frame.
sum(is.na(data))
## [1] 15
#Count the number of missing values per column
colSums(is.na(data))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
#Return the column names without missing values
names((colSums(is.na(data))>0))
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"
# Read first 10 rows of the cleaned data set

datatable(head(data, 10),options = list(scrollX=TRUE, pageLength=5))
# Read last 10 rows of the cleaned data set

datatable(tail(data, 10),options = list(scrollX=TRUE, pageLength=5))

We used the following code to tidy up our data:

##Proposed Exploratory Data Analysis and Data Visualization

You can also embed plots, for example:

pairs(~danceability+energy+key+loudness,data = data,
   main = "Scatterplot Matrix For GlobalMusicData")

ggplot(data, aes(x = playlist_genre,y=track_popularity)) + 
#customize bars 
 geom_bar(color="black",
           fill = "pink",
           width= 0.5,
           stat='identity') +
#adding values numbers
  geom_text(aes(label = track_popularity), 
            vjust = -0.25) +
#customize x,y axes and title
  ggtitle("Graph showing popularity Playlist genre") +
  xlab("Playlist genre") + 
  ylab("Popularity of the Track") +
#change font
  theme(plot.title = element_text(color="black", size=14,          face="bold", hjust = 0.5 ),
       axis.title.x = element_text(color="black", size=11, face="bold"),
       axis.title.y = element_text(color="black", size=11, face="bold"))

##Histogram


ggplot(data, aes(x=playlist_genre)) +geom_bar()

# Box plots


bp <- ggplot(data, aes(x=duration_ms, y=playlist_genre, fill=playlist_genre)) + 
  geom_boxplot()+
  labs(title="Plot of Duration against playlist genre",x="Duration in (ms)", y = "Playlist genre")
bp + theme_classic()