RStudio
--- title: "MidTermProject" author: "GlobalMusicData" date: "5/27/2022" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(knitr.duplicate.label = "allow") ``` ## **Introduction** Music is the language of expression in today's world most musicians make Music to send specific message for politicians and public figures. The purpose of this project is to perform data analysis and visualization for the GlobalMusicData data set. The data set provides information in detail about the artists with their tracks genre and playlist. The GlobalMusicData data set contains data on track names, albums, playlist, genre and many more for different artists since the year 1993.It is interesting to listen music that everyone listens based on your playlist and the Spotify API we downloaded from canvas it has very useful data so we can figure out the popularity of the music. The goal is to provide analytically statistics for music lover to know just how useful music is? We want to see how music become publicly famous, and what type of music gain more popularity like pop, rap, country and many more music types around the world. We used R to perform data analysis and visualization to explore and identify trends in the artists tracks, and uncover insights to understand through the following steps: * Load Required Packages * Clean Up and Prepare Data for Analysis * Perform Exploratory Data Analysis * Data Visualization {Width=20%} ## **More Data** For more information about http Spotify click here:[API](https://developer.spotify.com/documentation/web-api/reference/#/) ## **Packages Required** You can also embed plots, for example: ```{r} ``` ```{r, echo=TRUE, warning=FALSE, message=FALSE} library(readr) #used to read csv file library(plotly) #used to make interactive, publication-quality graphs. library(tidyr) # used to tidy up data library(GGally) #extension of ggplot2 with functions library(prettydoc) # document themes for R Markdown library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages library(lubridate) # used for date/time functions library(magrittr) # used for piping library(ggplot2) # used for data visualization library(dplyr) # used for data manipulation ``` ## **Data Preparation** The following is the code used to evaluate the variables in the source data. We noted that there is a total of 32,833 observations in the data set, and 33 variables, which are listed below. ```{r, echo=TRUE, warning=TRUE, message=FALSE} # Importing the data data <- read.csv("Global Music Data.csv") ``` ```{r, echo=FALSE, warning=TRUE, message=FALSE} # Find total number of observations # nrow(data) #get the variable name and its description values_table1 <- rbind(c("track_id","track_name","track_artist","track_popularity","track_album_id","track_album_name","track_album_release_date","playlist_name","playlist_id","playlist_genre","playlist_subgenre","danceability","energy","key","loudness","mode","speechiness","acousticness","instrumentalness","liveness","valence","tempo","duration_ms"), c("Unique ID for track", "Name of the track", "The artsist for specific for every track ", "Song Popularity (0-100) where higher is better", "Album unique ID", "Song album name", "Date when album released", "Name of playlist", "Playlist ID", "Playlist genre", "Playlist subgenre", "Describes how suitable a track is for dancing based on a combination of musical elements", "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.", "The estimated overall key of the track.", "The overall loudness of a track in decibels (dB).", "modality (major or minor) of a track, Major is represented by 1 and minor is 0.", "Detects the presence of spoken words in a track.", "A confidence measure from 0.0 to 1.0 of whether the track is acoustic.", "Predicts whether a track contains no vocals.", "Detects the presence of an audience in the recording.", "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.", "The overall estimated tempo of a track in beats per minute (BPM).", "Duration of song in milliseconds" )) fig_table1 <- plot_ly( type = 'table', columnorder = c(1,2), columnwidth = c(12,12), header = list( values = c('<b>VARIABLES</b><br>', '<b>DESCRIPTION</b>'), line = list(color = '#506784'), fill = list(color = '#119DFF'), align = c('left','center'), font = list(color = 'white', size = 12), height = 40 ), cells = list( values = values_table1, line = list(color = '#506784'), fill = list(color = c('#25FEFD', 'white')), align = c('left', 'left'), font = list(color = c('#506784'), size = 12), height = 30 )) fig_table1 ``` ## **Data Cleaning** ```{r} #Computing summary statistics for the variables datatable( summary(data) ) #Identifying the data types of each variable datatable( str(data) ) ``` ```{r} #Identifying missing data #number of missing values in this data frame. sum(is.na(data)) ``` ```{r} #Count the number of missing values per column colSums(is.na(data)) ``` ```{r} #Return the column names without missing values names((colSums(is.na(data))>0)) ``` ```{r} # Read first 10 rows of the cleaned data set datatable(head(data, 10),options = list(scrollX=TRUE, pageLength=5)) ``` ```{r} # Read last 10 rows of the cleaned data set datatable(tail(data, 10),options = list(scrollX=TRUE, pageLength=5)) ``` We used the following code to tidy up our data: ```{r , echo=FALSE} # Convert the start and end times from string to date/time format # data$track_album_release_date <- ymd_hms(data$track_album_release_date) # # #Check for duplicate rows # data[duplicated(data$track_id),] # # #Check for duplicate columns # data[!duplicated(lapply(data, summary))] # # #Check for duplicate rows # data[duplicated(data$track_id),] # # #Check for duplicate columns # data[!duplicated(lapply(data, summary))] # # # n_occur <- data.frame(table(data$track_id)) # # #gives you a data frame with a list of track_ids and the number of times they occurred. # n_occur[n_occur$Freq > 1,] # # #tells you which track_ids occurred more than once. # data[data$track_id %in% n_occur$Var1[n_occur$Freq > 1],] # # #Identifying missing data # # #number of missing values in this data frame. # sum(is.na(data)) # # #Count the number of missing values per column # colSums(is.na(data)) # # #Identify the position of the columns with at least one missing value # which(colSums(is.na(data))>0) # # #Return the column names with missing values # names(which(colSums(is.na(data))>0)) ``` ##**Proposed Exploratory Data Analysis and Data Visualization** You can also embed plots, for example: ```{r} pairs(~danceability+energy+key+loudness,data = data, main = "Scatterplot Matrix For GlobalMusicData") ``` ```{r} ggplot(data, aes(x = playlist_genre,y=track_popularity)) + #customize bars geom_bar(color="black", fill = "pink", width= 0.5, stat='identity') + #adding values numbers geom_text(aes(label = track_popularity), vjust = -0.25) + #customize x,y axes and title ggtitle("Graph showing popularity Playlist genre") + xlab("Playlist genre") + ylab("Popularity of the Track") + #change font theme(plot.title = element_text(color="black", size=14, face="bold", hjust = 0.5 ), axis.title.x = element_text(color="black", size=11, face="bold"), axis.title.y = element_text(color="black", size=11, face="bold")) ``` ```{r} ##Histogram ggplot(data, aes(x=playlist_genre)) +geom_bar() ``` ```{r} # Box plots bp <- ggplot(data, aes(x=duration_ms, y=playlist_genre, fill=playlist_genre)) + geom_boxplot()+ labs(title="Plot of Duration against playlist genre",x="Duration in (ms)", y = "Playlist genre") bp + theme_classic() ```