Data Algorithm homework: K means clustering in R studio

profilecoco2018
BIFS614Homework42.R

# Clustering for Genetic Data # # A Simple R-script for UMUC BIFS614 # Dr. R. Wolfgang Rumpf # # version 1.0 # July 27 2016 # Based on the Basic Cluster Analysis Script at # http://rstudio-pubs-static.s3.amazonaws.com/3773_0afaead59a02436889abc68753e6c20a.html # This script will load a sample data set and allow you to # perform cluster analysis # First we must install and load all necessary libraries. To execute these # lines, place the cursor at the end of the line and # press the RUN button at the top of the top left window in RStudio) # for this example we'll start by installing BioConductor, which has many great # plug-ins for bioinformatics. source("http://bioconductor.org/biocLite.R") biocLite("golubEsets") # Now we will load require package and datasets require(golubEsets) data(Golub_Merge) # We won't use the entire Golub Dataset, so we select a subset and load it into a # data-frame called golub: golub <- data.frame(Golub_Merge)[1:7129] # Some quick Data manipulation - calculating the variances and sorting # by variance decreasingly: golub.rearrange <- golub[, order(apply(golub, 2, var), decreasing = T)] golub <- golub.rearrange[, 1:150] # Now we're ready to do some hierarchical Clustering Analysis! # We calculate a distance matrix first: d <- dist(golub, method = "euclidean") # Using that matrix we can now do the clustering and put the results in a variable # called hclustering hclustering <- hclust(d, method = "ward.D") # The data appeared in the environment window, but let's show it as a visual # dendrogram: plot(hclustering) # We can "cut" the dendrogram (tree) at the 3rd clade level and draw boxes around those # groups to make it easier to see the clusters groups <- cutree(hclustering, k = 8) rect.hclust(hclustering, k = 8, border = "red") # Notice that there are 8 groups - can you guess why? :) #------------------------------------------------------------------------------------- # Kmeans Clustering # For this particular dataset a visual plot of a Kmeans Clustering would be very "messy", as there are too # many dimensions. But we can still run a Kmeans analysis and look at the statistics! fit2 <- kmeans(x = golub, 8) fit2$cluster # get cluster assignment fit2$centers # get cluster center # get cluster means aggregate(golub, by = list(fit2$cluster), FUN = mean) summary(fit2) #------------------------------------------------------------------------------------- # Hey, since you haven't quit yet let's do some fun stuff like a correlation analysis on the dataset # One of the powerful things about R is how easy it is to add new functionality # lots of people write modules and plugins for R and make them available # then you can just install them like this: # Install the corrplot package install.packages("corrplot") # Once you've done this on your local machine you can comment out the line - it'll # be ready for you the next time you run the script! # But we have to tell R to use this package after it's installed - we update our # script with this line: require(corrplot) # Now we can do a correlation plot on the original rearranged data using corrplot: corrplot(cor(golub.rearrange[, 1:20])) # The plot may seem a bit small - notice that you can ZOOM and even export it # from the window in the lower right of R-Studio....