clustering especially K-Means Using Python programming and Anaconda

profilesravanh
ClustAssignment.pdf

HW: Clustering MCIS-6273: Data Mining

This assignment is to give you a basic understanding of clustering especially K-Means Using Python programming and Anaconda.

Please download the dataset Mall_Customers.csv from blackboard. It will be used for solving this assignment.

Using K-Means

Part1: [10 points] First, read the data set into your code. Save two data features to X. (Please pick the fourth feature (Annual Income (k$)) and the fifth feature (Spending Score(1-100)), in this case we can visualize the clusters.) Please do the following:

1. Use the elbow method to find the optimal number of clusters 2. Fit K-Means to the dataset by using the optimal number of clusters found by the

elbow method 3. Predict the clustering results y for data set X 4. Visualizing the clusters results, please use different color for different clusters.

1. title, x label, y label should be specified. 2. The legend should be included.

Part2: [10 points] Repeat the steps in Part1 but now pick the second feature (Gender) and the third feature (Age) in your work to visualize the clusters. [This part may be trickier.]

Guidelines: • This assignment is to be solved in groups of two students, not more. • You only need to deliver a PDF report that is nicely formatted with: [5 points]

◦ Title page: Title and Group Names ◦ ToC page: ◦ Pages should be numbered and numbers show in the ToC ◦ A snapshot of each of the figures as described below, please see the Notes.

▪ Each snapshot has to have a caption, 10 words, describing the picture. ◦ Only one report per group should be submitted ◦ No need to submit any code

Notes: • For reading and handling the data and guide your work, you will be given the code

example_3D.py and data 3D_network.csv. ◦ You should run the code and understand what it does first. ◦ Also, you will be given a code file named: practice_blobs.py. You can run the code in

Anaconda and see how the output and the different steps should be performed so you know what to do.

◦ The codes run with no issues so any issues running the code is your responsibility to resolve

• To know more about the Elbow Method mentioned above for choosing the right number of clusters, please check: https://www.geeksforgeeks.org/elbow-method-for-optimal-value- of-k-in-kmeans/

• The report you will submit should have the figures below. ◦ To give you an idea, running the practice_blobs.py gives the following output: [arrows

for output order]

predicted group: 2 distance from center 0 is: 3.731771999479638 distance from center 1 is: 6.290334770382815 distance from center 2 is: 3.382224740457218 distance from center 3 is: 7.132308122920062