scikit-learn: K-Means Clustering In Practice

In this machine learning project, we take a look at applying an unsupervised clustering algorithm, k-means, to two different problems.

First, we apply k-means on the MNIST dataset. We will use k-means to try to identify similar digits without using the original label information. This might be similar to a first step in extracting meaning from a new dataset about which you don’t have any prior label information.

In the second half of the project, we will use k-means clustering for color compression within images. Imagine you have an image with millions of colors. In most images, a large number of the colors will be unused, and many of the pixels in the image will have similar or even identical colors. We will reduce these 16 million colors to just 16 colors, using a k-means clustering across the pixel space!

Start for Free
First 2 tasks free. Then, decide to pay $9.99 for the rest
scikit-learn: K-Means Clustering In Practice

Duration (mins)

Learners

5.0 / 5

Rating

Task List


We will cover the following tasks in 1 hour and 1 minute:


Loading the Data and Performing K-Means Clustering

We begin by loading the digits from the MNIST dataset and then finding the KMeans clusters. The digits consist of 1,797 samples with 64 features, where each of the 64 features is the brightness of one pixel in an 8×8 image


Plotting the Cluster Centers

In the previous task, we noticed that the cluster centers can be interpreted as a digit within the cluster. Here, we plot the cluster centers to see what the look like. We will find that even without any label information, k-means is able to find clusters whose centers are recognizable digits.


Model Evaluation

The k-means algorithm is blind to the true cluster assignment. So, the class labels from 0-9 can be permuted, resulting in incorrect labeling of the digits.

In the first half of this task, we solve the above issue by matching each learned cluster assignment with the true labels found in them.

Next, we evaluate our model using the accuracy score. This metric tells us how accurate our k-means clustering is in finding similar digits within the data. You’d be surprised to find that running a simple k-means on the data is sufficient to discover almost 80% of the correct grouping of the input.


Interpreting the Confusion Matrix

We plot the confusion matrix of the cluster centers we visualized before. Following from that insight, we observe that our model is confused between 8 and 1.

Even with its limitations, we will have shown that we can build a good digit classifier, using k-means, without using any known class labels!


Loading a Sample Image for Color Compression

With this task, we begin our journey into applying k-means for color compression within images.

We use Scikit-Learn’s datasets module to load a sample image and explore its attributes. Through the rest of this project will work with the same image and compress the original 16 million colors to just 16 colors!


From 16 Million to 16 Colors

In this task, we first normalize the data. We then use k-means across the pixel space to reduce the 16 million colors in our sample image to just 16 colors.

After visualizing these pixels in the color space, and comparing the original to the reduced representation, we find that the result is a recoloring of the original pixels, where each pixel from the sample image is assigned the color of its closest cluster center.


Plotting the Results

In the last task, we visualized the pixels in the color space. Given the abstract nature of color space, let us now plot our result from k-means in the image space. This let’s us compare our sample image of 16 million colors to our compressed image of just 16 colors, achieving a compression factor of around 1 million!

Watch Preview

Preview the instructions that you will follow along in a hands-on session in your browser.

Snehan Kekre

About the Host (Snehan Kekre)


Snehan hosts Machine Learning courses at Rhyme. He is in his senior year of university at the Minerva Schools at KGI, pursuing a double major in the Natural Sciences and Computational Sciences, with a focus on physics and machine learning. When not applying computational and quantitative methods to identify the structures shaping the world around him, he can sometimes be seen trekking in the mountains of Nepal.



Frequently Asked Questions


In Rhyme, all projects are completely hands-on. You don't just passively watch someone else. You use the software directly while following the host's (Snehan Kekre) instructions. Using the software is the only way to achieve mastery. With the "Live Guide" option, you can ask for help and get immediate response.
Nothing! Just join through your web browser. Your host (Snehan Kekre) has already installed all required software and configured all data.
You can go to https://rhyme.com/for-companies, sign up for free, and follow this visual guide How to use Rhyme to create your own sessions. If you have custom needs or company-specific environment, please email us at help@rhyme.com
Absolutely. We offer Rhyme for workgroups as well larger departments and companies. Universities, academies, and bootcamps can also buy Rhyme for their settings. You can select sessions and trainings that are mission critical for you and, as well, author your own that reflect your own needs and tech environments. Please email us at help@rhyme.com
Please email us at help@rhyme.com and we'll respond to you within one business day.

First 2 tasks free. Then, decide to pay $9.99 for the rest