We will cover the following tasks in 1 hour and 15 minutes:
Introduction and Importing Libraries
We will understand the Rhyme interface and our learning environment. You will be provided with a cloud desktop with Jupyter Notebooks and all the software you will need to complete the project. Jupyter Notebooks are very popular with Data Science and Machine Learning Engineers as one can write code in cells and use other cells for documentation.
Lastly, we clearly define the steps of a general machine problem and then import libraries and helper functions that will be essential later in the project.
To understand why visual diagnostics are vital to machine learning, we compute the summary statistics of four datasets and plot them. The surprising result we observe is that while the means, standard deviations, and correlation coefficients are identical across all of them, they appear drastically different when plotted.
This illustrative example was first conceived in 1973 by the English statistician Francis Anscombe. He wanted to dispel the ever pervasive notion that “numerical calculations are exact, but graphs are rough”.
Feature Analysis: Loading the Classification Data
Feature Analysis can be generalized to the following three steps:
Definea bounded, high dimensional feature space that can be effectively modeled.
Transformand manipulate the space to make modeling easier.
Extracta feature representation of each instance in the space.
Our goal in this task will be to load the room occupancy data, specify the features of interest, and to extract the instances and target.
Feature Analysis: Scatter Plot
In data science and machine learning we can use scatter plots to quickly graph data during analysis. Oftentimes, they are used as an informative base for more complex and higher dimensional visualizations.
In this task, we are going to simply plot instances of two features against each other to assess the relationship between the pair. Can we learn something novel that we would have otherwise missed? Let’s find out!
Feature Analysis: Radviz
Another very important feature visualization algorithm is
RadViz. Machine learning engineers and data scientists often use radial visualizations in their workflow to ascertain class separability and feature importance.
In this task, we will use
RadViz to plot our features on the unit circle, drop our instances as points within this circle, and let the features pull on the points according to their normalized values.
Feature Analysis: Parallel Coordinates Plot
RadViz, parallel coordinate plots visualize multi-dimensional features. We will use parallel coordinates to get a much better sense of the distribution of the features and if any features are highly variable with respect to any one class in the room occupancy dataset.
Feature Analysis: Rank Features
Are the features predictive? What is smallest set of features I can feed into my model to maximize for predictive performance?
These questions are bound to come up in any machine learning problem. In this task, we will use
Rank2D to score and visualize pairs of features according to various metrics so that we can make a well-informed qualitative and quantitative decisions about which features to include and why.
About the Host (Snehan Kekre)
Snehan hosts Machine Learning courses at Rhyme. He is in his senior year of university at the Minerva Schools at KGI, pursuing a double major in the Natural Sciences and Computational Sciences, with a focus on physics and machine learning. When not applying computational and quantitative methods to identify the structures shaping the world around him, he can sometimes be seen trekking in the mountains of Nepal.