We will cover the following tasks in 57 minutes:
Introduction and Importing the Data
In this task, we are introduced to the project and learning outcomes. Once we are familiarized with the Rhyme interface, we begin working in Jupyter Notebook, a web-based interactive computational environment for creating notebook documents.
Next, we will import essential libraries such as
Lastly, we use pandas to load the Breast Cancer Wisconsin (Diagnostic) Data Set.
Separate Target from Features
Now that the data set is in memory, we can explore the characteristics of its attributes and instances.
We will drop columns that that cannot be used for analysis and classification. Note that this does not constitute feature selection. We are dropping columns that have no bearing on the analysis we will be conducting, and will instead clutter our analysis.
After producing descriptive statistics about the data, we will separate the target from the features. The target contains the diagnosis with binary class labels,
B, for malignant and benign tumors respectively.
Diagnosis Distribution Visualzation
A very common question during model evaluation is, “Why isn’t the model I’ve picked predictive?”. Most often, it is a result of a class imbalance.
In this task, we will use Seaborn’s
countplot() method to visualize the target distributions. We will also generate descriptive statistics about the features that summarize the central tendency, dispersion and shape of the data set’s distribution.
Visualizing Standardized Data with Seaborn
As the columns in the data set take on values of varying range, we need to standardize the data before proceeding with further analysis and visualization.
To begin feature analysis, we use Seaborn’s
violinplot() method. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
Violin Plots and Box Plots
We are using violin plots and box plots to identify features that best separate the data for classification. Box plots are especially useful in identifying outliers in the data. Using violin plots, we are also able to infer whether certain features are correlated.
To minimize clutter in our visualizations, we divide the features into three batches of ten features and produce separate plots for them.
Using Joint Plots for Feature Comparison
Joint plots come in handy to illustrate the relationship between two features. We will use seaborn’s
jointplot() method to draw a scatter plot with marginal histograms and kernel density fits. We can examine the relationship between any two features using the Pearson correlation coefficient of the regression through our scatter plot.
Uncovering Correlated Features with Pair Grids
In this task, we will use Seaborn’s
PairGrid method for plotting pairwise relationships in the data set. However, we will limit ourselves to three features.
We will use these results to inform our feature selection process in the next project.
Observing the Distribution of Values and their Variance with Swarm Plots
We have learned that violin plots are a great tool for visualizing sparse distributions. As our data set contains close to 600 rows, we might want to simply display each point in the same visualization. This need is satisfied by Seaborn’s
swarmplot() method. A swarm plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.
The points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, but it does not scale well to large numbers of observations. This style of plot is sometimes called a “beeswarm”. It can be used to more clearly observe the variance in the data.
Observing all Pairwise Correlations
A good way to identify correlations between features is to visualize the correlation matrix as a heatmap. We will make a note of the correlated features so that we can drop them from our data set before building a predictive model in the next project.
In the next project, we will remove these correlated features and analyze the classification accuracy we get using XGBoost, a boosted decision tree classifier. We will then employ various feature selection and feature extraction methods to get the most predictive features and improve our classification accuracy.
About the Host (Snehan Kekre)
Snehan hosts Machine Learning and Data Sciences projects at Rhyme. He is in his senior year of university at the Minerva Schools at KGI, studying Computer Science and Artificial Intelligence. When not applying computational and quantitative methods to identify the structures shaping the world around him, he can sometimes be seen trekking in the mountains of Nepal.