Project: Statistical Data Visualization with Seaborn

Producing visualizations is an important first step in exploring and analyzing real-world data sets. As such, visualization is an indispensable method in any data scientist’s toolbox. It is also a powerful tool to identify problems in analyses and for illustrating results.

In this project, we will employ the statistical data visualization library, Seaborn, to discover and explore the relationships in the Breast Cancer Wisconsin (Diagnostic) Data Set.

We will use the results from our exploratory data analysis (EDA) in the previous project, Breast Cancer Diagnosis – Exploratory Data Analysis, to:

  • Drop correlated features
  • Implement feature selection and feature extraction methods including feature selection with correlation, univariate feature selection, recursive feature elimination, recursive feature elimination with cross validation, principal component analysis (PCA) and tree based feature selection methods
  • Build a boosted decision tree classifier with XGBoost to classify tumors as either malignant or benign.

Join for Free
Project: Statistical Data Visualization with Seaborn

Duration (mins)


NA / 5


Task List

We will cover the following tasks in 52 minutes:

Project Overview

Importing Libraries and Data

In this task, we will briefly recap our exploratory data analysis of the Breast Cancer Wisconsin (Diagnostic) Data Set in the previous project. To summarize, we:

  • Imported the data set
  • Separated the target column from the features
  • Visualized the target distribution
  • Standardize the data
  • Explored the relationship between the features using violin plots, joint plots, pair grids, and swarm plots
  • Identified the columns to be dropped by calculating the pairwise correlation coefficients

Dropping Correlated Columns from Feature List

Using the heatmap of the correlation matrix, we were able to identify columns to be dropped. Out of a set of correlated features, we will preserve the one that best separates the data. We will identify these salient features using the violin plots and swarm plots produced in the previous project.

Classification using XGBoost (minimal feature selection)

We drop 15 columns that are correlated out of a total of 33 columns. To verify that there are no remaining correlated features, we plot another correlation matrix and inspect the Pearson correlation coefficient.

Next, we use a helper function from scikit-learn to create split our data into training and test sets. Using the default parameters, we will fit the XGBClassifier estimator to the training set and use the model to predict values in the test set.

We can evaluate the performance of our classifier using the accuracy score, f-1 score, and confusion matrix from sklearn.metrics.

Univariate Feature Selection

In univariate feature selection, we will use the SelectKBest() function. The score it returns can be used to select n_features with the highest values for the test chi-squared statistic from the data.

Recall that the chi-square test measures the dependence between stochastic variables. Using this function weeds out the features that are the most likely to be independent of class and therefore irrelevant for classification.

Recursive Feature Elimination with Cross-Validation

In this task, we will not only find the best features but also the optimal number of features needed for the best classification accuracy.

Feature Extraction using Principal Component Analysis

We will use principle component analysis (PCA) for feature extraction. We will first need to normalize the data for better performance.

A plot of the cumulative explained variance against the number of components will give us the percentage of variance explained by each of the selected components. This curve quantifies how much of the total variance is contained within the first N components.

Watch Preview

Preview the instructions that you will follow along in a hands-on session in your browser.

Snehan Kekre

About the Host (Snehan Kekre)

Snehan Kekre is a Machine Learning and Data Science Instructor at Coursera. He studied Computer Science and Artificial Intelligence at Minerva Schools at KGI, based in San Francisco. His interests include AI safety, EdTech, and instructional design. He recognizes that building a deep, technical understanding of machine learning and AI among students and engineers is necessary in order to grow the AI safety community. This passion drives him to design hands-on, project-based machine learning courses on Rhyme.

Frequently Asked Questions

In Rhyme, all projects are completely hands-on. You don't just passively watch someone else. You use the software directly while following the host's (Snehan Kekre) instructions. Using the software is the only way to achieve mastery. With the "Live Guide" option, you can ask for help and get immediate response.
Nothing! Just join through your web browser. Your host (Snehan Kekre) has already installed all required software and configured all data.
Absolutely! Your host (Snehan Kekre) has provided this session completely free of cost!
You can go to, sign up for free, and follow this visual guide How to use Rhyme to create your own projects. If you have custom needs or company-specific environment, please email us at
Absolutely. We offer Rhyme for workgroups as well larger departments and companies. Universities, academies, and bootcamps can also buy Rhyme for their settings. You can select projects and trainings that are mission critical for you and, as well, author your own that reflect your own needs and tech environments. Please email us at
Rhyme strives to ensure that visual instructions are helpful for reading impairments. The Rhyme interface has features like resolution and zoom that will be helpful for visual impairments. And, we are currently developing a close-caption functionality to help with hearing impairments. Most of the accessibility options of the cloud desktop's operating system or the specific application can also be used in Rhyme. If you have questions related to accessibility, please email us at
We started with windows and linux cloud desktops because they have the most flexibility in teaching any software (desktop or web). However, web applications like Salesforce can run directly through a virtual browser. And, others like Jupyter and RStudio can run on containers and be accessed by virtual browsers. We are currently working on such features where such web applications won't need to run through cloud desktops. But, the rest of the Rhyme learning, authoring, and monitoring interfaces will remain the same.
Please email us at and we'll respond to you within one business day.

Ready to join this 52 minutes session for free?

More Projects by Snehan Kekre