Data Visualization with Plotly and Seaborn: Breast Cancer Diagnosis – Exploratory Data Analysis

Producing visualizations is an important first step in exploring and analyzing real-world data sets. As such, visualization is an indispensable method in any data scientist’s toolbox. It is also a powerful tool to identify problems in analyses and for illustrating results.

In this project, we will employ the statistical data visualization library, Seaborn, to discover and explore the relationships in the Breast Cancer Wisconsin (Diagnostic) Data Set.

We will cover key concepts in exploratory data analysis (EDA) using visualizations to:

  • Identify and interpret inherent relationships in the data set
  • Produce various chart types including histograms, violin plots, box plots, joint plots, pair grids, and heatmaps
  • Customize plot aesthetics
  • Apply faceting methods to visualize higher dimensional data

Join for $9.99
Data Visualization with Plotly and Seaborn: Breast Cancer Diagnosis – Exploratory Data Analysis

Task List


We will cover the following tasks in 57 minutes:


Introduction and Importing the Data

In this task, we are introduced to the project and learning outcomes. Once we are familiarized with the Rhyme interface, we begin working in Jupyter Notebook, a web-based interactive computational environment for creating notebook documents.

Next, we will import essential libraries such as NumPy, pandas, Seaborn, and matplotlib.

Lastly, we use pandas to load the Breast Cancer Wisconsin (Diagnostic) Data Set.


Separate Target from Features

Now that the data set is in memory, we can explore the characteristics of its attributes and instances.

We will drop columns that that cannot be used for analysis and classification. Note that this does not constitute feature selection. We are dropping columns that have no bearing on the analysis we will be conducting, and will instead clutter our analysis.

After producing descriptive statistics about the data, we will separate the target from the features. The target contains the diagnosis with binary class labels, M or B, for malignant and benign tumors respectively.


Diagnosis Distribution Visualzation

A very common question during model evaluation is, “Why isn’t the model I’ve picked predictive?”. Most often, it is a result of a class imbalance.

In this task, we will use Seaborn’s countplot() method to visualize the target distributions. We will also generate descriptive statistics about the features that summarize the central tendency, dispersion and shape of the data set’s distribution.


Visualizing Standardized Data with Seaborn

As the columns in the data set take on values of varying range, we need to standardize the data before proceeding with further analysis and visualization.

To begin feature analysis, we use Seaborn’s violinplot() method. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.


Violin Plots and Box Plots

We are using violin plots and box plots to identify features that best separate the data for classification. Box plots are especially useful in identifying outliers in the data. Using violin plots, we are also able to infer whether certain features are correlated.

To minimize clutter in our visualizations, we divide the features into three batches of ten features and produce separate plots for them.


Using Joint Plots for Feature Comparison

Joint plots come in handy to illustrate the relationship between two features. We will use seaborn’s jointplot() method to draw a scatter plot with marginal histograms and kernel density fits. We can examine the relationship between any two features using the Pearson correlation coefficient of the regression through our scatter plot.


Uncovering Correlated Features with Pair Grids

In this task, we will use Seaborn’s PairGrid method for plotting pairwise relationships in the data set. However, we will limit ourselves to three features.

We will use these results to inform our feature selection process in the next project.


Observing the Distribution of Values and their Variance with Swarm Plots

We have learned that violin plots are a great tool for visualizing sparse distributions. As our data set contains close to 600 rows, we might want to simply display each point in the same visualization. This need is satisfied by Seaborn’s swarmplot() method. A swarm plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

The points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, but it does not scale well to large numbers of observations. This style of plot is sometimes called a “beeswarm”. It can be used to more clearly observe the variance in the data.


Observing all Pairwise Correlations

A good way to identify correlations between features is to visualize the correlation matrix as a heatmap. We will make a note of the correlated features so that we can drop them from our data set before building a predictive model in the next project.

In the next project, we will remove these correlated features and analyze the classification accuracy we get using XGBoost, a boosted decision tree classifier. We will then employ various feature selection and feature extraction methods to get the most predictive features and improve our classification accuracy.

Watch Preview

Preview the instructions that you will follow along in a hands-on session in your browser.

Snehan Kekre

About the Host (Snehan Kekre)


Snehan hosts Machine Learning and Data Sciences projects at Rhyme. He is in his senior year of university at the Minerva Schools at KGI, studying Computer Science and Artificial Intelligence. When not applying computational and quantitative methods to identify the structures shaping the world around him, he can sometimes be seen trekking in the mountains of Nepal.



Frequently Asked Questions


In Rhyme, all projects are completely hands-on. You don't just passively watch someone else. You use the software directly while following the host's (Snehan Kekre) instructions. Using the software is the only way to achieve mastery. With the "Live Guide" option, you can ask for help and get immediate response.
Nothing! Just join through your web browser. Your host (Snehan Kekre) has already installed all required software and configured all data.
You can go to https://rhyme.com/for-companies, sign up for free, and follow this visual guide How to use Rhyme to create your own projects. If you have custom needs or company-specific environment, please email us at help@rhyme.com
Absolutely. We offer Rhyme for workgroups as well larger departments and companies. Universities, academies, and bootcamps can also buy Rhyme for their settings. You can select projects and trainings that are mission critical for you and, as well, author your own that reflect your own needs and tech environments. Please email us at help@rhyme.com
Rhyme's visual instructions are somewhat helpful for reading impairments. The Rhyme interface has features like resolution and zoom that are slightly helpful for visual impairment. And, we are currently developing a close-caption functionality to help with hearing impairment. Most of the accessibility options of the cloud desktop's operating system or the specific application can also be used in Rhyme. However, we still have a lot of work to do. If you have suggestions for accessibility, please email us at accessibility@rhyme.com
We started with windows and linux cloud desktops because they have the most flexibility in teaching any software (desktop or web). However, web applications like Salesforce can run directly through a virtual browser. And, others like Jupyter and RStudio can run on containers and be accessed by virtual browsers. We are currently working on such features where such web applications won't need to run through cloud desktops. But, the rest of the Rhyme learning, authoring, and monitoring interfaces will remain the same.
Please email us at help@rhyme.com and we'll respond to you within one business day.

Ready to join this 57 minutes session?

More Projects by Snehan Kekre