Project: Regression Analysis with Yellowbrick

In this project, we will build a machine learning model to predict the compressive strength of high performance concrete (HPC). Although, we will use linear regression, the emphasis of this project will be on using visualization techniques to steer our machine learning workflow.

Visualization plays a crucial role throughout the analytical process. It is indispensable for any effective analysis, model selection, and evaluation. This project will make use of a diagnostic platform called Yellowbrick. It allows data scientists and machine learning practitioners to visualize the entire model selection process to steer towards better, more explainable models.

Yellowbrick hosts several datasets from the UCI Machine Learning Repository. We’ll be working with the concrete dataset that is well suited for regression tasks. The dataset contains 1030 instances and 8 real valued attributes with a continuous target.

Join for Free
Project: Regression Analysis with Yellowbrick

Duration (mins)


NA / 5


Task List

We will cover the following tasks in 1 hour and 10 minutes:


We will familiarize ourselves with the Rhyme interface and our learning environment. You will be provided with a cloud desktop with Jupyter Notebooks and all the software you will need to complete the project. Jupyter Notebooks are very popular with Data Science and Machine Learning Engineers as one can write code in cells and use other cells for documentation.

We will also introduce the model we will be building as well the dataset for this project.

Data Exploration

In this task, we use the pandas library to load our data file. Next, we explore its attributes as well as the descriptive summary statistics associated with each instance.

The concrete dataset contains 1030 instances and 9 attributes. Eight of the attributes are explanatory variables, including the age of the concrete and the materials used to create it, while the target variable strength is a measure of the concrete’s compressive strength (MPa).

Preprocessing the Data

The preprocessing steps will involve specifying the features and target of interest, following by creating the matrix of features and the target vector.

Pairwise Scatterplot

In this task, we continue exploring the basic properties of our data. We leverage the fantastic plot-styling library seaborn to create a pairwise scatterplot of the attributes. We might gain some insight as to what attributes, if any, are less evenly distrusted across our data.

Feature Importances

A question that crops up before any of the machine learning begins is: How do I select the right features?

In this task, we answer just that. A common approach to eliminating features is to describe their relative importance to a model, then eliminate weak features or combinations of features and re-evaluate to see if the model fares better during cross-validation.

Target Visualization

Often in real-world machine learning problems, we suffer from the curse of dimensionality. There is a problem of acquiring sufficient training data. Other times, there aren’t enough data to train regression models to the precision required. In these cases, we may be able to transform the regression problem into a classification problem by binning the target instances into dummy classes.

How do we select the optimal number of bins and ensure that our data is evenly distributed across them? This is precisely the focus of this task!

Evaluating Lasso Regression

In this task, we first divide the data into training and test splits. Next, we fit the model on the training set and predict on the test set. A prediction error plot shows the actual targets from the dataset against the predicted values generated by our model. This allows us to see how much variance is in the model. Machine learning practitioners can diagnose regression models using this plot by comparing against the 45 degree line, where the prediction exactly matches the model.

Visualizing Test-set Errors

We can visualize of error on both the training and test sets to diagnose heteroscedasticity.

Residuals, in the context of regression models, are the difference between the observed value of the target variable (y) and the predicted value (ŷ), e.g. the error of the prediction. The ResidualsPlot Visualizer shows the difference between residuals on the vertical axis and the dependent variable on the horizontal axis, allowing us to detect regions within the target that may be susceptible to more or less error.

Cross Validation Scores

We generally determine whether a given model is optimal by looking at it’s F1, precision, recall, and accuracy scores(for classification), or it’s coefficient of determination (R2) and error (for regression). However, real world data is often distributed somewhat unevenly, meaning that the fitted model is likely to perform better on some sections of the data than on others. Yellowbrick’s CVScores visualizer enables us to visually explore these variations in performance using different cross validation strategies.

Learning Curves

A learning curve shows the relationship of the training score vs the cross validated test score for an estimator with a varying number of training samples. It can be used to show how much the estimator benefits from more data, and if our model is more sensitive to error due to variance vs. error due to bias.

Hyperparamter Tuning - Alpha Selection

Tuning a model is as important as model selection.. Regularization is designed to penalize model complexity. Alphas that are too high increase the error due to bias (underfit), while alphas that are too low increase the error due to variance (overfit). So in this task, we are going to learn how to choose an optimal alpha such that the error is minimized in both directions.

Watch Preview

Preview the instructions that you will follow along in a hands-on session in your browser.

Snehan Kekre

About the Host (Snehan Kekre)

Snehan Kekre is a Machine Learning and Data Science Instructor at Coursera. He studied Computer Science and Artificial Intelligence at Minerva Schools at KGI, based in San Francisco. His interests include AI safety, EdTech, and instructional design. He recognizes that building a deep, technical understanding of machine learning and AI among students and engineers is necessary in order to grow the AI safety community. This passion drives him to design hands-on, project-based machine learning courses on Rhyme.

Frequently Asked Questions

In Rhyme, all projects are completely hands-on. You don't just passively watch someone else. You use the software directly while following the host's (Snehan Kekre) instructions. Using the software is the only way to achieve mastery. With the "Live Guide" option, you can ask for help and get immediate response.
Nothing! Just join through your web browser. Your host (Snehan Kekre) has already installed all required software and configured all data.
Absolutely! Your host (Snehan Kekre) has provided this session completely free of cost!
You can go to, sign up for free, and follow this visual guide How to use Rhyme to create your own projects. If you have custom needs or company-specific environment, please email us at
Absolutely. We offer Rhyme for workgroups as well larger departments and companies. Universities, academies, and bootcamps can also buy Rhyme for their settings. You can select projects and trainings that are mission critical for you and, as well, author your own that reflect your own needs and tech environments. Please email us at
Rhyme strives to ensure that visual instructions are helpful for reading impairments. The Rhyme interface has features like resolution and zoom that will be helpful for visual impairments. And, we are currently developing a close-caption functionality to help with hearing impairments. Most of the accessibility options of the cloud desktop's operating system or the specific application can also be used in Rhyme. If you have questions related to accessibility, please email us at
We started with windows and linux cloud desktops because they have the most flexibility in teaching any software (desktop or web). However, web applications like Salesforce can run directly through a virtual browser. And, others like Jupyter and RStudio can run on containers and be accessed by virtual browsers. We are currently working on such features where such web applications won't need to run through cloud desktops. But, the rest of the Rhyme learning, authoring, and monitoring interfaces will remain the same.
Please email us at and we'll respond to you within one business day.

Ready to join this 1 hour and 10 minutes session for free?

More Projects by Snehan Kekre