NA / 5
We will cover the following tasks in 1 hour and 10 minutes:
We will familiarize ourselves with the Rhyme interface and our learning environment. You will be provided with a cloud desktop with Jupyter Notebooks and all the software you will need to complete the project. Jupyter Notebooks are very popular with Data Science and Machine Learning Engineers as one can write code in cells and use other cells for documentation.
We will also introduce the model we will be building as well the dataset for this project.
In this task, we use the
pandas library to load our data file. Next, we explore its attributes as well as the descriptive summary statistics associated with each instance.
concrete dataset contains 1030 instances and 9 attributes. Eight of the attributes are explanatory variables, including the
age of the concrete and the materials used to create it, while the target variable
strength is a measure of the concrete’s compressive strength (MPa).
Preprocessing the Data
The preprocessing steps will involve specifying the features and target of interest, following by creating the matrix of features and the target vector.
In this task, we continue exploring the basic properties of our data. We leverage the fantastic plot-styling library
seaborn to create a pairwise scatterplot of the attributes. We might gain some insight as to what attributes, if any, are less evenly distrusted across our data.
A question that crops up before any of the machine learning begins is: How do I select the right features?
In this task, we answer just that. A common approach to eliminating features is to describe their relative importance to a model, then eliminate weak features or combinations of features and re-evalute to see if the model fairs better during cross-validation.
Often in real-world machine learning problems, we suffer from the curse of dimensionality. There is a problem of acquiring sufficient training data. Other times, there aren’t enough data to train regression models to the precision required. In these cases, we may be able to transform the regression problem into a classification problem by binning the target instances into dummy classes.
How do we select the optimal number of bins and ensure that our data is evenly distributed across them? This is precisely the focus of this task!
Evaluating Lasso Regression
In this task, we first divide the data into training and test splits. Next, we fit the model on the training set and predict on the test set. A prediction error plot shows the actual targets from the dataset against the predicted values generated by our model. This allows us to see how much variance is in the model. Machine learning practitioners can diagnose regression models using this plot by comparing against the 45 degree line, where the prediction exactly matches the model.
Visualizing Test-set Errors
We can visualize of error on both the training and test sets to diagnose heteroscedasticity.
Residuals, in the context of regression models, are the difference between the observed value of the target variable
(y) and the predicted value
(ŷ), e.g. the error of the prediction. The
ResidualsPlot Visualizer shows the difference between residuals on the vertical axis and the dependent variable on the horizontal axis, allowing us to detect regions within the target that may be susceptible to more or less error.
Cross Validation Scores
We generally determine whether a given model is optimal by looking at it’s F1, precision, recall, and accuracy scores(for classification), or it’s coefficient of determination (R2) and error (for regression). However, real world data is often distributed somewhat unevenly, meaning that the fitted model is likely to perform better on some sections of the data than on others. Yellowbrick’s
CVScores visualizer enables us to visually explore these variations in performance using different cross validation strategies.
A learning curve shows the relationship of the training score vs the cross validated test score for an estimator with a varying number of training samples. It can be used to show how much the estimator benefits from more data, and if our model is more sensitive to error due to variance vs. error due to bias.
Hyperparamter Tuning - Alpha Selection
Tuning a model is as important as model selection.. Regularization is designed to penalize model complexity. Alphas that are too high increase the error due to bias (underfit), while alphas that are too low increase the error due to variance (overfit). So in this task, we are going to learn how to choose an optimal alpha such that the error is minimized in both directions.
About the Host (Snehan Kekre)
Snehan hosts Machine Learning and Data Sciences projects at Rhyme. He is in his senior year of university at the Minerva Schools at KGI, studying Computer Science and Artificial Intelligence. When not applying computational and quantitative methods to identify the structures shaping the world around him, he can sometimes be seen trekking in the mountains of Nepal.