NA / 5
We will cover the following tasks in 52 minutes:
Introduction and Overview
Azure ML Studio has a fair amount of built-in machine learning algorithms that we can use as modules. But what if we need an algorithm not included by default? Well, Azure can leverage the entire open-source R and Python communities! We can simply use the Create R Model module to use any R machine learning library and the associated algorithms.
The Bike Sharing dataset has 10,886 observations, each one pertaining to a specific hour from the first 19 days of each month from 2011 to 2012. The dataset consists of 11 columns that record information about bike rentals: date-time, season, working day, weather, temp, “feels like” temp, humidity, wind speed, casual rentals, registered rentals, and total rentals.
Feature Engineering and Preprocessing
There is an untapped wealth of prediction power hidden in the “datetime” column. However, it needs to be converted from its current form. Conveniently, Azure ML has a module for running R scripts, which can take advantage of R’s built-in functionality for extracting features from the date-time data.
We now select an R-Script Module to run our feature engineering script. This module allows us to import our dataset from Azure ML, add new features, and then export our improved dataset. This module has many uses beyond our use in this project, which help with cleaning data and creating graphs.
Our goal is to convert the datatime column of strings into date-time objects in R, so we can take advantage of their built-in functionality. R has two internal implementations of date-times: POSIXlt and POSIXct. We found Azure ML had problems dealing with POSIXlt, so we recommend using POSIXct for any date-time feature engineering.
This dataset only has one observation where weather = 4. Since this is a categorical variable, R will result in an error if it ends up in the test data split. This is because R expects the number of levels for each categorical variable to equal the number of levels found in the training data split. Therefore, it must be removed. We we write a custom R-script to remove the outlier.
Creating Training and Test Sets
Before training our model, we must tell Azure ML which variables are categorical. To do this, we use the Metadata Editor. We used the column selector to choose the hour, weekday, month, year, season, weather, holiday, and workingday columns. Then we select “Make categorical” under the “Categorical” dropdown.
Before creating our random forest, we must identify columns that add little-to-no value for predictive modeling. These columns will be dropped. Since we are predicting total count, the registered bike rental and casual bike rental columns must be dropped. Together, these values add up to total count, which would lead to a successful but uninformative model because the values would simply be summed to see the total count. One could train separate models to predict casual and registered bike rentals independently. Azure ML would make it very easy to include these models in our experiment after creating one for total count.
We must now directly tell Azure ML which attribute we want our algorithm to train to predict by casting that attribute as a “label”.
Model Building and Training
Here is where we take advantage of AzureMl’s newest feature: the Create R Model module. Now we can use R’s
randomForest library and take advantage of its large number of adjustable parameters directly inside AzureML studio. Then, the model can be deployed in a web service. Previously, R models were nearly impossible to deploy to the web.
Evaluating the Model
Unfortunately, AzureML’s Evaluate Model Module does not support models that use the Create R Model module, yet. We assume this feature will be added in the near future. In the meantime, we can import the results from the scored model (Score Model module) into an Execute R Script module and compute an evaluation using R. We calculate the MSE then export our result back to AzureML as a data frame.
About the Host (Snehan Kekre)
Snehan Kekre is a Machine Learning and Data Science Instructor at Coursera. He studied Computer Science and Artificial Intelligence at Minerva Schools at KGI, based in San Francisco. His interests include AI safety, EdTech, and instructional design. He recognizes that building a deep, technical understanding of machine learning and AI among students and engineers is necessary in order to grow the AI safety community. This passion drives him to design hands-on, project-based machine learning courses on Rhyme.