NA / 5
We will cover the following tasks in 1 hour and 6 minutes:
Introduction and Setup Instructions
Azure Machine Learning Studio is a GUI-based integrated development environment for constructing and operationalizing Machine Learning workflow on Azure. In this task, we will signup for an ML Studio account and avail $200 worth of free credits to create our machine learning experiments!
Importing the Data Sets
We start by signing in to our Azure ML Studio account. Next, we create a blank experiment and use the drag and drop module to import our two data sets:
- Flight On-Time Data from the US Bureau of Transportation Statistics
- Weather Data from the National Oceanic and Atmospheric Administration
Lastly, we convert our input data,
Flight Delays Data.csv and
Weather Dataset.csv, to the internal Dataset format used by Azure ML Studio. This is done using the Convert to Dataset module.
Scrubbing Missing Values
Now that we have imported the data, it’s time to get a sense of its properties so that we can make informed decisions when pre-processing.
We use the Summarize Data module to generate and visualize descriptive statistics for the columns in the
Flight Delays data.
Dealing with missing data is an essential pre-processing step. There are a number of ways to impute data. We will use the Clean Missing Data module to substitute all missing values in our data with 0.
Eliminating Target Leaks
ArrDel15 is the column containing the labels we are trying to predict. Its values indicate whether the flight was delayed or not. Since the data set is made available to the general public and not specifically created for our machine learning problem, it contains additional data that we don’t want going into our model.
Some of these columns are known as target leaks and need to be removed. To actually use the model in real life, we will not have access to information such as the number of minutes the flight was delayed, since we’re trying to predict that in the first place. So, in this task, we will use the Select Columns in Dataset module to exclude the target leaks from our input.
Conversion to Categorial Features
In this task, we use the Edit Metadata module to convert three columns to categorical feature types.
Preparing Features to be Joined with Weather Data
We observe that
CRSArrTime, the scheduled departure and arrival times, are in hours and minutes represented by a three or four-digit number. Our strategy is going to be to join this dataset with the Weather dataset. We want to join them by time. Since we don’t have weather information by the minute, we will join by the hour of the day.
To extract the hour from this value, we will use the Apply Math Operation module to divide the columns of interest by 100 in-place. Since we want just the hour, we will use another Apply Math Operation module to round the number down.
Preprocessing the Weather Dataset
For the Weather Data, we will have to do a lot of the same things as we did earlier. We scrub any missing data, convert a few columns to categorical feature types, and extract the hour values adjusted for timezones.
Joining Both Datasets
Once we have completed pre-processing the Flight Delay and Weather data, we can use the Join Data module to join them together. For both datasets, we need to specify the same columns which we want to use to join. Namely, we want to specify the
CRSDepTime at the departure airport.
Training and Evaluating the Model
In this final task, our goal is to train and evaluate a binary logistic regression model. Using the Split Data module, we partition our data into 80⁄20 train/test splits. We train the Two-Class Logistic Regression module on the train split and evaluate its performance on the test split using the Score Model and Evaluate Model modules.
About the Host (Snehan Kekre)
Snehan Kekre is a Machine Learning and Data Science Instructor at Coursera. He studied Computer Science and Artificial Intelligence at Minerva Schools at KGI, based in San Francisco. His interests include AI safety, EdTech, and instructional design. He recognizes that building a deep, technical understanding of machine learning and AI among students and engineers is necessary in order to grow the AI safety community. This passion drives him to design hands-on, project-based machine learning courses on Rhyme.