We will cover the following tasks in 47 minutes:
Importing Libraries and Data
In this task, we will briefly recap our exploratory data analysis of the Breast Cancer Wisconsin (Diagnostic) Data Set in the previous project. To summarize, we:
- Imported the data set
- Separated the target column from the features
- Visualized the target distribution
- Standardize the data
- Explored the relationship between the features using violin plots, joint plots, pair grids, and swarm plots
- Identified the columns to be dropped by calculating the pairwise correlation coefficients
Dropping Correlated Columns from Feature List
Using the heatmap of the correlation matrix, we were able to identify columns to be dropped. Out of a set of correlated features, we will preserve the one that best separates the data. We will identify these salient features using the violin plots and swarm plots produced in the previous project.
Classification using XGBoost (minimal feature selection)
We drop 15 columns that are correlated out of a total of 33 columns. To verify that there are no remaining correlated features, we plot another correlation matrix and inspect the Pearson correlation coefficient.
Next, we use a helper function from
scikit-learn to create split our data into training and test sets. Using the default parameters, we will fit the
XGBClassifier estimator to the training set and use the model to predict values in the test set.
We can evaluate the performance of our classifier using the accuracy score, f-1 score, and confusion matrix from
Univariate Feature Selection
In univariate feature selection, we will use the
SelectKBest() function. The score it returns can be used to select
n_features with the highest values for the test chi-squared statistic from the data.
Recall that the chi-square test measures the dependence between stochastic variables. Using this function weeds out the features that are the most likely to be independent of class and therefore irrelevant for classification.
Recursive Feature Elimination with Cross-Validation
In this task, we will not only find the best features but also the optimal number of features needed for the best classification accuracy.
Plot CV Scores vs Number of Features Selected
We will evaluate the the optimal number of features needed for the highest classification accuracy by plotting the cross validation (CV) scores of the selected features on the y-axis against the number of selected features on the x-axis.
Feature Extraction using Principal Component Analysis
We will use principle component analysis (PCA) for feature extraction. We will first need to normalize the data for better performance.
A plot of the cumulative explained variance against the number of components will give us the percentage of variance explained by each of the selected components. This curve quantifies how much of the total variance is contained within the first N components.
About the Host (Snehan Kekre)
Snehan hosts Machine Learning and Data Sciences projects at Rhyme. He is in his senior year of university at the Minerva Schools at KGI, studying Computer Science and Artificial Intelligence. When not applying computational and quantitative methods to identify the structures shaping the world around him, he can sometimes be seen trekking in the mountains of Nepal.