Introduction
One way to perform dimensionality reduction is through feature selection. In this post, we’ll explore how to create a low-variance feature mask (or filter). We’ll cover both a manual approach with dplyr
functions and a more automated, production approach with recipes
functions. Enjoy!
Why is variance important in feature selection?
Variance is important in feature selection because of a concept called information gain. Information gain is what we know about one (usually unknown) variable because we can observe another known variable. In supervised learning, we use information from predictor variables to learn something about the target variable.
To make this concrete, imagine we want to estimate loan applicants’ creditworthiness. Though we don’t have their credit scores, we do have their monthly income, age, and number of outstanding loans. If every applicant in our data set has three outstanding loans – that is, the number of outstanding loans didn’t vary – then “outstanding loans” does not differentiate loan applicants and, therefore, does not provide any information about the applicants that we can use to estimate their creditworthiness. Therefore, we’d say that number of outstanding loans provides no information gain about creditworthiness.
We can remove features with little to no variance because they are not informative. Consider them useless fluff.
A manual method
We’ll start by creating a low-variance filter manually with dplyr
functions.
Setup
For this example, we’ll use a credit score classification data set from Kaggle, provided by Rohan Paris. The data set is large, so I randomly sampled 20% of it. I also did a little data cleaning. As you’ll see, to make this exercise interesting, we’ll add a few low-variance features to demonstrate feature removal.
You can download the cleaned data set here.
Since variance is conceptually simpler with continuous variables, let’s only load the continuous variables into credit_df
.
Here’s a peek at the data.
age | annual_income | monthly_inhand_salary | num_bank_accounts | num_credit_card | interest_rate | num_of_loan | delay_from_due_date | num_of_delayed_payment | changed_credit_limit | num_credit_inquiries | outstanding_debt | credit_utilization_ratio | total_emi_per_month | amount_invested_monthly | monthly_balance | credit_history_months |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | 16176.83 | 1585.070 | 8 | 3 | 10 | 3 | 17 | 14 | 0.56 | 2445 | 1200.70 | 29.94733 | 21.96219 | 38.32601 | 358.2188 | 241 |
36 | 82383.04 | 6661.253 | 3 | 5 | 8 | 4 | 17 | 14 | 1.71 | 2 | 1218.57 | 33.38963 | 162.10896 | 123.97997 | 630.0364 | 392 |
44 | 28805.34 | 2309.445 | 8 | 3 | 5 | 2 | 13 | 19 | 7.02 | 0 | 796.45 | 26.83209 | 47.19512 | 139.90769 | 303.8417 | 385 |
28 | 45412.95 | 3520.412 | 8 | 6 | 30 | 6 | 28 | 21 | 23.07 | 6 | 4601.39 | 23.41958 | 113.69155 | 199.56341 | 318.7863 | 55 |
45 | 17296.38 | 1480.365 | 6 | 10 | 30 | 5 | 21 | 17 | 17.34 | 7 | 4624.73 | 38.39057 | 49.36071 | 78.16123 | 280.5146 | 92 |
Calculate feature variances
To begin, let’s calculate the variance of each column. With dplyr
, we’ll use across()
to apply var()
to all columns with the everything()
selector. Notice that we scale()
the data before passing it to var()
. It is important to normalize the data so the features are comparable with each other. Unnormalized, num_credit_card
and monthly income
, have very different variances. To compare their influence on creditworthiness in a fair manner, we normalize them.
We use tidyr
’s pivot_longer()
to pivot the scaled variances to columns. We’ll sort them from largest to smallest.
# A tibble: 1 × 17
age[,1] annual_income[,1] monthly_inhand_salary[,1] num_bank_accounts[,1]
<dbl> <dbl> <dbl> <dbl>
1 0.0787 0.974 0.353 0.978
# ℹ 13 more variables: num_credit_card <dbl[,1]>, interest_rate <dbl[,1]>,
# num_of_loan <dbl[,1]>, delay_from_due_date <dbl[,1]>,
# num_of_delayed_payment <dbl[,1]>, changed_credit_limit <dbl[,1]>,
# num_credit_inquiries <dbl[,1]>, outstanding_debt <dbl[,1]>,
# credit_utilization_ratio <dbl[,1]>, total_emi_per_month <dbl[,1]>,
# amount_invested_monthly <dbl[,1]>, monthly_balance <dbl[,1]>,
# credit_history_months <dbl[,1]>
credit_variances <- credit_df %>%
summarize(
across(
everything(),
~ var(scale(., center = FALSE), na.rm = TRUE))) %>%
pivot_longer(
everything(),
names_to = "feature",
values_to = "variance") %>%
arrange(desc(variance))
kable(credit_variances)
feature | variance |
---|---|
num_of_loan | 0.98558800 |
interest_rate | 0.98258481 |
num_of_delayed_payment | 0.98150878 |
num_credit_inquiries | 0.97843457 |
num_bank_accounts | 0.97841843 |
total_emi_per_month | 0.97569764 |
annual_income | 0.97397017 |
num_credit_card | 0.97044729 |
amount_invested_monthly | 0.90345646 |
outstanding_debt | 0.41753915 |
monthly_inhand_salary | 0.35255863 |
delay_from_due_date | 0.33223435 |
changed_credit_limit | 0.30855481 |
monthly_balance | 0.22081631 |
credit_history_months | 0.15974632 |
age | 0.07874258 |
credit_utilization_ratio | 0.02527138 |
Set variance threshold and create a mask
We can scan down the variances and identify a natural cut-off between amount_invested_monthly
and outstanding_debt
, however, that would remove too many features. The next cutoff between credit_history_month
and age
seems more appropriate. So, we can create a variance filter with a threshold of 0.1.
We use pull()
to get an array of feature names that we can use as a mask.
Apply the mask
We then apply the mask to the data frame. Notice that it removes ‘age’ and credit_utilization_ratio
.
filtered_credit_df <- credit_df %>%
select(-all_of(low_var_filter))
kable(filtered_credit_df %>% head(5))
annual_income | monthly_inhand_salary | num_bank_accounts | num_credit_card | interest_rate | num_of_loan | delay_from_due_date | num_of_delayed_payment | changed_credit_limit | num_credit_inquiries | outstanding_debt | total_emi_per_month | amount_invested_monthly | monthly_balance | credit_history_months |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16176.83 | 1585.070 | 8 | 3 | 10 | 3 | 17 | 14 | 0.56 | 2445 | 1200.70 | 21.96219 | 38.32601 | 358.2188 | 241 |
82383.04 | 6661.253 | 3 | 5 | 8 | 4 | 17 | 14 | 1.71 | 2 | 1218.57 | 162.10896 | 123.97997 | 630.0364 | 392 |
28805.34 | 2309.445 | 8 | 3 | 5 | 2 | 13 | 19 | 7.02 | 0 | 796.45 | 47.19512 | 139.90769 | 303.8417 | 385 |
45412.95 | 3520.412 | 8 | 6 | 30 | 6 | 28 | 21 | 23.07 | 6 | 4601.39 | 113.69155 | 199.56341 | 318.7863 | 55 |
17296.38 | 1480.365 | 6 | 10 | 30 | 5 | 21 | 17 | 17.34 | 7 | 4624.73 | 49.36071 | 78.16123 | 280.5146 | 92 |
A better way – tidymodels
Creating a low-variance mask manually is good for learning purposes, but in real-world scenarios, it’s nice to have something more automated – that will work in a production pipeline – and more intelligent – that can handle categorical variables. The recipes
package in tidymodels
provides step_zv()
and step_nzv()
to features with zero or near-zero variance, respectively.
Load data with categorical variables
To demonstrate the recipes
approach, we’ll load all the features, so we have both continuous and categorical variables.
To make this example interesting, we’ll insert a couple of low-variance features – num_credit_card
and num_credit_inquiries
.
credit_df <- credit_df %>%
mutate(
num_credit_card_rand = runif(n()),
num_credit_inquiries_rand = runif(n()),
num_credit_card = if_else(
num_credit_card_rand < .95, 5, num_credit_card),
num_credit_inquiries = if_else(
num_credit_inquiries_rand < .95, 3, num_credit_inquiries)
) %>%
select(-num_credit_card_rand, -num_credit_inquiries_rand)
Define a recipe object
Then we define a recipe object. Notice the first parameter is a formula. We define credit_score
as the target variable and all other features as predictor variables.
We add the step_zv()
step first. No-variance features will cause problems when we normalize the data with step_scale()
. We apply the no-variance step to all predictors and the scale step to only the numeric predictors. Then, we apply step_nzv()
to remove low-variance features. prep()
“fits” the recipe to the data.
We can use tidy()
to peek into the recipe and see the effect it will have on the trained data set it was trained on. Here we look at the third step of the recipe – step_nzv()
. We can see that it will remove our two low-variance features – num_credit_card
and num_credit_inquiries
.
Apply the recipe to credit_df
In recipes
to apply a trained recipe to a data set, we can use bake()
. The new_data
parameter allows us to specify the data set “bake” the recipe with. If we pass NULL, it will bake the same data that the recipe was trained on.
[1] "month" "age"
[3] "occupation" "annual_income"
[5] "monthly_inhand_salary" "num_bank_accounts"
[7] "interest_rate" "num_of_loan"
[9] "delay_from_due_date" "num_of_delayed_payment"
[11] "changed_credit_limit" "outstanding_debt"
[13] "credit_utilization_ratio" "payment_of_min_amount"
[15] "total_emi_per_month" "amount_invested_monthly"
[17] "payment_behaviour" "monthly_balance"
[19] "credit_history_months" "credit_score"
If we compare the names in the original credit_df
to the names in the filtered_credit_df
, we can see that num_credit_card
and num_credit_inquiries
were removed.