Simple Feature Selection Using a Low-to-No Variance Mask

Demonstrates how to remove features with low to no variance.
Author

Matt Pickard

Published

January 5, 2023

Modified

September 1, 2023

Introduction

One way to perform dimensionality reduction is through feature selection. In this post, we’ll explore how to create a low-variance feature mask (or filter). We’ll cover both a manual approach with dplyr functions and a more automated, production approach with recipes functions. Enjoy!

Why is variance important in feature selection?

Variance is important in feature selection because of a concept called information gain. Information gain is what we know about one (usually unknown) variable because we can observe another known variable. In supervised learning, we use information from predictor variables to learn something about the target variable.

To make this concrete, imagine we want to estimate loan applicants’ creditworthiness. Though we don’t have their credit scores, we do have their monthly income, age, and number of outstanding loans. If every applicant in our data set has three outstanding loans – that is, the number of outstanding loans didn’t vary – then “outstanding loans” does not differentiate loan applicants and, therefore, does not provide any information about the applicants that we can use to estimate their creditworthiness. Therefore, we’d say that number of outstanding loans provides no information gain about creditworthiness.

We can remove features with little to no variance because they are not informative. Consider them useless fluff.

A manual method

We’ll start by creating a low-variance filter manually with dplyr functions.

Setup

For this example, we’ll use a credit score classification data set from Kaggle, provided by Rohan Paris. The data set is large, so I randomly sampled 20% of it. I also did a little data cleaning. As you’ll see, to make this exercise interesting, we’ll add a few low-variance features to demonstrate feature removal.

You can download the cleaned data set here.

Since variance is conceptually simpler with continuous variables, let’s only load the continuous variables into credit_df.

library(tidyverse)
library(knitr)

credit_df <- 
  read_csv("data/credit_data.csv") %>% 
  select_if(is.numeric)

Here’s a peek at the data.

kable(credit_df %>% head(5))
age annual_income monthly_inhand_salary num_bank_accounts num_credit_card interest_rate num_of_loan delay_from_due_date num_of_delayed_payment changed_credit_limit num_credit_inquiries outstanding_debt credit_utilization_ratio total_emi_per_month amount_invested_monthly monthly_balance credit_history_months
41 16176.83 1585.070 8 3 10 3 17 14 0.56 2445 1200.70 29.94733 21.96219 38.32601 358.2188 241
36 82383.04 6661.253 3 5 8 4 17 14 1.71 2 1218.57 33.38963 162.10896 123.97997 630.0364 392
44 28805.34 2309.445 8 3 5 2 13 19 7.02 0 796.45 26.83209 47.19512 139.90769 303.8417 385
28 45412.95 3520.412 8 6 30 6 28 21 23.07 6 4601.39 23.41958 113.69155 199.56341 318.7863 55
45 17296.38 1480.365 6 10 30 5 21 17 17.34 7 4624.73 38.39057 49.36071 78.16123 280.5146 92

Calculate feature variances

To begin, let’s calculate the variance of each column. With dplyr, we’ll use across() to apply var() to all columns with the everything() selector. Notice that we scale() the data before passing it to var(). It is important to normalize the data so the features are comparable with each other. Unnormalized, num_credit_card and monthly income, have very different variances. To compare their influence on creditworthiness in a fair manner, we normalize them.

We use tidyr’s pivot_longer() to pivot the scaled variances to columns. We’ll sort them from largest to smallest.

credit_df %>% 
  summarize(
    across(everything(), ~ var(scale(., center = FALSE), na.rm = TRUE))) 
# A tibble: 1 × 17
  age[,1] annual_income[,1] monthly_inhand_salary[,1] num_bank_accounts[,1]
    <dbl>             <dbl>                     <dbl>                 <dbl>
1  0.0787             0.974                     0.353                 0.978
# ℹ 13 more variables: num_credit_card <dbl[,1]>, interest_rate <dbl[,1]>,
#   num_of_loan <dbl[,1]>, delay_from_due_date <dbl[,1]>,
#   num_of_delayed_payment <dbl[,1]>, changed_credit_limit <dbl[,1]>,
#   num_credit_inquiries <dbl[,1]>, outstanding_debt <dbl[,1]>,
#   credit_utilization_ratio <dbl[,1]>, total_emi_per_month <dbl[,1]>,
#   amount_invested_monthly <dbl[,1]>, monthly_balance <dbl[,1]>,
#   credit_history_months <dbl[,1]>
credit_variances <- credit_df %>% 
  summarize(
    across(
        everything(), 
        ~ var(scale(., center = FALSE), na.rm = TRUE))) %>% 
  pivot_longer(
    everything(), 
    names_to = "feature", 
    values_to = "variance") %>% 
  arrange(desc(variance)) 

kable(credit_variances)
feature variance
num_of_loan 0.98558800
interest_rate 0.98258481
num_of_delayed_payment 0.98150878
num_credit_inquiries 0.97843457
num_bank_accounts 0.97841843
total_emi_per_month 0.97569764
annual_income 0.97397017
num_credit_card 0.97044729
amount_invested_monthly 0.90345646
outstanding_debt 0.41753915
monthly_inhand_salary 0.35255863
delay_from_due_date 0.33223435
changed_credit_limit 0.30855481
monthly_balance 0.22081631
credit_history_months 0.15974632
age 0.07874258
credit_utilization_ratio 0.02527138

Set variance threshold and create a mask

We can scan down the variances and identify a natural cut-off between amount_invested_monthly and outstanding_debt, however, that would remove too many features. The next cutoff between credit_history_month and age seems more appropriate. So, we can create a variance filter with a threshold of 0.1.

We use pull() to get an array of feature names that we can use as a mask.

low_var_filter <- credit_variances %>% 
  filter(variance < 0.1) %>% 
  pull(feature)

low_var_filter
[1] "age"                      "credit_utilization_ratio"

Apply the mask

We then apply the mask to the data frame. Notice that it removes ‘age’ and credit_utilization_ratio.

filtered_credit_df <- credit_df %>% 
  select(-all_of(low_var_filter))

kable(filtered_credit_df %>% head(5))
annual_income monthly_inhand_salary num_bank_accounts num_credit_card interest_rate num_of_loan delay_from_due_date num_of_delayed_payment changed_credit_limit num_credit_inquiries outstanding_debt total_emi_per_month amount_invested_monthly monthly_balance credit_history_months
16176.83 1585.070 8 3 10 3 17 14 0.56 2445 1200.70 21.96219 38.32601 358.2188 241
82383.04 6661.253 3 5 8 4 17 14 1.71 2 1218.57 162.10896 123.97997 630.0364 392
28805.34 2309.445 8 3 5 2 13 19 7.02 0 796.45 47.19512 139.90769 303.8417 385
45412.95 3520.412 8 6 30 6 28 21 23.07 6 4601.39 113.69155 199.56341 318.7863 55
17296.38 1480.365 6 10 30 5 21 17 17.34 7 4624.73 49.36071 78.16123 280.5146 92

A better way – tidymodels

Creating a low-variance mask manually is good for learning purposes, but in real-world scenarios, it’s nice to have something more automated – that will work in a production pipeline – and more intelligent – that can handle categorical variables. The recipes package in tidymodels provides step_zv() and step_nzv() to features with zero or near-zero variance, respectively.

Load data with categorical variables

To demonstrate the recipes approach, we’ll load all the features, so we have both continuous and categorical variables.

credit_df <- 
  read_csv("data/credit_data.csv")

To make this example interesting, we’ll insert a couple of low-variance features – num_credit_card and num_credit_inquiries.

credit_df <- credit_df %>% 
  mutate(
    num_credit_card_rand = runif(n()),
    num_credit_inquiries_rand = runif(n()),
    num_credit_card = if_else(
      num_credit_card_rand < .95, 5, num_credit_card),
    num_credit_inquiries = if_else(
      num_credit_inquiries_rand < .95, 3, num_credit_inquiries)
  ) %>% 
  select(-num_credit_card_rand, -num_credit_inquiries_rand)

Define a recipe object

Then we define a recipe object. Notice the first parameter is a formula. We define credit_score as the target variable and all other features as predictor variables.

We add the step_zv() step first. No-variance features will cause problems when we normalize the data with step_scale(). We apply the no-variance step to all predictors and the scale step to only the numeric predictors. Then, we apply step_nzv() to remove low-variance features. prep() “fits” the recipe to the data.

library(recipes) 

low_variance_recipe <- recipe(credit_score ~ ., data = credit_df) %>%
  step_zv(all_predictors()) %>%
  step_scale(all_numeric_predictors()) %>%
  step_nzv(all_predictors()) %>%
  prep()

We can use tidy() to peek into the recipe and see the effect it will have on the trained data set it was trained on. Here we look at the third step of the recipe – step_nzv(). We can see that it will remove our two low-variance features – num_credit_card and num_credit_inquiries.

tidy(low_variance_recipe, number = 3)
# A tibble: 2 × 2
  terms                id       
  <chr>                <chr>    
1 num_credit_card      nzv_nAlyV
2 num_credit_inquiries nzv_nAlyV

Apply the recipe to credit_df

In recipes to apply a trained recipe to a data set, we can use bake(). The new_data parameter allows us to specify the data set “bake” the recipe with. If we pass NULL, it will bake the same data that the recipe was trained on.

filtered_credit_df <- low_variance_recipe %>% bake(new_data = NULL)

names(filtered_credit_df)
 [1] "month"                    "age"                     
 [3] "occupation"               "annual_income"           
 [5] "monthly_inhand_salary"    "num_bank_accounts"       
 [7] "interest_rate"            "num_of_loan"             
 [9] "delay_from_due_date"      "num_of_delayed_payment"  
[11] "changed_credit_limit"     "outstanding_debt"        
[13] "credit_utilization_ratio" "payment_of_min_amount"   
[15] "total_emi_per_month"      "amount_invested_monthly" 
[17] "payment_behaviour"        "monthly_balance"         
[19] "credit_history_months"    "credit_score"            

If we compare the names in the original credit_df to the names in the filtered_credit_df, we can see that num_credit_card and num_credit_inquiries were removed.

setdiff(names(credit_df), names(filtered_credit_df))
[1] "num_credit_card"      "num_credit_inquiries"