caret & bigstatsR Workshop

Required Libraries

# "bigstatsR" and related packages.
install.packages("bigstatsr")
install.packages("bigreadr")
install.packages("data.table")

# "impute" library.
install.packages("BiocManager")
BiocManager::install("impute")

# "caret" library and sample models.
install.packages("caret")
install.packages("glmnet")

Getting Started

  • All files used in the workshop can be downloaded here:

  • Throughout the workshop, you will need to fill out the missing code in the scripts, marked as (...).

bigstatsR Package

Provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory thanks to memory-mapping to binary files on disk.

Reading Files with bigstatsR [1]

# Read only the first row of the file.
# This allows us to obtain the number of columns.
first_row <- fread2("Test_Genotype_Data.csv",
                    nrows = 1)
col_num <- ncol(first_row)

# Read the entire file using big_read.
gen_data <- big_read("Test_Genotype_Data.csv", # Our large data file.
                     select = 1:col_num,       # The number of columns we want to read.
                     backingfile = "Data",     # Path to the backing file.
                     progress = TRUE,          # Show progress.
                     type = "double")          # Data type of the contents.

Reading Files with bigstatsR [2]

  • Our file will be read as a Filebacked Big Matrix (FBM).
  • After executing the code, we should see the following:
> gen_data

A Filebacked Big Matrix of type 'double' with 1000 rows and 1000 columns.
  • Be careful to remove the backing files before trying to re-read the same file.

Working with FBMs [1]

  • We can interact with FBMs in a similar manner as we do with normal dataframes.
# Examine the data.
gen_data
gen_data[1:10, 1:10]
str(gen_data)

Working with FBMs [2]

  • Functions need to be applied differently.
  • Say we wanted to find out the number of missing values in our data.
  • The following approach does not work anymore:
# Will not work!
sum(is.na(gen_data))

Applying Functions [1]

  • In order for the function to be applied, we need to define it separately and use big_apply.
  • Our function must take the following as arguments:
    • An FBM object: X
    • A vector of indices: ind
# Function for checking the number of missing values in the dataframe.
check_na <- function(X, ind) {
  sum(is.na(X[, ind]))
}
  • big_apply will take care of splitting our data into subsets, applying our function to each subset, and combining the results.

Applying Functions [2]

  • Now, we can find the number of missing values in our dataset.
# Check how many missing values we have.
big_apply(gen_data,         # FBM to which we want to apply the function.
          a.FUN = check_na, # The function that we want to apply.
          a.combine = "+",  # How the results of each subset should be combined.
          ncores = 2)       # Number of CPU cores to use.

Applying Functions [3]

  • If we really want to, we can manually create subsets and apply functions using a for loop.
  • First, we must get the indices for each subset:
# Subset the dataframe into sets of 100 columns.
sub_idx <- split(1:col_num, ceiling(seq_along(1:col_num) / 100))

Applying Functions [4]

  • Now, we can extract our subsets and apply our function:
for (i in 1:length(sub_idx)) {

  # Display current subset being evaluated.
  print(i)

  # Get the Genotype Data for the current subset.
  gen_subset <- gen_data[, sub_idx[[i]]]

  # Impute missing data using KNN clustering.
  gen_imputed <- round(
    t(impute::impute.knn(t(gen_subset), k = 50, rowmax = 1, colmax = 1)$data), 0)

  # "Write back" imputed data to the FBM.
  gen_data[, sub_idx[[i]]] <- gen_imputed
}

Applying Functions [5]

  • Check that the impute function worked:
# Now, verify there are no missing values anymore.
big_apply(gen_data,
          a.FUN = check_na,
          a.combine = "+",
          ncores = 2)

Saving FBMs

  • How do we save our modified data back to a .csv file? Easy, use big_write:
# Write the newly-imputed data to a new file.
big_write(gen_data,                         # Our FBM object.
          "Test_Genotype_Data_Imputed.csv", # Name of the file.
          every_nrow = 100,                 # How many rows do we want to save at once.
          progress = interactive())         # Show a fancy progress bar.

Removing Backing Files

  • big_read will create backing files that are of no use after we saved our file.
  • They can be safely removed:
# Remove backing files.
system("rm *.bk")
system("rm *.rds")

caret Package

Short for Classification And REgression Training. It is a set of functions that attempt to streamline the process for creating predictive models. The package started off as a way to provide a uniform interface for modeling functions, as well as a way to standardize common tasks (such parameter tuning and variable importance).

Loading Data

  • First, let's load our newly-imputed Genotype Data:
## Data Loading [Predictors]

# Genotype Data
gen_path <- "Test_Genotype_Data_Imputed.csv"
gen_data <- fread(gen_path, sep = ",", header = FALSE)
  • Let's load our Phenotype Data as well:
## Data Loading [Traits]

# Phenotype Data
phn_path <- "Test_Phenotype_Data.csv"
phn_data <- fread(phn_path, sep = ",", header = FALSE)

Data Overview

  • Our Genotype Data is a large matrix of 00s and 11s:
# Preview of Genotype Data.
str(gen_data)
gen_data[1:10, 1:10]
  • Our Phenotype Data, which we want to predict, is a single column of Plant Height (cm) measurements:
# Preview of Phenotype Data.
str(phn_data)
phn_data[1:10, 1]

The createDataPartition Function [1]

  • We can easily split our data into training and testing sets as follows:
# We want 80% of our data for training and 20% for testing.

# Get the indices for the rows to be used in the training data.
train_index <- createDataPartition(
  phn_data$V1,  # Our Phenotype Data.
  p = 0.8,      # Percentage of data to be used for training.
  list = FALSE) # To return a matrix instead of a list.

# Check out which rows will be used for training.
train_index

The createDataPartition Function [2]

  • Now that we know which indices will be used for training, we can extract the corresponding data:
# The values used as predictors [x].
x_train <- gen_data[train_index,]
x_test <- gen_data[-train_index,]

# The values to be predicted [y].
y_train <- phn_data[train_index,]
y_test <- phn_data[-train_index,]

Model Training and Tuning [1]

glmnet


Fits generalized linear and similar models via penalized maximum likelihood. The regularization path is computed for the lasso or elastic net penalty at a grid of values (on the log scale) for the regularization parameter lambda.


  • We can perform this tuning automatically with caret. Let's first create a custom parameter grid:
# Define a custom tuning grid.
tune_grid <- expand.grid(alpha = seq(0.0001, 1, length = 5), # Values to try for "alpha".
                         lambda = 5)                       # Values to try for "lambda".

Model Training and Tuning [2]

  • The trainControl function allows us to further customize how our model is made:
# Parameter tuning.
param_tune <- trainControl(method = "repeatedcv", # Method to be used for resampling.
                           number = 2,            # Numebr of folds (subsets) for cross-validation.
                           repeats = 5,           # Number of iterations for cross-validation.
                           trim = TRUE,           # Reduces memory consumption.
                           search = "grid",       # How the tuning grid should be made.
                           verboseIter = TRUE)    # To display additional information.
  • Other methods are available, such as a single cross-validation (cv), a leave-one-out cross-validation (LOOCV), or none, which will only train a single model.
  • It is also possible to specify a random tuning grid.

Model Training and Tuning [3]

  • Now, we can finally train our model:
# Train a model.
glmnet_model <- train(x_train,                # Our predictors.
                      y_train$V1,             # What we want to predict.
                      method = "glmnet",      # The model we want to use.
                      metric = "MAE",         # What metric should we use to pick the best model.
                      tuneGrid = tune_grid,   # Our custom tuning grid.
                      trControl = param_tune) # Our custom tuning method.

Model Evaluation [1]

> glmnet

 800 samples
1000 predictors

No pre-processing
Resampling: Cross-Validated (2 fold, repeated 5 times)
Summary of sample sizes: 400, 400, 400, 400, 400, 400, ...
Resampling results across tuning parameters:

  alpha     RMSE      Rsquared     MAE
  0.000100  61.23760  0.002033470  52.85568
  0.250075  61.83544  0.002543141  52.85747
  0.500050  60.02843  0.002118089  52.03982
  0.750025  59.16487  0.001651774  51.61048
  1.000000  58.81907  0.002781994  51.40225

Tuning parameter 'lambda' was held constant at a value of 5
MAE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 5.

Model Evaluation [2]

  • Now, let's predict on our testing data:
# Predict on testing data.
glmnet_prediction <- predict(glmnet_model, x_test)

# Check prediction accuracy.
postResample(pred = glmnet_prediction, # Our predicted values.
             obs = y_test$V1)          # Our expected vaues.
  • This gives us a nice summary of the prediction:
        RMSE     Rsquared          MAE
5.649786e+01 6.306793e-06 4.883650e+01

More Models

  • caret supports a lot of models (last time I checked, 238238).
  • More examples are given in the file for you to try:
    • earth
    • keras
    • xgboost
  • You can also define your own models as well.

References

Thank You!