| Type: | Package | 
| Title: | Hierarchical GxE Interactions in a Regularized Regression Model | 
| Version: | 1.0.2 | 
| Date: | 2021-11-28 | 
| Author: | Natalia Zemlianskaia | 
| Maintainer: | Natalia Zemlianskaia <natasha.zemlianskaia@gmail.com> | 
| Description: | The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <doi:10.48550/arXiv.2103.13510>. | 
| License: | MIT + file LICENSE | 
| Imports: | Rcpp (≥ 1.0.3), Matrix, bigmemory, methods | 
| Depends: | dplyr, R (≥ 3.5) | 
| Suggests: | glmnet, testthat, knitr, rmarkdown, ggplot2 | 
| LinkingTo: | Rcpp, RcppEigen, RcppThread, BH, bigmemory | 
| VignetteBuilder: | knitr | 
| NeedsCompilation: | yes | 
| Packaged: | 2021-11-28 17:20:02 UTC; nataliazemlianskaia | 
| Repository: | CRAN | 
| Date/Publication: | 2021-11-30 07:30:02 UTC | 
Hierarchical GxE Interactions in a Regularized Regression Model
Description
The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <arxiv:2103.13510>.
Author(s)
Natalia Zemlianskaia
Maintainer: Natalia Zemlianskaia <natasha.zemlianskaia@gmail.com>
References
"A Scalable Hierarchical Lasso for Gene-Environment Interactions", Natalia Zemlianskaia, W.James Gauderman, Juan Pablo Lewinger https://arxiv.org/abs/2103.13510
Data Generation
Description
Generates genotypes data matrix G (sample_size by p), vector of environmental measurments E, and an outcome vector Y of size sample_size. Simulates training, validation, and test datasets.
Usage
data.gen(sample_size = 100, p = 20, n_g_non_zero = 15, n_gxe_non_zero = 10, 
         family = "gaussian", mode = "strong_hierarchical", 
         normalize = FALSE, normalize_response = FALSE, 
         seed = 1, pG = 0.2, pE = 0.3,
         n_confounders = NULL)
Arguments
sample_size | 
 sample size of the data  | 
p | 
 total number of main effects  | 
n_g_non_zero | 
 number of non-zero main effects to generate  | 
n_gxe_non_zero | 
 number of non-zero interaction effects to generate  | 
family | 
 "gaussian" for continous outcome Y and "binomial" for binary 0/1 outcome  | 
mode | 
 either "strong_hierarchical", "hierarchical", or "anti_hierarchical". In the strong hierarchical mode the hierarchical structure is maintained (beta_g = 0 then beta_gxe = 0) and also |beta_g| >= |beta_gxe|. In the hierarchical mode the hierarchical structure is maintained, but |beta_G| < |beta_gxe|. In the anti_hierarchical mode the hierarchical structure is violated (beta_g = 0 then beta_gxe != 0).  | 
normalize | 
 
  | 
normalize_response | 
 
  | 
pG | 
 genotypes prevalence, value from 0 to 1  | 
pE | 
 environment prevalence, value from 0 to 1  | 
seed | 
 random seed  | 
n_confounders | 
 number of confounders to generate, either   | 
Value
A list of simulated datasets and generating coefficients
G_train, G_valid, G_test | 
 generated genotypes matrices  | 
E_train, E_valid, E_test | 
 generated vectors of environmental values  | 
Y_train, Y_valid, Y_test | 
 generated outcome vectors  | 
C_train, C_valid, C_test | 
 generated confounders matrices  | 
GxE_train, GxE_valid, GxE_test | 
 generated GxE matrix  | 
Beta_G | 
 main effect coefficients vector  | 
Beta_GxE | 
 interaction coefficients vector  | 
beta_0 | 
 intercept coefficient value  | 
beta_E | 
 environment coefficient value  | 
Beta_C | 
 confounders coefficient values  | 
index_beta_non_zero, index_beta_gxe_non_zero, index_beta_zero, index_beta_gxe_zero | 
 inner data generation variables  | 
n_g_non_zero | 
 number of non-zero main effects generated  | 
n_gxe_non_zero | 
 number of non-zero interactions generated  | 
n_total_non_zero | 
 total number of non-zero variables  | 
SNR_g | 
 signal-to-noise ratio for the main effects  | 
SNR_gxe | 
 signal-to-noise ratio for the interactions  | 
family, p, sample_size, mode, seed | 
 input simulation parameters  | 
Examples
data = data.gen(sample_size=100, p=100)
G = data$G_train; GxE = data$GxE_train
E = data$E_train; Y = data$Y_train
Get model coefficients
Description
A function to obtain coefficients from the model fit object corresponding to the desired pair of tuning parameters lambda = (lambda_1, lambda_2).
Usage
gesso.coef(fit, lambda)
Arguments
fit | 
 model fit object obtained either by using function   | 
lambda | 
 a pair of tuning parameters organized in a tibble (ex:   | 
Value
A list of model coefficients corresponding to lambda values of tuning parameters
beta_0 | 
 estimated intercept value  | 
beta_e | 
 estimated environmental coefficient value  | 
beta_g | 
 a vector of estimated main effect coefficients  | 
beta_c | 
 a vector of estimated confounders coefficients  | 
beta_gxe | 
 a vector of estimated interaction coefficients  | 
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, 
        parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g     
Get model coefficients with specified number of non-zero interactions
Description
A function to obtain coefficients with target_b_gxe_non_zero specified to control the desired sparsity of interactions in the model.
Usage
gesso.coefnum(cv_model, target_b_gxe_non_zero, less_than = TRUE)
Arguments
cv_model | 
 cross-validated model fit object obtained by using function   | 
target_b_gxe_non_zero | 
 number of non-zero interactions we want to inlcude in the model  | 
less_than | 
 
  | 
Value
A list of model coefficients corresponding to the best model that contains at most or at least target_b_gxe_non_zero non-zero interaction terms. 
The target model is selected based on the averaged cross-validation (cv) results: for each pair of  parameters lambda=(lambda_1, lambda_2) in the grid and each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target_b_gxe_non_zero non-zero interaction terms. Returned coefficients are obtained by fitting the model on the full data with the selected tuning parameters. 
Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.
beta_0 | 
 estimated intercept value  | 
beta_e | 
 estimated environmental coefficient value  | 
beta_g | 
 a vector of estimated main effect coefficients  | 
beta_gxe | 
 a vector of estimated interaction coefficients  | 
beta_c | 
 a vector of estimated confounders coefficients  | 
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
model_coefficients = gesso.coefnum(model, 5)
gxe_coefficients = model_coefficients$beta_gxe; sum(gxe_coefficients!=0)              
Cross-Validation
Description
Performs nfolds-fold cross-validation to tune hyperparmeters lambda_1 and lambda_2 for the gesso model.
Usage
gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL,
         grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", 
         type_measure = "loss", fold_ids = NULL, nfolds = 4, 
         parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000, 
         min_working_set_size = 100, verbose = TRUE)
Arguments
G | 
 matrix of main effects of size   | 
E | 
 vector of environmental measurments  | 
Y | 
 outcome vector. Set   | 
C | 
 matrix of confounders of size   | 
normalize | 
 
  | 
normalize_response | 
 
  | 
grid | 
 grid sequence for tuning hyperparameters, we use the same grid for   | 
grid_size | 
 specify   | 
grid_min_ratio | 
 parameter to determine   | 
alpha | 
 if   | 
family | 
 
  | 
type_measure | 
 loss to use for cross-validation. Specity   | 
fold_ids | 
 option to input custom folds assignments  | 
tolerance | 
 tolerance for the dual gap convergence criterion  | 
max_iterations | 
 maximum number of iterations  | 
min_working_set_size | 
 minimum size of the working set  | 
nfolds | 
 number of cross-validation splits  | 
parallel | 
 
  | 
seed | 
 set random seed to control random folds assignments  | 
verbose | 
 
  | 
Value
A list of objects
cv_result | 
 a tibble with cross-validation results: averaged across folds loss and the number of non-zero coefficients for each value of ( 
  | 
lambda_min | 
 a tibble of optimal (  | 
fit | 
 list, return of the function gesso.fit on the full data  | 
grid | 
 vector of values used for hyperparameters tuning  | 
full_cv_result | 
 inner variables  | 
Examples
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train, 
                      grid_size=20, parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe        
g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g          
gesso fit
Description
Fits gesso model over the two dimentional grid of hyperparmeters lambda_1 and lambda_2, returns estimated coefficients for each pair of hyperparameters.
Usage
gesso.fit(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE,
          grid = NULL, grid_size = 20, grid_min_ratio = NULL, 
          alpha = NULL, family = "gaussian", weights = NULL,
          tolerance = 1e-3, max_iterations = 5000, 
          min_working_set_size = 100,
          verbose = FALSE)
Arguments
G | 
 matrix of main effects of size   | 
E | 
 vector of environmental measurments  | 
Y | 
 outcome vector. Set   | 
C | 
 matrix of confounders of size   | 
normalize | 
 
  | 
normalize_response | 
 
  | 
grid | 
 grid sequence for tuning hyperparameters, we use the same grid for   | 
grid_size | 
 specify   | 
grid_min_ratio | 
 parameter to determine   | 
alpha | 
 if   | 
family | 
 
  | 
tolerance | 
 tolerance for the dual gap convergence criterion  | 
max_iterations | 
 maximum number of iterations  | 
min_working_set_size | 
 minimum size of the working set  | 
weights | 
 inner fitting parameter  | 
verbose | 
 
  | 
Value
A list of estimated coefficients and other model fit metrics for each pair of hyperparameters (lambda_1, lambda_2)
beta_0 | 
 vector of estimated intercept values of size   | 
beta_e | 
 vector of estimated environment coefficients of size   | 
beta_g | 
 matrix of estimated main effects coefficients organized by rows, size (  | 
beta_gxe | 
 matrix of estimated interactions coefficients organized by rows, size (  | 
beta_c | 
 matrix of estimated confounders coefficients organized by rows, size (  | 
num_iterations | 
 number of iterations until convergence for each fit  | 
working_set_size | 
 maximum number of variables in the working set for each fit  | 
has_converged | 
 1 if the model converged within given   | 
objective_value | 
 objective function (loss) value for each fit  | 
beta_g_nonzero | 
 number of estimated non-zero main effects for each fit  | 
beta_gxe_nonzero | 
 number of estimated non-zero interactions for each fit  | 
lambda_1 | 
 
  | 
lambda_2 | 
 
  | 
grid | 
 vector of values used for hyperparameters tuning  | 
Examples
data = data.gen()
fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, normalize=TRUE)
plot(fit$beta_g_nonzero, pch=19, cex=0.4, 
     ylab="num of non-zero features", xlab="lambdas path")
points(fit$beta_gxe_nonzero, pch=19, cex=0.4, col="red")
Predict new outcome vector
Description
Predict new outcome vector based on the new data and estimated model coefficients.
Usage
gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E, 
                   beta_c=NULL, new_C=NULL, family = "gaussian")
Arguments
beta_0 | 
 estimated intercept value  | 
beta_e | 
 estimated environmental coefficient value  | 
beta_g | 
 a vector of estimated main effect coefficients  | 
beta_gxe | 
 a vector of estimated interaction coefficients  | 
new_G | 
 matrix of main effects, variables organized by columns  | 
new_E | 
 vector of environmental measurments  | 
beta_c | 
 a vector of estimated confounders coefficients  | 
new_C | 
 matrix of confounders, variables organized by columns  | 
family | 
 set   | 
Value
Returns a vector of predicted values
Examples
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train)
coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e                   
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe     
new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
cor(new_Y, data$Y_test)^2
Selection metrics
Description
Calculates principal selection metrics for the binary zero/non-zero classification problem (sensitivity, specificity, precision, auc).
Usage
selection.metrics(true_b_g, true_b_gxe, estimated_b_g, estimated_b_gxe)
Arguments
true_b_g | 
 vector of true main effect coefficients  | 
true_b_gxe | 
 vector of true interaction coefficients  | 
estimated_b_g | 
 vector of estimated main effect coefficients  | 
estimated_b_gxe | 
 vector of estimated interaction coefficients  | 
Value
A list of principal selection metrics
b_g_non_zero | 
 number of non-zero main effects  | 
b_gxe_non_zero | 
 number of non-zero interactions  | 
mse_b_g | 
 mean squared error for estimation of main effects effect sizes  | 
mse_b_gxe | 
 mean squared error for estimation of interactions effect sizes  | 
sensitivity_g | 
 recall of the non-zero main effects  | 
specificity_g | 
 recall of the zero main effects  | 
precision_g | 
 precision with respect to non-zero main effects  | 
sensitivity_gxe | 
 recall of the non-zero interactions  | 
specificity_gxe | 
 recall of the zero interactions  | 
precision_gxe | 
 precision with respect to non-zero interactions  | 
auc_g | 
 area under the curve for zero/non-zero binary classification problem for main effects  | 
auc_gxe | 
 area under the curve for zero/non-zero binary classification problem for interactions  | 
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g  
selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients, gxe_coefficients)