Image Credits: MidJourney Generative Diffusion model

Motivating Problem

Biological clocks are a popular concept these days but for the uninitiated, it captures the aggregate effect of cellular & biochemical processes in our body produced by genetic and environmental factors that translates to physiological impairment. Most testing kits use DNA methylation because of a seminal paper by Horvath back in 2013. Whilst DNA methylation is really well correlated with chronological age - it has proven to be less informative on specific risk factors and adverse disease outcomes. More recently, multi-omics assays have become available e.g. metabolomics, proteomics and lipidomics. The running question is, a) what are the best modelling approaches for this new data b) can these assays tell us about risk factors associated with aging?

Datasets

Here, I used a cross-sectional multi-omics dataset that enrolled patients that had a CT angiography angiogram. These are typically patients with suspected or greater risk of coronary artery disease. For privacy purposes this dataset has not been made freely available yet. Omics data is notorious for having within-sample and between-sample variation. To fix that, hierarchical normalisation (hRUV) was used.

Distribution of available data on metabolites, lipids, proteins and cell proportions amongst participants.

Calculating Biological Age

There are broadly two categories of biological age.

  1. First generation clocks which effectively regress subjects' chronological age against a series of biomarkers.
  2. Second generation clocks which regress time-to-event due to all-cause mortality against a series of biomarkers.

In both cases, the predicted value of the regression model becomes the biological age. The residual is the biological age acceleration.

The 'calculation' is deceptively straightforward but there are some important points to note.

  • Second generation clocks probably capture mortality risk better and might be practical as a surrogate/proxy measurement of longevity. However, it also captures more late-stage presentation of disease.
  • First generation clocks don't correlate well with mortality risk but may give us more of an insight into the underlying drivers of the aging process

I didn't have longitudinal outcome data here, so I stuck to first generation clocks.

Modelling

For clarity sake, I did an train-test-split of 70:30 and ran elastic net, PCA regression, random forest, deep neural network (DNN) and autoencoder regression on an aggregate of metabolomics, lipidomics and proteomics biomarkers. My instinct (and maybe yours too) was that DNNs should do very well due to the interwoven relationship between metabolites and upstream proteins.

Some interesting insights that might be useful for anyone calculating biological age in the future:

  • Elastic regression performed best by far and away
  • PCA regression (i.e. regressing on the principal components before the elbow on the scree plot) performed the worst. My guess is the latent variables that give more weighting to predictors with greatest variance may not necessarily select predictors with the most explanatory relevance. This is particularly important in multi-omics data which typically has a lot of uninformative between-sample and between-platform variance
  • DNN was lacklustre too. I suspect it's because it was over-fitting quite heavily on the noise.
  • Autoencoder regression (i.e. regressing on latent variables in the bottleneck layer) actually performed reasonably well. This points towards some significant non-linearity in the underlying relationships between biomarkers. It simultaneously performs dimensional reduction, which I think probably put it just ahead of DNN.

I think caret and glm in R are straightforward so you can find my code here on GitHub but I've put my code snippet here for autoencoders as it uses a lesser known external library.

-- CODE language-r -- library(h2o) h2o.init() features = as.h2o(X_train) # Training Autoencoder on Features ae1 = h2o.deeplearning( x = seq_along(features), training_frame = features, autoencoder = TRUE, hidden = 300, activation = "Tanh" ) # Pulling Deep Features ae1_codings = h2o.deepfeatures(ae1, features, layer = 1) X_codings = as.data.frame(ae1_codings) # Regression on Deep Features slm = lm(y_train ~ ., cbind(y_train, X_codings)) summary(slm)

Risk Factor Analysis

For adjustment factors, non-smoking status is associated with reduced biological age – surprise, surprise. Chronological age is a significant adjustment factor where older individuals have increased biological age. Interestingly,Asian ethnicity was the only race to be significantly associated with decreased biological age. Could it be true that asians don't raisin?

High total cholesterol (TC) was associated with decreased biological age although the interpretation is not all that useful given TC is an amalgamation of many categories of cholesterol of varying density and function.

What’s ostensibly surprising is that increased HDL is associated with increased biological age. HDL has been traditionally perceived as ‘good cholesterol’ however more recent work has highlighted the deceptive nature of fixating on HDL serum concentration. Importantly, not all HDL particles are ‘equal’ and between particles, there are functional differences in the cholesterol molecules held within – becoming pro-atherogenic during onset of plaque development. Given plaque development is heavily tied to chronological age, HDL function or dysfunction evolves with age, a factor that confounds the interpretation here. NT-proBNP is associated with increased biological age which is expected given it is the gold standard for long-term independent prediction of mortality due to heart failure.

Despite well-established links between triglycerides, CRP and Lp(a) to cardiovascular risk, the negative result likely reflects the nature of this biological clock which is trained on chronological age rather than time-to-event for cardiovascular mortality; the component of aging measured here likely doesn’t overlap perfectly with mortality outcome.

Assessing late observable signs of disease diabetes mellitus,stroke, DVT and kidney disease were all associated with increased biological age.

This graph shows the effect size of the beta coefficients regressing biological age against a gauntlet of risk factors.

What's worth trying?

What could be worth trying is sparse partial least squares (SPLS) as it works well on unbalanced samples sizes and datasets with high collinearity such as multi-omics. SPLS constructs linear combinations of the predictors i.e., latent variables, in a supervised manner with respect to the response variable whilst simultaneously applying regularisation penalties to reduce the erroneous inclusion of noisy but uninformative predictors. Importantly, this approach would apply dimensional reduction but also considers the informativeness of the predictors it keeps, which may overcome the limitations of PCR and autoencoders. This could be implemented with mixomics in R. If any one every wants to try and combine multiple omics platforms together, this might be worth a try as a step above simple elastic net regression on a concatenation of the biomarkers.

What would be very cool is if we could get our hands on untargeted, longitudinal, multi-omics data. We could do time-to-event survival analysis with a large feature set of potential predicators that are unbiased.