# Reconstructing a latent representation of gene expression from genomic alterations to improve clinical utility of real-world clinicogenomics data

**Sunil Kumar, Felicia Kuperwaser, Dillon Tracy, Jeff Sherman, Emily Vucic and Maayan Baron**  
Zephyr AI: 1800 Tysons Blvd Suite 901, McLean VA 22102  
\| [www.zephyrai.bio](http://www.zephyrai.bio/)

## Abstract # 3519

### BACKGROUND

- Patient datasets with clinical and molecular information are ideal for studying tumor biology and developing robust machine learning (ML) models for predicting outcome and treatment response. These data however rarely exist in real-world settings or in sufficient quantities within research contexts.

- Large publicly available datasets like The Cancer Genome Atlas (TCGA), which provide multi-omic profiles for diverse cancer types, have greatly facilitated development of novel therapies and personalized medicines. However, the absence of patient outcome data tied to treatment limits the applicability of these data for understanding and modeling treatment response.

- Real-world clinicogenomics cohorts, such as the AACR Project GENIE, on the other hand are typically very rich in clinical annotations, including treatment regimens and outcomes measures. These data, however, are sparsely annotated for patient tumor molecular profiles, rarely exceeding ~100’s of genes profiled.

### METHODS

We developed an ML model (Mut2Ex) to reconstruct tumor gene expression profiles using genetic information available on commercial next generation sequencing panels using a regression-adapted Principle Label Space Transformation (PLST), along with embeddings from minimal clinical information (OncoTree code, sex and stage) generated by a language model. Mut2Ex was trained on ~1200 DepMap cell lines across 26 cancer types to reconstruct whole transcriptome mRNA expression profiles. These profiles were generated for ~10,000 tumors from TCGA and ~180,000 tumors from AACR Project GENIE and applied to a variety of clinical tasks.

### RESULTS

Input

- Reconstructed mRNA expression by Mut2Ex was highly correlated with true expression in cell lines (r = 0.9342, [0.9328-0.9357, 95% CI, N=164]). Compared to true expression, reconstructed profiles recapitulate sub-clusters within cancer types, PAM50 subtyping in breast tumors, survival signatures in colorectal tumors and multiple oncogenic signatures in a pan-cancer manner.

- Analysis of reconstructed expression for AACR Project GENIE tumors revealed expected enrichment of known driver genes within expression subtypes and enrichment of oncogenic signatures associated with distinct clinical outcomes in a cancer type specific manner.

### Open Source & Proprietary

Output

Zephyr AI Machine Learning (ML) method reconstructs transcriptomes with high accuracy across multiple tumor types.

Our expression reconstruction model, trained on the DepMap dataset, used 720 cell lines for training and 164 for testing. We compared the model's reconstructed expression profiles (18,969 genes) to actual expression in 26 cancer subtypes.

### Deriving PAM50 subtyping and other clinical features from reconstructed breast cancer expression profiles is comparable to real expression

Including clinical features significantly improved accuracy at sample and gene levels. There was a strong positive correlation between reconstructed and true expression, especially for highly variable genes.

### Zephyr AI’s Reconstruction Model is Robust Across Diverse Commercial NGS Panels

Neighbor Embedding (t-SNE) plot of reconstructed expression profiles shows the model output is robust across genomic inputs from various commercial NGS providers and assays, while capturing salient clinical and biological features including cancer type.

### To assess the clinical utility of reconstructed expression

We applied our method to 564 breast cancer samples, showing that reconstructed expression performed comparably to true expression and outperformed DNA alterations alone in all tasks.

### Deriving OncotypeDx Signatures from Reconstructed Colorectal Cancer Expression Profiles

Reconstructed expression profiles were generated for 272 colon adenocarcinoma tumors. A high correlation was observed between risk scores from real and reconstructed expression.

### ACKNOWLEDGEMENTS

The authors express their gratitude to the Zephyr AI science, engineering, data and business development teams for invaluable technical support and discussion. We also acknowledge the contributions of the authors and organizations cited, with special thanks to AACR Project GENIE, TCGA and the Cancer Dependency Map for essential data resources.

### CONCLUSION
