Project PhenoSeq: Protein Network Analysis for Phenotypic Outcomes

While demonstrating promising results in basic prediction tasks, the project identified key areas for improvement in protein-phenotype relationship modeling. The findings provide a foundation for future work in protein network analysis and phenotype prediction.

This project represents a significant step forward in understanding protein-phenotype relationships, while highlighting important areas for future research and development in computational biology.

Project Overview

PhenoSeq is an innovative project focused on understanding how protein networks contribute to organism-scale phenotypes, particularly in cancer growth and organism longevity. The project leverages protein embeddings from ESM (Evolutionary Scale Modeling) combined with graph neural networks to predict phenotypic outcomes through protein-protein interactions (PPIs).

Core Objectives

  1. Develop predictive models for understanding biological drivers of complex diseases
  2. Create frameworks for inferring oncogenic potential of genetic mutations
  3. Analyze clinical significance of protein modifications using sequence embeddings
  4. Establish connections between protein networks and phenotypic outcomes

Data Sources

The project utilized three major public databases:

Methodological Approach

Model Development

The team developed three distinct models:

  1. Baseline Model

    • Fully connected network predicting CRISPR scores from embeddings
    • Achieved correlation of 0.55 with ground truth
    • Outperformed K-nearest neighbors baseline
    • Performance correlated with training set proximity
  2. Cell Line-Specific Model

    • Incorporated cell line identity through one-hot embedding
    • Included mutation status (wild type vs mutated)
    • Achieved 0.44 correlation with ground truth
    • Limited success in predicting cell line-specific differences
  3. PPI-Informed Model

    • Integrated protein-protein interaction data
    • Results comparable to cell line-specific model
    • Limited additional performance gain from PPI integration

Additional Analyses

Key Findings

  1. ESM3 embeddings contain valuable functional information
  2. Simple models can outperform basic baselines
  3. Current approach limitations in capturing subtle effects
  4. Challenges in predicting mutation-specific impacts

image/png

Future Directions

  1. Integration of additional data types:
    • Copy number variation
    • Transcriptomic information
  2. Exploration of amino acid level embeddings
  3. Enhanced signal processing methods
  4. Improved model architectures

Technical Achievements

Limitations and Challenges

  1. Limited success in cell line-specific predictions
  2. Challenges in cross-phylogenetic predictions
  3. Subtle effect detection limitations
  4. Data integration complexities

Impact and Applications

PhenoSeq Longevity Analysis Component

This analysis revealed both the potential and limitations of using protein sequence data for predicting species longevity, highlighting the importance of taxonomic relationships in such predictions.

Overview

The longevity analysis component of PhenoSeq investigated the relationship between protein sequences and species lifespan across different taxonomic orders, with a particular focus on Primates, Chiroptera (bats), and Cetacea (whales).

Key Findings

image/png

1. Taxonomic Order Analysis

2. Prediction Performance

image/png

3. Model Architecture Insights

4. Protein Embedding Analysis

5. Hierarchical Prediction Accuracy

Correlation strength increased with taxonomic specificity:

image/png

Technical Limitations

Key Insights

image/png

PhenoSeq DepMap Analysis Component

This analysis demonstrated both the potential and current limitations of using protein sequence data to predict cancer-relevant protein functions, highlighting areas for future improvement in protein-phenotype prediction models.

image/png

Overview

The DepMap component investigated protein function in cancer through CRISPR-based knockout experiments, analyzing 9,353 proteins across 1,150 different cell lines to understand their effects on cancer cell growth.

Three Models :

  1. Baseline Model
  1. Cell-line-specific Model
  1. PPI-informed Model

Key Findings

image/png

Model Performance

image/png

Technical Insights

Limitations

Technical Details

Future Implications