Drug Discovery and Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Voices
    • Views
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE

Columbia-CZ team develops 10.3M parameter model that outperforms 100M parameter rivals on cell type classification

By Brian Buntz | July 11, 2025

Confusion matrices comparing cell-type classification performance across four models on human immune cell data. GREmLN (top left) shows the darkest diagonal, indicating superior accuracy in correctly identifying cell types compared to scGPT, Geneformer and scFoundation.

Confusion matrices comparing cell-type classification performance across four models on human immune cell data. GREmLN (top left) shows the darkest diagonal, indicating superior accuracy in correctly identifying cell types compared to scGPT, Geneformer and scFoundation. For more, see the full bioRxiv paper.

A new foundation model called GREmLN from a Columbia and Chan Zuckerberg Biohub team, delivers superior cell-type classification with only 10.3 million parameters, outpacing rivals like the 100-million-parameter scFoundation. Released July 9 on bioRxiv, it taps gene regulatory networks to achieve a 0.929 macro F1 score on immune cell data.

“Instead of using large language models, which are based on sequential data, we had to solve some very complicated math to extend the concept to what we call a large graph model,” explained Andrea Califano, Ph.D., president of the Chan Zuckerberg Biohub New York and the paper’s senior author. “In a cell, there is no sequence,” he said. “Gene number one and gene number two are not related in any inherent order. The order is created by the graph-like structure of how gene products regulate each other.”

Andrea Califano, Ph.D.

Andrea Califano, Ph.D.

The GREmLN paper reported that it achieved superior performance relative to established foundation models in cell-type classification. For human immune cells, GREmLN achieved a macro F1 score of 0.929, outperforming scGPT, the influential 33-million-parameter model from Bo Wang, Ph.D.’s lab published in Nature Methods in 2024 (0.924±0.002); Geneformer, the 30-million-parameter transfer learning model from Christina Theodoris, M.D., Ph.D. et al. published in Nature in 2023 (0.792); and scFoundation, the 100-million-parameter model published in Nature Methods in 2024 (0.879).

The model also demonstrated a knack at reconstructing gene expression, with R² scores clocking in at 0.883 on immune cells and 0.861 on cancer-infiltrating myeloid cells. “This is critical because otherwise you need a huge amount of data and you need a huge amount of computational resources to train the model,” Califano said.

GREmLN’s approach builds on decades of research into “master regulators”: hub proteins that integrate the effects of diverse mutations. “For the last 20 years, the mantra in oncology has been to target specific mutations, an approach that has not worked very well. Only about 11% of cancer patients benefit, and often the benefit is short-lived,” he noted. “The reason is that every cell in a tumor has a different set of mutations. If you target one mutation, you only kill the cells that depend on it.”

Targeting the hub, not the spokes

His solution involves finding the cellular equivalent of a telephone exchange: “Think of it like a telephone exchange where all calls go through the same hub. Instead of figuring out which individual conversation overloaded the system, we find the hub and fix the problem there. Those hubs are the master regulators: the proteins that integrate the effects of all the mutations. By targeting a small number of these proteins, typically about ten, we can address a cancer problem caused by countless different mutational patterns.”

Under the hood: The GREmLN blueprint

The model’s power comes from incorporating ARACNe algorithm-generated gene regulatory networks. ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) works by analyzing patterns in gene expression data to identify which genes control others. It does this by measuring how strongly genes are co-expressed (mutual information), using statistical sampling (bootstrapping) to ensure reliability, and removing spurious connections. These networks have been effective in shedding light on master regulator proteins and determining their sensitivity to small molecules, including use in clinical trials.

Training the GREmLN model required just 1 full epoch on 8 Nvidia H100 80G GPUs in parallel. While the hardware is powerful, the training time is efficient compared to typical foundation models that often require weeks of training. The pre-training dataset consisted of 11 million scRNA-seq (single-cell RNA sequencing) profiles spanning 19,000 genes from healthy human cells, sourced from the CELLxGENE dataset, covering 162 cell-types from various tissues. Each profile represents a snapshot of which genes are active in an individual cell, providing a large atlas of cellular states across the human body.

Looking ahead, Califano’s team plans to enhance GREmLN with sizable perturbational datasets unique to CZ Biohub NY. Those will include profiles where individual regulatory genes have been systematically silenced across millions of cells. “For this first iteration, we used the same data as other models for a fair comparison,” Califano noted. “But moving forward, we will use massive amounts of perturbational data we are generating in-house.” This approach could refine identification of master regulators, building on Califano’s prior work that has already informed successful cancer trials.

The Chan Zuckerberg Initiative has set an ambitious goal: to cure, prevent or manage all diseases by the end of this century. Califano’s vision for supporting that goal aligns with CZI’s framework-based approach. “The way I interpret ‘curing all diseases’ is not that we will develop a specific drug for every one of the 20,000+ rare genetic diseases. Instead, we will create the framework that allows us to solve these problems.” This framework approach represents what he calls “bucketization,” finding universal foundational elements rather than treating each disease as unique.

This work also dovetails with the Chan Zuckerberg Initiative’s work to build AI-powered virtual cells that can predict cellular behavior. Current AI models treat genes in cells like words in a sentence, but genes don’t have a natural order—they’re more like a network of interconnected components. “The cell is literally like a computer,” Califano explained. “If you can figure out its logic, the network of molecular interactions determining its behavior, you can predict what it will do in response to a perturbation with dramatic accuracy.” With GREmLN now available on CZI’s virtual cell platform, researchers can begin to decode these cellular circuits—moving beyond simply reading the genetic code to understanding their logic.


Filed Under: Drug Discovery and Development, Genomics/Proteomics, Immunology, machine learning and AI, Omics/sequencing
Tagged With: Andrea Califano, cancer, cell-type classification, Chan Zuckerberg Biohub, foundation model, gene regulatory networks, GREmLN, immune cells, master regulators
 

About The Author

Brian Buntz

As the pharma and biotech editor at WTWH Media, Brian has almost two decades of experience in B2B media, with a focus on healthcare and technology. While he has long maintained a keen interest in AI, more recently Brian has made making data analysis a central focus, and is exploring tools ranging from NLP and clustering to predictive analytics.

Throughout his 18-year tenure, Brian has covered an array of life science topics, including clinical trials, medical devices, and drug discovery and development. Prior to WTWH, he held the title of content director at Informa, where he focused on topics such as connected devices, cybersecurity, AI and Industry 4.0. A dedicated decade at UBM saw Brian providing in-depth coverage of the medical device sector. Engage with Brian on LinkedIn or drop him an email at [email protected].

Related Articles Read More >

Recursion Pharmaceuticals Logo
Recursion’s AI-selected MEK drug cuts FAP polyp burden in small trial
SAS launches clinical trial analytics software built on its Viya cloud native analytics platform
How stereo-correct data can de-risk AI-driven drug discovery
Real-world data ties COVID-19 to preterm birth risks, spotlights gaps in lung cancer treatment
“ddd
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest news and trends happening now in the drug discovery and development industry.

MEDTECH 100 INDEX

Medtech 100 logo
Market Summary > Current Price
The MedTech 100 is a financial index calculated using the BIG100 companies covered in Medical Design and Outsourcing.
Drug Discovery and Development
  • MassDevice
  • DeviceTalks
  • Medtech100 Index
  • Medical Design Sourcing
  • Medical Design & Outsourcing
  • Medical Tubing + Extrusion
  • Subscribe to our E-Newsletter
  • Contact Us
  • About Us
  • R&D World
  • Drug Delivery Business News
  • Pharmaceutical Processing World

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search Drug Discovery & Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Voices
    • Views
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE