The Align Foundation, a nonprofit accelerating predictive biology through artificial intelligence (AI)-driven research infrastructure, has announced a collaboration with ATCC, the leading global nonprofit provider of microbial strains, cell lines, in vitro models, and standards, to create the world’s largest public, AI-ready microbial phenotyping dataset.

pexels-valentinantonucci-1378723

Together, Align and ATCC will generate high-quality phenotypic data for 1,000 phylogenetically diverse microbial strains across 1,000 cultivation conditions. This will create an unparalleled foundation for AI models that link genotype to phenotype—crucial to engineering biology in a safe and predictable manner.

By combining Align’s scalable, high-throughput experimental platform with ATCC’s extensive authenticated microbial collection and expertise in microbial and cellular genomics, Align aims to bridge a critical gap in the scientific landscape: the lack of large, standardized, public datasets collected under consistent experimental conditions.

Frictionless, scaleable and shareable

“Our vision at Align is to build the research infrastructure needed to make biological data collection frictionless, scalable, and shareable,” said Erika DeBenedictis, PhD, co-founder of Align. “Collaborating with ATCC — an organization synonymous with biological quality and reproducibility — is an incredible opportunity to create a large-scale, public resource that can help enable the next generation of AI-driven biological discovery. We’re honored to work alongside them, and this opportunity would not be possible without their diverse and trusted biomaterials.“

The dataset will be expansive, covering diverse environments (such as atmospheric and temperature variations), multiple media types (undefined, semi-defined, and defined), and a wide range of metabolic supplements (carbon sources, vitamins, cofactors, and metals). Each strain’s growth and morphology will be systematically measured and linked to its genomic sequence, enabling researchers to train and validate machine learning models that predict microbial physiology from genetic information.

AI-driven insights

“The reliability of AI-driven biological insights depends entirely on the quality of the data—and ultimately, the source materials—used to train the predictive models. At ATCC, we are committed to providing reference datasets alongside our trusted biological reference materials so that these future insights can be physically reproduced and validated in the lab,” said Ruth Cheng, PhD, president and CEO of ATCC. “Our collaboration with Align is an important step towards enabling researchers to reliably apply AI in biology by building a dataset that is traceable to the authenticated microbial resources at ATCC.”

This is a first-of-its-kind effort that reflects our shared commitment to making biological research more reproducible, scalable, and accessible to the entire scientific community to solve the most challenging research problems. Phenotypic results will be hosted on Align’s Phenome Portal, with links to ATCC’s Genome Portal, enabling seamless cross-navigation between genotype and phenotype.

Why it matters

Today’s biological data landscape is fragmented: datasets are often small, inconsistent, or inaccessible. By launching a public, scalable, and standardized microbial phenotyping resource, we’re removing barriers for researchers and creating a platform for faster, better model development to enable predictable engineering—making a significant impact on human health, environmental sustainability, and economic growth.

This collaboration marks a significant step forward, and the team have invited the global research community to engage with this open resource. Learn more about this project and explore opportunities to collaborate at alignbio.org/datasets-microbes or reach out directly at contact@alignbio.org.