“‘Make sure you finish your antibiotics course, even if you start feeling better’ is a medical mantra many hear but ignore,” says Cesar de la Fuente of the University of Pennsylvania.


Source: Jonathan Chen

Mounted woolly mammoth skeleton on display at the Field Museum of Natural History.

He explains that this phrase is, however, crucial as noncompliance could hamper the efficacy of a key 20th century discovery, antibiotics. “And in recent decades, this has led to the rise of drug-resistant bacteria, a growing global health crisis causing approximately 4.95 million deaths per year and threatens to make even common infections deadly,” he says.

De la Fuente, a Presidential Assistant Professor, and a team of interdisciplinary researchers have been working on biomedical innovations tackling this looming threat. In a new study, published in Nature Biomedical Engineering, they developed an artificial intelligence tool to mine the vast and largely unexplored biological data—more than 10 million molecules of both modern and extinct organisms— to discover new candidates for antibiotics.

Six years to a drug candidate

“With traditional methods, it takes around six years to develop new preclinical drug candidates to treat infections and the process is incredibly painstaking and expensive,” de la Fuente says. “Our deep learning approach can dramatically reduce that time, driving down costs as we identified thousands of candidates in just a few hours, and many of them have preclinical potential, as tested in our animal models, signaling a new era in antibiotic discovery.”

César de la Fuente holds a 3D model of a unique ATP synthase fragment, identified by his lab’s deep learning model, APEX, as having potent antibiotic properties. This molecular structure, resurrected from ancient genetic data, represents a promising lead in the fight against antibiotic-resistant bacteria.

These latest findings build on methods de la Fuente has been working on since his arrival at Penn in 2019. The team asked a fundamental question: Can machines be used to accelerate antibiotic discovery by mining the world’s biological information? He explains that this idea is based on the notion that biology, at its most basic level, is an information source, which could theoretically be explored with AI to find new useful molecules.

Simple algorithms

The team started by applying simple algorithms that could mine individual proteins to find small antibiotic molecules hidden within their amino acid sequences. With advances in computational power, de la Fuente realized that they could scale up from mining individual proteins to mining entire proteomes.

De la Fuente says the team began by looking at one protein at a time, then as computer efficiency and power improved they were able to scale up. Next, he says, they were then able to mine “whole proteomes, which are all the proteins encoded in an organism’s genome, and this led us to discovering thousands of new antimicrobial molecules in the human proteome and later in the proteomes of ancient hominids like Neanderthals and Denisovans. “Then, we challenged ourselves to mine all extinct organisms known to science.”

The team developed what they call “molecular de-extinction,” which involves the revival of ancient molecules with potential therapeutic properties that have been extinct, and it brought about the discovery of therapeutic molecules in ancient organisms’ genomes. They hypothesize that many of the molecules they are finding may play a role in host immunity throughout evolution.

Novel candidate peptides

This idea culminated in a separate paper published in the journal Cell for which he and his team conducted an extensive analysis of 87,920 genomes from specific microbes and 63,410 microbial genome mixes from environmental samples worldwide. This research identified 863,498 novel candidate antimicrobial peptides, with more than 90% previously undescribed.

And in the recent Nature paper, the team developed a powerful deep learning model, called Antibiotic Peptide de-Extinction, APEX, which can sample hundreds of proteomes across evolutionary history, helping identify the best antibiotic candidates from various organisms, including woolly mammoths, straight tusked elephants, ancient sea cows, and extinct giant elk.

Marcelo Der Torossian Torres, co-first author of the study and a postdoctoral researcher in the de la Fuente Lab, says the team started building APEX by first creating a “highly standardized data set to train it with, which has been missing in the literature,” he says. “It’s surprising because there are so many data sets out there, and researchers will use multiple sets assuming all the samples were collected in a very systematic, consistent way, but that is not always the case.”

Large dataset

APEX, he says, does also make use of “probably the largest dataset of this kind” as a control for their experiments. This allowed the researchers to establish how their model performed relative to existing knowledge and to validate the uniqueness and efficacy of the antibiotic sequences discovered by APEX.

“AI will only be successful in a field as complex and chaotic as biology if we have high-quality datasets,” de la Fuente says. “We realized this many years ago and have been working hard to create datasets that can be used to train our algorithms.”

Fangping Wan, the other co-first author who is also a postdoctoral researcher in the de la Fuente Lab, says that APEX uses a combination of recurrent neural networks and attention networks, which perform two key tasks to identify encrypted peptides, fragments within proteins that have antimicrobial properties.

“Recurrent neural networks are great at processing sequences, like proteins, because they can handle data where inputs are independent and ordered, and attention networks improve the network’s ability to home in on specific parts of the protein’s structure that are likely involved in antimicrobial activity,” Wan says.

Predicting activity

The researchers note that APEX did a markedly better job of predicting activity than the benchmark models, and it was able to mine through 10,311,899 peptides and identify 37,176 sequences with predicted broad-spectrum antimicrobial activity, including 11,035 sequences not found in extant organisms.

Some of these showed effectiveness in preclinical mouse models of infection. It is a vital step, as it moves these candidates closer to potential clinical trials and eventual therapeutic use. In addition, most of the archaic peptides had a new mechanism of action by depolarizing the cell membrane of bacteria, a unique way of targeting them that hints at a new paradigm of infectious disease control. 

Altogether, the computational work performed in the de la Fuente Lab in the past 5 years has dramatically accelerated the ability to discover new antibiotics. What used to take many years of painstaking work with traditional methods, can now be done in just a few hours with AI.

César de la Fuente is a Presidential Assistant Professor and leader of the Machine Biology Group. He has appointments in the Perelman School of Medicine, School of Engineering and Applied Science, and School of Arts & Sciences at the University of Pennsylvania.

Marcelo der Torossian Torres is a research associate in the de la Fuente Lab at Penn.

Fangping Wan is a postdoctoral researcher in the de la Fuente Lab at Penn.

This research was funded by the Langer Prize (AIChE Foundation), National Institutes of Health (award R35GM138201), and Defense Threat Reduction Agency (DTRA, HDTRA11810041, HDTRA1-21-1-0014, and HDTRA1-23-1-0001).