Technological advances in sequencing have fueled the “omics revolution,” making big data a staple of biological research. However, many researchers feel ill-equipped to wrangle and analyze these massive datasets, leading them to seek the help of bioinformaticians. Now, with the help of advancing artificial intelligence (AI) technology, analysis can become less of an impediment.
Reporting in a bioRxiv preprint that has yet to undergo peer-review, researchers developed an AI chatbot called CellWhisperer that analyzes transcriptomics data and reports back its findings in plain English.1 Now, researchers with limited computational chops can probe their dense datasets by providing CellWhisperer with non-technical queries, such as “What are these selected cells?” or “Describe the sample concisely.”
Last year, AI algorithms called large language models spooked the world with their ability to respond to prompts in articulate English, but some have looked past their startling nature to streamline data analysis. Biologists have begun training these models on literature repositories to quickly retrieve information from publications. GeneGPT, for example, can answer questions about genes by consulting genomics databases.2 Moritz Schaefer, a bioinformatician at the Medical University of Vienna and study coauthor, wanted to harness AI to simplify analysis of transcriptomics data. “Right now, biologists need to learn programming languages,” he noted. “We wanted to turn this around and said, ‘the computer should learn English.’”
When a bioinformatician analyzes transcriptomics data, they draw on past research for contextual information about patterns of gene expression. For example, they check if a group of genes are typically expressed together by cross-comparing with historic datasets. An AI model needs access to the same resources, so Schaefer and his colleagues trained their algorithm on pre-existing transcriptomics data. They used 20,000 studies from Gene Expression Omnibus and nearly 400,000 human transcriptomes from CELLxGENE Census.3,4 Together, these repositories equipped the AI tool with the training materials it needs to recognize a cell type or a disease state based on its gene expression patterns.
To make their tool even more user-friendly, they paired their trained model with an AI chatbot that could respond to prompts written in English. They turned to a fine-tunable open-source large language model called Mistral 7B and customized it using over 100,000 examples of conversational questions and answers about transcriptomics data.5 Simple queries included “Give a brief description of these cells,” whereas complex prompts tasked the model to list the most prominently expressed genes or the most active cellular pathways. At last, they had developed an AI chatbot adept at discussing transcriptomics and made it publicly available in October of this year.
To take CellWhisperer for a test run, Schaefer queried the model about transcriptomics studies that they excluded from the training data. Starting with an easy task, his team confirmed that, most of the time, the model correctly identified distinctive cell types from multiple organs, including fat, muscle, lung, and skin.6 It grappled slightly with the complexities of distinguishing between similar cell types, namely ones in the pancreas.7 However, the model struggled with a few transcriptomic samples from diseased cells, suggesting that the training data lacked sufficient information about these conditions. Schaefer said CellWhisperer works well with some conditions, such as certain liver cancers, but struggles more with other diseases, such as skin melanoma.
Although CellWhisperer made correct predictions most of the time, Schaefer said that users should be aware that AI tools can make occasional errors. “It’s important to keep in mind that this AI tool is especially helpful for explorative analysis and brainstorming, and all its responses need to be cross-checked with other experiments,” Schaefer noted.
Maxim Nosenko, an immunologist at Trinity College Dublin who was not involved with the work, said, “Anyone can analyze the sequencing data using CellWhisperer, so that’s certainly a big advantage.” He added, “This tool is really timely now when there is a huge amount of sequencing data.” However, he noted that, in its present form, CellWhisperer is limited to data on human cells since the researchers excluded animal findings. “It is not applicable, for now at least, to mouse studies,” said Nosenko, who uses mice as a model species.
Schaefer plans to build on CellWhisperer’s capabilities. “We want to develop this further to become a semi-autonomous research assistant,” he said. Currently, CellWhisperer responds to single queries one at a time, but Schaefer hopes the tool will eventually carry out a comprehensive analysis on its own without the need for small talk.
- Schaefer M, et al. Multimodal learning of transcriptomes and text enables interactive single-cell RNA-seq data exploration with natural-language chats. bioRxiv. 2024.10.15.618501.
- Jin Q, et al. GeneGPT: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics. 2024;40(2):btae075.
- Clough E, et al. NCBI GEO: Archive for gene expression and epigenomics data sets: 23-year update. Nuc Acid Res. 2024;52(D1):D138-D144.
- Abdulla S, et al. CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. bioRxiv. 2023.10.30.563174.
- Jiang AQ, et al. Mistral 7B. arXiv. 2310.06825.
- The Tabula Sapiens Consortium. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science. 2022;376(6594):eabl4896.
- Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods. 2022;19(1):41-50.