Skip to main content

Genetic variants in Tumors

Using AI to identify genetic variants in tumors with DeepSomatic


DeepSomatic is an AI-powered tool that identifies cancer-related mutations in a tumor’s genetic sequence to help pinpoint what’s driving the cancer.

Cancer is fundamentally a genetic disease in which the genetic controls on cell division go awry. Many types of cancer exist, and each poses unique challenges as it can have distinct genetic underpinnings. A powerful way to study cancer, and a critical step toward creating a treatment plan, is to identify the genetic mutations in tumor cells. Indeed, clinicians will now often sequence the genomes of biopsied tumor cells to inform treatment plans that specifically disrupt how that cancer grows.

With partners at the University of California, Santa Cruz Genomics Institute and other federal and academic researchers, our new paper, “DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies” in Nature Biotechnology presents a tool that leverages machine learning to identify genetic variants in tumor cells more accurately than current methods. DeepSomatic is a flexible model that uses convolutional neural networks to identify tumor variants. It works on data from all major sequencing platforms, for different types of sample processing, and can extend its learning to cancer types not included in training.

We have made both the tool and the high-quality training dataset we created openly available to the research community. This work is part of broader Google efforts to develop AI methods to understand cancer and help scientists treat cancer, including analyzing mammogram images for breast cancer screening, CT scans for lung cancer screening, as well as a partnership aimed at using AI to advance research on gynecological cancers. Our hope is to speed cancer research and further the goal of precision medicine.

Genetic variation acquired after birth


Genome sequencing is used in research and medical clinics to identify genetic variations between an individual and the human reference genome. Distinguishing between real variants and simple errors made during the sequencing process is challenging. That’s why almost a decade ago Google Research introduced DeepVariant to identify inherited variants, also called germline variants, that came from parents and are found in all of the body’s cells.

The genetics of cancer is more complex. Cancer is often driven by variants acquired after birth. Environmental exposure that damages DNA, such as UV light or chemical carcinogens, as well as random errors that occur during DNA replication, can cause cells in the body, known as somatic cells, to acquire new variants. Sometimes, these acquired variants change the normal behavior of cells, and can cause them to replicate when they shouldn’t. This process drives the initial development of cancer, as well as its later progression to more fast-growing and invasive stages.

Identifying variants specific to some of a person’s somatic cells is much harder than identifying inherited variants. Tumor cells can contain a diverse set of acquired variants at different frequencies, and the error rate of sequencing can be higher than the rate a somatic variant is present in a sample.

Training DeepSomatic to spot genetic variation in tumor cells


We developed DeepSomatic to address these challenges and accurately identify somatic variants. In most clinical and research settings, cancer is studied by sequencing the tumor cells acquired through biopsy, as well as normal cells that are unaffected by the tumor growth and contain more typical inherited genetic variations. DeepSomatic is trained to identify variations observed in tumor cells that are not inherited variants. These types of variations can provide critical insights about which variations are driving the tumor growth. DeepSomatic is also able to identify somatic variation in tumor-only mode where a non-tumor sequence is not available, for example in a blood cancer like leukemia where it is hard to get only normal cells from a blood draw. The ability to extend to different types of use-cases that follow common ways clinicians and researchers study cancer makes DeepSomatic applicable to many research and clinical settings.

Like our earlier tool, DeepVariant, the DeepSomatic model works by first turning genetic sequencing data into a set of images. The images represent the sequencing data, alignment along the chromosome, the quality of the output, and other variables. DeepSomatic then uses its convolutional neural network on data from tumor cells and non-cancerous cells to differentiate between the reference genome, the non-cancer germline variants in that individual, and the cancer-caused somatic variants in the tumor, while discarding variations caused by small errors acquired during the sequencing process. The result is a list of cancer-related variants, or mutations.

DeepSomatic detects cancer variants in genomic data. First, sequencing data from the tumor cells and non-cancerous cells are turned into an image. DeepSomatic passes these images through its convolutional neural network to differentiate between the reference genome, the non-cancer germline variants in that individual, and the cancer-caused somatic variants in the tumor, while discarding variations caused by small sequencing errors. The result is a list of cancer-caused variants, or mutations.

Training accurate models that can identify genetic variation for different cancer types requires comprehensive, high-quality data and truth sets. For this work we created a new training and evaluation dataset for detecting variants in tumor cells. With our partners at UC Santa Cruz and the National Cancer Institute, we sequenced tumor cells and accompanying normal cells from four breast cancer samples and two lung cancer samples from research cell lines.

Testing DeepSomatic’s ability to spot cancer-related variants


We trained DeepSomatic on three of the breast cancer genomes and the two lung cancer genomes in the CASTLE reference dataset. We then tested DeepSomatic’s performance in several ways, including on the single breast cancer genome that was not included in its training data, and on chromosome 1 from each sample, which we also excluded from the training.

Results show that DeepSomatic models developed for each of the three major sequencing platforms performed better than other methods, identifying more tumor variants with higher accuracy. The tools used for comparison on short-read sequencing data were SomaticSniper, MuTect2 and Strelka2 (with SomaticSniper specifically for single nucleotide variants, or SNVs). For long-read sequencing data we compared against ClairS, a deep learning model trained on synthetic data.

In our tests DeepSomatic identified 329,011 somatic variants across the six reference cell lines and a seventh preserved sample. DeepSomatic does particularly well at identifying cancer variations that involve insertions and deletions (“Indels”) of genetic code. For these types of variants, DeepSomatic substantially increased the F1-score, a balanced measure of how well the model finds true variants in a sample (recall) while not making false positives (precision). On Illumina sequencing data the next-best method scored 80% at identifying Indels, while DeepSomatic scored 90%. On Pacific Biosciences sequencing data, the next-best method scored less than 50% at identifying Indels, and DeepSomatic scored more than 80%.

Genetics, heredity, DNA, genes, genome, mutation, chromosomal disorders, genetic variation, gene expression, genetic engineering, molecular biology, genomics, epigenetics, inherited traits, genetic testing, personalized medicine, population genetics, genetic disorders, gene therapy, biotechnology

#Genetics, #DNA, #Genes, #Genomics, #GeneticResearch, #MolecularBiology, #Epigenetics, #GeneTherapy, #GeneticDisorders, #GeneticEngineering, #Biotechnology, #Heredity, #Genome, #PersonalizedMedicine, #GeneticTesting, #InheritedTraits, #ChromosomalDisorders, #Mutation, #GeneticVariation, #PopulationGenetics


International Conference on Genetics and Genomics of Diseases

Comments

Popular posts from this blog

Fruitful innovation

Fruitful innovation: Transforming watermelon genetics with advanced base editors The development of new adenine base editors (ABE) and adenine-to-thymine/ guanine base editors (AKBE) is transforming watermelon genetic engineering. These innovative tools enable precise A:T-to-G and A:T-to-T base substitutions, allowing for targeted genetic modifications. The research highlights the efficiency of these editors in generating specific mutations, such as a flowerless phenotype in ClFT (Y84H) mutant plants. This advancement not only enhances the understanding of gene function but also significantly improves molecular breeding, paving the way for more efficient watermelon crop improvement. Traditional breeding methods for watermelon often face challenges in achieving desired genetic traits efficiently and accurately. While CRISPR/Cas9 has provided a powerful tool for genome editing, its precision and scope are sometimes limited. These limitations highlight the need for more advanced gene-e...

Genetic factors with clinical trial stoppage

Genetic factors associated with reasons for clinical trial stoppage Many drug discovery projects are started but few progress fully through clinical trials to approval. Previous work has shown that human genetics support for the therapeutic hypothesis increases the chance of trial progression. Here, we applied natural language processing to classify the free-text reasons for 28,561 clinical trials that stopped before their endpoints were met. We then evaluated these classes in light of the underlying evidence for the therapeutic hypothesis and target properties. We found that trials are more likely to stop because of a lack of efficacy in the absence of strong genetic evidence from human populations or genetically modified animal models. Furthermore, certain trials are more likely to stop for safety reasons if the drug target gene is highly constrained in human populations and if the gene is broadly expressed across tissues. These results support the growing use of human genetics to ...

Genetics study on COVID-19

Large genetic study on severe COVID-19 Bonn researchers confirm three other genes for increased risk in addition to the known TLR7 gene Whether or not a person becomes seriously ill with COVID-19 depends, among other things, on genetic factors. With this in mind, researchers from the University Hospital Bonn (UKB) and the University of Bonn, in cooperation with other research teams from Germany, the Netherlands, Spain and Italy, investigated a particularly large group of affected individuals. They confirmed the central and already known role of the TLR7 gene in severe courses of the disease in men, but were also able to find evidence for a contribution of the gene in women. In addition, they were able to show that genetic changes in three other genes of the innate immune system contribute to severe COVID-19. The results have now been published in the journal " Human Genetics and Genomics Advances ". Even though the number of severe cases following infection with the SARS-CoV-...