Using AI to identify genetic variants in tumors with DeepSomatic
DeepSomatic is an AI-powered tool that identifies cancer-related mutations in a tumor’s genetic sequence to help pinpoint what’s driving the cancer.
Cancer is fundamentally a genetic disease in which the genetic controls on cell division go awry. Many types of cancer exist, and each poses unique challenges as it can have distinct genetic underpinnings. A powerful way to study cancer, and a critical step toward creating a treatment plan, is to identify the genetic mutations in tumor cells. Indeed, clinicians will now often sequence the genomes of biopsied tumor cells to inform treatment plans that specifically disrupt how that cancer grows.
With partners at the University of California, Santa Cruz Genomics Institute and other federal and academic researchers, our new paper, “DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies” in Nature Biotechnology presents a tool that leverages machine learning to identify genetic variants in tumor cells more accurately than current methods. DeepSomatic is a flexible model that uses convolutional neural networks to identify tumor variants. It works on data from all major sequencing platforms, for different types of sample processing, and can extend its learning to cancer types not included in training.
We have made both the tool and the high-quality training dataset we created openly available to the research community. This work is part of broader Google efforts to develop AI methods to understand cancer and help scientists treat cancer, including analyzing mammogram images for breast cancer screening, CT scans for lung cancer screening, as well as a partnership aimed at using AI to advance research on gynecological cancers. Our hope is to speed cancer research and further the goal of precision medicine.
Genetic variation acquired after birth
Genome sequencing is used in research and medical clinics to identify genetic variations between an individual and the human reference genome. Distinguishing between real variants and simple errors made during the sequencing process is challenging. That’s why almost a decade ago Google Research introduced DeepVariant to identify inherited variants, also called germline variants, that came from parents and are found in all of the body’s cells.
The genetics of cancer is more complex. Cancer is often driven by variants acquired after birth. Environmental exposure that damages DNA, such as UV light or chemical carcinogens, as well as random errors that occur during DNA replication, can cause cells in the body, known as somatic cells, to acquire new variants. Sometimes, these acquired variants change the normal behavior of cells, and can cause them to replicate when they shouldn’t. This process drives the initial development of cancer, as well as its later progression to more fast-growing and invasive stages.
Identifying variants specific to some of a person’s somatic cells is much harder than identifying inherited variants. Tumor cells can contain a diverse set of acquired variants at different frequencies, and the error rate of sequencing can be higher than the rate a somatic variant is present in a sample.
Training DeepSomatic to spot genetic variation in tumor cells
We developed DeepSomatic to address these challenges and accurately identify somatic variants. In most clinical and research settings, cancer is studied by sequencing the tumor cells acquired through biopsy, as well as normal cells that are unaffected by the tumor growth and contain more typical inherited genetic variations. DeepSomatic is trained to identify variations observed in tumor cells that are not inherited variants. These types of variations can provide critical insights about which variations are driving the tumor growth. DeepSomatic is also able to identify somatic variation in tumor-only mode where a non-tumor sequence is not available, for example in a blood cancer like leukemia where it is hard to get only normal cells from a blood draw. The ability to extend to different types of use-cases that follow common ways clinicians and researchers study cancer makes DeepSomatic applicable to many research and clinical settings.
Like our earlier tool, DeepVariant, the DeepSomatic model works by first turning genetic sequencing data into a set of images. The images represent the sequencing data, alignment along the chromosome, the quality of the output, and other variables. DeepSomatic then uses its convolutional neural network on data from tumor cells and non-cancerous cells to differentiate between the reference genome, the non-cancer germline variants in that individual, and the cancer-caused somatic variants in the tumor, while discarding variations caused by small errors acquired during the sequencing process. The result is a list of cancer-related variants, or mutations.
DeepSomatic detects cancer variants in genomic data. First, sequencing data from the tumor cells and non-cancerous cells are turned into an image. DeepSomatic passes these images through its convolutional neural network to differentiate between the reference genome, the non-cancer germline variants in that individual, and the cancer-caused somatic variants in the tumor, while discarding variations caused by small sequencing errors. The result is a list of cancer-caused variants, or mutations.
Training accurate models that can identify genetic variation for different cancer types requires comprehensive, high-quality data and truth sets. For this work we created a new training and evaluation dataset for detecting variants in tumor cells. With our partners at UC Santa Cruz and the National Cancer Institute, we sequenced tumor cells and accompanying normal cells from four breast cancer samples and two lung cancer samples from research cell lines.
Testing DeepSomatic’s ability to spot cancer-related variants
We trained DeepSomatic on three of the breast cancer genomes and the two lung cancer genomes in the CASTLE reference dataset. We then tested DeepSomatic’s performance in several ways, including on the single breast cancer genome that was not included in its training data, and on chromosome 1 from each sample, which we also excluded from the training.
Results show that DeepSomatic models developed for each of the three major sequencing platforms performed better than other methods, identifying more tumor variants with higher accuracy. The tools used for comparison on short-read sequencing data were SomaticSniper, MuTect2 and Strelka2 (with SomaticSniper specifically for single nucleotide variants, or SNVs). For long-read sequencing data we compared against ClairS, a deep learning model trained on synthetic data.
In our tests DeepSomatic identified 329,011 somatic variants across the six reference cell lines and a seventh preserved sample. DeepSomatic does particularly well at identifying cancer variations that involve insertions and deletions (“Indels”) of genetic code. For these types of variants, DeepSomatic substantially increased the F1-score, a balanced measure of how well the model finds true variants in a sample (recall) while not making false positives (precision). On Illumina sequencing data the next-best method scored 80% at identifying Indels, while DeepSomatic scored 90%. On Pacific Biosciences sequencing data, the next-best method scored less than 50% at identifying Indels, and DeepSomatic scored more than 80%.
Genetics, heredity, DNA, genes, genome, mutation, chromosomal disorders, genetic variation, gene expression, genetic engineering, molecular biology, genomics, epigenetics, inherited traits, genetic testing, personalized medicine, population genetics, genetic disorders, gene therapy, biotechnology
#Genetics, #DNA, #Genes, #Genomics, #GeneticResearch, #MolecularBiology, #Epigenetics, #GeneTherapy, #GeneticDisorders, #GeneticEngineering, #Biotechnology, #Heredity, #Genome, #PersonalizedMedicine, #GeneticTesting, #InheritedTraits, #ChromosomalDisorders, #Mutation, #GeneticVariation, #PopulationGenetics
International Conference on Genetics and Genomics of Diseases
Visit: genetics-conferences.healthcarek.com
Award Nomination: genetics-conferences.healthcarek.com/award-nomination/?ecategory=Awards&rcategory=Awardee
Award registration: genetics-conferences.healthcarek.com/award-registration/
For Enquiries: support@healthcarek.com
Get Connected Here
---------------------------------
in.pinterest.com/Dorita0211
twitter.com/Dorita_02_11_
facebook.com/profile.php?id=61555903296992
instagram.com/p/C4ukfcOsK36
genetics-awards.blogspot.com/
youtube.com/@GeneticsHealthcare
linkedin.com/in/genetics-research-160337363
Comments
Post a Comment