权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences

数百万随机序列的选择性剪接和聚腺苷酸化的预测模型

基本信息

批准号：
9306648
负责人：
Georg Seelig
金额：
$ 59.66万
依托单位：
UNIVERSITY OF WASHINGTON
依托单位国家：
美国
项目类别：
财政年份：
2017
资助国家：
美国
起止时间：
2017-04-21 至 2021-01-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9306648
关键词：
Adopted Algorithms Alternative Splicing Area Basic Science Behavior Big Data Biological Assay Biological Phenomena CRISPR/Cas technology Clinical Medicine Code Complex Computers DNA Sequence Data Data Set Databases Dependency Disease Gene Expression Gene Expression Regulation Generations Genes Genetic Genetic Polymorphism Genetic Variation Genome Genomics Haplotypes Human Human Genome Lead Learning Libraries Machine Learning Measurement Measures Mediating Mendelian disorder Modeling Mutation Natural Language Processing Nucleotides Polyadenylation Protein Isoforms Proteins Publishing RNA Splicing RNA-Binding Proteins Regulation Regulator Genes Reporter Research Risk Scientist Shapes Specific qualifier value Testing Training Transcript Untranslated RNA Validation Variant Work base clinically relevant data modeling disease-causing mutation exon skipping experimental study genetic variant human disease knock-down novel strategies predictive modeling repaired synthetic biology synthetic construct

项目摘要

The proportion of the human genome that underlies gene regulation dwarfs the proportion that encodes proteins. However, we remain poorly equipped for identifying which genetic variants compromise gene regulatory function in ways that may contribute to risk for both rare and common human diseases. Understanding how non-coding sequences regulate gene expression, as well as being able to predict the functional consequences of genetic variation for gene regulation, are paramount challenges for the field. Here, we propose to combine synthetic biology, massively parallel functional assays, and machine learning to profoundly advance our understanding of the `regulatory code' of the human genome. While challenging, the task of unravelling complex codes from large amounts of empirical data is not without precedent. For example, over the past decade, computer scientists working in natural language processing have made immense progress, driven in large part by a combination of algorithmic and computational improvements and enormously larger training datasets than were available to the previous generations of scientists working in this area. Inspired by the revolutionizing impact of “big data” for traditional problems in machine learning, we propose to model gene regulatory phenomena using training datasets with several orders of magnitude more examples than naturally exist in the human genome. We predict that the models learned from massive numbers of synthetic examples will strongly outperform models learned from the small number of natural examples. We will demonstrate our approach by developing comprehensive, quantitative, and predictive models for alternative splicing and alternative polyadenylation, two widespread regulatory mechanisms by which a single gene can code for multiple transcripts and proteins. However, we anticipate that this basic paradigm – specifically, the massively parallel measurement of the functional behavior of extremely large numbers of synthetic sequences followed by quantitative modeling of sequence-function relationships – can be generalized to advance our understanding of diverse forms of gene regulation.

人类基因组中作为基因调控基础的比例使编码的比例相形见绌蛋白质。然而，我们仍然缺乏识别哪些遗传变异损害基因的能力调节功能可能会增加罕见和常见人类疾病的风险。了解非编码序列如何调节基因表达，以及能够预测遗传变异对基因调控的功能影响是该领域面临的首要挑战。这里，我们建议将合成生物学、大规模并行功能分析和机器学习结合起来深刻推进我们对人类基因组“调控密码”的理解。在充满挑战的同时，从大量经验数据中解开复杂密码的任务并非没有先例。例如，在过去的十年中，从事自然语言处理工作的计算机科学家取得了巨大的成就进步在很大程度上是由算法和计算改进的结合推动的比前几代从事该领域工作的科学家可用的训练数据集要大得多区域。受到“大数据”对机器学习传统问题的革命性影响的启发，我们提议使用几个数量级的训练数据集来模拟基因调控现象人类基因组中自然存在的例子。我们预测模型会从大量的数据中学习大量的合成示例将远远优于从少量自然示例中学习到的模型例子。我们将通过开发全面的、定量的和预测性的方法来展示我们的方法选择性剪接和选择性多腺苷酸化的模型，这两种广泛的调节机制单个基因可以编码多个转录本和蛋白质。然而，我们预计这一基本范式——具体来说，是对极大的功能行为的大规模并行测量合成序列的数量，然后对序列-功能关系进行定量建模 - 可以是推广以增进我们对多种形式的基因调控的理解。