CAREER: Genomic Data Science: From Informational Limits to Efficient Algorithms
职业:基因组数据科学:从信息限制到高效算法
基本信息
- 批准号:2046991
- 负责人:
- 金额:$ 50万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-06-01 至 2026-05-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Advances in DNA sequencing technologies have paved the way for a revolution in the biological and medical sciences. By sequencing the human genome, one can learn about the genetic basis of several diseases and use this information to develop treatments and preventative care. By sequencing the genomes of viruses and bacteria, one can obtain key insights into the mechanisms of infectious diseases. But the acquisition, processing, and analysis of large amounts of genomic data pose several fundamental questions such as: (i) how much sequencing data need be collected to reliably learn the genome of a species? (ii) how much can one compress genomic sequencing data while maintaining its usefulness? (iii) how do sequencing errors impact the ability to perform biologically valid inference? The goal of this project is to develop a framework to establish the informational limits of genomic data science problems, that is, establish what genomic data can and cannot reveal. This will lead to the development of computationally efficient algorithms that process genomic data in an information-optimal way. The project will also mentor underrepresented students and provide them with research opportunities in the field of genomics. The research efforts will directly shape the contents of an undergraduate course on data science, which, in turn, will produce materials (lecture notes, educational videos, data assignments, open-source code) that will be used to disseminate reliable information about genomic technologies to the community. A distinctive aspect of genomic data science is that it operates mainly on sequence data. The proposed research will be organized along three main thrusts, each one focused on a key data science task that arises when dealing with genomic sequence data: (1) aligning pairs of sequences, (2) reconstructing sequences from noisy fragments, and (3) clustering sequences based on appropriate metrics. Since the pairwise alignment of a large number of noisy sequences is often a bottleneck in genomic data science, the first thrust will study how low-dimensional representations of these sequences, or sketches, can be optimally used for alignment computation. A source-coding framework will be leveraged to study the tradeoffs between sketch size and the incurred distortion in alignment computation. The second thrust, on sequence reconstruction, will tackle the fact that computational complexity obstacles such as NP-hardness often do not appropriately capture the complexity of real-world problem instances. A notion of instance-based informational hardness is introduced to allow the development of efficient algorithms with instance-specific theoretical guarantees. Finally, the third thrust studies the problem of clustering sequences in the context of metagenomic sequencing, where the goal is to determine which sequences come from the same microbial genome. Information-theoretic metrics for the clustering of metagenomic sequencing data will be introduced and used in algorithms that seek to resolve microbial communities at the maximum resolution allowed by the data.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
DNA测序技术的进步为生物和医学科学的革命铺平了道路。通过对人类基因组进行测序,人们可以了解几种疾病的遗传基础,并利用这些信息开发治疗和预防保健。通过对病毒和细菌的基因组进行测序,人们可以获得对传染病机制的关键见解。但是,对大量基因组数据的获取、处理和分析提出了几个基本问题,例如:(1)需要收集多少测序数据才能可靠地了解一个物种的基因组?(ii)在保持基因组测序数据有用性的同时,可以压缩多少基因组测序数据?(iii)测序错误如何影响进行生物学有效推理的能力?该项目的目标是建立一个框架,以确定基因组数据科学问题的信息限制,即确定基因组数据可以揭示什么和不能揭示什么。这将导致以信息优化的方式处理基因组数据的计算效率算法的发展。该项目还将指导代表性不足的学生,并为他们提供基因组学领域的研究机会。研究工作将直接塑造数据科学本科课程的内容,而这些内容反过来将产生材料(课堂讲稿、教育视频、数据作业、开源代码),这些材料将用于向社区传播有关基因组技术的可靠信息。基因组数据科学的一个独特方面是它主要对序列数据进行操作。拟议的研究将沿着三个主要方向组织,每个重点关注处理基因组序列数据时出现的关键数据科学任务:(1)排列序列对,(2)从噪声片段重建序列,以及(3)基于适当指标的聚类序列。由于大量噪声序列的成对比对通常是基因组数据科学的瓶颈,因此第一个重点将研究如何将这些序列的低维表示或草图最佳地用于比对计算。我们将利用一个源代码编码框架来研究草图大小和对齐计算中产生的失真之间的权衡。第二个重点是序列重建,它将解决这样一个事实,即计算复杂性障碍(如np -硬度)通常不能适当地捕获现实世界问题实例的复杂性。引入了基于实例的信息硬度的概念,以允许开发具有特定于实例的理论保证的高效算法。最后,第三个重点研究了宏基因组测序背景下的聚类序列问题,其目标是确定哪些序列来自相同的微生物基因组。将引入宏基因组测序数据聚类的信息理论度量,并将其用于寻求以数据允许的最大分辨率解决微生物群落的算法中。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(14)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Fundamental Limits of Multiple Sequence Reconstruction from Substrings
- DOI:10.1109/isit54713.2023.10206707
- 发表时间:2023-05
- 期刊:
- 影响因子:0
- 作者:Kelly Levick;Ilan Shomorony
- 通讯作者:Kelly Levick;Ilan Shomorony
Adaptive Power Method: Eigenvector Estimation from Sampled Data
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Seiyun Shin;Hanfang Zhao;Ilan Shomorony
- 通讯作者:Seiyun Shin;Hanfang Zhao;Ilan Shomorony
Coded Shotgun Sequencing
编码鸟枪测序
- DOI:10.1109/jsait.2022.3151737
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Ravi, Aditya Narayan;Vahid, Alireza;Shomorony, Ilan
- 通讯作者:Shomorony, Ilan
Torn-Paper Coding
撕纸编码
- DOI:10.1109/tit.2021.3120920
- 发表时间:2021
- 期刊:
- 影响因子:2.5
- 作者:Shomorony, Ilan;Vahid, Alireza
- 通讯作者:Vahid, Alireza
Finding a Burst of Positives via Nonadaptive Semiquantitative Group Testing
通过非适应性半定量群体测试发现积极的爆发
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Li, Yun-Han;Gabrys, Ryan;Sima, Jin;Shomorony, Ilan;Milenkovic, Olgica
- 通讯作者:Milenkovic, Olgica
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ilan Shomorony其他文献
Capacity Results for the Noisy Shuffling Channel
噪声洗牌通道的容量结果
- DOI:
10.1109/isit.2019.8849789 - 发表时间:
2019 - 期刊:
- 影响因子:0
- 作者:
Ilan Shomorony;Reinhard Heckel - 通讯作者:
Reinhard Heckel
Computing Half-Duplex Schedules in Gaussian Relay Networks via Min-Cut Approximations
通过最小割近似计算高斯中继网络中的半双工调度
- DOI:
10.1109/tit.2014.2359440 - 发表时间:
2014 - 期刊:
- 影响因子:2.5
- 作者:
R. Etkin;F. Parvaresh;Ilan Shomorony;A. Avestimehr - 通讯作者:
A. Avestimehr
Fast multiple sequence alignment via multi-armed bandits
通过多臂老虎机进行快速多序列比对
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:5.8
- 作者:
Kayvon Mazooji;Ilan Shomorony - 通讯作者:
Ilan Shomorony
Unsupervised integration of multimodal dataset identifies novel signatures of health and disease
多模式数据集的无监督整合识别了健康和疾病的新特征
- DOI:
10.1101/432641 - 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Ilan Shomorony;E. Cirulli;Lei Huang;Lori A. Napier;Robyn R. Heister;Michael A. Hicks;Isaac V. Cohen;Hung;C. Swisher;Natalie M. Schenker;Weizhong Li;A. Kahn;Timothy D. Spector;C. Caskey;J. Venter;D. Karow;E. Kirkness;N. Shah - 通讯作者:
N. Shah
An Information Theory for Out-of-Order Media With Applications in DNA Data Storage
无序媒体信息论及其在 DNA 数据存储中的应用
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:2.2
- 作者:
Aditya Narayan Ravi;Alireza Vahid;Ilan Shomorony - 通讯作者:
Ilan Shomorony
Ilan Shomorony的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ilan Shomorony', 18)}}的其他基金
CIF: Small: Fundamental Limits of DNA-Based Storage
CIF:小:基于 DNA 的存储的基本限制
- 批准号:
2007597 - 财政年份:2020
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
相似海外基金
Doctoral Dissertation Research Improvement Grant: Biobanking, Epistemic Infrastructure, and the Lifecycle of Genomic Data
博士论文研究改进补助金:生物样本库、认知基础设施和基因组数据的生命周期
- 批准号:
2341622 - 财政年份:2024
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
From Data to Discovery: BlokBIO's Vision of Transforming Genomic Research with User-Centric Intelligence Solutions
从数据到发现:BlokBIO 通过以用户为中心的智能解决方案转变基因组研究的愿景
- 批准号:
10109374 - 财政年份:2024
- 资助金额:
$ 50万 - 项目类别:
Launchpad
RII Track-4: NSF: Extracting Pan Genomic Information from Metagenomic Data: Distributed Algorithms and Scalable Software
RII Track-4:NSF:从宏基因组数据中提取泛基因组信息:分布式算法和可扩展软件
- 批准号:
2327456 - 财政年份:2024
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
EAGER: IMPRESS-U: Modeling and Forecasting of Infection Spread in War and Post War Settings Using Epidemiological, Behavioral and Genomic Surveillance Data
EAGER:IMPRESS-U:使用流行病学、行为和基因组监测数据对战争和战后环境中的感染传播进行建模和预测
- 批准号:
2412914 - 财政年份:2024
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
Implementation of an impact assessment tool to optimize responsible stewardship of genomic data in the cloud
实施影响评估工具以优化云中基因组数据的负责任管理
- 批准号:
10721762 - 财政年份:2023
- 资助金额:
$ 50万 - 项目类别:
Learn Systems Biology Equations From Snapshot Single Cell Genomic Data
从快照单细胞基因组数据学习系统生物学方程
- 批准号:
10736507 - 财政年份:2023
- 资助金额:
$ 50万 - 项目类别:
Accelerating Genomic Data Sharing and Collaborative Research with Privacy Protection
通过隐私保护加速基因组数据共享和协作研究
- 批准号:
10735407 - 财政年份:2023
- 资助金额:
$ 50万 - 项目类别:
Genomic Intensive Data Science Research, Education and Mentorship
基因组密集数据科学研究、教育和指导
- 批准号:
10627583 - 财政年份:2023
- 资助金额:
$ 50万 - 项目类别:
Integrating Epidemiologic and Genomic Data to Elucidate the Genetic Overlap Between Congenital Anomalies and Pediatric Cancer
整合流行病学和基因组数据来阐明先天性异常和儿童癌症之间的遗传重叠
- 批准号:
10749761 - 财政年份:2023
- 资助金额:
$ 50万 - 项目类别:
NSF Postdoctoral Fellowship in Biology: Integrating Phenotypic and Genomic Data across Multiple Hybrid Zones to Understand the Evolution of Reproductive Isolation in Snakes
美国国家科学基金会生物学博士后奖学金:整合多个杂交区的表型和基因组数据,以了解蛇生殖隔离的演变
- 批准号:
2208959 - 财政年份:2023
- 资助金额:
$ 50万 - 项目类别:
Fellowship Award