权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Robust Inference in the Presence of Data Heterogeneity and Structured Missing Data

存在数据异构性和结构化缺失数据时的鲁棒推理

基本信息

批准号：
9916886
负责人：
Meisam Razaviyayn
金额：
$ 21.2万
依托单位：
UNIVERSITY OF SOUTHERN CALIFORNIA
依托单位国家：
美国
项目类别：
财政年份：
2019
资助国家：
美国
起止时间：
2019-09-01 至 2022-08-31
项目状态：
已结题

项目摘要

Modern sequencing platforms can sequence tens of billions of bases per run and generate peta-bytes of data, but individual study sizes may be small. Similarly, a wide variety of health data are now publicly available to inform health policy decisions, and it may be advantageous to use data from several different surveys. The ability to aggregate and compare heterogeneous data across different datasets would be critical to expanding the usable data available for any individual study. We propose systematically studying two major barriers to this effort: 1) Aggregating different medical and biological datasets; 2) Dealing with batch effects and structured heterogeneous data. Aim 1 allows us to fully utilize information on related topics from diverse datasets, as information across different experiments needs to be combined in a statistically rigorous, reliable way - the process needs to fully exploit the available information, not introduce biases, and still be systematic and reproducible. Not all experiments study the same set of variables/features, and combining this information is a non-trivial task. The second aim allows researchers to handle heterogeneity between individuals or samples, which happens with ubiquity in biological and health data. For instance, sequencing machines are evolving over time and samples obtained wlth new technologies cannot be directly compared to samples taken on older systems, even if data was collected in the same lab. This also applies to samples obtained under different environmental conditions. Currently, researchers are forced to either ignore such biases, potentially leading to violations of statistical validity, or limit their analysis to data generated in one batch of samples. This work will extend the set of useful data available to researchers in a wide variety of domains and provide methods to compare and synthesize disparate datasets. The proposed work will result in: (1) Development of algorithms with theoretical performance guarantees for combining information from datasets with small number of overlapping features; (2) Development of rigorous statistical procedures for hypothesis testing in the presence of within-. group heterogeneity. These methods are particularly helpful for pre-/post- treatment studies, studies containing batch effects, or studies where samples are collected over long time periods using different technologies; (3) Implementation of these methods in case studies to domains in molecular biology (genetic pathway hypothesis generation) and population survey data for health policy modeling.

现代测序平台可以在每一次运行中对数百亿个碱基进行测序，并产生Peta字节的数据，但个别研究的规模可能很小。同样，各种各样的健康数据现在也是公开的可用于为卫生政策决策提供信息，并且使用来自几个不同的调查。跨不同数据集聚合和比较异类数据的能力将是对于扩大可用于任何个别研究的可用数据至关重要。我们建议系统地研究这项工作的两个主要障碍：1)聚合不同的医学和生物数据集；2)处理具有批处理效果和结构化的异类数据。目标1使我们能够充分利用关于来自不同数据集的相关主题，因为不同实验的信息需要在统计上严谨、可靠的方式-流程需要充分利用可用的信息，而不是引入偏见，而且仍然是系统性和可重复性的。并不是所有的实验都研究相同的一组变量/功能，并且组合这些信息不是一项微不足道的任务。第二个目标是让研究人员处理个体或样本之间的异质性，这种情况在生物和健康数据。例如，测序机随着时间的推移而发展，并用新的方法获得样本技术不能直接与在较旧系统上采集的样本进行比较，即使数据是在同一个实验室。这也适用于在不同环境条件下获得的样品。目前，研究人员被迫要么忽视这种偏见，这可能会导致违反统计有效性，要么将他们的分析限制在一批样本中产生的数据。这项工作将扩展有用的数据集可供广泛领域的研究人员使用，并提供比较和综合的方法完全不同的数据集。所提出的工作将导致：(1)开发具有理论意义的算法用于组合来自具有少量重叠的数据集的信息的性能保证特征；(2)制定严格的统计程序，以便在存在内部因素的情况下进行假设检验。群体异质性。这些方法对治疗前/治疗后的研究特别有帮助包含批处理效果的研究，或在长时间内使用不同的技术；(3)在分子生物学(遗传学)领域的案例研究中实施这些方法路径假设生成)和用于卫生政策建模的人口调查数据。