权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Methods development for "Omics" data

“组学”数据的方法开发

基本信息

批准号：
10928612
负责人：
Alison Motsinger-Reif
金额：
$ 101.56万
依托单位：
NATIONAL INSTITUTE OF ENVIRONMENTAL HEALTH SCIENCES
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10928612
关键词：
Affect Algorithms Benchmarking Biological Biometry Code Collaborations Collection Complex Computational Biology Computer software Confidence Intervals Custom Data Data Set Dimensions Dose Epidemiology Explosion Future Genes Genomics Literature Location Machine Learning Maps Measures Methodology Methods Modeling National Health and Nutrition Examination Survey National Institute of Environmental Health Sciences National Toxicology Program Outcome Participant Performance Polymorphism Analysis Programming Languages Property Pythons Quantitative Trait Loci Research Research Personnel Rodent Sample Size Sampling Scheme Single Nucleotide Polymorphism Source Code Specific qualifier value Surveys Target Populations Testing Toxicogenomics Toxicology United States Environmental Protection Agency Visualization Weight Work base carcinogenicity cohort complex data computational toxicology design detection method epidemiology study flexibility gene environment interaction gene interaction genome wide association study gradient boosting high dimensionality improved machine learning method metabolomics method development microbiome mortality novel open source predictive modeling programs response simulation software development tool trend

项目摘要

In continued work on dose response modeling, in collaboration with Dr. Mathew Wheeler we have developed a suite of tools for dose response modeling. The need to analyze the complex relationships observed in high-throughput toxicogenomic and other omic platforms has resulted in an explosion of methodological advances in computational toxicology. However, advancements in the literature often outpace the development of software researchers can implement in their pipelines, and existing software is frequently based on pre-specified workflows built from well-vetted assumptions that may not be optimal for novel research questions. Accordingly, there is a need for a stable platform and open-source codebase attached to a programming language that allows users to program new algorithms. To fill this gap, the Biostatistics and Computational Biology Branch of the National Institute of Environmental Health Sciences, in cooperation with the National Toxicology Program (NTP) and US Environmental Protection Agency (EPA), developed ToxicR, an open-source R programming package. The ToxicR platform implements many of the standard analyses used by the NTP and EPA, including doseresponse analyses for continuous and dichotomous data that employ Bayesian, maximum likelihood, and model averaging methods, as well as many standard tests the NTP uses in rodent toxicology and carcinogenicity studies, such as the poly-K and Jonckheere trend tests. ToxicR is built on the same codebase as current versions of the EPAs Benchmark Dose software and NTPs BMDExpress software but has increased flexibility because it directly accesses this software. To demonstrate ToxicR, we developed a custom workflow to illustrate its capabilities for analyzing toxicogenomic data. The unique features of ToxicR will allow researchers in other fields to add modules, increasing its functionality in the future. Additionally, with Dr. Nat McNell, we have evaluated approaches for epidemiological weighting schemes in machine learning methods. Despite the prominent use of complex survey data and the growing popularity of machine learning methods in epidemiologic research, few machine learning software implementations offer options for handling complex samples. A major challenge impeding the broader incorporation of machine learning into epidemiologic research is incomplete guidance for analyzing complex survey data, including the importance of sampling weights for valid prediction in target populations. Using data from 15, 820 participants in the 19881994 National Health and Nutrition Examination Survey cohort, we determined whether ignoring weights in gradient boosting models of all-cause mortality affected prediction, as measured by the F1 score and corresponding 95% confidence intervals. In simulations, we additionally assessed the impact of sample size, weight variability, predictor strength, and model dimensionality. In the National Health and Nutrition Examination Survey data, unweighted model performance was inflated compared to the weighted model (F1 score 81.9% 95% confidence interval: 81.2%, 82.7% vs 77.4% 95% confidence interval: 76.1%, 78.6%). However, the error was mitigated if the F1 score was subsequently recalculated with observed outcomes from the weighted dataset (F1: 77.0%; 95% confidence interval: 75.7%, 78.4%). In simulations, this finding held in the largest sample size (N = 10,000) under all analytic conditions assessed. For sample sizes <5,000, sampling weights had little impact in simulations that more closely resembled a simple random sample (low weight variability) or in models with strong predictors, but findings were inconsistent under other analytic scenarios. Failing to account for sampling weights in gradient boosting models may limit generalizability for data from complex surveys, dependent on sample size and other analytic properties. In the absence of software for configuring weighted algorithms, post-hoc re-calculations of unweighted model performance using weighted observed outcomes may more accurately reflect model prediction in target populations than ignoring weights entirely. With Dr. David Reif's group, we developed the ToxPi*GIS Toolkit. It is a collection of methods for creating interactive feature layers that contain ToxPi profiles. It currently includes an ArcGIS Toolbox (ToxPiToolbox.tbx) for drawing location-specific ToxPi profiles in a single feature layer, a collection of modular Python scripts that create predesigned layer files containing ToxPi feature layers from the command line, and a collection of Python routines for useful data manipulation and preprocessing. We present workflows documenting ToxPi feature layer creation, sharing, and embedding for both novice and advanced users looking for additional customizability. Map visualizations created with the ToxPi*GIS Toolkit can be made freely available on public URLs, allowing users without ArcGIS Pro access or expertise to view and interact with them. Novice users with ArcGIS Pro access can create de novo custom maps, and advanced users can exploit additional customization options. The ArcGIS Toolbox provides a simple means for generating ToxPi feature layers. Ongoing projects building onto methods for detecting gene-environment interactions are currently ongoing, using variance QTLs to prioritize single nucleotide polymorphisms for detecting gene-gene interactions. Additionally, Dr. Ziyue Wang is working on developing new normalization approaches for microbiome data.

在持续的剂量反应建模工作中，我们与马修·惠勒博士合作，开发了一套用于剂量反应建模的工具。分析在高通量毒物基因组和其他基因组学平台中观察到的复杂关系的需要导致了计算毒理学方法学上的爆炸性进展。然而，文献中的进步往往超过了研究人员可以在他们的管道中实现的软件的开发，并且现有的软件经常基于预先指定的工作流，这些工作流建立在经过良好审查的假设基础上，而这些假设对于新的研究问题可能不是最佳的。因此，需要一个稳定的平台和附加到允许用户编程新算法的编程语言的开源代码库。为了填补这一空白，美国国家环境健康科学研究所生物统计和计算生物学分部与国家毒理学计划(NTP)和美国环境保护局(EPA)合作开发了ToxicR，这是一个开源的R编程包。ToxicR平台实施了NTP和EPA使用的许多标准分析，包括使用贝叶斯、最大似然和模型平均方法的连续和二分数据的剂量响应分析，以及NTP在啮齿动物毒理学和致癌性研究中使用的许多标准测试，如PolyK和Jonckheere趋势测试。ToxicR与EPAS基准剂量软件和NTPS BMDExpress软件的当前版本建立在相同的代码库上，但由于它直接访问该软件，因此具有更高的灵活性。为了演示ToxicR，我们开发了一个定制的工作流程来说明它分析毒素基因组数据的能力。ToxicR的独特功能将允许其他领域的研究人员添加模块，从而在未来增加其功能。此外，与Nat McNell博士一起，我们评估了机器学习方法中流行病学加权方案的方法。尽管复杂调查数据的突出使用和机器学习方法在流行病学研究中的日益流行，但很少有机器学习软件实现提供处理复杂样本的选项。阻碍机器学习更广泛地纳入流行病学研究的一个主要挑战是，对分析复杂调查数据的指导不完整，包括抽样权重对目标人群有效预测的重要性。使用来自19881994年度国民健康与营养调查队列的15820名参与者的数据，我们确定了忽略全因死亡率的梯度增强模型中的权重是否影响预测，该预测由F1分数和相应的95%可信区间来衡量。在模拟中，我们还评估了样本大小、权重变异性、预测器强度和模型维度的影响。在国民健康与营养调查数据中，未加权模型的表现比加权模型有所夸大(F1得分81.9%95%可信区间：81.2%，82.7%vs 77.4%95%可信区间：76.1%，78.6%)。然而，如果随后根据加权数据集的观察结果重新计算F1分数，则错误减轻(F1：77.0%；95%可信区间：75.7%，78.4%)。在模拟中，这一发现在所有评估的分析条件下的最大样本量(N=10,000)中保持不变。对于样本大小为5,000的样本，抽样权重在更接近于简单随机样本(低权重变异性)的模拟中或在具有较强预测值的模型中影响不大，但在其他分析情景下结果不一致。不考虑梯度提升模型中的抽样权重可能会限制复杂调查数据的推广能力，这取决于样本大小和其他分析属性。在缺乏配置加权算法的软件的情况下，使用加权的观测结果对未加权的模型性能进行事后重新计算可能比完全忽略权重更准确地反映目标人群中的模型预测。在David Reif博士的团队中，我们开发了ToxPI*地理信息系统工具包。它是用于创建包含ToxPI纵断面的交互式要素图层的方法集合。它当前包括用于在单个要素图层中绘制特定于位置的ToxPI纵断面的ArcGIS工具箱(ToxPiToolbox.tbx)、从命令行创建包含ToxPI要素图层的预设计图层文件的模块化Python脚本集合，以及用于有用的数据操作和预处理的Python例程集合。我们提供了记录ToxPI要素图层创建、共享和嵌入的工作流，供寻求更多可定制性的新手和高级用户使用。使用ToxPI*GIS工具包创建的地图可视化可以在公共URL上免费使用，允许没有ArcGIS Pro访问权限或专业知识的用户查看它们并与其交互。具有ArcGIS Pro Access的新手用户可以创建新的自定义地图，而高级用户可以使用其他自定义选项。ArcGIS工具箱提供了一种生成ToxPI要素图层的简单方法。建立在检测基因-环境相互作用的方法上的正在进行的项目目前正在进行中，使用差异QTL来确定检测基因-基因相互作用的单核苷酸多态的优先顺序。此外，王子月博士正在为微生物组数据开发新的标准化方法。