权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Design and Analysis for Cancer Epidemiology Studies

癌症流行病学研究的设计和分析

基本信息

批准号：
7127228
负责人：
MING Tony TAN
金额：
$ 7.25万
依托单位：
UNIVERSITY OF MARYLAND BALTIMORE
依托单位国家：
美国
项目类别：
财政年份：
2005
资助国家：
美国
起止时间：
2005-09-30 至 2007-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/7127228
关键词：
biomarker cancer risk clinical research computer data analysis computer simulation human data mathematical model method development neoplasm /cancer epidemiology statistics /biometry

项目摘要

DESCRIPTION (provided by applicant): The overall goal of this research is to develop novel statistical methods for addressing the difficult issue of multiplicity in current cancer etiology. To identify determinants of cancer and quantify their role, cancer etiology studies are intrinsically multi-factorial because of the multi-step nature of carcinogenesis and multi-extrinsic factors that lead normal cells to malignant ones. Multiplicity inflates false positive rate. In the simplest example of searching for a cutpoint of one quantitative biomarker for disease status, the common practice of examining different cutpoints and pick the one with the smallest p-value results in highly inflated false positive rate. Even in largest studies, statistical power for testing interactions quickly diminishes, sample sizes rapidly become inadequate with stratification and risk estimates become unstable. Because there are so many risk factors, model overfitting is a common problem and the predictive performance of the statistical model is poor. It is thus not surprising that even main effects (e.g., candidate gene associations) have proven notoriously difficult to replicate and reported interactions even harder. The multiplicity issue is acute today as more biomarkers of risk exposures and even the entire pathways comprising easily dozens of genes and their environmental substrates become available. An effective means to reduce overfitting and prediction error is to constrain model parameters as in least absolute shrinkage and selection operator (lasso) to eliminate the large number of irrelevant variables (e.g., genes). Finding MLE in such regression models with large number of variables is challenging. Since some measures of exposure may not be indicative of cancer and these irrelevant variables reduce the accuracy of the regression model, selecting the most relevant variables into the model would be a significant step. However, classic methods for model/variable selection have not had much success in biomedical application because they too aggressively eliminate significant factors predictor and are numerically unstable due to collinearity. This pilot project application focuses on the commonly used logistic regression model in cancer etiology studies. Built upon the novel accelerated expectation-maximization (EM) algorithm we developed for variable selection in linear models, we propose to develop fast variable selection procedures for logistic regression model that reduces overfitting and has improved predictive property; and to develop computer programs, conduct simulation studies to assess the performance of the method/algorithm and to analyze the esophageal data from two currently NCI funded studies. Upon completion of the proposed research, the methods/algorithms developed can be used to analyze cancer epidemiology data more effectively and efficiently. It also provides a basis for further developments of the approach into potentially an RO1 application. The future study can includes extensions to multinomial (i.e., multi-class) logistic regression models for cancer outcomes, the Cox regression model for time-to-event data such as time to advanced cancer analyzing data in cancer etiology and the Bayesian hierarchical modeling and model selection that incorporate prior biological knowledge about pathways will enhance the ability to detect real causal effects.

描述(由申请人提供)：这项研究的总体目标是开发新的统计方法，以解决当前癌症病因学中的多样性这一难题。为了确定癌症的决定因素并量化它们的作用，癌症病因学研究本质上是多因素的，因为癌症发生的多步骤性质以及导致正常细胞向恶性细胞转化的多个外部因素。多重性增加了假阳性率。在搜索疾病状态的一个定量生物标志物的切点这一最简单的例子中，检查不同的切点并选择p值最小的切点的常见做法会导致高度夸大的假阳性率。即使在最大的研究中，测试相互作用的统计能力也会迅速减弱，样本大小很快就会随着分层而变得不足，风险估计也会变得不稳定。由于风险因素很多，模型过拟合是一个普遍存在的问题，统计模型的预测性能较差。因此，即使是主效应(例如，候选基因关联)也被证明是出了名的难以复制，而且报告的相互作用更难，这也就不足为奇了。如今，随着更多的风险暴露生物标记物，甚至包括数十个基因及其环境底物的整个途径变得可用，多样性问题变得尖锐起来。减少过拟合和预测误差的一个有效方法是将模型参数约束为最小绝对收缩和选择算子(LASSO)，以消除大量不相关的变量(如基因)。在这类具有大量变量的回归模型中寻找最大似然估计是一项具有挑战性的工作。由于暴露的某些指标可能不能指示癌症，而这些无关的变量降低了回归模型的准确性，因此选择最相关的变量进入模型将是重要的一步。然而，经典的模型/变量选择方法在生物医学应用中并没有取得太大的成功，因为它们过于激进地消除了显著的预测因子，并且由于共线性而导致数值不稳定。这个试点项目的应用重点是癌症病因学研究中常用的Logistic回归模型。在我们开发的用于线性模型变量选择的新型加速期望最大化(EM)算法的基础上，我们建议为Logistic回归模型开发快速变量选择程序，以减少过度拟合并具有更好的预测性能；开发计算机程序，进行模拟研究以评估方法/算法的性能，并分析目前由NCI资助的两项研究的食道数据。在拟议的研究完成后，所开发的方法/算法可以用于更有效和高效地分析癌症流行病学数据。它还为该方法进一步发展成为潜在的RO1应用提供了基础。未来的研究可以包括癌症结果的多项(即多类)Logistic回归模型的扩展，癌症病因学中癌症晚期分析数据等时间到事件数据的Cox回归模型，以及结合先前生物学知识的贝叶斯分层建模和模型选择，将增强检测真正因果效应的能力。