权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Theory and Methods for Large-Scale Multi-Modal Matrix Data

大规模多模态矩阵数据的理论与方法

基本信息

批准号：
2015492
负责人：
Jing Lei
金额：
$ 20万
依托单位：
Carnegie-Mellon University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-07-01 至 2024-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2015492&HistoricalAwards=false
关键词：
Theory Methods Large Scale Multi

项目摘要

Modern data acquisition technology produces new types of data that carry rich information but also poses new challenges for analysis. In many modern datasets, the basic unit of measurement can be a matrix or even higher order array recording the interactions among one or multiple groups of individuals. For example, a gene co-expression network measures the average strength of correlation between each pair of genes in a particular organ tissue. With gene co-expression networks collected at different developmental stages, it is possible to understand how groups of genes change their behavior in a coherent way. As another example, next generation sequencing techniques are able to produce gene expression data at different scales: Tissue sample data consists of gene expressions in bulk tissue samples, whereas single cell RNA sequencing data contains expressions of the same genes for individual cells. Motivated by the these examples, this research work aims at developing novel probability tools and statistical inference methods for complex matrix valued datasets, which will enable scientists to uncover salient structures in such datasets in a coherent and efficient way. The project also provides research training opportunities for graduate students. This project consists of two parts. In the first part, the PI studies multiple layer networks with a shared latent structure across layers and develops methods to efficiently combine the information across different layers to recover the latent structure, which would be impossible if only a single layer were available. The expected results will provide new probability theorems describing the behavior of random noises in matrix forms, as well as their linear combinations and higher order functions. In the second part, the PI studies a series of inference problems related to tissue and single cell RNA-seq data, starting from dimensionality reduction and variable selection in a computationally efficient manner, followed by downstream inference problems such as cell type deconvolution in tissue RNA-seq data. The expected results will provide an important addition to the sparse principal components analysis literature, by developing a projection-free, gradient-based algorithm with provable global convergence properties. The cell type deconvolution problem will be an interesting application combining techniques from variable selection, nonnegative matrix factorization, and optimization.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

现代数据采集技术产生了携带丰富信息的新型数据，但也为分析带来了新的挑战。在许多现代数据集中，基本的测量单位可以是矩阵，甚至更高阶的数组，记录一组或多组个体之间的相互作用。例如，基因共表达网络测量特定器官组织中每对基因之间的平均相关强度。通过在不同发育阶段收集的基因共表达网络，有可能了解基因组如何以连贯的方式改变它们的行为。作为另一个例子，下一代测序技术能够产生不同尺度的基因表达数据：组织样本数据由大量组织样本中的基因表达组成，而单细胞RNA测序数据包含单个细胞的相同基因的表达。受这些例子的启发，本研究工作旨在为复杂矩阵值数据集开发新的概率工具和统计推断方法，这将使科学家能够以连贯和有效的方式发现此类数据集中的显着结构。该项目还为研究生提供研究培训机会。本项目由两部分组成。在第一部分中，PI研究了具有跨层共享潜在结构的多层网络，并开发了有效地将不同层的信息联合收割机组合以恢复潜在结构的方法，这在只有单层可用的情况下是不可能的。预期的结果将提供新的概率定理描述的行为随机噪声的矩阵形式，以及它们的线性组合和高阶函数。在第二部分中，PI研究了一系列与组织和单细胞RNA-seq数据相关的推断问题，从计算高效的降维和变量选择开始，然后是下游推断问题，如组织RNA-seq数据中的细胞类型去卷积。预期的结果将提供一个重要的除了稀疏主成分分析文献，通过开发一个无投影，基于梯度的算法，可证明的全局收敛性。细胞类型的反卷积问题将是一个有趣的应用相结合的技术，从变量选择，非负矩阵因式分解和optimization.This奖项反映了NSF的法定使命，并已被认为是值得通过评估使用基金会的智力价值和更广泛的影响审查标准的支持。