权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Distilling information structure from big and dirty data: Efficient learning of clusters and graphs in modern datasets

职业：从大数据和脏数据中提取信息结构：现代数据集中集群和图的高效学习

基本信息

批准号：
1252412
负责人：
Aarti Singh
金额：
$ 50万
依托单位：
Carnegie-Mellon University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2013
资助国家：
美国
起止时间：
2013-03-01 至 2018-02-28
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1252412&HistoricalAwards=false
关键词：
CAREER Distilling information structure big

项目摘要

This CAREER project aims to advance the state-of-the-art in theory and methods for extracting clusters and graphs from big and dirty datasets arising in modern application domains. Clusters and graphs provide a meaningful representation of the structure of information contained in data, e.g. in neuroscience and health care domains, clustering patients with similar phenotypes and genotypes helps identify target groups for drug design, clustering fiber tracks generated by high-resolution Digital Surface Imaging (DSI) scans of brains help identify significant neural pathways, and graph structures can reflect connectivity between brain regions. The results of this work will significantly enhance the ability to exploit such modern datasets through new methods for learning clusters and graphs from data that is large-scale, high-dimensional, under-sampled, corrupted, and often only available in a compressed or streaming representation. Specifically, this project will develop computationally efficient and principled methods for learning clusters and graphs that can (i) perform unsupervised feature selection to discard irrelevant features in high dimensions, (ii) leverage feedback based on intelligent adaptive queries that focus resources on most informative variables and features, (iii) use compressive measurement design that adapts to the information structure for measurement and computation efficiency, and (iv) be able to handle noisy streaming data. The algorithms will be accompanied with performance guarantees in the form of a precise characterization of the mis-clustering rate and graph recovery error. Additionally, the project will investigate the tradeoffs between number of measurements, computational complexity and robustness in these problems. The methods and theory developed will be evaluated through simulations as well as their applicability to real datasets in neuroscience and healthcare domain, in collaboration with practitioners from these fields. The results of this research could potentially transform many application domains that involve grouping similar variables and learning complex interactions between them, based on big and dirty datasets. In particular, the neuroscience and healthcare applications are likely have very direct and significant implications for society. Accurately mapping neural pathways will help diagnose and treat brain pathologies at an early stage, and help understand brain functioning. Clustering patients and discovering disease spreading pathways based on few measurements of relevant genetic features or indicators could help prevent and cure diseases, and also minimize healthcare costs. The research activities will be tightly integrated with education efforts that aim to develop a diverse workforce that is better equipped with cross-disciplinary tools to address the challenges of modern datasets. The education plan includes development of two inter-disciplinary courses, and enhancement of the joint Statistics & Machine Learning PhD program at Carnegie Mellon University (CMU). Outreach activities include promoting undergraduate research, broadening participation of women and underrepresented groups in STEM fields through OurCS (Opportunities for Undergraduate Research in Computer Science), Andrew?s Leap (a summer enrichment program for area high school and middle school students) and CS4HS program aimed at High School and K-8 teachers at Carnegie Mellon University. The results of this project (including publications, data sets, and software) will be disseminated online at http://www.cs.cmu.edu/~aarti/research_projects/.

该职业项目旨在推进从现代应用领域中出现的大而脏的数据集中提取簇和图的最先进的理论和方法。聚类和图提供了数据中包含的信息结构的有意义的表示，例如在神经科学和医疗保健领域，对具有相似表型和基因型的患者进行聚类有助于确定药物设计的目标群体，对大脑的高分辨率数字表面成像（DSI）扫描生成的纤维轨迹进行聚类有助于识别重要的神经通路，而图形结构可以反映大脑区域之间的连接性。这项工作的结果将显着增强利用此类现代数据集的能力，通过从大规模、高维、欠采样、损坏且通常仅以压缩或流表示形式提供的数据中学习聚类和图形的新方法。具体来说，该项目将开发用于学习集群和图的计算高效且有原则的方法，这些方法可以（i）执行无监督特征选择以丢弃高维度中的不相关特征，（ii）利用基于智能自适应查询的反馈，将资源集中在最具信息量的变量和特征上，（iii）使用适应信息结构的压缩测量设计以提高测量和计算效率，并且（iv）能够处理嘈杂的流数据。该算法将以错误聚类率和图恢复误差的精确表征的形式提供性能保证。此外，该项目还将研究这些问题的测量数量、计算复杂性和鲁棒性之间的权衡。所开发的方法和理论将与这些领域的从业者合作，通过模拟及其对神经科学和医疗保健领域真实数据集的适用性进行评估。这项研究的结果可能会改变许多应用领域，这些领域涉及基于大而脏的数据集对相似的变量进行分组并学习它们之间的复杂交互。特别是，神经科学和医疗保健应用可能对社会产生非常直接和重大的影响。准确绘制神经通路将有助于早期诊断和治疗大脑病变，并有助于了解大脑功能。基于对相关遗传特征或指标的少量测量，对患者进行聚类并发现疾病传播途径可以帮助预防和治疗疾病，并最大限度地降低医疗成本。研究活动将与教育工作紧密结合，旨在培养一支多元化的劳动力队伍，更好地配备跨学科工具来应对现代数据集的挑战。该教育计划包括开发两门跨学科课程，以及加强卡内基梅隆大学 (CMU) 的统计与机器学习联合博士项目。外展活动包括通过 OurCS（计算机科学本科生研究机会）、Andrew’s Leap（针对地区高中生和中学生的夏季强化计划）以及针对卡内基梅隆大学高中和 K-8 教师的 CS4HS 计划，促进本科生研究、扩大女性和代表性不足群体在 STEM 领域的参与。该项目的结果（包括出版物、数据集和软件）将在线传播：http://www.cs.cmu.edu/~aarti/research_projects/。