CAREER: Distilling information structure from big and dirty data: Efficient learning of clusters and graphs in modern datasets

职业:从大数据和脏数据中提取信息结构:现代数据集中集群和图的高效学习

基本信息

  • 批准号:
    1252412
  • 负责人:
  • 金额:
    $ 50万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2013
  • 资助国家:
    美国
  • 起止时间:
    2013-03-01 至 2018-02-28
  • 项目状态:
    已结题

项目摘要

This CAREER project aims to advance the state-of-the-art in theory and methods for extracting clusters and graphs from big and dirty datasets arising in modern application domains. Clusters and graphs provide a meaningful representation of the structure of information contained in data, e.g. in neuroscience and health care domains, clustering patients with similar phenotypes and genotypes helps identify target groups for drug design, clustering fiber tracks generated by high-resolution Digital Surface Imaging (DSI) scans of brains help identify significant neural pathways, and graph structures can reflect connectivity between brain regions. The results of this work will significantly enhance the ability to exploit such modern datasets through new methods for learning clusters and graphs from data that is large-scale, high-dimensional, under-sampled, corrupted, and often only available in a compressed or streaming representation. Specifically, this project will develop computationally efficient and principled methods for learning clusters and graphs that can (i) perform unsupervised feature selection to discard irrelevant features in high dimensions, (ii) leverage feedback based on intelligent adaptive queries that focus resources on most informative variables and features, (iii) use compressive measurement design that adapts to the information structure for measurement and computation efficiency, and (iv) be able to handle noisy streaming data. The algorithms will be accompanied with performance guarantees in the form of a precise characterization of the mis-clustering rate and graph recovery error. Additionally, the project will investigate the tradeoffs between number of measurements, computational complexity and robustness in these problems. The methods and theory developed will be evaluated through simulations as well as their applicability to real datasets in neuroscience and healthcare domain, in collaboration with practitioners from these fields. The results of this research could potentially transform many application domains that involve grouping similar variables and learning complex interactions between them, based on big and dirty datasets. In particular, the neuroscience and healthcare applications are likely have very direct and significant implications for society. Accurately mapping neural pathways will help diagnose and treat brain pathologies at an early stage, and help understand brain functioning. Clustering patients and discovering disease spreading pathways based on few measurements of relevant genetic features or indicators could help prevent and cure diseases, and also minimize healthcare costs. The research activities will be tightly integrated with education efforts that aim to develop a diverse workforce that is better equipped with cross-disciplinary tools to address the challenges of modern datasets. The education plan includes development of two inter-disciplinary courses, and enhancement of the joint Statistics & Machine Learning PhD program at Carnegie Mellon University (CMU). Outreach activities include promoting undergraduate research, broadening participation of women and underrepresented groups in STEM fields through OurCS (Opportunities for Undergraduate Research in Computer Science), Andrew?s Leap (a summer enrichment program for area high school and middle school students) and CS4HS program aimed at High School and K-8 teachers at Carnegie Mellon University. The results of this project (including publications, data sets, and software) will be disseminated online at http://www.cs.cmu.edu/~aarti/research_projects/.
这个CAREER项目旨在推进最先进的理论和方法,从现代应用领域中产生的大而脏的数据集中提取集群和图形。聚类和图形提供了数据中包含的信息结构的有意义的表示,例如在神经科学和医疗保健领域,聚类具有相似表型和基因型的患者有助于识别药物设计的目标群体,聚类由大脑的高分辨率数字表面成像(DSI)扫描生成的纤维轨迹有助于识别重要的神经通路,图形结构可以反映大脑区域之间的连通性。这项工作的结果将大大提高利用这些现代数据集的能力,通过新的方法从大规模,高维,欠采样,损坏的数据中学习聚类和图形,并且通常只能以压缩或流表示形式提供。具体来说,该项目将开发计算效率高和原则性强的方法来学习集群和图形,这些方法可以(i)执行无监督的特征选择,以丢弃高维中不相关的特征,(ii)利用基于智能自适应查询的反馈,将资源集中在最具信息性的变量和特征上,(iii)使用适应于信息结构的压缩测量设计以获得测量和计算效率,以及(iv)能够处理有噪声的流数据。该算法将伴随着性能保证的形式,一个精确的表征的误聚类率和图形恢复错误。此外,该项目将调查在这些问题中的测量数量,计算复杂性和鲁棒性之间的权衡。开发的方法和理论将通过模拟以及它们对神经科学和医疗保健领域真实的数据集的适用性进行评估,并与这些领域的从业者合作。这项研究的结果可能会改变许多应用领域,这些应用领域涉及基于大型和脏数据集对相似变量进行分组并学习它们之间的复杂交互。特别是,神经科学和医疗保健应用可能对社会产生非常直接和重大的影响。准确绘制神经通路将有助于在早期诊断和治疗大脑病变,并有助于了解大脑功能。基于对相关遗传特征或指标的少量测量,对患者进行聚类并发现疾病传播途径,可以帮助预防和治疗疾病,并最大限度地降低医疗成本。研究活动将与教育工作紧密结合,旨在培养一支多元化的劳动力队伍,更好地配备跨学科工具,以应对现代数据集的挑战。该教育计划包括开发两门跨学科课程,并加强卡内基梅隆大学(CMU)的联合统计机器学习博士课程。外联活动包括促进本科生研究,通过OurCS(计算机科学本科生研究机会)扩大妇女和代表性不足的群体在STEM领域的参与,安德鲁?s Leap(针对地区高中和中学生的夏季充实计划)和针对卡内基梅隆大学高中和K-8教师的CS 4 HS计划。该项目的成果(包括出版物、数据集和软件)将在http://www.cs.cmu.edu/~aarti/research_projects/网站上公布。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Aarti Singh其他文献

Noise-Adaptive Margin-Based Active Learning and Lower Bounds under Tsybakov Noise Condition
Tsybakov 噪声条件下基于噪声自适应裕度的主动学习和下界
A closer look at jobless recoveries
仔细观察失业复苏
  • DOI:
  • 发表时间:
    2003
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Stacey L. Schreft;Aarti Singh
  • 通讯作者:
    Aarti Singh
Design of an Intelligent and Adaptive Mapping Mechanism for Multiagent Interface
一种智能自适应多智能体接口映射机制设计
  • DOI:
    10.1007/978-3-642-22577-2_51
  • 发表时间:
    2011
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Aarti Singh;Dimple Juneja;A. Sharma
  • 通讯作者:
    A. Sharma
Provably Correct Active Sampling Algorithms for Matrix Column Subset Selection with Missing Data
用于缺失数据的矩阵列子集选择的可证明正确的主动采样算法
  • DOI:
  • 发表时间:
    2015
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yining Wang;Aarti Singh
  • 通讯作者:
    Aarti Singh
An empirical comparison of sampling techniques for matrix column subset selection
矩阵列子集选择采样技术的实证比较

Aarti Singh的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Aarti Singh', 18)}}的其他基金

AI Institute for Societal Decision Making (AI-SDM)
人工智能社会决策研究所 (AI-SDM)
  • 批准号:
    2229881
  • 财政年份:
    2023
  • 资助金额:
    $ 50万
  • 项目类别:
    Cooperative Agreement
Collaborative Research: New Perspectives on Deep Learning: Bridging Approximation, Statistical, and Algorithmic Theories
合作研究:深度学习的新视角:桥接近似、统计和算法理论
  • 批准号:
    2134133
  • 财政年份:
    2021
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
QuBBD: Collaborative Research: Personalized Predictive Neuromarkers for Stress-Related Health Risks
QuBBD:合作研究:压力相关健康风险的个性化预测神经标志物
  • 批准号:
    1557572
  • 财政年份:
    2015
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
15th IMS New Researchers Conference
第15届IMS新研究员大会
  • 批准号:
    1301845
  • 财政年份:
    2013
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
BIGDATA: Mid-Scale: DA: Distribution-based machine learning for high dimensional datasets
BIGDATA:中规模:DA:针对高维数据集的基于分布的机器学习
  • 批准号:
    1247658
  • 财政年份:
    2013
  • 资助金额:
    $ 50万
  • 项目类别:
    Continuing Grant
III: Small: Spectral Methods for Active Clustering and Bi-Clustering
III:小:主动聚类和双聚类的谱方法
  • 批准号:
    1116458
  • 财政年份:
    2011
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant

相似海外基金

DASS: Distilling Software Design Principles from Cybersecurity Caselaw
DASS:从网络安全判例中提炼软件设计原则
  • 批准号:
    2217597
  • 财政年份:
    2023
  • 资助金额:
    $ 50万
  • 项目类别:
    Interagency Agreement
Distilling melodies with algorithms
用算法提炼旋律
  • 批准号:
    572149-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
    University Undergraduate Student Research Awards
Distilling the relationship of parental psychiatric illness to offspring productivity and social outcomes: evidence base for preventive strategies
提炼父母精神疾病与后代生产力和社会结果的关系:预防策略的证据基础
  • 批准号:
    10708878
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
Distilling the relationship of parental psychiatric illness to offspring productivity and social outcomes: evidence base for preventive strategies
提炼父母精神疾病与后代生产力和社会结果的关系:预防策略的证据基础
  • 批准号:
    10506724
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
Production, formulation and consumer testing of an organic extract of waste bananas which boosts the efficiency of distilling and brewing fermentations
废弃香蕉有机提取物的生产、配制和消费者测试,可提高蒸馏和酿造发酵的效率
  • 批准号:
    90583
  • 财政年份:
    2021
  • 资助金额:
    $ 50万
  • 项目类别:
    Collaborative R&D
Northern Ontario Brewing and Distilling Summit
北安大略酿造和蒸馏峰会
  • 批准号:
    544902-2019
  • 财政年份:
    2019
  • 资助金额:
    $ 50万
  • 项目类别:
    Connect Grants Level 2 for colleges Ontario
Novel low viscosity wheats for distilling
用于蒸馏的新型低粘度小麦
  • 批准号:
    102530
  • 财政年份:
    2016
  • 资助金额:
    $ 50万
  • 项目类别:
    BEIS-Funded Programmes
15AGRITECHCAT4: Novel low viscosity wheats for distilling
15AGRITECHCAT4:用于蒸馏的新型低粘度小麦
  • 批准号:
    BB/N019164/1
  • 财政年份:
    2016
  • 资助金额:
    $ 50万
  • 项目类别:
    Research Grant
Evaluation of the major congeners produced by various Lallemand distilling yeast
各种拉勒曼德蒸馏酵母产生的主要同系物的评估
  • 批准号:
    488981-2015
  • 财政年份:
    2015
  • 资助金额:
    $ 50万
  • 项目类别:
    Applied Research and Development Grants - Level 1
EAGER: Distilling a Process for a National CI Roadmap from NSF Collaboratories
EAGER:从 NSF 合作实验室提炼国家 CI 路线图流程
  • 批准号:
    1153775
  • 财政年份:
    2011
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了