CAREER: Distilling information structure from big and dirty data: Efficient learning of clusters and graphs in modern datasets

职业:从大数据和脏数据中提取信息结构:现代数据集中集群和图的高效学习

基本信息

  • 批准号:
    1252412
  • 负责人:
  • 金额:
    $ 50万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2013
  • 资助国家:
    美国
  • 起止时间:
    2013-03-01 至 2018-02-28
  • 项目状态:
    已结题

项目摘要

This CAREER project aims to advance the state-of-the-art in theory and methods for extracting clusters and graphs from big and dirty datasets arising in modern application domains. Clusters and graphs provide a meaningful representation of the structure of information contained in data, e.g. in neuroscience and health care domains, clustering patients with similar phenotypes and genotypes helps identify target groups for drug design, clustering fiber tracks generated by high-resolution Digital Surface Imaging (DSI) scans of brains help identify significant neural pathways, and graph structures can reflect connectivity between brain regions. The results of this work will significantly enhance the ability to exploit such modern datasets through new methods for learning clusters and graphs from data that is large-scale, high-dimensional, under-sampled, corrupted, and often only available in a compressed or streaming representation. Specifically, this project will develop computationally efficient and principled methods for learning clusters and graphs that can (i) perform unsupervised feature selection to discard irrelevant features in high dimensions, (ii) leverage feedback based on intelligent adaptive queries that focus resources on most informative variables and features, (iii) use compressive measurement design that adapts to the information structure for measurement and computation efficiency, and (iv) be able to handle noisy streaming data. The algorithms will be accompanied with performance guarantees in the form of a precise characterization of the mis-clustering rate and graph recovery error. Additionally, the project will investigate the tradeoffs between number of measurements, computational complexity and robustness in these problems. The methods and theory developed will be evaluated through simulations as well as their applicability to real datasets in neuroscience and healthcare domain, in collaboration with practitioners from these fields. The results of this research could potentially transform many application domains that involve grouping similar variables and learning complex interactions between them, based on big and dirty datasets. In particular, the neuroscience and healthcare applications are likely have very direct and significant implications for society. Accurately mapping neural pathways will help diagnose and treat brain pathologies at an early stage, and help understand brain functioning. Clustering patients and discovering disease spreading pathways based on few measurements of relevant genetic features or indicators could help prevent and cure diseases, and also minimize healthcare costs. The research activities will be tightly integrated with education efforts that aim to develop a diverse workforce that is better equipped with cross-disciplinary tools to address the challenges of modern datasets. The education plan includes development of two inter-disciplinary courses, and enhancement of the joint Statistics & Machine Learning PhD program at Carnegie Mellon University (CMU). Outreach activities include promoting undergraduate research, broadening participation of women and underrepresented groups in STEM fields through OurCS (Opportunities for Undergraduate Research in Computer Science), Andrew?s Leap (a summer enrichment program for area high school and middle school students) and CS4HS program aimed at High School and K-8 teachers at Carnegie Mellon University. The results of this project (including publications, data sets, and software) will be disseminated online at http://www.cs.cmu.edu/~aarti/research_projects/.
该职业项目旨在推进从现代应用领域中出现的大而脏的数据集中提取簇和图的最先进的理论和方法。聚类和图提供了数据中包含的信息结构的有意义的表示,例如在神经科学和医疗保健领域,对具有相似表型和基因型的患者进行聚类有助于确定药物设计的目标群体,对大脑的高分辨率数字表面成像(DSI)扫描生成的纤维轨迹进行聚类有助于识别重要的神经通路,而图形结构可以反映大脑区域之间的连接性。这项工作的结果将显着增强利用此类现代数据集的能力,通过从大规模、高维、欠采样、损坏且通常仅以压缩或流表示形式提供的数据中学习聚类和图形的新方法。具体来说,该项目将开发用于学习集群和图的计算高效且有原则的方法,这些方法可以(i)执行无监督特征选择以丢弃高维度中的不相关特征,(ii)利用基于智能自适应查询的反馈,将资源集中在最具信息量的变量和特征上,(iii)使用适应信息结构的压缩测量设计以提高测量和计算效率,并且(iv)能够 处理嘈杂的流数据。该算法将以错误聚类率和图恢复误差的精确表征的形式提供性能保证。此外,该项目还将研究这些问题的测量数量、计算复杂性和鲁棒性之间的权衡。所开发的方法和理论将与这些领域的从业者合作,通过模拟及其对神经科学和医疗保健领域真实数据集的适用性进行评估。这项研究的结果可能会改变许多应用领域,这些领域涉及基于大而脏的数据集对相似的变量进行分组并学习它们之间的复杂交互。特别是,神经科学和医疗保健应用可能对社会产生非常直接和重大的影响。准确绘制神经通路将有助于早期诊断和治疗大脑病变,并有助于了解大脑功能。基于对相关遗传特征或指标的少量测量,对患者进行聚类并发现疾病传播途径可以帮助预防和治疗疾病,并最大限度地降低医疗成本。研究活动将与教育工作紧密结合,旨在培养一支多元化的劳动力队伍,更好地配备跨学科工具来应对现代数据集的挑战。该教育计划包括开发两门跨学科课程,以及加强卡内基梅隆大学 (CMU) 的统计与机器学习联合博士项目。外展活动包括通过 OurCS(计算机科学本科生研究机会)、Andrew’s Leap(针对地区高中生和中学生的夏季强化计划)以及针对卡内基梅隆大学高中和 K-8 教师的 CS4HS 计划,促进本科生研究、扩大女性和代表性不足群体在 STEM 领域的参与。该项目的结果(包括出版物、数据集和软件)将在线传播:http://www.cs.cmu.edu/~aarti/research_projects/。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Aarti Singh其他文献

Noise-Adaptive Margin-Based Active Learning and Lower Bounds under Tsybakov Noise Condition
Tsybakov 噪声条件下基于噪声自适应裕度的主动学习和下界
A closer look at jobless recoveries
仔细观察失业复苏
  • DOI:
  • 发表时间:
    2003
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Stacey L. Schreft;Aarti Singh
  • 通讯作者:
    Aarti Singh
Design of an Intelligent and Adaptive Mapping Mechanism for Multiagent Interface
一种智能自适应多智能体接口映射机制设计
  • DOI:
    10.1007/978-3-642-22577-2_51
  • 发表时间:
    2011
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Aarti Singh;Dimple Juneja;A. Sharma
  • 通讯作者:
    A. Sharma
Provably Correct Active Sampling Algorithms for Matrix Column Subset Selection with Missing Data
用于缺失数据的矩阵列子集选择的可证明正确的主动采样算法
  • DOI:
  • 发表时间:
    2015
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yining Wang;Aarti Singh
  • 通讯作者:
    Aarti Singh
An empirical comparison of sampling techniques for matrix column subset selection
矩阵列子集选择采样技术的实证比较

Aarti Singh的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Aarti Singh', 18)}}的其他基金

AI Institute for Societal Decision Making (AI-SDM)
人工智能社会决策研究所 (AI-SDM)
  • 批准号:
    2229881
  • 财政年份:
    2023
  • 资助金额:
    $ 50万
  • 项目类别:
    Cooperative Agreement
Collaborative Research: New Perspectives on Deep Learning: Bridging Approximation, Statistical, and Algorithmic Theories
合作研究:深度学习的新视角:桥接近似、统计和算法理论
  • 批准号:
    2134133
  • 财政年份:
    2021
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
QuBBD: Collaborative Research: Personalized Predictive Neuromarkers for Stress-Related Health Risks
QuBBD:合作研究:压力相关健康风险的个性化预测神经标志物
  • 批准号:
    1557572
  • 财政年份:
    2015
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
15th IMS New Researchers Conference
第15届IMS新研究员大会
  • 批准号:
    1301845
  • 财政年份:
    2013
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
BIGDATA: Mid-Scale: DA: Distribution-based machine learning for high dimensional datasets
BIGDATA:中规模:DA:针对高维数据集的基于分布的机器学习
  • 批准号:
    1247658
  • 财政年份:
    2013
  • 资助金额:
    $ 50万
  • 项目类别:
    Continuing Grant
III: Small: Spectral Methods for Active Clustering and Bi-Clustering
III:小:主动聚类和双聚类的谱方法
  • 批准号:
    1116458
  • 财政年份:
    2011
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant

相似海外基金

DASS: Distilling Software Design Principles from Cybersecurity Caselaw
DASS:从网络安全判例中提炼软件设计原则
  • 批准号:
    2217597
  • 财政年份:
    2023
  • 资助金额:
    $ 50万
  • 项目类别:
    Interagency Agreement
Distilling melodies with algorithms
用算法提炼旋律
  • 批准号:
    572149-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
    University Undergraduate Student Research Awards
Distilling the relationship of parental psychiatric illness to offspring productivity and social outcomes: evidence base for preventive strategies
提炼父母精神疾病与后代生产力和社会结果的关系:预防策略的证据基础
  • 批准号:
    10506724
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
Distilling the relationship of parental psychiatric illness to offspring productivity and social outcomes: evidence base for preventive strategies
提炼父母精神疾病与后代生产力和社会结果的关系:预防策略的证据基础
  • 批准号:
    10708878
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
Production, formulation and consumer testing of an organic extract of waste bananas which boosts the efficiency of distilling and brewing fermentations
废弃香蕉有机提取物的生产、配制和消费者测试,可提高蒸馏和酿造发酵的效率
  • 批准号:
    90583
  • 财政年份:
    2021
  • 资助金额:
    $ 50万
  • 项目类别:
    Collaborative R&D
Northern Ontario Brewing and Distilling Summit
北安大略酿造和蒸馏峰会
  • 批准号:
    544902-2019
  • 财政年份:
    2019
  • 资助金额:
    $ 50万
  • 项目类别:
    Connect Grants Level 2 for colleges Ontario
Novel low viscosity wheats for distilling
用于蒸馏的新型低粘度小麦
  • 批准号:
    102530
  • 财政年份:
    2016
  • 资助金额:
    $ 50万
  • 项目类别:
    BEIS-Funded Programmes
15AGRITECHCAT4: Novel low viscosity wheats for distilling
15AGRITECHCAT4:用于蒸馏的新型低粘度小麦
  • 批准号:
    BB/N019164/1
  • 财政年份:
    2016
  • 资助金额:
    $ 50万
  • 项目类别:
    Research Grant
Evaluation of the major congeners produced by various Lallemand distilling yeast
各种拉勒曼德蒸馏酵母产生的主要同系物的评估
  • 批准号:
    488981-2015
  • 财政年份:
    2015
  • 资助金额:
    $ 50万
  • 项目类别:
    Applied Research and Development Grants - Level 1
EAGER: Distilling a Process for a National CI Roadmap from NSF Collaboratories
EAGER:从 NSF 合作实验室提炼国家 CI 路线图流程
  • 批准号:
    1153775
  • 财政年份:
    2011
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了