CAREER: Statistically-Sound Knowledge Discovery from Data
职业:从数据中发现统计上合理的知识
基本信息
- 批准号:2238693
- 负责人:
- 金额:$ 60.03万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2028-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Methods for knowledge discovery from data (e.g., for extracting patterns or finding anomalies) have found their way to research labs in life and biological sciences, and in industries such as cybersecurity. In these fields, the statistical validity of the results produced by these methods is paramount: false discoveries cannot be tolerated. Current methods do not offer such stringent statistical guarantees. This project develops algorithms for statistically-sound Knowledge Discovery from Data. It transforms the field by shifting the goal of the Knowledge Discovery process from extracting information about the available data to gaining new understanding of the noisy, random process that generates the data. The proposed methods contribute towards a faster and higher-throughput scientific pipeline, by allowing scientists and practitioners to efficiently analyze rich large datasets and to trust the results of the analysis. Researchers can then focus on their discipline-specific research tasks without worrying about computational or statistical considerations. The project includes collaborations with a local museum and a local public library, to analyze data about their collections of historic materials, and with a cybersecurity company to develop methods for fast detection of network attacks with few false positives. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.Research in knowledge discovery has mostly focused on understanding the available data, rather than the process that generated it. In the few cases where hypothesis testing was used to assess the results (mostly for simple patterns), only simplistic null models were considered, and the testing employed low-statistical-power approaches (e.g., the Bonferroni correction) to control only for one measure of false discovery, the Family-Wise Error Rate. This project is transformative because it will develop efficient methods for evaluating a wide variety of results (e.g., patterns, anomalies, graph/vertex/edge properties, and more) obtained from large rich datasets (e.g., transactional datasets, graphs, and time series), using realistic null models which are more appropriate for these tasks, and better encode available knowledge of the data generating process. We will create novel efficient procedures to sample from such models, both approximate (e.g., Markov-Chain Monte Carlo) and exact, and combine them with modern resampling- based multiple testing methods, in a multiple-hypothesis first approach that also controls the (marginal) False Discovery Rate.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
用于从数据(例如,用于提取模式或发现异常)已经进入生命和生物科学以及网络安全等行业的研究实验室。在这些领域,这些方法产生的结果的统计有效性是至关重要的:错误的发现是不能容忍的。目前的方法不能提供如此严格的统计保证。该项目开发了从数据中发现知识的算法。它通过将知识发现过程的目标从提取有关可用数据的信息转变为对生成数据的噪声随机过程的新理解来改变该领域。所提出的方法有助于实现更快和更高吞吐量的科学管道,使科学家和从业者能够有效地分析丰富的大型数据集,并信任分析结果。然后,研究人员可以专注于他们的学科特定的研究任务,而不必担心计算或统计方面的考虑。该项目包括与当地博物馆和当地公共图书馆合作,分析有关其历史资料收藏的数据,并与网络安全公司合作开发快速检测网络攻击的方法,几乎没有误报。一批不同的本科生将参与该项目的研究和教育部分。知识发现方面的研究大多集中在理解现有数据,而不是产生数据的过程。在少数情况下,假设检验被用来评估结果。(主要用于简单模式),仅考虑了简单化的空模型,并且测试采用了低统计功效方法(例如,Bonferroni校正),以仅控制错误发现的一个度量,即族错误率。这个项目是变革性的,因为它将开发有效的方法来评估各种各样的结果(例如,模式、异常、图/顶点/边属性等)从大的丰富数据集(例如,事务数据集、图形和时间序列),使用更适合这些任务的现实空模型,并更好地编码数据生成过程的可用知识。我们将创建新的有效程序来从这样的模型中采样,这两个模型都是近似的(例如,马尔可夫链蒙特卡罗)和精确的,并结合联合收割机他们与现代的呼吸为基础的多种测试方法,在多假设的第一种方法,也控制(边际)假发现率。这个奖项反映了NSF的法定使命,并已被认为是值得通过评估使用基金会的智力价值和更广泛的影响审查标准的支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Matteo Riondato其他文献
The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling
SQL查询的VC维和通过采样估计选择性
- DOI:
10.1007/978-3-642-23783-6_42 - 发表时间:
2011 - 期刊:
- 影响因子:5.7
- 作者:
Matteo Riondato;M. Akdere;U. Çetintemel;S. Zdonik;E. Upfal - 通讯作者:
E. Upfal
Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies
基于采样的数据挖掘算法:现代技术和案例研究
- DOI:
- 发表时间:
2014 - 期刊:
- 影响因子:0
- 作者:
Matteo Riondato - 通讯作者:
Matteo Riondato
Sharpe Ratio: Estimation, Confidence Intervals, and Hypothesis Testing
夏普比率:估计、置信区间和假设检验
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Matteo Riondato - 通讯作者:
Matteo Riondato
MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension
MiSoSouP:通过采样和伪维度挖掘有趣的子群
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Matteo Riondato;Fabio Vandin - 通讯作者:
Fabio Vandin
Matteo Riondato的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Matteo Riondato', 18)}}的其他基金
III: Small: RUI: Scalable and Iterative Statistical Testing of Multiple Hypotheses on Massive Datasets
III:小型:RUI:海量数据集上多个假设的可扩展和迭代统计检验
- 批准号:
2006765 - 财政年份:2020
- 资助金额:
$ 60.03万 - 项目类别:
Standard Grant
NSF Student Travel Grant for 2019 SIAM International Conference on Data Mining (SDM)
2019 年 SIAM 国际数据挖掘会议 (SDM) NSF 学生旅费补助
- 批准号:
1918446 - 财政年份:2019
- 资助金额:
$ 60.03万 - 项目类别:
Standard Grant
相似海外基金
Incidental learning across statistically-structured input in active tasks
主动任务中统计结构输入的附带学习
- 批准号:
2420979 - 财政年份:2023
- 资助金额:
$ 60.03万 - 项目类别:
Continuing Grant
Statistically efficient integration of animal tracking data into ecological theory and evidence-based conservation
将动物追踪数据统计有效地整合到生态理论和循证保护中
- 批准号:
RGPIN-2021-02758 - 财政年份:2022
- 资助金额:
$ 60.03万 - 项目类别:
Discovery Grants Program - Individual
Statistically robust topological data analysis for biomedical applications
适用于生物医学应用的统计稳健的拓扑数据分析
- 批准号:
2784904 - 财政年份:2022
- 资助金额:
$ 60.03万 - 项目类别:
Studentship
Statistically explainable GAN inversion
可统计解释的 GAN 反转
- 批准号:
2576597 - 财政年份:2021
- 资助金额:
$ 60.03万 - 项目类别:
Studentship
Statistically efficient integration of animal tracking data into ecological theory and evidence-based conservation
将动物追踪数据统计有效地整合到生态理论和循证保护中
- 批准号:
RGPIN-2021-02758 - 财政年份:2021
- 资助金额:
$ 60.03万 - 项目类别:
Discovery Grants Program - Individual
Statistically efficient integration of animal tracking data into ecological theory and evidence-based conservation
将动物追踪数据统计有效地整合到生态理论和循证保护中
- 批准号:
DGECR-2021-00089 - 财政年份:2021
- 资助金额:
$ 60.03万 - 项目类别:
Discovery Launch Supplement
CIF: Small: Statistically Optimal Subsampling for Big Data and Rare Events Data
CIF:小:大数据和稀有事件数据的统计最佳子采样
- 批准号:
2105571 - 财政年份:2021
- 资助金额:
$ 60.03万 - 项目类别:
Standard Grant
Designing Ligand Platforms of Titanium Imido Catalysts and Statistically Screening Alkyne Substrates for [2+2+1] Regioselective Pyrrole Synthesis
亚氨基钛催化剂配体平台设计及[2 2 1]区域选择性吡咯合成的炔底物统计筛选
- 批准号:
10322653 - 财政年份:2021
- 资助金额:
$ 60.03万 - 项目类别:
RII Track-4: Statistically Significant Signatures of Dark Matter from Astrophysical Observations (WoU-MMA)
RII Track-4:天体物理观测中暗物质的统计显着特征 (WoU-MMA)
- 批准号:
2033382 - 财政年份:2021
- 资助金额:
$ 60.03万 - 项目类别:
Standard Grant
AF: Medium: Collaborative Research: Estimation, Learning, and Memory: The Quest for Statistically Optimal Algorithms
AF:媒介:协作研究:估计、学习和记忆:追求统计最优算法
- 批准号:
2212841 - 财政年份:2021
- 资助金额:
$ 60.03万 - 项目类别:
Continuing Grant