权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Statistically-Sound Knowledge Discovery from Data

职业：从数据中发现统计上合理的知识

基本信息

批准号：
2238693
负责人：
Matteo Riondato
金额：
$ 60.03万
依托单位：
Amherst College
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-10-01 至 2028-09-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2238693&HistoricalAwards=false
关键词：
CAREER Statistically Sound Knowledge Discovery

项目摘要

Methods for knowledge discovery from data (e.g., for extracting patterns or finding anomalies) have found their way to research labs in life and biological sciences, and in industries such as cybersecurity. In these fields, the statistical validity of the results produced by these methods is paramount: false discoveries cannot be tolerated. Current methods do not offer such stringent statistical guarantees. This project develops algorithms for statistically-sound Knowledge Discovery from Data. It transforms the field by shifting the goal of the Knowledge Discovery process from extracting information about the available data to gaining new understanding of the noisy, random process that generates the data. The proposed methods contribute towards a faster and higher-throughput scientific pipeline, by allowing scientists and practitioners to efficiently analyze rich large datasets and to trust the results of the analysis. Researchers can then focus on their discipline-specific research tasks without worrying about computational or statistical considerations. The project includes collaborations with a local museum and a local public library, to analyze data about their collections of historic materials, and with a cybersecurity company to develop methods for fast detection of network attacks with few false positives. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.Research in knowledge discovery has mostly focused on understanding the available data, rather than the process that generated it. In the few cases where hypothesis testing was used to assess the results (mostly for simple patterns), only simplistic null models were considered, and the testing employed low-statistical-power approaches (e.g., the Bonferroni correction) to control only for one measure of false discovery, the Family-Wise Error Rate. This project is transformative because it will develop efficient methods for evaluating a wide variety of results (e.g., patterns, anomalies, graph/vertex/edge properties, and more) obtained from large rich datasets (e.g., transactional datasets, graphs, and time series), using realistic null models which are more appropriate for these tasks, and better encode available knowledge of the data generating process. We will create novel efficient procedures to sample from such models, both approximate (e.g., Markov-Chain Monte Carlo) and exact, and combine them with modern resampling- based multiple testing methods, in a multiple-hypothesis first approach that also controls the (marginal) False Discovery Rate.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

用于从数据（例如，用于提取模式或发现异常）已经进入生命和生物科学以及网络安全等行业的研究实验室。在这些领域，这些方法产生的结果的统计有效性是至关重要的：错误的发现是不能容忍的。目前的方法不能提供如此严格的统计保证。该项目开发了从数据中发现知识的算法。它通过将知识发现过程的目标从提取有关可用数据的信息转变为对生成数据的噪声随机过程的新理解来改变该领域。所提出的方法有助于实现更快和更高吞吐量的科学管道，使科学家和从业者能够有效地分析丰富的大型数据集，并信任分析结果。然后，研究人员可以专注于他们的学科特定的研究任务，而不必担心计算或统计方面的考虑。该项目包括与当地博物馆和当地公共图书馆合作，分析有关其历史资料收藏的数据，并与网络安全公司合作开发快速检测网络攻击的方法，几乎没有误报。一批不同的本科生将参与该项目的研究和教育部分。知识发现方面的研究大多集中在理解现有数据，而不是产生数据的过程。在少数情况下，假设检验被用来评估结果。（主要用于简单模式），仅考虑了简单化的空模型，并且测试采用了低统计功效方法（例如，Bonferroni校正），以仅控制错误发现的一个度量，即族错误率。这个项目是变革性的，因为它将开发有效的方法来评估各种各样的结果（例如，模式、异常、图/顶点/边属性等）从大的丰富数据集（例如，事务数据集、图形和时间序列），使用更适合这些任务的现实空模型，并更好地编码数据生成过程的可用知识。我们将创建新的有效程序来从这样的模型中采样，这两个模型都是近似的（例如，马尔可夫链蒙特卡罗）和精确的，并结合联合收割机他们与现代的呼吸为基础的多种测试方法，在多假设的第一种方法，也控制（边际）假发现率。这个奖项反映了NSF的法定使命，并已被认为是值得通过评估使用基金会的智力价值和更广泛的影响审查标准的支持。