权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Algorithms for understanding data

职业：理解数据的算法

基本信息

批准号：
1351108
负责人：
Gregory Valiant
金额：
$ 50万
依托单位：
Stanford University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2014
资助国家：
美国
起止时间：
2014-07-01 至 2019-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1351108&HistoricalAwards=false
关键词：
CAREER Algorithms understanding data

项目摘要

Given samples from some unknown distribution, what can one infer about the underlying distribution, and how efficiently can these inferences be made? In many of the most fundamental settings, our understanding of the computational and information theoretic possibilities and barriers is still startlingly poor. This project tackles two broad research objectives: developing efficient algorithms for probing data, and understanding how to efficiently estimate properties of distributions. The first line of research seeks to understand which questions about a dataset can be answered extremely efficiently, requiring computational resources (time, or memory) that are sublinear in the size of the dataset or distribution. The second research objective is to understand the minimal amount of information necessary to ascertain, with high probability, whether or not a distribution or dataset possesses a given property. In the context of statistical property estimation, this problem asks how few samples are needed to estimate the property in question to a desired accuracy, with high probability. This research pursues both new estimation algorithms, and new information theoretic tools and lower bounds.With vast and important datasets emerging across many disciplines, from genetic, biological, and medical databases, to databases documenting our economic and social behaviors, the challenge of how to make sense of them has particular immediate relevance and has rapidly become the bottleneck in scientific understanding. The specific problems investigated in this project arise in the analysis of these datasets; algorithmic advances on these problems have the potential to very quickly be adopted and transform ongoing data analysis efforts. Beyond the immediate implications for the data sciences, these questions are extremely basic and foundational. As such, new techniques, perspectives, and insights gleaned from their study are likely to have broad implications for other problems throughout computer science, statistics, information theory, and the data sciences.

给定一些未知分布的样本，人们可以推断出潜在的分布是什么，这些推断的效率如何？在许多最基本的环境中，我们对计算和信息理论的可能性和障碍的理解仍然非常贫乏。该项目解决了两个广泛的研究目标：开发用于探测数据的有效算法，以及了解如何有效地估计分布的属性。第一条研究路线旨在了解关于数据集的哪些问题可以非常有效地回答，需要在数据集或分布的大小上呈次线性的计算资源（时间或内存）。第二个研究目标是了解以高概率确定分布或数据集是否具有给定属性所需的最小信息量。在统计属性估计的上下文中，该问题询问需要多少样本来以高概率估计所讨论的属性以达到期望的精度。这项研究追求新的估计算法，新的信息理论工具和下界。随着大量重要的数据集出现在许多学科，从遗传学，生物学和医学数据库，到记录我们的经济和社会行为的数据库，如何理解它们的挑战具有特别的直接相关性，并迅速成为科学理解的瓶颈。本项目中调查的具体问题出现在这些数据集的分析中;这些问题的算法进步有可能很快被采用并改变正在进行的数据分析工作。除了对数据科学的直接影响之外，这些问题也是极其基本和基础的。因此，从他们的研究中收集到的新技术、观点和见解可能对整个计算机科学、统计学、信息论和数据科学中的其他问题产生广泛的影响。