权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Scalable Algorithms for Large-Scale Data Mining

职业：大规模数据挖掘的可扩展算法

基本信息

批准号：
0093404
负责人：
Inderjit Dhillon
金额：
--
依托单位：
University of Texas at Austin
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2001
资助国家：
美国
起止时间：
2001-06-01 至 2008-05-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0093404&HistoricalAwards=false
关键词：
CAREER Scalable Algorithms Large Scale

项目摘要

Digital data can occur in diverse forms; it may occur as database records with numerical fields, as raw text documents or image files, or as website traffic log files. Data mining is the automatic discovery of interesting patterns, associations,changes, anomalies, rules, and statistically significant structures and events in data. A key feature, often an overwhelming feature, of the data is its sheer magnitude. The rapidly expanding internet already contains more than 1 billion web pages, and typical warehouse and web traffic data can occupy terabytes of disk space. It is clear that data mining tools must be efficient and scalable if they are to serve any practical purpose. Parallel computing can help in satisfying the demands on computing cycles and memory storage imposed by these large data sets.The main focus of this project is to develop scalable solutions for large-scale data analysis. The main thrust isin exploring and developing efficient, parallel, mathematical and statistical methods that can mine large data sets and deliver results in a timely manner. In particular, new clustering techniques that partition data intodisjoint partitions, the new method of concept decompositions for dimensionality reduction, improved computation of principal components analysis, efficient classification schemes for folding in newly arriving unlabeled data into known classes, and effective visualization of multidimensional data will be investigated. Another focus is to adapt the data analyses tools developed to the application area of text mining.A completely parallel text mining system that is capable of (a) efficient preprocessing of text data intonumerical data, (b) clustering large unlabeled document collections, (c) classifying unlabeled documents into a known concept hierarchy and (d) visualization of document & word relationships will be built. This system will allow the user to easily navigate, assimilate, search and organize the contents of very large document collections; we hope to process up to 100 million documents on a 128-processor cluster of workstations. Many of the text mining algorithms we develop will scale linearly with the size of the data. In this scenario, it becomes important to avoid I/O bottlenecks, exploit memory hierarchies of modern processors and hide network latencies.The educational plan consists of three components: (i) a teaching philosophy that emphasizesthe scientific method in undergraduate and graduate education, by incorporating new technologies for in-class and web-based offline instruction; (ii) a focus on multidisciplinary education with a commitment to develop centralized web-oriented primers designed to quickly acquaint students with desired pre-requisites; and (iii) curriculum development for two courses; the first, a scientific computing course for non-CS undergraduates as part of UT Austin's new "Elements of Computing" program, and the second, a new course on large-scale data mining for graduate students.

数字数据可以以多种形式出现；它可能以带有数字字段的数据库记录、原始文本文档或图像文件或网站流量日志文件的形式出现。数据挖掘是自动发现数据中有趣的模式、关联、变化、异常、规则以及统计上重要的结构和事件。这些数据的一个关键特征，往往是压倒性的特征，就是其庞大的规模。快速扩张的互联网已经包含超过10亿个网页，典型的仓库和网络流量数据可以占用数tb的磁盘空间。很明显，如果数据挖掘工具要服务于任何实际目的，它们必须是高效和可伸缩的。并行计算可以帮助满足这些大型数据集对计算周期和内存存储的要求。该项目的主要重点是为大规模数据分析开发可扩展的解决方案。主要的推动力是探索和发展有效的、并行的、数学和统计方法，这些方法可以挖掘大数据集并及时提供结果。特别是，将数据划分为不相交分区的新聚类技术，用于降维的新概念分解方法，改进的主成分分析计算，将新到达的未标记数据折叠成已知类别的有效分类方案，以及多维数据的有效可视化将被研究。另一个重点是使所开发的数据分析工具适应文本挖掘的应用领域。将建立一个完全并行的文本挖掘系统，该系统能够(A)有效地预处理文本数据（数字数据），(b)聚类大型未标记文档集合，(c)将未标记文档分类到已知的概念层次结构中，以及(d)文档&词关系的可视化。该系统将使用户能够轻松地浏览、吸收、搜索和组织非常大的文档集合的内容；我们希望在128个处理器的工作站集群上处理多达1亿个文档。我们开发的许多文本挖掘算法将随着数据的大小线性扩展。在这种情况下，避免I/O瓶颈、利用现代处理器的内存层次结构和隐藏网络延迟变得非常重要。该教育计划由三个部分组成：(i)强调在本科和研究生教育中采用科学方法的教学理念，通过结合课堂和网络离线教学的新技术；（ii）注重多学科教育，致力于开发集中的以网络为导向的入门读物，旨在迅速使学生熟悉所需的先决条件；（三）两门课程的课程开发；第一个是面向非计算机科学本科生的科学计算课程，作为德克萨斯大学奥斯汀分校新“计算要素”项目的一部分；第二个是面向研究生的大规模数据挖掘新课程。