CAREER: Scalable Algorithms for Large-Scale Data Mining
职业:大规模数据挖掘的可扩展算法
基本信息
- 批准号:0093404
- 负责人:
- 金额:--
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2001
- 资助国家:美国
- 起止时间:2001-06-01 至 2008-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Digital data can occur in diverse forms; it may occur as database records with numerical fields, as raw text documents or image files, or as website traffic log files. Data mining is the automatic discovery of interesting patterns, associations,changes, anomalies, rules, and statistically significant structures and events in data. A key feature, often an overwhelming feature, of the data is its sheer magnitude. The rapidly expanding internet already contains more than 1 billion web pages, and typical warehouse and web traffic data can occupy terabytes of disk space. It is clear that data mining tools must be efficient and scalable if they are to serve any practical purpose. Parallel computing can help in satisfying the demands on computing cycles and memory storage imposed by these large data sets.The main focus of this project is to develop scalable solutions for large-scale data analysis. The main thrust isin exploring and developing efficient, parallel, mathematical and statistical methods that can mine large data sets and deliver results in a timely manner. In particular, new clustering techniques that partition data intodisjoint partitions, the new method of concept decompositions for dimensionality reduction, improved computation of principal components analysis, efficient classification schemes for folding in newly arriving unlabeled data into known classes, and effective visualization of multidimensional data will be investigated. Another focus is to adapt the data analyses tools developed to the application area of text mining.A completely parallel text mining system that is capable of (a) efficient preprocessing of text data intonumerical data, (b) clustering large unlabeled document collections, (c) classifying unlabeled documents into a known concept hierarchy and (d) visualization of document & word relationships will be built. This system will allow the user to easily navigate, assimilate, search and organize the contents of very large document collections; we hope to process up to 100 million documents on a 128-processor cluster of workstations. Many of the text mining algorithms we develop will scale linearly with the size of the data. In this scenario, it becomes important to avoid I/O bottlenecks, exploit memory hierarchies of modern processors and hide network latencies.The educational plan consists of three components: (i) a teaching philosophy that emphasizesthe scientific method in undergraduate and graduate education, by incorporating new technologies for in-class and web-based offline instruction; (ii) a focus on multidisciplinary education with a commitment to develop centralized web-oriented primers designed to quickly acquaint students with desired pre-requisites; and (iii) curriculum development for two courses; the first, a scientific computing course for non-CS undergraduates as part of UT Austin's new "Elements of Computing" program, and the second, a new course on large-scale data mining for graduate students.
数字数据可以以多种形式出现;它可能以带有数字字段的数据库记录、原始文本文档或图像文件或网站流量日志文件的形式出现。数据挖掘是自动发现数据中有趣的模式、关联、变化、异常、规则以及统计上重要的结构和事件。这些数据的一个关键特征,往往是压倒性的特征,就是其庞大的规模。快速扩张的互联网已经包含超过10亿个网页,典型的仓库和网络流量数据可以占用数tb的磁盘空间。很明显,如果数据挖掘工具要服务于任何实际目的,它们必须是高效和可伸缩的。并行计算可以帮助满足这些大型数据集对计算周期和内存存储的要求。该项目的主要重点是为大规模数据分析开发可扩展的解决方案。主要的推动力是探索和发展有效的、并行的、数学和统计方法,这些方法可以挖掘大数据集并及时提供结果。特别是,将数据划分为不相交分区的新聚类技术,用于降维的新概念分解方法,改进的主成分分析计算,将新到达的未标记数据折叠成已知类别的有效分类方案,以及多维数据的有效可视化将被研究。另一个重点是使所开发的数据分析工具适应文本挖掘的应用领域。将建立一个完全并行的文本挖掘系统,该系统能够(A)有效地预处理文本数据(数字数据),(b)聚类大型未标记文档集合,(c)将未标记文档分类到已知的概念层次结构中,以及(d)文档&词关系的可视化。该系统将使用户能够轻松地浏览、吸收、搜索和组织非常大的文档集合的内容;我们希望在128个处理器的工作站集群上处理多达1亿个文档。我们开发的许多文本挖掘算法将随着数据的大小线性扩展。在这种情况下,避免I/O瓶颈、利用现代处理器的内存层次结构和隐藏网络延迟变得非常重要。该教育计划由三个部分组成:(i)强调在本科和研究生教育中采用科学方法的教学理念,通过结合课堂和网络离线教学的新技术;(ii)注重多学科教育,致力于开发集中的以网络为导向的入门读物,旨在迅速使学生熟悉所需的先决条件;(三)两门课程的课程开发;第一个是面向非计算机科学本科生的科学计算课程,作为德克萨斯大学奥斯汀分校新“计算要素”项目的一部分;第二个是面向研究生的大规模数据挖掘新课程。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Inderjit Dhillon其他文献
Inderjit Dhillon的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Inderjit Dhillon', 18)}}的其他基金
BIGDATA: Collaborative Research: F: Nomadic Algorithms for Machine Learning in the Cloud
BIGDATA:协作研究:F:云中机器学习的游牧算法
- 批准号:
1546452 - 财政年份:2016
- 资助金额:
-- - 项目类别:
Standard Grant
I-Corps: Faster than Light Big Data Analytics
I-Corps:超光速大数据分析
- 批准号:
1507631 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Standard Grant
AF:Small: Divide-and-Conquer Numerical Methods for Analysis of Massive Data Sets
AF:Small:用于分析海量数据集的分而治之数值方法
- 批准号:
1320746 - 财政年份:2013
- 资助金额:
-- - 项目类别:
Standard Grant
AF: Small: Fast and Memory-Efficient Dimensionality Reduction for Massive Networks
AF:小:大规模网络的快速且节省内存的降维
- 批准号:
1117055 - 财政年份:2011
- 资助金额:
-- - 项目类别:
Standard Grant
Non-Negative Matrix and Tensor Approximations: Algorithms, Software and Applications
非负矩阵和张量近似:算法、软件和应用
- 批准号:
0728879 - 财政年份:2007
- 资助金额:
-- - 项目类别:
Standard Grant
Novel Matrix Problems in Modern Applications
现代应用中的新矩阵问题
- 批准号:
0431257 - 财政年份:2004
- 资助金额:
-- - 项目类别:
Standard Grant
相似国自然基金
Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:合作创新研究团队
相似海外基金
CAREER: Scalable algorithms for regularized and non-linear genetic models of gene expression
职业:基因表达的正则化和非线性遗传模型的可扩展算法
- 批准号:
2336469 - 财政年份:2024
- 资助金额:
-- - 项目类别:
Continuing Grant
CAREER: Scalable and Robust Uncertainty Quantification using Subsampling Markov Chain Monte Carlo Algorithms
职业:使用子采样马尔可夫链蒙特卡罗算法进行可扩展且稳健的不确定性量化
- 批准号:
2340586 - 财政年份:2024
- 资助金额:
-- - 项目类别:
Continuing Grant
CAREER: Learning Kernels in Operators from Data: Learning Theory, Scalable Algorithms and Applications
职业:从数据中学习算子的内核:学习理论、可扩展算法和应用
- 批准号:
2238486 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Continuing Grant
CAREER: Scalable Algorithms for Nonlinear, Large-Scale Inverse Problems Governed by Dynamical Systems
职业:动态系统控制的非线性、大规模反问题的可扩展算法
- 批准号:
2145845 - 财政年份:2022
- 资助金额:
-- - 项目类别:
Continuing Grant
CAREER: Scalable binning algorithms for genome-resolved metagenomics
职业:用于基因组解析宏基因组学的可扩展分箱算法
- 批准号:
1845890 - 财政年份:2019
- 资助金额:
-- - 项目类别:
Continuing Grant
CAREER: Pushing the Theoretical Limits of Scalable Distributed Algorithms
职业:突破可扩展分布式算法的理论极限
- 批准号:
1845146 - 财政年份:2019
- 资助金额:
-- - 项目类别:
Continuing Grant
CAREER: Towards Fast and Scalable Algorithms for Big Proteogenomics Data Analytics
职业:面向蛋白质基因组大数据分析的快速且可扩展的算法
- 批准号:
1925960 - 财政年份:2018
- 资助金额:
-- - 项目类别:
Standard Grant
CAREER: Towards Fast and Scalable Algorithms for Big Proteogenomics Data Analytics
职业:面向蛋白质基因组大数据分析的快速且可扩展的算法
- 批准号:
1651724 - 财政年份:2017
- 资助金额:
-- - 项目类别:
Standard Grant
CAREER: Scalable Algorithms for Spectral Analysis of Massive Networked Systems
职业:大规模网络系统频谱分析的可扩展算法
- 批准号:
1651433 - 财政年份:2017
- 资助金额:
-- - 项目类别:
Standard Grant














{{item.name}}会员




