权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Small: A New Machine Learning Approach for Improved Entity Identification

III：小：改进实体识别的新机器学习方法

基本信息

批准号：
1815538
负责人：
Shivaram Venkataraman
金额：
$ 32.04万
依托单位：
University of Wisconsin-Madison
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-09-01 至 2022-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1815538&HistoricalAwards=false
关键词：
III Small New Machine Learning

项目摘要

Modern analytics rely on data integration to combine heterogeneous data into a unified repository they can tap into for insights, services, and scientific knowledge. The typical goal of data integration is to combine heterogeneous data about the same real-world entity into a canonical representation of that entity. Traditionally, entity canonicalization methods focus on structured data and leverage the semantics of the schema accompanying the data to come up with canonical entity representations. This dependency on data semantics makes existing entity canonicalization methods inapplicable to dark data, i.e., operational data that corresponds to unstructured, noisy, and incomplete data. This project will develop entity canonicalization methods that focus on unstructured and semi-structured data and are suitable for large-scale integration applications. This work will help ease the currently challenging procedure of heuristically consolidating matching information about the same entity into unified representations and thus enable dark data to be more effectively used in downstream analytics applications.The emphasis of this work is on entity canonicalization techniques that leverage representation learning (a.k.a. feature learning) and deep learning. The combination of distributed representations with deep architectures has emerged as the de facto standard for analyzing and processing unstructured data. This project will develop new deep learning architectures for: (1) record linkage, i.e., clustering unstructured data records that provide information about the same entity; and (2) data fusion, i.e., combining matching unstructured records into a canonical representation of the underlying entity. For record linkage, this work will introduce new deep learning techniques that capture multi-context domain-specific knowledge to learn the semantic similarity between records. For data fusion, this project will design new multi-sequence to one-sequence encoder-decoder recurrent neural networks for data fusion with a particular focus on incomplete data. The outcomes of this project have the potential to advance the state-of-the-art in large scale data integration methods as well as machine learning methods for high-dimensional, sparse, and noisy data.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

现代分析依赖于数据集成，将异类数据组合到一个统一的存储库中，它们可以从中获取洞察力、服务和科学知识。数据集成的典型目标是将关于同一现实世界实体的异类数据组合成该实体的规范表示。传统上，实体规范化方法侧重于结构化数据，并利用伴随数据的模式的语义来提供规范的实体表示。这种对数据语义的依赖使得现有的实体规范化方法不适用于暗数据，即对应于非结构化、噪声和不完整数据的操作数据。该项目将开发专注于非结构化和半结构化数据并适合大规模集成应用的实体规范化方法。这项工作将有助于缓解目前具有挑战性的过程，即启发式地将关于同一实体的匹配信息合并到统一的表示中，从而使暗数据能够更有效地用于下游分析应用。这项工作的重点是利用表示学习的实体规范化技术。特征学习)和深度学习。分布式表示与深层体系结构的结合已经成为分析和处理非结构化数据的事实上的标准。该项目将为以下方面开发新的深度学习架构：(1)记录链接，即对提供关于同一实体的信息的非结构化数据记录进行分组；(2)数据融合，即将匹配的非结构化记录合并为基本实体的规范表示。对于记录链接，本工作将引入新的深度学习技术，捕捉多上下文领域特定的知识，以学习记录之间的语义相似性。对于数据融合，本项目将设计新的多序列到单序列编解码器的递归神经网络用于数据融合，特别关注不完整数据。该项目的成果有可能推动大规模数据集成方法以及针对高维、稀疏和噪声数据的机器学习方法的最先进水平。该奖项反映了NSF的法定使命，并通过使用基金会的智力优势和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（8）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

DOI：
10.14778/3494124.3494149
发表时间：
2021-06
期刊：
ArXiv
影响因子：
0
作者：
S. Suri;Ihab F. Ilyas;Christopher R'e;Theodoros Rekatsinas
通讯作者：
S. Suri;Ihab F. Ilyas;Christopher R'e;Theodoros Rekatsinas

Demo of Marius: A System for Large-scale Graph Embeddings

Marius 演示：大规模图嵌入系统

DOI：
发表时间：
2021
期刊：
Proceedings of the VLDB Endowment
影响因子：
2.5
作者：
Carlsson, Anders;Xie, Anze;Mohoney, Jason;Waleffe, Roger;Peters, Shanan;Rekatsinas, Theodoros;Venkataraman, Shivaram
通讯作者：
Venkataraman, Shivaram

Picket: guarding against corrupted data in tabular data during learning and inference

DOI：
10.1007/s00778-021-00699-w
发表时间：
2020-06
期刊：
The VLDB Journal
影响因子：
0
作者：
Zifan Liu;Zhechun Zhou;Theodoros Rekatsinas
通讯作者：
Zifan Liu;Zhechun Zhou;Theodoros Rekatsinas

MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks

DOI：
10.1145/3552326.3567501
发表时间：
2022-02
期刊：
Proceedings of the Eighteenth European Conference on Computer Systems
影响因子：
0
作者：
R. Waleffe;J. Mohoney;Theodoros Rekatsinas;S. Venkataraman
通讯作者：
R. Waleffe;J. Mohoney;Theodoros Rekatsinas;S. Venkataraman

Principal Component Networks: Parameter Reduction Early in Training

DOI：
发表时间：
2020-06
期刊：
ArXiv
影响因子：
0
作者：
R. Waleffe;Theodoros Rekatsinas
通讯作者：
R. Waleffe;Theodoros Rekatsinas

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Shivaram Venkataraman其他文献

CHAI: Clustered Head Attention for Efficient LLM Inference

CHAI：用于高效 LLM 推理的集群头注意力

DOI：
发表时间：
2024
期刊：
arXiv.org
影响因子：
0
作者：
Saurabh Agarwal;Bilge Acun;Basil Homer;Mostafa Elhoushi;Yejin Lee;Shivaram Venkataraman;Dimitris Papailiopoulos;Carole
通讯作者：
Carole