III: Small: Large-Scale High Dimensional Dense Vector Management
三:小:大规模高维密集矢量管理
基本信息
- 批准号:2212629
- 负责人:
- 金额:$ 59.96万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-09-01 至 2025-08-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Real-world objects such as images and documents often contain rich metadata information. In addition, the rapid development of machine learning, especially deep learning, in recent years make it possible to extract meaningful relationships between real-world objects and encode them in numerical representations. In this way, the semantics of objects can be conveniently processed by computers. The numerical representations of semantics plays an important role in many data science and artificial intelligence applications, such as face recognition, image retrieval, video understanding, recommender systems, text analysis, and knowledge-base management. In these applications, the numerical representations of real-world objects and their associated metadata are usually jointly queried. While metadata management and representation management are investigated extensively independently, jointly managing metadata and representations is under-investigated making it difficult to do in practice. Unfortunately, due to the large data volume and the notorious "curse-of-dimensionality" phenomenon that makes all high-dimensional data objects appear far apart, metadata and representation joint management are challenging. To support the ability of applications to work with both traditional and numerical representations of data, this project will study how to leverage the synergy between them. If successful, this project will advance the development of science and technology by providing new knowledge about data management. Moreover, despite being widely used, metadata and representations are still largely managed by individual application developers. Without careful implementation, the performance can hardly meet the needs of a wide variety of potential users. This project will deliver an end-to-end data system to alleviate the burden on machine learning practitioners and application developers of managing the representations and metadata created by their programs by themselves. In addition, this project includes curriculum development and student training at Rutgers University to amplify the impact of the work.Large-scale high dimensional dense vectors are ubiquitous nowadays due to the rapid development of representation learning (e.g., the learned feature vectors from well-established machine-learning systems such as word2vec, doc2vec, node2vec, graph2vec, item2vec, etc.). They play an important role in many applications in areas such as data mining, natural language processing, computer vision, information retrieval, and recommendations. However, large-scale high dimensional dense vectors are notorious for being hard to query efficiently due to the well-known "curse of dimensionality" phenomenon. Existing research on high dimensional dense vector management mainly focuses on approximate nearest-neighbor search (ANNS). However, a few widely used, compute-intensive dense vector queries are under-examined by the research community and not well supported by existing systems. This project will study three of them: multi-modal ANNS, parallel vector similarity join, and rank estimation. Specifically, multi-modal ANNS are queries involving both dense vectors (e.g., vector representations of product images or documents) and their structured attributes (e.g., product price or last edit time). Given a collection of dense vectors, the vector similarity join connects every vector with its nearest neighbors. To deal with the huge computational cost of this operation, this project will study lock-free, massively parallel algorithms on CPUs and GPUs. Rank estimation approximates the inherent dimensionality of a data vector (e.g., vector representations of items in recommendation or documents in information retrieval) in a set of data vectors ordered by their distance to a relevant vector (e.g., a user purchased the item or a keyword query related to the document). Such a scheme is useful in machine-learning model evaluation. The long-term goal of this project is to build an end-to-end system to make large-scale dense vector management transparent to machine-learning practitioners and application developers.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
真实世界的对象(如图像和文档)通常包含丰富的元数据信息。此外,近年来机器学习的快速发展,特别是深度学习,使得提取现实世界对象之间有意义的关系并将其编码为数字表示成为可能。这样,对象的语义可以方便地被计算机处理。语义的数值表示在许多数据科学和人工智能应用中起着重要作用,例如人脸识别,图像检索,视频理解,推荐系统,文本分析和知识库管理。在这些应用中,真实世界对象的数字表示及其相关元数据通常被联合查询。虽然元数据管理和表示管理被广泛地独立研究,但联合管理元数据和表示的研究不足,使得在实践中很难做到。不幸的是,由于大数据量和臭名昭著的“维数灾难”现象,使所有的高维数据对象显得相距甚远,元数据和表示的联合管理是具有挑战性的。为了支持应用程序处理传统和数字数据表示的能力,本项目将研究如何利用它们之间的协同作用。如果成功,该项目将通过提供有关数据管理的新知识来推动科学和技术的发展。此外,尽管被广泛使用,元数据和表示仍然主要由个人应用程序开发人员管理。如果不认真执行,性能很难满足各种潜在用户的需求。该项目将提供一个端到端的数据系统,以减轻机器学习从业者和应用程序开发人员管理自己程序创建的表示和元数据的负担。此外,该项目还包括罗格斯大学的课程开发和学生培训,以放大工作的影响。由于表征学习的快速发展,大规模高维密集向量如今无处不在(例如,从诸如Word2vec、Doc2vec、Node2vec、Graph2vec、Item2vec等的完善的机器学习系统学习的特征向量)。它们在数据挖掘、自然语言处理、计算机视觉、信息检索和推荐等领域的许多应用中发挥着重要作用。然而,大规模的高维密集向量是臭名昭著的难以有效地查询,由于众所周知的“维数灾难”现象。现有的高维稠密向量管理研究主要集中在近似最近邻搜索(ANNS)。然而,一些广泛使用的,计算密集型的密集向量查询是由研究界研究不足,并没有得到很好的支持现有的系统。本计画将研究其中的三种:多模态人工神经网路、平行向量相似性连接、以及等级估计。具体地,多模态ANNS是涉及密集向量(例如,产品图像或文档的矢量表示)及其结构化属性(例如,产品价格或上次编辑时间)。给定一个稠密向量的集合,向量相似性连接将每个向量与其最近的邻居连接起来。为了应对这种操作的巨大计算成本,该项目将研究CPU和GPU上的无锁大规模并行算法。秩估计近似数据向量的固有维度(例如,推荐中的项目或信息检索中的文档的向量表示)在一组数据向量中,用户购买了该项目或与该文档相关的关键字查询)。这种方案在机器学习模型评估中是有用的。该项目的长期目标是建立一个端到端系统,使大规模密集矢量管理对机器学习从业者和应用程序开发人员透明。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
ARKGraph: All-Range Approximate K-Nearest-Neighbor Graph
- DOI:10.14778/3603581.3603601
- 发表时间:2023-06
- 期刊:
- 影响因子:0
- 作者:Chaoji Zuo;Dong Deng
- 通讯作者:Chaoji Zuo;Dong Deng
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Dong Deng其他文献
Vitexin alleviates MNNG-induced chronic atrophic gastritis emvia/em inhibiting NLRP3 inflammasome
牡荆素通过抑制 NLRP3 炎症小体减轻 MNNG 诱导的慢性萎缩性胃炎。
- DOI:
10.1016/j.jep.2024.119272 - 发表时间:
2025-01-31 - 期刊:
- 影响因子:5.400
- 作者:
Jiaying Liu;Yuanfan Chen;Jing Zhang;Yun Zheng;Yun An;Chenglai Xia;Yonger Chen;Song Huang;Shaozhen Hou;Dong Deng - 通讯作者:
Dong Deng
Correction to: Internal and external memory set containment join
- DOI:
10.1007/s00778-021-00662-9 - 发表时间:
2021-04-02 - 期刊:
- 影响因子:3.800
- 作者:
Chengcheng Yang;Dong Deng;Shuo Shang;Fan Zhu;Li Liu;Ling Shao - 通讯作者:
Ling Shao
Database decay and how to avoid it
数据库衰退以及如何避免它
- DOI:
10.1109/bigdata.2016.7840584 - 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
M. Stonebraker;Dong Deng;Michael L. Brodie - 通讯作者:
Michael L. Brodie
High-fidelity biosensing of dNTPs and nucleic acids by controllable subnanometer channel emPaMscS/em
通过可控亚纳米通道对 dNTPs 和核酸进行高保真生物传感
- DOI:
10.1016/j.bios.2021.113894 - 发表时间:
2022-03-15 - 期刊:
- 影响因子:10.500
- 作者:
Changjian Zhao;Kaiju Li;Xingyu Mou;Yibo Zhu;Chuan Chen;Ming Zhang;Yu Wang;Ke Zhou;Yingying Sheng;Hao Liu;Yunjin Bai;Xinqiong Li;Cuisong Zhou;Dong Deng;Jianping Wu;Hai-Chen Wu;Rui Bao;Jia Geng - 通讯作者:
Jia Geng
Spine: Scaling up Programming-by-Negative-Example for String Filtering and Transformation
Spine:扩大字符串过滤和转换的否定示例编程
- DOI:
- 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Chaoji Zuo;Sepehr Assadi;Dong Deng - 通讯作者:
Dong Deng
Dong Deng的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
- 批准号:
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
- 批准号:n/a
- 批准年份:2022
- 资助金额:10.0 万元
- 项目类别:省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
- 批准号:32000033
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
- 批准号:31972324
- 批准年份:2019
- 资助金额:58.0 万元
- 项目类别:面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
- 批准号:81900988
- 批准年份:2019
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
- 批准号:31870821
- 批准年份:2018
- 资助金额:56.0 万元
- 项目类别:面上项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
- 批准号:31802058
- 批准年份:2018
- 资助金额:26.0 万元
- 项目类别:青年科学基金项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
- 批准号:31772128
- 批准年份:2017
- 资助金额:60.0 万元
- 项目类别:面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
- 批准号:81704176
- 批准年份:2017
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
- 批准号:91640114
- 批准年份:2016
- 资助金额:85.0 万元
- 项目类别:重大研究计划
相似海外基金
Collaborative Research: III: Small: Taming Large-Scale Streaming Graphs in an Open World
协作研究:III:小型:在开放世界中驯服大规模流图
- 批准号:
2236578 - 财政年份:2023
- 资助金额:
$ 59.96万 - 项目类别:
Standard Grant
Collaborative Research: III: Small: Taming Large-Scale Streaming Graphs in an Open World
协作研究:III:小型:在开放世界中驯服大规模流图
- 批准号:
2236579 - 财政年份:2023
- 资助金额:
$ 59.96万 - 项目类别:
Standard Grant
III: Small: Stochastic Algorithms for Large Scale Data Analysis
III:小型:大规模数据分析的随机算法
- 批准号:
2131335 - 财政年份:2021
- 资助金额:
$ 59.96万 - 项目类别:
Continuing Grant
III: Small: Collaborative Research: Cost-Efficient Sampling and Estimation from Large-Scale Networks
III:小型:协作研究:大规模网络的经济高效采样和估计
- 批准号:
2209921 - 财政年份:2021
- 资助金额:
$ 59.96万 - 项目类别:
Standard Grant
III: Small: Stochastic Algorithms for Large Scale Data Analysis
III:小型:大规模数据分析的随机算法
- 批准号:
1908104 - 财政年份:2019
- 资助金额:
$ 59.96万 - 项目类别:
Continuing Grant
III: Small: Collaborative Research: Cost-Efficient Sampling and Estimation from Large-Scale Networks
III:小型:协作研究:大规模网络的经济高效采样和估计
- 批准号:
1908375 - 财政年份:2019
- 资助金额:
$ 59.96万 - 项目类别:
Standard Grant
III: Small: A Query System for Rapid Audiovisual Analysis of Large-Scale Video Collections
三:小型:大规模视频采集快速视听分析的查询系统
- 批准号:
1908727 - 财政年份:2019
- 资助金额:
$ 59.96万 - 项目类别:
Continuing Grant
III: Small: Collaborative Research: Cost-Efficient Sampling and Estimation from Large-Scale Networks
III:小型:协作研究:大规模网络的经济高效采样和估计
- 批准号:
1910749 - 财政年份:2019
- 资助金额:
$ 59.96万 - 项目类别:
Standard Grant
III: Small: Robust Large-Scale Data Mining for Knowledge Discovery in Depression Thought Records
III:小:用于抑郁症思想记录知识发现的鲁棒大规模数据挖掘
- 批准号:
1845666 - 财政年份:2017
- 资助金额:
$ 59.96万 - 项目类别:
Standard Grant
III: Small: Robust Large-Scale Data Mining for Knowledge Discovery in Depression Thought Records
III:小:用于抑郁症思想记录知识发现的鲁棒大规模数据挖掘
- 批准号:
1619308 - 财政年份:2016
- 资助金额:
$ 59.96万 - 项目类别:
Standard Grant