CRII: AF: RUI: Faster and Cache-Efficient Similarity Filters and Searches for Big Data
CRII:AF:RUI:更快且高效缓存的相似性过滤器和大数据搜索
基本信息
- 批准号:1755791
- 负责人:
- 金额:$ 17.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-02-15 至 2022-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Analysing and organising the massive amount of data collected everyday is one of the biggest challenges faced by computer scientists. This impacts almost every field of modern human life, such as health care, military applications, smart cities, transportation, networks, etc. To quickly analyse and organise such huge amounts of data, one needs to develop space and time efficient algorithms, that accelerate the search for information while minimising memory requirements. This project aims to develop space-efficient similarity filters, which quickly filter out queries that have very little similarity from the records existing in the database. This has applications in security, databases, machine learning, file systems, vision, pattern recognition, compression and networks. Apart from the targeted impact of these applications to society, this project also aims at producing the next generation of researchers and promoting under-represented groups in science and technology. Minority undergraduate and high school students will be involved in this project. Queens College, CUNY is a diverse institution, representing a wide range of ethnic minorities and has 28% Hispanic students. In an effort to engage more undergraduate students and women in research (Queens College has roughly 56% female students), the PI plans to offer a new course that develops an understanding of approximate membership and similarity search data structures in the setting of big data.Similarity searching is a fundamental problem in this context, where a massive data of records needs to be organised so as to answer quickly queries of the form: given a record, is it similar to some record in the data? Similarity search, modeled as near(est) neighbor search (NNS) is an important and well-studied problem with numerous applications: Given a set P of n points lying in some d-dimensional metric space, the goal is to preprocess P so as to return for a given query point q the point p in P that is closest to q. Exact versions of this problem are hard to solve, and all solutions to the approximate version of the problem require superlinear (more than nd) space. This space usage is prohibitive for many big data settings, where not just n, but d could be huge (e.g., the number of pixels in an image). This project is aimed at developing the theory and applications of Similarity Filters, which are decision (yes/no) based data structures. Given a query q, a (c, r)-similarity filter always returns yes if q is within a distance r of some input point, and returns no with a small error probability (thus some false positives are allowed) if all input points are at least cr away from the query. Recent results from the researcher shows that such filters can be built with sublinear (asymptotically less than nd) space. The project takes on two tasks: 1) to understand the approximation factor-space-query tradeoff for similarity filters, developing tight upper and lower bounds, especially in the sublinear space regime, and 2) to develop I/O efficient Similarity Filters and nearest neighbor data structures, akin to quotient or cascade filters for the approximate membership problem (where the in-RAM Bloom Filter is the standard data structure, used in many networking applications). The overarching goal is to use similarity filters on RAM in conjunction with a disk-based data structure: faraway queries will be filtered quickly, and nearby queries will have their neighbor(s) returned following a slower but cache friendly search. The availability of such a two-layered space-and-time-efficient data structure will greatly boost the scope of applications when the amount of data is massive, especially in databases, networks, security and machine learning.
分析和组织每天收集的海量数据是计算机科学家面临的最大挑战之一。这几乎影响到现代人类生活的每一个领域,如医疗保健、军事应用、智能城市、交通运输、网络等。要快速分析和组织如此海量的数据,需要开发空间和时间高效的算法,以加快搜索信息的速度,同时将内存需求降至最低。这个项目的目标是开发空间高效的相似性过滤器,它可以快速过滤出与数据库中现有记录几乎没有相似性的查询。它在安全、数据库、机器学习、文件系统、视觉、模式识别、压缩和网络中都有应用。除了这些应用程序对社会有针对性的影响外,该项目还旨在培养下一代研究人员,并促进科学和技术中代表性不足的群体。少数民族本科生和高中生将参与到这个项目中来。纽约州立大学皇后学院是一所多元化的学院,代表了广泛的少数族裔,有28%的西班牙裔学生。为了吸引更多的本科生和女性参与研究(皇后学院约有56%的女生),PI计划开设一门新课程,了解大数据环境下的近似成员资格和相似搜索数据结构。在这种情况下,相似搜索是一个基本问题,需要组织海量记录数据,以便快速回答以下形式的查询:给定一条记录,它与数据中的某些记录相似吗?相似搜索是一个重要且被广泛研究的问题:给定一个位于d维度量空间中的n个点的集合P,其目标是对P进行预处理,以便对给定的查询点Q返回P中最接近Q的点p。这个问题的精确版本很难求解,并且该问题的所有近似版本的解都需要超线性(大于Nd)空间。这种空间使用对于许多大数据设置是令人望而却步的,其中不仅是n,而且d可能很大(例如,图像中的像素数)。该项目旨在发展基于决策(是/否)的数据结构的相似过滤器的理论和应用。在给定查询Q的情况下,如果Q在某个输入点的距离r内,则(c,r)相似性过滤器总是返回YES,而如果所有输入点距离查询至少cr,则返回具有小错误概率的NO(因此允许一些误报)。研究人员的最新结果表明,这种滤波器可以用次线性(渐近小于ND)的空间来构造。该项目承担了两项任务:1)了解相似过滤器的近似因子-空间-查询权衡,开发紧凑的上下界,特别是在次线性空间区域;2)开发I/O高效的相似过滤器和最近邻数据结构,类似于用于近似成员问题的商或级联过滤器(其中,In-RAM Bloom过滤器是标准数据结构,在许多网络应用中使用)。首要目标是将内存上的相似性过滤器与基于磁盘的数据结构结合使用:遥远的查询将被快速过滤,而附近的查询将在较慢但缓存友好的搜索后返回其邻居(S)。这种时空两层高效的数据结构的可用性将极大地扩大数据量巨大的应用范围,特别是在数据库、网络、安全和机器学习方面。
项目成果
期刊论文数量(4)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Manifold View of Adversarial Risk
- DOI:10.48550/arxiv.2203.13277
- 发表时间:2022-03
- 期刊:
- 影响因子:0
- 作者:Wen-jun Zhang;Yikai Zhang;Xiaoling Hu;Mayank Goswami;Chao Chen;Dimitris N. Metaxas
- 通讯作者:Wen-jun Zhang;Yikai Zhang;Xiaoling Hu;Mayank Goswami;Chao Chen;Dimitris N. Metaxas
A Topological Filter for Learning with Label Noise
- DOI:
- 发表时间:2020-12
- 期刊:
- 影响因子:0
- 作者:Pengxiang Wu;Songzhu Zheng;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
- 通讯作者:Pengxiang Wu;Songzhu Zheng;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
On the I/O Complexity of the k-Nearest Neighbors Problem
- DOI:10.1145/3375395.3387649
- 发表时间:2020-02
- 期刊:
- 影响因子:0
- 作者:Mayank Goswami;R. Jacob;R. Pagh
- 通讯作者:Mayank Goswami;R. Jacob;R. Pagh
Error-Bounded Correction of Noisy Labels
- DOI:
- 发表时间:2020-07
- 期刊:
- 影响因子:0
- 作者:Songzhu Zheng;Pengxiang Wu;A. Goswami;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
- 通讯作者:Songzhu Zheng;Pengxiang Wu;A. Goswami;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Mayank Goswami其他文献
Weakly confined silicon nano-structures as a THz absorbing material
弱约束硅纳米结构作为太赫兹吸收材料
- DOI:
10.1063/5.0204896 - 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Pooja Sudha;Mayank Goswami;Arup Samanta - 通讯作者:
Arup Samanta
Adaptive Discretization for Computerized Tomography
计算机断层扫描的自适应离散化
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
S. Shakya;A. Saxena;P. Munshi;Mayank Goswami - 通讯作者:
Mayank Goswami
Fair subgraph selection for contagion containment (Brief Announcement)
- DOI:
10.1016/j.procs.2023.08.250 - 发表时间:
2023-01-01 - 期刊:
- 影响因子:
- 作者:
Esther M. Arkin;Rezaul A. Chowdhury;Mayank Goswami;Jason Huang;Joseph S.B. Mitchell;Valentin Polishchuk;Rakesh Ravindran - 通讯作者:
Rakesh Ravindran
Pattern-Avoiding Access in Binary Search Trees
二叉搜索树中避免模式访问
- DOI:
10.1109/focs.2015.32 - 发表时间:
2015 - 期刊:
- 影响因子:0
- 作者:
Parinya Chalermsook;Mayank Goswami;L. Kozma;K. Mehlhorn;Thatchaphol Saranurak - 通讯作者:
Thatchaphol Saranurak
Missing Motility: High-resolution in vivo imaging of retinal microglia reveals stationary ramified cells.
运动性缺失:视网膜小胶质细胞的高分辨率体内成像揭示了静止的分支细胞。
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Eric B. Miller;Pengfei Zhang;Mayank Goswami;R. Zawadzki;E. Pugh;M. E. Burns - 通讯作者:
M. E. Burns
Mayank Goswami的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Mayank Goswami', 18)}}的其他基金
AF: Small: RUI: Towards Resolving the Dynamic Optimality Conjecture.
AF:小:RUI:解决动态最优猜想。
- 批准号:
1910873 - 财政年份:2019
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
相似国自然基金
基于前瞻性队列的双酚AF联合果糖加重代谢损伤的靶向代谢组学研究
- 批准号:2025JJ30049
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
U2AF2-circMMP1信号轴促进结直肠癌进展的分子机制研究
- 批准号:2025JJ80723
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
U2AF2精氯酸甲基化调控RNA转录合成在MTAP缺失骨肉瘤T细胞耗竭中的机制研究
- 批准号:
- 批准年份:2024
- 资助金额:0 万元
- 项目类别:青年科学基金项目
BDA-366通过MYD88/NF-κB/PGC1β通路杀伤 KMT2A/AF9 AML细胞的机制研究
- 批准号:
- 批准年份:2024
- 资助金额:15.0 万元
- 项目类别:省市级项目
Lu AF21934减少缺血性脑卒中导致的神经损伤的机制研究
- 批准号:
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
H2S介导剪接因子BraU2AF65a的S-巯基化修饰促进大白菜开花的分子机制
- 批准号:32372727
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
AF9通过ARRB2-MRGPRB2介导肠固有肥大细胞活化促进重症急性胰腺炎发生MOF的研究
- 批准号:82300739
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
剪接因子U2AF1突变在急性髓系白血病原发耐药中的机制研究
- 批准号:82370157
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
线粒体活性氧介导的胎盘早衰在孕期双酚AF暴露致婴幼儿神经发育迟缓中的作用
- 批准号:82304160
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
U2AF2-circMMP1调控能量代谢促进结直肠癌肝转移的分子机制
- 批准号:82303789
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
CRII: AF: RUI: New Frontiers in Fundamental Error-Correcting Codes
CRII:AF:RUI:基本纠错码的新领域
- 批准号:
2347371 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Algorithmic Fairness for Computational Social Choice Models
CRII:AF:RUI:计算社会选择模型的算法公平性
- 批准号:
2348275 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
AF: Small: RUI: Toward High-Performance Block Krylov Subspace Algorithms for Solving Large-Scale Linear Systems
AF:小:RUI:用于求解大规模线性系统的高性能块 Krylov 子空间算法
- 批准号:
2327619 - 财政年份:2023
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: AF: Small: RUI: Data Science from Economic Foundations
合作研究:AF:小型:RUI:来自经济基础的数据科学
- 批准号:
2218814 - 财政年份:2022
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: AF: Small: RUI: Data Science from Economic Foundations
合作研究:AF:小型:RUI:来自经济基础的数据科学
- 批准号:
2218813 - 财政年份:2022
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Breaking Ground on Circulant TSP
CRII:AF:RUI:循环 TSP 破土动工
- 批准号:
2153331 - 财政年份:2022
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: New Approaches for Space-Efficient Similarity Search
CRII:AF:RUI:节省空间的相似性搜索的新方法
- 批准号:
2103813 - 财政年份:2021
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Markov Chains and Random Sampling on Graphs
CRII:AF:RUI:马尔可夫链和图上的随机采样
- 批准号:
2104795 - 财政年份:2021
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Engineering and Experiments with Geometric Spanner Construction Algorithms for Massive Point Sets
CRII:AF:RUI:大规模点集的几何扳手构造算法的工程和实验
- 批准号:
1947887 - 财政年份:2020
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Verifiable Computation Outsourcing: A Non-Cooperative Approach
CRII:AF:RUI:可验证计算外包:一种非合作方法
- 批准号:
1947789 - 财政年份:2020
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant