权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Robust and Secure Multi-Modal Learning for Library-Scale Text Collections

职业：图书馆规模文本收藏的稳健且安全的多模式学习

基本信息

批准号：
1652536
负责人：
David Mimno
金额：
$ 55万
依托单位：
Cornell University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2017
资助国家：
美国
起止时间：
2017-05-15 至 2024-04-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1652536&HistoricalAwards=false
关键词：
CAREER Robust Secure Multi Modal

项目摘要

The growth of social media and digitized libraries has made computational text analysis a vital tool for modern scholarship. But too often methods that work on standardized collections for expert users don't translate to real-world data analysis. In order to be useful, text mining methodologies need to balance theoretical power with practical application. Real data sets are noisy and complicated. More importantly, vast amounts of data cannot be shared directly due to copyright, including all published books after 1923. This project will develop tools that can be applied to limited, privatized views of documents. Algorithms will focus on reliability and efficiency, so that powerful techniques can be used by non-expert users on easily accessible hardware, such as the 10 million K-12 students using low-powered browser-based Chromebooks thereby increasing the societal impact of the work.Unsupervised text mining methods such as topic models and word embeddings have become popular outside of machine learning because they operate on simple, widely-available representations and identify latent variables that represent recognizable themes, events, or concepts. But standard algorithms do not scale well, require full access to potentially sensitive text collections, and cannot take advantage of non-textual data such as images. Although recent work in spectral inference has produced improvements in speed, current methods are plagued by sensitivity to noisy observations. This work will develop a unified approach to unsupervised text mining based on matrix and tensor factorization. The project will focus on data rectification methods for input matrices, enabling simple algorithms to work dramatically better, even in the presence of sparse and noisy observations, while also reducing model uncertainty. The project will develop new methods for learning from private and sensitive documents by creating public views of non-public data. These will include both noisy representations of individual documents as well as corpus-level summary matrices, and support both strong non-identifiability and weaker non-expressivity criteria. Finally, the project will develop new tools for modeling images and text optimized for the way images actually accompany text in real corpora, rather than short, artificial captions. By jointly modeling large volumes of text and semantically related images, the project will enable users to search for contextually related images, not just visually similar images, and identify topics that are grounded in the visual world, not just in text. For further information see the project web page: http://mimno.infosci.cornell.edu

社交媒体和数字化图书馆的发展使计算文本分析成为现代学术的重要工具。但是，为专家用户工作的标准化集合的方法往往不能转化为现实世界的数据分析。为了发挥作用，文本挖掘方法需要平衡理论力量与实际应用。真实的数据集是嘈杂和复杂的。更重要的是，由于版权问题，大量数据无法直接共享，包括1923年以后出版的所有书籍。该项目将开发可用于有限的、私有化的文件视图的工具。算法将专注于可靠性和效率，以便非专家用户可以在易于访问的硬件上使用强大的技术，例如1000万K-12学生使用基于低功耗浏览器的Chromebook，从而增加工作的社会影响。无监督文本挖掘方法，如主题模型和单词嵌入，在机器学习之外已经变得流行，因为它们操作简单，广泛可用的表示，并识别表示可识别的主题，事件或概念的潜在变量。但标准算法的扩展性不好，需要完全访问潜在的敏感文本集合，并且不能利用图像等非文本数据。虽然最近的工作在频谱推断产生了改进的速度，目前的方法是受干扰的观测灵敏度。这项工作将开发一个统一的方法，无监督文本挖掘的基础上矩阵和张量分解。该项目将专注于输入矩阵的数据校正方法，使简单的算法即使在稀疏和噪声观测的情况下也能更好地工作，同时还能降低模型的不确定性。该项目将通过创建非公开数据的公开视图，开发从私人和敏感文件中学习的新方法。这些将包括单个文档的噪声表示以及语料库级别的摘要矩阵，并支持强不可识别性和弱不可表达性标准。最后，该项目将开发新的工具，用于对图像和文本进行建模，并针对图像在真实的语料库中实际伴随文本的方式进行优化，而不是简短的人工标题。通过对大量文本和语义相关的图像进行联合建模，该项目将使用户能够搜索上下文相关的图像，而不仅仅是视觉相似的图像，并识别基于视觉世界的主题，而不仅仅是文本。欲了解更多信息，请访问项目网页：http://mimno.infosci.cornell.edu

项目成果

期刊论文数量（13）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

The strange geometry of skip-gram with negative sampling

DOI：
10.18653/v1/d17-1308
发表时间：
2017-09
期刊：
影响因子：
0
作者：
David Mimno;Laure Thompson
通讯作者：
David Mimno;Laure Thompson

Comparing Text Representations: A Theory-Driven Approach

DOI：
10.18653/v1/2021.emnlp-main.449
发表时间：
2021-09
期刊：
ArXiv
影响因子：
0
作者：
Gregory Yauney;David M. Mimno
通讯作者：
Gregory Yauney;David M. Mimno

Computational Cut-Ups: The Influence of Dada

计算剪切：达达主义的影响

DOI：
发表时间：
2018
期刊：
Journal of modern periodical studies
影响因子：
0.3
作者：
Thompson, Laure;Mimno, David
通讯作者：
Mimno, David

Combatting The Challenges of Local Privacy for Distributional Semantics with Compression

DOI：
发表时间：
2019
期刊：
影响因子：
0
作者：
Alexandra Schofield
通讯作者：
Alexandra Schofield

Like Two Pis in a Pod: Author Similarity Across Time in the Ancient Greek Corpus

就像豆荚里的两个 Pi：古希腊语料库中不同时间的作者相似度

DOI：
10.22148/001c.13680
发表时间：
2020
期刊：
Journal of Cultural Analytics
影响因子：
0
作者：
Storey, Grant;Mimno, David
通讯作者：
Mimno, David

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

David Mimno其他文献

Missing Photos, Suffering Withdrawal, or Finding Freedom? How Missing Photos, Suffering Withdrawal, or Finding Freedom? How Experiences of Social Media Non-Use Influence the Likelihood of Experiences of Social Media Non-Use Influence the Likelihood of Reversion Reversion

丢失照片、遭受退缩之苦，还是寻找自由？

DOI：
发表时间：
期刊：
影响因子：
0
作者：
Eric Baumer;Shion Guha;Emily Quan;David Mimno;Geri K. Gay
通讯作者：
Geri K. Gay

Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries

超越数字摇篮：下一代数字图书馆建模

DOI：
10.1007/11863878_30
发表时间：
2006
期刊：
European Conference on Research and Advanced Technology for Digital Libraries
影响因子：
0
作者：
G. Crane;David Bamman;L. Cerrato;Alison Jones;David Mimno;A. Packel;D. Sculley;G. Weaver
通讯作者：
G. Weaver