Collaborative Research: RI: Small: Unsupervised Islamicate Manuscript Transcription via Lacunae Reconstruction
合作研究:RI:小型:通过缺口重建进行无监督伊斯兰手稿转录
基本信息
- 批准号:2200333
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-07-01 至 2025-06-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
This award tackles handwritten text recognition (HTR, the task of automatically transcribing images of handwritten manuscripts into symbolic text) for Islamicate manuscripts, a domain that encompasses Persian and Arabic written traditions originating in the premodern Islamic world (7th-19th centuries). HTR for modern text is itself a challenging problem that has received substantial attention from the fields of machine learning (ML) and artificial intelligence (AI). However, the predominance of modern text in HTR research is, to some extent, waning: current techniques are relatively robust on modern data, and contemporary written media production is already almost entirely digital. In contrast, historical manuscripts have received comparatively less attention from ML and AI, and at the same time represent both an exceptional opportunity for impact and a set of unique challenges for ML techniques. Specifically, the written traditions of the Islamicate world together form one of the largest -- if not the largest -- archives of human cultural production of the premodern world. Scanning and digitization efforts over the last decade have made images of Islamicate manuscripts in a large number of collections available to the public. However, this data remains ‘locked’ for most scholarly uses because it has not been transcribed into symbolic text which is required for many types of analysis. In fact, the script styles used in Islamicate manuscripts -- 'scribal hands' -- vary so widely and differ so substantially from modern forms that even manual close reading of these texts requires expert training and is thus limited to a small subset of researchers. The primary outcome of this project will be new techniques that 'unlock' the Islamicate written tradition by accurately transcribing it. As a result, this project has the potential to be transformative for humanities disciplines such as Islamic and Near Eastern Studies by enabling libraries to accurately transcribe entire collections and, further, by allowing individual researchers to accurately transcribe manuscripts outside the western canon. Finally, this research will also support interdisciplinary training of a diverse set of graduate students at the University of California San Diego and the University of Maryland.Current techniques for HTR require large amounts of in-domain supervised training data in order to produce highly accurate transcriptions. The neural architectures behind these modern methods are capable of generalizing, to some degree, across modern handwriting styles when trained on larger and more diverse collections of transcribed data. However, their limitations make these techniques impractical for large-scale transcription of Islamicate texts for two reasons: (1) scribal hand variation across Islamicate manuscripts is much more pronounced than stylistic variation in modern handwriting; and (2) transcriptions of Islamicate manuscripts that can be used as supervised training data are extremely scarce because accurate manual transcription requires expert training. This project will develop a new unsupervised learning framework for Islamicate HTR centered around a novel pretraining task: lacuna reconstruction. The new approach trains a neural encoder for images of manuscript text lines by learning to reconstruct masked regions -- i.e. lacaunae -- of unlabeled manuscript images. This completely unsupervised training criterion implicitly incentivizes the model to discover and encode discreteThis award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该奖项旨在解决伊斯兰手稿的手写文本识别(HTR,将手写手稿的图像自动转录为符号文本的任务),这是一个涵盖起源于前现代伊斯兰世界(7 - 19世纪)的波斯语和阿拉伯语书面传统的领域。现代文本的HTR本身就是一个具有挑战性的问题,受到了机器学习(ML)和人工智能(AI)领域的广泛关注。然而,现代文本在HTR研究中的主导地位在某种程度上正在减弱:目前的技术在现代数据上相对稳健,当代书面媒体制作几乎完全是数字化的。相比之下,历史手稿受到ML和AI的关注相对较少,同时代表了ML技术的一系列独特挑战。具体而言,伊斯兰世界的书面传统共同构成了前现代世界人类文化生产的最大-如果不是最大-档案之一。在过去十年中,扫描和数字化的努力使大量收藏中的伊斯兰手稿的图像向公众开放。然而,这些数据对于大多数学术用途来说仍然是“锁定的”,因为它还没有被转录成许多类型的分析所需的符号文本。事实上,伊斯兰手稿中使用的书写风格--"抄写手"--变化如此之大,与现代形式的差异如此之大,以至于即使是手工仔细阅读这些文本也需要专家的培训,因此仅限于一小部分研究人员。该项目的主要成果将是通过准确转录来“解锁”伊斯兰书面传统的新技术。因此,该项目有可能通过使图书馆能够准确转录整个收藏品来改变伊斯兰和近东研究等人文学科,并进一步允许个人研究人员准确转录西方经典之外的手稿。最后,这项研究还将支持在加州圣地亚哥大学和马里兰州大学的一组不同的研究生的跨学科培训。目前的HTR技术需要大量的域监督训练数据,以产生高度准确的transmittance。这些现代方法背后的神经架构在一定程度上能够在更大和更多样化的转录数据集合上进行训练时,在现代手写风格中进行推广。然而,由于两个原因,它们的局限性使得这些技术对于大规模转录伊斯兰文本不切实际:(1)伊斯兰手稿中的抄写手变化比现代手写体中的风格变化更明显;(2)可以用作监督训练数据的伊斯兰手稿的转录非常稀缺,因为准确的手动转录需要专家培训。该项目将为Islamicate HTR开发一个新的无监督学习框架,围绕一个新的预训练任务:腔隙重建。这种新方法通过学习重建未标记的手稿图像的掩蔽区域(即空白区域)来训练神经编码器。这种完全无监督的训练标准隐含地激励模型发现和编码离散的这个奖项反映了NSF的法定使命,并被认为值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估来支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Taylor Berg-Kirkpatrick其他文献
Taylor Berg-Kirkpatrick的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Taylor Berg-Kirkpatrick', 18)}}的其他基金
CAREER: Modeling Language Evolution via Deep Probabilistic Factorization
职业:通过深度概率分解建模语言演化
- 批准号:
2146151 - 财政年份:2022
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
RI: Small: Print and Probability - A Statistical Approach to Analysis of Clandestine Publication
RI:小:印刷品和概率 - 秘密出版物分析的统计方法
- 批准号:
1936155 - 财政年份:2019
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
RI: Small: Print and Probability - A Statistical Approach to Analysis of Clandestine Publication
RI:小:印刷品和概率 - 秘密出版物分析的统计方法
- 批准号:
1816311 - 财政年份:2018
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
RI: Small: Collaborative Research: Unsupervised Transcription of Early Modern Documents
RI:小型:合作研究:早期现代文献的无监督转录
- 批准号:
1618044 - 财政年份:2016
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312841 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312842 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Foundations of Few-Round Active Learning
协作研究:RI:小型:少轮主动学习的基础
- 批准号:
2313131 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Lie group representation learning for vision
协作研究:RI:中:视觉的李群表示学习
- 批准号:
2313151 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Collaborative Research: RI: Small: Motion Fields Understanding for Enhanced Long-Range Imaging
合作研究:RI:小型:增强远程成像的运动场理解
- 批准号:
2232298 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312840 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Deep Constrained Learning for Power Systems
合作研究:RI:小型:电力系统的深度约束学习
- 批准号:
2345528 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: CompCog: RI: Medium: Understanding human planning through AI-assisted analysis of a massive chess dataset
合作研究:CompCog:RI:中:通过人工智能辅助分析海量国际象棋数据集了解人类规划
- 批准号:
2312374 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: CompCog: RI: Medium: Understanding human planning through AI-assisted analysis of a massive chess dataset
合作研究:CompCog:RI:中:通过人工智能辅助分析海量国际象棋数据集了解人类规划
- 批准号:
2312373 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: End-to-end Learning of Fair and Explainable Schedules for Court Systems
合作研究:RI:小型:法院系统公平且可解释的时间表的端到端学习
- 批准号:
2232055 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant