权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Collaborative Research: RI: Small: Unsupervised Islamicate Manuscript Transcription via Lacunae Reconstruction

合作研究：RI：小型：通过缺口重建进行无监督伊斯兰手稿转录

基本信息

批准号：
2200333
负责人：
Taylor Berg-Kirkpatrick
金额：
$ 30万
依托单位：
University of California-San Diego
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2022
资助国家：
美国
起止时间：
2022-07-01 至 2025-06-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2200333&HistoricalAwards=false
关键词：
Collaborative Research RI Small Unsupervised

项目摘要

This award tackles handwritten text recognition (HTR, the task of automatically transcribing images of handwritten manuscripts into symbolic text) for Islamicate manuscripts, a domain that encompasses Persian and Arabic written traditions originating in the premodern Islamic world (7th-19th centuries). HTR for modern text is itself a challenging problem that has received substantial attention from the fields of machine learning (ML) and artificial intelligence (AI). However, the predominance of modern text in HTR research is, to some extent, waning: current techniques are relatively robust on modern data, and contemporary written media production is already almost entirely digital. In contrast, historical manuscripts have received comparatively less attention from ML and AI, and at the same time represent both an exceptional opportunity for impact and a set of unique challenges for ML techniques. Specifically, the written traditions of the Islamicate world together form one of the largest -- if not the largest -- archives of human cultural production of the premodern world. Scanning and digitization efforts over the last decade have made images of Islamicate manuscripts in a large number of collections available to the public. However, this data remains ‘locked’ for most scholarly uses because it has not been transcribed into symbolic text which is required for many types of analysis. In fact, the script styles used in Islamicate manuscripts -- 'scribal hands' -- vary so widely and differ so substantially from modern forms that even manual close reading of these texts requires expert training and is thus limited to a small subset of researchers. The primary outcome of this project will be new techniques that 'unlock' the Islamicate written tradition by accurately transcribing it. As a result, this project has the potential to be transformative for humanities disciplines such as Islamic and Near Eastern Studies by enabling libraries to accurately transcribe entire collections and, further, by allowing individual researchers to accurately transcribe manuscripts outside the western canon. Finally, this research will also support interdisciplinary training of a diverse set of graduate students at the University of California San Diego and the University of Maryland.Current techniques for HTR require large amounts of in-domain supervised training data in order to produce highly accurate transcriptions. The neural architectures behind these modern methods are capable of generalizing, to some degree, across modern handwriting styles when trained on larger and more diverse collections of transcribed data. However, their limitations make these techniques impractical for large-scale transcription of Islamicate texts for two reasons: (1) scribal hand variation across Islamicate manuscripts is much more pronounced than stylistic variation in modern handwriting; and (2) transcriptions of Islamicate manuscripts that can be used as supervised training data are extremely scarce because accurate manual transcription requires expert training. This project will develop a new unsupervised learning framework for Islamicate HTR centered around a novel pretraining task: lacuna reconstruction. The new approach trains a neural encoder for images of manuscript text lines by learning to reconstruct masked regions -- i.e. lacaunae -- of unlabeled manuscript images. This completely unsupervised training criterion implicitly incentivizes the model to discover and encode discreteThis award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该奖项旨在解决伊斯兰手稿的手写文本识别（HTR，将手写手稿的图像自动转录为符号文本的任务），这是一个涵盖起源于前现代伊斯兰世界（7 - 19世纪）的波斯语和阿拉伯语书面传统的领域。现代文本的HTR本身就是一个具有挑战性的问题，受到了机器学习（ML）和人工智能（AI）领域的广泛关注。然而，现代文本在HTR研究中的主导地位在某种程度上正在减弱：目前的技术在现代数据上相对稳健，当代书面媒体制作几乎完全是数字化的。相比之下，历史手稿受到ML和AI的关注相对较少，同时代表了ML技术的一系列独特挑战。具体而言，伊斯兰世界的书面传统共同构成了前现代世界人类文化生产的最大-如果不是最大-档案之一。在过去十年中，扫描和数字化的努力使大量收藏中的伊斯兰手稿的图像向公众开放。然而，这些数据对于大多数学术用途来说仍然是“锁定的”，因为它还没有被转录成许多类型的分析所需的符号文本。事实上，伊斯兰手稿中使用的书写风格--“抄写手”--变化如此之大，与现代形式的差异如此之大，以至于即使是手工仔细阅读这些文本也需要专家的培训，因此仅限于一小部分研究人员。该项目的主要成果将是通过准确转录“解锁”伊斯兰书面传统的新技术。因此，该项目有可能对伊斯兰和近东研究等人文学科产生变革性影响，使图书馆能够准确转录整个收藏，并进一步允许个人研究人员准确转录西方经典之外的手稿。最后，这项研究还将支持在加州圣地亚哥大学和马里兰州大学的一组不同的研究生的跨学科培训。目前的HTR技术需要大量的域监督训练数据，以产生高度准确的transmittance。这些现代方法背后的神经架构在一定程度上能够在更大和更多样化的转录数据集合上进行训练时，在现代手写风格中进行推广。然而，由于两个原因，它们的局限性使得这些技术对于大规模转录伊斯兰文本不切实际：（1）伊斯兰手稿中的抄写手变化比现代手写体中的风格变化更明显;（2）可以用作监督训练数据的伊斯兰手稿的转录非常稀缺，因为准确的手动转录需要专家培训。该项目将为Islamicate HTR开发一个新的无监督学习框架，围绕一个新的预训练任务：腔隙重建。这种新方法通过学习重建未标记的手稿图像的掩蔽区域（即空白区域）来训练神经编码器。这种完全无监督的训练标准隐含地激励模型发现和编码离散的这个奖项反映了NSF的法定使命，并被认为值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估来支持。