RI: Small: Collaborative Research: Unsupervised Transcription of Early Modern Documents
RI:小型:合作研究:早期现代文献的无监督转录
基本信息
- 批准号:1618044
- 负责人:
- 金额:$ 24.95万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2016
- 资助国家:美国
- 起止时间:2016-09-01 至 2021-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Recently, researchers in the social sciences and humanities have made increasing use of digital technologies in their work, seeking to answer important questions about human artifacts based on new kinds of analyses. However, since many of their methods are statistical in nature, they require a large amount of digitally readable text to operate. For example, to ask statistical questions about how the legal rights of women have changed during the past five centuries, a large and unbiased sample of court proceedings spanning that time period has to be accessible in digital form. Unfortunately, for many time periods this data is not available, not because the historical documents have been lost, but because they cannot be efficiently transcribed. In particular, the 400 years just after the invention of the printing press (the early modern period, ca. 1450-1850) represents a critical dark period for such research because documents from this period are notoriously hard to transcribe into machine-readable text with automatic methods for three reasons: they use obscure and unknown fonts, their text differs from modern language, and historical printing processes were imprecise. This proposal seeks to address these issues by treating transcription as a type of code-breaking and using machine learning to induce font and text structure directly from unannotated document images without relying on annotated examples, an approach called unsupervised learning. As a result, the proposal aims not just to digitize existing early modern corpora in major libraries, but also to produce a tool that researchers can use to digitize data at scale themselves and that is sufficiently flexible to develop new representations, for example, of non-standard character sets. The proposed approach treats the problem of document transcription as a linguistic decipherment problem, leveraging modeling techniques from work on decrypting historical ciphers. The key idea is that while properties like font and text structure are document-specific and therefore difficult to treat generally with supervised techniques, these phenomena are in fact regular within individual documents. For example, while the shape of a particular character in an obscure historical font may be unknown to the system, that shape is in fact regular; every time the character is printed it uses the same template. Models that leverage this kind of regularity by incorporating it as an assumption can constrain the otherwise difficult unsupervised learning problem and make it feasible. This proposal introduces a class of generative models with this goal in mind, designed to learn fonts and predict accurate transcriptions in an unsupervised fashion by capturing the core properties of the process that generated the input data: the historical printing process. These models represent the specific types of printing and typesetting noise exhibited by early modern documents, treat typesetting as a latent variable, and jointly consider possible character segmentations and transcriptions during inference. Their parameters can be estimated efficiently, directly from images of historical documents without accompanying transcriptions. Further, by treating damaged portions of the input documents as latent variables, this proposal aims to automatically reconstruct damaged documents using the same approach. The unsupervised techniques developed here may have uses in other areas of natural language processing where annotated training data is hard to obtain; for example, in personalized speech recognition and grounded semantics.
最近,社会科学和人文科学的研究人员在他们的工作中越来越多地使用数字技术,寻求基于新的分析来回答有关人类人工制品的重要问题。然而,由于他们的许多方法本质上是统计的,他们需要大量的数字可读文本来操作。例如,要问关于妇女的法律的权利在过去五个世纪中如何变化的统计问题,就必须以数字形式提供跨越这一时期的大量公正的法院诉讼程序样本。不幸的是,在许多时期,这些数据是不可用的,不是因为历史文件已经丢失,而是因为它们无法有效地转录。特别是印刷机发明后的400年(近代早期,约100年)。1450-1850)代表了此类研究的关键黑暗时期,因为这一时期的文件非常难以用自动方法转录成机器可读的文本,原因有三:它们使用模糊和未知的字体,它们的文本与现代语言不同,历史印刷过程不精确。该提案旨在通过将转录视为一种密码破译并使用机器学习直接从未注释的文档图像中诱导字体和文本结构而不依赖于注释的示例来解决这些问题,这种方法称为无监督学习。因此,该提案的目标不仅是在主要图书馆中检索现有的早期现代语料库,而且还旨在开发一种工具,研究人员可以使用该工具来大规模检索数据,并且该工具足够灵活,可以开发新的表示法,例如非标准字符集。所提出的方法将文档转录问题视为语言解密问题,利用解密历史密码的建模技术。关键思想是,虽然像字体和文本结构这样的属性是特定于文档的,因此很难用监督技术来处理,但这些现象实际上在单个文档中是有规律的。例如,虽然系统可能不知道某个历史字体中的特定字符的形状,但该形状实际上是规则的;每次打印该字符时,它都使用相同的模板。通过将其作为假设来利用这种规律性的模型可以约束原本困难的无监督学习问题,并使其可行。该提案引入了一类具有这一目标的生成模型,旨在通过捕获生成输入数据的过程的核心属性(历史打印过程)来学习字体并以无监督的方式预测准确的字体。这些模型代表了早期现代文档所表现出的特定类型的打印和排版噪声,将排版视为潜在变量,并在推理过程中共同考虑可能的字符分割和transmittance。它们的参数可以有效地估计,直接从图像的历史文件,而不伴随transmittance。此外,通过将输入文档的受损部分视为潜在变量,该提议旨在使用相同的方法自动重建受损文档。这里开发的无监督技术可能在自然语言处理的其他领域中使用,其中注释的训练数据很难获得;例如,在个性化语音识别和基础语义中。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Taylor Berg-Kirkpatrick其他文献
Taylor Berg-Kirkpatrick的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Taylor Berg-Kirkpatrick', 18)}}的其他基金
CAREER: Modeling Language Evolution via Deep Probabilistic Factorization
职业:通过深度概率分解建模语言演化
- 批准号:
2146151 - 财政年份:2022
- 资助金额:
$ 24.95万 - 项目类别:
Continuing Grant
Collaborative Research: RI: Small: Unsupervised Islamicate Manuscript Transcription via Lacunae Reconstruction
合作研究:RI:小型:通过缺口重建进行无监督伊斯兰手稿转录
- 批准号:
2200333 - 财政年份:2022
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
RI: Small: Print and Probability - A Statistical Approach to Analysis of Clandestine Publication
RI:小:印刷品和概率 - 秘密出版物分析的统计方法
- 批准号:
1936155 - 财政年份:2019
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
RI: Small: Print and Probability - A Statistical Approach to Analysis of Clandestine Publication
RI:小:印刷品和概率 - 秘密出版物分析的统计方法
- 批准号:
1816311 - 财政年份:2018
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
相似国自然基金
昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
- 批准号:
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
- 批准号:n/a
- 批准年份:2022
- 资助金额:10.0 万元
- 项目类别:省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
- 批准号:32000033
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
- 批准号:31972324
- 批准年份:2019
- 资助金额:58.0 万元
- 项目类别:面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
- 批准号:81900988
- 批准年份:2019
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
- 批准号:31802058
- 批准年份:2018
- 资助金额:26.0 万元
- 项目类别:青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
- 批准号:31870821
- 批准年份:2018
- 资助金额:56.0 万元
- 项目类别:面上项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
- 批准号:31772128
- 批准年份:2017
- 资助金额:60.0 万元
- 项目类别:面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
- 批准号:81704176
- 批准年份:2017
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
- 批准号:91640114
- 批准年份:2016
- 资助金额:85.0 万元
- 项目类别:重大研究计划
相似海外基金
Collaborative Research: RI: Small: Foundations of Few-Round Active Learning
协作研究:RI:小型:少轮主动学习的基础
- 批准号:
2313131 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Deep Constrained Learning for Power Systems
合作研究:RI:小型:电力系统的深度约束学习
- 批准号:
2345528 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Motion Fields Understanding for Enhanced Long-Range Imaging
合作研究:RI:小型:增强远程成像的运动场理解
- 批准号:
2232298 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: End-to-end Learning of Fair and Explainable Schedules for Court Systems
合作研究:RI:小型:法院系统公平且可解释的时间表的端到端学习
- 批准号:
2232055 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: End-to-end Learning of Fair and Explainable Schedules for Court Systems
合作研究:RI:小型:法院系统公平且可解释的时间表的端到端学习
- 批准号:
2232054 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Motion Fields Understanding for Enhanced Long-Range Imaging
合作研究:RI:小型:增强远程成像的运动场理解
- 批准号:
2232300 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Motion Fields Understanding for Enhanced Long-Range Imaging
合作研究:RI:小型:增强远程成像的运动场理解
- 批准号:
2232299 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Foundations of Few-Round Active Learning
协作研究:RI:小型:少轮主动学习的基础
- 批准号:
2313130 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
RI: Small: Collaborative Research: Evolutionary Approach to Optimal Morphology and Control of Transformable Soft Robots
RI:小型:协作研究:可变形软机器人的最佳形态和控制的进化方法
- 批准号:
2325491 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: End-to-end Learning of Fair and Explainable Schedules for Court Systems
合作研究:RI:小型:法院系统公平且可解释的时间表的端到端学习
- 批准号:
2334936 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant














{{item.name}}会员




