Training of machine-learning based procedures for automated postcorrection of OCRed historical printings

基于机器学习的程序培训,用于 ORed 历史打印的自动后期校正

基本信息

项目摘要

OCR-results for historical printings typically contain many recognition errors. Hence postcorrection methods play an important role in this field. Some automated postcorrection systems ``individually'' developed for a particular historical OCR-corpus have shown good results. However, the development of an ``omnipotent'' general system for automated postcorrection of OCR-results, offering good results for distinct OCR engines and arbitrary historical printings, is an ambitious future goal. In the framework of the OCR-D initiative currently OCR postcorrection systems are being developed that are based on supervised machine learning. In the ideal case these systems should be applicable to arbitrary OCR engines and historical texts. In this project we want to systematically study the influence of training data and -methods on the quality of the correction results achieved. The long-term ultimate goal is the development of an ``omnipotent'' (s.a.) postcorrection model. As a first step we look for training data and feature systems that lead to optimal correction results for specific OCR engines and classes of historical printings, analyzing correction problems arising for other OCRs and corpora. Using these results as a starting point we search for methods to minimize the additional effort needed (in terms of ground truth preparation and posttraining) for developing correction models for larger and inhomogeneous corpora. Specific points to be investigated are, among others, the combination of postcorrection models and the automated selection of a correction model for a given new OCR corpus.
OCR-历史打印的结果通常包含许多识别错误。因此,后校正法在这一领域发挥着重要的作用。针对特定的历史OCR语料库开发的一些“单独”的自动后校正系统显示了良好的效果。然而,开发一种用于OCR结果自动后校正的“万能”通用系统,为不同的OCR引擎和任意的历史打印提供良好的结果,是一个雄心勃勃的未来目标。在光学字符识别-D倡议的框架内,目前正在开发基于有监督机器学习的光学字符识别后校正系统。在理想情况下,这些系统应该适用于任意的OCR引擎和历史文本。在这个项目中,我们想要系统地研究训练数据和方法对校正结果质量的影响。长期的最终目标是发展一种“万能的”(S.A.)后校正模型。作为第一步,我们寻找训练数据和特征系统,以导致特定OCR引擎和历史打印类别的最佳校正结果,并分析其他OCR和语料库出现的校正问题。以这些结果为起点,我们寻找方法,以最大限度地减少为更大和不同质的语料库开发更正模型所需的额外工作(在实地真相准备和后期培训方面)。除其他外,要研究的具体问题包括校正后模型的组合以及针对给定的新OCR语料库自动选择校正模型。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Professor Dr. Klaus U. Schulz其他文献

Professor Dr. Klaus U. Schulz的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Professor Dr. Klaus U. Schulz', 18)}}的其他基金

Automated postcorrection of OCRed historical printings with integrated optional interactive postcorrection
通过集成的可选交互式后期校正对 ORed 历史打印进行自动后期校正
  • 批准号:
    393215159
  • 财政年份:
    2018
  • 资助金额:
    --
  • 项目类别:
    Research data and software (Scientific Library Services and Information Systems)
Development of a web-based system for the postcorrection of historical OCR'ed texts
开发基于网络的系统,用于对历史 OCR 文本进行后校正
  • 批准号:
    314731081
  • 财政年份:
    2016
  • 资助金额:
    --
  • 项目类别:
    Research data and software (Scientific Library Services and Information Systems)
Domänen- und dokumentenadaptive Verfahren zur Nachkorrektur von OCR-Ergebnissen
用于 OCR 结果后校正的域和文档自适应程序
  • 批准号:
    5419670
  • 财政年份:
    2004
  • 资助金额:
    --
  • 项目类别:
    Research Grants
Erweiterung eines Abfragemodells für XML-Daten zur interaktiven Exploration
扩展 XML 数据的查询模型以进行交互式探索
  • 批准号:
    5231068
  • 财政年份:
    2000
  • 资助金额:
    --
  • 项目类别:
    Research Grants

相似国自然基金

Understanding structural evolution of galaxies with machine learning
  • 批准号:
    n/a
  • 批准年份:
    2022
  • 资助金额:
    10.0 万元
  • 项目类别:
    省市级项目
非标准随机调度模型的最优动态策略
  • 批准号:
    71071056
  • 批准年份:
    2010
  • 资助金额:
    28.0 万元
  • 项目类别:
    面上项目
微生物发酵过程的自组织建模与优化控制
  • 批准号:
    60704036
  • 批准年份:
    2007
  • 资助金额:
    21.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

CAREER: Mitigating the Lack of Labeled Training Data in Machine Learning Based on Multi-level Optimization
职业:基于多级优化缓解机器学习中标记训练数据的缺乏
  • 批准号:
    2339216
  • 财政年份:
    2024
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
  • 批准号:
    EP/Y029089/1
  • 财政年份:
    2024
  • 资助金额:
    --
  • 项目类别:
    Research Grant
Collaborative Research: CyberTraining: Implementation: Small: Inclusive Cyberinfrastructure and Machine Learning Training to Advance Water Science Research
合作研究:网络培训:实施:小型:包容性网络基础设施和机器学习培训,以推进水科学研究
  • 批准号:
    2320980
  • 财政年份:
    2024
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Small: Inclusive Cyberinfrastructure and Machine Learning Training to Advance Water Science Research
合作研究:网络培训:实施:小型:包容性网络基础设施和机器学习培训,以推进水科学研究
  • 批准号:
    2320979
  • 财政年份:
    2024
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation
统一预训练和多语言语义表示学习以实现低资源神经机器翻译
  • 批准号:
    22KJ1843
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Grant-in-Aid for JSPS Fellows
METEOR-Integrated Training Environment (METEORITE)
METEOR-综合训练环境(METEORITE)
  • 批准号:
    10715026
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
Alliance for Regenerative Rehabilitation Research & Training 2.0 (AR3T)
再生康复研究联盟
  • 批准号:
    10830114
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
Interdisciplinary Training in Computational Neuroscience
计算神经科学跨学科培训
  • 批准号:
    10746499
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
  • 批准号:
    2311500
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Pilot: Operationalizing AI/Machine Learning for Cybersecurity Training
合作研究:网络培训:试点:将人工智能/机器学习应用于网络安全培训
  • 批准号:
    2229976
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了