Development of a Repository for OCR Models and an Automatic Font Recognition tool OCR-D
OCR 模型存储库和自动字体识别工具 OCR-D 的开发
基本信息
- 批准号:394448308
- 负责人:
- 金额:--
- 依托单位:
- 依托单位国家:德国
- 项目类别:Research data and software (Scientific Library Services and Information Systems)
- 财政年份:2018
- 资助国家:德国
- 起止时间:2017-12-31 至 2019-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The project addresses the problem of strongly fluctuating recognition rates of OCR for 16th to 18th century historical prints, limiting the full-text digitization of material created by the VD16, VD17, and VD18 programs.Recognition models trained on modern corpora lacking the specifics of historical prints or historic material without thorough bibliographic analysis, retard recognition rates in comparison to the accuracy now routinely achieved for scans of modern prints.The creation of font-specific corpora on the basis of manual tagging is unrealistic, since both non-trivial knowledge of printing history is necessary and the scalability of such an approach would be insufficient. Due to the repetitiveness of the task, such an approach is also very error-prone. The project will enable the humanities to use OCR in a font-specific manner with limited effort. In order to achieve this the project has three main objectives:The development of an online training infrastructure that allows specific models to be trained for these font groups and at the same time for different OCR software.Development of a tool for the automatic recognition of fonts in digitizations of historical prints. In this case, an algorithm for the recognition of fonts in incunabula is first trained using the ground truth found in the Typenrepertorium der Wiegendrucke. In a second step the fonts are grouped according to their similarity in order to get as few groups as possible while maintaining OCR accuracy.Provision of a model repository, in which developed font-specific OCR models are made available to the public.
该项目解决了16至18世纪世纪历史印刷品OCR识别率波动很大的问题,限制了VD16,VD17和VD18程序创建的材料的全文数字化。在缺乏历史印刷品或历史材料细节的现代语料库上训练的识别模型没有经过彻底的书目分析,与现代印刷品扫描通常达到的准确度相比,识别率会降低。在手动标记的基础上创建特定字体语料库是不现实的,这是因为需要打印历史的重要知识,并且这种方法的可扩展性是不够的。由于任务的重复性,这种方法也非常容易出错。该项目将使人文学科能够以有限的努力以特定于字体的方式使用OCR。为了实现这一目标,该项目有三个主要目标:开发一个在线培训基础设施,允许为这些字体组同时为不同的OCR软件培训特定的模型。开发一个工具,用于自动识别历史印刷品数字化中的字体。在这种情况下,首先使用Typenrepertorium der Wiegendrucke中找到的地面真值来训练用于识别incunabula中字体的算法。第二步,根据字体的相似性对字体进行分组,以便在保持OCR准确性的同时获得尽可能少的组。提供一个模型库,其中开发的字体特定OCR模型可供公众使用。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Professor Dr. Manuel Burghardt, since 11/2019其他文献
Professor Dr. Manuel Burghardt, since 11/2019的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似海外基金
DMS/NIGMS 2: Deep learning for repository-scale analysis of tandem mass spectrometry proteomics data
DMS/NIGMS 2:用于串联质谱蛋白质组数据存储库规模分析的深度学习
- 批准号:
2245300 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Continuing Grant
Continued Curation of the Marine Geology and Geophysics Collection in the OSU/CEOAS Marine and Geology Repository
继续管理 OSU/CEOAS 海洋和地质知识库中的海洋地质和地球物理学馆藏
- 批准号:
2310875 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Continuing Grant
NIDDK Extramural Digital Pathology Repository System (HALO LINK)
NIDDK 校外数字病理学存储系统 (HALO LINK)
- 批准号:
10884865 - 财政年份:2023
- 资助金额:
-- - 项目类别:
A software tool to facilitate variable-level equivalency and harmonization in research data: Leveraging the NIH Common Data Elements Repository to link concepts and measures in an open format
促进研究数据中变量级别等效性和协调性的软件工具:利用 NIH 通用数据元素存储库以开放格式链接概念和测量
- 批准号:
10821517 - 财政年份:2023
- 资助金额:
-- - 项目类别:
THE NIH NEUROBIOBANK BRAIN AND TISSUE REPOSITORY (NBTR) TO PROVIDE SERVICES THAT WILL ACTIVELY ACQUIRE, RECEIVE, STORE, CURATE, PRESERVE, AND DISTRIBUTE CNS AND RELATED BIOLOGICAL SPECIMENS TO QUALIFI
NIH NEUROBIOBANK 大脑和组织存储库 (NBTR) 提供积极获取、接收、存储、整理、保存和分发 CNS 及相关生物样本的服务,以确保符合资格
- 批准号:
10948523 - 财政年份:2023
- 资助金额:
-- - 项目类别:
THE NIH NEUROBIOBANK BRAIN AND TISSUE REPOSITORY (NBTR) TO PROVIDE SERVICES THAT WILL ACTIVELY ACQUIRE, RECEIVE, STORE, CURATE, PRESERVE, AND DISTRIBU
NIH NEUROBIOBANK 大脑和组织存储库 (NBTR) 提供积极获取、接收、存储、管理、保存和分发的服务
- 批准号:
10916992 - 财政年份:2023
- 资助金额:
-- - 项目类别:
An international repository of individual patient data to predict development of chronic post-surgical pain after knee replacement surgery
个人患者数据的国际存储库,用于预测膝关节置换手术后慢性术后疼痛的发展
- 批准号:
489302 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Operating Grants
CENTRALIZED CHEMOPREVENTIVE AGENT REPOSITORY AND CHEMISTRY SUPPORT
集中化学预防剂存储库和化学支持
- 批准号:
10884587 - 财政年份:2023
- 资助金额:
-- - 项目类别:
STUDY OF PREGNANCY AND NEONATAL HEALTH (SPAN) SPECIMEN REPOSITORY
妊娠和新生儿健康研究(SPAN)样本库
- 批准号:
10927705 - 财政年份:2023
- 资助金额:
-- - 项目类别: