Automated Extraction of Fields of Interest from Unstructured Documents

从非结构化文档中自动提取感兴趣的领域

基本信息

  • 批准号:
    571403-2021
  • 负责人:
  • 金额:
    $ 6.47万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Alliance Grants
  • 财政年份:
    2022
  • 资助国家:
    加拿大
  • 起止时间:
    2022-01-01 至 2023-12-31
  • 项目状态:
    已结题

项目摘要

The written knowledge of human experience is immense and rich. However, much of this knowledge is unbeknownst and inaccessible. Natural language processing (NLP) represents one avenue to unlock this valuable information by understanding text automatically. This potential is being realized with the recent emergence of foundation or large language models that combine the power of self-supervised deep neural networks with unfathomably large and broad data sets. One hallmark of these models is the concept of "transfer learning," whereby the model is trained to do one task and can be fine-tuned to complete a related but different task. In this proposal, the team will focus on recent advances with large language models (e.g., BERT, GPT-3) in NLP to extract valuable information from unstructured documents. Despite the lack of cohesion and organization in unstructured documents, they often come with embedded structured data. Representing useful and structured information as tables is a common and effective method in scientific, financial, health and many domains for information retrieval and other tasks. However, manual extraction of structured data from documents typically costs tremendous time and labour, motivating the need for a system for automating the process. Specifically, this proposal will leverage groundbreaking advances by BERT and T5 models to develop tools for automated table extraction from documents. That is, given a collection of "fields of interest" (FoI), the task is to automatically extract and populate a table with the FoIs as column headers. To illustrate that our tools are general with broad applicability, we demonstrate the effectiveness of our algorithms with two very different application domains: (1) breast cancer care; and (2) mineral mining. After such tables have been extracted, the data can be used for various "downstream" analytics tasks such as question answering and predictive modeling.
人类经验的书面知识是巨大而丰富的。然而,这些知识中有许多是不为人知和无法获得的。自然语言处理(NLP)是通过自动理解文本来释放这些有价值信息的一种途径。这种潜力正在随着最近出现的基础或大型语言模型而实现,这些模型将自监督深度神经网络的功能与庞大而广泛的数据集相结合。这些模型的一个标志是“迁移学习”的概念,即模型被训练来完成一项任务,并且可以进行微调以完成相关但不同的任务。在本提案中,该团队将重点关注大型语言模型的最新进展(例如,BERT,GPT-3),从非结构化文档中提取有价值的信息。尽管非结构化文档缺乏内聚性和组织性,但它们通常带有嵌入式结构化数据。将有用的结构化信息表示为表格是科学、金融、健康和许多领域中用于信息检索和其他任务的常见且有效的方法。然而,从文档中手动提取结构化数据通常会花费大量的时间和劳动力,从而激发了对自动化过程的系统的需求。具体来说,该提案将利用BERT和T5模型的突破性进展来开发从文档中自动提取表格的工具。也就是说,给定一个“感兴趣的字段”(FoI)的集合,任务是自动提取并填充一个表,其中FoI作为列标题。为了说明我们的工具具有广泛的适用性,我们证明了我们的算法在两个非常不同的应用领域的有效性:(1)乳腺癌护理;(2)矿产开采。在提取这些表之后,数据可以用于各种“下游”分析任务,例如问答和预测建模。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Ng, RaymondT其他文献

Ng, RaymondT的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似海外基金

A novel damage characterization technique based on adaptive deconvolution extraction algorithm of multivariate AE signals for accurate diagnosis of osteoarthritic knees
基于多变量 AE 信号自适应反卷积提取算法的新型损伤表征技术,用于准确诊断膝关节骨关节炎
  • 批准号:
    24K07389
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
PROTSENS Rethinking Alternative PROTein Extraction: Decoding SENsory-Protein Extraction Relationships
PROTSENS 重新思考替代性蛋白质提取:解码感觉-蛋白质提取关系
  • 批准号:
    EP/Z000785/1
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Fellowship
Extraction and Use of Highly Explainable and Transferable Indicators for AI in Education
高度可解释和可转移的人工智能教育指标的提取和使用
  • 批准号:
    23K25698
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Wide-area low-cost sustainable ocean temperature and velocity structure extraction using distributed fibre optic sensing within legacy seafloor cables
使用传统海底电缆中的分布式光纤传感进行广域低成本可持续海洋温度和速度结构提取
  • 批准号:
    NE/Y003365/1
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Research Grant
Unlocking mine waste potential: carbon sequestration and metals extraction
释放矿山废物潜力:碳封存和金属提取
  • 批准号:
    LP230100371
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Linkage Projects
Objective Measurement and Analysis of Fibre Extraction
纤维提取的客观测量和分析
  • 批准号:
    10089054
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Collaborative R&D
RII Track-4:@NASA: Automating Character Extraction for Taxonomic Species Descriptions Using Neural Networks, Transformer, and Computer Vision Signal Processing Architectures
RII Track-4:@NASA:使用神经网络、变压器和计算机视觉信号处理架构自动提取分类物种描述的字符
  • 批准号:
    2327168
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Standard Grant
I-Corps: Translation potential of a miniaturized biotechnology platform for nucleic acid extraction, purification, and library preparation
I-Corps:用于核酸提取、纯化和文库制备的小型生物技术平台的转化潜力
  • 批准号:
    2421022
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Standard Grant
BuildZero: transforming the UK's buildings for zero material extraction, zero carbon and zero waste
BuildZero:改造英国建筑,实现零材料开采、零碳和零废物
  • 批准号:
    EP/Y530578/1
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Research Grant
CAREER: Understanding Electrochemical Metal Extraction in Molten Salts from First Principles
职业:从第一原理了解熔盐中的电化学金属萃取
  • 批准号:
    2340765
  • 财政年份:
    2024
  • 资助金额:
    $ 6.47万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了