权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Semi-Automating Data Extraction for Systematic Reviews

用于系统评价的半自动数据提取

基本信息

批准号：
9326367
负责人：
Randolph Bias
金额：
$ 29.35万
依托单位：
NORTHEASTERN UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2015
资助国家：
美国
起止时间：
2015-09-20 至 2019-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9326367
关键词：
Age Area Caring Categories Characteristics Clinical Clinical Trials Collaborations Community Medicine Complement Computer software Computing Methodologies Data Data Element Data Set Databases Decision Making Effectiveness of Interventions Elements Evidence Based Medicine Evidence based practice Exercise Feedback Goals Growth Healthcare Hour Human Resources Interdisciplinary Study Intervention Letters Link Literature Machine Learning Manuals Medical Medicine Methodology Methods Modeling Modernization National Health Policy Natural Language Processing Online Systems Outcome Patient Care Performance Persons Population Characteristics Positioning Attribute Process Public Health Publishing Research Research Personnel Resources Sample Size Services Software Tools Standardization Structure System Text Training Work Workload base clinical practice computerized tools cost cost efficient data mining design evidence base experience improved innovation interest learning strategy member novel open source process optimization study characteristics systematic review tool trial design usability web services web-based tool

项目摘要

DESCRIPTION (provided by applicant): Evidence-based medicine (EBM) looks to inform patient care with the totality of available relevant evidence. Systematic reviews are the cornerstone of EBM and are critical to modern healthcare, informing everything from national health policy to bedside decision-making. But conducting systematic reviews is extremely laborious (and hence expensive): producing a single review requires thousands of person-hours. Moreover, the exponential expansion of the biomedical literature base has imposed an unprecedented burden on reviewers, thus multiplying costs. Researchers can no longer keep up with the primary literature, and this hinders the practice of evidence-based care. The long term aim of this work is to develop computational tools and methods that optimize the practice of EBM. The proposed work thus builds upon our previous successful efforts developing computational approaches that reduce the workload in EBM. More speciﬁcally, we aim to develop tools that semi-automate the laborious task of data extraction - identifying and extracting the information of interest (e.g., trial sample size, interventions and outcomes) from the free-texts of biomedical articles - via novel machine learning methods. Semi-automating this task will drastically reduce reviewer workload, thus enabling the practice of EBM in an age of information overload. Previous efforts to automate data extraction from articles describing clinical trials have shown promise, but lack the accuracy and scope necessary for real-world use. These approaches have been impeded by the absence of a large corpus of annotated clinical trials, and by the difﬁculty of constructing models to automatically extract all of the variables necessary for synthesis. We describe methodological innovations to overcome these hurdles. First, to train our machine learning models we propose leveraging large existing databases that contain structured information about clinical trials, in lieu of the usual approach of collecting expensive manual annotations. Practically, this means we will be able to exploit a very large `pseudo-annotated' dataset that is an order of magnitude bigger than what has been used in previous efforts, thus substantially improving model performance. Our extensive preliminary work demonstrates the promise and feasibility of this approach. Second, we propose novel machine learning models appropriate for the tasks of article categorization and data extraction for EBM. These models will speciﬁcally be designed to perform extraction of multiple, correlated data elements of interest while simultaneously classifying articles into clinically salient categories useful for EBM. We will rigorously evaluate the developed methods to assess their practical utility, speciﬁcally y comparing automated extraction accuracy to that achieved by trained systematic reviewers. And to make these methods useful to end-users (systematic reviewers), we will develop and evaluate open-source software and tools, including a web-based extraction tool that integrates our machine learning models to automatically extract information from uploaded articles (PDFs). We will conduct a user study to evaluate the utility and usability of this tool in practice.

描述（由申请人提供）：循证医学（EBM）旨在通过所有可用的相关证据告知患者护理。系统评价是循证医学的基石，对现代医疗保健至关重要，为从国家卫生政策到床边决策的一切提供信息。但进行系统性评论极其费力（因此成本高昂）：制作一篇评论需要数千个工时。此外，生物医学文献库的指数级扩张给审稿人带来了前所未有的负担，从而使成本成倍增加。研究人员不再能跟上主要文献，这阻碍了循证护理的实践。这项工作的长期目标是开发优化循证医学实践的计算工具和方法。因此，建议的工作建立在我们以前成功的努力，开发计算方法，减少EBM的工作量。更具体地说，我们的目标是开发工具，使数据提取的繁重任务半自动化-识别和提取感兴趣的信息（例如，试验样本量，干预措施和结果）从生物医学文章的免费文本-通过新的机器学习方法。半自动化这项任务将大大减少审查工作量，从而使循证医学的实践在信息过载的时代。以前从描述临床试验的文章中自动提取数据的努力已经显示出希望，但缺乏真实世界使用所需的准确性和范围。这些方法受到了缺乏大量注释临床试验语料库以及难以构建模型以自动提取合成所需的所有变量的阻碍。我们描述了克服这些障碍的方法创新。首先，为了训练我们的机器学习模型，我们建议利用包含有关临床试验的结构化信息的大型现有数据库，而不是收集昂贵的手动注释的通常方法。实际上，这意味着我们将能够利用一个非常大的“伪注释”数据集，它比以前的工作中使用的数据集大一个数量级，从而大大提高模型性能。我们广泛的初步工作证明了这种方法的前景和可行性。其次，我们提出了新的机器学习模型，适合于EBM的文章分类和数据提取的任务。这些模型将专门设计用于提取多个相关的感兴趣数据元素，同时将文章分类为对EBM有用的临床显著类别。我们将严格评估所开发的方法，以评估其实际效用，特别是将自动提取准确性与训练有素的系统审查员所实现的准确性进行比较。为了使这些方法对最终用户（系统评审员）有用，我们将开发和评估开源软件和工具，包括基于Web的提取工具，该工具集成了我们的机器学习模型，可以从上传的文章（PDF）中自动提取信息。我们将进行一项用户研究，以评估该工具在实践中的实用性和可用性。