权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

A Knowledge Provider for Scruffy Sources of Metadata in Translational Medicine

转化医学元数据源的知识提供者

基本信息

批准号：
10057243
负责人：
Mark A Musen
金额：
$ 5.6万
依托单位：
STANFORD UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2020
资助国家：
美国
起止时间：
2020-01-23 至 2020-04-07
项目状态：
已结题

项目摘要

An essential task for the Biomedical Data Translator is to identify scientific experiments that have been performed or that are ongoing, and to enable integration of knowledge of the experimental methods, the results, and—when available—the conclusions with other knowledge sources. Such capabilities will enable queries such as: (1) Has anyone ever performed an experiment using methods like these? (2) Has anyone performed a study where the data may support a particular conclusion? (3) Are there any clinical trials for a particular condition whose patient population is a good match for a patient whom I now need to treat? (4) What best practices are suggested by the results of current clinical trials for a particular condition? Sometimes such queries can be addressed through an analysis of the scientific literature. More often, however, the published literature does not provide the methodological details needed to address such questions—even if NLP techniques were good enough to find the answers. Publications also provide only summary statistics of the experimental results. To address the kinds of queries that are of most interest to the Translator, it is necessary to access the actual experimental data online, starting with the metadata that are intended to provide descriptions of the datasets and of the experiments that led to the collection of the data in the first place. The problem for the Translator project is that the metadata that describe most online experimental data sources are difficult for computers to find and to process. Our laboratory’s analysis of the NCBI BioSample metadata repository, for example, shows that scientists largely avoid using standard data dictionaries entirely, and—partly as a result—they are extremely sloppy when they provide metadata values [3]. (A case in point: Some 76% of the metadata values in BioSample that are intended to be Boolean are neither true nor false.) Despite all the discussion in the past few years about making online datasets Findable, Accessible, Interoperable, and Re-usable (FAIR) [14], most online datasets are not close to FAIR. Our laboratory is developing technology that can rectify errors in online metadata. Like a spell-checker for metadata, our approach will attempt to identify the intentions of metadata authors, to correct typos, and to convert free-text strings to ontology terms whenever possible [6]. Our goal is to provide a service that will transform the scruffy metadata that pervade online descriptions of biomedical experiments into a form that will allow automated discovery, integration, and secondary analysis of research results in ways that are simply not possible at present. We anticipate that the Translator will call on our service to find experimental datasets and their accompanying metadata, to perform standard analyses of such datasets, and to integrate descriptions of experiments into the evolving knowledge graph. We will evaluate the performance of our Knowledge Provider by studying its response to queries from the Translator community and by peer review of a subset of the underlying, cleaned up metadata records that it processes from actual online repositories, such as BioSample and ClinicalTrials.gov. Our evaluation necessarily will be limited by the pragmatics of selecting a manageable test set of metadata and by the inherent shortcomings of manual peer review. Our laboratory has a sustained tradition of collaborating to develop major national resources that bring semantic technology to biomedicine. Our BioPortal ontology repository [5] was developed by the National Center for Biomedical Ontology (NCBO), one of the NIH National Centers for Biomedical Computing. The CEDAR Workbench for the prospective authoring of standardized metadata [11,12] was developed under the NIH Big Data to Knowledge (BD2K) program. Our Protégé system for building and maintaining biomedical ontologies is the most widely used software for creating semantic technology in the world [15]. Our group has ongoing relationships with corporations such as Pinterest, BASF, and Elsevier to assist them in their work to develop enterprise-wide knowledge graphs. We are thus well equipped to develop our Knowledge Provider and to assist the consortium broadly in the area of semantic technology.

生物医学数据翻译器的一项重要任务是识别科学实验，已经执行或正在进行的，并使知识的整合，实验方法，结果，以及-如果可用-与其他知识的结论源这样的能力将使得能够进行诸如以下的查询：（1）是否有人曾经执行过用这样的方法做实验(2)有没有人做过一项研究，支持一个特定的结论？(3)有没有针对特定情况的临床试验，患者人群是否与我现在需要治疗的患者匹配？(4)哪些最佳目前的临床试验结果表明，对特定条件的做法？有时，这种疑问可以通过分析科学文献来解决。更然而，出版的文献往往没有提供所需的方法细节，解决这些问题-即使NLP技术足够好，找到答案。出版物也只提供了实验结果的汇总统计。解决对于翻译者最感兴趣的各种查询，有必要访问实际的实验数据在线，从元数据开始，旨在提供描述数据集和实验的数据，导致收集的数据摆在首位。 Translator项目的问题是，描述大多数在线内容的元数据计算机很难找到和处理实验数据源。我们的实验室例如，对NCBI BioSample元数据库的分析表明，科学家们在很大程度上完全避免使用标准的数据字典，部分原因是它们非常当他们提供元数据值时很草率[3]。（一个很好的例子：大约76%的元数据 BioSample中的布尔值既不是真也不是假。）尽管所有的在过去的几年里，关于使在线数据集可查找，可解释，互操作和可重用（FAIR）[14]，大多数在线数据集都不接近FAIR。我们的实验室正在开发可以纠正在线元数据中错误的技术。像一个元数据的拼写检查器，我们的方法将尝试识别元数据的意图作者，纠正错别字，并尽可能将自由文本字符串转换为本体术语 [6]的文件。我们的目标是提供一种服务，将改变充斥在线的肮脏元数据将生物医学实验的描述转化为一种允许自动发现的形式，整合和二次分析的研究结果的方式，根本不可能在礼物我们预计翻译者将调用我们的服务来寻找实验数据集及其附带的元数据，对这些数据集进行标准分析，将实验的描述集成到不断发展的知识图中。我们将通过研究知识提供者对查询的响应来评估其性能从翻译社区和同行审查的一个子集的基础上，清理元数据记录，它从实际的在线存储库处理，如生物样品和我们的评估必然会受到选择一个可管理的元数据测试集和人工同行评审的固有缺点。我们的实验室有一个持续的传统，合作开发主要的国家资源将语义技术引入生物医学。我们的BioPortal本体库[5]是由美国国立卫生研究院国家生物医学本体中心（NCBO）开发生物医学计算中心。CEDAR的未来作者标准化元数据[11，12]是在NIH大数据到知识（BD2K）下开发的程序.我们用于构建和维护生物医学本体的Protégé系统是世界上在世界上广泛使用的用于创建语义技术的软件[15]。我们的团队正在与Pinterest、巴斯夫和爱思唯尔等公司建立关系，以帮助他们致力于开发企业范围的知识图谱。因此，我们有能力发展我们的知识提供者，并在语义技术领域广泛协助该联盟。