权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

PheneBank: automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature

PheneBank：自动提取和验证科学文献中人类表型与疾病关联的数据库

基本信息

批准号：
MR/M025160/1
负责人：
Nigel Collier
金额：
$ 59.12万
依托单位：
University of Cambridge
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2015
资助国家：
英国
起止时间：
2015 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=MR%2FM025160%2F1
关键词：
PheneBank automatic extraction validation database

项目摘要

Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) [1] and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologiesThere are three major challenges we seek to address: (1) knowledge brokering: to develop state of the art text mining approaches to identify phenotypic descriptions in scientific texts; (2) knowledge management: to create a structured resource of phenotype terms used in scientific texts and link them to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize statistical association algorithms to identify meaningful phenotype-disease / phenotype-gene profiles. The disease profiles will be evaluated against hand curated standards in human disease databases (e.g. Online Mendelian Inheritance of Man and OrphaNet) with a focus on rare diseases. Mined data will be provided in a machine understandable database - a definitive output of the project - to support clinicians and scientists. At the technological level the project will pioneer new methods for text mining that exploit machine learning (ML). Scientific texts remain a challenging area for a variety of reasons: descriptive naming, high levels of ambiguity/out of vocabulary words, use of complex sentence structures and an evolving vocabulary. Current techniques in term recognition employ ML in pipelines to search for continuous sequences of words that represent genes, proteins and cells etc. State of the art models include conditional random fields using feature sets based on dictionaries as well as the local and topical context where the term is located. However, phenotype descriptions are often represented by discontinuous sequences, such as 'growth in the patient was delayed'. One key aspect not previously addressed is in the capture of such non-canonical terms. This requires a different paradigm based on grammatical parsing algorithms that capture structural relations as well as joint learning techniques that can leverage large numbers of features simultaneously and optimise these across the diverse contexts in which phenotypes are mentioned.The project also seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic descriptions of the relations they tried to target. This means that association scores merge notions of genetic, pharmacological, and epidemiological relations etc. without distinction. Our parsing-based approach is an attempt to overcome this issue by discovering more precise relationships. The approach follows ground breaking work at the Wellcome Trust Sanger Institute (WTSI), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype. An exciting aspect of this project is inter-disciplinary collaboration across stakeholders to build a resource of phenotype-disease profiles: (a) computer scientists from the Universities of Cambridge, Colorado and Manchester; (b) bioinformaticians and life scientists from the WTSI, McGill University and EMBL-EBI, and (c) clinicians from the NIHR Bioresource.

自由文本科学文献有可能成为一个非常有价值的数据来源，用于揭示基因、疾病和表型之间往往隐藏的关系。表型描述包括解剖结构、过程和行为的异常。例如“生长延迟”和“体重减轻”。这种描述形成了确定疾病存在和治疗的基础，但由于其固有的复杂性，以前很少受到文本挖掘社区的关注。近年来，少数专家策展人花费了大量精力来创建表型编码系统（称为“本体”），例如人类表型本体（HP）[1]和哺乳动物表型本体（MP）。PheneBank项目建议使用直接从文献中发现的术语来支持和加速策展，并将它们与这些标准本体自动集成。我们寻求解决三个主要挑战：（1）知识中介：开发最先进的文本挖掘方法来识别科学文本中的表型描述;（2）知识管理：创建一个结构化的资源表型术语在科学文本中使用，并将它们连接到现有的编码系统;（3）增加洞察力的证据：与领域专家合作，利用统计关联算法，以确定有意义的表型疾病/表型基因谱。将根据人类疾病数据库（例如在线人类孟德尔遗传和OrphaNet）中的人工标准对疾病特征进行评估，重点是罕见疾病。挖掘的数据将在机器可理解的数据库中提供-该项目的最终输出-以支持临床医生和科学家。在技术层面，该项目将开创利用机器学习（ML）进行文本挖掘的新方法。由于各种原因，科学文本仍然是一个具有挑战性的领域：描述性命名，高度的歧义/词汇，使用复杂的句子结构和不断发展的词汇。术语识别中的当前技术在管道中使用ML来搜索表示基因、蛋白质和细胞等的连续单词序列。最先进的模型包括条件随机场，其使用基于词典的特征集以及术语所在的局部和主题上下文。然而，表型描述通常由不连续的序列表示，例如“患者的生长延迟”。以前没有提到的一个关键方面是捕获这些非规范术语。这需要一种不同的范式，它基于语法分析算法，捕捉结构关系，以及联合学习技术，可以同时利用大量特征，并在提到表型的不同背景下优化这些特征。该项目还试图利用文本来提取表型，疾病和基因之间的统计学显著关联。早期的方法没有提供它们试图针对的关系的深层语义描述。这意味着关联分数合并了遗传、药理学和流行病学关系等概念，没有区别。我们基于解析的方法试图通过发现更精确的关系来克服这个问题。该方法遵循了Wellcome Trust桑格研究所（WTSI）的开创性工作，包括使用构成表型的概念元素的成对评分进行表型的术语对齐。该项目的一个令人兴奋的方面是跨利益相关者的跨学科合作，以建立一个表型-疾病谱资源：（a）来自剑桥大学、科罗拉多大学和曼彻斯特大学的计算机科学家;（B）来自WTSI、麦吉尔大学和EMBL-EBI的生物信息学家和生命科学家;以及（c）来自NIHR Bioresource的临床医生。

项目成果

期刊论文数量（10）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Large-scale Exploration of Neural Relation Classification Architectures

DOI：
10.18653/v1/d18-1250
发表时间：
2018
期刊：
影响因子：
0
作者：
Hoang-Quynh Le;Duy-Cat Can;Sinh T. Vu;T. Dang;Mohammad Taher Pilehvar;Nigel Collier
通讯作者：
Hoang-Quynh Le;Duy-Cat Can;Sinh T. Vu;T. Dang;Mohammad Taher Pilehvar;Nigel Collier

A pragmatic guide to geoparsing evaluation

地理解析评估实用指南

DOI：
10.17863/cam.55940
发表时间：
2019
期刊：
影响因子：
0
作者：
Gritta M
通讯作者：
Gritta M

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Nigel Collier其他文献

Text Readability and Coreference Annotation across Heterogeneous Media for the Digital Archive of Rare Books

善本数字档案馆跨异构媒体的文本可读性和共指注释

DOI：
发表时间：
2004
期刊：
The Journal of the Institute of Image Electronics Engineers of Japan (in Japanese) Vol.33, No.5
影响因子：
0
作者：
Asanobu Kitamoto;Takeo Yamamoto;Sonoko Sato;Nigel Collier;Ai Kawazoe;Kinji Ono
通讯作者：
Kinji Ono

Incorporating topic information into semantic analysis models

将主题信息纳入语义分析模型

DOI：
10.3115/1219044.1219069
发表时间：
2004
期刊：
Chemistry Letters
影响因子：
1.6
作者：
Tony Mullen;Nigel Collier
通讯作者：
Nigel Collier

Synthetic Examples Improve Cross-Target Generalization: A Study on Stance Detection on a Twitter corpus.

综合示例提高跨目标泛化：Twitter 语料库上的立场检测研究。

DOI：
发表时间：
2021
期刊：
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
影响因子：
0
作者：
Costanza Conforti;Jakob Berndt;Mohammad Taher Pilehvar;Chryssi Giannitsarou;Flavio Toxvaerd;Nigel Collier
通讯作者：
Nigel Collier

Annotation of Biomedical Texts for Zone Analysis

用于区域分析的生物医学文本注释

DOI：
发表时间：
2004
期刊：
影响因子：
0
作者：
Y. Mizuta;Tony Mullen;Nigel Collier
通讯作者：
Nigel Collier

在来産業の展開と資本主義, 有志舎, 佐々木寛司, 勝部真人編『講座, 明治維新』第8巻

本土产业与资本主义的发展，由志社、佐佐木宏、胜部正人（主编）讲义、明治维新第8卷

DOI：
发表时间：
2013
期刊：
影响因子：
0
作者：
杉山将;Nigel Collier;高田輝子;山崎志郎;中西聡;Yoshiaki Ogura and Hirofumi Uchida;Shingo IOKIBE;林采成;北澤満;山本達司;T.Takada,A.Inoue;冨善一敏;松村敏弘;湯澤規子;Keiichi Hori and Hiroshi Osano;谷本雅之
通讯作者：
谷本雅之