PheneBank: automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature
PheneBank:自动提取和验证科学文献中人类表型与疾病关联的数据库
基本信息
- 批准号:MR/M025160/1
- 负责人:
- 金额:$ 59.12万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2015
- 资助国家:英国
- 起止时间:2015 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) [1] and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologiesThere are three major challenges we seek to address: (1) knowledge brokering: to develop state of the art text mining approaches to identify phenotypic descriptions in scientific texts; (2) knowledge management: to create a structured resource of phenotype terms used in scientific texts and link them to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize statistical association algorithms to identify meaningful phenotype-disease / phenotype-gene profiles. The disease profiles will be evaluated against hand curated standards in human disease databases (e.g. Online Mendelian Inheritance of Man and OrphaNet) with a focus on rare diseases. Mined data will be provided in a machine understandable database - a definitive output of the project - to support clinicians and scientists. At the technological level the project will pioneer new methods for text mining that exploit machine learning (ML). Scientific texts remain a challenging area for a variety of reasons: descriptive naming, high levels of ambiguity/out of vocabulary words, use of complex sentence structures and an evolving vocabulary. Current techniques in term recognition employ ML in pipelines to search for continuous sequences of words that represent genes, proteins and cells etc. State of the art models include conditional random fields using feature sets based on dictionaries as well as the local and topical context where the term is located. However, phenotype descriptions are often represented by discontinuous sequences, such as 'growth in the patient was delayed'. One key aspect not previously addressed is in the capture of such non-canonical terms. This requires a different paradigm based on grammatical parsing algorithms that capture structural relations as well as joint learning techniques that can leverage large numbers of features simultaneously and optimise these across the diverse contexts in which phenotypes are mentioned.The project also seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic descriptions of the relations they tried to target. This means that association scores merge notions of genetic, pharmacological, and epidemiological relations etc. without distinction. Our parsing-based approach is an attempt to overcome this issue by discovering more precise relationships. The approach follows ground breaking work at the Wellcome Trust Sanger Institute (WTSI), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype. An exciting aspect of this project is inter-disciplinary collaboration across stakeholders to build a resource of phenotype-disease profiles: (a) computer scientists from the Universities of Cambridge, Colorado and Manchester; (b) bioinformaticians and life scientists from the WTSI, McGill University and EMBL-EBI, and (c) clinicians from the NIHR Bioresource.
自由文本科学文献有可能成为一个非常有价值的数据来源,用于揭示基因、疾病和表型之间往往隐藏的关系。表型描述包括解剖结构、过程和行为的异常。例如“生长延迟”和“体重减轻”。这种描述形成了确定疾病存在和治疗的基础,但由于其固有的复杂性,以前很少受到文本挖掘社区的关注。近年来,少数专家策展人花费了大量精力来创建表型编码系统(称为“本体”),例如人类表型本体(HP)[1]和哺乳动物表型本体(MP)。PheneBank项目建议使用直接从文献中发现的术语来支持和加速策展,并将它们与这些标准本体自动集成。我们寻求解决三个主要挑战:(1)知识中介:开发最先进的文本挖掘方法来识别科学文本中的表型描述;(2)知识管理:创建一个结构化的资源表型术语在科学文本中使用,并将它们连接到现有的编码系统;(3)增加洞察力的证据:与领域专家合作,利用统计关联算法,以确定有意义的表型疾病/表型基因谱。将根据人类疾病数据库(例如在线人类孟德尔遗传和OrphaNet)中的人工标准对疾病特征进行评估,重点是罕见疾病。挖掘的数据将在机器可理解的数据库中提供-该项目的最终输出-以支持临床医生和科学家。在技术层面,该项目将开创利用机器学习(ML)进行文本挖掘的新方法。由于各种原因,科学文本仍然是一个具有挑战性的领域:描述性命名,高度的歧义/词汇,使用复杂的句子结构和不断发展的词汇。术语识别中的当前技术在管道中使用ML来搜索表示基因、蛋白质和细胞等的连续单词序列。最先进的模型包括条件随机场,其使用基于词典的特征集以及术语所在的局部和主题上下文。然而,表型描述通常由不连续的序列表示,例如“患者的生长延迟”。以前没有提到的一个关键方面是捕获这些非规范术语。这需要一种不同的范式,它基于语法分析算法,捕捉结构关系,以及联合学习技术,可以同时利用大量特征,并在提到表型的不同背景下优化这些特征。该项目还试图利用文本来提取表型,疾病和基因之间的统计学显著关联。早期的方法没有提供它们试图针对的关系的深层语义描述。这意味着关联分数合并了遗传、药理学和流行病学关系等概念,没有区别。我们基于解析的方法试图通过发现更精确的关系来克服这个问题。该方法遵循了Wellcome Trust桑格研究所(WTSI)的开创性工作,包括使用构成表型的概念元素的成对评分进行表型的术语对齐。该项目的一个令人兴奋的方面是跨利益相关者的跨学科合作,以建立一个表型-疾病谱资源:(a)来自剑桥大学、科罗拉多大学和曼彻斯特大学的计算机科学家;(B)来自WTSI、麦吉尔大学和EMBL-EBI的生物信息学家和生命科学家;以及(c)来自NIHR Bioresource的临床医生。
项目成果
期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Large-scale Exploration of Neural Relation Classification Architectures
- DOI:10.18653/v1/d18-1250
- 发表时间:2018
- 期刊:
- 影响因子:0
- 作者:Hoang-Quynh Le;Duy-Cat Can;Sinh T. Vu;T. Dang;Mohammad Taher Pilehvar;Nigel Collier
- 通讯作者:Hoang-Quynh Le;Duy-Cat Can;Sinh T. Vu;T. Dang;Mohammad Taher Pilehvar;Nigel Collier
A pragmatic guide to geoparsing evaluation
地理解析评估实用指南
- DOI:10.17863/cam.55940
- 发表时间:2019
- 期刊:
- 影响因子:0
- 作者:Gritta M
- 通讯作者:Gritta M
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Nigel Collier其他文献
Text Readability and Coreference Annotation across Heterogeneous Media for the Digital Archive of Rare Books
善本数字档案馆跨异构媒体的文本可读性和共指注释
- DOI:
- 发表时间:
2004 - 期刊:
- 影响因子:0
- 作者:
Asanobu Kitamoto;Takeo Yamamoto;Sonoko Sato;Nigel Collier;Ai Kawazoe;Kinji Ono - 通讯作者:
Kinji Ono
Incorporating topic information into semantic analysis models
将主题信息纳入语义分析模型
- DOI:
10.3115/1219044.1219069 - 发表时间:
2004 - 期刊:
- 影响因子:1.6
- 作者:
Tony Mullen;Nigel Collier - 通讯作者:
Nigel Collier
Synthetic Examples Improve Cross-Target Generalization: A Study on Stance Detection on a Twitter corpus.
综合示例提高跨目标泛化:Twitter 语料库上的立场检测研究。
- DOI:
- 发表时间:
2021 - 期刊:
- 影响因子:0
- 作者:
Costanza Conforti;Jakob Berndt;Mohammad Taher Pilehvar;Chryssi Giannitsarou;Flavio Toxvaerd;Nigel Collier - 通讯作者:
Nigel Collier
Annotation of Biomedical Texts for Zone Analysis
用于区域分析的生物医学文本注释
- DOI:
- 发表时间:
2004 - 期刊:
- 影响因子:0
- 作者:
Y. Mizuta;Tony Mullen;Nigel Collier - 通讯作者:
Nigel Collier
在来産業の展開と資本主義, 有志舎, 佐々木寛司, 勝部真人編『講座, 明治維新』第8巻
本土产业与资本主义的发展,由志社、佐佐木宏、胜部正人(主编)讲义、明治维新第8卷
- DOI:
- 发表时间:
2013 - 期刊:
- 影响因子:0
- 作者:
杉山将;Nigel Collier;高田輝子;山崎志郎;中西聡;Yoshiaki Ogura and Hirofumi Uchida;Shingo IOKIBE;林 采成;北澤満;山本達司;T.Takada,A.Inoue;冨善一敏;松村敏弘;湯澤規子;Keiichi Hori and Hiroshi Osano;谷本 雅之 - 通讯作者:
谷本 雅之
Nigel Collier的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Nigel Collier', 18)}}的其他基金
EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media
EPI-AI:自动了解全球新闻媒体的疾病爆发并发出警报
- 批准号:
ES/T012277/1 - 财政年份:2020
- 资助金额:
$ 59.12万 - 项目类别:
Research Grant
SIPHS: Semantic interpretation of personal health messages for generating public health summaries
SIPHS:个人健康信息的语义解释以生成公共卫生摘要
- 批准号:
EP/M005089/1 - 财政年份:2015
- 资助金额:
$ 59.12万 - 项目类别:
Fellowship
相似国自然基金
基于计算模型的医用X线最优曝光控制技术的研究
- 批准号:60472004
- 批准年份:2004
- 资助金额:26.0 万元
- 项目类别:面上项目
相似海外基金
Automatic extraction of chemical reaction process information from papers and its utilization
论文中化学反应过程信息的自动提取及其利用
- 批准号:
23K18500 - 财政年份:2023
- 资助金额:
$ 59.12万 - 项目类别:
Grant-in-Aid for Challenging Research (Exploratory)
Automatic extraction of hazard information from satellite InSAR ground motion data
从卫星InSAR地面运动数据中自动提取危险信息
- 批准号:
2749968 - 财政年份:2022
- 资助金额:
$ 59.12万 - 项目类别:
Studentship
Research on high-resolution coronary MRA imaging method using automatic extraction technology of coronary artery stationary period and super-resolution technology
利用冠状动脉静止期自动提取技术和超分辨率技术的高分辨率冠状动脉MRA成像方法研究
- 批准号:
22K07646 - 财政年份:2022
- 资助金额:
$ 59.12万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
SBIR Phase I: Automatic Data Series Extraction from a Text Corpus
SBIR 第一阶段:从文本语料库中自动提取数据序列
- 批准号:
2110123 - 财政年份:2021
- 资助金额:
$ 59.12万 - 项目类别:
Standard Grant
Developing and implementing a continuous quality improvement system by automatic data extraction from homecare nursing records
通过从家庭护理记录中自动提取数据来开发和实施持续质量改进系统
- 批准号:
21K19632 - 财政年份:2021
- 资助金额:
$ 59.12万 - 项目类别:
Grant-in-Aid for Challenging Research (Exploratory)
Application of Automatic Extraction Platform KingFisher Apex on Viral and Bacterial Pathogen to Increase the Capacity in Vet-LIRN Sample Analysis
应用自动提取平台 KingFisher Apex 检测病毒和细菌病原体,提高 Vet-LIRN 样品分析能力
- 批准号:
10448997 - 财政年份:2021
- 资助金额:
$ 59.12万 - 项目类别:
Development of writer support system using the user's intention extraction and automatic text generation
使用用户意图提取和自动文本生成的作家支持系统的开发
- 批准号:
20K19878 - 财政年份:2020
- 资助金额:
$ 59.12万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Automatic Extraction and Analysis of Reference Data for Detonation and Hypersonic Research
爆炸和高超声速研究参考数据的自动提取和分析
- 批准号:
541625-2019 - 财政年份:2019
- 资助金额:
$ 59.12万 - 项目类别:
University Undergraduate Student Research Awards
Study on Automatic extraction of language teaching materials from a large closed caption corpus by bottom-up assembly of linguistic units such as words, phrases, and conversations.
通过自下而上组装单词、短语、对话等语言单元从大型隐藏字幕语料库中自动提取语言教材的研究。
- 批准号:
19H04224 - 财政年份:2019
- 资助金额:
$ 59.12万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Automatic Extraction of Rich Metadata from Broadcast Speech
从广播语音中自动提取丰富的元数据
- 批准号:
2104504 - 财政年份:2018
- 资助金额:
$ 59.12万 - 项目类别:
Studentship