权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Textpresso information retrieval and extraction system for biological literature

Textpresso生物文献信息检索与提取系统

基本信息

批准号：
8034342
负责人：
Hans-Michael Muller
金额：
$ 33.27万
依托单位：
CALIFORNIA INSTITUTE OF TECHNOLOGY
依托单位国家：
美国
项目类别：
财政年份：
2006
资助国家：
美国
起止时间：
2006-03-23 至 2013-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/8034342
关键词：
Access to Information Acquired Immunodeficiency Syndrome Address Algorithms Alzheimer&apos s Disease Arabidopsis Area Biological Biological Models Biological Sciences Biological databases Brain Caenorhabditis elegans Categories Cells Classification Collaborations Communities Computer software Data Databases Development Dictyostelium Disease Drosophila genus Feedback Gene Expression Gene Expression Regulation Gene Proteins Genes Genome Gold Graph Health Individual Information Retrieval Label Learning Literature Location Machine Learning Malignant Neoplasms Measures Methods Names Natural Language Processing Neurosciences Notification Ontology Organism Paper Process Rattus Reading Research Research Personnel Retrieval Scientist Semantics Site Software Tools Specificity Speed System Taxonomy Testing Text Training Triage Work Writing Zebrafish base biological systems gene function gene interaction genome sequencing human disease improved indexing markov model model organisms databases novel strategies phrases software systems text searching theories tool web interface web site

项目摘要

DESCRIPTION (provided by applicant): We developed an information retrieval and extraction system that processes the full text of biological papers. The system, called Textpresso, separates text into sentences, labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises approximately one hundred categories of terms, such as "gene", "regulation", "human disease", "brain area" etc., and also contains main Gene Ontology (GO) categories. Extraction of particular biological facts, such as gene-gene interactions, or the curation of GO cellular components, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences. Search engine for four literatures, C. elegans, Drosophila, Arabidopsis and Neuroscience have been established by us, and nine systems for other literatures have been developed by other groups around the world. The system will be further developed in many aspects. In collaboration with the respective model organism databases, we will set up literature search engine for zebrafish, rat and Dictyostelium and consider systems for important diseases such as cancer, Alzheimer's and AIDS. We will improve the quality of searchable full text by carrying super- and subscripts as well as special character information, and recognizing subsections of a paper. Website and system enhancement will include synonym searches, better website customization features ("myTextpresso"), browsing and searching a paper taxonomy, implementation of batch queries and notification of search result changes due to corpus changes. We will offer webservices for Textpresso and maintain a public subversion system for the software. Named entity recognition algorithms will be implemented to find new terms for the ontology from full text. We will work on the problem of high specificity of terms in the lexica, which reduces recall, and enable searches for GO annotations. Strategies for (semi-) automated literature curation include installing a paper triage system and first pass curation to identify where in a paper which relevant data types can be found. Automated curation tasks include producing connections between a paper and a biological entity such as gene. We will develop learning algorithms that discover new categories and lexica in text. We will improve our curation strategy of developing specialized curation categories that are used to retrieve specific data, and develop corresponding curator interfaces to automate the processing pipeline from full text to database. We will research and implement new, more semantically oriented ways of searching by combining latent semantic indexing with new similarity measures. Machine learning algorithms for classifying sentences and extracting information will be implemented using hidden Markov models. A new approach of finding categories and lexica using graph theory will be investigated. PUBLIC HEALTH RELEVANCE: Narrative Biomedical researchers need to read or skim many thousands of scientific articles each year, more than is humanly possible. This project will extend and improve an automatic system, Textpresso, that finds relevant sentences within millions of sentences that likely contain crucial information. Textpresso also extracts some types of information automatically, making it possible to have organized databases of important information.

描述（由申请人提供）：我们开发了一个信息检索和提取系统，处理生物论文的全文。该系统被称为文本分析，将文本分成句子，根据本体（有组织的词典）标记单词和短语，并允许在标记句子的数据库上执行查询。目前的本体包括大约一百个术语类别，例如“基因”、“调节”、“人类疾病”、“脑区”等，并且还包含主要的基因本体（GO）类别。特定生物事实的提取，例如基因-基因相互作用，或GO细胞成分的管理，可以通过本体显著加速，TextData自动执行几乎与专家管理员一样好地识别句子。四种文献的搜索引擎，C。elegans、Drosophila、Arabidopsis和Neuroscience等的文献检索系统，其他研究组也建立了9个文献检索系统。该系统将在许多方面得到进一步发展。我们将与各自的模式生物数据库合作，为斑马鱼、大鼠和网骨藻建立文献搜索引擎，并考虑为癌症、阿尔茨海默病和艾滋病等重要疾病建立系统。我们将通过携带超级和下标以及特殊字符信息，并识别论文的小节来提高可搜索全文的质量。网站和系统的改进将包括同义词搜索、更好的网站定制功能（“myTextbooks”）、浏览和搜索论文分类、执行批量查询以及通知由于语料库变化而导致的搜索结果变化。我们将为TextData提供网络服务，并为该软件维护一个公共的颠覆系统。命名实体识别算法将被实现，以从全文中为本体找到新的术语。我们将致力于解决词汇表中术语的高度特异性问题，这会降低召回率，并支持搜索GO注释。（半）自动化文献策展的策略包括安装论文分类系统和第一遍策展，以确定在论文中可以找到哪些相关数据类型。自动化策展任务包括在论文和生物实体（如基因）之间建立联系。我们将开发学习算法，发现文本中的新类别和词汇。我们将改进我们的策展策略，开发用于检索特定数据的专门策展类别，并开发相应的策展人界面，以自动化从全文到数据库的处理管道。我们将研究和实现新的，更面向语义的搜索方法，结合潜在语义索引与新的相似性措施。将使用隐马尔可夫模型实现用于对句子进行分类和提取信息的机器学习算法。本文将研究一种利用图论来寻找范畴和词汇的新方法。公共卫生相关性：叙述性生物医学研究人员每年需要阅读或浏览数千篇科学文章，这超出了人类的能力。该项目将扩展和改进一个自动化系统，文本搜索，该系统可以在数百万个可能包含关键信息的句子中找到相关句子。TextData还可以自动提取某些类型的信息，从而有可能建立有组织的重要信息数据库。