权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Tracking Evolution and Spread of Viral Genomes by Geospatial Observation Error

通过地理空间观测误差追踪病毒基因组的进化和传播

基本信息

批准号：
9249484
负责人：
GRACIELA GONZALEZ HERNANDEZ
金额：
$ 46.1万
依托单位：
ARIZONA STATE UNIVERSITY-TEMPE CAMPUS
依托单位国家：
美国
项目类别：
财政年份：
2016
资助国家：
美国
起止时间：
2016-04-01 至 2020-03-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9249484
关键词：
Animals Area Award Back China Computer software County Data Data Sources Databases Deposition Development Diffusion Disease Environmental Health Evaluation Evolution Funding Genbank Genetic Variation Genome Geography Goals Gold Hantavirus Health Human Imagery Influenza Knowledge Link Literature Location Manuals Metadata Methods Modeling Molecular Epidemiology National Institute of Allergy and Infectious Disease Natural Language Processing Nucleotides Population Genetics Public Health Publications RNA Viruses Rabies Records Research Research Infrastructure Research Personnel Resources Risk Running Science Source Surveillance Modeling System Time Trees United States National Institutes of Health Viral Viral Genome Virus Work Zoonoses improved information model interest journal article molecular sequence database pathogen population health programs public health relevance simulation surveillance data tool web portal

项目摘要

DESCRIPTION (provided by applicant): Tracking evolutionary changes in viral genomes and their spread often requires the use of data deposited in public databases such as GenBank, the Influenza Research Database (IRD), or the Virus Pathogen Resource (ViPR). GenBank provides an abundance of available viral sequence data for phylogeography. Sequences and their metadata can be downloaded and imported into software applications that generate phylogeographic trees and models for surveillance. IRD and ViPR are NIH/NIAID funded programs that import data from GenBank but contain additional data sources, visualization, and search tools for their users. Tracking evolutionary changes and spread also requires the geospatial assignment of taxa, which is often obtained from GenBank metadata. Unfortunately, geospatial metadata such as host location is often uncertain in GenBank entries, with only 36% containing a precise location such as a county, town, or region within a state. For example, information such as China or USA was indicated instead of Beijing or Bedford, NH. While town or county might be included in the corresponding journal article, this valuable information is not available for immediate use unless it is extracted and then linked back to the appropriate sequence. The goal of our work is to enable health agencies and other researchers to automatically generate phylogeographic models that incorporate enhanced geospatial data for better estimates of virus spread. This proposal focuses on developing and applying information extraction and statistical phylogeography approaches to enhance models that track evolutionary changes in viral genomes and their spread. We propose a framework that uses natural language processing (NLP) for the automatic extraction of relevant geospatial data from the literature, and assigns a confidence between such geospatial mentions and the GenBank record. We will then use these locations and the estimates as observation error in the creation of phylogeographic models of zoonotic virus spread. We hypothesize that a combined NLP-phylogeography infrastructure that produces models that include observation error in the geospatial assignment of taxa will be closer to a gold standard than phylogeographic models that do not include them. Our research will extend phylogeography and zoonotic surveillance by: creating a NLP infrastructure that will improve the level of detail of geospatial data for phylogeography of zoonotic viruses (Aim 1), develop phylogeographic models using the estimates from Aim 1 as observation error (Aim 2), and evaluating our approach by comparing the models it produces to models that do not account for observation error in the geospatial assignment of taxa (Aim 3). We will allow users to generate enhanced models and view results on a web portal accessible via a LinkOut feature from GenBank, IRD, and ViPR. The addition of more precise geospatial information in building such models could enable health agencies to better target areas that represent the greatest public health risk.

描述（由申请人提供）：跟踪病毒基因组的进化变化及其传播通常需要使用公共数据库中存储的数据，如GenBank、流感研究数据库（IRD）或病毒病原体资源（ViPR）。GenBank为地理学提供了丰富的病毒序列数据。序列和它们的元数据可以被下载并导入到软件应用程序中，该软件应用程序生成用于监视的地理树和模型。IRD和ViPR是NIH/NIAID资助的项目，它们从GenBank导入数据，但为用户提供额外的数据源、可视化和搜索工具。跟踪进化变化和传播还需要分类群的地理空间分配，这通常是从GenBank元数据中获得的。不幸的是，地理空间元数据（如主机位置）在GenBank条目中通常是不确定的，只有36%包含精确的位置，如一个县，城镇或州内的地区。例如，显示了中国或美国等信息，而不是北京或新罕布什尔州贝德福德。虽然城镇或县可能包含在相应的期刊文章中，但这些有价值的信息不能立即使用，除非将其提取出来，然后链接回适当的序列。我们工作的目标是使卫生机构和其他研究人员能够自动生成包含增强的地理空间数据的地理模型，以更好地估计病毒传播。该提案的重点是开发和应用信息提取和统计地理学方法，以增强跟踪病毒基因组进化变化及其传播的模型。我们提出了一个框架，使用自然语言处理（NLP）从文献中自动提取相关的地理空间数据，并分配这样的地理空间提到和GenBank记录之间的信心。然后，我们将使用这些位置和估计值作为建立人畜共患病病毒传播的地理模型的观测误差。我们假设，一个组合的NLP-地理基础设施，产生的模型，包括观察错误的地理空间分配的类群将更接近黄金标准比不包括它们的地理模型。我们的研究将通过以下方式扩展生物地理学和人畜共患病监测：建立一个自然语言处理基础设施，以提高人畜共患病毒地理信息学的地理空间数据的详细程度（目标1），利用目标1的估计值作为观测误差开发地理信息学模型（目标2），并通过比较我们的方法产生的模型与不考虑分类群地理空间分配中观察误差的模型来评估我们的方法（目标3）。我们将允许用户通过GenBank、IRD和ViPR的LinkOut功能在Web门户上生成增强的模型并查看结果。在建立这种模型时增加更精确的地理空间信息，可以使卫生机构更好地瞄准公共卫生风险最大的地区。