权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles

利用从全文文章中提取的信息丰富公共存储库中的 SARS-CoV-2 序列数据

基本信息

批准号：
10701081
负责人：
GRACIELA GONZALEZ HERNANDEZ
金额：
$ 58.37万
依托单位：
ARIZONA STATE UNIVERSITY-TEMPE CAMPUS
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-09-17 至 2024-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10701081
关键词：
2019-nCoV Address Age Agreement Algorithms Base Sequence COVID-19 COVID-19 pandemic Clinical Clinical Data Collaborations Communicable Diseases Coronavirus Data Data Analyses Data Set Epidemiology Evolution Funding Genbank Gender Genetic Genotype Geography Goals Health International Intervention Joints Journals Knowledge Link Location Manuals Metadata Methods Modeling Natural Language Processing Ontology Outcome Patient Care Patients Peer Review Performance Phylogenetic Analysis Population Population Group Populations at Risk Printing Probability Public Health Publications Publishing Race Records Relative Risks Reporting Research Research Personnel Resolution Resources Risk SARS coronavirus Scientist Sequence Analysis Severities Specific qualifier value System Testing Text Unified Medical Language System United States National Institutes of Health Update Viral Viral Genome Virus Work clinical phenotype cohort comorbidity coronavirus disease dashboard data sharing deep learning demographics field study genomic epidemiology heuristics improved insight novel pandemic disease population health prevent public database public repository residence response secondary analysis sex text searching transmission process trend virus characteristic

项目摘要

Project Summary In response to the COVID-19 pandemic, scientists have published over one hundred thousand research articles and made available over eight hundred thousand virus genome sequences. These sequences, along with their metadata, can be used to understand virus evolution and spread and their implications for public health, a field of study called genomic epidemiology. However, these sequence records do not typically contain patient metadata such as demographics, clinical severity, or comorbidities, preventing researchers from uncovering trends in population health. To understand the severity of the problem, we analyzed nearly 748 thousand SARS-CoV-2 records from GISAID and 60 thousand from GenBank for the presence of patient metadata finding age and gender were represented in < 1% of GenBank records and in GISAID, 26% included sex, and 24% had age. For other fields, the amount of missing data is even more pronounced, with neither resource providing information on a patient's race and only GISAID specifying severity (i.e. ICU) in less than 5% of records. To address missing virus metadata, researchers could utilize the publication associated with the new sequences, however, the virus sequence record is often never updated with a link to the publication. From the set of records that we analyzed, 3.4% (of 748K) in GISAID and < 1% (of 117K) in GenBank had a link to a publication. This greatly hinders secondary data analysis of these sequences and limits the ability to use them at scale to uncover associations between the viral genome, transmission risk, and health outcomes. The goal of this proposal is to enhance genomic epidemiology and population health of COVID-19 with a framework to continuously and automatically enrich SARS-CoV-2 nucleic acid sequence metadata in public databases such as GenBank and GISAID with metadata in associated published articles. We will incorporate input from clinicians at the front-line of patient care during the pandemic and build on our NIH funded work (R01AI117011), which used Natural Language Processing (NLP) to enrich the geographic metadata of a sequence record using its corresponding published article. We have used these data in virus phylogeographic models and shown the benefit of using enriched metadata for modeling virus evolution and spread. Theavailability of SARS-CoV-2 sequences, paired withfull- text COVID-19 articles and preprints, presents an opportunity for metadata enrichment and scientific discovery beyond our prior work. Our specific aims are to: (1) enrich SARS-CoV-2 sequence metadata using text extracted from publications and (2) derive key epidemiologic insights for different patient demographics using our enriched SARS-CoV-2 sequence dataset. We will leverage our prior joint work funded by the NIH to enable the secondary use of enriched metadata for genomic epidemiology to improve our understanding of SARS-CoV-2 evolution and spread among different population groups. We will disseminate the enriched data through our GeoBoost2 data dashboard, GenBank LinkOut and the i2b2 platform. The latter will more immediately allow integration with COVID-specific clinical data shared by the 4CE Consortium.

项目摘要为了应对COVID-19大流行，科学家们发表了超过10万篇研究文章并提供了超过八十万个病毒基因组序列。这些序列，沿着它们的元数据，可用于了解病毒的演变和传播及其对公共卫生的影响，基因组流行病学研究。然而，这些序列记录通常不包含患者元数据例如人口统计学、临床严重程度或合并症，使研究人员无法揭示人口健康。为了了解问题的严重性，我们分析了近74.8万例SARS-CoV-2 来自GISAID的6万份记录和来自GenBank的6万份记录，性别在基因库记录中的比例不到1%，在GISAID中，26%包括性别，24%包括年龄。为在其他领域，缺失数据的数量甚至更加明显，没有任何资源提供有关患者的种族和仅GISAID在不到5%的记录中指定严重程度（即ICU）。至地址缺失研究人员可以利用与新序列相关的出版物，但是，序列记录通常从不使用到出版物的链接进行更新。从我们分析的记录来看， GISAID中有3.4%（748 K）和GenBank中< 1%（117 K）与出版物有链接。这极大地阻碍了这些序列的二次数据分析，并限制了大规模使用它们来揭示关联的能力病毒基因组、传播风险和健康结果之间的关系。该提案的目的是加强基因组流行病学和人群健康的COVID-19的框架，在GenBank和GISAID等公共数据库中丰富SARS-CoV-2核酸序列元数据，相关发布文章中的元数据。我们将纳入来自患者第一线临床医生的意见，在大流行期间的护理，并建立在我们的NIH资助的工作（R 01 AI 117011），其中使用自然语言使用序列记录的对应的已发布的地理元数据来丰富序列记录的地理元数据文章.我们在病毒传播地理模型中使用了这些数据，并展示了使用丰富的用于模拟病毒进化和传播的元数据。SARS-CoV-2序列的可用性，与完整的 COVID-19文章和预印本，为元数据丰富和科学发现提供了机会超越了我们之前的工作。我们的具体目标是：（1）使用提取的文本丰富SARS-CoV-2序列元数据从出版物和（2）获得关键的流行病学见解不同的患者人口统计学使用我们丰富的 SARS-CoV-2序列数据集。我们将利用我们先前由NIH资助的联合工作，使用丰富的基因组流行病学元数据，以提高我们对SARS-CoV-2进化的理解，在不同的人群中传播。我们将通过GeoBoost 2数据传播丰富的数据仪表板、GenBank LinkOut和i2 b2平台。后者将更直接地允许与 4CE联盟共享的COVID特定临床数据。