Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles
利用从全文文章中提取的信息丰富公共存储库中的 SARS-CoV-2 序列数据
基本信息
- 批准号:10390667
- 负责人:
- 金额:$ 75.71万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-09-17 至 2022-08-31
- 项目状态:已结题
- 来源:
- 关键词:2019-nCoVAddressAgeAgreementAlgorithmsBase SequenceCOVID-19COVID-19 pandemicClinicalClinical DataCollaborationsCommunicable DiseasesCoronavirusDataData AnalysesData SetDatabasesEpidemiologyEvolutionFundingGenbankGenderGeneticGenomicsGenotypeGeographyGoalsGoldHealthInternationalInterventionJointsJournalsKnowledgeLinkLocationManualsMetadataMethodsModelingNatural Language ProcessingOntologyOutcomePatient CarePatientsPeer ReviewPerformancePhylogenetic AnalysisPopulationPopulation GroupPopulations at RiskProbabilityPublic HealthPublicationsPublishingRaceRecordsRelative RisksReportingResearchResearch PersonnelResolutionResourcesRiskSARS coronavirusScientistSequence AnalysisSeveritiesSpecific qualifier valueSystemTestingTextUnified Medical Language SystemUnited States National Institutes of HealthUpdateViralViral GenomeVirusWorkclinical phenotypecohortcomorbiditycoronavirus diseasedashboarddata sharingdeep learningdemographicsfield studygenomic epidemiologyheuristicsimprovedinsightnovelpandemic diseasepopulation healthpreventpublic repositoryresidenceresponsesecondary analysissextext searchingtransmission processtrendvirus characteristic
项目摘要
Project Summary
In response to the COVID-19 pandemic, scientists have published over one hundred thousand research articles
and made available over eight hundred thousand virus genome sequences. These sequences, along with their
metadata, can be used to understand virus evolution and spread and their implications for public health, a field of
study called genomic epidemiology. However, these sequence records do not typically contain patient metadata
such as demographics, clinical severity, or comorbidities, preventing researchers from uncovering trends in
population health. To understand the severity of the problem, we analyzed nearly 748 thousand SARS-CoV-2
records from GISAID and 60 thousand from GenBank for the presence of patient metadata finding age and
gender were represented in < 1% of GenBank records and in GISAID, 26% included sex, and 24% had age. For
other fields, the amount of missing data is even more pronounced, with neither resource providing information on
a patient's race and only GISAID specifying severity (i.e. ICU) in less than 5% of records. To address missing
virus metadata, researchers could utilize the publication associated with the new sequences, however, the virus
sequence record is often never updated with a link to the publication. From the set of records that we analyzed,
3.4% (of 748K) in GISAID and < 1% (of 117K) in GenBank had a link to a publication. This greatly hinders
secondary data analysis of these sequences and limits the ability to use them at scale to uncover associations
between the viral genome, transmission risk, and health outcomes. The goal of this proposal is to enhance
genomic epidemiology and population health of COVID-19 with a framework to continuously and automatically
enrich SARS-CoV-2 nucleic acid sequence metadata in public databases such as GenBank and GISAID with
metadata in associated published articles. We will incorporate input from clinicians at the front-line of patient
care during the pandemic and build on our NIH funded work (R01AI117011), which used Natural Language
Processing (NLP) to enrich the geographic metadata of a sequence record using its corresponding published
article. We have used these data in virus phylogeographic models and shown the benefit of using enriched
metadata for modeling virus evolution and spread. Theavailability of SARS-CoV-2 sequences, paired withfull-
text COVID-19 articles and preprints, presents an opportunity for metadata enrichment and scientific discovery
beyond our prior work. Our specific aims are to: (1) enrich SARS-CoV-2 sequence metadata using text extracted
from publications and (2) derive key epidemiologic insights for different patient demographics using our enriched
SARS-CoV-2 sequence dataset. We will leverage our prior joint work funded by the NIH to enable the secondary
use of enriched metadata for genomic epidemiology to improve our understanding of SARS-CoV-2 evolution and
spread among different population groups. We will disseminate the enriched data through our GeoBoost2 data
dashboard, GenBank LinkOut and the i2b2 platform. The latter will more immediately allow integration with
COVID-specific clinical data shared by the 4CE Consortium.
项目摘要
为应对新冠肺炎疫情,科学家发表了10多万篇研究论文
并提供了超过80万个病毒基因组序列。这些序列以及它们的
元数据,可以用来了解病毒的进化和传播及其对公共卫生的影响,这是一个领域
这项研究被称为基因组流行病学。但是,这些序列记录通常不包含患者元数据
例如人口统计学、临床严重性或合并症,使研究人员无法发现
人口健康。为了了解问题的严重性,我们分析了近74.8万例SARS-CoV-2
来自GISAID的记录和来自GenBank的6万条记录,用于发现年龄和
性别出现在1%的GenBank记录和GISAID中,26%包括性别,24%包括年龄。为
在其他字段中,丢失的数据量甚至更加明显,两个资源都没有提供关于
在不到5%的记录中,只有患者的种族和仅GISAID指定严重程度(即ICU)。解决失踪问题
病毒元数据,研究人员可以利用与新序列相关的出版物,然而,病毒
序列记录通常不会使用指向出版物的链接进行更新。从我们分析的一组记录来看,
GISAID中3.4%(748K)和GenBank中1%(117K)有出版物链接。这极大地阻碍了
对这些序列的二次数据分析,限制了大规模使用它们来揭示关联的能力
病毒基因组、传播风险和健康结果之间的关系。这项提议的目标是加强
新冠肺炎基因组流行病学与人群健康的连续自动分析框架
丰富GenBank和GISAID等公共数据库中的SARS-CoV-2核酸序列元数据
关联已发布文章中的元数据。我们将把临床医生的意见纳入患者的一线
在大流行期间提供护理,并在NIH资助的使用自然语言的工作(R01AI117011)的基础上再接再厉
处理(NLP),以使用其相应发布的序列记录的地理元数据来丰富其地理元数据
文章。我们已经在病毒系统地理模型中使用了这些数据,并显示了使用富集化的好处。
用于模拟病毒进化和传播的元数据。SARS-CoV-2序列的可用性,与全序列配对-
文本新冠肺炎文章和预印本,为元数据丰富和科学发现提供了机会
超出了我们之前的工作。我们的具体目标是:(1)利用提取的文本丰富SARS-CoV-2序列元数据
从出版物和(2)获取针对不同患者人群的关键流行病学见解
SARS-CoV-2序列数据集。我们将利用我们之前由美国国立卫生研究院资助的联合工作,使次要的
利用丰富的基因组流行病学元数据来提高我们对SARS-CoV-2进化和
在不同的人群中传播。我们将通过我们的GeoBoost2数据传播丰富的数据
Dashboard、GenBank LinkOut和i2b2平台。后者将更直接地允许与
由4CE联盟共享的CoVID特定临床数据。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
GRACIELA GONZALEZ HERNANDEZ其他文献
GRACIELA GONZALEZ HERNANDEZ的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('GRACIELA GONZALEZ HERNANDEZ', 18)}}的其他基金
Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles
利用从全文文章中提取的信息丰富公共存储库中的 SARS-CoV-2 序列数据
- 批准号:
10681068 - 财政年份:2022
- 资助金额:
$ 75.71万 - 项目类别:
Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles
利用从全文文章中提取的信息丰富公共存储库中的 SARS-CoV-2 序列数据
- 批准号:
10701081 - 财政年份:2021
- 资助金额:
$ 75.71万 - 项目类别:
Tracking Evolution and Spread of Viral Genomes by Geospatial Observation Error
通过地理空间观测误差追踪病毒基因组的进化和传播
- 批准号:
9249484 - 财政年份:2016
- 资助金额:
$ 75.71万 - 项目类别:
Text Processing and Geospatial Uncertainty for Phylogeography of Zoonotic Viruses
人畜共患病毒系统发育地理学的文本处理和地理空间不确定性
- 批准号:
8698542 - 财政年份:2013
- 资助金额:
$ 75.71万 - 项目类别:
Mining Social Network Postings for Mentions of Potential Adverse Drug Reactions
挖掘社交网络帖子中提及潜在药物不良反应的内容
- 批准号:
8222740 - 财政年份:2012
- 资助金额:
$ 75.71万 - 项目类别:
相似海外基金
Rational design of rapidly translatable, highly antigenic and novel recombinant immunogens to address deficiencies of current snakebite treatments
合理设计可快速翻译、高抗原性和新型重组免疫原,以解决当前蛇咬伤治疗的缺陷
- 批准号:
MR/S03398X/2 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Fellowship
Re-thinking drug nanocrystals as highly loaded vectors to address key unmet therapeutic challenges
重新思考药物纳米晶体作为高负载载体以解决关键的未满足的治疗挑战
- 批准号:
EP/Y001486/1 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Research Grant
CAREER: FEAST (Food Ecosystems And circularity for Sustainable Transformation) framework to address Hidden Hunger
职业:FEAST(食品生态系统和可持续转型循环)框架解决隐性饥饿
- 批准号:
2338423 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Continuing Grant
Metrology to address ion suppression in multimodal mass spectrometry imaging with application in oncology
计量学解决多模态质谱成像中的离子抑制问题及其在肿瘤学中的应用
- 批准号:
MR/X03657X/1 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Fellowship
CRII: SHF: A Novel Address Translation Architecture for Virtualized Clouds
CRII:SHF:一种用于虚拟化云的新型地址转换架构
- 批准号:
2348066 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Standard Grant
BIORETS: Convergence Research Experiences for Teachers in Synthetic and Systems Biology to Address Challenges in Food, Health, Energy, and Environment
BIORETS:合成和系统生物学教师的融合研究经验,以应对食品、健康、能源和环境方面的挑战
- 批准号:
2341402 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Standard Grant
The Abundance Project: Enhancing Cultural & Green Inclusion in Social Prescribing in Southwest London to Address Ethnic Inequalities in Mental Health
丰富项目:增强文化
- 批准号:
AH/Z505481/1 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Research Grant
ERAMET - Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
ERAMET - 快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
- 批准号:
10107647 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
EU-Funded
Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
- 批准号:
10106221 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
EU-Funded
Recite: Building Research by Communities to Address Inequities through Expression
背诵:社区开展研究,通过表达解决不平等问题
- 批准号:
AH/Z505341/1 - 财政年份:2024
- 资助金额:
$ 75.71万 - 项目类别:
Research Grant














{{item.name}}会员




