权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Tuning Large language models to read biological literature

调整大型语言模型以阅读生物文献

基本信息

批准号：
BB/Y514032/1
负责人：
Antony McCabe
金额：
$ 23.78万
依托单位：
University of Liverpool
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2024
资助国家：
英国
起止时间：
2024 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=BB%2FY514032%2F1
关键词：
Tuning Large language models read

项目摘要

In this application, we focus on two related bioinformatics challenges that require interpretation and knowledge extraction from biological and biomedical literature at great scale.First, gene/genome databases store information on gene function, which is ultimately derived from scientific experiments with results reported in publications. It is exceptionally time-consuming and expensive for human curators to read all relevant scientific literature, interpret what has reported about the function or localisation of gene products, and assign specific controlled vocabulary terms (e.g. Gene Ontology terms) or short free text descriptions (gene names or product descriptions.Second, there are enormous volumes of raw data sets accompanying scientific publications, which are deposited in archival databases from expensive omics experiments, including mass spectrometry (MS) proteomics. Our group and others develop and apply pipelines for re-analysing MS data for new purposes, including annotating genomes, discovery of post-translational modifications and building quantitative atlases of species or tissues amongst others. There is a major bottleneck interpreting the original experimental design, sample descriptions and software parameters, which are currently described in blocks of free text submitted to the archival repository or within Materials and Methods sections of accompanying articles. For both challenges, we believe that with the recent extraordinary improvements in large language models (LLMs), they can be retrained and harnessed for these tasks, to remove the bottleneck in knowledge extraction from literature. Our group has significant expertise in bioinformatics and machine learning, but limited expertise in natural language processing (NLP) to date. In this international partnering application, we are collaborating with a leading group in artificial intelligence and NLP from the University of Pennsylvania (UPenn). The UPenn team will help to guide us in the optimal approach for re-training open source LLMs, using training data that our team has amassed over many years. We will produce open source code for the two challenge areas, with a longer term plan to put these into production within the context of major international databases and consortia, within which we have leading roles.

在这个应用程序中，我们专注于两个相关的生物信息学的挑战，需要解释和知识提取的生物和生物医学文献在大规模。首先，基因/基因组数据库存储的基因功能，这是最终从科学实验的结果报告在出版物中的信息。对于人类策展人来说，阅读所有相关的科学文献，解释关于基因产物的功能或定位的报道，并指定特定的受控词汇术语，是非常耗时和昂贵的（例如基因本体术语）或简短的自由文本描述（基因名称或产品描述）。其次，伴随着科学出版物，其从昂贵的组学实验（包括质谱（MS）蛋白质组学）中保存在档案数据库中。我们的团队和其他人开发和应用管道重新分析MS数据用于新的目的，包括注释基因组，发现翻译后修饰和构建物种或组织的定量图谱等。有一个主要的瓶颈解释原始的实验设计，样品描述和软件参数，这是目前描述的自由文本块提交给档案库或随附文章的材料和方法部分。对于这两个挑战，我们相信，随着最近大型语言模型（LLM）的非凡改进，它们可以被重新训练和利用来完成这些任务，以消除从文献中提取知识的瓶颈。我们的团队在生物信息学和机器学习方面拥有丰富的专业知识，但迄今为止在自然语言处理（NLP）方面的专业知识有限。在这个国际合作申请中，我们正在与宾夕法尼亚大学（UPenn）的人工智能和NLP领导小组合作。UPenn团队将帮助指导我们重新训练开源LLM的最佳方法，使用我们团队多年来积累的训练数据。我们将为这两个挑战领域制作开源代码，并制定长期计划，将这些代码投入主要国际数据库和联盟的生产，我们在其中发挥主导作用。