权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Large Databases of Small Molecules - Drug Development Tool and Public Resource

小分子大型数据库 - 药物开发工具和公共资源

基本信息

批准号：
10703018
负责人：
MARC NICKLAUS
金额：
$ 18.06万
依托单位：
DIVISION OF BASIC SCIENCES - NCI
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10703018
关键词：
3-Dimensional Algorithmic Software Area Awareness Biological Biological Assay Books Cactaceae Catalogs Characteristics Chemical Structure Chemicals Collection Computer Assisted Computers Contracts Custom Data Data Set Databases Deposition Developmental Therapeutics Program Drug Design Evaluation Generations Goals Information Sciences Internet Legal patent Link Malignant Neoplasms Methods Nature Paper Pharmacologic Substance Property PubChem Publications Readability Records Research Personnel Resources Sampling Series Services Structure System Telephone Time United States National Institutes of Health Update Vendor Work Writing chemical group chemical synthesis cloud platform database structure design drug development improved insight next generation pharmacophore programs screening small molecule tautomer tool tool development web based interface web platform web server web services

项目摘要

The principal objective of this project is to make large collections of small molecules available for aiding in drug development, both in-house and publicly, to advance the fields of chemical structure identification and processing and of unique compound identifier generation, as well as to provide free chemoinformatics tools aiding one in dealing with such databases. This project started with posting the information in the Open NCI Database on the CADD Group's public web server. Many databases are available to the user, including large vendor catalogs of compounds that can be acquired for screening. Advanced processing is applied to the data, and powerful searching and display capabilities have been implemented. The nature of the resources currently being developed is exemplified by a brief description of this service: The data in this current Enhanced NCI Web Browser web service comprise data from NCI's Developmental Therapeutics Program (DTP) and additional information with which we have augmented the DTP data sets. We have subjected the Open NCI Database of about 260,000 compounds to various analyses that help to better understand its characteristics and put it in perspective of other large databases used in computer-aided drug design and chemical information sciences. Various clustering methods have been applied to it to elucidate its diversity, and the results have been compared with those for other databases. The Open NCI Database has been converted into various formats, suitable for further processing including 3D pharmacophore searching. We have also implemented a powerful public search tool for the Open NCI Database with a web interface based on the chemical information toolkit CACTVS. Using just a web browser, the user is able to search about 250,000 structures for more than 600 criteria. We have greatly augmented the original DTP files with numerous additional data fields, be it calculated, predicted or hyperlinked information. These data have also been made available in directly downloadable format. Links to several additional services for further processing have been implemented. An online 3D pharmacophore capability has been built, a capability that is currently unique on the web, as far as we are aware of. Searchable predictions of more than 550 different biological activities, calculated by the program PASS for most of the quarter-million compounds, have been included in the web service (abstract). A more recent service is our Chemical Structure Lookup Service (CSLS), available at http://cactus.nci.nih.gov/lookup. CSLS is essentially a "phone book" for small molecules, allowing the user to quickly find out in which, if any, of over 100 different databases (both public and commercial), comprising more than 74 million entries, their compounds occur. Updates of both the user interface and the structure and data holdings are underway as of the time of this writing, which will push the number of entries in CSLS beyond the 100 million mark. Part of these projects is the downloading, reformatting and evaluation for cancer-related purposes, of the massive set of structure and assay data as deposited in PubChem. The Chemical Identifier Resolver (CIR) is the service with the most use, with typically several hundred thousand requests per day. CIR works as a resolver for different chemical structure identifiers and allows one to convert a given structure identifier into another representation or structure identifier. Among others, our NCI/CADD Structure Identifiers developed in-house as well as the new Standard InChI and InChIKey identifiers are handled by this service. One of CIR's key features is that it is a programmatic interface into the Chemical Structure Database (CSDB). An update of CSDB has been completed to over 360 million original database records representing approximately 128 unique million small-molecule structures. Many additional capabilities are planned to be added to this service, which is increasingly being integrated with other web services and chemoinformatics tools world-wide. CIR will also become increasingly important in the area of publications involving chemical structures, as efforts increase to make inclusion of computer-readable representations of all compounds presented in a paper mandatory. We are working on the next generation web platform which will be the basis for a series of new web services and updates of existing services including CADD Group's Chemical Structure Lookup Service (CSLS II). The URL of our public web server is https://cactus.nci.nih.gov. The monthly average usage counts on cactus from January 2016 through December 2021 have been 14 million accesses, i.e. more than 450,000 per day. We have analyzed a set of 43 million chemical structure records extracted from patent data (EP, US PTO, WO) by the IBM-led consortium of large pharmaceutical companies in the context of the SIIP (Strategic IP Insight Platform) project. The originally CADD Group-developed utility OSRA was used in this project. Part of these data were given for public use to both PubChem and the CADD Group (see, e.g., http://www-935.ibm.com/services/us/gbs/bao/siip/nih/?sid=0015AFBF08D8F183C1F8E32A430CFFEB). Efforts to implement a resource for making affordable chemical synthesis of screening samples available to all NIH researchers were realized in the form of an extension of the contract with the formerly independent company ChemNavigator, now part of Sigma-Aldrich, recently acquired by Merck GmbH, who have implemented the so-called Semi-Custom Synthesis Online Request System (SCSORS). The Chemical Activity Predictor (CAP), which allows the user to calculate physicochemical properties and activities for compounds. Our database and chemoinformatics tools are benefiting from the work pertaining to tautomerism, in particular related to the redesign of the handling of tautomerism for version 2 of the IUPAC InChI identifier. A recent new web tool in this context is the so-called Tautomerizer. Numerous additional downloadable data sets have been made available on the group's web server. The work of creating a database of more than a billion easily synthesizable compounds in the SAVI project is described elsewhere. Efforts to move some of these tools to cloud platforms are being undertaken. A very significant update of the several of the services on our web server is currently underway.

该项目的主要目标是提供大量小分子，以帮助内部和公开的药物开发，以推进化学结构识别和处理以及独特化合物标识符生成领域的发展，并提供免费的化学信息学工具来帮助人们处理此类数据库。该项目首先将信息发布到 CADD 集团公共 Web 服务器上的开放 NCI 数据库中。用户可以使用许多数据库，包括可以获取用于筛选的大型化合物供应商目录。对数据进行了先进的处理，并实现了强大的搜索和显示功能。目前正在开发的资源的性质可以通过该服务的简要描述来举例说明：当前增强型 NCI Web 浏览器 Web 服务中的数据包括来自 NCI 发展治疗计划 (DTP) 的数据以及我们用来增强 DTP 数据集的其他信息。我们对包含约 260,000 种化合物的开放式 NCI 数据库进行了各种分析，有助于更好地了解其特征，并将其与计算机辅助药物设计和化学信息科学中使用的其他大型数据库放在一起。人们对其应用了各种聚类方法来阐明其多样性，并将结果与其他数据库的结果进行了比较。开放 NCI 数据库已转换为各种格式，适合进一步处理，包括 3D 药效团搜索。我们还为开放 NCI 数据库实施了强大的公共搜索工具，其 Web 界面基于化学信息工具包 CACTVS。只需使用 Web 浏览器，用户就可以搜索约 250,000 个结构，并满足 600 多个条件。我们使用大量附加数据字段（计算、预测或超链接信息）极大地增强了原始 DTP 文件。这些数据也以可直接下载的格式提供。已实现与若干附加服务的链接以供进一步处理。据我们所知，在线 3D 药效团功能已经建立，这是目前网络上独一无二的功能。超过 550 种不同生物活性的可搜索预测，由 PASS 程序对 25 万种化合物中的大多数进行计算，已包含在网络服务中（摘要）。最近的一项服务是我们的化学结构查找服务 (CSLS)，可从 http://cactus.nci.nih.gov/lookup 获取。 CSLS 本质上是小分子的“电话簿”，允许用户快速找出其化合物存在于 100 多个不同数据库（公共数据库和商业数据库）（包含超过 7400 万个条目）中（如果有的话）。截至撰写本文时，用户界面以及结构和数据持有量的更新正在进行中，这将使 CSLS 中的条目数量突破 1 亿大关。这些项目的一部分是出于与癌症相关的目的，对 PubChem 中存储的大量结构和分析数据进行下载、重新格式化和评估。化学标识符解析器 (CIR) 是使用最多的服务，通常每天有数十万个请求。 CIR 用作不同化学结构标识符的解析器，并允许将给定的结构标识符转换为另一种表示或结构标识符。其中，我们内部开发的 NCI/CADD 结构标识符以及新的标准 InChI 和 InChIKey 标识符均由该服务处理。 CIR 的主要功能之一是它是化学结构数据库 (CSDB) 的编程接口。 CSDB 的更新已完成，超过 3.6 亿条原始数据库记录，代表约 128 个独特的百万个小分子结构。计划向该服务添加许多附加功能，该服务正越来越多地与全球其他网络服务和化学信息学工具集成。随着强制要求在论文中包含所有化合物的计算机可读表示形式，CIR 在涉及化学结构的出版物领域也将变得越来越重要。我们正在开发下一代网络平台，该平台将成为一系列新网络服务和现有服务更新的基础，其中包括 CADD Group 的化学结构查找服务 (CSLS II)。我们公共网络服务器的 URL 是 https://cactus.nci.nih.gov。从2016年1月到2021年12月，Cactus的月平均使用量为1400万次，即每天超过45万次。我们分析了由 IBM 领导的大型制药公司联盟在 SIIP（战略知识产权洞察平台）项目背景下从专利数据（EP、美国专利商标局、WO）中提取的一组 4300 万条化学结构记录。该项目使用了最初由 CADD Group 开发的实用程序 OSRA。这些数据的一部分已提供给 PubChem 和 CADD Group 供公众使用（例如，参见 http://www-935.ibm.com/services/us/gbs/bao/siip/nih/?sid=0015AFBF08D8F183C1F8E32A430CFFEB）。通过与前独立公司 ChemNavigator 延长合同的形式，实现了为所有 NIH 研究人员提供负担得起的筛选样品化学合成资源的努力，该公司现在是 Sigma-Aldrich 的一部分，最近被 Merck GmbH 收购，该公司实施了所谓的半定制合成在线请求系统 (SCSORS)。化学活性预测器 (CAP)，允许用户计算化合物的理化性质和活性。我们的数据库和化学信息学工具受益于与互变异构相关的工作，特别是与 IUPAC InChI 标识符第 2 版的互变异构处理重新设计相关的工作。在这方面，最近出现的一个新的网络工具是所谓的互变异构器（Tautomerizer）。该组织的网络服务器上提供了许多额外的可下载数据集。在 SAVI 项目中创建包含超过 10 亿个易于合成的化合物的数据库的工作在其他地方进行了描述。正在努力将其中一些工具转移到云平台。我们的网络服务器上的多项服务目前正在进行非常重要的更新。