权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Large Databases of Small Molecules - Drug Development Tool and Public Resource

小分子大型数据库 - 药物开发工具和公共资源

基本信息

批准号：
10262724
负责人：
MARC NICKLAUS
金额：
$ 11.23万
依托单位：
DIVISION OF BASIC SCIENCES - NCI
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10262724
关键词：
3-Dimensional Algorithmic Software Area Awareness Biological Biological Assay Books Cactaceae Catalogs Characteristics Chemical Structure Chemicals Collection Computer Assisted Computers Contracts Custom Data Data Set Databases Deposition Developmental Therapeutics Program Drug Design Evaluation Generations Goals Information Sciences Internet Legal patent Link Malignant Neoplasms Methods Nature Paper Pharmacologic Substance Property PubChem Publications Readability Records Research Personnel Resources Sampling Series Services Structure System Telephone Time United States National Institutes of Health Update Vendor Work Writing chemical group chemical synthesis cloud platform database structure design drug development improved insight next generation pharmacophore programs screening small molecule structured data tautomer tool tool development web based interface web platform web server web services

项目摘要

The principal objective of this project is to make large collections of small molecules available for aiding in drug development, both in-house and publicly, to advance the fields of chemical structure identification and processing and of unique compound identifier generation, as well as to provide free chemoinformatics tools aiding one in dealing with such databases. This project started with posting the information in the Open NCI Database on the CADD Group's public web server. Many databases are available to the user, including large vendor catalogs of compounds that can be acquired for screening. Advanced processing is applied to the data, and powerful searching and display capabilities have been implemented. The nature of the resources currently being developed is exemplified by a brief description of this service: The data in this current Enhanced NCI Web Browser web service comprise data from NCI's Developmental Therapeutics Program (DTP) and additional information with which we have augmented the DTP data sets. We have subjected the Open NCI Database of about 260,000 compounds to various analyses that help to better understand its characteristics and put it in perspective of other large databases used in computer-aided drug design and chemical information sciences. Various clustering methods have been applied to it to elucidate its diversity, and the results have been compared with those for other databases. The Open NCI Database has been converted into various formats, suitable for further processing including 3D pharmacophore searching. We have also implemented a powerful public search tool for the Open NCI Database with a web interface based on the chemical information toolkit CACTVS. Using just a web browser, the user is able to search about 250,000 structures for more than 600 criteria. We have greatly augmented the original DTP files with numerous additional data fields, be it calculated, predicted or hyperlinked information. These data have also been made available in directly downloadable format. Links to several additional services for further processing have been implemented. An online 3D pharmacophore capability has been built, a capability that is currently unique on the web, as far as we are aware of. Searchable predictions of more than 550 different biological activities, calculated by the program PASS for most of the quarter-million compounds, have been included in the web service (abstract). A more recent service is our Chemical Structure Lookup Service (CSLS), available at http://cactus.nci.nih.gov/lookup. CSLS is essentially a "phone book" for small molecules, allowing the user to quickly find out in which, if any, of over 100 different databases (both public and commercial), comprising more than 74 million entries, their compounds occur. Updates of both the user interface and the structure and data holdings are underway as of the time of this writing, which will push the number of entries in CSLS beyond the 100 million mark. Part of these projects is the downloading, reformatting and evaluation for cancer-related purposes, of the massive set of structure and assay data as deposited in PubChem. The Chemical Identifier Resolver (CIR) is the service with the most use, with typically several hundred thousand requests per day. CIR works as a resolver for different chemical structure identifiers and allows one to convert a given structure identifier into another representation or structure identifier. Among others, our NCI/CADD Structure Identifiers developed in-house as well as the new Standard InChI and InChIKey identifiers are handled by this service. One of CIR's key features is that it is a programmatic interface into the Chemical Structure Database (CSDB). An update of CSDB has been completed to over 360 million original database records representing approximately 128 unique million small-molecule structures. Many additional capabilities are planned to be added to this service, which is increasingly being integrated with other web services and chemoinformatics tools world-wide. CIR will also become increasingly important in the area of publications involving chemical structures, as efforts increase to make inclusion of computer-readable representations of all compounds presented in a paper mandatory. We are working on the next generation web platform which will be the basis for a series of new web services and updates of existing services including CADD Group's Chemical Structure Lookup Service (CSLS II). The URL of our public web server is https://cactus.nci.nih.gov. We have analyzed a set of 43 million chemical structure records extracted from patent data (EP, US PTO, WO) by the IBM-led consortium of large pharmaceutical companies in the context of the SIIP (Strategic IP Insight Platform) project. The originally CADD Group-developed utility OSRA was used in this project. Part of these data were given for public use to both PubChem and the CADD Group (see, e.g., http://www-935.ibm.com/services/us/gbs/bao/siip/nih/?sid=0015AFBF08D8F183C1F8E32A430CFFEB). Efforts to implement a resource for making affordable chemical synthesis of screening samples available to all NIH researchers were realized in the form of an extension of the contract with the formerly independent company ChemNavigator, now part of Sigma-Aldrich, recently acquired by Merck GmbH, who have implemented the so-called Semi-Custom Synthesis Online Request System (SCSORS). The Chemical Activity Predictor (CAP), which allows the user to calculate physicochemical properties and activities for compounds. Our database and chemoinformatics tools are benefiting from the work pertaining to tautomerism, in particular related to the redesign of the handling of tautomerism for version 2 of the IUPAC InChI identifier. A recent new web tool in this context is the so-called Tautomerizer. Numerous additional downloadable data sets have been made available on the group's web server. The work of creating a database of more than a billion easily synthesizable compounds in the SAVI project is described elsewhere. Efforts to move some of these tools to cloud platforms are being undertaken.

该项目的主要目标是收集大量的小分子，以帮助内部和公众进行药物开发，推进化学结构鉴定和处理以及唯一化合物标识符生成领域的发展，并提供免费的化学信息学工具，帮助人们处理这些数据库。该项目首先在CADD集团的公共web服务器上的开放NCI数据库中发布信息。用户可以使用许多数据库，包括可以获得用于筛选的化合物的大型供应商目录。对数据进行了高级处理，实现了强大的搜索和显示功能。目前正在开发的资源的性质可以通过该服务的简要描述来举例说明：当前增强的NCI Web Browser Web服务中的数据包括来自NCI发展治疗计划（DTP）的数据以及我们增强了DTP数据集的附加信息。我们对大约26万种化合物的NCI开放数据库进行了各种分析，以帮助更好地了解其特征，并将其与计算机辅助药物设计和化学信息科学中使用的其他大型数据库进行比较。本文采用了不同的聚类方法来阐明其多样性，并将聚类结果与其他数据库的聚类结果进行了比较。开放NCI数据库已转换成各种格式，适合进一步处理，包括3D药效团搜索。我们还为Open NCI数据库实现了一个强大的公共搜索工具，该工具基于化学信息工具包CACTVS提供了一个web界面。仅使用一个网络浏览器，用户就可以根据600多个标准搜索大约25万个结构。我们用许多额外的数据字段极大地增强了原始DTP文件，无论是计算的、预测的还是超链接的信息。这些数据也以可直接下载的格式提供。已经实现了几个附加服务的链接，以便进行进一步处理。一个在线3D药效团功能已经建立起来，据我们所知，这个功能目前在网络上是独一无二的。通过PASS程序计算出的超过550种不同生物活性的可搜索预测，已经包含在web服务中（摘要）。最近的一项服务是我们的化学结构查找服务（CSLS），可在http://cactus.nci.nih.gov/lookup获得。CSLS本质上是小分子的“电话簿”，允许用户快速找到，如果有的话，超过100个不同的数据库（包括公共和商业），包含超过7400万个条目，它们的化合物出现在哪里。在撰写本文时，用户界面、结构和数据持有的更新正在进行中，这将使CSLS中的条目数量超过1亿大关。这些项目的一部分是下载，重新格式化和评估癌症相关的目的，存储在PubChem中的大量结构和分析数据集。化学标识解析器（Chemical Identifier Resolver， CIR）是使用最多的服务，通常每天有数十万个请求。CIR作为不同化学结构标识符的解析器，允许将给定的结构标识符转换为另一种表示或结构标识符。其中，我们内部开发的NCI/CADD结构标识符以及新的标准InChI和InChIKey标识符都由该服务处理。CIR的主要特点之一是它是一个进入化学结构数据库（CSDB）的编程接口。CSDB已经完成了超过3.6亿条原始数据库记录的更新，这些记录代表了大约1.28亿个独特的小分子结构。许多额外的功能计划被添加到这个服务中，它正越来越多地与世界范围内的其他网络服务和化学信息学工具集成。CIR在涉及化学结构的出版物领域也将变得越来越重要，因为越来越多的努力要求在一篇论文中包含所有化合物的计算机可读表示。我们正在开发下一代网络平台，这将是一系列新网络服务和现有服务更新的基础，包括CADD集团的化学结构查找服务（CSLS II）。我们的公共web服务器的URL是https://cactus.nci.nih.gov。我们分析了由ibm领导的大型制药公司联盟在SIIP（战略知识产权洞察平台）项目背景下从专利数据（EP, US PTO， WO）中提取的4300万条化学结构记录。本项目使用了最初CADD group开发的实用程序OSRA。这些数据的一部分被公开提供给PubChem和CADD Group使用（例如，参见http://www-935.ibm.com/services/us/gbs/bao/siip/nih/?sid=0015AFBF08D8F183C1F8E32A430CFFEB）。努力实现一种资源，使所有NIH研究人员都能负担得起筛选样品的化学合成，这是以与前独立公司ChemNavigator（现在是Sigma-Aldrich的一部分，最近被默克公司收购）的合同延长的形式实现的，该公司实施了所谓的半定制合成在线请求系统（SCSORS）。化学活性预测器（CAP），允许用户计算化合物的物理化学性质和活性。我们的数据库和化学信息学工具正受益于与互变异构相关的工作，特别是与IUPAC InChI标识符第2版互变异构处理的重新设计有关。在这种背景下，最近出现了一种新的网络工具，即所谓的互变器。在该组织的网络服务器上提供了许多额外的可下载数据集。在SAVI项目中创建超过10亿种容易合成化合物的数据库的工作在其他地方进行了描述。正在努力将其中一些工具转移到云平台上。