权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Graph Grammars for Molecular Structure Search and Classification

用于分子结构搜索和分类的图文法

基本信息

批准号：
416768284
负责人：
Professor Dr. Ernst Althaus
金额：
--
依托单位：
Arbeitsgruppe Software-Technik und Bioinformatik
依托单位国家：
德国
项目类别：
Research Grants
财政年份：
2019
资助国家：
德国
起止时间：
2018-12-31 至 2022-12-31
项目状态：
已结题

来源：
https://gepris.dfg.de/gepris/projekt/416768284?language=en
关键词：
Graph Grammars Molecular Structure Search

项目摘要

Numerous fields of study focus on small molecules. A prominent example is the field of drug design, where small molecules are used to inhibit or activate proteins to achieve a desired biological function. In these fields, we often want to scan databases for molecules containing certain substructures. Traditionally, these substructures are modelled in chemical description languages such as Daylight’s SMARTS. These languages tend to be very complex and are very restricted in their ability to describe the topological patterns of the underlying graphs. Parsing and matching patterns against a database of molecules is NP-complete. To circumvent these problems, we propose a simple graph grammar to describe substructures. Even very simple graph rewriting systems allow a high expressive power that almost reaches that of SMARTS. To use these graph grammars for molecular structure search, we have to solve the subgraph matching problem. Although this problem remains NP-complete, it becomes polynomial if each minimal cut of the query graph has bounded size, which we empirically find to be true for most molecules contained in the standard databases. We will investigate the complexity of the problem for more known graph parameters and try to relate the maximal size of a minimal cut to other parameters and we will focus on parameters that are typically small for molecular graphs and we will make our basic algorithm more efficient in practice. Furthermore, we want to derive over-approximations of the class of graphs generated by a grammar for which the subgraph matching problem can be solved more efficiently. As a second research direction, we will develop and implement efficient algorithms for learning graph grammars from positive and negative examples. We aim to find a graph grammar that is as simple as possible and matches the positive examples but does not match the negative examples for the chemical group. A trivial grammar that interpolates the positive and negative examples is a grammar that creates positive examples that clearly overfit the positive examples. The underlying idea behind this learning task is to automatically identify aspects of the pharmacophore of these molecules. The challenge here is to simultaneously prevent overfitting and overgeneralization. We plan to develop constructive algorithms, i.e. algorithms that compute a simple graph grammar that interpolates the positive and negative examples and improvement algorithms, i.e. algorithms that try to simplify a graph grammar while preserving its interpolating property.

许多领域的研究都集中在小分子上。一个突出的例子是药物设计领域，在药物设计领域，小分子被用来抑制或激活蛋白质，以实现所需的生物功能。在这些领域中，我们经常想要扫描数据库以寻找包含某些亚结构的分子。传统上，这些子结构是用日光的SMARTS等化学描述语言建模的。这些语言往往非常复杂，并且在描述底层图形的拓扑模式的能力方面受到很大限制。在分子数据库中解析和匹配模式是NP完全的。为了避免这些问题，我们提出了一种简单的图文法来描述子结构。即使是非常简单的图形重写系统也可以实现几乎达到SMARTS的高度表现力。为了将这些图文法用于分子结构搜索，我们必须解决子图匹配问题。虽然这个问题仍然是NP-完全的，但如果查询图的每个最小割都有有限的大小，那么它就是多项式的，我们根据经验发现，对于标准数据库中包含的大多数分子来说，这是正确的。对于更多已知的图参数，我们将研究问题的复杂性，并尝试将最小割的最大尺寸与其他参数联系起来，我们将重点关注分子图中通常较小的参数，并将使我们的基本算法在实践中更有效。此外，我们想要得到由一种文法生成的图类的过近似，对于这种文法，子图匹配问题可以更有效地解决。作为第二个研究方向，我们将开发和实现从正例和反例学习图文法的高效算法。我们的目标是找到一种尽可能简单的图文法，它与正例匹配，但与化学组的反例不匹配。插入正例和反例的平凡文法是这样一种文法，它创建的正例明显超过了正例。这项学习任务背后的潜在想法是自动识别这些分子的药效团的各个方面。这里的挑战是同时防止过度适应和过度泛化。我们计划开发构造性算法，即计算插入正负示例的简单图文法的算法，以及改进算法，即试图简化图文法同时保持其插值性的算法。