权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Modeling and prediction of genome sequence information by using information representation models

利用信息表示模型对基因组序列信息进行建模和预测

基本信息

批准号：
12208010
负责人：
YADA Tetsushi
金额：
$ 46.34万
依托单位：
Kyoto University (2003-2004)The University of Tokyo (2001-2002)The Institute of Physical and Chemical Research (2000)
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research on Priority Areas
财政年份：
2000
资助国家：
日本
起止时间：
2000 至 2004
项目状态：
已结题

项目摘要

In this research, we have focused on gene models which are capable of finding genes from genome sequences.First, we have developed a general purpose algorithm which finds genes by combining plural existing gene-finders. The algorithm has been implemented into a novel gene-finder named DIGIT. An outline of the algorithm is as follows. First, existing gene-finders are applied to an uncharacterized genomic sequence (input sequence). Next, DIGIT produces all possible exons from the results of gene-finders, and assigns them their exon types, reading frames and exon scores. Finally, DIGIT searches a set of exons whose additive score is maximized under their reading frame constraints. Bayesian procedure and a hidden Markov model (HMM) are used to infer exon scores and search the exon set, respectively. We have designed DIGIT so as to combine the results of FGENESH, GENSCAN and HMMgene, and have assessed its prediction accuracy by using recently compiled benchmark data sets. For all data sets, … More DIGIT successfully discarded many false-positive exons predicted by individual gene-finders and yielded remarkable improvements in sensitivity and specificity at the gene level compared with the best gene level accuracies achieved by any single gene-finder.Second, we have developed a novel index which precisely derives protein coding regions from cross-species genome alignments. The index is deeply related to frame recovery observed in coding sequence alignments, that is, if insertions or deletions of nucleotides causes frame shifts in coding regions, other in-dels which recover the reading frames will be often observed in the vicinity. In contrast, such frame recoveries are not observed in other conserved regions. We prepared two gene models: a model which finds gene by using sequence similarity and intrinsic gene measures (basic model), and the other model which finds gene by using frame recovery index in addition to sequence similarity and intrinsic gene measures (frame recovery model). We evaluated the prediction accuracies of the two models, and our benchmark test revealed that frame recovery model significantly improved the prediction accuracy in comparison with basic model.Third, we have developed GeneDecoder which is a gene finding technology for eukaryotes, based on HMMs. The algorithm, using dynamic programing method and statistic models trained by annotated genome sequences, divides the input nucleic acid sequence into some meaningful segments. Besides, GeneDecoder has some additional features: (1) multi-stream architecture, (2) incorporation of similarity search and (3) SVM-driven putative splice sites screening. (1) In addition to nucleic acid sequences, GeneDecoder allows any other data streams to be added. Typically, dicodon bigram values can be calculated in advance and be aligned on a 'Direct' stream, which makes state transition networks much simpler. Any other meaningful features extracted in advance can be incorporated to. gene-finding process using this scheme. (2) Combining calculation of coding potential and similarity search with known sequence database realizes more reliable putative exons. For this purpose, GeneDecoder has ability both to embed known motif models in exon models and to use segments with which similarity to known sequence was found by BLAST search. (3) Support Vector Machine (SVM) is one of the pattern re cognition techniques known to have high classification capability and has succes sfully been applied to splice site prediction. In GeneDecoder, this fearure is implemented as well as PWM-based splice site mod els. While parsing, putative splice sites derived from the PWM-based models but have poor support by the SVMs designed as splice site classifiers are excluded. Less

在这项研究中，我们集中在基因模型，能够找到基因组序列的基因。首先，我们已经开发了一个通用的算法，发现基因，通过结合多个现有的基因发现。该算法已被实现到一个新的基因查找器DIGIT。该算法的概要如下。首先，将现有的基因查找器应用于未表征的基因组序列（输入序列）。接下来，DIGIT从基因查找器的结果中产生所有可能的外显子，并分配它们的外显子类型、阅读框架和外显子评分。最后，DIGIT搜索在其阅读框架约束下加性得分最大化的一组外显子。贝叶斯过程和隐马尔可夫模型（HMM）被用来推断外显子得分和搜索外显子集，分别。我们设计了DIGIT，以便联合收割机FGENESH，GENSCAN和HMMgene的结果，并通过使用最近编译的基准数据集评估其预测精度。对于所有数据集， ...更多信息 DIGIT成功地丢弃了许多假阳性外显子预测的个人基因finders和产生显着的改善，在基因水平上的灵敏度和特异性相比，最好的基因水平的准确性实现了任何单一的gene-finder.Second，我们已经开发出一种新的索引，精确地从跨物种的基因组比对蛋白质编码区。该索引与在编码序列比对中观察到的框恢复密切相关，即，如果核苷酸的插入或缺失导致编码区中的框移位，则在附近将经常观察到恢复阅读框的其它插入缺失。相反，在其他保守区域中没有观察到这种帧恢复。我们准备了两个基因模型：一个模型通过使用序列相似性和内在基因度量来发现基因（基本模型），另一个模型除了序列相似性和内在基因度量之外还通过使用帧恢复指数来发现基因（帧恢复模型）。我们对两种模型的预测精度进行了评估，基准测试表明，帧恢复模型与基本模型相比，预测精度有了显著提高。第三，我们开发了基于HPLOS的真核生物基因发现技术GeneDecoder。该算法利用动态规划方法和基因组序列训练的统计模型，将输入的核酸序列划分为若干有意义的片段。此外，GeneDecoder还具有以下特点：（1）多流结构，（2）引入相似性搜索和（3）SVM驱动的推定剪接位点筛选。(1)除了核酸序列，GeneDecoder还允许添加任何其他数据流。通常，双齿龙二元组值可以提前计算并在“直接”流上对齐，这使得状态转换网络更加简单。可以将预先提取的任何其他有意义的特征并入。使用该方案的基因发现过程。(2)将编码潜力计算和相似性搜索与已知序列数据库相结合，实现了更可靠的推定外显子。为此，GeneDecoder具有将已知基序模型嵌入外显子模型中以及使用通过BLAST搜索发现的与已知序列相似的片段的能力。(3)支持向量机（Support Vector Machine，SVM）是一种模式识别技术，具有很高的分类能力，已成功地应用于剪接位点的预测。在GeneDecoder中，实现了该功能以及基于PWM的剪接位点模型。在解析时，假定的剪接位点来自基于PWM的模型，但有差的支持向量机设计为剪接位点分类器被排除在外。少