Modeling and prediction of genome sequence information by using information representation models

利用信息表示模型对基因组序列信息进行建模和预测

基本信息

项目摘要

In this research, we have focused on gene models which are capable of finding genes from genome sequences.First, we have developed a general purpose algorithm which finds genes by combining plural existing gene-finders. The algorithm has been implemented into a novel gene-finder named DIGIT. An outline of the algorithm is as follows. First, existing gene-finders are applied to an uncharacterized genomic sequence (input sequence). Next, DIGIT produces all possible exons from the results of gene-finders, and assigns them their exon types, reading frames and exon scores. Finally, DIGIT searches a set of exons whose additive score is maximized under their reading frame constraints. Bayesian procedure and a hidden Markov model (HMM) are used to infer exon scores and search the exon set, respectively. We have designed DIGIT so as to combine the results of FGENESH, GENSCAN and HMMgene, and have assessed its prediction accuracy by using recently compiled benchmark data sets. For all data sets, … More DIGIT successfully discarded many false-positive exons predicted by individual gene-finders and yielded remarkable improvements in sensitivity and specificity at the gene level compared with the best gene level accuracies achieved by any single gene-finder.Second, we have developed a novel index which precisely derives protein coding regions from cross-species genome alignments. The index is deeply related to frame recovery observed in coding sequence alignments, that is, if insertions or deletions of nucleotides causes frame shifts in coding regions, other in-dels which recover the reading frames will be often observed in the vicinity. In contrast, such frame recoveries are not observed in other conserved regions. We prepared two gene models: a model which finds gene by using sequence similarity and intrinsic gene measures (basic model), and the other model which finds gene by using frame recovery index in addition to sequence similarity and intrinsic gene measures (frame recovery model). We evaluated the prediction accuracies of the two models, and our benchmark test revealed that frame recovery model significantly improved the prediction accuracy in comparison with basic model.Third, we have developed GeneDecoder which is a gene finding technology for eukaryotes, based on HMMs. The algorithm, using dynamic programing method and statistic models trained by annotated genome sequences, divides the input nucleic acid sequence into some meaningful segments. Besides, GeneDecoder has some additional features: (1) multi-stream architecture, (2) incorporation of similarity search and (3) SVM-driven putative splice sites screening. (1) In addition to nucleic acid sequences, GeneDecoder allows any other data streams to be added. Typically, dicodon bigram values can be calculated in advance and be aligned on a 'Direct' stream, which makes state transition networks much simpler. Any other meaningful features extracted in advance can be incorporated to. gene-finding process using this scheme. (2) Combining calculation of coding potential and similarity search with known sequence database realizes more reliable putative exons. For this purpose, GeneDecoder has ability both to embed known motif models in exon models and to use segments with which similarity to known sequence was found by BLAST search. (3) Support Vector Machine (SVM) is one of the pattern re cognition techniques known to have high classification capability and has succes sfully been applied to splice site prediction. In GeneDecoder, this fearure is implemented as well as PWM-based splice site mod els. While parsing, putative splice sites derived from the PWM-based models but have poor support by the SVMs designed as splice site classifiers are excluded. Less
在这项研究中,我们集中在基因模型,能够找到基因组序列的基因。首先,我们已经开发了一个通用的算法,发现基因,通过结合多个现有的基因发现。该算法已被实现到一个新的基因查找器DIGIT。该算法的概要如下。首先,将现有的基因查找器应用于未表征的基因组序列(输入序列)。接下来,DIGIT从基因查找器的结果中产生所有可能的外显子,并分配它们的外显子类型、阅读框架和外显子评分。最后,DIGIT搜索在其阅读框架约束下加性得分最大化的一组外显子。贝叶斯过程和隐马尔可夫模型(HMM)被用来推断外显子得分和搜索外显子集,分别。我们设计了DIGIT,以便联合收割机FGENESH,GENSCAN和HMMgene的结果,并通过使用最近编译的基准数据集评估其预测精度。对于所有数据集, ...更多信息 DIGIT成功地丢弃了许多假阳性外显子预测的个人基因finders和产生显着的改善,在基因水平上的灵敏度和特异性相比,最好的基因水平的准确性实现了任何单一的gene-finder.Second,我们已经开发出一种新的索引,精确地从跨物种的基因组比对蛋白质编码区。该索引与在编码序列比对中观察到的框恢复密切相关,即,如果核苷酸的插入或缺失导致编码区中的框移位,则在附近将经常观察到恢复阅读框的其它插入缺失。相反,在其他保守区域中没有观察到这种帧恢复。我们准备了两个基因模型:一个模型通过使用序列相似性和内在基因度量来发现基因(基本模型),另一个模型除了序列相似性和内在基因度量之外还通过使用帧恢复指数来发现基因(帧恢复模型)。我们对两种模型的预测精度进行了评估,基准测试表明,帧恢复模型与基本模型相比,预测精度有了显著提高。第三,我们开发了基于HPLOS的真核生物基因发现技术GeneDecoder。该算法利用动态规划方法和基因组序列训练的统计模型,将输入的核酸序列划分为若干有意义的片段。此外,GeneDecoder还具有以下特点:(1)多流结构,(2)引入相似性搜索和(3)SVM驱动的推定剪接位点筛选。(1)除了核酸序列,GeneDecoder还允许添加任何其他数据流。通常,双齿龙二元组值可以提前计算并在“直接”流上对齐,这使得状态转换网络更加简单。可以将预先提取的任何其他有意义的特征并入。使用该方案的基因发现过程。(2)将编码潜力计算和相似性搜索与已知序列数据库相结合,实现了更可靠的推定外显子。为此,GeneDecoder具有将已知基序模型嵌入外显子模型中以及使用通过BLAST搜索发现的与已知序列相似的片段的能力。(3)支持向量机(Support Vector Machine,SVM)是一种模式识别技术,具有很高的分类能力,已成功地应用于剪接位点的预测。在GeneDecoder中,实现了该功能以及基于PWM的剪接位点模型。在解析时,假定的剪接位点来自基于PWM的模型,但有差的支持向量机设计为剪接位点分类器被排除在外。少

项目成果

期刊论文数量(92)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Marginalized kernels for RNA sequence data analysis.
Selective integration of multiple biological data for supervised network inference
  • DOI:
    10.1093/bioinformatics/bti339
  • 发表时间:
    2005-05-15
  • 期刊:
  • 影响因子:
    5.8
  • 作者:
    Kato, T;Tsuda, K;Asai, K
  • 通讯作者:
    Asai, K
Finishing the euchromatic sequence of the human genome
  • DOI:
    10.1038/nature03001
  • 发表时间:
    2004-10-21
  • 期刊:
  • 影响因子:
    64.8
  • 作者:
    Collins, FS;Lander, ES;Waterston, RH
  • 通讯作者:
    Waterston, RH
Differential display analysis of mutants for the transcription factor pdr1p regulating multidrug resistance in the budding yeast
芽殖酵母多药耐药性转录因子 pdr1p 突变体的差异显示分析
  • DOI:
  • 发表时间:
    2001
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Miura;F.;Yada;T.;Nakai;K.;Sakaki;Y.;Ito.;T.
  • 通讯作者:
    T.
T.Kato, K.Tsuda, K.Tomii, K Asai: "Maximum likelihood superposition of protein structures"Genome Informatics. 14. 488-489 (2003)
T.Kato、K.Tsuda、K.Tomii、K Asai:“蛋白质结构的最大似然叠加”基因组信息学。
  • DOI:
  • 发表时间:
  • 期刊:
  • 影响因子:
    0
  • 作者:
  • 通讯作者:
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

YADA Tetsushi其他文献

YADA Tetsushi的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('YADA Tetsushi', 18)}}的其他基金

Designing promoter sequences
设计启动子序列
  • 批准号:
    22240032
  • 财政年份:
    2010
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Grant-in-Aid for Scientific Research (A)
Comparative analysis of large scale genome data and knowledge discovery
大规模基因组数据比较分析和知识发现
  • 批准号:
    17018021
  • 财政年份:
    2005
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Grant-in-Aid for Scientific Research on Priority Areas

相似海外基金

Ancient human genome sequence analysis to elucidate the population structure of Kofun period humans
古代人类基因组序列分析阐明古坟时代人类的种群结构
  • 批准号:
    23K05948
  • 财政年份:
    2023
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
The Dundee Resource for Sequence Analysis and Structure Prediction (DRSASP) - 2023 and Beyond
邓迪序列分析和结构预测资源 (DRSASP) - 2023 年及以后
  • 批准号:
    BB/X018628/1
  • 财政年份:
    2023
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Research Grant
Elucidation of pathogenesis and development of prophylaxis of lower extremity arterial disease using optical coherence tomography and single-cell RNA sequence analysis
利用光学相干断层扫描和单细胞 RNA 序列分析阐明下肢动脉疾病的发病机制和预防开发
  • 批准号:
    23K15129
  • 财政年份:
    2023
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
CC* Integration-Small: Harnessing FABRIC for Scalable Human Genome Sequence Analysis
CC* Integration-Small:利用 FABRIC 进行可扩展的人类基因组序列分析
  • 批准号:
    2201583
  • 财政年份:
    2022
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Standard Grant
Whole-genome sequence analysis of radiation-induced mutations in human hematopoietic stem cells
人类造血干细胞辐射诱发突变的全基因组序列分析
  • 批准号:
    22K12388
  • 财政年份:
    2022
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Advancing DNA sequence analysis algorithms using biologically inspired and mathematical sound analysis at the University of Tokyo
东京大学利用生物学启发和数学声音分析推进 DNA 序列分析算法
  • 批准号:
    577723-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Canadian Graduate Scholarships Foreign Study Supplements
CAREER: Future phylogenies: novel computational frameworks for biomolecular sequence analysis involving complex evolutionary origins
职业:未来的系统发育:涉及复杂进化起源的生物分子序列分析的新型计算框架
  • 批准号:
    2144121
  • 财政年份:
    2022
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Continuing Grant
Identification of structual variants abd neoantigens of renal cell carcinoma uising long-read sequence analysis
利用长读长序列分析鉴定肾细胞癌的结构变异和新抗原
  • 批准号:
    22K09488
  • 财政年份:
    2022
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Modern mathematical models of big data-driven problems in biological sequence analysis with applications to efficient algorithm design
生物序列分析中大数据驱动问题的现代数学模型及其在高效算法设计中的应用
  • 批准号:
    569312-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 46.34万
  • 项目类别:
    Alexander Graham Bell Canada Graduate Scholarships - Doctoral
Occupational transitions across the lifecourse and dementia risk: evaluating unemployment, occupational complexity using sequence analysis
生命历程中的职业转变和痴呆风险:使用序列分析评估失业、职业复杂性
  • 批准号:
    10302126
  • 财政年份:
    2021
  • 资助金额:
    $ 46.34万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了