Motivation: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples using similarity based clustering methods. A key requirement for clustering is to avoid using the noisy aberrations in the CGH samples. Results: We develop a dynamic programming algorithm to identify a small set of important genomic intervals calledmarkers. Theadvantage of using these markers is that the potentially noisy genomic intervals are excluded during the clustering process. We also develop two clustering strategies using these markers. The first one, prototype-based approach, maximizes the support for the markers. The second one, similarity-based approach, develops a new similarity measure called RSimand refinesclusterswith theaimofmaximizing theRSimmeasure between the samples in the same cluster. Our results demonstrate that the markers we found represent the aberration patterns of cancer types well and they improve the quality of clustering significantly. Availability: All software developed in this paper and all the datasets used are available from the authors upon request. Contact: juliu@cise.ufl.edu
动机:我们考虑使用基于相似性的聚类方法对一组比较基因组杂交(CGH)数据样本进行聚类的问题。聚类的一个关键要求是避免使用CGH样本中的噪声畸变。
结果:我们开发了一种动态规划算法来识别一小组称为标记的重要基因组区间。使用这些标记的优势在于在聚类过程中排除了可能有噪声的基因组区间。我们还使用这些标记开发了两种聚类策略。第一种是基于原型的方法,它使对标记的支持最大化。第二种是基于相似性的方法,它开发了一种称为RSim的新相似性度量,并以最大化同一聚类中样本之间的RSim度量为目标对聚类进行细化。我们的结果表明,我们找到的标记很好地代表了癌症类型的畸变模式,并且显著提高了聚类质量。
可用性:本文开发的所有软件以及使用的所有数据集可应要求从作者处获取。
联系方式:juliu@cise.ufl.edu