-
摘要: 顺式调控模块(Cis-regulatory module,CRM)在真核生物基因的转录调控中起着重要作用,识别顺式调控模块是当前计算生物学的一个重要课题.虽然当前有许多计算方法用于识别顺式调控模块,但识别准确率仍有待进一步提高.将顺式调控模块的多种特征信息结合在一起,有助于提高识别顺式调控模块的准确率.基于此,本文提出了一种识别顺式调控模块的算法SegHMC(Segmental HMM model for discovery of cis-regulatory module).该算法建立了一种关于顺式调控模块识别问题的Segmental HMM模型,进一步扩展了顺式调控模块调控结构(或调控语法)的表示,不仅将顺式调控模块表示为模体(Motif)的组合,还进一步将模体共同出现的频率、模体顺序偏好以及顺式调控模块中相邻模体间的距离分布等特征引入到顺式调控模块的调控语法中.在模拟数据集和真实生物数据集上的实验结果表明,本文方法识别顺式调控模块的准确率显著优于当前的主要方法.
-
关键词:
- 基因的转录调控 /
- 模体 /
- Segmental HMM /
- 顺式调控模块识别
Abstract: Cis-regulatory module (CRM) plays a key role in metazoan gene transcriptional regulation, and the discovery of cis-regulatory module has been a crucial research topic recently. Many computational methods have been proposed to predict the cis-regulatory module, but it is still a main task to further improve the prediction accuracy for cis-regulatory modules. Combining multiple features of cis-regulatory module together can improve the prediction accuracy for cis-regulatory module. Based on this, the paper presents an algorithm SegHMC (Segmental HMM model for discovery of cis-regulatory module) for the discovery of cis-regulatory module based on segmental HMM. The model further extends the representation of the structure of cis-regulatory module (or regulatory grammar), which not only describes a CRM as a combination of a group of motifs but also further introduces the frequency of the occurrence of motifs, the favour of the order of motifs, and the distance distribution between the adjacent motifs and other features. Experiments on the benchmark datasets demonstrate that the proposed algorithm outperforms the present main algorithms in the prediction accuracy. -
图 1 顺式调控模块结构示意图(顺式调控模块是包含多个转录因子相应模体的序列区; 模体的方向、模体间的间隔距离、模体间的相互关系可能包含了给定顺式调控模块的重要性质.)
Fig. 1 The structure discription of cis-regulatory modules (A cis-regulatory module is a sequence region that contains multiple motifs of multiple transcription factors; motif orientation, the interval distance between motifs and their cooperation relationship may imply the important regulatory properties of the cis-regulatory module.)
-
[1] Wasserman W W, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics, 2004, 5(4): 276-287 doi: 10.1038/nrg1315 [2] Davidson E H. The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. San Diego, California: Academic Press/Elsevier, 2006. [3] 王沛, 吕金虎. 基因调控网络的控制: 机遇与挑战. 自动化学报, 2013, 39(12): 1969-1979 http://www.aas.net.cn/CN/abstract/abstract18236.shtmlWang Pei, Lv Jin-Hu. Control of genetic regulatory networks: opportunities and challenges. Acta Automatica Sinica, 2013, 39(12): 1969-1979 http://www.aas.net.cn/CN/abstract/abstract18236.shtml [4] Chen L N, Wang R S, Zhang X S. Biomolecular Networks: Methods and Applications in Systems Biology. Hoboken, New Jersey: Wiley, 2009. [5] Kleinjan D A, Seawright A, Mella S, Carr C B, Tyas D A, Simpson T I, Mason J O, Price D J, van Heyningen V. Long-range downstream enhancers are essential for Pax6 expression. Developmental Biology, 2006, 299(2): 563-581 doi: 10.1016/j.ydbio.2006.08.060 [6] Hardison R C, Taylor J. Genomic approaches towards finding cis-regulatory modules in animals. Nature Reviews Genetics, 2012, 13(7): 469-483 doi: 10.1038/nrg3242 [7] Matys V, Kel-Margoulis O V, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel A E, Wingender E. TRANSFAColedR and its module TRANSCompeloledR: transcriptional gene regulation in eukaryotes. Nucleic Acids Research, 2006, 34(Database issue): D108-D110 [8] Portales-Casamar E, Thongjuea S, Kwon A T, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Research, 2010, 38(Database issue): D105-D110 http://cn.bing.com/academic/profile?id=2142570576&encoded=0&v=paper_preview&mkt=zh-cn [9] Klepper K, Sandve G K, Abul O, Johansen J, Drablos F. Assessment of composite motif discovery methods. BMC Bioinformatics, 2008, 9: 123 doi: 10.1186/1471-2105-9-123 [10] Su J, Teichmann S A, Down T A. Assessing computational methods of cis-regulatory module prediction. PLoS Computational Biology, 2010, 6(12): e1001020 doi: 10.1371/journal.pcbi.1001020 [11] Naval-Sánchez M, Potier D, Hulselmans G, Christiaens V, Aerts S. Identification of lineage-specific cis-regulatory modules associated with variation in transcription factor binding and chromatin activity using Ornstein-Uhlenbeck models. Molecular Biology and Evolution, 2015, 32(9): 2441-2455 doi: 10.1093/molbev/msv107 [12] Suryamohan K, Halfon M S. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdisciplinary Reviews: Developmental Biology, 2015, 4(2): 59-84 doi: 10.1002/wdev.2015.4.issue-2 [13] Thompson J A, Congdon C B. GAMI-CRM: using de novo motif inference to detect cis-regulatory modules. In: Proceedings of the 2014 IEEE Congress on Evolutionary Computation. Beijing, China: IEEE, 2014. 1022-1029 [14] 郑树锐. 基于HMM模型的顺式调控模块识别方法的研究[硕士学位论文], 西安电子科技大学, 中国, 2012Zheng Shu-Rui. Research of Cis-regulatory Module Discovery Method Based on HMM Model [Master dissertation], Xidian University, China, 2012 [15] Navarro C, Lopez F J, Cano C, Garcia-Alcalde F, Blanco A. CisMiner: genome-wide in-silico cis-regulatory module prediction by fuzzy itemset mining. PLoS One, 2014, 9(9): e108065 doi: 10.1371/journal.pone.0108065 [16] Rouault H, Santolini M, Schweisguth F, Hakim V. Imogene: identification of motifs and cis-regulatory modules underlying gene co-regulation. Nucleic Acids Research, 2014, 42(10): 6128-6145 doi: 10.1093/nar/gku209 [17] Potier D, Seyres D, Guichard C, Iche-Torres M, Aerts S, Herrmann C, Perrin L. Identification of cis-regulatory modules encoding temporal dynamics during development. BMC Genomics, 2014, 15(1): 534 doi: 10.1186/1471-2164-15-534 [18] Thompson J A, Congdon C B. Initial results in using de novo motif inference to detect cis-regulatory modules. In: Proceedings of the 2013 International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. Washington DC, USA: ACM, 2013. 687 [19] Lemnian I M, Eggeling R, Grosse I. Extended sunflower hidden Markov models for the recognition of homotypic cis-regulatory modules. In: Proceedings of the 2013 German Conference on Bioinformatics. Gottingen, Germany, 2013. 101-109 [20] Zhou Q, Wong W H. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(33): 12114-12119 doi: 10.1073/pnas.0402858101 [21] Gan Y L, Guan J H, Zhou S G, Zhang W X. Identifying cis-regulatory elements and modules using conditional random fields. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2014, 11(1): 73-82 doi: 10.1109/TCBB.2013.131 [22] Alkema W B, Johansson O, Lagergren J, Wasserman W W. MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Research, 2004, 32(Web Server issue): W195-W198 http://cn.bing.com/academic/profile?id=2172186301&encoded=0&v=paper_preview&mkt=zh-cn [23] Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B. Computational detection of cis-regulatory modules. Bioinformatics, 2003, 19(Suppl 2): ii5-ii14 http://cn.bing.com/academic/profile?id=2116672274&encoded=0&v=paper_preview&mkt=zh-cn [24] Sharan R, Ovcharenko I, Ben-Hur A, Karp R M. CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics, 2003, 19(Suppl 1): i283-i291 doi: 10.1093/bioinformatics/btg1039 [25] Arnold P, Erb I, Pachkov M, Molina N, van Nimwegen E. MotEvo: integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences. Bioinformatics, 2012, 28(4): 487-494 doi: 10.1093/bioinformatics/btr695 [26] Sinha S, He X. MORPH: probabilistic alignment combined with hidden Markov models of cis-regulatory modules. PLoS Computational Biology, 2007, 3(11): e216 doi: 10.1371/journal.pcbi.0030216 [27] González S, Montserrat-Sentís B, Sánchez F, Puiggrós M, Blanco E, Ramirez A, Torrents D. ReLA, a local alignment search tool for the identification of distal and proximal gene regulatory regions and their conserved transcription factor binding sites. Bioinformatics, 2012, 28(6): 736-770 [28] Bailey T L, Noble W S. Searching for statistically significant regulatory modules. Bioinformatics, 2003, 19(Suppl 2): ii16-ii25 http://cn.bing.com/academic/profile?id=2151845703&encoded=0&v=paper_preview&mkt=zh-cn [29] Leoncini M, Montangero M, Pellegrini M, Tillan K P. CMStalker: a combinatorial tool for composite motif discovery. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015, 12(5): 1123-1136 doi: 10.1109/TCBB.2014.2359444 [30] Chan B Y, Kibler D. Using hexamers to predict cis-regulatory motifs in Drosophila. BMC Bioinformatics, 2005, 6: 262 doi: 10.1186/1471-2105-6-262 [31] Kolbe D, Taylor J, Elnitski L, Eswara P, Li J, Miller W, Hardison R, Chiaromonte F. Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Research, 2004, 14(4): 700-707 doi: 10.1101/gr.1976004 [32] Sinha S, van Nimwegen E, Siggia E D. A probabilistic method to detect regulatory modules. Bioinformatics, 2003, 19(Suppl 1): i292-i301 doi: 10.1093/bioinformatics/btg1040 [33] Nikulova A A, Favorov A V, Sutormin R A, Makeev V J, Mironov A A. CORECLUST: identification of the conserved CRM grammar together with prediction of gene regulation. Nucleic Acids Research, 2012, 40(12): e93 doi: 10.1093/nar/gks235 [34] Durbin R, Eddy S R, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press, 1998. [35] Lin T H, Ray P, Sandve G K, Uguroglu S, Xing E P. BayCis: a Bayesian hierarchical HMM for cis-regulatory module decoding in metazoan genomes. In: Proceedings of the 12th Annual International Conference on Research in Computational Molecular Biology. Singapore: Springer, 2008. 66-81 [36] Zhou Q, Wong W H. Coupling hidden Markov models for the discovery of Cis-regulatory modules in multiple species. Annals of Applied Statistics, 2007, 1(1): 36-65 doi: 10.1214/07-AOAS103 [37] Hu J F, Hu H Y, Li X M. MOPAT: a graph-based method to predict recurrent cis-regulatory modules from known motifs. Nucleic Acids Research, 2008, 36(13): 4488-4497 doi: 10.1093/nar/gkn407 [38] Russell M J. A segmental HMM for speech pattern modelling. In: Processing of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing. Minneapolis, MN, USA: IEEE, 1993. 499-502 [39] Stormo G D. DNA binding sites: representation and discovery. Bioinformatics, 2000, 16(1): 16-23 doi: 10.1093/bioinformatics/16.1.16 [40] Liu X, Brutlag D L, Liu J S. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Proceedings of the 6th Pacific Symposium on Biocomputing. Hawaii, USA, 2001. 127-138 [41] Wasserman W W, Fickett J W. Identification of regulatory regions which confer muscle-specific gene expression. Journal of Molecular Biology, 1998, 278(1): 167-181 doi: 10.1006/jmbi.1998.1700 [42] Kulakovskiy I V, Makeev V J. Discovery of DNA motifs recognized by transcription factors through integration of different experimental sources. Biophysics, 2009, 54(6): 667-674 doi: 10.1134/S0006350909060013 [43] Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, Zhang H, The FlyBase Consortium. FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Research, 2009, 37(Database issue): D555-D559 [44] Gallo S M, Gerrard D T, Miner D, Simich M, Des Soye B, Bergman C M, Halfon M S. REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in Drosophila. Nucleic Acids Research, 2011, 39(Database issue): D118-D123 http://cn.bing.com/academic/profile?id=2021841129&encoded=0&v=paper_preview&mkt=zh-cn [45] Tompa M, Li N, Bailey T L, Church G M, De Moor B, Eskin E, Favorov A V, Frith M C, Fu Y T, Kent W J, Makeev V J, Mironov A A, Noble W S, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z P, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 2005, 23: 137-144 doi: 10.1038/nbt1053 [46] Shaw W M Jr, Burgin R, Howell P. Performance standards and evaluations in IR test collections: cluster-based retrieval models. Information Processing & Management, 1997, 33(1): 1-14 http://cn.bing.com/academic/profile?id=2115603438&encoded=0&v=paper_preview&mkt=zh-cn [47] Maeda T, Gupta M P, Stewart A F R. TEF-1 and MEF2 transcription factors interact to regulate muscle-specific promoters. Biochemical and Biophysical Research Communications, 2002, 294(4): 791-797 doi: 10.1016/S0006-291X(02)00556-9 [48] Lifanov A P, Makeev V J, Nazina A G, Papatsenko D A. Homotypic regulatory clusters in Drosophila. Genome Research, 2003, 13(4): 579-588 doi: 10.1101/gr.668403