posted on 2024-07-26, 14:36authored byG. Y. Sofronov, T. V. Polushina, Madawa Weerasinghe Jayawar
A human genome is highly structured. Usually, the structure forms regions having patterns of a specific property. It is well-known that analysis of biological sequences is often confronted with measurements for the gene expression levels. When these observations are ordered by their location on the genome, the values form clouds with different observed means, supposedly reflecting different mean levels. The statistical analysis of these sequences aims at finding chromosomal regions with “abnormal” (increased or decreased) mean levels. Therefore, identifying genomic regions associated with systematic aberrations provides insights into the initiation and progression of a disease, and improves the diagnosis, prognosis and therapy strategies. In this paper, we present a further extension of our work, where we propose a two-staged hybrid algorithm to identify structural patterns in genomic sequences. At the first stage of the algorithm, an efficient sequential change-point detection procedure (for example, the Shiryaev-Roberts procedure or the cumulative sum control chart (CUSUM) procedure) is applied. Then the obtained locations of the change-points are used to initialize the Cross-Entropy (CE) algorithm, which is an evolutionary stochastic optimization method that estimates both the number of change-points and their corresponding locations. The first-stage of the algorithm is very sensitive for the thresholds selection, and the identification of optimal thresholds will increase the accuracy of the results and further improve the efficiency of the a lgorithm. In this study, we propose an improved hybrid algorithm for change-point detection, which uses optimal thresholds for the sequential change-point detection procedure and the CE method to obtain more precised estimates. In order to illustrate the usefulness of the algorithm, we have performed a comparison of the proposed hybrid algorithms for both artificially generated data and real aCGH experimental data. Our results show that the proposed methodologies are effective in detecting multiple change-points in biological sequences.