Swinburne
Browse

Feature Selection for High-Dimensional Imbalanced Class Datasets Using Harmony Search and Kullback-Leibler Divergence

journal contribution
posted on 2025-11-24, 04:40 authored by Alireza MoayedikiaAlireza Moayedikia, Richard Jensen, Sara Fin
<p dir="ltr">High-dimensional imbalanced datasets pose significant challenges in pattern recognition, often leading to overfitting and classifier bias toward majority classes. While numerous feature selection algorithms exist, most struggle to effectively address both high dimensionality and class imbalance simultaneously. This paper introduces Harmony Search Kullback–Leibler (HKL), a novel feature selection algorithm that integrates Kullback–Leibler divergence with the Harmony Search metaheuristic to specifically address these dual challenges. HKL establishes an information-theoretic foundation by employing KL divergence as a statistical framework to evaluate feature subsets based on their ability to separate minority and majority classes. Unlike existing Harmony Search variants that operate as class-blind optimizers treating feature selection as a generic optimization problem, HKL fundamentally shifts the paradigm by incorporating direct class distribution awareness into the optimization process. The algorithm implements a dual optimization approach that simultaneously balances classification performance metrics with class distribution divergence measures. This design specifically enhances minority class discrimination by prioritizing features that maximize the divergence between class distributions, ensuring that selected features provide discriminative power for underrepresented classes rather than simply favoring the majority class. Experimental validation across multiple high-dimensional biomedical datasets demonstrates that HKL consistently outperforms existing state-of-the-art methods in terms of AUC and G-mean metrics, with particular improvements for minority class classification. The algorithm achieves optimal performance while using substantially reduced feature subsets, often requiring only a quarter to half of the original features to maintain or exceed baseline classification accuracy. Statistical significance testing confirms that these performance improvements represent genuine algorithmic advantages rather than random variation. The proposed approach offers an effective solution to both dimensionality reduction and class imbalance challenges, providing a valuable tool for complex classification tasks across various domains.</p>

History

Available versions

Accepted manuscript

ISSN

1384-5810

Journal title

Data mining and knowledge discovery

Volume

40

Article number

6

Publisher

Springer Nature

Copyright statement

Copyright © 2025 The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2025. Springer Nature holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Language

eng

Usage metrics

    Publications

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC