Addressing Imbalance in Multilabel Classification: Measures and Random Preprocessing Methods

This website contains additional material to the paper: F. Charte, A.J. Rivera, M.J. del Jesus, and F. Herrera "Addressing Imbalance in Multilabel Classification: Measures and Random Preprocessing Methods". Neurocomputing, Volume 163, p.3-16, (2015) .

Abstract

Learning from imbalanced datasets is a problem thoroughly studied in binary classification, and to a lesser extent in multiclass classification. Although most multilabel datasets suffer from a high imbalance level, the proposals on how to measure this characteristic and how to deal with this issue are scant.

The purpose of this paper is to present measures aimed to assess the imbalance level in multilabel datasets, as well as to propose several preprocessing algorithms designed to reduce it. Two of the proposed methods are random undersampling algorithms, called LP-RUS and ML-RUS, while the other two accomplish random oversampling, LP-ROS and ML-ROS. All of them are experimentally tested and their effectiveness is statistically evaluated. From the results obtained, a set of guidelines directed to show when these methods should be applied is also provided.

Top of page

Algorithms proposed in the paper

Four preprocesing algorithms aimed to reduce the imbalance level in multilabel datasets are proposed. Two of them are based on the LP (Label Powerset) transformation, whereas the other two perform individual label imbalance analysis. All of them depend on one parameter P, which establishes the percentage of instances to remove or produce.

Top of page

Datasets

The experimentation was conducted using 13 datasets from the MULAN and MEKA repositories. Each one of these datasets has been partitioned randomly twice in five separate partitions aiming to do a 2x5 folds cross validation, which means 10 runs of every algorithm for each dataset. These partitions are available to download.

Datasets and their characteristics - Download
Dataset# instances# features# labelsCard
bibtex739518361592.402
cal5005026817426.044
corel5k50004993743.522
corel16k137665001612.867
emotions5937261.869
enron17021001533.378
genbase6621186271.252
llog14601004751.180
mediamill439071201014.376
slashdot37821079221.181
scene240729461.074
tmc200728596500222.158
yeast2417198144.237
Top of page

Experimentation results

Top of page

Page created and maintained by Francisco Charte - 2013