MLSMOTE: A Synthetic Minority Oversampling Technique for Imbalanced Multilabel Classification

This website contains additional material to the paper: F. Charte, A.J. Rivera, M.J. del Jesus, and F. Herrera "MLSMOTE: A Synthetic Minority Oversampling Technique for Imbalanced Multilabel Classification". Submitted to Knowledge-Based Systems.

Abstract

The learning from imbalanced data problem arises in many real world scenarios, as well as the need to build classifiers able to predict more than one class simultaneously (multilabel classification). Facing imbalance by means of resampling methods is an approach that has been deeply studied lately, primarily in the context of traditional (non-multilabel) classification.

In this paper MLSMOTE, a new algorithm aimed to produce synthetic instances for imbalanced multilabel datasets, is proposed. Three different methods for generating the labelsets for synthetic samples from nearest neighbors are presented and tested. A thorough experimental study has been conducted in order to verify the benefits of the proposed algorithm, considering several base multilabel classifiers and other multilabel oversampling algorithms. The empirical analysis, endorsed by appropriate statistical tests, shows that MLSMOTE leads to an overall improvement in classification results.

MLSMOTE basis

MLSMOTE is a novel multilabel oversampling algorithm designed to create synthetic instances associated to minority labels. In order to know which labels are minority, MLSMOTE leans on the multilabel imbalance measures proposed in Addressing Imbalance in Multilabel Classification: Measures and Random Preprocessing Methods. The features of the synthetic instances are obtained by interpolation of values belonging to nearest neighbors, as in SMOTE. The labelsets of these new instances are also gathered from nearest neighbors. For this task three different methods are studied, the intersection of the labels which appear in the neighbors, the union of those, and a third method based on a ranking of appearances.

Top of page

Datasets

The experimentation was conducted using 13 datasets from the MULAN and MEKA repositories. Each one of these datasets has been randomly partitioned twice in five separate partitions aiming to do a 2x5 folds cross validation, which means 10 runs of every algorithm for each dataset. These partitions are available to download.

Datasets and their characteristics - Download
Dataset# instances# features# labelsCard
bibtex739518361592.402
cal5005026817426.044
corel5k50004993743.522
corel16k137665001612.867
emotions5937261.869
enron17021001533.378
genbase6621186271.252
medical9781449451.245
mediamill439071201014.376
slashdot37821079221.181
scene240729461.074
tmc200728596500222.158
yeast2417198144.237
Top of page

How to use MLSMOTE

The MLSMOTE algorithm has been implemented in Java. You will need a JRE 1.7, along with the uncompressed content of the mlsmote.zip file. The mlsmote.jar file contains the MLSMOTE implementation. This program is designed to process all the partitions of one multilabel dataset at once. Run the program with the following syntax:

java -jar mlsmote.jar -inpath input_path -outpath output_path -fileext file_pattern -xml xml_file -labelCombination n

The parameters needed are the following:

In the following example the MLSMOTE algorithm would be applied to the files emotions/XXXXtra.arff and the output would be saved in the files over-emotions/XXXXtra.arff. The new samples would have as labelsets those generated by the ranking method.

java -jar ~/app/mlsmote.jar -inpath emotions -outpath over-emotions -fileext tra.arff -xml emotions/emotions.xml -labelCombination 3
Top of page

Page created and maintained by Francisco Charte - 2015