MLSMOTE: A Synthetic Minority Oversampling Technique for Imbalanced Multilabel Classification

This website contains additional material to the paper: F. Charte, A.J. Rivera, M.J. del Jesus, and F. Herrera "MLSMOTE: A Synthetic Minority Oversampling Technique for Imbalanced Multilabel Classification". Submitted to Knowledge-Based Systems.

Abstract

The learning from imbalanced data problem arises in many real world scenarios, as well as the need to build classifiers able to predict more than one class simultaneously (multilabel classification). Facing imbalance by means of resampling methods is an approach that has been deeply studied lately, primarily in the context of traditional (non-multilabel) classification.

In this paper MLSMOTE, a new algorithm aimed to produce synthetic instances for imbalanced multilabel datasets, is proposed. Three different methods for generating the labelsets for synthetic samples from nearest neighbors are presented and tested. A thorough experimental study has been conducted in order to verify the benefits of the proposed algorithm, considering several base multilabel classifiers and other multilabel oversampling algorithms. The empirical analysis, endorsed by appropriate statistical tests, shows that MLSMOTE leads to an overall improvement in classification results.

MLSMOTE basis

MLSMOTE is a novel multilabel oversampling algorithm designed to create synthetic instances associated to minority labels. In order to know which labels are minority, MLSMOTE leans on the multilabel imbalance measures proposed in Addressing Imbalance in Multilabel Classification: Measures and Random Preprocessing Methods. The features of the synthetic instances are obtained by interpolation of values belonging to nearest neighbors, as in SMOTE. The labelsets of these new instances are also gathered from nearest neighbors. For this task three different methods are studied, the intersection of the labels which appear in the neighbors, the union of those, and a third method based on a ranking of appearances.

Datasets

The experimentation was conducted using 13 datasets from the MULAN and MEKA repositories. Each one of these datasets has been randomly partitioned twice in five separate partitions aiming to do a 2x5 folds cross validation, which means 10 runs of every algorithm for each dataset. These partitions are available to download.

Datasets and their characteristics - Download
Dataset	# instances	# features	# labels	Card
bibtex	7395	1836	159	2.402
cal500	502	68	174	26.044
corel5k	5000	499	374	3.522
corel16k	13766	500	161	2.867
emotions	593	72	6	1.869
enron	1702	1001	53	3.378
genbase	662	1186	27	1.252
medical	978	1449	45	1.245
mediamill	43907	120	101	4.376
slashdot	3782	1079	22	1.181
scene	2407	294	6	1.074
tmc2007	28596	500	22	2.158
yeast	2417	198	14	4.237

How to use MLSMOTE

The MLSMOTE algorithm has been implemented in Java. You will need a JRE 1.7, along with the uncompressed content of the mlsmote.zip file. The mlsmote.jar file contains the MLSMOTE implementation. This program is designed to process all the partitions of one multilabel dataset at once. Run the program with the following syntax:

java -jar mlsmote.jar -inpath input_path -outpath output_path -fileext file_pattern -xml xml_file -labelCombination n

The parameters needed are the following:

```
-inpath
```
The path where the dataset partitions to be processed are stored. Do not add a trailing / or \ character.
```
-outpath
```
The path where the processed dataset partitions will be stored. The program will not change the filenames, so this path has to be different to the previous one.
```
-fileext
```
Trailing characters in the filenames of the files to be processed. Assuming that the files emotions5x2x1tra.arff, emotions5x2x2tra.arff, emotions5x2x1tst.arff, and emotions5x2x2tst.arff are stored in -inpath, the parameter -fileext tra.arff would process the first two files. It is possible to process an individual file using its full filename.
```
-xml
```
Full path and filename of the XML file associated to the multilabel dataset to process. This file enumerates the labels existent in the dataset.
```
-labelCombination
```
Establishes the procedure to generate the labelsets for synthetic instances. The valid values for this parameter are the following:
- 1 -> The synthetic labelset will be the intersection of the neighbors' labelsets.
- 2 -> The synthetic labelset will be the union of the neighbors' labelsets.
- 3 -> The synthetic labelset will be generated using a ranking of labels in the neighbors' labelsets.

In the following example the MLSMOTE algorithm would be applied to the files emotions/XXXXtra.arff and the output would be saved in the files over-emotions/XXXXtra.arff. The new samples would have as labelsets those generated by the ranking method.

java -jar ~/app/mlsmote.jar -inpath emotions -outpath over-emotions -fileext tra.arff -xml emotions/emotions.xml -labelCombination 3

Page created and maintained by Francisco Charte - 2015