This website contains additional material to the paper: F. Charte, A.J. Rivera, M.J. del Jesus, and F. Herrera "MLSMOTE: A Synthetic Minority Oversampling Technique for Imbalanced Multilabel Classification". Submitted to Knowledge-Based Systems.
The learning from imbalanced data problem arises in many real world scenarios, as well as the need to build classifiers able to predict more than one class simultaneously (multilabel classification). Facing imbalance by means of resampling methods is an approach that has been deeply studied lately, primarily in the context of traditional (non-multilabel) classification.
In this paper MLSMOTE, a new algorithm aimed to produce synthetic instances for imbalanced multilabel datasets, is proposed. Three different methods for generating the labelsets for synthetic samples from nearest neighbors are presented and tested. A thorough experimental study has been conducted in order to verify the benefits of the proposed algorithm, considering several base multilabel classifiers and other multilabel oversampling algorithms. The empirical analysis, endorsed by appropriate statistical tests, shows that MLSMOTE leads to an overall improvement in classification results.
MLSMOTE is a novel multilabel oversampling algorithm designed to create synthetic instances associated to minority labels. In order to know which labels are minority, MLSMOTE leans on the multilabel imbalance measures proposed in Addressing Imbalance in Multilabel Classification: Measures and Random Preprocessing Methods. The features of the synthetic instances are obtained by interpolation of values belonging to nearest neighbors, as in SMOTE. The labelsets of these new instances are also gathered from nearest neighbors. For this task three different methods are studied, the intersection of the labels which appear in the neighbors, the union of those, and a third method based on a ranking of appearances.
The experimentation was conducted using 13 datasets from the MULAN and MEKA repositories. Each one of these datasets has been randomly partitioned twice in five separate partitions aiming to do a 2x5 folds cross validation, which means 10 runs of every algorithm for each dataset. These partitions are available to download.
Dataset | # instances | # features | # labels | Card |
---|---|---|---|---|
bibtex | 7395 | 1836 | 159 | 2.402 |
cal500 | 502 | 68 | 174 | 26.044 |
corel5k | 5000 | 499 | 374 | 3.522 |
corel16k | 13766 | 500 | 161 | 2.867 |
emotions | 593 | 72 | 6 | 1.869 |
enron | 1702 | 1001 | 53 | 3.378 |
genbase | 662 | 1186 | 27 | 1.252 |
medical | 978 | 1449 | 45 | 1.245 |
mediamill | 43907 | 120 | 101 | 4.376 |
slashdot | 3782 | 1079 | 22 | 1.181 |
scene | 2407 | 294 | 6 | 1.074 |
tmc2007 | 28596 | 500 | 22 | 2.158 |
yeast | 2417 | 198 | 14 | 4.237 |
The MLSMOTE algorithm has been implemented in Java. You will need a JRE 1.7, along with the uncompressed content of the mlsmote.zip file. The mlsmote.jar
file contains the MLSMOTE implementation. This program is designed to process all the partitions of one multilabel dataset at once. Run the program with the following syntax:
java -jar mlsmote.jar -inpath input_path -outpath output_path -fileext file_pattern -xml xml_file -labelCombination n
The parameters needed are the following:
-inpathThe path where the dataset partitions to be processed are stored. Do not add a trailing / or \ character.
-outpathThe path where the processed dataset partitions will be stored. The program will not change the filenames, so this path has to be different to the previous one.
-fileextTrailing characters in the filenames of the files to be processed. Assuming that the files
emotions5x2x1tra.arff
, emotions5x2x2tra.arff
, emotions5x2x1tst.arff
, and emotions5x2x2tst.arff
are stored in -inpath
, the parameter -fileext tra.arff
would process the first two files. It is possible to process an individual file using its full filename.
-xmlFull path and filename of the XML file associated to the multilabel dataset to process. This file enumerates the labels existent in the dataset.
-labelCombinationEstablishes the procedure to generate the labelsets for synthetic instances. The valid values for this parameter are the following:
1
-> The synthetic labelset will be the intersection of the neighbors' labelsets.
2
-> The synthetic labelset will be the union of the neighbors' labelsets.
3
-> The synthetic labelset will be generated using a ranking of labels in the neighbors' labelsets.
In the following example the MLSMOTE algorithm would be applied to the files emotions/XXXXtra.arff
and the output would be saved in the files over-emotions/XXXXtra.arff
. The new samples would have as labelsets those generated by the ranking method.
java -jar ~/app/mlsmote.jar -inpath emotions -outpath over-emotions -fileext tra.arff -xml emotions/emotions.xml -labelCombination 3
Page created and maintained by Francisco Charte - 2015