This website contains additional material to the paper: F. Charte, A.J. Rivera, M.J. del Jesus, and F. Herrera. MLeNN: A First Approach to Heuristic Multilabel Undersampling. Proceedings of the 15th Intelligent Data Engineering and Automated Learning (IDEAL 2014), September, Volume 8669, Salamanca (Spain), p.1-9, (2014)
Learning from imbalanced multilabel data is a challenging task that has attracted considerable attention lately. Some resampling algorithms used in traditional classification, such as random undersampling and random oversampling, have been already adapted in order to work with multilabel datasets.
In this paper MLeNN, a heuristic multilabel undersampling algorithm based on the well-known Wilson's Edited Nearest Neighbor Rule, is proposed. The samples to be removed are heuristically selected, instead of randomly picked. The ability of MLeNN to improve classification results is experimentally tested, and its performance against multilabel random undersampling is analyzed. As will be shown, MLeNN is a competitive multilabel undersampling alternative, able to significantly enhance classification results.
The datasets shown in the following table were used in this paper experimentation, partitioned with a 2X5 scheme. These partitions are available to download (185 MB).
Dataset | Instances | Attributes | Labels | MaxIR | MeanIR |
---|---|---|---|---|---|
corel5k | 5000.00 | 499.00 | 374.00 | 1120.00 | 189.57 |
corel16k | 13766.00 | 500.00 | 161.00 | 126.80 | 34.16 |
emotions | 593.00 | 72.00 | 6.00 | 1.78 | 1.48 |
enron | 1702.00 | 753.00 | 53.00 | 913.00 | 73.95 |
mediamill | 43907.00 | 120.00 | 101.00 | 1092.55 | 256.40 |
yeast | 2417.00 | 198.00 | 14.00 | 53.41 | 7.20 |
Classification results obtained processing the base datasets, the ones preprocessed by MLeNN, and the ones preprocessed with ML-RUS are shown in the following tables.
Algorithm | Dataset | Base | MLeNN | LP-RUS |
---|---|---|---|---|
BR-J48 | corel16k | 0.0650 | 0.0768 | 0.0593 |
BR-J48 | corel5k | 0.0586 | 0.0746 | 0.0480 |
BR-J48 | emotions | 0.4435 | 0.4410 | 0.4328 |
BR-J48 | enron | 0.4010 | 0.4044 | 0.3292 |
BR-J48 | mediamill | 0.4194 | 0.4314 | 0.3964 |
BR-J48 | yeast | 0.4329 | 0.4395 | 0.4150 |
CLR | corel16k | 0.0456 | 0.0652 | 0.0401 |
CLR | corel5k | 0.0360 | 0.0606 | 0.0292 |
CLR | emotions | 0.4480 | 0.4424 | 0.4338 |
CLR | enron | 0.4171 | 0.4108 | 0.3422 |
CLR | mediamill | 0.4490 | 0.4607 | 0.4342 |
CLR | yeast | 0.4698 | 0.4766 | 0.4566 |
RAkEL-BR | corel16k | 0.0645 | 0.0766 | 0.0592 |
RAkEL-BR | corel5k | 0.0586 | 0.0746 | 0.0480 |
RAkEL-BR | emotions | 0.4435 | 0.4410 | 0.4328 |
RAkEL-BR | enron | 0.4010 | 0.4044 | 0.3292 |
RAkEL-BR | mediamill | 0.4194 | 0.4314 | 0.3964 |
RAkEL-BR | yeast | 0.4338 | 0.4400 | 0.4160 |
Algorithm | Dataset | Base | MLeNN | LP-RUS |
---|---|---|---|---|
BR-J48 | corel16k | 0.1336 | 0.1255 | 0.1198 |
BR-J48 | corel5k | 0.1774 | 0.1966 | 0.1552 |
BR-J48 | emotions | 0.5712 | 0.5745 | 0.5616 |
BR-J48 | enron | 0.4029 | 0.3902 | 0.4086 |
BR-J48 | mediamill | 0.2774 | 0.2750 | 0.2766 |
BR-J48 | yeast | 0.4341 | 0.4453 | 0.4289 |
CLR | corel16k | 0.1003 | 0.1404 | 0.0967 |
CLR | corel5k | 0.1330 | 0.1843 | 0.1328 |
CLR | emotions | 0.5982 | 0.5935 | 0.5805 |
CLR | enron | 0.4198 | 0.3900 | 0.4184 |
CLR | mediamill | 0.2276 | 0.2350 | 0.2340 |
CLR | yeast | 0.4480 | 0.4554 | 0.4562 |
RAkEL-BR | corel16k | 0.1277 | 0.1242 | 0.1176 |
RAkEL-BR | corel5k | 0.1774 | 0.1966 | 0.1552 |
RAkEL-BR | emotions | 0.5712 | 0.5745 | 0.5616 |
RAkEL-BR | enron | 0.4029 | 0.3902 | 0.4086 |
RAkEL-BR | mediamill | 0.2774 | 0.2750 | 0.2766 |
RAkEL-BR | yeast | 0.4466 | 0.4545 | 0.4436 |
Algorithm | Dataset | Base | MLeNN | LP-RUS |
---|---|---|---|---|
BR-J48 | corel16k | 0.1156 | 0.1306 | 0.1066 |
BR-J48 | corel5k | 0.1096 | 0.1314 | 0.0904 |
BR-J48 | emotions | 0.5845 | 0.5867 | 0.5762 |
BR-J48 | enron | 0.5334 | 0.5290 | 0.5054 |
BR-J48 | mediamill | 0.5622 | 0.5686 | 0.5490 |
BR-J48 | yeast | 0.5787 | 0.5843 | 0.5664 |
CLR | corel16k | 0.0846 | 0.1145 | 0.0766 |
CLR | corel5k | 0.0706 | 0.1094 | 0.0573 |
CLR | emotions | 0.6072 | 0.6032 | 0.5916 |
CLR | enron | 0.5596 | 0.5451 | 0.5308 |
CLR | mediamill | 0.5928 | 0.5984 | 0.5865 |
CLR | yeast | 0.6168 | 0.6229 | 0.6099 |
RAkEL-BR | corel16k | 0.1145 | 0.1304 | 0.1061 |
RAkEL-BR | corel5k | 0.1096 | 0.1314 | 0.0904 |
RAkEL-BR | emotions | 0.5845 | 0.5867 | 0.5762 |
RAkEL-BR | enron | 0.5334 | 0.5290 | 0.5054 |
RAkEL-BR | mediamill | 0.5622 | 0.5686 | 0.5490 |
RAkEL-BR | yeast | 0.5796 | 0.5850 | 0.5674 |
The previous results, grouped by measure and classification algorithm, are visually represented in the following spider plots.
The MLSMOTE algorithm has been implemented in Java. You will need a JRE 1.7, along with the uncompressed content of the mlenn.zip file (coming soon). The mlenn.jar file contains the MLeNN implementation. This program is designed to process all the partitions of one multilabel dataset at once. Run the program with the following syntax:
java -jar mlenn.jar -inpath input_path -outpath output_path -fileext file_pattern -xml xml_file -HT float -NN int
The parameters needed are the following:
-inpath The path where the dataset partitions to be processed are stored. Do not add a trailing / or \ character.
-outpath The path where the processed dataset partitions will be stored. The program will not change the filenames, so this path has to be different to the previous one.
-fileext Trailing characters in the filenames of the files to be processed. Assuming that the files emotions5x2x1tra.arff, emotions5x2x2tra.arff, emotions5x2x1tst.arff, and emotions5x2x2tst.arff are stored in -inpath, the parameter -fileext tra.arff would process the first two files. It is possible to process an individual file using its full filename.
-xml Full path and filename of the XML file associated to the multilabel dataset to process. This file enumerates the labels existent in the dataset.
-NN Sets the number of neighbors to use. Default value is 3.
-HT Sets the threshold for the Hamming distance. Must be a value in the the interval (0,1]. Default value is 0.5.
In the following example the MLeNN algorithm would be applied to the files emotions/XXXXtra.arff and the output would be saved in the files under-emotions/XXXXtra.arff. A 75% threshold would be used to consider two instances as not equal and the 3 nearest neighbors would be used.
java -jar ~/app/mlenn.jar -inpath emotions -outpath under-emotions -fileext tra.arff -xml emotions/emotions.xml -HT 0.75 -NN 3