MLeNN: A First Approach to Heuristic Multilabel Undersampling

This website contains additional material to the paper: F. Charte, A.J. Rivera, M.J. del Jesus, and F. Herrera. MLeNN: A First Approach to Heuristic Multilabel Undersampling. Proceedings of the 15th Intelligent Data Engineering and Automated Learning (IDEAL 2014), September, Volume 8669, Salamanca (Spain), p.1-9, (2014)

Summary

Abstract

Learning from imbalanced multilabel data is a challenging task that has attracted considerable attention lately. Some resampling algorithms used in traditional classification, such as random undersampling and random oversampling, have been already adapted in order to work with multilabel datasets.

In this paper MLeNN, a heuristic multilabel undersampling algorithm based on the well-known Wilson's Edited Nearest Neighbor Rule, is proposed. The samples to be removed are heuristically selected, instead of randomly picked. The ability of MLeNN to improve classification results is experimentally tested, and its performance against multilabel random undersampling is analyzed. As will be shown, MLeNN is a competitive multilabel undersampling alternative, able to significantly enhance classification results.

Datasets

The datasets shown in the following table were used in this paper experimentation, partitioned with a 2X5 scheme. These partitions are available to download (185 MB).

Dataset characteristics
Dataset Instances Attributes Labels MaxIR MeanIR
corel5k 5000.00 499.00 374.00 1120.00 189.57
corel16k 13766.00 500.00 161.00 126.80 34.16
emotions 593.00 72.00 6.00 1.78 1.48
enron 1702.00 753.00 53.00 913.00 73.95
mediamill 43907.00 120.00 101.00 1092.55 256.40
yeast 2417.00 198.00 14.00 53.41 7.20

Results

Classification results obtained processing the base datasets, the ones preprocessed by MLeNN, and the ones preprocessed with ML-RUS are shown in the following tables.

Accuracy
Algorithm Dataset Base MLeNN LP-RUS
BR-J48 corel16k 0.0650 0.0768 0.0593
BR-J48 corel5k 0.0586 0.0746 0.0480
BR-J48 emotions 0.4435 0.4410 0.4328
BR-J48 enron 0.4010 0.4044 0.3292
BR-J48 mediamill 0.4194 0.4314 0.3964
BR-J48 yeast 0.4329 0.4395 0.4150
CLR corel16k 0.0456 0.0652 0.0401
CLR corel5k 0.0360 0.0606 0.0292
CLR emotions 0.4480 0.4424 0.4338
CLR enron 0.4171 0.4108 0.3422
CLR mediamill 0.4490 0.4607 0.4342
CLR yeast 0.4698 0.4766 0.4566
RAkEL-BR corel16k 0.0645 0.0766 0.0592
RAkEL-BR corel5k 0.0586 0.0746 0.0480
RAkEL-BR emotions 0.4435 0.4410 0.4328
RAkEL-BR enron 0.4010 0.4044 0.3292
RAkEL-BR mediamill 0.4194 0.4314 0.3964
RAkEL-BR yeast 0.4338 0.4400 0.4160
MacroFMeasure
Algorithm Dataset Base MLeNN LP-RUS
BR-J48 corel16k 0.1336 0.1255 0.1198
BR-J48 corel5k 0.1774 0.1966 0.1552
BR-J48 emotions 0.5712 0.5745 0.5616
BR-J48 enron 0.4029 0.3902 0.4086
BR-J48 mediamill 0.2774 0.2750 0.2766
BR-J48 yeast 0.4341 0.4453 0.4289
CLR corel16k 0.1003 0.1404 0.0967
CLR corel5k 0.1330 0.1843 0.1328
CLR emotions 0.5982 0.5935 0.5805
CLR enron 0.4198 0.3900 0.4184
CLR mediamill 0.2276 0.2350 0.2340
CLR yeast 0.4480 0.4554 0.4562
RAkEL-BR corel16k 0.1277 0.1242 0.1176
RAkEL-BR corel5k 0.1774 0.1966 0.1552
RAkEL-BR emotions 0.5712 0.5745 0.5616
RAkEL-BR enron 0.4029 0.3902 0.4086
RAkEL-BR mediamill 0.2774 0.2750 0.2766
RAkEL-BR yeast 0.4466 0.4545 0.4436
MicroFMeasure
Algorithm Dataset Base MLeNN LP-RUS
BR-J48 corel16k 0.1156 0.1306 0.1066
BR-J48 corel5k 0.1096 0.1314 0.0904
BR-J48 emotions 0.5845 0.5867 0.5762
BR-J48 enron 0.5334 0.5290 0.5054
BR-J48 mediamill 0.5622 0.5686 0.5490
BR-J48 yeast 0.5787 0.5843 0.5664
CLR corel16k 0.0846 0.1145 0.0766
CLR corel5k 0.0706 0.1094 0.0573
CLR emotions 0.6072 0.6032 0.5916
CLR enron 0.5596 0.5451 0.5308
CLR mediamill 0.5928 0.5984 0.5865
CLR yeast 0.6168 0.6229 0.6099
RAkEL-BR corel16k 0.1145 0.1304 0.1061
RAkEL-BR corel5k 0.1096 0.1314 0.0904
RAkEL-BR emotions 0.5845 0.5867 0.5762
RAkEL-BR enron 0.5334 0.5290 0.5054
RAkEL-BR mediamill 0.5622 0.5686 0.5490
RAkEL-BR yeast 0.5796 0.5850 0.5674

The previous results, grouped by measure and classification algorithm, are visually represented in the following spider plots.

Classification resultsClassification resultsClassification resultsClassification resultsClassification resultsClassification resultsClassification resultsClassification resultsClassification results

How to use MLeNN

The MLSMOTE algorithm has been implemented in Java. You will need a JRE 1.7, along with the uncompressed content of the mlenn.zip file (coming soon). The mlenn.jar file contains the MLeNN implementation. This program is designed to process all the partitions of one multilabel dataset at once. Run the program with the following syntax:

java -jar mlenn.jar -inpath input_path -outpath output_path -fileext file_pattern -xml xml_file -HT float -NN int

The parameters needed are the following:

In the following example the MLeNN algorithm would be applied to the files emotions/XXXXtra.arff and the output would be saved in the files under-emotions/XXXXtra.arff. A 75% threshold would be used to consider two instances as not equal and the 3 nearest neighbors would be used.

java -jar ~/app/mlenn.jar -inpath emotions -outpath under-emotions -fileext tra.arff -xml emotions/emotions.xml -HT 0.75 -NN 3