MLeNN: A First Approach to Heuristic Multilabel Undersampling

This website contains additional material to the paper: F. Charte, A.J. Rivera, M.J. del Jesus, and F. Herrera. MLeNN: A First Approach to Heuristic Multilabel Undersampling. Proceedings of the 15th Intelligent Data Engineering and Automated Learning (IDEAL 2014), September, Volume 8669, Salamanca (Spain), p.1-9, (2014)

Summary

Abstract
Datasets
Resuts
How to use MLeNN

Abstract

Learning from imbalanced multilabel data is a challenging task that has attracted considerable attention lately. Some resampling algorithms used in traditional classification, such as random undersampling and random oversampling, have been already adapted in order to work with multilabel datasets.

In this paper MLeNN, a heuristic multilabel undersampling algorithm based on the well-known Wilson's Edited Nearest Neighbor Rule, is proposed. The samples to be removed are heuristically selected, instead of randomly picked. The ability of MLeNN to improve classification results is experimentally tested, and its performance against multilabel random undersampling is analyzed. As will be shown, MLeNN is a competitive multilabel undersampling alternative, able to significantly enhance classification results.

Datasets

The datasets shown in the following table were used in this paper experimentation, partitioned with a 2X5 scheme. These partitions are available to download (185 MB).

Dataset characteristics
Dataset	Instances	Attributes	Labels	MaxIR	MeanIR
corel5k	5000.00	499.00	374.00	1120.00	189.57
corel16k	13766.00	500.00	161.00	126.80	34.16
emotions	593.00	72.00	6.00	1.78	1.48
enron	1702.00	753.00	53.00	913.00	73.95
mediamill	43907.00	120.00	101.00	1092.55	256.40
yeast	2417.00	198.00	14.00	53.41	7.20

Results

Classification results obtained processing the base datasets, the ones preprocessed by MLeNN, and the ones preprocessed with ML-RUS are shown in the following tables.

Accuracy
Algorithm	Dataset	Base	MLeNN	LP-RUS
BR-J48	corel16k	0.0650	0.0768	0.0593
BR-J48	corel5k	0.0586	0.0746	0.0480
BR-J48	emotions	0.4435	0.4410	0.4328
BR-J48	enron	0.4010	0.4044	0.3292
BR-J48	mediamill	0.4194	0.4314	0.3964
BR-J48	yeast	0.4329	0.4395	0.4150
CLR	corel16k	0.0456	0.0652	0.0401
CLR	corel5k	0.0360	0.0606	0.0292
CLR	emotions	0.4480	0.4424	0.4338
CLR	enron	0.4171	0.4108	0.3422
CLR	mediamill	0.4490	0.4607	0.4342
CLR	yeast	0.4698	0.4766	0.4566
RAkEL-BR	corel16k	0.0645	0.0766	0.0592
RAkEL-BR	corel5k	0.0586	0.0746	0.0480
RAkEL-BR	emotions	0.4435	0.4410	0.4328
RAkEL-BR	enron	0.4010	0.4044	0.3292
RAkEL-BR	mediamill	0.4194	0.4314	0.3964
RAkEL-BR	yeast	0.4338	0.4400	0.4160

MacroFMeasure
Algorithm	Dataset	Base	MLeNN	LP-RUS
BR-J48	corel16k	0.1336	0.1255	0.1198
BR-J48	corel5k	0.1774	0.1966	0.1552
BR-J48	emotions	0.5712	0.5745	0.5616
BR-J48	enron	0.4029	0.3902	0.4086
BR-J48	mediamill	0.2774	0.2750	0.2766
BR-J48	yeast	0.4341	0.4453	0.4289
CLR	corel16k	0.1003	0.1404	0.0967
CLR	corel5k	0.1330	0.1843	0.1328
CLR	emotions	0.5982	0.5935	0.5805
CLR	enron	0.4198	0.3900	0.4184
CLR	mediamill	0.2276	0.2350	0.2340
CLR	yeast	0.4480	0.4554	0.4562
RAkEL-BR	corel16k	0.1277	0.1242	0.1176
RAkEL-BR	corel5k	0.1774	0.1966	0.1552
RAkEL-BR	emotions	0.5712	0.5745	0.5616
RAkEL-BR	enron	0.4029	0.3902	0.4086
RAkEL-BR	mediamill	0.2774	0.2750	0.2766
RAkEL-BR	yeast	0.4466	0.4545	0.4436

MicroFMeasure
Algorithm	Dataset	Base	MLeNN	LP-RUS
BR-J48	corel16k	0.1156	0.1306	0.1066
BR-J48	corel5k	0.1096	0.1314	0.0904
BR-J48	emotions	0.5845	0.5867	0.5762
BR-J48	enron	0.5334	0.5290	0.5054
BR-J48	mediamill	0.5622	0.5686	0.5490
BR-J48	yeast	0.5787	0.5843	0.5664
CLR	corel16k	0.0846	0.1145	0.0766
CLR	corel5k	0.0706	0.1094	0.0573
CLR	emotions	0.6072	0.6032	0.5916
CLR	enron	0.5596	0.5451	0.5308
CLR	mediamill	0.5928	0.5984	0.5865
CLR	yeast	0.6168	0.6229	0.6099
RAkEL-BR	corel16k	0.1145	0.1304	0.1061
RAkEL-BR	corel5k	0.1096	0.1314	0.0904
RAkEL-BR	emotions	0.5845	0.5867	0.5762
RAkEL-BR	enron	0.5334	0.5290	0.5054
RAkEL-BR	mediamill	0.5622	0.5686	0.5490
RAkEL-BR	yeast	0.5796	0.5850	0.5674

The previous results, grouped by measure and classification algorithm, are visually represented in the following spider plots.

How to use MLeNN

The MLSMOTE algorithm has been implemented in Java. You will need a JRE 1.7, along with the uncompressed content of the mlenn.zip file (coming soon). The mlenn.jar file contains the MLeNN implementation. This program is designed to process all the partitions of one multilabel dataset at once. Run the program with the following syntax:

java -jar mlenn.jar -inpath input_path -outpath output_path -fileext file_pattern -xml xml_file -HT float -NN int

The parameters needed are the following:

-inpath The path where the dataset partitions to be processed are stored. Do not add a trailing / or \ character.
-outpath The path where the processed dataset partitions will be stored. The program will not change the filenames, so this path has to be different to the previous one.
-fileext Trailing characters in the filenames of the files to be processed. Assuming that the files emotions5x2x1tra.arff, emotions5x2x2tra.arff, emotions5x2x1tst.arff, and emotions5x2x2tst.arff are stored in -inpath, the parameter -fileext tra.arff would process the first two files. It is possible to process an individual file using its full filename.
-xml Full path and filename of the XML file associated to the multilabel dataset to process. This file enumerates the labels existent in the dataset.
-NN Sets the number of neighbors to use. Default value is 3.
-HT Sets the threshold for the Hamming distance. Must be a value in the the interval (0,1]. Default value is 0.5.

In the following example the MLeNN algorithm would be applied to the files emotions/XXXXtra.arff and the output would be saved in the files under-emotions/XXXXtra.arff. A 75% threshold would be used to consider two instances as not equal and the 3 nearest neighbors would be used.

java -jar ~/app/mlenn.jar -inpath emotions -outpath under-emotions -fileext tra.arff -xml emotions/emotions.xml -HT 0.75 -NN 3