On the Impact of Dataset Complexity and Sampling Strategy in Multilabel Classifiers Performance

TitleOn the Impact of Dataset Complexity and Sampling Strategy in Multilabel Classifiers Performance
Publication TypeConference Paper
Year of Publication2016
AuthorsCharte, Francisco, Rivera-Rivas A.J., del Jesus M. J., and Herrera F.
Conference Name11th International Conference on Hybrid Artificial Intelligent Systems, HAIS 2016
Date Published4
Conference LocationSeville (Spain)
ISBN Number978-3-319-32033-5

Multilabel classification (MLC) is an increasingly widespread data mining technique. Its goal is to categorize patterns in several non-exclusive groups, and it is applied in fields such as news categorization, image labeling and music classification. Comparatively speaking, MLC is a more complex task than multiclass and binary classification, since the classifier must learn the presence of various outputs at once from the same set of predictive variables. The own nature of the data the classifier has to deal with implies a certain complexity degree. How to measure this complexness level strictly from the data characteristics would be an interesting objective. At the same time, the strategy used to partition the data also influences the sample patterns the algorithm has at its disposal to train the classifier. In MLC random sampling is commonly used to accomplish this task. This paper introduces TCS (Theoretical Complexity Score), a new characterization metric aimed to assess the intrinsic complexity of a multilabel dataset, as well as a novel stratified sampling method specifically designed to fit the traits of multilabeled data. A detailed description of both proposals is provided, along with empirical results of their suitability for their respective duties.