Dealing with difficult minority labels in imbalanced mutilabel data sets

Author	Francisco Charte Ojeda Antonio Jesús Rivera Rivas Maria José del Jesus Díaz Francisco Herrera Triguero
Abstract	Multilabel classification is an emergent data mining task with a broad range of real world applications. Learning from imbalanced multilabel data is being deeply studied latterly, and several resampling methods have been proposed in the literature. The unequal label distribution in most multilabel datasets, with disparate imbalance levels, could be a handicap while learning new classifiers. In addition, this characteristic challenges many of the existent preprocessing algorithms. Furthermore, the concurrence between imbalanced labels can make harder the learning from certain labels. These are what we call difficult labels. In this work, the problem of difficult labels is deeply analyzed, its influence in multilabel classifiers is studied, and a novel way to solve this problem is proposed. Specific metrics to assess this trait in multilabel datasets, called SCUMBLE (Score of ConcUrrence among iMBalanced LabEls) and SCUMBLELbl, are presented along with REMEDIAL (REsampling MultilabEl datasets by Decoupling highly ImbAlanced Labels), a new algorithm aimed to relax label concurrence. How to deal with this problem using the R mldr package is also outlined.
Year of Publication	2019
Journal	Neurocomputing
Volume	326
Number of Pages	39-53
DOI	10.1016/j.neucom.2016.08.158
Download citation	DOI Google Scholar BibTeX
Notes	TIN2014-57251-P,TIN2015-68454-R,P11-TIC-7765
Notes	TIN2014-57251-P,TIN2015-68454-R,P11-TIC-7765
Bibliography media	Document 2019-NeucomDealingDifficultLabels.pdf

Author

Abstract

Multilabel classification is an emergent data mining task with a broad range of real world applications. Learning from imbalanced multilabel data is being deeply studied latterly, and several resampling methods have been proposed in the literature. The unequal label distribution in most multilabel datasets, with disparate imbalance levels, could be a handicap while learning new classifiers. In addition, this characteristic challenges many of the existent preprocessing algorithms. Furthermore, the concurrence between imbalanced labels can make harder the learning from certain labels. These are what we call difficult labels. In this work, the problem of difficult labels is deeply analyzed, its influence in multilabel classifiers is studied, and a novel way to solve this problem is proposed. Specific metrics to assess this trait in multilabel datasets, called SCUMBLE (Score of ConcUrrence among iMBalanced LabEls) and SCUMBLELbl, are presented along with REMEDIAL (REsampling MultilabEl datasets by Decoupling highly ImbAlanced Labels), a new algorithm aimed to relax label concurrence. How to deal with this problem using the R mldr package is also outlined.

Year of Publication

2019

Journal

Neurocomputing

Volume

326

Number of Pages

39-53

DOI

10.1016/j.neucom.2016.08.158

Download citation

Notes

TIN2014-57251-P,TIN2015-68454-R,P11-TIC-7765

Notes

TIN2014-57251-P,TIN2015-68454-R,P11-TIC-7765

Bibliography media

Document

2019-NeucomDealingDifficultLabels.pdf

Dealing with difficult minority labels in imbalanced mutilabel data sets

Location

Resources

User account menu

🍪 Cookie Notice