Learning class-imbalanced data with region-impurity synthetic minority oversampling technique

Der Chiang Li, Ssu Yang Wang, Kuan Cheng Huang, Tung I. Tsai

Research output: Contribution to journalArticlepeer-review

20 Citations (Scopus)

Abstract

Learning from class-imbalanced data is a tough task, which often leads classifiers to fail on identifying the minority class. To balance the class ratio, synthetic minority oversampling technique (SMOTE) has shown its improvement in classifying minority class by generating synthetic minority instances. However, in some scenarios, SMOTE and its extensions will generate noise instances and thus causing the performance degradation. This is because of that they were developed based on kNN (k nearest neighbors), which cannot identify the class distributions between pairs of two minority instances. Furthermore, the number of synthetic instances is left to be discussed in this field of study. To conquer these issues, we propose a new algorithm here named Region-Impurity Synthetic Minority Oversampling Technique (RIOT). Specifically, a region radius, we locate neighbors for minority instances and whereby to identify the relatively hard-to-learn minority instances, by the class ratio within the region and selecting building the base of sample generation. Then, generating synthetic instances until the region is approximately balanced. In the experiment, the results revealed that RIOT can perform better than some SMOTE extensions with less synthetic instances in terms of several model performance indicators for twelve real-world datasets.

Original languageEnglish
Pages (from-to)1391-1407
Number of pages17
JournalInformation sciences
Volume607
DOIs
Publication statusPublished - 2022 Aug

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems and Management
  • Artificial Intelligence
  • Theoretical Computer Science
  • Control and Systems Engineering
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Learning class-imbalanced data with region-impurity synthetic minority oversampling technique'. Together they form a unique fingerprint.

Cite this