TY - JOUR
T1 - LiDA
T2 - Language-Independent Data Augmentation for Text Classification
AU - Sujana, Yudianto
AU - Kao, Hung Yu
N1 - Funding Information:
This work was supported by the National Science and Technology Council (NSTC), Taiwan, under Grant MOST 111-2221-E-006-001.
Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Developing a high-performance text classification model in a low-resource language is challenging due to the lack of labeled data. Meanwhile, collecting large amounts of labeled data is cost-inefficient. One approach to increase the amount of labeled data is to create synthetic data using data augmentation techniques. However, most of the available data augmentation techniques work on English data and are highly language-dependent as they perform at the word and sentence level, such as replacing some words or paraphrasing a sentence. We present Language-independent Data Augmentation (LiDA), a technique that utilizes a multilingual language model to create synthetic data from the available training dataset. Unlike other methods, our approach worked on the sentence embedding level independent of any particular language. We evaluated LiDA in three languages on various fractions of the dataset, and the result showed improved performance in both the LSTM and BERT models. Furthermore, we conducted an ablation study to determine the impact of the components in our method on overall performance. The source code of LiDA is available at https://github.com/yest/LiDA.
AB - Developing a high-performance text classification model in a low-resource language is challenging due to the lack of labeled data. Meanwhile, collecting large amounts of labeled data is cost-inefficient. One approach to increase the amount of labeled data is to create synthetic data using data augmentation techniques. However, most of the available data augmentation techniques work on English data and are highly language-dependent as they perform at the word and sentence level, such as replacing some words or paraphrasing a sentence. We present Language-independent Data Augmentation (LiDA), a technique that utilizes a multilingual language model to create synthetic data from the available training dataset. Unlike other methods, our approach worked on the sentence embedding level independent of any particular language. We evaluated LiDA in three languages on various fractions of the dataset, and the result showed improved performance in both the LSTM and BERT models. Furthermore, we conducted an ablation study to determine the impact of the components in our method on overall performance. The source code of LiDA is available at https://github.com/yest/LiDA.
UR - http://www.scopus.com/inward/record.url?scp=85147220658&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147220658&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3234019
DO - 10.1109/ACCESS.2023.3234019
M3 - Article
AN - SCOPUS:85147220658
SN - 2169-3536
VL - 11
SP - 10894
EP - 10901
JO - IEEE Access
JF - IEEE Access
ER -