SUSTEM: An Improved Rule-based Sundanese Stemmer

Irwan Setiawan, Hung Yu Kao

研究成果: Article同行評審

1 引文 斯高帕斯(Scopus)


Current Sundanese stemmers either ignore reduplication words or define rules to handle only affixes. There is a significant amount of reduplication words in the Sundanese language. Because of that, it is impossible to achieve superior stemming precision in the Sundanese language without addressing reduplication words. This article presents an improved stemmer for the Sundanese language, which handles affixed and reduplicated words. With a Sundanese root word list, we use a rules-based stemming technique. In our approach, all stems produced by the affixes removal or normalization processes are added to the stem list. Using a stem list can help increase stemmer accuracy by reducing stemming errors caused by affix removal sequence errors or morphological issues. The current Sundanese language stemmer, RBSS, was used as a comparison. Two datasets with 8,218 unique affixed words and reduplication words were evaluated. The results show that our stemmer’s strength and accuracy have improved noticeably. The use of stem list and word reduplication rules improved our stemmer’s affixed type recognition and allowed us to achieve up to 99.30% accuracy.

期刊ACM Transactions on Asian and Low-Resource Language Information Processing
出版狀態Published - 2024 6月 21

All Science Journal Classification (ASJC) codes

  • 一般電腦科學
