TY - JOUR
T1 - SUSTEM
T2 - An Improved Rule-based Sundanese Stemmer
AU - Setiawan, Irwan
AU - Kao, Hung Yu
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/6/21
Y1 - 2024/6/21
N2 - Current Sundanese stemmers either ignore reduplication words or define rules to handle only affixes. There is a significant amount of reduplication words in the Sundanese language. Because of that, it is impossible to achieve superior stemming precision in the Sundanese language without addressing reduplication words. This article presents an improved stemmer for the Sundanese language, which handles affixed and reduplicated words. With a Sundanese root word list, we use a rules-based stemming technique. In our approach, all stems produced by the affixes removal or normalization processes are added to the stem list. Using a stem list can help increase stemmer accuracy by reducing stemming errors caused by affix removal sequence errors or morphological issues. The current Sundanese language stemmer, RBSS, was used as a comparison. Two datasets with 8,218 unique affixed words and reduplication words were evaluated. The results show that our stemmer’s strength and accuracy have improved noticeably. The use of stem list and word reduplication rules improved our stemmer’s affixed type recognition and allowed us to achieve up to 99.30% accuracy.
AB - Current Sundanese stemmers either ignore reduplication words or define rules to handle only affixes. There is a significant amount of reduplication words in the Sundanese language. Because of that, it is impossible to achieve superior stemming precision in the Sundanese language without addressing reduplication words. This article presents an improved stemmer for the Sundanese language, which handles affixed and reduplicated words. With a Sundanese root word list, we use a rules-based stemming technique. In our approach, all stems produced by the affixes removal or normalization processes are added to the stem list. Using a stem list can help increase stemmer accuracy by reducing stemming errors caused by affix removal sequence errors or morphological issues. The current Sundanese language stemmer, RBSS, was used as a comparison. Two datasets with 8,218 unique affixed words and reduplication words were evaluated. The results show that our stemmer’s strength and accuracy have improved noticeably. The use of stem list and word reduplication rules improved our stemmer’s affixed type recognition and allowed us to achieve up to 99.30% accuracy.
UR - http://www.scopus.com/inward/record.url?scp=85197404532&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197404532&partnerID=8YFLogxK
U2 - 10.1145/3656342
DO - 10.1145/3656342
M3 - Article
AN - SCOPUS:85197404532
SN - 2375-4699
VL - 23
JO - ACM Transactions on Asian and Low-Resource Language Information Processing
JF - ACM Transactions on Asian and Low-Resource Language Information Processing
IS - 6
M1 - 77
ER -