TY - JOUR
T1 - Spam detection and high-quality features to analyse question –answer pairs
AU - Wang, Hei Chia
AU - Chiang, Yu Hung
AU - Lin, Si Ting
N1 - Funding Information:
The research is based on work supported by Taiwan Ministry of Science and Technology under Grant No. MOST 107-2410-H-006 040-MY3 and 108-2511-H-006-009. We would like to thank the Center of Innovative Fintech Business Models for a research grant to support this research. Thank you for your assistance.
Publisher Copyright:
© 2020, Emerald Publishing Limited.
PY - 2020/12/12
Y1 - 2020/12/12
N2 - Purpose: In community question and answer (CQA) services, because of user subjectivity and the limits of knowledge, the distribution of answer quality can vary drastically – from highly related to irrelevant or even spam answers. Previous studies of CQA portals have faced two important issues: answer quality analysis and spam answer filtering. Therefore, the purposes of this study are to filter spam answers in advance using two-phase identification methods and then automatically classify the different types of question and answer (QA) pairs by deep learning. Finally, this study proposes a comprehensive study of answer quality prediction for different types of QA pairs. Design/methodology/approach: This study proposes an integrated model with a two-phase identification method that filters spam answers in advance and uses a deep learning method [recurrent convolutional neural network (R-CNN)] to automatically classify various types of questions. Logistic regression (LR) is further applied to examine which answer quality features significantly indicate high-quality answers to different types of questions. Findings: There are four prominent findings. (1) This study confirms that conducting spam filtering before an answer quality analysis can reduce the proportion of high-quality answers that are misjudged as spam answers. (2) The experimental results show that answer quality is better when question types are included. (3) The analysis results for different classifiers show that the R-CNN achieves the best macro-F1 scores (74.8%) in the question type classification module. (4) Finally, the experimental results by LR show that author ranking, answer length and common words could significantly impact answer quality for different types of questions. Originality/value: The proposed system is simultaneously able to detect spam answers and provide users with quick and efficient retrieval mechanisms for high-quality answers to different types of questions in CQA. Moreover, this study further validates that crucial features exist among the different types of questions that can impact answer quality. Overall, an identification system automatically summarises high-quality answers for each different type of questions from the pool of messy answers in CQA, which can be very useful in helping users make decisions.
AB - Purpose: In community question and answer (CQA) services, because of user subjectivity and the limits of knowledge, the distribution of answer quality can vary drastically – from highly related to irrelevant or even spam answers. Previous studies of CQA portals have faced two important issues: answer quality analysis and spam answer filtering. Therefore, the purposes of this study are to filter spam answers in advance using two-phase identification methods and then automatically classify the different types of question and answer (QA) pairs by deep learning. Finally, this study proposes a comprehensive study of answer quality prediction for different types of QA pairs. Design/methodology/approach: This study proposes an integrated model with a two-phase identification method that filters spam answers in advance and uses a deep learning method [recurrent convolutional neural network (R-CNN)] to automatically classify various types of questions. Logistic regression (LR) is further applied to examine which answer quality features significantly indicate high-quality answers to different types of questions. Findings: There are four prominent findings. (1) This study confirms that conducting spam filtering before an answer quality analysis can reduce the proportion of high-quality answers that are misjudged as spam answers. (2) The experimental results show that answer quality is better when question types are included. (3) The analysis results for different classifiers show that the R-CNN achieves the best macro-F1 scores (74.8%) in the question type classification module. (4) Finally, the experimental results by LR show that author ranking, answer length and common words could significantly impact answer quality for different types of questions. Originality/value: The proposed system is simultaneously able to detect spam answers and provide users with quick and efficient retrieval mechanisms for high-quality answers to different types of questions in CQA. Moreover, this study further validates that crucial features exist among the different types of questions that can impact answer quality. Overall, an identification system automatically summarises high-quality answers for each different type of questions from the pool of messy answers in CQA, which can be very useful in helping users make decisions.
UR - http://www.scopus.com/inward/record.url?scp=85096587617&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096587617&partnerID=8YFLogxK
U2 - 10.1108/EL-05-2020-0120
DO - 10.1108/EL-05-2020-0120
M3 - Article
AN - SCOPUS:85096587617
SN - 0264-0473
VL - 38
SP - 1013
EP - 1033
JO - Electronic Library
JF - Electronic Library
IS - 5-6
ER -