TY - JOUR
T1 - Word co-occurrence augmented topic model in short text
AU - Chen, Guan Bin
AU - Kao, Hung Yu
N1 - Publisher Copyright:
© 2017-IOS Press and the authors. All rights reserved.
Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2017
Y1 - 2017
N2 - The large amount of text on the Internet cause people hard to understand the meaning in a short limit time. Topic models (e.g. LDA and PLSA) have then been proposed to summarize the long text into several topic terms. In the recent years, the short text media such as Twitter is very popular. However, directly applying the transitional topic model on the short text corpus usually obtains non-coherent topics. It's because that there is no enough words to discover the word co-occurrence patterns in a short document. In this paper, we solve the problem of lack of the local word co-occurrence in LDA. Thus, we proposed an improvement of word co-occurrence method to enhance the topic models. We generate new virtual documents by re-organizing the words in documents and use it to enhance the traditional LDA. The experimental results show that our re-organized LDA (RO-LDA) method gets better results in the noisy Tweet dataset and the regular news dataset. Moreover, in our proposed augmented model, we do not need any external data. Our proposed methods are only based on the original topic model, thus our methods can easily apply to other existing LDA based models.
AB - The large amount of text on the Internet cause people hard to understand the meaning in a short limit time. Topic models (e.g. LDA and PLSA) have then been proposed to summarize the long text into several topic terms. In the recent years, the short text media such as Twitter is very popular. However, directly applying the transitional topic model on the short text corpus usually obtains non-coherent topics. It's because that there is no enough words to discover the word co-occurrence patterns in a short document. In this paper, we solve the problem of lack of the local word co-occurrence in LDA. Thus, we proposed an improvement of word co-occurrence method to enhance the topic models. We generate new virtual documents by re-organizing the words in documents and use it to enhance the traditional LDA. The experimental results show that our re-organized LDA (RO-LDA) method gets better results in the noisy Tweet dataset and the regular news dataset. Moreover, in our proposed augmented model, we do not need any external data. Our proposed methods are only based on the original topic model, thus our methods can easily apply to other existing LDA based models.
UR - http://www.scopus.com/inward/record.url?scp=85017340982&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85017340982&partnerID=8YFLogxK
U2 - 10.3233/IDA-170872
DO - 10.3233/IDA-170872
M3 - Article
AN - SCOPUS:85017340982
SN - 1088-467X
VL - 21
SP - S55-S70
JO - Intelligent Data Analysis
JF - Intelligent Data Analysis
IS - S1
ER -