The large amount of text on the Internet cause people hard to understand the meaning in a short limit time. Topic models (e.g. LDA and PLSA) have then been proposed to summarize the long text into several topic terms. In the recent years, the short text media such as Twitter is very popular. However, directly applying the transitional topic model on the short text corpus usually obtains non-coherent topics. It's because that there is no enough words to discover the word co-occurrence patterns in a short document. In this paper, we solve the problem of lack of the local word co-occurrence in LDA. Thus, we proposed an improvement of word co-occurrence method to enhance the topic models. We generate new virtual documents by re-organizing the words in documents and use it to enhance the traditional LDA. The experimental results show that our re-organized LDA (RO-LDA) method gets better results in the noisy Tweet dataset and the regular news dataset. Moreover, in our proposed augmented model, we do not need any external data. Our proposed methods are only based on the original topic model, thus our methods can easily apply to other existing LDA based models.
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Computer Vision and Pattern Recognition
- Artificial Intelligence