Word Co-occurrence Augmented Topic Model in Short Text

  • 陳 冠斌

Student thesis: Master's Thesis

Abstract

The large amount of text on the Internet cause people hard to understand the meaning in a short limit time Topic models (e g LDA and PLSA) has been proposed to summarize the long text into several topic terms In the recent years the short text media such as tweet is very popular However directly applies the transitional topic model on the short text corpus usually gating non-coherent topics Because there is no enough words to discover the word co-occurrence pattern in a short document The Bi-term topic model (BTM) has been proposed to improve this problem However BTM just consider simple bi-term frequency which cause the generated topics are dominated by common words In this paper we solve the lack of the local word co-occurrence problem in LDA and the problem of the frequent bi-term in BTM Thus we proposed two improvement of word co-occurrence methods to enhance the topic models First we apply the word co-occurrence information to the BTM Second we generate new virtual documents by reorganizing the words in documents and just apply in the traditional LDA The experimental result that show our RO-LDA method gets well results in the noisy Tweet dataset and the PMI-β-BTM gets well result in the regular short news title text Moreover there are two advantages in our methods We do not need any external data and our proposed methods are based on the original topic model that we did not modify the model itself thus our methods can easily apply to some other existing LDA or BTM based models
Date of Award2015 Aug 18
Original languageEnglish
SupervisorHung-Yu Kao (Supervisor)

Cite this

'