LDA based semi-supervised learning from streaming short text

Ji De Chen, Hung-Yu Kao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

With the rapidly growing of real-time social media, like Twitter, many users share and discuss their interest topics through such platforms. Hashtag is a type of metadata tag which allows users to annotate their topics of tweets. For research usage, for example, hashtags can help the performance of event detection by observing the trend of hashtags. Although Twitter grows rapidly, hashtag growth is not as expected. Our dataset shows that there are less than 20% of all tweets containing hashtags. We think that it is caused by that most users may have no idea what hashtags are suitable for tweets they post. If we can recommend suitable hashtags to users, it can be one of the solutions to solve the problem of low usage rate of hashtag. Hashtag recommendation belongs to supervised learning problem. More labeled data for training the learning model can get higher performance in prediction. However, labeled data in hashtag recommendation is not so much due to low usage rate of hashtag. Thus, we want to exploit unlabeled data, i.e. non-hashtag tweets, to solve this problem. Now we have large amount of unlabeled data, but directly adding all non-hashtag tweets may not be helpful to train the model. To overcome this issue, we apply the weight-updating mechanisms to filter out the useless parts of non-hashtag tweets. These mechanisms also have to consider the temporal characteristics of hashtag due to the real-time nature of Twitter. The experimental results in this research show that adding non-hashtag tweets to extend original training data outperforms baseline methods which only exploit labeled data to train the model.

Original languageEnglish
Title of host publicationProceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015
EditorsGabriella Pasi, James Kwok, Osmar Zaiane, Patrick Gallinari, Eric Gaussier, Longbing Cao
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781467382731
DOIs
Publication statusPublished - 2015 Dec 2
EventIEEE International Conference on Data Science and Advanced Analytics, DSAA 2015 - Paris, France
Duration: 2015 Oct 192015 Oct 21

Publication series

NameProceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015

Other

OtherIEEE International Conference on Data Science and Advanced Analytics, DSAA 2015
CountryFrance
CityParis
Period15-10-1915-10-21

Fingerprint

Supervised learning
Metadata
Semi-supervised learning
Twitter
Train

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Information Systems and Management
  • Information Systems

Cite this

Chen, J. D., & Kao, H-Y. (2015). LDA based semi-supervised learning from streaming short text. In G. Pasi, J. Kwok, O. Zaiane, P. Gallinari, E. Gaussier, & L. Cao (Eds.), Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015 [7344830] (Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DSAA.2015.7344830
Chen, Ji De ; Kao, Hung-Yu. / LDA based semi-supervised learning from streaming short text. Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015. editor / Gabriella Pasi ; James Kwok ; Osmar Zaiane ; Patrick Gallinari ; Eric Gaussier ; Longbing Cao. Institute of Electrical and Electronics Engineers Inc., 2015. (Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015).
@inproceedings{729d669b62884b4ca95914de66c61f27,
title = "LDA based semi-supervised learning from streaming short text",
abstract = "With the rapidly growing of real-time social media, like Twitter, many users share and discuss their interest topics through such platforms. Hashtag is a type of metadata tag which allows users to annotate their topics of tweets. For research usage, for example, hashtags can help the performance of event detection by observing the trend of hashtags. Although Twitter grows rapidly, hashtag growth is not as expected. Our dataset shows that there are less than 20{\%} of all tweets containing hashtags. We think that it is caused by that most users may have no idea what hashtags are suitable for tweets they post. If we can recommend suitable hashtags to users, it can be one of the solutions to solve the problem of low usage rate of hashtag. Hashtag recommendation belongs to supervised learning problem. More labeled data for training the learning model can get higher performance in prediction. However, labeled data in hashtag recommendation is not so much due to low usage rate of hashtag. Thus, we want to exploit unlabeled data, i.e. non-hashtag tweets, to solve this problem. Now we have large amount of unlabeled data, but directly adding all non-hashtag tweets may not be helpful to train the model. To overcome this issue, we apply the weight-updating mechanisms to filter out the useless parts of non-hashtag tweets. These mechanisms also have to consider the temporal characteristics of hashtag due to the real-time nature of Twitter. The experimental results in this research show that adding non-hashtag tweets to extend original training data outperforms baseline methods which only exploit labeled data to train the model.",
author = "Chen, {Ji De} and Hung-Yu Kao",
year = "2015",
month = "12",
day = "2",
doi = "10.1109/DSAA.2015.7344830",
language = "English",
series = "Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
editor = "Gabriella Pasi and James Kwok and Osmar Zaiane and Patrick Gallinari and Eric Gaussier and Longbing Cao",
booktitle = "Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015",
address = "United States",

}

Chen, JD & Kao, H-Y 2015, LDA based semi-supervised learning from streaming short text. in G Pasi, J Kwok, O Zaiane, P Gallinari, E Gaussier & L Cao (eds), Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015., 7344830, Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, Institute of Electrical and Electronics Engineers Inc., IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, 15-10-19. https://doi.org/10.1109/DSAA.2015.7344830

LDA based semi-supervised learning from streaming short text. / Chen, Ji De; Kao, Hung-Yu.

Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015. ed. / Gabriella Pasi; James Kwok; Osmar Zaiane; Patrick Gallinari; Eric Gaussier; Longbing Cao. Institute of Electrical and Electronics Engineers Inc., 2015. 7344830 (Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - LDA based semi-supervised learning from streaming short text

AU - Chen, Ji De

AU - Kao, Hung-Yu

PY - 2015/12/2

Y1 - 2015/12/2

N2 - With the rapidly growing of real-time social media, like Twitter, many users share and discuss their interest topics through such platforms. Hashtag is a type of metadata tag which allows users to annotate their topics of tweets. For research usage, for example, hashtags can help the performance of event detection by observing the trend of hashtags. Although Twitter grows rapidly, hashtag growth is not as expected. Our dataset shows that there are less than 20% of all tweets containing hashtags. We think that it is caused by that most users may have no idea what hashtags are suitable for tweets they post. If we can recommend suitable hashtags to users, it can be one of the solutions to solve the problem of low usage rate of hashtag. Hashtag recommendation belongs to supervised learning problem. More labeled data for training the learning model can get higher performance in prediction. However, labeled data in hashtag recommendation is not so much due to low usage rate of hashtag. Thus, we want to exploit unlabeled data, i.e. non-hashtag tweets, to solve this problem. Now we have large amount of unlabeled data, but directly adding all non-hashtag tweets may not be helpful to train the model. To overcome this issue, we apply the weight-updating mechanisms to filter out the useless parts of non-hashtag tweets. These mechanisms also have to consider the temporal characteristics of hashtag due to the real-time nature of Twitter. The experimental results in this research show that adding non-hashtag tweets to extend original training data outperforms baseline methods which only exploit labeled data to train the model.

AB - With the rapidly growing of real-time social media, like Twitter, many users share and discuss their interest topics through such platforms. Hashtag is a type of metadata tag which allows users to annotate their topics of tweets. For research usage, for example, hashtags can help the performance of event detection by observing the trend of hashtags. Although Twitter grows rapidly, hashtag growth is not as expected. Our dataset shows that there are less than 20% of all tweets containing hashtags. We think that it is caused by that most users may have no idea what hashtags are suitable for tweets they post. If we can recommend suitable hashtags to users, it can be one of the solutions to solve the problem of low usage rate of hashtag. Hashtag recommendation belongs to supervised learning problem. More labeled data for training the learning model can get higher performance in prediction. However, labeled data in hashtag recommendation is not so much due to low usage rate of hashtag. Thus, we want to exploit unlabeled data, i.e. non-hashtag tweets, to solve this problem. Now we have large amount of unlabeled data, but directly adding all non-hashtag tweets may not be helpful to train the model. To overcome this issue, we apply the weight-updating mechanisms to filter out the useless parts of non-hashtag tweets. These mechanisms also have to consider the temporal characteristics of hashtag due to the real-time nature of Twitter. The experimental results in this research show that adding non-hashtag tweets to extend original training data outperforms baseline methods which only exploit labeled data to train the model.

UR - http://www.scopus.com/inward/record.url?scp=84962821837&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962821837&partnerID=8YFLogxK

U2 - 10.1109/DSAA.2015.7344830

DO - 10.1109/DSAA.2015.7344830

M3 - Conference contribution

AN - SCOPUS:84962821837

T3 - Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015

BT - Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015

A2 - Pasi, Gabriella

A2 - Kwok, James

A2 - Zaiane, Osmar

A2 - Gallinari, Patrick

A2 - Gaussier, Eric

A2 - Cao, Longbing

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Chen JD, Kao H-Y. LDA based semi-supervised learning from streaming short text. In Pasi G, Kwok J, Zaiane O, Gallinari P, Gaussier E, Cao L, editors, Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015. Institute of Electrical and Electronics Engineers Inc. 2015. 7344830. (Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015). https://doi.org/10.1109/DSAA.2015.7344830