A feature selection approach for automatic e-book classification based on discourse segmentation

Jiunn Liang Guo, Hei-Chia Wang, Ming Way Lai

研究成果: Article

1 引文 (Scopus)

摘要

Purpose – The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus. Design/methodology/approach – The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique. Findings – The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books. Research limitations/implications – Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold. Practical implications – The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques. Originality/value – A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.

原文English
頁(從 - 到)2-22
頁數21
期刊Program
49
發行號1
DOIs
出版狀態Published - 2015 二月 2

指紋

Feature extraction
discourse
Support vector machines
text analysis
role play
segmentation
evaluation
discourse analysis
narrative
methodology
management
performance
evidence
Values

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Library and Information Sciences

引用此文

@article{044d8c094aa64f52ab9b52fd96cad7ec,
title = "A feature selection approach for automatic e-book classification based on discourse segmentation",
abstract = "Purpose – The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus. Design/methodology/approach – The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique. Findings – The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books. Research limitations/implications – Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold. Practical implications – The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques. Originality/value – A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.",
author = "Guo, {Jiunn Liang} and Hei-Chia Wang and Lai, {Ming Way}",
year = "2015",
month = "2",
day = "2",
doi = "10.1108/PROG-12-2012-0071",
language = "English",
volume = "49",
pages = "2--22",
journal = "Data Technologies and Applications",
issn = "2514-9288",
publisher = "Emerald Group Publishing Ltd.",
number = "1",

}

A feature selection approach for automatic e-book classification based on discourse segmentation. / Guo, Jiunn Liang; Wang, Hei-Chia; Lai, Ming Way.

於: Program, 卷 49, 編號 1, 02.02.2015, p. 2-22.

研究成果: Article

TY - JOUR

T1 - A feature selection approach for automatic e-book classification based on discourse segmentation

AU - Guo, Jiunn Liang

AU - Wang, Hei-Chia

AU - Lai, Ming Way

PY - 2015/2/2

Y1 - 2015/2/2

N2 - Purpose – The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus. Design/methodology/approach – The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique. Findings – The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books. Research limitations/implications – Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold. Practical implications – The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques. Originality/value – A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.

AB - Purpose – The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus. Design/methodology/approach – The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique. Findings – The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books. Research limitations/implications – Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold. Practical implications – The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques. Originality/value – A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.

UR - http://www.scopus.com/inward/record.url?scp=84921772680&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921772680&partnerID=8YFLogxK

U2 - 10.1108/PROG-12-2012-0071

DO - 10.1108/PROG-12-2012-0071

M3 - Article

AN - SCOPUS:84921772680

VL - 49

SP - 2

EP - 22

JO - Data Technologies and Applications

JF - Data Technologies and Applications

SN - 2514-9288

IS - 1

ER -