TY - JOUR
T1 - Annotation and verification of sense pools in OntoNotes
AU - Yu, Liang Chih
AU - Wu, Chung Hsien
AU - Chang, Ru Yng
AU - Liu, Chao Hong
AU - Hovy, Eduard
N1 - Funding Information:
This work was supported by the National Science Council, Taiwan, ROC , under Grant No. NSC 97-2218-E-155-011 . The authors would like to thank the anonymous reviewers and the guest editors for their constructive comments.
PY - 2010/7
Y1 - 2010/7
N2 - The paper describes the OntoNotes, a multilingual (English, Chinese and Arabic) corpus with large-scale semantic annotations, including predicate-argument structure, word senses, ontology linking, and coreference. The underlying semantic model of OntoNotes involves word senses that are grouped into so-called sense pools, i.e., sets of near-synonymous senses of words. Such information is useful for many applications, including query expansion for information retrieval (IR) systems, (near-)duplicate detection for text summarization systems, and alternative word selection for writing support systems. Although a sense pool provides a set of near-synonymous senses of words, there is still no knowledge about whether two words in a pool are interchangeable in practical use. Therefore, this paper devises an unsupervised algorithm that incorporates Google n-grams and a statistical test to determine whether a word in a pool can be substituted by other words in the same pool. The n-gram features are used to measure the degree of context mismatch for a substitution. The statistical test is then applied to determine whether the substitution is adequate based on the degree of mismatch. The proposed method is compared with a supervised method, namely Linear Discriminant Analysis (LDA). Experimental results show that the proposed unsupervised method can achieve comparable performance with the supervised method.
AB - The paper describes the OntoNotes, a multilingual (English, Chinese and Arabic) corpus with large-scale semantic annotations, including predicate-argument structure, word senses, ontology linking, and coreference. The underlying semantic model of OntoNotes involves word senses that are grouped into so-called sense pools, i.e., sets of near-synonymous senses of words. Such information is useful for many applications, including query expansion for information retrieval (IR) systems, (near-)duplicate detection for text summarization systems, and alternative word selection for writing support systems. Although a sense pool provides a set of near-synonymous senses of words, there is still no knowledge about whether two words in a pool are interchangeable in practical use. Therefore, this paper devises an unsupervised algorithm that incorporates Google n-grams and a statistical test to determine whether a word in a pool can be substituted by other words in the same pool. The n-gram features are used to measure the degree of context mismatch for a substitution. The statistical test is then applied to determine whether the substitution is adequate based on the degree of mismatch. The proposed method is compared with a supervised method, namely Linear Discriminant Analysis (LDA). Experimental results show that the proposed unsupervised method can achieve comparable performance with the supervised method.
UR - http://www.scopus.com/inward/record.url?scp=77955231984&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77955231984&partnerID=8YFLogxK
U2 - 10.1016/j.ipm.2009.11.002
DO - 10.1016/j.ipm.2009.11.002
M3 - Article
AN - SCOPUS:77955231984
SN - 0306-4573
VL - 46
SP - 436
EP - 447
JO - Information Processing and Management
JF - Information Processing and Management
IS - 4
ER -