OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation

Liang Chih Yu, Chung-Hsien Wu, Eduard Hovy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Annotated corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, no-one has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective on identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.

Original languageEnglish
Title of host publicationColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Pages1057-1064
Number of pages8
Publication statusPublished - 2008 Dec 1
Event22nd International Conference on Computational Linguistics, Coling 2008 - Manchester, United Kingdom
Duration: 2008 Aug 182008 Aug 22

Publication series

NameColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Volume1

Other

Other22nd International Conference on Computational Linguistics, Coling 2008
CountryUnited Kingdom
CityManchester
Period08-08-1808-08-22

Fingerprint

Cost effectiveness
Ontology
Entropy
Semantics
Experiments
entropy
ontology
candidacy
semantics
Disambiguation
Word Sense
Annotation
experiment
costs
evaluation
performance

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Cite this

Yu, L. C., Wu, C-H., & Hovy, E. (2008). OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation. In Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference (pp. 1057-1064). (Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference; Vol. 1).
Yu, Liang Chih ; Wu, Chung-Hsien ; Hovy, Eduard. / OntoNotes : Corpus cleanup of mistaken agreement using word sense disambiguation. Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. 2008. pp. 1057-1064 (Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference).
@inproceedings{15fdd585fd1b4a99886615b31e29f5fc,
title = "OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation",
abstract = "Annotated corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, no-one has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective on identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2{\%} remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.",
author = "Yu, {Liang Chih} and Chung-Hsien Wu and Eduard Hovy",
year = "2008",
month = "12",
day = "1",
language = "English",
isbn = "9781905593446",
series = "Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference",
pages = "1057--1064",
booktitle = "Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference",

}

Yu, LC, Wu, C-H & Hovy, E 2008, OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation. in Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference, vol. 1, pp. 1057-1064, 22nd International Conference on Computational Linguistics, Coling 2008, Manchester, United Kingdom, 08-08-18.

OntoNotes : Corpus cleanup of mistaken agreement using word sense disambiguation. / Yu, Liang Chih; Wu, Chung-Hsien; Hovy, Eduard.

Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. 2008. p. 1057-1064 (Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference; Vol. 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - OntoNotes

T2 - Corpus cleanup of mistaken agreement using word sense disambiguation

AU - Yu, Liang Chih

AU - Wu, Chung-Hsien

AU - Hovy, Eduard

PY - 2008/12/1

Y1 - 2008/12/1

N2 - Annotated corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, no-one has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective on identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.

AB - Annotated corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, no-one has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective on identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.

UR - http://www.scopus.com/inward/record.url?scp=77955231415&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77955231415&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:77955231415

SN - 9781905593446

T3 - Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference

SP - 1057

EP - 1064

BT - Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference

ER -

Yu LC, Wu C-H, Hovy E. OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation. In Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. 2008. p. 1057-1064. (Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference).