OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation

Liang Chih Yu, Chung Hsien Wu, Eduard Hovy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Annotated corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, no-one has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective on identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.

Original languageEnglish
Title of host publicationColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Pages1057-1064
Number of pages8
Publication statusPublished - 2008
Event22nd International Conference on Computational Linguistics, Coling 2008 - Manchester, United Kingdom
Duration: 2008 Aug 182008 Aug 22

Publication series

NameColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Volume1

Other

Other22nd International Conference on Computational Linguistics, Coling 2008
Country/TerritoryUnited Kingdom
CityManchester
Period08-08-1808-08-22

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation'. Together they form a unique fingerprint.

Cite this