Activate Activate Activate
contact  
Hello. Sign in to personalize your visit. New user? Register now.  

In
By author
Computational Linguistics

Quarterly (March, June, September, December)
160 pp. per issue
6 3/4 x 10
Founded: 1974
ISSN 0891-2017
E-ISSN 1530-9312
2008 ISI Impact Factor: 2.656

Computational Linguistics

September 2006, Vol. 32, No. 3, Pages 295-340
Posted Online August 24, 2006.
(doi:10.1162/coli.2006.32.3.295)
© 2006 Massachusetts Institute of Technology
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Christoph Ringlstetter*Klaus U. Schulz*

CIS, University of Munich

Stoyan Mihov

Bulgarian Academy of Science, Sofia

*Funded by German Research Foundation (DFG)

Funded by VolkswagenStiftung

PDF (891.852 KB) PDF Plus (895.622 KB)

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.

Technology Partner - Atypon Systems, Inc.
  CrossRef member COUNTER member