Activate Activate Activate
contact  
Hello. Sign in to personalize your visit. New user? Register now.  

In
By author
Computational Linguistics

Quarterly (March, June, September, December)
160 pp. per issue
6 3/4 x 10
Founded: 1974
ISSN 0891-2017
E-ISSN 1530-9312
2008 ISI Impact Factor: 2.656

Computational Linguistics

September 2005, Vol. 31, No. 3, Pages 329-366
Posted Online March 13, 2006.
(doi:10.1162/089120105774321073)
© 2005 Association for Computational Linguistics
Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks

Ruth O'Donovan

Dublin City University

Michael Burke

Dublin City University

Aoife Cahill

Dublin City University

Josef van Genabith

Dublin City University

Andy Way

Dublin City University

National Centre for Language Technology, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland.

Centre for Advanced Studies, IBM, Dublin, Ireland.

PDF (637.885 KB) PDF Plus (536.46 KB)

We present a methodology for extracting subcategorization frames based on an automatic lexical-functional grammar (LFG) f-structure annotation algorithm for the Penn-II and Penn-III Treebanks. We extract syntactic-function-based subcategorization frames (LFG semantic forms) and traditional CFG category-based subcategorization frames as well as mixed function/category-based frames, with or without preposition information for obliques and particle information for particle verbs. Our approach associates probabilities with frames conditional on the lemma, distinguishes between active and passive frames, and fully reflects the effects of long-distance dependencies in the source data structures. In contrast to many other approaches, ours does not predefine the subcategorization frame types extracted, learning them instead from the source data. Including particles and prepositions, we extract 21,005 lemma frame types for 4,362 verb lemmas, with a total of 577 frame types and an average of 4.8 frame types per verb. We present a large-scale evaluation of the complete set of forms extracted against the full COMLEX resource. To our knowledge, this is the largest and most complete evaluation of subcategorization frames acquired automatically for English.

Cited by

Julia Hockenmaier, Mark Steedman. (2007) CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank. Computational Linguistics 33:3, 355-396
Online publication date: 1-Sep-2007.
Abstract | PDF (305 KB) | PDF Plus (310 KB) 

Technology Partner - Atypon Systems, Inc.
  CrossRef member COUNTER member