Quarterly (March, June, September, December)
160 pp. per issue
6 3/4 x 10
Founded: 1974
ISSN 0891-2017
E-ISSN 1530-9312
2008 ISI Impact Factor: 2.656
|
March 2004, Vol. 30, No. 1, Pages 75-93
Posted Online March 13, 2006.
(doi:10.1162/089120104773633394)
© 2004 Association for Computational Linguistics
Accessor Variety Criteria for Chinese Word Extraction Haodi FengShandong University Tsinghua University City University of Hong Kong, School of Computer Science and Technology, Jinan, PRC; Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong. E-mail: fenghd@cs.cityu.edu.hk or fenghaodi@hotmail.com Kang ChenDepartment of Computer Science and Technology, Peking, PR China. ck99@mails.tsinghua.edu.cn Xiaotie DengCity University of Hong Kong Tsinghua University, Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong. csdeng@cityu.edu.hk Weimin ZhengDepartment of Computer Science and Technology, Peking, PR China. zwm-dcs@mails.tsinghua.edu.cn
We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.
|