Activate Activate Activate
contact  
Hello. Sign in to personalize your visit. New user? Register now.  

In
By author
Computational Linguistics

Quarterly (March, June, September, December)
160 pp. per issue
6 3/4 x 10
Founded: 1974
ISSN 0891-2017
E-ISSN 1530-9312
2008 ISI Impact Factor: 2.656

Computational Linguistics

March 2004, Vol. 30, No. 1, Pages 75-93
Posted Online March 13, 2006.
(doi:10.1162/089120104773633394)
© 2004 Association for Computational Linguistics

Accessor Variety Criteria for Chinese Word Extraction

Haodi Feng

Shandong University Tsinghua University City University of Hong Kong, School of Computer Science and Technology, Jinan, PRC; Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong. E-mail: or

Kang Chen

Department of Computer Science and Technology, Peking, PR China.

Xiaotie Deng

City University of Hong Kong Tsinghua University, Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong.

Weimin Zheng

Department of Computer Science and Technology, Peking, PR China.



PDF (205.707 KB) PDF Plus (212.257 KB)



We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.

Technology Partner - Atypon Systems, Inc.
  CrossRef member COUNTER member