摘要
词聚类是语言自动处理中一个重要的基础环节。针对中文词聚类研究中训练数据缺乏、质量不高而影响聚类效果这一主要障碍,本文提出一种面向中文的词聚类算法,算法以词的上下文分布相似度作距离量度;然后分析了仅依据距离量度进行中文词聚类的缺陷,提出词的临近空间概念,并根据词的临近空间概念进行聚类,使得在不用指定类的数目与大小的情况下,依靠词的内在语义进行聚类;最后,算法再将聚类结果作为计算相似度的依据,进行EM迭代聚类,使聚类结果得到明显优化。实验证明,算法有效地克服了中文训练数据的数量和质量问题,聚类结果好。
Word clustering is an important fundamental work of automatic language process. Point to dearth of training data and low quality of training data, which is the main obstacle of Chinese word clustering, a Chinese oriented algorithm is presented in this paper. First, the context similar degree of a word is used as the distance measure of the word; second, the limitation of taking the distance measure only into account is analyzed; then, the concept of Word-Near-Space is put for- ward, which can make word clustering work without allocating the total class number. Finally, according to the class which is the result of clustering,we calculate the context similar degree, and repeat the above steps until the whole algorithm con- verges, so that it is consistent with the EM criteriom Experiments show that the algorithm effectively conquers the two main obstacles of Chinese word clustering, and brings about good clustering results.
出处
《计算机工程与科学》
CSCD
2006年第1期122-124,142,共4页
Computer Engineering & Science