摘要
文档表示是文本聚类的重要组成部分,该文旨在通过改进文档表示改进文本聚类。同义词和多义词现象是文档表示所面临的重要挑战。为此该文提出了词义类簇模型(Sense Cluster Model,SCM),在词义类簇空间上表示文档。SCM首先构造词义类簇空间,然后将文档表示在词义类簇空间上,获得每篇文档在每个词义类簇的概率。在词义类簇空间构造这一步骤中,首先利用词义归纳技术从文本中自动发现词义,接着采用词义聚类技术识别相同或者相似的词义从而获得词义类簇。词义类簇空间构造后,该文首先进行词义消歧,然后利用词义消歧的结果将文档表示在词义空间上。实验表明,SCM在标准测试集上的性能优于基线系统以及经典话题模型LDA。
Document representation is the key part in document clustering. In this paper, we aim at improving docu- ment representation in document clustering. Synonymy and polysemy are two challenging issues in document repre- sentation. Inspired by the observation that synonymy and polysemy are mainly related to word sense, we present a novel model, referred to as Sense Cluster Model (SCM), to address both issues by representing documents with word sense clusters. In SCM, word sense clusters are first constructed from the development dataset by 1) the word sense induction to automatically discover different senses of each word from raw text; and 2) the word sense cluster- ingto recognize identical or similar words. Then the probability distribution over word sense clusters is generated to represent every document after word sense disambiguation. The experiments conducted on benchmarking data show that the SCM model outperforms both baseline and the classic topic model, LDA, in the task of document clustering.
出处
《中文信息学报》
CSCD
北大核心
2013年第3期113-119,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(61272233)
关键词
文档聚类
文档表示
话题模型
word sense
document representation
topic model