摘要
本文提出一种面向聚类主题的文本特征表示方法,即以聚类的主题概念来刻画文本的特征向量,将文本描述提升至语义层次。首先,通过聚类,形成一组以向量形式表达的隐含主题概念,再将基于词条空间的文本特征向量投影至这组主题概念,以隐含的主题概念来描述文本。实验分析表明,建立在概念空间之上的文本向量实质上是文本矢量与主题概念的关联度,能够突出表现文本内容的主题特征,更好地反映文本的语义内容,从而有效提高模型在文本检索与分类等领域的应用性能。而基于聚类形成的概念空间的维数由于可主观调整,又能有效地约减概念空间的维数,提高模型的应用实效。
In the paper,a Vector-space Modeling based on document subject clustering is presented.With subject concepts describing document's features,document feature value can get to the semantic level.Firstly,by clustering the document into groups,subject concept-space is generated in the form of Vector-space.Then a mapping from term-space to concept-space is established,which enable text document to be described with concept-based vector modeling.Experimental analysis shows that the vector of documents based on concept-space is associated with semantic relationships between document with subject concepts, which can be used to demonstrate the document's semantic feature more accurately,as well as promote its performance in practical applications,such as text retrieval and classification.Meanwhile,dimensionality reduction of term-space vector can be conducted more efficiently according to the application demand,which also strength it actual affects.
出处
《情报学报》
CSSCI
北大核心
2009年第4期524-529,共6页
Journal of the China Society for Scientific and Technical Information
基金
2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”(项目批准号:08JC870013)研究成果之一
关键词
文本聚类
概念空间模型
文本特征
document clustering
concept-vector model
document feature value