摘要
利用训练文档集准确高效地挖掘隐藏的用户文本偏好和概念向量是文本信息过滤和多文档自动文摘等自然语言处理应用的关键技术之一。针对训练文本集中往往存在多个主题类别的问题,提出一种基于聚类分析策略的文本偏好挖掘方法。其基本思路是对训练文档集进行聚类处理,然后对同主题文档进行共性分析,并经过特征权值调整和特征约简,获得表示用户不同主题偏好的概念向量。实验结果表明该方法具有对用户的文本偏好刻画更加精确,对相关阈值变化不敏感等优点,可以与Rocchio等算法结合来进行用户兴趣建模。
It is one of the key technologies in NLP applications such as text information filtering and multi-document summarization to mine the hidden user text preference and concept vector from the training documents. To solve the problem of multitopic problem in training documents, an approach which is based on cluster analysis has been introduced . The basic idea is to classify the training documents firstly, then analyze the commonness of the documents on the same topic. After feature weight modification and feature reduction, the concept vectors on different topic are formed. The experiment results show that the approach can represent user text preference more precisely, and not sensitive to relevance threshold. User preference profile can be mined by combing the approach with Rocchio algorithm.
出处
《计算机应用研究》
CSCD
北大核心
2005年第12期21-23,共3页
Application Research of Computers
基金
国家自然科学基金资助项目(60373100)
国家"863"计划资助项目(2002AA117010-09)