摘要
该文研究一种改进的n元递增算法来抽取文本中表达关键信息的语义串,然后用多特征融合的评价方法为每一个文本选取最重要的语义串,并用这些语义串作为特征表示文本。通过K_means聚类分析的实验结果表明,以语义串作为特征可以构造比单词特征集更紧凑的文本模型,不仅可以大大降低特征空间的维度,对于提高聚类算法性能也是非常有效的。
This paper proposes an improved frequent pattern-growth approach to discover and extract the semantic strings which express key information in the text,It then assigns weights to them via a multi-feature fusion method and select the most important semantic strings as features to represent the text.The experimental results by K_means cluster shows that the text model constructed by semantic string feature is more compact than the text model constructed by word feature,not only greatly reducing the dimensions of feature space but also improving the performance of clustering algorithm.
出处
《中文信息学报》
CSCD
北大核心
2017年第5期99-107,共9页
Journal of Chinese Information Processing
基金
国家自然科学基金(61562083
61262062
61262063)