摘要
文本向量化是将文本转化为向量的代数模型建立过程,在文本处理领域具有重要的应用价值,是文本数据挖掘算法的关键环节。在著名的PageRank算法基础上,提出一种基于句中词语间关系的文本向量化算法。通过引入语义层面的词语关联来克服传统的基于词频统计数据的向量化方法语义敏感度不佳的缺陷。在不同的语料测试集上的实验表明,基于句中词语间关系的文本向量化算法有更高的准确率。
Document vectorization is the process of building vector space model which has a number of potential applications on natural language processing. This paper describes an algorithm of vectorization through the relationships of word in a sentence based on the PageRank algorithm. The introduction of semantics relationship is then proposed to overcome the disadvantage of traditional statistics-based vectorization. Experimental results show that the new method has a better accuracy rate.
出处
《信息安全与通信保密》
2014年第4期84-88,共5页
Information Security and Communications Privacy
基金
国家自然科学基金资助项目(批准号:61272441
61171173)