摘要
针对目前少数民族语言方面热点关键词提取算法研究较少,而且精度和效率不高这一问题,提出一种哈萨克文网络热点关键词提取方法。将预处理后得到的文本利用多重因子加权改进的TF-IDF算法进行关键词提取,后续根据其位置和频率信息进行关键词组配,得到候选热点关键词集合;结合TF-PDF算法和媒体关注度思想,构造关键词热度评分标准公式KHD(Keywords Hot Degree),实现对热点关键词的提取。实验结果证明此方法可行有效,并且在提取精度和效率上都有显著提高。
In order to improve the accuracy and efficiency of the hot key words extraction algorithm for minority language,a new hot keywords extracting method is proposed. Firstly,this method extracts the keywords of the preprocessed text by the improved TF-IDF weighting algorithm and tries to link them together in the light of their location and frequency information,then the candidate hot keywords are obtained. Then,it constructs the KHD( Keywords Hot Degree) formula based on the combination of TF-PDF algorithm and the thought of media attention to achieve the extraction of hotkeywords. Experimental results show that this method is feasible and effective and the extraction accuracy and efficiency has been significantly improved.
出处
《计算机应用与软件》
2017年第1期45-49,67,共6页
Computer Applications and Software
基金
国家自然科学基金项目(61063025
61363062)