摘要
针对词频-逆文档频率模型应用于主题爬虫时存在的非结构化问题,设计基于分类关键词词频(CKTF)模型的主题爬虫。利用网页文档结构特征和主题词语的分布信息将网页映射为五维向量,根据维基百科中文语料库和搜狗全网新闻数据语料库选择关键词并计算其与地缘政治主题的相关度,使用支持向量机实现网页向量的学习和分类。实验结果表明,与传统主题爬虫相比,该主题爬虫能够挖掘地缘政治主题中的丰富内容,有效衡量网页与主题的相关度,具有较高的爬准率和稳定性。
To solve the no-structuring problem of Term Frequency-Inverse Document Frequency(TF-IDF) model in topical crawler, this paper proposes a novel topical crawler based on Classified Keyword Term Frequency (CKTF) model. A Webpage is divided into five parts, according to the Webpage document structure characteristics and the distribution information of topical works. Geopolitical topical words and their correlative rates are calculated based on Wikipedia and Sougou internet corpus. Then,Webpage vector classification are learned and classified by Support Vector Machine(SVM). Experimental result shows that geopolitical topical crawler based on CKTF model can mine the rich meaning of the geopolitical topic, and measure effectively correlation between a Webpage and a topic with a higher accuracy and stability.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第2期45-50,共6页
Computer Engineering
基金
四川省应急测绘与防灾减灾工程技术研究中心开放基金资助项目(K2015B014)
关键词
主题爬虫
分类关键词词频模型
词向量
支持向量机
相关度
topical crawler
Classified Keyword Term Frequency (CKTF) model
word vector
Support Vector Machine(SVM)
relevancy