摘要
互联网网页数量爆炸性地增长,使得网页文档分类技术研究成为目前Web挖掘的一大热点.针对面向某特定领域文档的特点,提出一种基于层次特征词权重的文档特征表示方法,以此为基础,在网页文档分类时,通过对网页结构和文本链接分析,设计了网页文档分类算法HFSHA(Text Categorization Algorithm Based on Hierarchy Feature Word Weight and Structure and Hyperlink Analysis).在服装网页文档语料库上的分类实验表明,对服装专业文档HFSHA算法比基于向量空间模型(VSM)的普通文本分类算法的分类准确率高.
The explosive growth of Web pages makes currently the research of Web document classification technology a hotspot of Web mining.Representation method of document characteristics based on hierarchy feature word weight was put forward aiming to document character of special domains.Based on this, the text classification algorithm named HFSHA(hierarchy feature word weight and structure and hyperlink analysis)was designed by considering the Web structure and link relationships.It shows that HFSHA has higher accuracy rate on the text classification than normal classification algorithm based VSM text representation in the experiment on fashion Web documents corpus.
出处
《中北大学学报(自然科学版)》
北大核心
2017年第3期354-359,共6页
Journal of North University of China(Natural Science Edition)
基金
北京市教育科学"十二五"规划重点课题资助项目(AJA11174)
教育部人文社科资助项目(12YJA760014)
2014年度北京服装学院科学研究提升计划培育资助项目