数据挖掘中Web文档转换算法的设计与实现

DESIGN AND IMPLEMENTATION OF WEB DOCUMENTS CONVERSION ALGORITHM IN DATA MINING

下载PDF

导出

摘要 Web文本挖掘是数据挖掘技术在网络信息处理中的一个重要应用,如何将web文档转换成数据挖掘所要求的格式,即web文档预处理是一项很重要的研究课题.本文的方法是:从Internet网上下载了大量的网页文件,将网页文件转换成文本文件,然后通过算法对这些文本文件中的数据进行词频统计,删除非用词,去掉高频词,对单词进行词根处理,建立用词词表,从而抽取用词,按字母排序生成词频索引,和字典文件进行对照,获取单词的ID,最后生成Reuters-21578的Database数据格式.这样就将web文档数据转换成标准的数据集,以便为数据挖掘中分类、聚类作好准备. Web text information mining is one of the important applications of applying data mining technologies into informa- tion analysis and processing, how to transform web documents into data mining to the required format, i.e. web document pre- processing becomes a significant research task. In this paper the method is ： from Internet to download a large number of web- page files, webpage files are converted into a text files, and then through the algorithm to word frequency statistics the data of the text files, delete non-using words, remove high frequency words, process etyma of substantive words, extract stems, elimi- nate redundant words and establish word lis4 thus extraction word list, alphabetical index to generate word frequency index, and the dictionary file comparison, get the word ID, the last generation of Reuters-21578 Database data format. This web docu ment data converted into standard data sets for classification and clustering to prepare in data mining.

作者赵小龙佘东

机构地区安徽工业经济学院

出处《巢湖学院学报》 2011年第6期34-38,共5页 Journal of Chaohu University

基金安徽省高校优秀人才基金项目(项目编号:2009SQRZ136) 巢湖学院一般项目(项目编号:XLY-200910) 安徽工业经济学院<学院科研管理信息系统开发研究>自然科学基金项目支持

关键词 WEB文档数据挖掘预处理 web documents data mining preprocessing

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1Hanan Ayad,Mohamed Kamel.Topic Discovery from Text using Aggregation of Different Clustering Methods. http://pami.uwateloo.ca . 2003
2.

1伊瑶瑶,茅苏.Hadoop下的关联规则分析研究[J].计算机技术与发展,2015,25(9):84-88. 被引量：5
2王道才.让IE收藏夹中的网址按字母排序[J].电脑知识与技术（经验技巧）,2008(6):35-35.
3军嫂.让IE收藏夹自动按字母排序[J].网友世界,2005(1):36-36.
4吴向明.用汇编程序生成汉字源字典文件[J].中文信息,1990(4):71-72.
5杨欢红,王鲁杨,韩桂琴,曹炜.管理信息数据的快速录入及界面设计[J].计算机应用,1996,16(5):48-49. 被引量：2
6蒋彬.10款Android手机必备应用——Android操作系下的软件评测[J].微电脑世界,2010(4):49-52.
7解决收藏夹中文件杂乱问题[J].电脑爱好者,2004(7):115-115.
8收藏夹大学问,按字母排序你会吗?[J].计算机与网络,2005,31(15):42-42.
9李胜,胡和平.一种基于PLSA的高效检索方法[J].华中科技大学学报（自然科学版）,2010,38(11):48-50. 被引量：3
10黑色.把新词典装入电子书[J].电脑知识与技术（经验技巧）,2010(9):83-84.

巢湖学院学报

2011年第6期

浏览历史

内容加载中请稍等...

数据挖掘中Web文档转换算法的设计与实现

参考文献2

相关作者

相关机构

相关主题

浏览历史