摘要
数据挖掘技术是利用计算机强大的计算能力来代替部分人工分析的一项技术。传统的数据分析是人们利用自己的大脑对数据进行分析、思考和解读,但人脑所能承载的计算量是有限的。目前,计算机强大的计算能力代替了人脑,它们不仅可以处理一些不需要自主思考的增删改查类工作,有时还可以担任一些需要自我学习能力的任务,比如对网页数据进行高质量分析与挖掘。为了进一步探究网页数据分析与挖掘,本文提出了一种基于优化样本距离计算方法,从而改进了K-means算法的聚类中心计算方法。具体来说,本文获取常见电商页面“当当网”公开的以“手机”为关键词的近12000条数据,使用文本挖掘技术对其进行数据挖掘,对数据的文本信息进行清洗、中文分词以及关键词权重计算等全面预处理,最终使用聚类中心优化的K-means算法,挖掘看似毫无关联的数据集中的隐藏信息为电商用户提供市场导向。
Data mining technology is a technique that utilizes the powerful computing power of computers to replace some manual analysis.Traditional data analysis involves people using their own brains to analyze,think and interpret data,but the amount of computation that the human brain can carry is limited.At present,the powerful computing power of computers has replaced the human brain.They can not only handle tasks such as adding,deleting,modifying,and searching that do not require independent thinking,but also sometimes perform tasks that require self-learning ability,such as high-quality analysis and mining of web data.In order to further explore web data analysis and mining,this article proposes a clustering center calculation method based on optimized sample distance,thereby improving the K-means algorithm.Specifically,this article obtained nearly 12000 pieces of data publicly available on the common e-commerce page"Dangdang.com"with the keyword"mobile phone".Text mining technology was used to mine the data,which underwent comprehensive preprocessing such as text information cleaning,Chinese word segmentation,and keyword weight calculation.Finally,the K-means algorithm optimized by the clustering center was used,mining hidden information in seemingly unrelated datasets to provide market orientation for e-commerce users.
作者
叶昊
缪宜恒
张宏俊
YE Hao;MIAO Yiheng;ZHANG Hongjun(School of Modern Posts,Nanjing University of Posts and Telecommunications,Nanjing Jiangsu 210003;School of Communications and Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing Jiangsu 210003;China Communications Services Co.,Ltd.,Beijing 100005;School of Internet of Things,Nanjing University of Posts and Telecommunications,Nanjing Jiangsu 210003)
出处
《软件》
2023年第6期35-43,共9页
Software
基金
江苏省研究生科研与实践创新计划项目(KYCX22_1019)。
关键词
电商页面
数据挖掘
数据预处理
中文文本聚类
e-commerce page
data mining
data preprocessing
Chinese text clustering