摘要
提出一种利用网页特征进行会话识别的方法。通过分析网页本身的特征,计算站点中所有网页的特征向量。根据这些特征向量,可以计算任意网页之间的相关程度。按照用户请求页面在日志中的时间顺序,可以得到日志中所有直接相邻的页面记录的关联程度曲线。通过设定一个阈值,在关联程度曲线中波动较大的位置形成会话边界。将关联程度大的页面分类到一个会话中,从而完成会话识别。
In this paper, a method of sessions' identification based on the feature of web pages is proposed. After the features of web pages are analysed, the feature vectors of all web pages in a website are computed. Based on the feature vectors, the relativity between any two web pages could be computed. According to the time sequence of user's request pages in the web log, a curve of relativity between any two direct neighbor web pages could be found. After a threshold is set up, sessions' border would be found at the position where the fluctuation are great in the curve of relativity. After the high relative web pages are put into one sesstion, sessions' identification is completed.
出处
《燕山大学学报》
CAS
2008年第1期10-13,共4页
Journal of Yanshan University
关键词
WEB日志挖掘
数据预处理
会话识别
web log mining
data preprocessing
sessions' identification