摘要
利用Clementine完成Web日志预处理数据流的初步构建,实现了数据清洗、用户识别、会话识别、路径补充4大过程,同时具备日志合并、数据审核、规范编码、外部信息关联等辅助功能。实验研究表明,利用Clementine对Web日志进行预处理是完全可行的,这为在该平台上进一步完成挖掘工作奠定了基础,从一定程度上解决了Web日志挖掘与预处理交由不同工具处理的困境,提高了Web日志挖掘的自动化程度。
The paper introduces the preliminary structuring of preprocessing data stream for web log by Clementine, which implements the following procedures: data cleaning, user identification, session identification and path complementary, etc. In addition, it also provides some auxiliary, functions, such as log merging, data auditing, coding specification, associating with external information, etc. Experimental result indicates that web log preprocessing based on Clementine is completely feasible, which lays a foundation for further log mining on the same platform. To some extent, it resolves the problem that web log mining and preprocessing are treated by different tools, thus improving the degree of automation for web log mining.
出处
《医学信息学杂志》
CAS
2009年第12期33-36,40,共5页
Journal of Medical Informatics
基金
中国医学科学院医学信息研究所基本科研业务费专项"基于Web日志统计的图书馆网站读者行为分析"(项目编号:08R0130)