摘要
为解决目前网络信息采集中信息主题单一与垃圾信息过多的问题,讨论了一种半人工监督的启发式采集系统。用户向系统提交同一个主题的一组关键词后,系统自动合并多个搜索引擎返回的结果,从而构成一个有序的文档集合。对这个集合利用后缀树算法进行聚类,人工对聚类的结果进行有效与垃圾状态标注并生成训练集构造分类器。当用户提交该主题更多的关键词时,系统可以从各成员搜索返回的结果中自动识别并采集有效数据而过滤垃圾信息。实验结果显示,系统对定主题数据的平均有效信息识别率达到92%以上。
To solve the problems of unitary theme and too many garbage information in net information gathering, a new semi-automated heuristic system and the meta-search expanding technology are studied. A set of keywords in the same theme should be submitted by user in this system, and then a sorted files set is constructed after combining the new key words with other results from memberships of search engine. The clustering method is used on this set with post-tree algorithm. The results are checked manually and are labelled with the symbol of valid status and invalid status as dualistic group. When more key words are summated by users, the classifier can identify whether a result from other element search engine is invalid or not, and so the garbage information can be filtered. The experimental data show that the average identify ratio of effective information can be more than 92%.
出处
《北京石油化工学院学报》
2007年第4期38-42,共5页
Journal of Beijing Institute of Petrochemical Technology
基金
国家自然科学基金资助项目
项目号:60673160