摘要
网络信息资源的迅猛膨胀推进了信息检索技术的发展和成熟,但将现有的技术应用于海量实时网络数据时,传统的信息检索算法仍存在种种不足之处.本文中以CER-NET华(东)北地区的海量实时网络数据环境为依托,研究和设计了两段向量簇聚类信息检索算法,通过插入聚类和优化聚类两阶段的操作,提供高效的信息处理能力.同时,基于簇聚类树实现了群发邮件甄别的应用,对网络数据中的垃圾邮件进行过滤,进一步地提高检索效率.
With the rapid expansion of information resources in networks, information retrieval technologies are now becoming more and more well-developed. But their current applications to massive and real-time data, especially for the conventional information retrieval algorithms, still reveal some shortcoming. In this paper, aiming at the massive and real-time network data from CERNET East China North center, a two-phase vector clustering algorithm is investigated and designed, in which a high-efficiency information processing ability is implemented by a two-phase operation; clustering insertion and clustering optimization. Meanwhile, the application of the proposed algorithm in the group mail discrimination system for filtering junk mails of network data is achieved by means of the clustering tree. Thus, the retrieval efficiency is further improved.
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2004年第z1期6-10,共5页
Journal of South China University of Technology(Natural Science Edition)
关键词
信息检索
簇聚类
两段向量
邮件甄别
information retrieval
clustering
two-phase vector
mail discrimination