摘要
Web主题检索是信息检索领域一个将采集技术与过滤方法结合的新兴方向,也是信息处理领域的研究热点。针对现有主题检索系统在Web页面文本的主题相关性判断和Spider搜索策略方面存在的问题,引入两个性能优化方案,即利用信息抽取技术,提出了一种基于模式集的主题相关性判断方法来提高主题判断准确度;针对pagerank在主题检索中存在的不足,引入基于增强学习的页面评估算法,提出了Web环境优先的搜索策略。最后根据实验结果评估两个算法的性能。
Focused web crawling is a new crawling direction in the field of information retrieval which is combined with filtering methods.And it also is a research hotspot in the information processing field.In order to improve the performace of the Web Topic Search System,the paper introduces two performace optimization methods.One method, based on information extraction,is presented to improve the accuracy of obtained documents;The other one is a new Web Topic search strategy based on WEB environment precedence,which uses a function,based on reinforcement learning,to value Web pages and characterize Web topic environment.Thls method works well in promoting the search efficiency on rare information in effect.Finally,the performace of two methods is evaluated by experiments.
出处
《计算机工程与应用》
CSCD
北大核心
2006年第4期183-185,188,共4页
Computer Engineering and Applications
基金
河北省自然科学基金资助项目(编号:F2004000132)
关键词
信息抽取技术
信息抽取模式
模式匹配
WEB环境
增强学习
information extraction,extraction pattern,pattern matching,WEB Environment,reinforcement learning