摘要
针对深层网中数据量大导致无法被传统搜索引擎索引的问题,在提取网页中,改进启发式规则识别表单查询入口,在表单标签与内容匹配时,改进基于语义的相似度匹配算法进行表单内容填充。实验结果表明,提取表单标签的准确率达到94.23%,匹配成功率达到88.83%,填充成功率达到95.43%。
Aiming at the problem that large data in deep Web can not be indexed by traditional searching engine, this paper uses an improved heuristic rules to identify entrance query of form in extractive Web pages. It adopts improved similarity matching algorithm based on semantic to fill form content when form tag matching with content. Experimental results show that the veracity rate of extracted form label is 94.23%, success rate of the matching is 88.83% and filling form control is 95.43%.
出处
《计算机工程》
CAS
CSCD
北大核心
2010年第7期66-67,70,共3页
Computer Engineering