期刊文献+

一种启发式网络信息采集系统设计与实现

The Design and the Implementation of Net Information Gathering with Heuristic Method
下载PDF
导出
摘要 为解决目前网络信息采集中信息主题单一与垃圾信息过多的问题,讨论了一种半人工监督的启发式采集系统。用户向系统提交同一个主题的一组关键词后,系统自动合并多个搜索引擎返回的结果,从而构成一个有序的文档集合。对这个集合利用后缀树算法进行聚类,人工对聚类的结果进行有效与垃圾状态标注并生成训练集构造分类器。当用户提交该主题更多的关键词时,系统可以从各成员搜索返回的结果中自动识别并采集有效数据而过滤垃圾信息。实验结果显示,系统对定主题数据的平均有效信息识别率达到92%以上。 To solve the problems of unitary theme and too many garbage information in net information gathering, a new semi-automated heuristic system and the meta-search expanding technology are studied. A set of keywords in the same theme should be submitted by user in this system, and then a sorted files set is constructed after combining the new key words with other results from memberships of search engine. The clustering method is used on this set with post-tree algorithm. The results are checked manually and are labelled with the symbol of valid status and invalid status as dualistic group. When more key words are summated by users, the classifier can identify whether a result from other element search engine is invalid or not, and so the garbage information can be filtered. The experimental data show that the average identify ratio of effective information can be more than 92%.
出处 《北京石油化工学院学报》 2007年第4期38-42,共5页 Journal of Beijing Institute of Petrochemical Technology
基金 国家自然科学基金资助项目 项目号:60673160
关键词 后缀树 聚类 支持向量机 分类 逆向文件频率 suffix Tree clustering SVM classification IDF
  • 相关文献

参考文献8

二级参考文献50

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3王映,常毅,谭建龙,白硕.基于N元汉字串模型的文本表示和实时分类的研究与实现[J].计算机工程与应用,2005,41(5):88-91. 被引量:5
  • 4..http://www.yahoo.com,2001.
  • 5J Nie, M Simard, et al. Cross-language information retrieval based on parallel texts and automatic mining parallel texts from the Web. ACM-SIGIR Conference, Berkeley, California,1999.
  • 6D Lonsdale, E Mitamura, E Nyberg. Acquisition of large lexicons for practical knowledge-based MT. Machine Translation,1995, 9(3) : 101 - 133.
  • 7M Barlow. Parallel texts in language reaching. In: A M McEnery, et al. ed. Corpora and Language Reasearch: A Selection of Papers from Talc96. Lancaster University. 1996.
  • 8W A Gale, K W Church. Identifying word correspondences in parallel texts. Proceedings of the 4th DARPA Workshop on Speech and Natural Language. 1991: 152- 157.
  • 9P F Brown, J Cocke and S A Pietra, et al. A statistical approach to machine translation. Computational Linguistics,1990, 16(2) :79 - 85.
  • 10I Dagan, K W Church and W A Gale. Robust bilingual word alignment for machine aided translation. Proc. of Workshop on Very Large Corpora. 1993 : 1 - 8.

共引文献445

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部