期刊文献+

Geeking:基于胜者表的体育新闻搜索引擎系统

Geeking: a Sports News Search Engine System Based on Champion List
下载PDF
导出
摘要 文章介绍了体育新闻搜索引擎系统Geeking的框架结构和各项功能,其结构分为网页爬取、胜者表构建、检索处理、用户界面4个部分,其主要功能包含查询词校正、自动补全、检索结果排序、相似新闻聚类以及显示页面中关键词高亮并提供网页快照。输入查询请求时,系统根据搜索日志和新闻热词自动补全查询词,搜索不到相关结果时校正查询,给出推荐的查询词。检索新闻文档时,使用胜者表快速查找查询词项的相关文档,综合tf-idf权重和新闻标题、发布时间等因素计算文档的相关性并按得分排序。在相似新闻聚类中,结合最长公共子序列和编辑距离衡量新闻标题之间的相似度,以新闻标题相似度代表新闻文档的相似度。测试结果表明,基于胜者表的Geeking搜索引擎系统各项功能协调效果好,检索响应速度快。 In this paper, a sports news search engine, Geeking, was introduced, which contains four functional models: web crawling, champion list building, search processing and user interface. Geeking could provide query correction, query auto-completion, search results sorting, news clustering, keywords highlighting and snapshot visualization. Given a query, the system automatically completes the query according to the search logs and the news hot keywords. If there was no return of result, the system could correct the query and provided the recommended query terms. The related documents were searched quickly according to the champion list. Based on the tf-idf values and other factors like news headlines and release time, the documents' relevance was calculated. For the clustering of similar news, the longest common subsequence and levenshtein distance were used to measure the similarity between news headlines and the similarity of news headlines could be regarded as the similarity between documents. Test results were given to show that Geeking is fast and stable.
出处 《集成技术》 2016年第2期97-108,共12页 Journal of Integration Technology
基金 国家自然科学基金(61433012 U1435215 11171086) 河北省自然科学基金(F2013201064)
关键词 搜索引擎 体育新闻 胜者表 编辑距离 聚类 查询词校正 search engine sports news champion list levenshtein distance clustering query term correction
  • 相关文献

参考文献2

二级参考文献23

  • 1赵作鹏,尹志民,王潜平,许新征,江海峰.一种改进的编辑距离算法及其在数据处理中的应用[J].计算机应用,2009,29(2):424-426. 被引量:51
  • 2车万翔,刘挺,秦兵,李生.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-19. 被引量:65
  • 3Nianwen Xue. Chinese word segmentation as character tagging[J]. InternationalJournal of Computational Linguistics and Chinese Language Processing. 2003. 8 0): 29-48.
  • 4Tseng H. Chang P. Andrew G. et al. A conditional random field word segmenter for sighan bakeoff 2005[CJ/ /Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. 2005: 17l.
  • 5Yue Zhang. Stephen Clark. Chinese segmentation with a word-based perceptron algorithm[CJ/ /Proceedings of the 45 th ACL. 2007: 840-847.
  • 6Collins M. Discriminative training methods for hidden markov models: Theory and experiments with percep?tron algorithms[CJ/ /Proceedings of the ACL-02 con?ference on Empirical methods innatural language pro?cessing-Volume 10. 2002: 1-8.
  • 7Ng. HweeTou , Iin Kiat Low. Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?[CJ/ /Proceedings of EMNLP 2004. 2004: 277-284.
  • 8Yue Zhang. Stephen Clark.Joint Word Segmentation and POS Tagging Using a Single Perceptron[CJ/ /Pro?ceedings of ACL-08: HL T. 2008: 888-896.
  • 9Crammer K. Singer Y. Ultraconservative online algo?rithms for multiclass problems[J]. TheJournal of Ma?chine Learning Research. 2003: 951-99l.
  • 10Cohen W W. Stacked sequential learning[CJ/ /Pro?ceedings of InternationalJoint Conference on Artificial Intelligence. 2005: 671-676.

共引文献82

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部