期刊文献+

一种无词典的从Web新闻页面抽取主题的算法 被引量:2

A Practical Algorithm for Extracting Subject from Web Pages without Thesaurus
下载PDF
导出
摘要 主题抽取是自然语言处理研究的重要问题之一。目前流行的方法是“词典+匹配”,但该方法用于处理动态变化的网页信息时,词典难于及时更新等弊病就表现出来。本文作者在研究中文新闻网页内容、结构特点的基础上,提出了一种利用Web页面结构无需词典的主题抽取算法。我们使用该方法对新华网财经新闻语料1000篇进行主题抽取实验,并与手工抽取的主题进行比较,结果表明,重合率高达93%以上。 Subject extraction is one of the important problems in natural language processing area. Traditional methods mainly depend on "thesaunts + matching" mode. But problems arise when processing Internet news using this method, one is the limited volume of thesaurus compared with the uninterrupted emergence of new concepts in Internet nearly all the time. According to Web Chinese news page structure, we propose a new practical algorithm for extracting subject from Web pages without thesaurus. We do subject extraction experiment using 1,000 pieces of news corpus, compared with handcraft, coincidence ratio attain 93 %.
出处 《情报学报》 CSSCI 北大核心 2008年第1期12-17,共6页 Journal of the China Society for Scientific and Technical Information
基金 本文受国家863项目(No.2002AA119905)及国家自然科学基金项目(No.60082003)资助.
关键词 主题提取 WEB页面 超链接 subject extraction, Web pages, hyperlinks
  • 相关文献

参考文献6

二级参考文献18

  • 1张琪玉.汉语关键词法探讨[J].图书馆论坛,1993,13(1):3-7. 被引量:7
  • 2周健湘.一种简明而规范的标引规则[J].情报学报,1994,13(1):70-74. 被引量:2
  • 3王还 常宝儒.现代汉语频率词典[M].北京:北京语言学院出版社,1986..
  • 4Yang Y. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information (Retrieval 1 ),1999:69-90.
  • 5Mladenic M. Feature Subset Selection in Text-learning. http://www.ai.ijs.si/DunjaMladenic.
  • 6Wulfekuhler M R,Punch W F,Finding Salient Features for Personal Web Page Categorization. In Proc.of 6th International World Wide Web Conference,1997.
  • 7Salton G,Wong A,Yang C. A Vector Space Model for Automatic Indexing. Communications of the ACM,1995,18:613-620.
  • 8Lin Shian-hua. Extracting Classification Knowledge of Intemet Documents With Mining Term Associations: a Semantic Approach. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval,1998:241-240.
  • 9Cohen W J,Singer Y. Context-sensitive Learning Methods for Text Categorization. In SIGIR'96:Proc. 19th Annual Int. ACM SIGIR Conf.on Research and Development in Information Retrieval,1996:307-315.
  • 10Yang Y,Pedersen J O. A Comparative Study on Feature Selection in Text Categorization. In the 14th Int. Conf. on Machine Learning,1997:412-420.

共引文献80

同被引文献23

  • 1张云涛,龚玲,王永成.基于综合方法的文本主题句的自动抽取[J].上海交通大学学报,2006,40(5):771-774. 被引量:16
  • 2廉站俊,吕学强,张玉杰,施水才.基于句子相似度计算的信息抽取[J].现代图书情报技术,2007(6):38-41. 被引量:4
  • 3Salton G,Allan J.Automatic Text Decomposition and Structuring[J].Information Processing and Management,1996,32 (2):127-138.
  • 4Salton G,Singhal A,Buckley C,et al.Automatic Text Decomposition Using Text Segments and Text Themes[C].In:Proceedings of the Seventh ACM Conference on Hypertext.NY:ACM New York,1996.53-65.
  • 5Mitra M,Singhal A,Buckley C.Automatic Text Summarization by Paragraph Extraction[C].In:Proceedings of ACL' 97/ACL' 97.Worksho Pon Intelligent Scaleable Text Summarization,Madrid.NJ:Assoc.Compnt.Linguistics,1997:39-46.
  • 6Chatterjee N.A Statistical Approach for Similarity Measurement between Sentences for EBMT[C].In:Proceedings of Symposium on Translation Support Systems STRANS-2001,2001.
  • 7Chen K,Fan XZ,Liu J,et al.A New Approach to Compute the Semantic Similarity of Chinese Question Sentence[C].In:Proceedings of the Sixth International Conference on Machine Learning and Cybernetics(ICMLC 2007),Hong Kong.NJ:IEEE,2007:1830-1835.
  • 8Li Y,McLean D,Bandar Z A,et al.Sentence Similarity Based on Semantic Nets and Corpus Statistics[J].IEEE transactions on knowledge and data engineering,2006,18(8):1138-1150.
  • 9Che W X,Jiang J M,Su Z,et al.Improved-Edit-Distance Kernel for Chinese Relation Extraction[C].In:The Second International Joint Conference on Natural Language Processing (IJCNLP05),Jeju Korea.Springer,2005:134-139.
  • 10哈尔滨工业大学信息检索研究室.同义词词林(扩展版)[EB/OL].[2008-05-19].http://www.ir-lab.org/.

引证文献2

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部