期刊文献+

利用串匹配技术实现网上新闻的主题提取(英文) 被引量:11

Extracting Subject from Internet News by String Match
下载PDF
导出
摘要 从文本中提取主题串是自然语言处理的重要基础之一.传统的提取方法主要是依据“词典加匹配”的模式.由于词典的更新速度无法同步于网上新闻中新词汇涌现的速度,而且词典的内容也无法完全涵盖网上新闻的范围, 因此这种方法不适用于网上新闻的主题提取.提出并实现了一种不用词典即可提取新闻主题的新方法.该方法利用网上新闻的特殊结构,在标题和正文间寻找重复的字串.经过简单地处理,这些字串能够较好地反映新闻的主题.实验结果显示该方法能够准确、有效地提取出绝大部分网上新闻的主题,满足新闻自动处理的需要.该方法同样适用于其它亚洲语言和西方语言. Subject extraction from a text is very important for natural language processing. Traditional methods mainly depend on the mode of 搕hesaurus plus match? It is not fit to process Internet news because of its limited volume and slow update speed. After analyzing the news structure carefully, this paper presents a new practical method to extract news subjects without thesaurus, and give the main implementing procedure. Instead of large thesaurus, it uses the special structure of Internet news to find the repeated strings. These repeated strings could express the news subjects very well. Experimental results show that this method can extract the most important subject strings from most of Internet news rapidly and efficiently. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.
出处 《软件学报》 EI CSCD 北大核心 2002年第2期159-167,共9页 Journal of Software
基金 Supported by the National Natural Science Foundation of China under Grant No.60082003 (国家自然科学基金)
关键词 网页 信息处理 网上新闻 主题提取 自然语言处理 串匹配技术 Web information processing Internet news subject extraction string match thesaurus
  • 相关文献

参考文献4

  • 1中国互联网络信息中心.关于中国互联网络发展状况的统计[J].统计报告,2001,.
  • 2Gao, Jian-fang. An empirical study of CLIR at MSCN. In: Proceedings of theInternational Workshop ILT&CIP-2001 on Innovative Language Technology and ChineseInformation Processing. German Research Center for Artificial Intellige nce and ShanghaiJiao Tong University, Shanghai, 2001. 55~62.
  • 3Hsieh, Ying-chun, Huang, Shyue-shuo. A general model of representing the content ofscience news using XML. In: Proceedings of the 3rd Symposium of Information Cross-Straits.Press of Taiwan Chenggong University, 2001. 143~148.
  • 4陈桂林,王永成.Internet网络信息自动摘要的研究[J].高技术通讯,1999,9(2):33-36. 被引量:18

二级参考文献5

共引文献17

同被引文献72

引证文献11

二级引证文献42

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部