摘要
从文本中提取主题串是自然语言处理的重要基础之一.传统的提取方法主要是依据“词典加匹配”的模式.由于词典的更新速度无法同步于网上新闻中新词汇涌现的速度,而且词典的内容也无法完全涵盖网上新闻的范围, 因此这种方法不适用于网上新闻的主题提取.提出并实现了一种不用词典即可提取新闻主题的新方法.该方法利用网上新闻的特殊结构,在标题和正文间寻找重复的字串.经过简单地处理,这些字串能够较好地反映新闻的主题.实验结果显示该方法能够准确、有效地提取出绝大部分网上新闻的主题,满足新闻自动处理的需要.该方法同样适用于其它亚洲语言和西方语言.
Subject extraction from a text is very important for natural language processing. Traditional methods mainly depend on the mode of 搕hesaurus plus match? It is not fit to process Internet news because of its limited volume and slow update speed. After analyzing the news structure carefully, this paper presents a new practical method to extract news subjects without thesaurus, and give the main implementing procedure. Instead of large thesaurus, it uses the special structure of Internet news to find the repeated strings. These repeated strings could express the news subjects very well. Experimental results show that this method can extract the most important subject strings from most of Internet news rapidly and efficiently. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.
出处
《软件学报》
EI
CSCD
北大核心
2002年第2期159-167,共9页
Journal of Software
基金
Supported by the National Natural Science Foundation of China under Grant No.60082003 (国家自然科学基金)