摘要
文本主题提取技术能够有效地精炼文本消息,传统的中文文本由最基本的词语组成,由于词汇本身的信息粒度过小,针对词语进行中文信息抽取不能完整表达文本片段的语义信息。短语本身包含较为丰富的细粒度语义信息,更能表达出文本片段的主题性。本文提出基于双层语料过滤器(词性过滤器与短语扩展规则过滤器)的方法来进行文本语料的冗余信息过滤并抽取文本主题短语信息。实验证明,本文的方法具有一定的可靠性和应用性。
The technology of text topic extraction is widely applied to refine the text information. Since the Chinese text is made up of base Chinese words, which contains trivial semantic information, the methods of using the words to express the semantic in- formation of short text is not promised in applications. In contrast, Chinese phrases contain rich fine-gained semantic information and they are preferred to be the representatives of topic of text. Therefore, this paper proposed a method of double-linguistic-filter ( lexical category filter and phrase-extending filter) to weed out the redundant information and extract topic phrases from text. The phrase results are close to the refined semantic expression of text. The experimental result shows that the method we proposed can obtain reliable results, and the method would indicate other new methods on text mining.
出处
《计算机与现代化》
2015年第12期7-14,共8页
Computer and Modernization
关键词
短语抽取
信息提取
规则挖掘
phrase extraction
information extraction
rule mining