摘要
给出了藏文搜索引擎中网页自动摘要的基本思路、基本步骤和Web文摘的定义,讨论了分词算法,提出了基于句子抽取的Web文摘生成算法。算法中将每个Web句子的权重分解为Web特征词权重和Web句子结构权重,Web句子结构权重充分考虑排版格式和超连接属性。根据权值大小按给定的比例挑选句子,并进行平滑处理,生成文字流畅且具备一定质量的摘要。最后实验分析表明效果较好。
This paper provided the basic thinking and step of the automatic abstract of Web Document of Tibetan search engine and a definition for Web Document,the algorithm of words frequency is discussed,and presents an algorithm for Web Document based on sentences extraction.each sentence's weight is a weighted sum of word's weight and its sentence structure's weight,the sentence structure's weight considers both the Web formats and hyperlink attributes.Some sentences are selected according to the proportion definitely and the size of weights.Moreover,dealing with them smoothly.And last,generating automatic abstract,which is of some quality and fluent.
出处
《微处理机》
2010年第5期77-80,共4页
Microprocessors
基金
国家教育部项目资助(2008704)
关键词
自然语言处理
自动摘要
分词
权重
平滑处理
Natural language processing
Automatic abstract
Words frequency
Weights
Dealing with levelly and smoothly