摘要
文本主题自动提取是一种很有实用价值的技术,它可以有效地浓缩整个web页面,解决无线网络终端由于显示屏太小而无法显示整个网页的难题。总结了目前有关文本主题提取方面的研究成果,设计了一个特征词加权函数,在此函数中考虑了词所跨的段落数因子,同时采用非线性函数描述词长因子和词所跨的段落数因子的作用,并将加权函数应用于主题自动提取。实现了一个中文主题自动提取系统原型,通过对文本集的测试验证了加权函数的有效性。
Automatic subject extracting is a useful technique. It can extract the subject signature words from Web page text when the wireless terminal cannot show the full Web page because of small display screen. This paper summarized the related researches on automatic subject extracting, and designed a weighting function by considering the factor of number of the paragraphs covering the signature word. Meanwhile, non-linear functions were put forward to analyze the action of word length and number of paragraphs covering the signature word. The applications to the automatic subject extracting and the tests on a set of tests show that the weighting function is effective.
出处
《四川大学学报(工程科学版)》
EI
CAS
CSCD
2004年第3期97-100,共4页
Journal of Sichuan University (Engineering Science Edition)
基金
国家自然科学基金资助项目(40274058)
关键词
主题自动提取
特征词
加权函数
Automation
Statistics
Text processing
World Wide Web