摘要
互联网的发展带动了另一种形式的信息传播,人们越来越多地依赖于电子产品,Web网页也随之变为了最大的信息源,利用好这些资源便涉及信息提取。为了从Web网页中获取关键藏文信息,文章提出了基于文本密度的藏文网页正文提取方法,利用半结构化的HTML网页中正文内容的连续性特点,结合正则表达式过滤HTML标签。此方法针对主题型网页,类似新闻类网页中的正文提取具有较高的准确率。
The development of the Internet has led to another form of information dissemination, people are increasingly relying on electronic products, Web also become the largest source of information, and the use of these resources will involve the extraction of information. In order to obtain the key Tibetan information from the Web, this paper proposes a method to extract the Web text based on text density, which uses the continuity characteristics of semi-structured text content in HTML pages and the regular expression. This method has higher accuracy for text extraction in theme pages and similar news pages.
作者
洛松求培
安见才让
Luosong Qiupei Anjian Cairang(Computer Science Qinghai University for Nationalities, Xining, Qinghai 810007, China)
出处
《计算机时代》
2017年第8期46-47,51,共3页
Computer Era
基金
青海省科技厅项目资助(2016-ZJ-Y04)