摘要
随着Internet的发展,Web页面信息量不断加大,信息密集程度不断加强。但Web页面的主题信息通常不太明确,抽取主题信息也比较困难。针对这一难题,提出一种算法:构建文档对象模型DOM(Document Object Model)树,然后针对HTML半结构特征的不足,为DOM添加显示、语义(链接数、非链接文字数、高度、宽度)等属性,并提出一种聚类规则来对其进行分块,最后对其进行剪枝,删除掉无用的信息,提取主题信息。实验表明,该方法能够准确抽取主题信息。
With the development of the Internet,the amount as well as the density of Web pages information increase day by day.However the representation of the topical information is usually not manifest enough,and this makes it difficult to acquire the topical information.A new extraction algorithm is proposed to solve this issue by constructing the DOM tree and then adding attributes to it such as display,semantics(link number,unlinked words number,height and width,etc.),as well as presenting a clustering rule for partitioning the DOM tree,the last part of the algorithm is to prune the DOM tree to expel redundancies so as to pick up the topical information.This approach can accurately extract the topical information as shown by the experiment.
出处
《计算机应用与软件》
CSCD
2010年第5期188-190,共3页
Computer Applications and Software
关键词
DOM
主题
信息抽取
分块
剪枝
DOM Topic Information extraction Partition Prune