摘要
随着互联网技术的迅猛发展,因特网成为目前新闻信息最丰富最主要的来源。本文在分析新闻网页的基础上,分析了目前现有的信息抽取技术和XML技术,提出了一个基于XML技术的Web新闻抽取系统。本文主要是充分运用XML中的XPath技术在数据定位方面的优势,并提出一种基于DOM树的XPath生成算法,使用XSLT语言用于描述抽取规则,并使用路径表达式XPath定位待抽取的信息点。
With the rapid development of Internet technology,the Internet is now the main source of the most abundant information. On the basis of the analysis of news Web pages,and analysis of the current existing information extraction technology and XML technology,the paper presents a Web news extraction system based on XML technology. This paper mainly uses the XPath XML technology in data location advantages,and puts forward a DOM tree based XPath generation algorithm,using XSLT language to describe the selection rules,and the path expression XPath to extract information point.
出处
《智能计算机与应用》
2015年第2期58-61,共4页
Intelligent Computer and Applications
关键词
数据挖掘
信息抽取
新闻抽取系统
XPATH
Data Mining
Information Extraction
News and Information Extraction System
XPath