摘要
Web的迅速发展,使其日益成为人们查找有用数据的重要来源,但是Web站点主题各异、形式多样、结构不同,其页面组织结构很难用系统的方法来有效抽取目标数据。文中将使用Asp.Net技术开发一种基于Web内容的数据自动抽取方法。首先选择目标数据源并自动调用获取其静态html文档内容,然后根据约定规则生成网页描述文件,分析html文档,设定目标锚,最后利用正则表达式和c#技术自动抽取目标数据并生成所需Web页面。这种数据自动抽取方法可以使Web用户快捷地从结构化、半结构化网页中抽取其所需的数据信息。
The rapid development of the Web makes it become increasingly an important source of data that people find useful data,cur rent Web sites present information on various topics in various formats and structures. The page organization structure of Web content makes it difficult to use the method of system to effectively extract target data. It uses the Asp. Net technology to develop a data automatic extraction method based on Web content. First it selects target data sources,then it invokes automatically data source and obtains static html document content,generates description file of webpage in accordance with fixed roles,analyzes html document,sets a goal anchor,finally it uses regular expressions and c # technology to automatically extract target data and generate required Web page. This data automatic extraction method can make Web user quickly get the required data information.
出处
《计算机技术与发展》
2012年第5期87-89,93,共4页
Computer Technology and Development
基金
江苏省公益性行业科研专项(GYHY201106037)