摘要
信息抽取是数据挖掘和知识发掘的重要方法,基于规则自动化或半自动化地从互联网中提取准确有效的数据是知识挖掘的关键。本文构建了一个通用文本信息抽取平台,采用多种信息匹配技术从网络数据源中抽取数据和信息,并采用规则处理方式对网页信息进行智能化抽取。该平台采用EclipseRCP开发,对其功能可进行插件式扩充,在业务逻辑上采用规则引擎。该平台具有界面友好、易于扩展、使用方便等特点,并能够从大规模网页中自动地获取有效的数据和信息。
Information extraction is an important approach of data mining and knowledge discovery,accurate and valid Internet data extraction based upon rule engine as well as automation of the action are the key to knowledge discovery.This paper develops a general text information retrieval platform,using several kinds of information matching techniques to extract data from network data source and adopt processing rules to automatically and intelligently handle information.The platform is implemented using Eclipse RCP;features are implemented as Plug-ins and business logic is embodied as rules.The advantages of the platform are user-friendly,easy expansion,and can automatically retrieve accurate and valid data from large scale web pages.
出处
《北京城市学院学报》
2010年第5期67-70,共4页
Journal of Beijing City University