摘要
WHISK系统是一个半自动的IE系统,对结构化、半结构化的Web文本它都能使用生成的抽取规则进行信息抽取.但是它在规则学习过程中规则不能保证以最优的方式进行扩展,且生成规则集的时间较长.文中主要针对这些问题,提出利用遗传算法改进WHISK的监督式学习算法,并采用移除法生成规则集.实验结果表明此方法在效率和召回率上都得到提高.
WHISK system is a semi-automatic information extraction (IE) system. It works well in extracting information for structured or semi-structured web texts. However, but there is no guarantee that the rule learning algorithm can extend rules in an optimal way. Besides, the generation of rule set is time-consuming. To solve these problems, the genetic algorithm is introduced to improve the supervised machine learning algorithm WHISK by a heuristic rule expansion, and a removing method is used to generate the rule set. The experimental results show that the proposed algorithm performs well in terms of the efficiency and the recall rate.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2011年第3期385-390,共6页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.60775028)
吉林省信息产业发展专项资金项目(吉信发[2008]40号)
大连市科技局重大项目(No.2007A14GX042)资助
关键词
信息抽取
WHISK系统
遗传算法
规则学习
Information Extraction, WHISK System, Genetic Algorithm, Rule Learning