摘要
本文利用本体思想,采用基于规则和统计相结合的算法,提出了一种网上人物信息提取算法,实现了半结构化人物信息的自动提取。通过程序统计的方法创建了一个包含4624个有效字段名的词典,用来检验提取出的字段名是否有效,当字段名有效时再提取其对应的字段值,大大提高了信息提取的准确率。实验结果表明,该算法对半结构化web人物网页信息提取具有较高的效率,平均准确率为97.6%,平均召回率为86.1%。
This paper presents an algorithm of extracting people information on web based on the combining of regulations and statistics,utilizing the idea of the ontology,to accomplish the auto-extracting information from the semi-structure people information.It established a field name dictionary which contained four thousands and six hundreds and twenty four effective field name by the method of program statistic,to check the effectiveness of the extracted field name.The precision of the IE was greatly raised because the field value was extracted only when the field name was effective.The final results display that the algorithm has high efficiency on web extraction of semi-structure people information,and the average precision and recall reach 97.6%and 86.1%,respectively.
出处
《微计算机信息》
2010年第12期145-147,共3页
Control & Automation
关键词
WEB信息抽取
抽取规则
半结构化网页
XML
版式分析
the Web IE
IE regulations
the semi-structure web page
XML
the web page format analyzing