摘要
传统基于本体的Web页面信息抽取以单个信息项为最小抽取单位,抽取出的实体语义关联性较差和抽取准确率不理想。针对上述问题,以微博领域本体为基础,提出了一种两层次匹配的用户信息抽取方法:将微博中具有语义关联的不同层次的用户信息划分成对应信息块,以信息块作为最小抽取单位分别抽取其中包含的用户各属性信息(包含个人信息、关注的好友信息和所发文本微博信息)。试验结果证明,与传统信息抽取方法相比,设计的抽取规则算法能够有效地提高信息的准确率和召回率,对微博页面结构复杂以及信息量大的Web网页有良好的抽取效果。
There are some problems of existing the traditional ontology-based Web information which uses single information item as the smallest unit,the extracted entities lack of associating semantics and with poor extraction accuracy.In response to the problems,a two-level matching method of users' personal information extraction is proposed based on the microblog domain ontology,microblog user information is divided into different blocks,then the information block is used as the smallest unit to extract information from the each user's property(including personal information,information of concerned friend and issued the text tweets).Experimental results show that compared with traditional information extraction method,the proposed method can effectively improve the accuracy and the recall of information extraction and has good extraction results with the complex microblogging page and infor.
出处
《长江大学学报(自科版)(上旬)》
CAS
2015年第4期36-40,4,共5页
JOURNAL OF YANGTZE UNIVERSITY (NATURAL SCIENCE EDITION) SCI & ENG
基金
安徽省教育厅基金项目(KJ2013B020)
国家级大学生创新与创业训练计划(201210363066
201310363097)
关键词
领域本体
两层次匹配
信息抽取
微博
抽取规则
Domain ontology
two-level matching
Information extraction
microblog
extraction rules