期刊文献+

基于LCA分块算法的大学科研人员信息抽取 被引量:3

Information Extraction of University Research Faculty Based on LCA Segmentation Algorithm
下载PDF
导出
摘要 现有的半结构化网页信息抽取方法主要假设有效数据间具有较强结构相似性,将网页分割为具有类似特征的数据记录与数据区域然后进行抽取。但是存有大学科研人员信息的网页大多是人工编写填入内容,结构特征并不严谨。针对这类网页的弱结构性,提出了一种基于最近公共祖先(lowest common ancestor,LCA)分块算法的人员信息抽取方法,将LCA和语义相关度强弱的联系引入网页分块中,并提出了基本语义块与有效语义块的概念。在将网页转换成文档对象模型(document object model,DOM)树并进行预处理后,首先通过向上寻找LCA节点的方法将页面划分为基本语义块,接着结合人员信息的特征将基本语义块合并为存有完整人员信息的有效语义块,最后根据有效语义块的对齐获取当前页面所有关系映射的人员信息。实验结果表明,该方法在大量真实的大学人员网页的分块与抽取中,与MDR(mining data records)算法相比仍能保持较高的准确率与召回率。 Conventional information extraction methods of semi-structured pages usually assume that valid data have relatively strong structural similarity, divide the page into data records and data region with similar characteristics and then extract from them. However, faculty list pages of universities mostly are written artificially and filled by human beings instead of automatic generation by using templates, so their structure is not rigorous. This paper proposes a fac-ulty information extraction method based on LCA (lowest common ancestor) segmentation algorithm, introduces the connection between LCA and semantic relation into Web segmentation, and presents the new concepts of basic semantic blocks and effective semantic blocks. After converting the page into a DOM (document object model) tree and the pre-processing, the page is divided into the basic semantic blocks with LCA algorithm firstly. Then the basic semantic blocks are merged into their corresponding effective semantic blocks with complete personnel information. Finally, according to the alignment of effective semantic blocks, all faculty information mapped by all relationships in current page is gotten. The experimental results show that the proposed method still has high precision and recall rates in the segmentation and extraction of quantities of real university research faculty list pages by compared with the MDR (mining data records) algorithm.
出处 《计算机科学与探索》 CSCD 北大核心 2016年第6期761-772,共12页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金No.61202100 软件工程国家重点实验室开放基金No.SKLSE2012-09-20~~
关键词 信息抽取 最近公共祖先(LCA) 基本语义块 有效语义块 关系映射 information extraction lowest common ancestor (LCA) basic semantic block effective semantic block relational mapping
  • 相关文献

参考文献19

  • 1Tang Jie, Zhang Jing, Yao Limin, et al. ArnetMiner: extractionand mining of academic social networks[C]//Proceedingsof the 14th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, Las Vegas,USA, Aug 24-27, 2008. New York, USA: ACM, 2008: 990-998.
  • 2Liu Bing, Grossman R, Zhai Yanhong. Mining data recordsin Web pages[C]//Proceedings of the 9th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining, Washington, USA, Aug 24- 27, 2003. New York,USA: ACM, 2003: 601-606.
  • 3Liu Bing, Zhai Yanhong. NET-a system for extractingWeb data from flat and nested data records[C]//Proceedingsof the 6th International Conference on Web InformationSystems Engineering, New York, USA, Nov 20- 22, 2005.Berlin, Heidelberg: Springer, 2005: 487-495.
  • 4Zhao Hongkun, Meng Weiyi, Yu C. Mining templates fromsearch result records of search engines[C]//Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, San Jose, USA, Aug 12-15, 2007. New York, USA: ACM, 2007: 884-893.
  • 5Lerman K, Getoor L, Minton S, et al. Using the structure ofWeb sites for automatic segmentation of tables[C]//Proceedingsof the 2004 ACM SIGMOD International Conferenceon Management of Data, Paris, France, Jun 13-18, 2004. NewYork, USA: ACM, 2004: 119-130.
  • 6Hong J L, Siew E G, Egerton S. Information extraction forsearch engines using fast heuristic techniques[J]. Data &Knowledge Engineering, 2010, 69(2): 169-196.
  • 7高乐,张健,田贤忠.基于视觉的Web页面分块算法的改进与实现[J].计算机系统应用,2009,18(4):65-69. 被引量:11
  • 8Chakrabarti D, Kumar R, Punera K. A graph-theoretic approachto webpage segmentation[C]//Proceedings of the17th International Conference on World Wide Web, Beijing,China, Apr 21-25, 2008. New York, USA: ACM, 2008:377-386.
  • 9Ravikumar S, Chakrabarti D, Punera K. Method for segmenting webpages by parsing webpages into document object modules (DOMs) and creating weighted graphs: U.S.Patent 7, 974, 934[P]. 2011-07-05.
  • 10Cai Deng, Yu Shipeng, Wen Jirong, et al. VIPS: a visionbasedpage segmentation algorithm, MSR-TR-2003-79[R].Microsoft, 2003.

二级参考文献6

  • 1于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 2Cai D, Yu S, Wen JR, et al. VIPS: a vision based page segmentation algorithm. Microsoft Technical Report, MSR- TR- 2003 - 79, 2003:10.
  • 3Chen JL, Zhou BY, Shi J, et al. Function-based object model towards website adaptation. Proceedings of the 10th World Wide Web Conference. Hong Kong: ACM Press, 2001:587 - 596.
  • 4Chakrabarti S, Punera K, Subramanyam M. Accelerated focused crawling through online relevance feedback Proceedings of the eleventh international conference on World Wide Web (WWW2002), 2002:148 - 159.
  • 5Lin SH, HO JM. Discovering informative coment blocks from Web documents. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDDp02). July, 2002.
  • 6Wen JR, Song RH, Cai D, et al. Microsoft Research Asia at The Web Track of TREC 2003. The Twelfth Text Retrieval Conference (TRECp12), 2003.

共引文献10

同被引文献12

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部