期刊文献+

基于子树广度的Web信息抽取 被引量:3

Web Information Extraction Based on Sub-tree Breadth
下载PDF
导出
摘要 提出一种新的网页信息抽取方法,基于子树的广度可不加区分地对不同科技文献网站的页面信息进行自动抽取。对大量科技文献网站进行信息抽取实验,已应用到甘肃省科技文献共享平台。实验结果证明,该方法能不依赖科技文献网页的来源而自动地抽取相关信息,并能保证较高的数据抽取回召率和查准率。 This paper proposes a new method which can extract the useful information from the different document sites automatically based on the breadth of a sub-tree. Experimental evaluation on a large of Web pages from different document Web sites has done and this method has been applied to the platform of gansu science & technology document sharing successfully. Experimental result shows this method automatically extracts the information ignoring where Web sites the pages come from and has high accuracy in terms of recall and precision.
作者 王权 施韶亭
出处 《计算机工程》 CAS CSCD 北大核心 2009年第3期89-90,93,共3页 Computer Engineering
基金 甘肃省技术研究与开发专项计划基金资助项目(2007GS05285)
关键词 子树广度 信息抽取 跨库检索 sub-tree breadth information extraction cross-search
  • 相关文献

参考文献10

  • 1Liu Ling, Calton R Han Wei. XWRAP: An XML-enabled Wrapper Construction System for Web Information Source[C]//Proc. of the 16th International Conference on Data Engineering. Washington D. C., USA: [s. n.], 2000: 611-621.
  • 2Hammer J, Garcia M H, Cho J, et al. Extracting Semistructured Information from the Web[C]//Proc. of the 1st Workshop on Management of Semistructured Data. Tucson, Arizona. USA: [s. n.], 1997: 18-25.
  • 3Huck G, Fankhauser P, Aberer K. Jedi: Extracting and Synthesizing Information from the Web[C]//Proc. of the 3rd International Conference on Cooperative Information Systems. New York, USA: [s. n.], 1998: 32-43.
  • 4Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction With Lixto[C]//Proc. of the 27th Int'l Conf. on Very Large Data Bases. San Francisco, California, USA: [s. n.], 2001: 119-128.
  • 5Raggett D. HTML Tidy Project Page[Z]. (2005-02-01). http:// tidy.sourceforge.net.
  • 6Deng Cai, Yu Shipeng, Wen Jirong, et al. Block-based Web Search[C]//Proc. of the 27th Annual International ACM SIGIR Conference. Sheffield, South Yorkshire, UK: [s. n.], 2004.
  • 7Deng Cai, He Xiaofei, Wen Jirong, et al. Block-level Link Analysis[C]//Proc. of the 27th Annual International ACM SIGIR Conference. Sheffield, South Yorkshire, UK: [s. n.], 2004.
  • 8甘肃省科学技术情报研究所.甘肃科技文献资源共享平台[Z].[2007-06-10).http://www.gsstd.cn.
  • 9曹方,施韶亭.基于Web过程模拟的异构数字文献统一检索系统设计与实现[J].情报学报,2006,25(5):575-579. 被引量:11
  • 10Gaizauskas R, Wilks Y. Information Extraction: Beyond Document Retrieval[J]. Computational Linguistics and Chinese Language Processing, 1998, 3(2): 17-60.

二级参考文献8

共引文献10

同被引文献23

  • 1胡芒谷.我国科技文献共享平台的建设模式研究与可持续发展思考[J].数字图书馆论坛,2008(7):67-70. 被引量:7
  • 2周俊生,戴新宇,尹存燕,陈家骏.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5):804-809. 被引量:112
  • 3曹方,施韶亭.基于Web过程模拟的异构数字文献统一检索系统设计与实现[J].情报学报,2006,25(5):575-579. 被引量:11
  • 4Zhai Yanhong, Liu Bing. Web Data Extraction Based on Partial Tree Alignment[C]//Proc. of the 14th Int'l Conf. on World Wide Web. New York, USA: ACM Press, 2005: 76-85.
  • 5Zhu Jun, Nie Zaiqing. 2D Conditional Random Fields for Web Information Extraction[C]//Proc. of the 22nd Int'l Conf. on Machine Learning. San Francisco, USA: Morgan Kaufmaun Publishers, 2005: 1044-1051.
  • 6Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabmstic Models for Segmenting and Labeling Sequence Data[C]//Proc. of ICML'01. San Francisco, USA: Morgan Kaufmann. 2001:282-289.
  • 7Li Jia, Najmi A, Gray R M. hnage Classification by a Two- dimensional Hidden Markov Model[J]. IEEE Trans. on Signal Processing, 2000, 48(2): 517-533.
  • 8Liu Dong, Nocedal J. On the Limited Memory BFGS Method for Large Scale Optimization[J]. Mathmetical Programming, 2005, 45(1-3): 503-528.
  • 9全国国民阅读调查课题组. 国民阅读调查: 中国国民对阅读作用的认知较高[EB/OL]. (2010-04-26). http://www.nlc.gov.cn/yjfw/ 2010/0426/article_1683.htm.
  • 10Cormen T H, Leiserson C E. 算法导论[M]. 潘金贵, 顾铁龙, 李成法, 等, 译. 北京: 机械工业出版社, 2010.

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部