期刊文献+

一种基于标签路径聚类的文本信息抽取算法 被引量:2

A TEXT INFORMATION EXTRACTION ALGORITHM BASED ON TAG XPATH CLUSTERING
下载PDF
导出
摘要 针对网页噪音和网页非结构化信息抽取复杂度高的问题,提出一种基于标签路径(XPATH)聚类的文本信息抽取算法。该算法首先对网页噪音预处理,根据网页的DOM树结构进行标签路径聚类,通过自动训练的阈值和网页.分割算法快速判定网页的关键部分,根据数据块中的嵌套结构获取网页文本抽取模板。对不同类型网站实验表明,该方法获得快速和较高准确度的效果。 This paper proposes a new approach for text information extraction based on tag xpath clustering,in order to solve the problem of high complexity in extracting webpage noise and unstructured webpage information.The method first carries out the web noise pre-treatment,as well as the tag xpath clustering according to the DOM tree structure of the webpage,and fast determines key parts of the webpage through automatically trained threshold value and webpage segmentation algorithm,then finds webpage's text extracted template based on the embedded structure of data block.The experiments performed on several different kinds of website show that this method obtains faster effect with higher accuracy.
作者 刘云峰
出处 《计算机应用与软件》 CSCD 2010年第11期199-202,共4页 Computer Applications and Software
关键词 XPATH 网页分割 信息抽取 聚类 阈值 Xpath Webpage segmentation Information extraction Clustering Threshold
  • 相关文献

参考文献1

二级参考文献12

  • 1Chang CH, Kayed M, Girgis MR, Shaalan K. A survey of Web information extraction systems. IEEE Trans. on Knowledge and Data Engineering, 2006,18(10): 1411-1428.
  • 2Gold ME. Language identification in the limit. Information and Control, 1967,10(5):447-474.
  • 3Laender AHF, Ribeiro-Neto BA, da Silva AD, Teixeira JS. A brief survey of Web data extraction tools. SIGMOD Record, 2002,31 (2):84-93.
  • 4Arasu A, Hector GM. Extracting structured data from Web pages. In: Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 337-348.
  • 5EXALG datasets, http://infolab.stanford.edu/-arvind/extract/
  • 6TBDW v1.02, http://daisen.cc.kyushu-u.ac.jp/TBDW/testbed/
  • 7Zhao HK, Meng WY, Wu ZH, Raghavan V, Yu C. Fully automatic wrapper generation for search engines. In: Proc. of the 14th Int'l Conf. on World Wide Web (WWW 2005). Chiba: ACM Press, 2005.66-75.
  • 8Simon K, Lausen G. VIPER: Augmenting automatic information extraction with visual perceptions. In: Proc. of the ACM CIKM Int'l Conf. on Information and Knowledge Management. Bremen: ACM Press, 2005. 381-388.
  • 9Crescenzi V, Mecca G, Meraldo P. RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. of the 27th Int'l Conf. on Very Large Data Bases (VLDB 2001). Roma: Morgan Kaufmann Publishers, 2001. 109-118.
  • 10Wang JY, Lochovsky FH. Data extraction and label assignment for Web databases. In: Proc. of the 12th Int'l World Wide Web Conf. (WWW 2003). Budapest: ACM Press, 2003. 187-196.

共引文献44

同被引文献13

引证文献2

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部