基于最大熵分类器的Deep Web查询接口自动判定被引量：1

Automatic identifying query interfaces of deep Web with maximum entropy classifier

下载PDF

导出

摘要 Web中包含着海量的高质量信息,它们通常处在网络深处,无法被传统搜索引擎索引,将这样的资源称为Deep Web。因为查询接口是Deep Web的唯一入口,所以要获取Deep Web信息就必须判定哪些网页表单是Deep Web查询接口。由于最大熵模型可以综合观察到的各种相关或不相关的概率知识,对许多问题的处理都可以达到较好的结果。因此,基于最大熵模型的分类性能,利用最大熵分类算法自动判定查询接口。并通过实验,将最大熵分类法与其它常用分类方法进行了比较,结果显示它的分类性能优于Bayes方法和C4.5方法,与SVM方法相当,表明这是一种非常实用的查询接口分类方法。 Tremendous high-quality web information is deeply hidden in the Web,which can not be indexed by traditional search engines,so we call them Deep Web.Since query interface is the only entrance to the Deep Web,we must distinguish query interfaces of Deep Web.Since the Maximum Entropy Model could integrate various correlative and irrelative probability knowledge,it could deal with many problem well.So we use Maximum Entropy Model for query interface categorization in this paper.Compared with Bayes,C4.5 and SVM,Maximum Entropy shows its high quality.Moreover,it is useful to query interface categorization.

作者方巍黄黎崔志明

机构地区江苏省计算机信息处理技术重点实验室

出处《计算机工程与应用》 CSCD 北大核心 2008年第21期133-137,共5页 Computer Engineering and Applications

基金国家自然科学基金( the National Natural Science Foundation of China under Grant No.60673092) 2005年度教育部科研重点项目(the Key Project of Chinese Ministry of Education under Grant No.205059) 2006 年江苏省“六大人才高峰”项目( the“Six Talent Peak”Project of Jiangsu Province under Grant No.06-E-037) 2006 年度江苏省软件和集成电路业专项经费项目(the Specialized Fund Pro-ject for the Software and IC of Jiangsu Province in 2006 under Grant No.[2006]221- 41) 2007 年江苏省重点实验室开放基金项目(theProject of Jiangsu Key Laboratory of Computer Information Processing Technology)

关键词 DEEP Web 网页表单特征提取最大熵模型 deep web Html form feature extraction maximum entropy model

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献13

1Ghanem T M,Aref W G.Databases deepen the Web[J].IEEE Computer, 2004,73 ( 1 ) : 116-117.
2Bergraan M K.The deep Web:surfacing hidden value[R].BrightPlanet technical Report,2001.
3Sherman C,Price G.The invisible Web:uncovering information sources search engines can't see[C].2003.
4Lewis D D.Naive (Bayes) at forty:the independence assumption in information retrieval[C]//The 10th European Conference on Machine Learning.New York : Springer, 1998:4-15.
5Yang Y,Lin X.A re-examination of text categorization methods[C]// The 22nd Annual International ACM SIGIR Conference on Research and Development in the Information Retrieval.New York: ACM Press, 1999.
6Quinlal R.C4.5 :Programs for Machine Learning[M].San Mateo, CA: Morgan Kaufmann Publishers, 1993.
7Schapire R E,Singer Y.Improved boosting algorithms using confidence-rated predications[C]//The 11th Annual Conference on Computational Learning Theory.Madison:ACM Press,1998:80-91.
8Joachims T.Text categorization with support vector machines:learning with many relevant features[C]//The 10th European Conference on Machine Learning.New York:Springer, 1998 :137-142.
9Cope J, Craswell N,Hawking D.Automated discovery of search interfaces on the Web[C]//14th Australasian Database Conference (ADC2003).Conferences in Research and Practice in Information Technology, 2003,17.
10Lage J P,da Silva A S,Golgher P B,et al.Automatic generation of agents for collecting hidden Web pages for data extraction[J]. Data&Knowledge Engineering,2004,49:177-196.

二级参考文献44

1D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998, 4-15.
2Y. Yang, X. Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf. onResearch and Development in the Information Retrieval. NewYork: ACM Press, 1999.
3Y. Yang, C. G. Chute. An example based mapping method for text categorization and retrieval. ACM Trans. on Information Systems, 1994, 12(3): 252 -277.
4E. Wiener. A neural network approach to topic spotting. The 4th Annual Syrup. on Document Analysis and Information Retrieval,Las Vegas, NV, 1995.
5R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. In: Proc. of the 11th Annual Conf.on Computational Learning Theory. New York: ACM Press,1998. 80--91.
6T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998. 137-142.
7Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1 ( 1 ) : 76-- 88.
8R. Adwait. Maximum entropy models for natural language ambiguity resolution: [ Ph. D. dissertation ] . Pennsylvania:University of Pennsylvania, 1998.
9R. Adwait. A maximum entropy model for part-of-speech tagging. The Empirical Methods in Natural Language Processing Conference, Philadelphia, USA, 1996.
10Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22( 1 ) : 38-- 73.

共引文献166

1刘亚慧,杨浩苹,李正华,张民.一种轻量级的汉语语义角色标注规范[J].中文信息学报,2020(4):10-20. 被引量：4
2陈文庆,李勤,姚伽华.基于最大熵模型的垃圾邮件过滤方法[J].网络安全技术与应用,2005(1):16-18. 被引量：1
3修宇,王士同,朱林,宗成庆.极大熵球面K均值文本聚类分析[J].计算机科学与探索,2007,1(3):331-339. 被引量：1
4钱晶,张杰,张涛.基于最大熵的汉语人名地名识别方法研究[J].小型微型计算机系统,2006,27(9):1761-1765. 被引量：26
5苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：378
6尚文倩,瞿有利,黄厚宽,朱海滨,林永民,董红斌.基于基尼的模糊kNN分类器(英文)[J].广西师范大学学报（自然科学版）,2006,24(4):87-90.
7周琳.摄影,靠的就是眼力[J].军事记者,2006(10):52-52.
8尚文倩,黄厚宽,刘玉玲,林永民,瞿有利,董红斌.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展,2006,43(10):1688-1694. 被引量：38
9崔彩霞,王素格.基于粗集的支持向量机文本分类方法研究[J].科技广场,2006(8):4-6. 被引量：1
10司广涛,李培峰,朱巧明,李军辉.基于最大熵模型的邮件过滤系统研究[J].计算机工程与应用,2006,42(32):119-121.

同被引文献7

1郑淑丽,韩江洪,程文娟,吴永忠.Deep Web查询接口自动识别方法[J].郑州大学学报（理学版）,2009,41(1):56-58. 被引量：1
2GHANEM T M, AREF W G. Databases deepen the Web[ J]. Computer,2004,37 ( 1 ) :116-117.
3COPE J, CRASWELL N, HAWKING D. Automated discovery of search interfaces on the Web[ C ]//Proc of the 14th Australasian Database Conference in Research and Practice in Information Technology. 2003 : 181-189.
4LAGE J P, DA SILVAA S, GOLGHER P B,et al. Automatic genera- tion of agents for collecting hidden Web pages for data extraction [ J ]. Data & Knowledge Engineering, 2004,49(2) :177-196.
5BARBOSA L, FREIRE J. Combining classifiers to identify online databases[ C]//Proc of the 16th International Conference on World Wide Web. New York: ACM Press, 2007:431-440.
6刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量：136
7赵朋朋,崔志明,高岭,仲华.关于中国Deep Web的规模、分布和结构[J].小型微型计算机系统,2007,28(10):1799-1802. 被引量：13

引证文献1

1李雪玲,施化吉,兰均,李星毅.基于决策树和链接相似的Deep Web查询接口判定[J].计算机应用研究,2011,28(11):4086-4088.

1网页表单标准有新进展[J].大众软件,2003(17):61-61.
2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量：95
3高岭,赵朋朋,崔志明.Deep Web查询接口的自动判定[J].计算机技术与发展,2007,17(5):148-151. 被引量：13
4钱丽.基于HTML5的网页表单设计与实现[J].科技视界,2012(28):178-178.
5冯小民.网页表单轻松填[J].电脑,2004(5):130-131.
6李志涛,刘全,周文云.一种多分类器Deep Web数据源的自动分类与判别方法[J].计算机应用与软件,2010,27(2):11-13.
7郑淑丽,韩江洪,程文娟,吴永忠.Deep Web查询接口自动识别方法[J].郑州大学学报（理学版）,2009,41(1):56-58. 被引量：1
8王建民.网页表单无障碍设计[J].电子商务,2012,13(11):61-62. 被引量：1
9陈光,刘宗田.基于特征聚合与最大熵的文本分类算法[J].计算机应用与软件,2008,25(3):263-264. 被引量：2
10王海军.用FRONTPAGE 2000制作网页表单[J].师范教育,2003,0(6):33-33.

计算机工程与应用

2008年第21期

浏览历史

内容加载中请稍等...

基于最大熵分类器的Deep Web查询接口自动判定被引量：1

参考文献13

二级参考文献44

共引文献166

同被引文献7

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于最大熵分类器的Deep Web查询接口自动判定 被引量：1

参考文献13

二级参考文献44

共引文献166

同被引文献7

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于最大熵分类器的Deep Web查询接口自动判定被引量：1