摘要
传统的搜索引擎可以很好地发现静态网页,但是不能获取隐藏在查询接口背后的大量数据。大量不断更新的数据只能通过填写HTML页面的查询接口对后台的数据进行查询得到。本文介绍了一种发现查询接口的有效方法。通过用更具代表性的属性描述查询接口,并利用决策树技术对查询接口进行分类,从而达到比较高的识别准确率。
Traditional Web search engines work well for finding static Web pages, but not for finding datasets hidden behind Web search forms. A significant and ever-increasing amount of data is accessible only by filling out HTML forms to query an underlying Web data source. We describe a novel technique for detecting search forms, which uses representative features to describe candidate forms and a useful general purpose decision tree that is effective on accuracy to classify them.
出处
《微计算机信息》
北大核心
2008年第33期204-205,208,共3页
Control & Automation
基金
国家科技基础条件平台门户应用系统颁布部门:国家科学技术部(2005DKA63901)