摘要
提出了一种基于Dirichlet过程的Deep Web数据源聚类方法 ,该方法采用层次Dirichlet过程(HDP)进行特征提取。首先将查询接口中原本高维稀疏的文本表示为主题特征,该过程能自动确定特征数。然后将文本看成多项式模型,采用Dirichlet过程混合模型聚类。该模型无需人工事先指定聚类个数,由Dirichlet过程根据数据自动计算得到,特别适用于Deep Web数据源数量大、变化快的特点。在通用数据集TEL-8上进行验证实验,并与其他聚类方法在F-measure和熵值两个指标上进行对比,均取得较好的结果 。
This paper proposed a clustering method of deep web sources based on dirichlet process. The proposed method adopted the hierarchical dirichlet process (HDP) for feature extraction. First it replaced the high-dimensional sparse text in the searching interface with topic feature and it could automatically determine the number of features. Then introduced the dirichlet process mixture model for the text clustering by treating the text as a multinomial model. Based on the dirichlet process, this model could automatically determine the number of clusters without any manual intervention. It was especially suitable for the characteristics of deep web sources, which is of large scale and changes fast. Compared with other clustering method, experimental results demonstrate a good performance on the date sets of TEL-8 on F-measure and Entropy.
出处
《微型机与应用》
2015年第7期75-78,共4页
Microcomputer & Its Applications