This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential c...This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.展开更多
The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated wi...The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated with learnt classification rules categorizing documents into topics. Classification provides for the reduction of the dimensio0ality of the document feature space. The semantic model of retrieved web documents is semantically labeled by querying domain ontology and processed with content-based classification method. The model obtained is mapped to the existing knowledge base by implementing inference algorithm. It enables models of the same semantic type to be recognized and integrated into the knowledge base. The approach provides for the domain knowledge integration and assists the extraction and modeling web documents semantics. Implementation results of the proposed approach are presented.展开更多
基金Science and Technology Development Project of Tianjin(No. 06FZRJGX02400)National Natural Science Foundation of China (No.60603027)
文摘This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.
文摘The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated with learnt classification rules categorizing documents into topics. Classification provides for the reduction of the dimensio0ality of the document feature space. The semantic model of retrieved web documents is semantically labeled by querying domain ontology and processed with content-based classification method. The model obtained is mapped to the existing knowledge base by implementing inference algorithm. It enables models of the same semantic type to be recognized and integrated into the knowledge base. The approach provides for the domain knowledge integration and assists the extraction and modeling web documents semantics. Implementation results of the proposed approach are presented.