A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phr...A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.展开更多
This paper presents an effective clustering mode and a novel clustering result evaluating mode. Clustering mode has two limited integral parameters. Evaluating mode evaluates clustering results and gives each a mark. ...This paper presents an effective clustering mode and a novel clustering result evaluating mode. Clustering mode has two limited integral parameters. Evaluating mode evaluates clustering results and gives each a mark. The higher mark the clustering result gains, the higher quality it has. By organizing two modes in different ways, we can build two clustering algorithms: SECDU(Self-Expanded Clustering Algorithm based on Density Units) and SECDUF(Self-Expanded Clustering Algorithm Based on Density Units with Evaluation Feedback Section). SECDU enumerates all value pairs of two parameters of clustering mode to process data set repeatedly and evaluates every clustering result by evaluating mode. Then SECDU output the clustering result that has the highest evaluating mark among all the ones. By applying "hill-climbing algorithm", SECDUF improves clustering efficiency greatly. Data sets that have different distribution features can be well adapted to both algorithms. SECDU and SECDUF can output high-quality clustering results. SECDUF tunes parameters of clustering mode automatically and no man's action involves through the whole process. In addition, SECDUF has a high clustering performance.展开更多
基金Foundation item: Supported by the National Natural Science Foundation of China (60503020, 60503033, 60703086)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow Uni-versity (KJS0714)+1 种基金Research Foundation of Nanjing University of Posts and Telecommunications (NY207052, NY207082)National Natural Science Foundation of Jiangsu (BK2006094).
文摘A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.
基金Supported by the National Natural Science Foundation of China(60573089)
文摘This paper presents an effective clustering mode and a novel clustering result evaluating mode. Clustering mode has two limited integral parameters. Evaluating mode evaluates clustering results and gives each a mark. The higher mark the clustering result gains, the higher quality it has. By organizing two modes in different ways, we can build two clustering algorithms: SECDU(Self-Expanded Clustering Algorithm based on Density Units) and SECDUF(Self-Expanded Clustering Algorithm Based on Density Units with Evaluation Feedback Section). SECDU enumerates all value pairs of two parameters of clustering mode to process data set repeatedly and evaluates every clustering result by evaluating mode. Then SECDU output the clustering result that has the highest evaluating mark among all the ones. By applying "hill-climbing algorithm", SECDUF improves clustering efficiency greatly. Data sets that have different distribution features can be well adapted to both algorithms. SECDU and SECDUF can output high-quality clustering results. SECDUF tunes parameters of clustering mode automatically and no man's action involves through the whole process. In addition, SECDUF has a high clustering performance.