To improve the accuracy of text clustering, fuzzy c-means clustering based on topic concept sub-space (TCS2FCM) is introduced for classifying texts. Five evaluation functions are combined to extract key phrases. Con...To improve the accuracy of text clustering, fuzzy c-means clustering based on topic concept sub-space (TCS2FCM) is introduced for classifying texts. Five evaluation functions are combined to extract key phrases. Concept phrases, as well as the descriptions of final clusters, are presented using WordNet origin from key phrases. Initial centers and membership matrix are the most important factors affecting clustering performance. Orthogonal concept topic sub-spaces are built with the topic concept phrases representing topics of the texts and the initialization of centers and the membership matrix depend on the concept vectors in sub-spaces. The results show that, different from random initialization of traditional fuzzy c-means clustering, the initialization related to text content contributions can improve clustering precision.展开更多
A method that combines category-based and keyword-based concepts for a better information retrieval system is introduced. To improve document clustering, a document similarity measure based on cosine vector and keywor...A method that combines category-based and keyword-based concepts for a better information retrieval system is introduced. To improve document clustering, a document similarity measure based on cosine vector and keywords frequency in documents is proposed, but also with an input ontology. The ontology is domain specific and includes a list of keywords organized by degree of importance to the categories of the ontology, and by means of semantic knowledge, the ontology can improve the effects of document similarity measure and feedback of information retrieval systems. Two approaches to evaluating the performance of this similarity measure and the comparison with standard cosine vector similarity measure are also described.展开更多
Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension...Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.展开更多
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method...Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.展开更多
This paper proposed an incremental textclustering algorithm based on semantic sequence. Using similarity relation of semantic sequences and calculating the cover of similarity semantic sequences set, the candidate clu...This paper proposed an incremental textclustering algorithm based on semantic sequence. Using similarity relation of semantic sequences and calculating the cover of similarity semantic sequences set, the candidate cluster with minimum entropy overlap value was selected as a result cluster every time in this algorithm. The comparison of experimental results shows that the precision of the algorithm is higher than other algorithms under same conditions and this is obvious especially on long documents set.展开更多
K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper propo...K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper proposes an improved K-means algorithm based on the similarity matrix. The im- proved algorithm can effectively avoid the random selection of initial center points, therefore it can provide effective initial points for clustering process, and reduce the fluctuation of clustering results which are resulted from initial points selections, thus a better clustering quality can be obtained. The experimental results also show that the F-measure of the improved K-means algorithm has been greatly improved and the clustering results are more stable.展开更多
To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of ...To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of clustering algorithm has to be improved with the refinement algorithm application. The experiment result demonstrated that the multilevel graph text clustering algorithm is available. Key words text clustering - multilevel coarsen graph model - refinement algorithm - high-dimensional clustering CLC number TP301 Foundation item: Supported by the National Natural Science Foundation of China (60173051)Biography: CHEN Jian-bin(1970-), male, Associate professor, Ph. D., research direction: data mining.展开更多
Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information. There is some important semantic information existed in the positions of words in the s...Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information. There is some important semantic information existed in the positions of words in the sequence. In this paper, a novel method named Frequent Itemset-based Clustering with Window (FICW) was proposed, which makes use of the semantic information for text clustering with a window constraint. The experimental results obtained from tests on three (hypertext) text sets show that FICW outperforms the method compared in both clustering accuracy and efficiency.展开更多
Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervise...Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervised clustering analysis of BOA-Argo ocean text data.Unsupervised K-Means and Affinity Propagation(AP)are two classical clustering algorithms.The Election-AP algorithm is proposed to handle the final cluster number in AP clustering as it has proved to be difficult to control in a suitable range.Semi-supervised samples thermocline data in the BOA-Argo dataset according to the thermocline standard definition,and use this data for semi-supervised cluster analysis.Several semi-supervised clustering algorithms were chosen for comparison of learning performance:Constrained-K-Means,Seeded-K-Means,SAP(Semi-supervised Affinity Propagation),LSAP(Loose Seed AP)and CSAP(Compact Seed AP).In order to adapt the single label,this paper improves the above algorithms to SCKM(improved Constrained-K-Means),SSKM(improved Seeded-K-Means),and SSAP(improved Semi-supervised Affinity Propagationg)to perform semi-supervised clustering analysis on the data.A DSAP(Double Seed AP)semi-supervised clustering algorithm based on compact seeds is proposed as the experimental data shows that DSAP has a better clustering effect.The unsupervised and semi-supervised clustering results are used to analyze the potential patterns of marine data.展开更多
In order to improve the clustering results and select in the results, the ontology semantic is combined with document clustering. A new document clustering algorithm based WordNet in the phrase of document processing ...In order to improve the clustering results and select in the results, the ontology semantic is combined with document clustering. A new document clustering algorithm based WordNet in the phrase of document processing is proposed. First, every word vector by new entities is extended after the documents are represented by tf-idf. Then the feature extracting algorithm is applied for the documents. Finally, the algorithm of ontology aggregation clustering (OAC) is proposed to improve the result of document clustering. Experiments are based on the data set of Reuters 20 News Group, and experimental results are compared with the results obtained by mutual information(MI). The conclusion draws that the proposed algorithm of document clustering based on ontology is better than the other existed clustering algorithms such as MNB, CLUTO, co-clustering, etc.展开更多
Ant-based text clustering is a promising technique that has attracted great research attention. This paper attempts to improve the standard ant-based text-clustering algorithm in two dimensions. On one hand, the ontol...Ant-based text clustering is a promising technique that has attracted great research attention. This paper attempts to improve the standard ant-based text-clustering algorithm in two dimensions. On one hand, the ontology-based semantic similarity measure is used in conjunction with the traditional vector-space-model-based measure to provide more accurate assessment of the similarity between documents. On the other, the ant behavior model is modified to pursue better algorithmic performance. Especially, the ant movement rule is adjusted so as to direct a laden ant toward a dense area of the same type of items as the ant's carrying item, and to direct an unladen ant toward an area that contains an item dissimilar with the surrounding items within its Moore neighborhood. Using WordNet as the base ontology for assessing the semantic similarity between documents, the proposed algorithm is tested with a sample set of documents excerpted from the Reuters-21578 corpus and the experiment results partly indicate that the proposed algorithm perform better than the standard ant-based text-clustering algorithm and the k-means algorithm.展开更多
Focusing on the problem that the ant colony algorithm gets into stagnation easily and cannot fully search in solution space,a text clustering approach based on the fusion of the ant colony and genetic algorithms is pr...Focusing on the problem that the ant colony algorithm gets into stagnation easily and cannot fully search in solution space,a text clustering approach based on the fusion of the ant colony and genetic algorithms is proposed.The four parameters that influence the performance of the ant colony algorithm are encoded as chromosomes,thereby the fitness function,selection,crossover and mutation operator are designed to find the combination of optimal parameters through a number of iteration,and then it is applied to text clustering.The simulation results show that compared with the classical k-means clustering and the basic ant colony clustering algorithm,the proposed algorithm has better performance and the value of F-Measure is enhanced by 5.69%,48.60%and 69.60%,respectively,in 3 test datasets.Therefore,it is more suitable for processing a larger dataset.展开更多
A method of realization of automatic abstracting based on text clustering and natural language understanding is explored, aimed at overcoming shortages of some current methods. The method makes use of text clustering ...A method of realization of automatic abstracting based on text clustering and natural language understanding is explored, aimed at overcoming shortages of some current methods. The method makes use of text clustering and can realize automatic abstracting of multi-documents. The algo- rithm of twice word segmentation based on the title and first sentences in paragraphs is investigated. Its precision and recall is above 95 %. For a specific domain on plastics, an automatic abstracting system named TCAAS is implemented. The precision and recall of multi-document’s automatic ab- stracting is above 75 %. Also, the experiments prove that it is feasible to use the method to develop a domain automatic abstracting system, which is valuable for further in-depth study.展开更多
Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
In recent years,the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web.As a result,the use of techniques for extracting useful information from lar...In recent years,the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web.As a result,the use of techniques for extracting useful information from large collections of data,and particularly documents,has become more necessary and challenging.Text clustering is such a technique;it consists in dividing a set of text documents into clusters(groups),so that documents within the same cluster are closely related,whereas documents in different clusters are as different as possible.Clustering depends on measuring the content(i.e.,words)of a document in terms of relevance.Nevertheless,as documents usually contain a large number of words,some of them may be irrelevant to the topic under consideration or redundant.This can confuse and complicate the clustering process and make it less accurate.Accordingly,feature selection methods have been employed to reduce data dimensionality by selecting the most relevant features.In this study,we developed a text document clustering optimization model using a novel genetic frog-leaping algorithm that efficiently clusters text documents based on selected features.The proposed approach is based on two metaheuristic algorithms:a genetic algorithm(GA)and a shuffled frog-leaping algorithm(SFLA).The GA performs feature selection,and the SFLA performs clustering.To evaluate its effectiveness,the proposed approach was tested on a well-known text document dataset:the“20Newsgroup”dataset from the University of California Irvine Machine Learning Repository.Overall,after multiple experiments were compared and analyzed,it was demonstrated that using the proposed algorithm on the 20Newsgroup dataset greatly facilitated text document clustering,compared with classical K-means clustering.Nevertheless,this improvement requires longer computational time.展开更多
Purpose-When a large number of project proposals are evaluated to alocate available funds,grouping them based on their simiarites is benefciaL.Current approaches to group proposals are primarily based on manual matchi...Purpose-When a large number of project proposals are evaluated to alocate available funds,grouping them based on their simiarites is benefciaL.Current approaches to group proposals are primarily based on manual matching of similar topics,discipline areas and keywordls declared by project applicants.When the number of proposals increases,this task becomes complex and requires excessive time.This paper aims to demonstrate how to ffctively use the rich information in the titles and abstracts of Turkish project propsals to group them atmaially.Design/methodology/approach-This study proposes a model that effectively groups Turkish project proposals by combining word embedding,clustering and classification technigues.The proposed model uses FastText,BERT and term frequency/inverse document frequency(TF/IDF)word-embedding techniques to extract terms from the titles and abstracts of project proposals in Turkish.The extracted terms were grouped using both the clustering and classification techniques.Natural groups contained within the corpus were discovered using k-means,k-means++,k-medoids and agglomerative clustering algorithms,Additionally,this study employs classification approaches to predict the target class for each document in the corpus.To classify project proposals,var ious classifiers,including k nearest neighbors(KNN),support vector machines(SVM),artificial neural networks(ANN),cassftcation and regression trees(CART)and random forest(RF),are used.Empirical experiments were conducted to validate the effectiveness of the proposed method by using real data from the Istanbul Development Agency.Findings-The results show that the generated word embeddings an fftvely represent proposal texts as vectors,and can be used as inputs for dustering or casificatiomn algorithms.Using clustering algorithms,the document corpus is divided into five groups.In adition,the results demonstrate that the proposals can easily be categoried into predefmned categories using cassifiation algorithms.SVM-Linear achieved the highest prediction accuracy(89.2%)with the FastText word embedding.method.A comparison of mamual grouping with automatic casification and clutering results revealed that both classification and custering techniques have a high sucess rate.Research limitations/implications-The propsed mdelatomatically benefits fromthe rich information in project proposals and significantly reduces numerous time consuming tasks that managers must perform manually.Thus,it eliminates the drawbacks of the curent manual methods and yields significantly more acurate results.In the future,additional experiments should be conducted to validate the proposed method using data from other funding organizations.Originality/value-This study presents the application of word embedding methods to eftively use the rich information in the titles and abstracts of Turkish project proposals.Existing research studies focus on the automatice grouping of proposals;traditional frequency-based word embedding methods are used for feature extraction methods to represent project proposals.Unlike previous research,this study employs two outperforming neural network-based textual feature extraction techniques to obtain termns representing the proposals:BERT as a contextual word embedding method and F astText as a static word embedding method.Moreover,to the best of our knowledge,there has been no research conducted on the grouping of project proposals in Turkish.展开更多
In order to study deeply the prominent problems faced by China’s clean government work,and put forward effective coping strategies,this article analyzes the network information of anti-corruption related news events,...In order to study deeply the prominent problems faced by China’s clean government work,and put forward effective coping strategies,this article analyzes the network information of anti-corruption related news events,which is based on big data technology.In this study,we take the news report from the website of the Communist Party of China(CPC)Central Commission for Discipline Inspection(CCDI)as the source of data.Firstly,the obtained text data is converted to word segmentation and stop words under preprocessing,and then the pre-processed data is improved by vectorization and text clustering,finally,after text clustering,the key words of clean government work is derived from visualization analysis.According to the results of this study,it shows that China’s clean government work should focus on‘the four forms of decadence’issue,and related departments must strictly crack down five categories of phenomena,such as“illegal payment of subsidies or benefits,illegal delivery of gifts and cash gift,illegal use of official vehicles,banquets using public funds,extravagant wedding ceremonies and funeral”.The results of this study are consistent with the official data released by the CCDI’s website,which also suggests that the method is feasible and effective.展开更多
Text summarization creates subset that represents the most important or relevant information in the original content,which effectively reduce information redundancy.Recently neural network method has achieved good res...Text summarization creates subset that represents the most important or relevant information in the original content,which effectively reduce information redundancy.Recently neural network method has achieved good results in the task of text summarization both in Chinese and English,but the research of text summarization in low-resource languages is still in the exploratory stage,especially in Tibetan.What’s more,there is no large-scale annotated corpus for text summarization.The lack of dataset severely limits the development of low-resource text summarization.In this case,unsupervised learning approaches are more appealing in low-resource languages as they do not require labeled data.In this paper,we propose an unsupervised graph-based Tibetan multi-document summarization method,which divides a large number of Tibetan news documents into topics and extracts the summarization of each topic.Summarization obtained by using traditional graph-based methods have high redundancy and the division of documents topics are not detailed enough.In terms of topic division,we adopt two level clustering methods converting original document into document-level and sentence-level graph,next we take both linguistic and deep representation into account and integrate external corpus into graph to obtain the sentence semantic clustering.Improve the shortcomings of the traditional K-Means clustering method and perform more detailed clustering of documents.Then model sentence clusters into graphs,finally remeasure sentence nodes based on the topic semantic information and the impact of topic features on sentences,higher topic relevance summary is extracted.In order to promote the development of Tibetan text summarization,and to meet the needs of relevant researchers for high-quality Tibetan text summarization datasets,this paper manually constructs a Tibetan summarization dataset and carries out relevant experiments.The experiment results show that our method can effectively improve the quality of summarization and our method is competitive to previous unsupervised methods.展开更多
In this study, we uncover the topics of Chinese public cultural activities in 2020 with a two-step short text clustering(self-taught neural networks and graph-based clustering) and topic modeling approach. The dataset...In this study, we uncover the topics of Chinese public cultural activities in 2020 with a two-step short text clustering(self-taught neural networks and graph-based clustering) and topic modeling approach. The dataset we use for this research is collected from 108 websites of libraries and cultural centers, containing over 17,000 articles. With the novel framework we propose, we derive 3 clusters and 8 topics from 21 provinciallevel regions in China. By plotting the topic distribution of each cluster, we are able to shows unique tendencies of local cultural institutes, that is, free lessons and lectures on art and culture, entertainment and service for socially vulnerable groups, and the preservation of intangible cultural heritage respectively. The findings of our study provide decision-making support for cultural institutes, thus promoting public cultural service from a data-driven perspective.展开更多
基金The National Natural Science Foundation of China(No60672056)Open Fund of MOE-MS Key Laboratory of Multime-dia Computing and Communication(No06120809)
文摘To improve the accuracy of text clustering, fuzzy c-means clustering based on topic concept sub-space (TCS2FCM) is introduced for classifying texts. Five evaluation functions are combined to extract key phrases. Concept phrases, as well as the descriptions of final clusters, are presented using WordNet origin from key phrases. Initial centers and membership matrix are the most important factors affecting clustering performance. Orthogonal concept topic sub-spaces are built with the topic concept phrases representing topics of the texts and the initialization of centers and the membership matrix depend on the concept vectors in sub-spaces. The results show that, different from random initialization of traditional fuzzy c-means clustering, the initialization related to text content contributions can improve clustering precision.
基金The Young Teachers Scientific Research Foundation (YTSRF) of Nanjing University of Science and Technology in the Year of2005-2006.
文摘A method that combines category-based and keyword-based concepts for a better information retrieval system is introduced. To improve document clustering, a document similarity measure based on cosine vector and keywords frequency in documents is proposed, but also with an input ontology. The ontology is domain specific and includes a list of keywords organized by degree of importance to the categories of the ontology, and by means of semantic knowledge, the ontology can improve the effects of document similarity measure and feedback of information retrieval systems. Two approaches to evaluating the performance of this similarity measure and the comparison with standard cosine vector similarity measure are also described.
基金supported by the National Natural Science Foundation of China (60774096)the National HighTech R&D Program of China (2008BAK49B05)
文摘Feature optimization is important to agricultural text mining. Usually, the vector space model is used to represent text documents. However, this basic approach still suffers from two drawbacks: thecurse of dimension and the lack of semantic information. In this paper, a novel ontology-based feature optimization method for agricultural text was proposed. First, terms of vector space model were mapped into concepts of agricultural ontology, which concept frequency weights are computed statistically by term frequency weights; second, weights of concept similarity were assigned to the concept features according to the structure of the agricultural ontology. By combining feature frequency weights and feature similarity weights based on the agricultural ontology, the dimensionality of feature space can be reduced drastically. Moreover, the semantic information can be incorporated into this method. The results showed that this method yields a significant improvement on agricultural text clustering by the feature optimization.
基金Supported by the National Natural Science Foundation of China (60503020, 60373066)the Outstanding Young Scientist’s Fund (60425206)+1 种基金the Natural Science Foundation of Jiangsu Province (BK2005060)the Opening Foundation of Jiangsu Key Laboratory of Computer Informa-tion Processing Technology in Soochow University
文摘Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
基金Supported by the National Natural Science Funda-tion of China (60173058)
文摘This paper proposed an incremental textclustering algorithm based on semantic sequence. Using similarity relation of semantic sequences and calculating the cover of similarity semantic sequences set, the candidate cluster with minimum entropy overlap value was selected as a result cluster every time in this algorithm. The comparison of experimental results shows that the precision of the algorithm is higher than other algorithms under same conditions and this is obvious especially on long documents set.
文摘K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper proposes an improved K-means algorithm based on the similarity matrix. The im- proved algorithm can effectively avoid the random selection of initial center points, therefore it can provide effective initial points for clustering process, and reduce the fluctuation of clustering results which are resulted from initial points selections, thus a better clustering quality can be obtained. The experimental results also show that the F-measure of the improved K-means algorithm has been greatly improved and the clustering results are more stable.
文摘To construct a high efficient text clustering algorithm the multilevel graph model and the refinement algorithm used in the uncoarsening phase is discussed. The model is applied to text clustering. The performance of clustering algorithm has to be improved with the refinement algorithm application. The experiment result demonstrated that the multilevel graph text clustering algorithm is available. Key words text clustering - multilevel coarsen graph model - refinement algorithm - high-dimensional clustering CLC number TP301 Foundation item: Supported by the National Natural Science Foundation of China (60173051)Biography: CHEN Jian-bin(1970-), male, Associate professor, Ph. D., research direction: data mining.
基金Supported by the Natural Science Foundation ofHubei Province(ABA048)
文摘Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information. There is some important semantic information existed in the positions of words in the sequence. In this paper, a novel method named Frequent Itemset-based Clustering with Window (FICW) was proposed, which makes use of the semantic information for text clustering with a window constraint. The experimental results obtained from tests on three (hypertext) text sets show that FICW outperforms the method compared in both clustering accuracy and efficiency.
基金This work was supported in part by the National Natural Science Foundation of China(51679105,61872160,51809112)“Thirteenth Five Plan”Science and Technology Project of Education Department,Jilin Province(JJKH20200990KJ).
文摘Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervised clustering analysis of BOA-Argo ocean text data.Unsupervised K-Means and Affinity Propagation(AP)are two classical clustering algorithms.The Election-AP algorithm is proposed to handle the final cluster number in AP clustering as it has proved to be difficult to control in a suitable range.Semi-supervised samples thermocline data in the BOA-Argo dataset according to the thermocline standard definition,and use this data for semi-supervised cluster analysis.Several semi-supervised clustering algorithms were chosen for comparison of learning performance:Constrained-K-Means,Seeded-K-Means,SAP(Semi-supervised Affinity Propagation),LSAP(Loose Seed AP)and CSAP(Compact Seed AP).In order to adapt the single label,this paper improves the above algorithms to SCKM(improved Constrained-K-Means),SSKM(improved Seeded-K-Means),and SSAP(improved Semi-supervised Affinity Propagationg)to perform semi-supervised clustering analysis on the data.A DSAP(Double Seed AP)semi-supervised clustering algorithm based on compact seeds is proposed as the experimental data shows that DSAP has a better clustering effect.The unsupervised and semi-supervised clustering results are used to analyze the potential patterns of marine data.
基金The National Natural Science Foundation of China(No.60373099),the Natural Science Foundation for Young Scholars of Northeast Normal University (No.20061005)
文摘In order to improve the clustering results and select in the results, the ontology semantic is combined with document clustering. A new document clustering algorithm based WordNet in the phrase of document processing is proposed. First, every word vector by new entities is extended after the documents are represented by tf-idf. Then the feature extracting algorithm is applied for the documents. Finally, the algorithm of ontology aggregation clustering (OAC) is proposed to improve the result of document clustering. Experiments are based on the data set of Reuters 20 News Group, and experimental results are compared with the results obtained by mutual information(MI). The conclusion draws that the proposed algorithm of document clustering based on ontology is better than the other existed clustering algorithms such as MNB, CLUTO, co-clustering, etc.
基金This work was supported in part by National Natural Science Foundation of China under Grants No.70301009 and No. 70431001, and by Ministry of Education, Culture, Sports, Science and Technology of Japan under the "Kanazawa Region, Ishikawa High-Tech Sensing Cluster of Knowledge-Based Cluster Creation Project"
文摘Ant-based text clustering is a promising technique that has attracted great research attention. This paper attempts to improve the standard ant-based text-clustering algorithm in two dimensions. On one hand, the ontology-based semantic similarity measure is used in conjunction with the traditional vector-space-model-based measure to provide more accurate assessment of the similarity between documents. On the other, the ant behavior model is modified to pursue better algorithmic performance. Especially, the ant movement rule is adjusted so as to direct a laden ant toward a dense area of the same type of items as the ant's carrying item, and to direct an unladen ant toward an area that contains an item dissimilar with the surrounding items within its Moore neighborhood. Using WordNet as the base ontology for assessing the semantic similarity between documents, the proposed algorithm is tested with a sample set of documents excerpted from the Reuters-21578 corpus and the experiment results partly indicate that the proposed algorithm perform better than the standard ant-based text-clustering algorithm and the k-means algorithm.
基金supported by the Hi-Tech Research and Development Program of China (No.2006AA01Z210).
文摘Focusing on the problem that the ant colony algorithm gets into stagnation easily and cannot fully search in solution space,a text clustering approach based on the fusion of the ant colony and genetic algorithms is proposed.The four parameters that influence the performance of the ant colony algorithm are encoded as chromosomes,thereby the fitness function,selection,crossover and mutation operator are designed to find the combination of optimal parameters through a number of iteration,and then it is applied to text clustering.The simulation results show that compared with the classical k-means clustering and the basic ant colony clustering algorithm,the proposed algorithm has better performance and the value of F-Measure is enhanced by 5.69%,48.60%and 69.60%,respectively,in 3 test datasets.Therefore,it is more suitable for processing a larger dataset.
基金supported by the National Natural Science Foundation of China(No.70572090,No.60305009)the Ph.D.Degree Teacher Foundation of North China Electric Power University.
文摘A method of realization of automatic abstracting based on text clustering and natural language understanding is explored, aimed at overcoming shortages of some current methods. The method makes use of text clustering and can realize automatic abstracting of multi-documents. The algo- rithm of twice word segmentation based on the title and first sentences in paragraphs is investigated. Its precision and recall is above 95 %. For a specific domain on plastics, an automatic abstracting system named TCAAS is implemented. The precision and recall of multi-document’s automatic ab- stracting is above 75 %. Also, the experiments prove that it is feasible to use the method to develop a domain automatic abstracting system, which is valuable for further in-depth study.
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.
基金This research was supported by a grant from the Research Center of the Center for Female Scientific and Medical Colleges Deanship of Scientific Research,King Saud University.
文摘In recent years,the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web.As a result,the use of techniques for extracting useful information from large collections of data,and particularly documents,has become more necessary and challenging.Text clustering is such a technique;it consists in dividing a set of text documents into clusters(groups),so that documents within the same cluster are closely related,whereas documents in different clusters are as different as possible.Clustering depends on measuring the content(i.e.,words)of a document in terms of relevance.Nevertheless,as documents usually contain a large number of words,some of them may be irrelevant to the topic under consideration or redundant.This can confuse and complicate the clustering process and make it less accurate.Accordingly,feature selection methods have been employed to reduce data dimensionality by selecting the most relevant features.In this study,we developed a text document clustering optimization model using a novel genetic frog-leaping algorithm that efficiently clusters text documents based on selected features.The proposed approach is based on two metaheuristic algorithms:a genetic algorithm(GA)and a shuffled frog-leaping algorithm(SFLA).The GA performs feature selection,and the SFLA performs clustering.To evaluate its effectiveness,the proposed approach was tested on a well-known text document dataset:the“20Newsgroup”dataset from the University of California Irvine Machine Learning Repository.Overall,after multiple experiments were compared and analyzed,it was demonstrated that using the proposed algorithm on the 20Newsgroup dataset greatly facilitated text document clustering,compared with classical K-means clustering.Nevertheless,this improvement requires longer computational time.
文摘Purpose-When a large number of project proposals are evaluated to alocate available funds,grouping them based on their simiarites is benefciaL.Current approaches to group proposals are primarily based on manual matching of similar topics,discipline areas and keywordls declared by project applicants.When the number of proposals increases,this task becomes complex and requires excessive time.This paper aims to demonstrate how to ffctively use the rich information in the titles and abstracts of Turkish project propsals to group them atmaially.Design/methodology/approach-This study proposes a model that effectively groups Turkish project proposals by combining word embedding,clustering and classification technigues.The proposed model uses FastText,BERT and term frequency/inverse document frequency(TF/IDF)word-embedding techniques to extract terms from the titles and abstracts of project proposals in Turkish.The extracted terms were grouped using both the clustering and classification techniques.Natural groups contained within the corpus were discovered using k-means,k-means++,k-medoids and agglomerative clustering algorithms,Additionally,this study employs classification approaches to predict the target class for each document in the corpus.To classify project proposals,var ious classifiers,including k nearest neighbors(KNN),support vector machines(SVM),artificial neural networks(ANN),cassftcation and regression trees(CART)and random forest(RF),are used.Empirical experiments were conducted to validate the effectiveness of the proposed method by using real data from the Istanbul Development Agency.Findings-The results show that the generated word embeddings an fftvely represent proposal texts as vectors,and can be used as inputs for dustering or casificatiomn algorithms.Using clustering algorithms,the document corpus is divided into five groups.In adition,the results demonstrate that the proposals can easily be categoried into predefmned categories using cassifiation algorithms.SVM-Linear achieved the highest prediction accuracy(89.2%)with the FastText word embedding.method.A comparison of mamual grouping with automatic casification and clutering results revealed that both classification and custering techniques have a high sucess rate.Research limitations/implications-The propsed mdelatomatically benefits fromthe rich information in project proposals and significantly reduces numerous time consuming tasks that managers must perform manually.Thus,it eliminates the drawbacks of the curent manual methods and yields significantly more acurate results.In the future,additional experiments should be conducted to validate the proposed method using data from other funding organizations.Originality/value-This study presents the application of word embedding methods to eftively use the rich information in the titles and abstracts of Turkish project proposals.Existing research studies focus on the automatice grouping of proposals;traditional frequency-based word embedding methods are used for feature extraction methods to represent project proposals.Unlike previous research,this study employs two outperforming neural network-based textual feature extraction techniques to obtain termns representing the proposals:BERT as a contextual word embedding method and F astText as a static word embedding method.Moreover,to the best of our knowledge,there has been no research conducted on the grouping of project proposals in Turkish.
基金funded by the Open Foundation for the University Innovation Platform in the Hunan Province,grant number 16K013Hunan Provincial Natural Science Foundation of China,grant number 2017JJ2016+2 种基金2016 Science Research Project of Hunan Provincial Department of Education,grant number 16C0269Accurate crawler design and implementation with a data cleaning function,National Students innovation and entrepreneurship of training program,grant number 201811532010This research work is implemented at the 2011 Collaborative Innovation Center for Development and Utilization of Finance and Economics Big Data Property,Universities of Hunan Province.Open project,grant number 20181901CRP03,20181901CRP04,20181901CRP05.
文摘In order to study deeply the prominent problems faced by China’s clean government work,and put forward effective coping strategies,this article analyzes the network information of anti-corruption related news events,which is based on big data technology.In this study,we take the news report from the website of the Communist Party of China(CPC)Central Commission for Discipline Inspection(CCDI)as the source of data.Firstly,the obtained text data is converted to word segmentation and stop words under preprocessing,and then the pre-processed data is improved by vectorization and text clustering,finally,after text clustering,the key words of clean government work is derived from visualization analysis.According to the results of this study,it shows that China’s clean government work should focus on‘the four forms of decadence’issue,and related departments must strictly crack down five categories of phenomena,such as“illegal payment of subsidies or benefits,illegal delivery of gifts and cash gift,illegal use of official vehicles,banquets using public funds,extravagant wedding ceremonies and funeral”.The results of this study are consistent with the official data released by the CCDI’s website,which also suggests that the method is feasible and effective.
基金This work was supported in part by the National Science Foundation Project of P.R.China 484 under Grant No.52071349partially supported by Young and Middle-aged Talents Project of the State Ethnic Affairs 487 Commission.
文摘Text summarization creates subset that represents the most important or relevant information in the original content,which effectively reduce information redundancy.Recently neural network method has achieved good results in the task of text summarization both in Chinese and English,but the research of text summarization in low-resource languages is still in the exploratory stage,especially in Tibetan.What’s more,there is no large-scale annotated corpus for text summarization.The lack of dataset severely limits the development of low-resource text summarization.In this case,unsupervised learning approaches are more appealing in low-resource languages as they do not require labeled data.In this paper,we propose an unsupervised graph-based Tibetan multi-document summarization method,which divides a large number of Tibetan news documents into topics and extracts the summarization of each topic.Summarization obtained by using traditional graph-based methods have high redundancy and the division of documents topics are not detailed enough.In terms of topic division,we adopt two level clustering methods converting original document into document-level and sentence-level graph,next we take both linguistic and deep representation into account and integrate external corpus into graph to obtain the sentence semantic clustering.Improve the shortcomings of the traditional K-Means clustering method and perform more detailed clustering of documents.Then model sentence clusters into graphs,finally remeasure sentence nodes based on the topic semantic information and the impact of topic features on sentences,higher topic relevance summary is extracted.In order to promote the development of Tibetan text summarization,and to meet the needs of relevant researchers for high-quality Tibetan text summarization datasets,this paper manually constructs a Tibetan summarization dataset and carries out relevant experiments.The experiment results show that our method can effectively improve the quality of summarization and our method is competitive to previous unsupervised methods.
基金the key laboratory project“Research on the Wisdom Mode Clustering and Dynamic Display System of Public Culture”(No.2020008)supported by the Ministry of Culture and Tourism of the People’s Republic of China
文摘In this study, we uncover the topics of Chinese public cultural activities in 2020 with a two-step short text clustering(self-taught neural networks and graph-based clustering) and topic modeling approach. The dataset we use for this research is collected from 108 websites of libraries and cultural centers, containing over 17,000 articles. With the novel framework we propose, we derive 3 clusters and 8 topics from 21 provinciallevel regions in China. By plotting the topic distribution of each cluster, we are able to shows unique tendencies of local cultural institutes, that is, free lessons and lectures on art and culture, entertainment and service for socially vulnerable groups, and the preservation of intangible cultural heritage respectively. The findings of our study provide decision-making support for cultural institutes, thus promoting public cultural service from a data-driven perspective.