This study investigates the feasibility of applying complex networks to fine-grained language classification and of employing word co-occurrence networks based on parallel texts as a substitute for syntactic dependenc...This study investigates the feasibility of applying complex networks to fine-grained language classification and of employing word co-occurrence networks based on parallel texts as a substitute for syntactic dependency networks in complex-network-based language classification.14 word co-occurrence networks were constructed based on parallel texts of 12 Slavic languages and 2 non-Slavic languages,respectively.With appropriate combinations of major parameters of these networks,cluster analysis was able to distinguish the Slavic languages from the non-Slavic and correctly group the Slavic languages into their respective sub-branches.Moreover,the clustering could also capture the genetic relationships of some of these Slavic languages within their sub-branches.The results have shown that word co-occurrence networks based on parallel texts are applicable to fine-grained language classification and they constitute a more convenient substitute for syntactic dependency networks in complex-network-based language classification.展开更多
SimCSE框架仅使用分类令牌[CLS]token作为文本向量,同时忽略基座模型内层级信息,导致对基座模型输出语义特征提取不充分.本文基于SimCSE框架提出一种融合预训练模型层级特征方法SimCSE-HFF(SimCSE with hierarchical feature fusion,Sim...SimCSE框架仅使用分类令牌[CLS]token作为文本向量,同时忽略基座模型内层级信息,导致对基座模型输出语义特征提取不充分.本文基于SimCSE框架提出一种融合预训练模型层级特征方法SimCSE-HFF(SimCSE with hierarchical feature fusion,SimCSE-HFF).SimCSE-HFF基于双路并行网络,使用短路径和长路径强化特征学习,短路径使用卷积神经网络学习文本局部特征并进行降维,长路径使用双向门控循环神经网络学习深度语义信息,同时在长路径中利用自编码器融合基座模型内部其他层特征,解决模型对输出特征提取不充分的问题.在STS-B的中文与英文数据集上,SimCSE-HFF方法效果在语义相似度Spearman和Pearson相关性指标上优于传统方法,在不同预训练模型上均得到提升;在下游任务检索问答上也优于SimCSE框架,具有更优秀的通用性.展开更多
Feature selection is one of the important topics in text classification. However, most of existing feature selection methods are serial and inefficient to be applied to massive text data sets. In this case, a feature ...Feature selection is one of the important topics in text classification. However, most of existing feature selection methods are serial and inefficient to be applied to massive text data sets. In this case, a feature selection method based on parallel collaborative evolutionary genetic algorithm is presented. The presented method uses genetic algorithm to select feature subsets and takes advantage of parallel collaborative evolution to enhance time efficiency, so it can quickly acquire the feature subsets which are more representative. The experimental results show that, for accuracy ratio and recall ratio, the presented method is better than information gain, x2 statistics, and mutual information methods; the consumed time of the presented method with only one CPU is inferior to that of these three methods, but the presented method is supe rior after using the parallel strategy.展开更多
基金supported by the National Social Science Foundation of China (09BYY024 and 11&ZD188)
文摘This study investigates the feasibility of applying complex networks to fine-grained language classification and of employing word co-occurrence networks based on parallel texts as a substitute for syntactic dependency networks in complex-network-based language classification.14 word co-occurrence networks were constructed based on parallel texts of 12 Slavic languages and 2 non-Slavic languages,respectively.With appropriate combinations of major parameters of these networks,cluster analysis was able to distinguish the Slavic languages from the non-Slavic and correctly group the Slavic languages into their respective sub-branches.Moreover,the clustering could also capture the genetic relationships of some of these Slavic languages within their sub-branches.The results have shown that word co-occurrence networks based on parallel texts are applicable to fine-grained language classification and they constitute a more convenient substitute for syntactic dependency networks in complex-network-based language classification.
文摘SimCSE框架仅使用分类令牌[CLS]token作为文本向量,同时忽略基座模型内层级信息,导致对基座模型输出语义特征提取不充分.本文基于SimCSE框架提出一种融合预训练模型层级特征方法SimCSE-HFF(SimCSE with hierarchical feature fusion,SimCSE-HFF).SimCSE-HFF基于双路并行网络,使用短路径和长路径强化特征学习,短路径使用卷积神经网络学习文本局部特征并进行降维,长路径使用双向门控循环神经网络学习深度语义信息,同时在长路径中利用自编码器融合基座模型内部其他层特征,解决模型对输出特征提取不充分的问题.在STS-B的中文与英文数据集上,SimCSE-HFF方法效果在语义相似度Spearman和Pearson相关性指标上优于传统方法,在不同预训练模型上均得到提升;在下游任务检索问答上也优于SimCSE框架,具有更优秀的通用性.
基金supported by the Science and Technology Plan Projects of Sichuan Province of China under Grant No.2008GZ0003the Key Technologies R & D Program of Sichuan Province of China under Grant No.2008SZ0100
文摘Feature selection is one of the important topics in text classification. However, most of existing feature selection methods are serial and inefficient to be applied to massive text data sets. In this case, a feature selection method based on parallel collaborative evolutionary genetic algorithm is presented. The presented method uses genetic algorithm to select feature subsets and takes advantage of parallel collaborative evolution to enhance time efficiency, so it can quickly acquire the feature subsets which are more representative. The experimental results show that, for accuracy ratio and recall ratio, the presented method is better than information gain, x2 statistics, and mutual information methods; the consumed time of the presented method with only one CPU is inferior to that of these three methods, but the presented method is supe rior after using the parallel strategy.