Purpose:Many science,technology and innovation(STI)resources are attached with several different labels.To assign automatically the resulting labels to an interested instance,many approaches with good performance on t...Purpose:Many science,technology and innovation(STI)resources are attached with several different labels.To assign automatically the resulting labels to an interested instance,many approaches with good performance on the benchmark datasets have been proposed for multi-label classification task in the literature.Furthermore,several open-source tools implementing these approaches have also been developed.However,the characteristics of real-world multi-label patent and publication datasets are not completely in line with those of benchmark ones.Therefore,the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets.Research limitations:Three real-world datasets differ in the following aspects:statement,data quality,and purposes.Additionally,open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection,which in turn impacts the performance of a multi-label classification approach.In the near future,we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings.Practical implications:The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets,underscoring the complexity of real-world multi-label classification tasks.Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels.With ongoing enhancements in deep learning algorithms and large-scale models,it is expected that the efficacy of multi-label classification tasks will be significantly improved,reaching a level of practical utility in the foreseeable future.Originality/value:(1)Seven multi-label classification methods are comprehensively compared on three real-world datasets.(2)The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution.(3)The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution.展开更多
The Internet revolution has resulted in abundant data from various sources,including social media,traditional media,etcetera.Although the availability of data is no longer an issue,data labelling for exploiting it in ...The Internet revolution has resulted in abundant data from various sources,including social media,traditional media,etcetera.Although the availability of data is no longer an issue,data labelling for exploiting it in supervised machine learning is still an expensive process and involves tedious human efforts.The overall purpose of this study is to propose a strategy to automatically label the unlabeled textual data with the support of active learning in combination with deep learning.More specifically,this study assesses the performance of different active learning strategies in automatic labelling of the textual dataset at sentence and document levels.To achieve this objective,different experiments have been performed on the publicly available dataset.In first set of experiments,we randomly choose a subset of instances from training dataset and train a deep neural network to assess performance on test set.In the second set of experiments,we replace the random selection with different active learning strategies to choose a subset of the training dataset to train the same model and reassess its performance on test set.The experimental results suggest that different active learning strategies yield performance improvement of 7% on document level datasets and 3%on sentence level datasets for auto labelling.展开更多
The use of all samples in the optimization process does not produce robust results in datasets with label noise.Because the gradients calculated according to the losses of the noisy samples cause the optimization proc...The use of all samples in the optimization process does not produce robust results in datasets with label noise.Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction.In this paper,we recommend using samples with loss less than a threshold determined during the optimization,instead of using all samples in the mini-batch.Our proposed method,Adaptive-k,aims to exclude label noise samples from the optimization process and make the process robust.On noisy datasets,we found that using a threshold-based approach,such as Adaptive-k,produces better results than using all samples or a fixed number of low-loss samples in the mini-batch.On the basis of our theoretical analysis and experimental results,we show that the Adaptive-k method is closest to the performance of the Oracle,in which noisy samples are entirely removed from the dataset.Adaptive-k is a simple but effective method.It does not require prior knowledge of the noise ratio of the dataset,does not require additional model training,and does not increase training time significantly.In the experiments,we also show that Adaptive-k is compatible with different optimizers such as SGD,SGDM,and Adam.The code for Adaptive-k is available at GitHub.展开更多
针对现有入侵检测技术的不足,文章研究了基于机器学习的异常入侵检测系统,将多标记和半监督学习应用于入侵检测,提出了一种基于多标记学习的入侵检测算法。该算法采用"k近邻"分类准则,统计近邻样本的类别标记信息,通过最大化...针对现有入侵检测技术的不足,文章研究了基于机器学习的异常入侵检测系统,将多标记和半监督学习应用于入侵检测,提出了一种基于多标记学习的入侵检测算法。该算法采用"k近邻"分类准则,统计近邻样本的类别标记信息,通过最大化后验概率(maximum a posteriori,MAP)的方式推理未标记数据的所属集合。在KDD CUP99数据集上的仿真结果表明,该算法能有效地改善入侵检测系统的性能。展开更多
支持向量机(Support vector machine,SVM)作为一种经典的分类方法,已经广泛应用于各种领域中。然而,标准支持向量机在分类决策中面临以下问题:(1)未考虑分类数据的分布特征;(2)忽略了样本类别间的相对关系;(3)无法解决大规模分类问题。...支持向量机(Support vector machine,SVM)作为一种经典的分类方法,已经广泛应用于各种领域中。然而,标准支持向量机在分类决策中面临以下问题:(1)未考虑分类数据的分布特征;(2)忽略了样本类别间的相对关系;(3)无法解决大规模分类问题。鉴于此,提出融合数据分布特征的保序学习机(Rank preservation learning machine based on data distribution fusion,RPLM-DDF)。该方法通过引入类内离散度表征数据的分布特征;通过各类样本数据中心位置相对不变保证全局样本顺序不变;通过建立所提方法和核心向量机对偶形式的等价性解决了大规模分类问题。在人工数据集、中小规模数据集和大规模数据集上的比较实验验证所提方法的有效性。展开更多
基金the Natural Science Foundation of China(Grant Numbers 72074014 and 72004012).
文摘Purpose:Many science,technology and innovation(STI)resources are attached with several different labels.To assign automatically the resulting labels to an interested instance,many approaches with good performance on the benchmark datasets have been proposed for multi-label classification task in the literature.Furthermore,several open-source tools implementing these approaches have also been developed.However,the characteristics of real-world multi-label patent and publication datasets are not completely in line with those of benchmark ones.Therefore,the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets.Research limitations:Three real-world datasets differ in the following aspects:statement,data quality,and purposes.Additionally,open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection,which in turn impacts the performance of a multi-label classification approach.In the near future,we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings.Practical implications:The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets,underscoring the complexity of real-world multi-label classification tasks.Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels.With ongoing enhancements in deep learning algorithms and large-scale models,it is expected that the efficacy of multi-label classification tasks will be significantly improved,reaching a level of practical utility in the foreseeable future.Originality/value:(1)Seven multi-label classification methods are comprehensively compared on three real-world datasets.(2)The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution.(3)The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution.
基金the Deanship of Scientific Research at Shaqra University for supporting this work.
文摘The Internet revolution has resulted in abundant data from various sources,including social media,traditional media,etcetera.Although the availability of data is no longer an issue,data labelling for exploiting it in supervised machine learning is still an expensive process and involves tedious human efforts.The overall purpose of this study is to propose a strategy to automatically label the unlabeled textual data with the support of active learning in combination with deep learning.More specifically,this study assesses the performance of different active learning strategies in automatic labelling of the textual dataset at sentence and document levels.To achieve this objective,different experiments have been performed on the publicly available dataset.In first set of experiments,we randomly choose a subset of instances from training dataset and train a deep neural network to assess performance on test set.In the second set of experiments,we replace the random selection with different active learning strategies to choose a subset of the training dataset to train the same model and reassess its performance on test set.The experimental results suggest that different active learning strategies yield performance improvement of 7% on document level datasets and 3%on sentence level datasets for auto labelling.
基金Scientific and Technological Research Council of Turkey(TUBITAK)(No.120E100).
文摘The use of all samples in the optimization process does not produce robust results in datasets with label noise.Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction.In this paper,we recommend using samples with loss less than a threshold determined during the optimization,instead of using all samples in the mini-batch.Our proposed method,Adaptive-k,aims to exclude label noise samples from the optimization process and make the process robust.On noisy datasets,we found that using a threshold-based approach,such as Adaptive-k,produces better results than using all samples or a fixed number of low-loss samples in the mini-batch.On the basis of our theoretical analysis and experimental results,we show that the Adaptive-k method is closest to the performance of the Oracle,in which noisy samples are entirely removed from the dataset.Adaptive-k is a simple but effective method.It does not require prior knowledge of the noise ratio of the dataset,does not require additional model training,and does not increase training time significantly.In the experiments,we also show that Adaptive-k is compatible with different optimizers such as SGD,SGDM,and Adam.The code for Adaptive-k is available at GitHub.
文摘针对现有入侵检测技术的不足,文章研究了基于机器学习的异常入侵检测系统,将多标记和半监督学习应用于入侵检测,提出了一种基于多标记学习的入侵检测算法。该算法采用"k近邻"分类准则,统计近邻样本的类别标记信息,通过最大化后验概率(maximum a posteriori,MAP)的方式推理未标记数据的所属集合。在KDD CUP99数据集上的仿真结果表明,该算法能有效地改善入侵检测系统的性能。
文摘支持向量机(Support vector machine,SVM)作为一种经典的分类方法,已经广泛应用于各种领域中。然而,标准支持向量机在分类决策中面临以下问题:(1)未考虑分类数据的分布特征;(2)忽略了样本类别间的相对关系;(3)无法解决大规模分类问题。鉴于此,提出融合数据分布特征的保序学习机(Rank preservation learning machine based on data distribution fusion,RPLM-DDF)。该方法通过引入类内离散度表征数据的分布特征;通过各类样本数据中心位置相对不变保证全局样本顺序不变;通过建立所提方法和核心向量机对偶形式的等价性解决了大规模分类问题。在人工数据集、中小规模数据集和大规模数据集上的比较实验验证所提方法的有效性。