期刊文献+
共找到139篇文章
< 1 2 7 >
每页显示 20 50 100
Feature Selection Based on Difference and Similitude in Data Mining
1
作者 WU Ming YAN Puliu 《Wuhan University Journal of Natural Sciences》 CAS 2007年第3期467-470,共4页
Feature selection is the pretreatment of data mining. Heuristic search algorithms are often used for this subject. Many heuristic search algorithms are based on discernibility matrices, which only consider the differe... Feature selection is the pretreatment of data mining. Heuristic search algorithms are often used for this subject. Many heuristic search algorithms are based on discernibility matrices, which only consider the difference in information system. Because the similar characteristics are not revealed in discernibility matrix, the result may not be the simplest rules. Although differencesimilitude(DS) methods take both of the difference and the similitude into account, the existing search strategy will cause some important features to be ignored. An improved DS based algorithm is proposed to solve this problem in this paper. An attribute rank function, which considers both of the difference and similitude in feature selection, is defined in the improved algorithm. Experiments show that it is an effective algorithm, especially for large-scale databases. The time complexity of the algorithm is O(| C |^2|U |^2). 展开更多
关键词 knowledge reduction feature selection rough set difference set similitude set attribute rank function
下载PDF
Feature Selection Method by Applying Parallel Collaborative Evolutionary Genetic Algorithm 被引量:1
2
作者 Hao-Dong Zhu Hong-Chan Li +1 位作者 Xiang-Hui Zhao Yong Zhong 《Journal of Electronic Science and Technology》 CAS 2010年第2期108-113,共6页
Feature selection is one of the important topics in text classification. However, most of existing feature selection methods are serial and inefficient to be applied to massive text data sets. In this case, a feature ... Feature selection is one of the important topics in text classification. However, most of existing feature selection methods are serial and inefficient to be applied to massive text data sets. In this case, a feature selection method based on parallel collaborative evolutionary genetic algorithm is presented. The presented method uses genetic algorithm to select feature subsets and takes advantage of parallel collaborative evolution to enhance time efficiency, so it can quickly acquire the feature subsets which are more representative. The experimental results show that, for accuracy ratio and recall ratio, the presented method is better than information gain, x2 statistics, and mutual information methods; the consumed time of the presented method with only one CPU is inferior to that of these three methods, but the presented method is supe rior after using the parallel strategy. 展开更多
关键词 Index Terms-feature selection genetic algorithm parallel collaborative evolutionary text mining.
下载PDF
Effective and Efficient Feature Selection for Large-scale Data Using Bayes' Theorem 被引量:7
3
作者 Subramanian Appavu Alias Balamurugan Ramasamy Rajaram 《International Journal of Automation and computing》 EI 2009年第1期62-71,共10页
This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected featu... This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms. 展开更多
关键词 Data mining CLASSIFICATION feature selection dimensionality reduction Bayes' theorem.
下载PDF
Method for Fault Feature Selection for a Baler Gearbox Based on an Improved Adaptive Genetic Algorithm
4
作者 Bin Ren Dong Bai +2 位作者 Zhanpu Xue Hu Xie Hao Zhang 《Chinese Journal of Mechanical Engineering》 SCIE EI CAS CSCD 2022年第3期312-323,共12页
The performance and efficiency of a baler deteriorate as a result of gearbox failure.One way to overcome this challenge is to select appropriate fault feature parameters for fault diagnosis and monitoring gearboxes.Th... The performance and efficiency of a baler deteriorate as a result of gearbox failure.One way to overcome this challenge is to select appropriate fault feature parameters for fault diagnosis and monitoring gearboxes.This paper proposes a fault feature selection method using an improved adaptive genetic algorithm for a baler gearbox.This method directly obtains the minimum fault feature parameter set that is most sensitive to fault features through attribute reduction.The main benefit of the improved adaptive genetic algorithm is its excellent performance in terms of the efficiency of attribute reduction without requiring prior information.Therefore,this method should be capable of timely diagnosis and monitoring.Experimental validation was performed and promising findings highlighting the relationship between diagnosis results and faults were obtained.The results indicate that when using the improved genetic algorithm to reduce 12 fault characteristic parameters to three without a priori information,100%fault diagnosis accuracy can be achieved based on these fault characteristics and the time required for fault feature parameter selection using the improved genetic algorithm is reduced by half compared to traditional methods.The proposed method provides important insights into the instant fault diagnosis and fault monitoring of mechanical devices. 展开更多
关键词 Fault diagnosis feature selection Attribute reduction Improved adaptive genetic algorithm
下载PDF
Importance of Features Selection,Attributes Selection,Challenges and Future Directions for Medical Imaging Data:A Review 被引量:6
5
作者 Nazish Naheed Muhammad Shaheen +2 位作者 Sajid Ali Khan Mohammed Alawairdhi Muhammad Attique Khan 《Computer Modeling in Engineering & Sciences》 SCIE EI 2020年第10期315-344,共30页
In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential grow... In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential growth of information investments in medical data repositories and health service provision,medical institutions are collecting large volumes of data.These data repositories contain details information essential to support medical diagnostic decisions and also improve patient care quality.On the other hand,this growth also made it difficult to comprehend and utilize data for various purposes.The results of imaging data can become biased because of extraneous features present in larger datasets.Feature selection gives a chance to decrease the number of components in such large datasets.Through selection techniques,ousting the unimportant features and selecting a subset of components that produces prevalent characterization precision.The correct decision to find a good attribute produces a precise grouping model,which enhances learning pace and forecast control.This paper presents a review of feature selection techniques and attributes selection measures for medical imaging.This review is meant to describe feature selection techniques in a medical domainwith their pros and cons and to signify its application in imaging data and data mining algorithms.The review reveals the shortcomings of the existing feature and attributes selection techniques to multi-sourced data.Moreover,this review provides the importance of feature selection for correct classification of medical infections.In the end,critical analysis and future directions are provided. 展开更多
关键词 Medical imaging imaging data feature selection data mining attribute selection medical challenges future directions
下载PDF
A new feature selection method for handling redundant information in text classification 被引量:3
6
作者 You-wei WANG Li-zhou FENG 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2018年第2期221-234,共14页
Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant informa... Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection(OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets(Web KB, 20-Newsgroups, and Reuters-21578) where in support vector machines and na?ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods(information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods(improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification. 展开更多
关键词 feature selection Dimensionality reduction text classification Redundant features Support vector machine Naive Bayes Mutual information
原文传递
Feature subset selection based on mahalanobis distance: a statistical rough set method 被引量:1
7
作者 孙亮 韩崇昭 《Journal of Pharmaceutical Analysis》 SCIE CAS 2008年第1期14-18,共5页
In order to select effective feature subsets for pattern classification, a novel statistics rough set method is presented based on generalized attribute reduction. Unlike classical reduction approaches, the objects in... In order to select effective feature subsets for pattern classification, a novel statistics rough set method is presented based on generalized attribute reduction. Unlike classical reduction approaches, the objects in universe of discourse are signs of training sample sets and values of attributes are taken as statistical parameters. The binary relation and discernibility matrix for the reduction are induced by distance function. Furthermore, based on the monotony of the distance function defined by Mahalanobis distance, the effective feature subsets are obtained as generalized attribute reducts. Experiment result shows that the classification performance can be improved by using the selected feature subsets. 展开更多
关键词 feature subset selection rough set attribute reduction Mahalanobis distance
下载PDF
Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features 被引量:10
8
作者 Zachary Miller Brian Dickinson Wei Hu 《International Journal of Intelligence Science》 2012年第4期143-148,共6页
The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts... The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts to learn about the author of the text through subtle variations in the writing styles that occur between gender, age and social groups. Such information has a variety of applications including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which makes the majority of its user data freely available through its data access API. In this study we seek to identify the gender of users on Twitter using Perceptron and Nai ve Bayes with selected 1 through 5-gram features from tweet text. Stream applications of these algorithms were employed for gender prediction to handle the speed and volume of tweet traffic. Because informal text, such as tweets, cannot be easily evaluated using traditional dictionary methods, n-gram features were implemented in this study to represent streaming tweets. The large number of 1 through 5-grams requires that only a subset of them be used in gender classification, for this reason informative n-gram features were chosen using multiple selection algorithms. In the best case the Naive Bayes and Perceptron algorithms produced accuracy, balanced accuracy, and F-measure above 99%. 展开更多
关键词 TWITTER GENDER Identification STREAM mining N-GRAM feature selection text mining
下载PDF
多标签隐性知识显性化下的数据挖掘算法 被引量:3
9
作者 刘利民 张勇 《计算机仿真》 北大核心 2023年第4期504-508,共5页
对多标签数据进行挖掘时,由于数据挖掘模式的差异,导致算法加速比较低。提出基于SECI模型与属性分类的多标签数据挖掘算法。应用SECI理论建立数据转化模型,将多标签数据的隐性知识显性化处理。结合Relief F算法和互信息,提取多标签数据... 对多标签数据进行挖掘时,由于数据挖掘模式的差异,导致算法加速比较低。提出基于SECI模型与属性分类的多标签数据挖掘算法。应用SECI理论建立数据转化模型,将多标签数据的隐性知识显性化处理。结合Relief F算法和互信息,提取多标签数据特征。通过属性分类方法,按照类内距离平方和最小、类间距离平方和最大的原则设计多标签数据挖掘模式,获取数据挖掘结果。在MVVM模式的作用下,建立挖掘结果交互方案,获取实时数据挖掘结果。仿真结果表明:所提出的数据挖掘算法应用后,加速比得到了有效提升。 展开更多
关键词 属性分类 多标签数据 数据挖掘 特征选择 隐性知识
下载PDF
基于高维聚类的文本大数据挖掘算法仿真 被引量:2
10
作者 郭红建 陈一飞 梅轶群 《计算机仿真》 北大核心 2023年第6期499-503,共5页
文本数据具有规模大,特征维数高等特点。通常含有大量的冗余、空间维度复杂的数据,导致文本大数据信息挖掘困难。因此,提出一种基于高维聚类算法的文本大数据挖掘方法。采用等距离特征映射算法,将多维数据映射到低维空间。通过相空间重... 文本数据具有规模大,特征维数高等特点。通常含有大量的冗余、空间维度复杂的数据,导致文本大数据信息挖掘困难。因此,提出一种基于高维聚类算法的文本大数据挖掘方法。采用等距离特征映射算法,将多维数据映射到低维空间。通过相空间重建,提取大数据的关键特征。以平均信息熵作为衡量聚类项目的标准,多次不断更新本文聚类中心,当平均信息熵为小数值时,利用密度函数确定原始本文聚类中心,实现文本大数据挖掘。实验结果证明,所提方法的F1值在95%以上,说明文本大数据的聚类精准度高,不会出现过度挖掘问题。 展开更多
关键词 聚类算法 平均信息熵 降维处理 相空间重建 文本聚类 特征选择
下载PDF
基于特征选择的学生成绩预测方法研究
11
作者 刘晓雲 刘鸿雁 +1 位作者 李劲松 王冠帮 《信息技术》 2023年第10期17-22,共6页
学习成绩是反映学习效果和教学质量的重要指标,对成绩进行预测可改进学习和教学方法,进而提高教学质量,如何准确地预测成绩已成为教育数据挖掘领域的一个热点研究课题。为提高预测的准确度,提出了基于特征选择的成绩预测方法。首先利用... 学习成绩是反映学习效果和教学质量的重要指标,对成绩进行预测可改进学习和教学方法,进而提高教学质量,如何准确地预测成绩已成为教育数据挖掘领域的一个热点研究课题。为提高预测的准确度,提出了基于特征选择的成绩预测方法。首先利用序列前向选择算法对样本数据进行特征选择,从而选出最优特征子集来构建多元线性回归预测模型,再利用模型对成绩进行预测。为检验方法的有效性,在真实数据集上进行了验证,实验结果表明:文中方法具有更高的预测精度,可以为改进教学方法和教学质量提供数据支持。 展开更多
关键词 数据挖掘 特征选择 数据降维 多元线性回归 成绩预测
下载PDF
统计模式识别中的维数削减与低损降维 被引量:44
12
作者 宋枫溪 高秀梅 +1 位作者 刘树海 杨静宇 《计算机学报》 EI CSCD 北大核心 2005年第11期1915-1922,共8页
较为全面地回顾了统计模式识别中常用的一些特征选择、特征提取等主流特征降维方法,介绍了它们各自的特点及其适用范围,在此基础上,提出了一种新的基于最优分类器———贝叶斯分类器的可用于自动文本分类及其它大样本模式分类的特征选... 较为全面地回顾了统计模式识别中常用的一些特征选择、特征提取等主流特征降维方法,介绍了它们各自的特点及其适用范围,在此基础上,提出了一种新的基于最优分类器———贝叶斯分类器的可用于自动文本分类及其它大样本模式分类的特征选择方法———低损降维.在标准数据集Reuters-21578上进行的仿真实验结果表明,与互信息、χ2统计量以及文档频率这三种主流文本特征选择方法相比,低损降维的降维效果与互信息、χ2统计量相当,而优于文档频率. 展开更多
关键词 维数削减 特征选择 特征抽取 低损降维 文本分类
下载PDF
文本挖掘技术研究进展 被引量:58
13
作者 袁军鹏 朱东华 +2 位作者 李毅 李连宏 黄进 《计算机应用研究》 CSCD 北大核心 2006年第2期1-4,共4页
文本挖掘是一个对具有丰富语义的文本进行分析从而理解其所包含的内容和意义的过程,已经成为数据挖掘中一个日益流行而重要的研究领域。首先给出了文本挖掘的定义和框架,对文本挖掘中预处理、文本摘要、文本分类、聚类、关联分析及可视... 文本挖掘是一个对具有丰富语义的文本进行分析从而理解其所包含的内容和意义的过程,已经成为数据挖掘中一个日益流行而重要的研究领域。首先给出了文本挖掘的定义和框架,对文本挖掘中预处理、文本摘要、文本分类、聚类、关联分析及可视化技术进行了详尽的分析,并归纳了最新的研究进展。最后指出了文本挖掘在知识发现中的重要意义,展望了文本挖掘在信息技术中的发展前景。 展开更多
关键词 文本挖掘 中文分词 特征选取 文本摘要 文本分类 文本聚类 关联分析 数据可视化
下载PDF
基于邻域粒化和粗糙逼近的数值属性约简 被引量:291
14
作者 胡清华 于达仁 谢宗霞 《软件学报》 EI CSCD 北大核心 2008年第3期640-649,共10页
对于空间中的任一子集,通过基本邻域信息粒子进行逼近,由此提出了邻域信息系统和邻域决策表模型.分析了该模型的性质,并且基于此模型构造了数值型属性的选择算法.利用UCI标准数据集与现有算法进行了比较分析,实验结果表明,该模型可以选... 对于空间中的任一子集,通过基本邻域信息粒子进行逼近,由此提出了邻域信息系统和邻域决策表模型.分析了该模型的性质,并且基于此模型构造了数值型属性的选择算法.利用UCI标准数据集与现有算法进行了比较分析,实验结果表明,该模型可以选择较少的特征而保持或改善分类能力. 展开更多
关键词 数值特征 粒度计算 邻域关系 粗糙集 可变精度 属性约简 特征选择
下载PDF
文本挖掘研究进展 被引量:15
15
作者 湛燕 陈昊 +1 位作者 袁方 王丽娟 《河北大学学报(自然科学版)》 CAS 2003年第2期221-226,共6页
数据挖掘是将人工智能技术和数据库技术紧密结合 ,让计算机帮助人们从庞大的数据中智能地、自动地抽取出有价值的知识模式 ,以满足人们不同应用的需要 .由于存储信息最多的自然形式就是文本 ,因此文本挖掘具有重要的意义 .结合笔者研究... 数据挖掘是将人工智能技术和数据库技术紧密结合 ,让计算机帮助人们从庞大的数据中智能地、自动地抽取出有价值的知识模式 ,以满足人们不同应用的需要 .由于存储信息最多的自然形式就是文本 ,因此文本挖掘具有重要的意义 .结合笔者研究工作 ,主要介绍了文本挖掘的研究内容 ,挖掘过程 ,挖掘算法及应用前景 . 展开更多
关键词 文本挖掘 特征选取 文本分类 文本聚类
下载PDF
基于类别分布的特征选择框架 被引量:18
16
作者 靖红芳 王斌 +1 位作者 杨雅辉 徐燕 《计算机研究与发展》 EI CSCD 北大核心 2009年第9期1586-1593,共8页
目前已有很多种特征选择方法,但就目前所知,没有一种方法能够在非平衡语料上取得很好的效果.依据特征在类别间的分布特点提出了基于类别分布的特征选择框架.该框架能够利用特征的分布信息选出具有较强区分能力的特征,同时允许给类别灵... 目前已有很多种特征选择方法,但就目前所知,没有一种方法能够在非平衡语料上取得很好的效果.依据特征在类别间的分布特点提出了基于类别分布的特征选择框架.该框架能够利用特征的分布信息选出具有较强区分能力的特征,同时允许给类别灵活地分配权重,分配较大的权重给稀有类别则提高稀有类别的分类效果,所以它适用于非平衡语料,也具有很好的扩展性.另外,OCFS和基于类别分布差异的特征过滤可以看作该框架的特例.实现该框架得到了具体的特征选择方法,Retuers-21578语料及复旦大学语料等两个非平衡语料上的实验表明,它们的Macro和Micro F1效果都优于IG,CHI和OCFS. 展开更多
关键词 特征选择 非平衡语料 特征降维 文本分类 数据挖掘
下载PDF
生物医学命名实体识别的研究与进展 被引量:25
17
作者 郑强 刘齐军 +1 位作者 王正华 朱云平 《计算机应用研究》 CSCD 北大核心 2010年第3期811-815,832,共6页
为直接高效地获取文献中的知识,命名实体识别用来识别文本中具有特定意义的实体。这是应用文本挖掘技术自动获取知识的关键的第一步,因此受到日益广泛的关注。主要从评测方法、特征选择、机器学习方法和后期处理等方面介绍了近年来生物... 为直接高效地获取文献中的知识,命名实体识别用来识别文本中具有特定意义的实体。这是应用文本挖掘技术自动获取知识的关键的第一步,因此受到日益广泛的关注。主要从评测方法、特征选择、机器学习方法和后期处理等方面介绍了近年来生物医学命名实体识别方面的主要研究方法及成果,并对目前各方面存在的问题进行了分析和讨论,最后对该领域的研究前景进行了展望。 展开更多
关键词 命名实体识别 文本挖掘 特征选择 机器学习
下载PDF
文本分类中的特征降维方法综述 被引量:79
18
作者 陈涛 谢阳群 《情报学报》 CSSCI 北大核心 2005年第6期690-695,共6页
文本分类的关键是对高维的特征集进行降维.降维的主要方法是特征选择和特征提取.本文综述了已有的特征选择和特征抽取方法,评价了它们的优缺点和适用范围.
关键词 文本分类 特征降维 特征选择 特征提取
下载PDF
基于类别选择的改进KNN文本分类 被引量:9
19
作者 刘海峰 张学仁 +1 位作者 姚泽清 刘守生 《计算机科学》 CSCD 北大核心 2009年第11期213-216,共4页
特征高维性以及算法的泛化能力影响了KNN分类器的分类性能。提出了一种降维条件下基于类别的KNN改进模型,解决了k近邻选择时大类别、高密度样本占优问题。首先使用一种改进的优势率方法进行特征选择,随后使用类别向量对文本类别进行初... 特征高维性以及算法的泛化能力影响了KNN分类器的分类性能。提出了一种降维条件下基于类别的KNN改进模型,解决了k近邻选择时大类别、高密度样本占优问题。首先使用一种改进的优势率方法进行特征选择,随后使用类别向量对文本类别进行初步判定,最后在压缩后的样本集上使用KNN分类器进行分类。试验结果表明,提出的改进分类模型提高了分类效率。 展开更多
关键词 K-最近邻 特征降维 特征选择 文本分类
下载PDF
商品评论情感倾向性分析 被引量:20
20
作者 李明 胡吉霞 +1 位作者 侯琳娜 严峻 《计算机应用》 CSCD 北大核心 2019年第S02期15-19,共5页
针对粗粒度的商品评论情感分析不能详尽地提供用户喜好问题,提出一种基于支持向量机(SVM)结合点互信息(PMI)的细粒度商品评论情感分析方法。首先,使用卡方检验方法进行文本特征选择和降维;接着,对朴素贝叶斯、决策树、支持向量机(SVM)、... 针对粗粒度的商品评论情感分析不能详尽地提供用户喜好问题,提出一种基于支持向量机(SVM)结合点互信息(PMI)的细粒度商品评论情感分析方法。首先,使用卡方检验方法进行文本特征选择和降维;接着,对朴素贝叶斯、决策树、支持向量机(SVM)、K最邻近算法(K NN)四种常用情感分类方法进行比较,支持向量机(SVM)的召回率和精确率最高,均达到94.5%,所以使用支持向量机(SVM)对商品评论进行粗粒度的情感分析;然后,根据人工经验总结典型的商品属性,使用点互信息(PMI)方法对商品属性扩充;最后针,对扩充后的商品属性,在以上粗粒度的商品评论情感分析基础上,进行细粒度的情感分析及统计。细粒度的商品评论情感分析,可使厂家看到用户对产品属性的喜好,以及在产品设计、销售及服务中需要改进的方面。 展开更多
关键词 情感分析 特征选择 文本分类 机器学习 商品属性
下载PDF
上一页 1 2 7 下一页 到第
使用帮助 返回顶部