期刊文献+

基于差异度的不均衡电信客户数据分类方法 被引量:11

Imbalanced telecom customer data classification method based on dissimilarity
下载PDF
导出
摘要 针对传统分类技术对不均衡电信客户数据集中流失客户识别能力不足的问题,提出一种基于差异度的改进型不均衡数据分类(IDBC)算法。该算法在基于差异度分类(DBC)算法的基础上改进了原型选择策略。在原型选择阶段,利用改进型的样本子集优化方法从整体数据集中选择最具参考价值的原型集,从而避免了随机选择所带来的不确定性;在分类阶段,分别利用训练集和原型集、测试集和原型集样本之间的差异性构建相应的特征空间,进而采用传统的分类预测算法对映射到相应特征空间内的差异度数据集进行学习。最后选用了UCI数据库中的电信客户数据集和另外6个普通的不均衡数据集对该算法进行验证,相对于传统基于特征的不均衡数据分类算法,DBC算法对稀有类的识别率平均提高了8.3%,IDBC算法对稀有类的识别率平均提高了11.3%。实验结果表明,所提IDBC算法不受类别分布的影响,而且对不均衡数据集中稀有类的识别能力优于已有的先进分类技术。 It is difficult for conventional classification technology to discriminate churn customers in the context of imbalanced telecom customer dataset, therefore, an Improved Dissimilarity-Based imbalanced data Classification(IDBC)algorithm was proposed by introducing an improved prototype selection strategy to Dissimilarity-Based Classification(DBC)algorithm. In prototype selection stage, the improved sample subset optimization method was adopted to select the most valuable prototype set from the whole dataset, thus avoiding the uncertainties caused by the random selection; in classification stage, new feature space was constructed via dissimilarity between samples from train set and prototype set, and samples from test set and prototype set, and then dissimilarity-based datasets mapped into corresponding feature space were learnt with conventional classification algorithms. Finally, the telecom customer dataset and other six ordinary imbalanced datasets from UCI database were selected to test the performance of IDBC.Compared with the traditional imbalanced data classification algorithm based on features, the recognition rate of DBC algorithm for rare class was improved by 8.3% on average, and the recognition rate of IDBC algorithm for raw class was increased by 11.3%. The experimental results show that the IDBC algorithm is not affected by the category distribution, and the discriminative ability of IDBC algorithm outperforms existing state-of-the-art approaches.
作者 王林 郭娜娜
出处 《计算机应用》 CSCD 北大核心 2017年第4期1032-1037,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(61405157)~~
关键词 客户流失预测 不均衡数据分类 样本子集优化 原型选择 差异度转化 customer churn prediction imbalanced data classification Sample Subset Optimization(SSO) prototype selection dissimilarity transformation
  • 相关文献

参考文献5

二级参考文献77

  • 1朱世武,崔嵬,谢邦昌.移动电话客户流失数据挖掘[J].数理统计与管理,2005,24(1):62-68. 被引量:17
  • 2王雷,陈松林,顾学道.客户流失预警模型及其在电信企业的应用[J].电信科学,2006,22(9):47-51. 被引量:17
  • 3应维云,覃正,赵宇,李兵,李秀.SVM方法及其在客户流失预测中的应用研究[J].系统工程理论与实践,2007,27(7):105-110. 被引量:30
  • 4杨智明.面向不平衡数据的支持向量机分类方法研究[D].哈尔滨:哈尔滨工业大学,2009.
  • 5董燕杰.不平衡数据集分类的Random-SMOTE方法研究[D].大连:大连理工大学,2009.
  • 6Corinna Cortes,Vladimir Vapnik. Support-Vector Networks[J] 1995,Machine Learning(3):273~297
  • 7Chan P K,Stolfo S J.Toward scalable learning with nonuniform class and cost distributions:A case study in credit card fraud detection[A].Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining[C].New York:AAAI,1998.164-168.
  • 8Phua C,Alahakoon D,Lee V.Minority report in fraud detection:Classification of skewed data[J].SIGKDD Explore,2004,6(1):50-59.
  • 9Lewis D,Gale W.A sequential algorithm for training text classifiers[A].Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C].Dublin:ACM,1994.3-12.
  • 10Turney P D.Learning algorithms for keyphrase extraction[J].Information Retrieval,2000,2(4):303-336.

共引文献84

同被引文献83

引证文献11

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部