期刊文献+

基于二次组合的特征工程与XGBoost模型的用户行为预测 被引量:21

User Behavior Prediction Based on Feature Engineering of Quadratic Combination and XGBoost Model
下载PDF
导出
摘要 特征构造的难题在数据挖掘过程中一直存在,传统固化的特征工程对于业务场景千变万化的数据挖掘任务所带来的效益十分有限,因此解决特征工程的特征构造问题已经成为数据挖掘的瓶颈之一;尤其在机器学习算法快速发展的情况下,特征逐渐成为模型中急需重视的部分。基于电商平台的用户行为数据,在原有特征群的基础上提出了二次组合统计特征的构建方法。利用二次交叉衍生出丰富而又切合业务场景的特征群,同时结合两种滑动窗口的方法,分别是定长滑动窗口获取更多的训练样本,变长滑动窗口获取具有时间权重的训练特征,以此来最大限度地还原出用户真实的行为习惯。最后,使用不同的特征组合结合降维的方法建立对照检验模型;并利用线性的逻辑回归模型、线性支持向量机以及树模型极端随机森林与XGBoost对模型进行交叉验证。结果表明,组合特征在树模型的算法中得到了非常好的表达效果;而且无论在线性模型还是树模型中衍生特征群模型的F1值都优于基础特征群。 Constructing feature has always been a problem in the process of data mining when conventional ways for feature engineering do not satisfy the need of various data mining mission any more. As machine learning is in a state of rapid development,feature engineering has been playing an important role gradually. The data of user behavior was used to construct statistical combination feature based on the original feature,which is particularly suitable for the business scene. At the same time two different window sliding method is used,in other words,fixed length window sliding to obtain more training samples,and variable length window sliding to get more feature from different time dimension,for the purpose of reproducing the real habit of user in daily life as much as possible. In the end of this paper,different combinations of features will be used for control experiment,while different models such as LR,SVM,ET and XGBoost are all used for experiment as well. The results show that no matter in the linear model or tree model,the F1 value of the combination feature group is better than the original feature group.
作者 杨立洪 白肇强 YANG Li-hong;BAI Zhao-qiang(Department of Mathematics, South China University of Technology, Guangzhou 510640,China)
出处 《科学技术与工程》 北大核心 2018年第14期186-189,共4页 Science Technology and Engineering
基金 广东省产学研协同创新成果转化项目(2016B090918041) 广州市产学研协同创新重大专项(201504302222568)资助
关键词 特征工程 二次组合特征 用户行为预测 XGBoost feature engineering feature combination user behavior prediction XGBoost
  • 相关文献

参考文献4

二级参考文献32

  • 1曾莹,陈晓柱.数据挖掘及算法浅谈[J].中国科技信息,2005(14):75-75. 被引量:2
  • 2邹志文,朱金伟.数据挖掘算法研究与综述[J].计算机工程与设计,2005,26(9):2304-2307. 被引量:52
  • 3贺玲,吴玲达,蔡益朝.数据挖掘中的聚类算法综述[J].计算机应用研究,2007,24(1):10-13. 被引量:222
  • 4Franky Kin-Pong C, Ada Wai-chee F, Clement Y. Haar Wavelets for Efficient Similarity Search of Time-series:With and Without Time Warping [J]. IEEE Trans. on Knowl. and Data Eng., 2003,15 (3): 686-705.
  • 5Popivanov I, Miller R J. Similarity Search Over Time-Series Data Using Wavelets[C]//Proceedings of the 18th International Conference on Data Engineering. IEEE Computer Society, 2002: 212-216.
  • 6Liabotis I, Theodoulidis B, Saraaee M. Improving Similarity Search in Time Series Using Wavelets[J]. International Journal of Data Warehousing and Mining, 2006,2 (2).
  • 7Yingyi B, Lei C, Ada Wai-Chee F, et al. Efficient anomaly monitoring over moving object trajectory streams[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France, ACM, 2009: 159-168.
  • 8Hao-Ping H, Ming-Syan C. Efficient range-constrained similarity search on wavelet synopses over multiple streams[C]//Proceedings of the 15th ACM International Conference on Information and Knowledge Management. Arlington, Virginia, USA, ACM, 2006 : 327-336.
  • 9Mayur D, Aristides G, Piotr I, et al. Maintaining stream statistics over sliding windows [C]// Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms. San Francisco, California, Society for Industrial and Applied Mathematics, 2002 : 635-644.
  • 10Lukasz G, David D, Erik D D, et al. Identifying frequent items in sliding windows over on-line packet streams[C]//Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement. Miami Beach, FL, USA, ACM, 2003: 173-178.

共引文献12

同被引文献151

引证文献21

二级引证文献73

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部