摘要
特征构造的难题在数据挖掘过程中一直存在,传统固化的特征工程对于业务场景千变万化的数据挖掘任务所带来的效益十分有限,因此解决特征工程的特征构造问题已经成为数据挖掘的瓶颈之一;尤其在机器学习算法快速发展的情况下,特征逐渐成为模型中急需重视的部分。基于电商平台的用户行为数据,在原有特征群的基础上提出了二次组合统计特征的构建方法。利用二次交叉衍生出丰富而又切合业务场景的特征群,同时结合两种滑动窗口的方法,分别是定长滑动窗口获取更多的训练样本,变长滑动窗口获取具有时间权重的训练特征,以此来最大限度地还原出用户真实的行为习惯。最后,使用不同的特征组合结合降维的方法建立对照检验模型;并利用线性的逻辑回归模型、线性支持向量机以及树模型极端随机森林与XGBoost对模型进行交叉验证。结果表明,组合特征在树模型的算法中得到了非常好的表达效果;而且无论在线性模型还是树模型中衍生特征群模型的F1值都优于基础特征群。
Constructing feature has always been a problem in the process of data mining when conventional ways for feature engineering do not satisfy the need of various data mining mission any more. As machine learning is in a state of rapid development,feature engineering has been playing an important role gradually. The data of user behavior was used to construct statistical combination feature based on the original feature,which is particularly suitable for the business scene. At the same time two different window sliding method is used,in other words,fixed length window sliding to obtain more training samples,and variable length window sliding to get more feature from different time dimension,for the purpose of reproducing the real habit of user in daily life as much as possible. In the end of this paper,different combinations of features will be used for control experiment,while different models such as LR,SVM,ET and XGBoost are all used for experiment as well. The results show that no matter in the linear model or tree model,the F1 value of the combination feature group is better than the original feature group.
作者
杨立洪
白肇强
YANG Li-hong;BAI Zhao-qiang(Department of Mathematics, South China University of Technology, Guangzhou 510640,China)
出处
《科学技术与工程》
北大核心
2018年第14期186-189,共4页
Science Technology and Engineering
基金
广东省产学研协同创新成果转化项目(2016B090918041)
广州市产学研协同创新重大专项(201504302222568)资助