面向第三方支付平台非结构化大数据分布特征的融合聚类算法

Clustering Algorithm Based on Integrating Distribution Features for Unstructured Big Data of Third-Party Payment Platforms

下载PDF

导出

摘要对于蓬勃发展的第三方支付平台,挖掘非结构化数据中蕴含的分布信息有助于平台对商户精准细分。为充分挖掘多元交易分布信息,本文提出针对多元分布样本的商户聚类算法。本文首先采用高斯混合模型拟合交易分布,进而基于Wasserstein距离提出多元分布间的距离定义,并设计迭代算法实现商户聚类。在此基础上,本文进一步提出融合多元分布与协变量特征的聚类算法,得到更综合稳健的聚类效果。本文分别在模拟数据和某第三方支付平台真实商户交易数据上进行验证,结果表明本文算法相较于其他对比方法具有更准确的聚类表现,并且通过聚类结果还能从分布、协变量特征等角度直观反映异常套现商户的行为模式,为第三方支付平台进行风险识别、个性化管理等提供决策支持。此方法可广泛应用于需要融合多维度数据分析的客群聚类场景。 Summary:With the rapid development of big data technology as well as the popularization of third-party payment services,more and more transactions are conducted digitally and recorded in databases.The massive transaction data,which contain merchants'behavior logs,can serve as a valuable resource for mining behavioral information merchants.Segmenting merchants into different groups according to their behavior patterns contributes to precise and personalized decision-supports for marketing,risk control,and many other management issues related to merchants.This is of great importance to the management and revenue of third-party payment platforms,as well as promoting the sustainable development of the real economy.With respect to the segmentation of individuals using transaction data,previous methods are typically based on feature engineering.In this way,a large amount of transaction records are compressed into low-dimensional dense feature vectors.Then,with merchants presented by feature vectors,clustering methods are implemented on those vectors to output partition for all merchants.However,feature-based methods have limited efficiency in utilizing the data and inevitably lead to information loss,which may greatly restrict the effectiveness of the segmentation results.To make full use of transaction data,in this paper,we investigate the empirical distributions of transactions for a better understanding of merchants'behaviors.Compared with low-dimensional feature vectors,the empirical distributions of transactions are much more informative.Nevertheless,how to conduct clustering analysis based on empirical distributions is a challenging task.Traditional clustering algorithms,which are typically applicable for structured data,can hardly be directly used for empirical distributions.To fix this problem,this paper proposes a novel clustering algorithm for merchant segmentation based on multivariate distribution functions among transactions.Firstly,the Gaussian Mixture Model(GMM)is adopted to fit the distribution among the whole dataset.The combinations of Gaussian components within GMM are utilized to describe meaningful patterns of transaction behaviors.With all transactions modelled by GMM,the relations between a transaction and the Gaussian components are simultaneously estimated.As a result,the relations between a merchant and the Gaussian components can be thus inferred via aggregating results of corresponding transactions.Secondly,based on the estimation of GMM,the Wasserstein distance is exploited to measure the dissimilarities among merchants'distributions.Specifically,we apply sliced Wasserstein distance for the purpose of the computational efficiency.Finally,we develop an iterative algorithm,which is called K-means Clustering algorithm based on GMM and Wasserstein Distance(GWKC),to cluster all merchants according to the dissimilarities among their distributions.With the empirical distributions among transactions fully taken into consideration,our method provides a reasonable solution for the segmentation of merchants.In regard to the hyperparameters of our method,we also provide information criterion as reference for real applications.The GWKC algorithm mentioned above utilizes the differences in transaction distribution among merchants for clustering.To further improve the clustering performance,this paper considers integrating more transaction-related covariate features to boost the GWKC algorithm.These covariate features,e.g.,average transaction amount,average number of transactions,and suspected cash-out transactions,serve as supplementary information to assist and adjust the results of GWKC.The improved clustering algorithm is called GWKC With Weighted Covariates(GWKC+WCov)in this paper.This version covers information on both feature vectors and empirical distributions.It allows the integration of distribution-based clustering methods with feature engineering,incorporating highly personalized and complex features that involve expert experience and business knowledge into the clustering process.It is noteworthy that when integrating distribution and covariates to measure the differences among merchants,it is necessary to determine the weights of different parts,e.g.the measurement based on distributions and those based on feature vectors.To obtain appropriate weights,this paper proposes an adaptive approach to iteratively search for weights that optimize the clustering performance.Thus,GWKC+WCov is able to integrate multiple structural features for comprehensive clustering.Both simulation and real data analysis show that the proposed algorithm significantly outperforms previous methods.With structural information of distributions involved,GWKC performs much better than those based on feature vectors.Moreover,the visualization of the results of GWKC intuitively illustrates the behavior patterns of cash-out merchants,thus providing decision-making supports for risk detection and differentiated management of payment platforms.Among various methods,GWKC+WCov achieves the best performance.Since it adaptively integrates multiple structural information within transactions,it is supposed to be a promising solution for real applications of merchant segmentation.Possible directions for future works are also discussed in this paper.Firstly,the proposed methods can be further extended through integrating more unstructured data,e.g.,network data or text data.Thus,the clustering results may be more informative with more available data sources incorporated.Secondly,in regard to the modelling of empirical distributions,it is possible to apply different finite mixture models.For datasets with arbitrary distributions,non-Gaussian components,e.g.,Gamma components,or nonparametric estimations may also be useful.Based on the proposed methods,we can derive more flexible versions to further optimize the clustering performance.

作者黄丹阳罗伊琳朱映秋 Danyang Huang;Yilin Luo;Yingqiu Zhu(School of Statistics,Renmin University of China;School of Statistics,University of International Business and Economics)

机构地区中国人民大学统计学院对外经济贸易大学统计学院

出处《经济管理学刊》 2023年第3期179-208,共30页 Quarterly Journal of Economics and Management

基金国家自然科学基金面上项目(12071477) 对外经济贸易大学中央高校基本科研业务费专项资金资助项目(CXTD14-05)对本文研究的资助。

关键词客户细分高斯混合模型 Wasserstein距离聚类分析非结构化数据 Customer Segmentation Gaussian Mixture Model Wasserstein Distance Clustering Analysis Unstructured Data

分类号 O212 [理学—概率论与数理统计]

引文网络
相关文献

参考文献14

1白璐,赵鑫,孔钰婷,张正航,邵金鑫,钱育蓉.谱聚类算法研究综述[J].计算机工程与应用,2021,57(14):15-26. 被引量：27
2蔡淑琴,蒋士淼,G D OLLE OLLE,秦志勇.基于在线客户评论的客户细分研究[J].管理学报,2015,12(7):1059-1063. 被引量：9
3陈国青,吴刚,顾远东,陆本江,卫强.管理决策情境下大数据驱动的研究和应用挑战——范式转变与研究方向[J].管理科学学报,2018,21(7):1-10. 被引量：159
4董永峰,邓亚晗,董瑶,王雅琮.基于深度学习的聚类综述[J].计算机应用,2022,42(4):1021-1028. 被引量：12
5方匡南,吴见彬,朱建平,谢邦昌.信贷信息不对称下的信用卡信用风险研究[J].经济研究,2010,45(S1):97-107. 被引量：64
6郭崇慧,赵作为.基于客户行为的4S店客户细分及其变化挖掘[J].管理工程学报,2015,29(4):18-26. 被引量：10
7黄丹阳,毕博洋,朱映秋.基于高斯谱聚类的风险商户聚类分析[J].统计研究,2021,38(6):145-160. 被引量：4
8黄丽华,朱海林,刘伟华,窦一凡,王今朝,蔡莉,陈煜波,廖貅武,吴晓波,谢康,叶强,张兮,陈文波.企业数字化转型和管理:研究框架与展望[J].管理科学学报,2021,24(8):26-35. 被引量：123
9黄志刚,刘志惠,朱建林.多源数据信用评级普适模型栈框架的构建与应用[J].数量经济技术经济研究,2019,36(4):155-168. 被引量：15
10刘英姿,吴昊.客户细分方法研究综述[J].管理工程学报,2006,20(1):53-57. 被引量：86

二级参考文献190

1罗暘洋,李存金,罗斌.与第三方支付机构“竞合”是否提升了银行绩效[J].金融经济学研究,2020(4):108-118. 被引量：11
2Longfei He,Mei Xue,Bin Gu.Internet-of-things enabled supply chain planning and coordination with big data services:Certain theoretic implications[J].Journal of Management Science and Engineering,2020,5(1):1-22. 被引量：6
3方匡南,吴见彬,朱建平,谢邦昌.信贷信息不对称下的信用卡信用风险研究[J].经济研究,2010,45(S1):97-107. 被引量：64
4张卫东.试论信用卡业务中的风险控制[J].国际金融研究,1991(1):52-54. 被引量：1
5刘闽,林成德.基于支持向量机的商业银行信用风险评估模型[J].厦门大学学报（自然科学版）,2005,44(1):29-32. 被引量：26
6刘云焘,吴冲,王敏,乔木.基于支持向量机的商业银行信用风险评估模型研究[J].预测,2005,24(1):52-55. 被引量：16
7庞素琳.概率神经网络信用评价模型及预警研究[J].系统工程理论与实践,2005,25(5):43-48. 被引量：21
8石庆焱.一个基于神经网络——Logistic回归的混合两阶段个人信用评分模型研究[J].统计研究,2005,22(5):45-49. 被引量：39
9刘英姿,吴昊.客户细分方法研究综述[J].管理工程学报,2006,20(1):53-57. 被引量：86
10叶强,卢涛,闫相斌,李一军.客户关系管理中的动态客户细分方法研究[J].管理科学学报,2006,9(2):44-52. 被引量：16

共引文献745

1张立驰,姜彦彦,吴君民.基于扎根理论的老年女性参与在线教育的影响因素[J].中国成人教育,2020(21):9-14. 被引量：1
2冯占科,钱旺,王杰.基于Haddon模型的应急值守系统设计与应用[J].中国安全科学学报,2022,32(S02):211-216. 被引量：2
3刘露,吴珏,杨雷,杨福军.基于谱聚类的Web多级缓存替换策略[J].计算机系统应用,2022,31(11):380-386. 被引量：1
4高延歌.企业数字化转型与年报可读性:治理效应抑或噪音效应?[J].投资研究,2024,43(2):121-144.
5高峰,吴谣,肖云凯.大数据局的设立对公司价值的影响——来自A股的实证研究[J].投资研究,2022,41(8):104-114.
6黄丹阳,毕博洋,朱映秋.基于高斯谱聚类的风险商户聚类分析[J].统计研究,2021,38(6):145-160. 被引量：4
7范新妍,方匡南,郑陈璐,张志远.基于整合治愈率模型的信贷违约时点预测[J].统计研究,2021(2):99-113. 被引量：2
8苏新宁,杨国立.我国情报学学科建设研究进展[J].情报学进展,2020(1):1-38. 被引量：14
9冯建英,石岩,王博,穆维松.基于聚类分析的数据挖掘技术及其农业应用研究进展[J].农业机械学报,2022,53(S01):201-212. 被引量：11
10郭水文,孟冬月.煤炭行业数字化转型:现状、动因与政策建议[J].煤炭经济研究,2023,43(11):49-54. 被引量：1

1孙旻芬.大数据在高校教学管理中的应用探析[J].华东科技,2023(8):139-141.
2潘新新,庄亚平,宋春景,林超.基于模型驱动的核工程顶层架构设计方法研究[J].核动力工程,2023,44(4):179-184.
3林婷.大数据技术在互联网开源情报分析中的应用探讨[J].区域治理,2023(25):68-70.
4白仲萃.德育教育创新在班主任管理中的价值分析[J].今天,2023(19):234-236.
5毕继蕊,鹿林,尹婷婷,毛绍华,张润.社区获得性肺炎增加痫性发作后高肌酸激酶血症的风险[J].中文科技期刊数据库（文摘版）医药卫生,2023(10):37-40.
6杨茜.基于Bi-LSTM和图注意力网络的多标签文本分类算法[J].计算机应用与软件,2023,40(9):145-150. 被引量：2
7汤一峰.面向SPO的未来民机座舱研究进展[J].国际航空,2023(8):45-48.
8娄明华,杨同辉,王卫兵,毛建方,徐婧,章建红.四明山黄山松针阔混交林林分空间结构参数多元分布特征[J].林业与环境科学,2023,39(4):12-20. 被引量：1
9张晟,殷颖超.从灵魂社交到元宇宙社交:陌生人社交软件Soul商业模式画布分析[J].海南开放大学学报,2023,24(3):107-116. 被引量：1
10洪会,王贺,林红.基于追踪方法学的个性化管理在儿童外周静脉留置针输液中的应用[J].中国基层医药,2023,30(9):1398-1401. 被引量：1

经济管理学刊

2023年第3期

浏览历史

内容加载中请稍等...

面向第三方支付平台非结构化大数据分布特征的融合聚类算法

参考文献14

二级参考文献190

共引文献745

相关作者

相关机构

相关主题

浏览历史