摘要
对于蓬勃发展的第三方支付平台,挖掘非结构化数据中蕴含的分布信息有助于平台对商户精准细分。为充分挖掘多元交易分布信息,本文提出针对多元分布样本的商户聚类算法。本文首先采用高斯混合模型拟合交易分布,进而基于Wasserstein距离提出多元分布间的距离定义,并设计迭代算法实现商户聚类。在此基础上,本文进一步提出融合多元分布与协变量特征的聚类算法,得到更综合稳健的聚类效果。本文分别在模拟数据和某第三方支付平台真实商户交易数据上进行验证,结果表明本文算法相较于其他对比方法具有更准确的聚类表现,并且通过聚类结果还能从分布、协变量特征等角度直观反映异常套现商户的行为模式,为第三方支付平台进行风险识别、个性化管理等提供决策支持。此方法可广泛应用于需要融合多维度数据分析的客群聚类场景。
Summary:With the rapid development of big data technology as well as the popularization of third-party payment services,more and more transactions are conducted digitally and recorded in databases.The massive transaction data,which contain merchants'behavior logs,can serve as a valuable resource for mining behavioral information merchants.Segmenting merchants into different groups according to their behavior patterns contributes to precise and personalized decision-supports for marketing,risk control,and many other management issues related to merchants.This is of great importance to the management and revenue of third-party payment platforms,as well as promoting the sustainable development of the real economy.With respect to the segmentation of individuals using transaction data,previous methods are typically based on feature engineering.In this way,a large amount of transaction records are compressed into low-dimensional dense feature vectors.Then,with merchants presented by feature vectors,clustering methods are implemented on those vectors to output partition for all merchants.However,feature-based methods have limited efficiency in utilizing the data and inevitably lead to information loss,which may greatly restrict the effectiveness of the segmentation results.To make full use of transaction data,in this paper,we investigate the empirical distributions of transactions for a better understanding of merchants'behaviors.Compared with low-dimensional feature vectors,the empirical distributions of transactions are much more informative.Nevertheless,how to conduct clustering analysis based on empirical distributions is a challenging task.Traditional clustering algorithms,which are typically applicable for structured data,can hardly be directly used for empirical distributions.To fix this problem,this paper proposes a novel clustering algorithm for merchant segmentation based on multivariate distribution functions among transactions.Firstly,the Gaussian Mixture Model(GMM)is adopted to fit the distribution among the whole dataset.The combinations of Gaussian components within GMM are utilized to describe meaningful patterns of transaction behaviors.With all transactions modelled by GMM,the relations between a transaction and the Gaussian components are simultaneously estimated.As a result,the relations between a merchant and the Gaussian components can be thus inferred via aggregating results of corresponding transactions.Secondly,based on the estimation of GMM,the Wasserstein distance is exploited to measure the dissimilarities among merchants'distributions.Specifically,we apply sliced Wasserstein distance for the purpose of the computational efficiency.Finally,we develop an iterative algorithm,which is called K-means Clustering algorithm based on GMM and Wasserstein Distance(GWKC),to cluster all merchants according to the dissimilarities among their distributions.With the empirical distributions among transactions fully taken into consideration,our method provides a reasonable solution for the segmentation of merchants.In regard to the hyperparameters of our method,we also provide information criterion as reference for real applications.The GWKC algorithm mentioned above utilizes the differences in transaction distribution among merchants for clustering.To further improve the clustering performance,this paper considers integrating more transaction-related covariate features to boost the GWKC algorithm.These covariate features,e.g.,average transaction amount,average number of transactions,and suspected cash-out transactions,serve as supplementary information to assist and adjust the results of GWKC.The improved clustering algorithm is called GWKC With Weighted Covariates(GWKC+WCov)in this paper.This version covers information on both feature vectors and empirical distributions.It allows the integration of distribution-based clustering methods with feature engineering,incorporating highly personalized and complex features that involve expert experience and business knowledge into the clustering process.It is noteworthy that when integrating distribution and covariates to measure the differences among merchants,it is necessary to determine the weights of different parts,e.g.the measurement based on distributions and those based on feature vectors.To obtain appropriate weights,this paper proposes an adaptive approach to iteratively search for weights that optimize the clustering performance.Thus,GWKC+WCov is able to integrate multiple structural features for comprehensive clustering.Both simulation and real data analysis show that the proposed algorithm significantly outperforms previous methods.With structural information of distributions involved,GWKC performs much better than those based on feature vectors.Moreover,the visualization of the results of GWKC intuitively illustrates the behavior patterns of cash-out merchants,thus providing decision-making supports for risk detection and differentiated management of payment platforms.Among various methods,GWKC+WCov achieves the best performance.Since it adaptively integrates multiple structural information within transactions,it is supposed to be a promising solution for real applications of merchant segmentation.Possible directions for future works are also discussed in this paper.Firstly,the proposed methods can be further extended through integrating more unstructured data,e.g.,network data or text data.Thus,the clustering results may be more informative with more available data sources incorporated.Secondly,in regard to the modelling of empirical distributions,it is possible to apply different finite mixture models.For datasets with arbitrary distributions,non-Gaussian components,e.g.,Gamma components,or nonparametric estimations may also be useful.Based on the proposed methods,we can derive more flexible versions to further optimize the clustering performance.
作者
黄丹阳
罗伊琳
朱映秋
Danyang Huang;Yilin Luo;Yingqiu Zhu(School of Statistics,Renmin University of China;School of Statistics,University of International Business and Economics)
出处
《经济管理学刊》
2023年第3期179-208,共30页
Quarterly Journal of Economics and Management
基金
国家自然科学基金面上项目(12071477)
对外经济贸易大学中央高校基本科研业务费专项资金资助项目(CXTD14-05)对本文研究的资助。