摘要
引起人类患癌的原因,环境污染占比60%。空间同位(co-location)模式挖掘算法可以识别其实例在地理空间中频繁邻近的模式,可应用于探究工业排放的室外空气污染物与癌症的潜在联系。传统的空间同位模式挖掘算法在衡量模式兴趣度时通常基于模式实例出现频次计算其频繁性,但污染源实例对癌症实例的影响还与实例之间的距离相关,加之污染源受气象条件、浓度大小、危害程度等因素的影响具有差异性,因此不能只依靠实例出现次数度量其兴趣度。为此,基于高斯核密度估计模型提出了空间序偶模式及相应挖掘算法,高斯核函数可以较好地刻画污染源对癌症病例的影响随距离衰减的过程,为了尽可能地还原污染源在真实世界中的扩散情况,将城市风向、风速以及污染源排放浓度考虑在内定义了新的空间邻近关系度量准则,并且对污染源所属致癌类别进行了归类,对不同致癌类别的污染物进行加权区分,提出了更加新颖的污染源与癌症关系模式的度量和相应的挖掘算法。最后,在真实和合成数据集上验证了所提度量和挖掘算法的有效性和高效率,结果表明,提出的影响度度量较传统参与度度量更能捕获现实生活中更具有现实意义的空间序偶模式,且相较于同类型算法挖掘效率平均提高了60%左右。
About 60%of all known causes of cancer are related to environmental pollution.Identifying the spatial co-location pattern of prevalent neighbor spatial feature sets in geographical space is important to explore thepotential relationship between industrial outdoor air pollutants and cancer risk.The traditional spatial co-locationpattern mining algorithms usually calculate the prevalence of co-locations based on the frequency of cancer instances when measuring pattern interest.However,the influence of pollution source on cancer instances is alsodependent on their proximity.In addition,pollution sources are also influenced by factors such as meteorologicalconditions,concentration levels,and the degree of harm.So,the pattern interest cannot be measured by relyingsolely on the number of instance occurences.To address this issue,a new spatial co-location pattern(called spatial ordered-pair pattern)is defined,and a novel mining algorithm is proposed based on the Gaussian kernel density estimation model.The Gaussian kernel function can well capture the decay of the influence of pollutionsources on cancer cases with distance.To better represent the real-world diffusion of pollution sources,a spatialneighbor relationship between pollution source and cancer is defined,which considers urban wind direction,wind speed,and pollution emission concentration.Furthermore,pollution sources are categorized into differentcarcinogenic groups,and a weighted differentiation method is employed to distinguish pollutants based on theircarcinogenic categories.The influence of various pollutants on cancer is calculated by weighting their contributions by the"carcinogenic coefficient."Therefore,a novel metric of the influence of pollution sources on canceralong with corresponding mining algorithm is presented.It not only effectively measures the impact of distance between pollution sources and cancer instances on the prevalence patterns but also models the mechanism of theinfluence of pollution sources on cancer by incorporating real-world conditions,overcoming the limitations ofthe traditional methods.Furthermore,this study improves the robustness of the method by using a smoothing factor to mitigate mining anomalies caused by uneven distributions of cancer instances.Finally,the effectivenessand efficiency of the metric and the mining algorithm proposed in this study are tested through experiments onreal and synthetic datasets,and insights are also provided for cancer prevention and urban planning for YunnanProvince.The experimental results indicate that both the influence degree and participation index can accuratelyreflect the pattern interest from both macroscopic and microscopic perspectives.Furthermore,the mining efficiency increases by an average of 60%compared to other algorithms.The proposed influence degree measurement can more effectively capture spatial co-location patterns and can better reflect the impact of pollution sources on the incidence of cancer.
作者
张玲莉
王丽珍
杨培忠
ZHANG Lingli;WANG Lizhen;YANG Peizhong(School of Information Science and Engineering,Yunnan University,Kunming 650504,China;Dianchi College of YunnanUniversity,Kunming 650228,China)
出处
《地球信息科学学报》
EI
CSCD
北大核心
2023年第12期2340-2360,共21页
Journal of Geo-information Science
基金
国家自然科学基金项目(62276227、61966036、62266050)
云南省基础研究计划重点项目(202201AS070015)
云南省创新团队项目(2018HC019)。
关键词
空间数据挖掘
空间同位模式
空间序偶模式
污染源
癌症
核密度估计
距离衰减
影响因子
spatial data mining
spatial co-location pattern
spatial ordered-pair pattern
pollution sources
cancer
kernel density estimation
distance attenuation
influence factor