摘要
全面丰富的兴趣点(Point of Interest,POI)数据直接影响着地图App厂商的地理位置服务。针对传统的POI数据采集与上线方式周期长、速度慢的问题,提出一种高效的采集、上线POI数据的方式。将数据上线工作细化为:数据采集,数据格式化,数据判重与存储。在数据采集模块上采用一种负载均衡的分布式网络爬虫采集技术,数据格式化模块用于处理数据采集模块采集出的原始数据格式不统一的问题。数据判重模块将新旧数据的名称进行相似度计算,再结合经纬度计算的距离进行判重。结合Word2Vec与Siamese-LSTM设计判重模型,准确率达93.5%。
The comprehensive and abundant POI(Point of Interest)data directly affects the geographical location services of map App manufacturers.Aiming at the problems of long cycle and slow speed of traditional POI data collection and upload mode,an efficient way of collecting and upload POI data is proposed.The data upload work was divided into data collection,data formatting,data uniqueness and storage.The data collection module adopted a load balanced distributed Web crawler collection technology,and the data formatting module was used to deal with the inconsistency of the original data format collected by the data collection module.The data uniqueness module calculated the similarity between the old and new data names,and then judged the uniqueness by combining the distance calculated by longitude and latitude.Combining Word2Vec with Siamese-LSTM to design the uniqueness model,the accuracy is 93.5%.
作者
张先荣
郑贵俊
Zhang Xianrong;Zheng Guijun(School of Software Engineering,University of Science and Technology of China,Hefei 230051,Anhui,China)
出处
《计算机应用与软件》
北大核心
2020年第12期17-25,共9页
Computer Applications and Software