摘要
地址分词是实现地理编码的重要基础。本文基于条件随机场模型对中文地址分词进行研究,实现了中文地址的快速、准确分词。文中先综合分析各种中文地址分词方法的优势和劣势,设计出一套优化的地址分词标注体系,然后制定地址分词的尾词特征和特征模板,采用半监督学习与人工标注相结合的方式获取高质量的标注训练语料,供CRF模型训练;最终根据已标注语料训练条件随机场模型,实现对中文地址信息的自动分词和识别。
Address segmentation is an important basis for geocoding.Based on conditional random field model,Chinese address word segmentation is studied in this research,and the fast and accurate segmentation of Chinese address is realized.Firstly,This paper analyzes the advantages and disadvantages of various methods for Chinese address segmentation,designs a set of optimized address segmentation tagging system,and then formulates the tail word features and feature templates of address segmentation,and obtains high-quality tagging training materials by combining semi-supervised learning with manual tagging for CRF model training.Finally,according to the training condition random field model of labeled training materials,this paper achieves the goal of automatic segmentation and recognition of Chinese address information.
作者
杨德彬
马卫春
YANG Debin;MA Weichun(Provincal Fundamental Geomatic Center of Anhui,Hefei 230031,China;Anhui Key Laboratory of Smart City and Geographical Condition Monitoring,Hefei 230031,China)
出处
《测绘与空间地理信息》
2021年第11期73-75,79,共4页
Geomatics & Spatial Information Technology
基金
安徽省基础测绘信息中心天地图省级平台支撑技术研发项目(2019FACN2756)资助。
关键词
中文地址
地理编码
条件随机场
分词
地理信息
Chinese address
geocoding
conditional random field
word segmentation
geographic information