摘要
已有的企业级全文检索系统不宜直接应用于城市轨道交通行业,因为其不仅有自身的专业术语,而且数据量较大,所以对中文分词的准确性、切分速度和全文检索查询效率具有特殊需求。中文分词方面,针对字典分词方案,添加适合城市轨道交通行业的中文术语词库,采用MMseg中文分词算法;城市轨道交通线网数据量大,查询业务相对集中,所涉及的分布式全文检索系统采用结合主索引和增量索引的方案,在业务空闲时,执行主机采用多进程方式对检索任务进行处理,实现索引的合并。试验分三部分完成。其一,比较基于字典和字标注的两种中文分词方案,以论证选择字典分词方案的合理性;其二,针对于同一检索任务,在单机上比较不同进程数对检索时间的影响,以说明多进程处理技术的合理性;最后,测试整个系统的性能。试验结果表明,此方案和方法有助于改善检索性能和检索效率。
The existing full-text retrieval systemcan't be di- rectly applied to urban rail transit industry due to thelarge vocabulary of professional terms, the special requirements for accurate Chinese word, segmentationspeed and efficien- cy of full-text retrieval. In this paper, according to the dic- tionary scheme, Chinese MMsegis adopted to solve the problems by adding urban rail industry professional terms. In view of the characteristics of huge data and query inten- sive business, a distributed full-text retrieval system with master and slave index is designed, the indexes could merge in the free timeof thissystem, thusto executethe the retriev- al tasks parallellyon the multi-core machines. This experi- ment is divided into threesteps. Firstly, two Chinese word segmentation methods based on dictionary and word label are compared; secondly, the retrieval time of a same search task working in different threadnumbers are compared, in order to verify the rationality of the multi-process technol- ogy; and finally, the functions of the whole system are tested. The testing results show that the proposed method has better retrieval performance and retrieval efficiency.
出处
《城市轨道交通研究》
北大核心
2015年第12期135-139,共5页
Urban Mass Transit
基金
广东省公益研究与能力建设专项(专题0401)(2014A040401016)
广州地铁集团有限公司与华南理工大学合作项目(J11KFA6D0004)
关键词
城市轨道交通
全文检索系统
中文分词
urban rail transit
full-text retrieval system
Chinese word segmentation