摘要
为了构建全面而且准确的法律文本纠错词库,本文提出基于网络爬虫的词库构建方法。此方法以常用爬虫为基础,扩展主题选取、网页排序等功能模块以提高爬虫的精准度和查全率。在得到数据后进行数据清洗以筛选有用词汇形成最终可使用的专业纠错词库。通过系统试运行验证了本爬虫设计方案的可行性,可以为相关词库的构建提供支持。
In order to build a comprehensive and accurate legal text error correction thesaurus,this paper proposes a method based on web crawler.Based on the common crawlers,this method extends the function modules such as topic selection and page sorting to improve the accuracy and recall of crawlers.After getting the data,data cleaning is carried out to select useful words to form the final usable professional error correction lexicon.The feasibility of this crawler design scheme is verified by the system test run,which can provide support for the construction of related thesaurus.
作者
刘明洁
李珅
梁毅
LIU Ming-jie;LI Shen;LIANG Yi(School of Computer,Beijing University of Technology,Beijing 100124,China;China Judicial Big Data Research Institute,Beijing 100043,China)
出处
《软件》
2020年第5期57-60,共4页
Software
基金
国家重点研发计划(批准号:2018YFC0831200)。
关键词
网络爬虫
法律文本
分词词库
Web crawler
Legal text
Word segmentation dictionary