摘要
近年来自然语言处理发展较为迅速,自然语言的处理离不开大量的、高质量的数据集。藏文语言处理是自然语言处理的一个重要应用。但公开的藏文数据集获取难度高,为提高自建藏文命名实体数据集,对藏文命名实体数据集半自动构建器进行了研究,包含爬虫部分和拆分部分,并提出了一种基于滑动窗口的命名实体匹配算法。其中,爬虫部分通过应用十分成熟稳定的八爪鱼采集器创建任务实现。拆分部分使用WPF技术设计操作界面,采用C#编程语言实现拆分算法。
In recent years,natural language processing has developed rapidly,and the processing of natural language relies heavily on large quantities of high-quality datasets.Tibetan language processing is an important application of natural language processing.However,obtaining publicly available Tibetan datasets is challenging.In order to improve the construction of a self-built Tibetan named entity dataset,this paper proposes a design of a semi-automatic builder for a Tibetan named entity dataset based on web crawler and WPF(windows presentation foundation)technology.The design includes a crawler component and a splitting component,and introduces a named entity matching algorithm based on sliding windows.The crawler component utilizes the well-established and stable Octopus Collector for task creation.The splitting component employs an operation interface designed using WPF technology and implements the splitting algorithm using the C# programming language.
作者
李甜华
央啦
杨文艺
春燕
Li Tianhua;Yang La;Yang Wenyi;Chun Yan(School of Information Science and Technology,Tibet University,Lhasa 850000,China)
出处
《现代计算机》
2023年第21期93-97,共5页
Modern Computer
基金
西藏大学2022年自治区级大学生创新训练项目(S202210694053)。
关键词
爬虫
藏文
命名实体
八爪鱼采集器
Web crawler
Tibetan language
named entity
Octopus collector