摘要
【目的】提出一种基于多语义词向量的中文新词发现方法(MWEC),解决多领域社交媒体文本的分词不准确问题。【方法】利用社交媒体文本,结合中文知网和汉字笔画数据库训练多语义词向量,以解决语义混淆问题。使用N-gram频繁字符串挖掘方法识别相关度高的子词集合,以此获取新词候选集。利用多语义词向量的语义相似度评估候选词进而获得新词。【结果】在金融、体育、旅游和音乐4个领域数据集上进行实验,结果表明本文方法的F1指标较对比方法分别提升了2.0(金融)、3.0(体育)、2.6(旅游)、11.3(音乐)个百分点。【局限】候选词生成策略着重关注子词的热度,低频词很难被识别出来。【结论】通过增强词向量的语义理解能力,利用多语义词向量对新词候选词进行剪枝,能有效提升针对中文社交媒体文本的新词发现能力。
[Objective] This paper proposes a method to discover Chinese new words based on multi-sense word embedding, aiming to improve the word segmentation of social media texts. [Methods] Firstly, we trained the MWEC with social media texts, as well as data from Chinese HowNet and Chinese character stroke database to reduce the semantic confusion. Then, we used the n-gram frequent string mining method to identify the highly relevant sub-word set, and created the new candidate set. Finally, we used the semantic similarity of multi-sense word embedding to evaluate candidates and identified the new words. [Results] We examined the model with datasets of finance, sports, tourism and music. The MWEC improved the F1 value by 2.0, 3.0, 2.6 and 11.3 percentage points respectively compared with the existing methods. [Limitations] We generated candidate words based on the popularity of sub-words, which was difficult to identify the low-frequency words. [Conclusions] The multi-sense word embedding algorithm could effectively discover new words from Chinese social media texts.
作者
张乐
冷基栋
吕学强
袁梦龙
游新冬
Zhang Le;Leng Jidong;Lv Xueqiang;Yuan Menglong;You Xindong(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2022年第1期113-121,共9页
Data Analysis and Knowledge Discovery
基金
北京市自然科学基金项目(项目编号:4212020)
青海省藏文信息处理与机器翻译重点实验室/藏文信息处理教育部重点实验室开放课题基金项目(项目编号:2019Z002)
国家自然科学基金项目(项目编号:61671070)的研究成果之一。
关键词
向量
新词
分词
N-GRAM
多语义词向量
语义相似度
Word Embedding
New Word
Word Segmentation
N-gram
Multi-sense Word Embedding
Semantic Similarity