摘要
文本相似度计算是自然语言处理领域的研究热点和难点。自2013年“一带一路”倡议提出以来,我国急需小语种国家和地区的商业情报信息。选取中文和藏文进行比较,并采用基于多级双语向量空间映射技术的文本相似度算法计算藏汉文本相似度。首先,对文本进行预处理,并对中文文本和藏文文本进行分词;然后,利用多级双语向量空间映射框架,将藏文词向量和中文词向量映射到同一抽象的语义空间下,词间相似度则由映射后的词向量计算得出;最后,计算得出基于词间相似度的文本相似度。利用已训练完毕的汉藏词向量得到最佳多级框架,再选择6个类别的中文和藏文新闻作为试验数据,配以映射后的汉藏词向量计算汉藏文本相似度。试验结果表明,该方法可通过相似度结果有效区分同类别和异类别新闻。
The text similarity calculation is a hotspot and difficulty in the field of natural language processing. Since the Belt and Road initiative was proposed in 2013, The need of business intelligence information of countries and regions with minority language is in urgent in China. Chinese and Tibetan are selected for comparison, and text similarity algorithm based on multilevel bilingual vector space mapping technology is used to calculate the similarity of Tibetan and Chinese text. Firstly, the text is preprocessed, and Chinese text and Tibetan text segmented. Then, the Tibetan and Chinese word vectors are mapped to the same abstract semantic space using the multilevel bilingual vector space mapping framework, and the similarity between words is calculated by the mapped word vectors. Finally, the text similarity based on word similarity is calculated. The trained Chinese and Tibetan word vector is used to get the best multilevel framework, and six categories of Chinese and Tibetan news are selected as experimental data, and the mapped Chinese and Tibetan word vector is used to calculate the text similarity. The experimental results showed that this method can effectively distinguish the same category and different categories of news by similarity results.
作者
刘一丁
陈晓琳
尹晓阳
刘功申
LIU Yiding;CHEN Xiaolin;YIN Xiaoyang;LIU Gongshen(School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China;The 28th Research Institute of China Electronics Technology Group Corporation,Nanjing 210007,China)
出处
《指挥信息系统与技术》
2019年第4期27-32,共6页
Command Information System and Technology
基金
国家自然科学基金(61772337)资助项目
关键词
资源贫乏型语言
文本相似度
双语向量空间映射
多级框架
resource-poor language
text similarity
bilingual vector space mapping
multilevel framework