摘要
词汇语义表示是自然语言理解的基础。传统的基于语义词典的编码表示构建成本高昂,而独热表示又存在高维稀疏等缺点。词汇的分布式表示将词汇映射为低维稠密的实值向量,能有效捕捉词汇间的语义关联,是当前主流的表示技术。本文从数据特征、学习目标和优化算法三个方面,对现有的词汇表示学习方法进行了全面深入的分析,重点介绍了这些方法的理论基础、关键技术、评价指标及应用领域。此外,本文还总结了该方向面临的主要挑战以及最新研究进展,并对词汇表示学习未来的发展方向做了展望。
Word representation that reflects semantic meaning is fundamental to natural language understanding tasks. The traditional method of encoding a word through a semantic dictionary is impractical due to the high construction cost, and one-hot representation suffers from various defects, such as high dimension and data sparsity. Distributed word representation,which projects the words into vectors in a low-dimensional real-valued space, can capture the semantic relatedness between the words and has been widely used in many NLP tasks. In this paper, we present an in-depth study of word representation learning methods from the perspectives of input data, learning objectives, and optimization algorithms, focusing on the theoretical basis, key techniques, evaluation methods, and application fields. We then summarize the main challenges and the latest advances in this research field, and we finally discuss possible future work in the field.
作者
潘俊
吴宗大
Pan Jun;Wu Zongda(School of Science,Zhejiang University of Science and Technology,Hangzhou 310023;Wenzhou Popper Big Data Research,Wenzhou 325035)
出处
《情报学报》
CSSCI
CSCD
北大核心
2019年第11期1222-1240,共19页
Journal of the China Society for Scientific and Technical Information
基金
教育部人文社会科学研究青年基金项目“基于知识库和大规模文本的词汇语义表示研究”(18YJCZH137)
浙江省自然科学基金重点项目“个性化文本检索服务用户个人隐私保护方法研究”(LZ18F020001)
关键词
词汇表示
表示学习
词向量
分布式表示
深度学习
word representation
representation learning
word vector
distributed representation
deep learning