摘要
针对传统的简历信息实体抽取方法泛化能力差、难以维护的问题,提出一种基于深层神经网络的简历信息实体抽取方法。经过数据清洗、分词等预处理将非结构化的简历文本信息处理为词序列,通过由Word2Vec在大规模语料库以无监督方式训练得到的词向量表,将每个词映射为低维实数向量,由双向LSTM层融合待标注词所处的语境信息,输出所有可能标签序列的分值给CRF层,由其引入前后标签之间的约束求解最优标签序列,以随机梯度下降法训练该模型,辅以Dropout防止过拟合。实验结果表明,该方法提升了相应的解析标注性能,提高了泛化能力。
The traditional information entity extraction methods of the resume(ERIE)are hard to be maintained because of poor generalization ability.To tackle above problems,an ERIE method based on deep neural network was proposed.After data clea-ning and word segmentation,the unstructured resume text information was represented as a word sequence.Each word was mapped into a low-dimensional real vector,which was trained by using an unsupervised method Word2Vec based on a large-scale corpus.The bidirectional LSTM layer was used to fuse the contextual information of the words to be marked,and the values of all possible tag sequences were exported to the CRF layer.The constraint between the front and rear tags was introduced to solve the optimal tag sequence.The model was trained using the stochastic gradient descent method,and the dropout was used to prevent overfitting.Experimental results show that the proposed method produces better parsing performance and improves the generalization ability.
作者
黄胜
李伟
张剑
HUANG Sheng;LI Wei;ZHANG Jian(Key Laboratory of Optical Communication and Networks, Chongqing University of Posts and Telecommunications,Chongqing 400065, China;Peking University Shenzhen Institute, Shenzhen 518057, China)
出处
《计算机工程与设计》
北大核心
2018年第12期3873-3878,共6页
Computer Engineering and Design
基金
国家自然科学基金项目(61371096)
深圳市科技计划基金项目(JCYJ20170307151743672)
关键词
简历抽取
信息实体
序列标注
长短期记忆
条件随机场
resume extraction
information entity
sequence labeling
long short term memory
conditional random fields