Text mining is a text data analysis,found that the relationship between concepts and underlying concepts from unstructured text,it is extracted from large text database has not yet been realized patterns or associatio...Text mining is a text data analysis,found that the relationship between concepts and underlying concepts from unstructured text,it is extracted from large text database has not yet been realized patterns or associations,some information retrieval and text processing system can find the relationship between words and paragraphs.This article first describes the data sources and a brief introduction to the related platforms and functional components.Secondly,it explains the Chinese word segmentation and the Korean word segmentation system.At last,it takes the news,documents and materials of the Korean Peninsula as well as the various public opinion data on the network as the basic data for the research.The examples of word frequency graph and word cloud graph is carried out to show the results of text mining through Chinese word segmentation system and Korean word segmentation system.展开更多
针对电力领域中文文本包含大量专有词时分词效果不佳的问题,提出一种基于改进BERT(Bidirectional Encoder Representation from Transformers)的电力领域中文分词(CWS)方法。首先,构建分别涵盖通用、领域词的词典,并设计双词典匹配融合...针对电力领域中文文本包含大量专有词时分词效果不佳的问题,提出一种基于改进BERT(Bidirectional Encoder Representation from Transformers)的电力领域中文分词(CWS)方法。首先,构建分别涵盖通用、领域词的词典,并设计双词典匹配融合机制将词特征直接融入BERT模型,使模型更有效地利用外部知识;其次,通过引入DEEPNORM方法提高模型对于特征的提取能力,并使用贝叶斯信息准则(BIC)确定模型的最佳深度,使BERT模型稳定加深至40层;最后,采用ProbSparse自注意力机制层替换BERT模型中的经典自注意力机制层,并利用粒子群优化(PSO)算法确定采样因子的最优值,在降低模型复杂度的同时确保模型性能不变。在人工标注的电力领域专利文本数据集上进行了分词性能测试。实验结果表明,所提方法在该数据集分词任务中的F1值达到了92.87%,相较于隐马尔可夫模型(HMM)、多标准分词模型METASEG(pre-training model with META learning for Chinese word SEGmentation)与词典增强型BERT(LEBERT)模型分别提高了14.70、9.89与3.60个百分点,验证了所提方法有效提高了电力领域中文文本的分词质量。展开更多
文摘Text mining is a text data analysis,found that the relationship between concepts and underlying concepts from unstructured text,it is extracted from large text database has not yet been realized patterns or associations,some information retrieval and text processing system can find the relationship between words and paragraphs.This article first describes the data sources and a brief introduction to the related platforms and functional components.Secondly,it explains the Chinese word segmentation and the Korean word segmentation system.At last,it takes the news,documents and materials of the Korean Peninsula as well as the various public opinion data on the network as the basic data for the research.The examples of word frequency graph and word cloud graph is carried out to show the results of text mining through Chinese word segmentation system and Korean word segmentation system.
文摘针对电力领域中文文本包含大量专有词时分词效果不佳的问题,提出一种基于改进BERT(Bidirectional Encoder Representation from Transformers)的电力领域中文分词(CWS)方法。首先,构建分别涵盖通用、领域词的词典,并设计双词典匹配融合机制将词特征直接融入BERT模型,使模型更有效地利用外部知识;其次,通过引入DEEPNORM方法提高模型对于特征的提取能力,并使用贝叶斯信息准则(BIC)确定模型的最佳深度,使BERT模型稳定加深至40层;最后,采用ProbSparse自注意力机制层替换BERT模型中的经典自注意力机制层,并利用粒子群优化(PSO)算法确定采样因子的最优值,在降低模型复杂度的同时确保模型性能不变。在人工标注的电力领域专利文本数据集上进行了分词性能测试。实验结果表明,所提方法在该数据集分词任务中的F1值达到了92.87%,相较于隐马尔可夫模型(HMM)、多标准分词模型METASEG(pre-training model with META learning for Chinese word SEGmentation)与词典增强型BERT(LEBERT)模型分别提高了14.70、9.89与3.60个百分点,验证了所提方法有效提高了电力领域中文文本的分词质量。