摘要
随着铁路工程地质工作的不断开展,相关文本资料大量累积。但由于文本具有非结构化、不直观等特点,难以在信息化进程中得到高效利用。为将文本资料转化为计算机可直接读取的形式,该文面向铁路工程地质领域,收集文献、报告、规范及手册等多种类文本,利用Jiaba函数库,构建4192189词规模的铁路工程地质语料库;利用Word2vec模型,将非结构化文本分词嵌入词向量空间中,转化为具有语义信息的数值。经过降维可视化、聚类和语义相似度计算的检验,结果表明,该文构建的语料库及其所训练的词向量能有效记录语义信息。为铁路工程地质语义分析、实体识别和知识图谱构建等工作提供数据基础。
With the continuous development of railway engineering geological work,a large number of related text materials have been accumulated.However,because the text is unstructured and unintuitive,it is difficult to be used efficiently in the process of informatization.In order to transform the text data into a form that can be directly read by computer,this paper collects documents,reports,specifications,manuals and other kinds of texts in the field of railway engineering geology,uses Jieba Chinese word segmentation technology to build a railway engineering geological corpus with a scale of 4192189 words,and uses Word2vec model to embed unstructured text word segmentation into word vector space and transform it into numerical values with semantic information.Through the tests of dimensionality reduction visualization,clustering and semantic similarity calculation,the results show that the corpus constructed in this paper and its trained word vectors can effectively record semantic information,thus providing a data basis for semantic analysis of railway engineering geology,entity recognition,knowledge graph construction and so on.
出处
《科技创新与应用》
2022年第35期89-92,共4页
Technology Innovation and Application
基金
中国铁建重大专项(2021-A02)。