On the basis of the characteristics of Chinese language such.as simple and uniform structure, distinct hierarchy and construction by word order and function words, and in the view of the human cognitive mechanism, a h...On the basis of the characteristics of Chinese language such.as simple and uniform structure, distinct hierarchy and construction by word order and function words, and in the view of the human cognitive mechanism, a hierarchical combination method for computer understanding of Chinese language is put forward in this paper. By this method, the whole information of a sentence is hierarchically combined from the partial information of the basic units of it, with the unification operation under attribute description frames. This method is perfect in combining syntax analysis with semantic analysis, easy to implement, and very suitable for the computer understanding system for processing Chinese language.展开更多
Chinese word segmentation is the basis of natural language processing. The dictionary mechanism significantly influences the efficiency of word segmentation and the understanding of the user’s intention which is impl...Chinese word segmentation is the basis of natural language processing. The dictionary mechanism significantly influences the efficiency of word segmentation and the understanding of the user’s intention which is implied in the user’s query. As the traditional dictionary mechanisms can't meet the present situation of personalized mobile search, this paper presents a new dictionary mechanism which contains the word classification information. This paper, furthermore, puts forward an approach for improving the traditional word bank structure, and proposes an improved FMM segmentation algorithm. The results show that the new dictionary mechanism has made a significant increase on the query efficiency and met the user’s individual requirements better.展开更多
Switzerland is one of the most desirable European destinations for Chinese tourists;therefore, a better understanding of Chinese tourists is essential for successful business practices. In China, the largest and leadi...Switzerland is one of the most desirable European destinations for Chinese tourists;therefore, a better understanding of Chinese tourists is essential for successful business practices. In China, the largest and leading social media platform—Sina Weibo, a hybrid of Twitter and Facebook—has more than 600 million users. Weibo’s great market penetration suggests that tourism operators and markets need to understand how to build effective and sustainable communications on Chinese social media platforms. In order to offer a better decision support platform to tourism destination managers as well as Chinese tourists, we proposed a framework using linked data on Sina Weibo. Linked Data is a term referring to using the Internet to connect related data. We will show how it can be used and how ontology can be designed to include the users’ context (e.g., GPS locations). Our framework will provide a good theoretical foundation for further understand Chinese tourists’ expectation, experiences, behaviors and new trends in Switzerland.展开更多
Described and exemplified a semantic scoring system of students' on-line English-Chinese translation. To achieve accurate assessment, the system adopted a comprehensive method which combines semantic scoring with ...Described and exemplified a semantic scoring system of students' on-line English-Chinese translation. To achieve accurate assessment, the system adopted a comprehensive method which combines semantic scoring with keyword matching scoring. Four kinds of words-verbs, adjectives, adverbs and "the rest" including nouns, pronouns, idioms, prepositions, etc., are identified after parsing. The system treats different words tagged with different part of speech differently. Then it calculated the semantic similarity between these words of the standard versions and those of students' translations by the distinctive differences of the semantic features of these words with the aid of HowNet. The first semantic feature of verbs and the last semantic features of adjectives and adverbs are calculated. "The rest" is scored by keyword matching. The experiment results show that the semantic scoring system is applicable in fulfilling the task of scoring students' on-line English-Chinese translations.展开更多
目的对国内外近20年来发表的涉及自然语言处理(NLP)智能技术应用于中医术语识别或标注方面的文献进行计量分析与评价,探讨NLP智能技术在中医术语标准研究中的应用和发展前景。方法检索收集2003年1月至2023年10月期间,中国知网、维普中...目的对国内外近20年来发表的涉及自然语言处理(NLP)智能技术应用于中医术语识别或标注方面的文献进行计量分析与评价,探讨NLP智能技术在中医术语标准研究中的应用和发展前景。方法检索收集2003年1月至2023年10月期间,中国知网、维普中文科技期刊数据库、万方数据知识服务平台、中国生物医学文献服务系统及Web of Science等中英文数据库中的相关文献。采用Excel vba、Gephi、PyCharm等数据处理和统计分析工具,应用频数统计、Apriori关联分析、词云统计等文献计量学方法,对相关研究热点进行可视化分析。结果①经筛选,符合研究标准的文献共442篇,其中中文文献320篇、英文文献122篇。②2016年以后,相关发文量呈现持续增长的趋势。③发文国家主要集中在中国。④中文文献中硕博士学位论文比重较大,其中发文量最高的是北京交通大学。⑤中文文献发文机构以中国中医科学院发文量最高;英文文献发文机构以北京科技大学发文量最高;中医机构与计算机相关机构合作频繁。⑥基于BERT的命名实体识别算法在中医术语研究中的应用效果最为显著。⑦中医文献类的数据占比较大。结论基于NLP智能技术的中医术语标准化研究仍处于探索阶段,现有研究表现出技术应用的多样性,但缺乏系统性。鉴于NLP智能技术在中医术语识别和标注方面的潜力,未来研究需进一步加强,以期实现中医术语标准研究的系统化、智能化与广泛应用。展开更多
抽象语义表示(Abstract Meaning Representation,AMR)是一种深层次的句子级语义表示形式,其将句子中的语义信息抽象为由概念结点与关系组成的有向无环图,相比其他较为浅层的语义表示形式如语义角色标注、语义依存分析等,AMR因其出色的...抽象语义表示(Abstract Meaning Representation,AMR)是一种深层次的句子级语义表示形式,其将句子中的语义信息抽象为由概念结点与关系组成的有向无环图,相比其他较为浅层的语义表示形式如语义角色标注、语义依存分析等,AMR因其出色的深层次语义信息捕捉能力,被广泛运用在例如信息抽取、智能问答、对话系统等多种下游任务中。AMR解析过程将自然语言转换成AMR图。虽然AMR图中的大部分概念结点和关系与句子中的词语具有较为明显的对齐关系,但原始的英文AMR语料中并没有给出具体的对齐信息。为了克服对齐信息不足给AMR解析以及AMR在下游任务上的应用造成的阻碍,Li等人[14]提出并标注了具有概念和关系对齐的中文AMR语料库。然而,现有的AMR解析方法并不能很好地在AMR解析的过程中利用和生成对齐信息。因此,该文首次提出了一种可以利用并且生成对齐信息的AMR解析方法,包括了概念预测和关系预测两个阶段。该文提出的方法具有高度的灵活性和可扩展性,实验结果表明,该方法在公开数据集CAMR 2.0和CAMRP 2022盲测集分别取得了77.6(+10.6)和70.7(+8.5)的Align Smatch分数,超过了过去基于序列到序列(Sequence-to-Sequence)模型的方法。该文同时对AMR解析的性能和细粒度指标进行详细的分析,并对存在的改进方向进行了展望。该文的代码和模型参数已经开源到https://github.com/pkunlp-icler/Two-Stage-CAMRP,供复现与参考。展开更多
中文故事结尾生成(SEG)是自然语言处理中的下游任务之一。基于全错误结尾的CLSEG(Contrastive Learning of Story Ending Generation)在故事的一致性方面表现较好。然而,由于错误结尾中也包含与原结尾文本相同的内容,仅使用错误结尾的...中文故事结尾生成(SEG)是自然语言处理中的下游任务之一。基于全错误结尾的CLSEG(Contrastive Learning of Story Ending Generation)在故事的一致性方面表现较好。然而,由于错误结尾中也包含与原结尾文本相同的内容,仅使用错误结尾的对比训练会导致生成文本中原结尾正确的主要部分被剥离。因此,在CLSEG基础上增加正向结尾增强训练,以保留对比训练中损失的正确部分;同时,通过正向结尾的引入,使生成的结尾具有更强的多样性和关联性。基于双向对比训练的中文故事结尾生成模型包含两个主要部分:1)多结尾采样,通过不同的模型方法获取正向增强的结尾和反向对比的错误结尾;2)对比训练,在训练过程中修改损失函数,使生成的结尾接近正向结尾,远离错误结尾。在公开的故事数据集OutGen上的实验结果表明,相较于GPT2. ft和深层逐层隐变量融合(Della)等模型,所提模型的BERTScore、METEOR等指标均取得了较优的结果,生成的结尾具有更强的多样性和关联性。展开更多
While large language models(LLMs)have made significant strides in natural language processing(NLP),they continue to face challenges in adequately addressing the intricacies of the Chinese language in certain scenarios...While large language models(LLMs)have made significant strides in natural language processing(NLP),they continue to face challenges in adequately addressing the intricacies of the Chinese language in certain scenarios.We propose a framework called Six-Writings multimodal processing(SWMP)to enable direct integration of Chinese NLP(CNLP)with morphological and semantic elements.The first part of SWMP,known as Six-Writings pictophonetic coding(SWPC),is introduced with a suitable level of granularity for radicals and components,enabling effective representation of Chinese characters and words.We conduct several experimental scenarios,including the following:(1)We establish an experimental database consisting of images and SWPC for Chinese characters,enabling dual-mode processing and matrix generation for CNLP.(2)We characterize various generative modes of Chinese words,such as thousands of Chinese idioms,used as question-and-answer(Q&A)prompt functions,facilitating analogies by SWPC.The experiments achieve 100%accuracy in answering all questions in the Chinese morphological data set(CA8-Mor-10177).(3)A fine-tuning mechanism is proposed to refine word embedding results using SWPC,resulting in an average relative error of≤25%for 39.37%of the questions in the Chinese wOrd Similarity data set(COS960).The results demonstrate that SWMP/SWPC methods effectively capture the distinctive features of Chinese and offer a promising mechanism to enhance CNLP with better efficiency.展开更多
梳理总结现阶段BERT模型应用于医学中的研究热点和未来发展趋势,为我国医学信息化提供参考和建议。采用文献计量学方法,收集整理Web of Science数据库核心集(WoSCC)中从2018年1月1日至2022年12月31日医学应用BERT模型的相关文献并进行...梳理总结现阶段BERT模型应用于医学中的研究热点和未来发展趋势,为我国医学信息化提供参考和建议。采用文献计量学方法,收集整理Web of Science数据库核心集(WoSCC)中从2018年1月1日至2022年12月31日医学应用BERT模型的相关文献并进行分析。经筛选共纳入267篇文献。研究显示BERT主要应用在西医领域;参研国家主要为中国和美国,其他国家涉猎较少;作者单位分布呈现以高校为主,医疗机构及科研院所、政府机关等为辅的特征;研究内容主要聚焦于医疗信息抽取、命名实体识别等。中医领域应用BERT模型较早,但目前尚处于起步阶段,而我国健康卫生保障体系中西医并重,未来研究可围绕BERT如何促进中医信息化方面进一步扩展。展开更多
文摘On the basis of the characteristics of Chinese language such.as simple and uniform structure, distinct hierarchy and construction by word order and function words, and in the view of the human cognitive mechanism, a hierarchical combination method for computer understanding of Chinese language is put forward in this paper. By this method, the whole information of a sentence is hierarchically combined from the partial information of the basic units of it, with the unification operation under attribute description frames. This method is perfect in combining syntax analysis with semantic analysis, easy to implement, and very suitable for the computer understanding system for processing Chinese language.
文摘Chinese word segmentation is the basis of natural language processing. The dictionary mechanism significantly influences the efficiency of word segmentation and the understanding of the user’s intention which is implied in the user’s query. As the traditional dictionary mechanisms can't meet the present situation of personalized mobile search, this paper presents a new dictionary mechanism which contains the word classification information. This paper, furthermore, puts forward an approach for improving the traditional word bank structure, and proposes an improved FMM segmentation algorithm. The results show that the new dictionary mechanism has made a significant increase on the query efficiency and met the user’s individual requirements better.
文摘Switzerland is one of the most desirable European destinations for Chinese tourists;therefore, a better understanding of Chinese tourists is essential for successful business practices. In China, the largest and leading social media platform—Sina Weibo, a hybrid of Twitter and Facebook—has more than 600 million users. Weibo’s great market penetration suggests that tourism operators and markets need to understand how to build effective and sustainable communications on Chinese social media platforms. In order to offer a better decision support platform to tourism destination managers as well as Chinese tourists, we proposed a framework using linked data on Sina Weibo. Linked Data is a term referring to using the Internet to connect related data. We will show how it can be used and how ontology can be designed to include the users’ context (e.g., GPS locations). Our framework will provide a good theoretical foundation for further understand Chinese tourists’ expectation, experiences, behaviors and new trends in Switzerland.
基金The National Natural Science Foundution of China(No60496326)The Second Phase of 985 Project of Shanghai Jiaotong University
文摘Described and exemplified a semantic scoring system of students' on-line English-Chinese translation. To achieve accurate assessment, the system adopted a comprehensive method which combines semantic scoring with keyword matching scoring. Four kinds of words-verbs, adjectives, adverbs and "the rest" including nouns, pronouns, idioms, prepositions, etc., are identified after parsing. The system treats different words tagged with different part of speech differently. Then it calculated the semantic similarity between these words of the standard versions and those of students' translations by the distinctive differences of the semantic features of these words with the aid of HowNet. The first semantic feature of verbs and the last semantic features of adjectives and adverbs are calculated. "The rest" is scored by keyword matching. The experiment results show that the semantic scoring system is applicable in fulfilling the task of scoring students' on-line English-Chinese translations.
文摘目的对国内外近20年来发表的涉及自然语言处理(NLP)智能技术应用于中医术语识别或标注方面的文献进行计量分析与评价,探讨NLP智能技术在中医术语标准研究中的应用和发展前景。方法检索收集2003年1月至2023年10月期间,中国知网、维普中文科技期刊数据库、万方数据知识服务平台、中国生物医学文献服务系统及Web of Science等中英文数据库中的相关文献。采用Excel vba、Gephi、PyCharm等数据处理和统计分析工具,应用频数统计、Apriori关联分析、词云统计等文献计量学方法,对相关研究热点进行可视化分析。结果①经筛选,符合研究标准的文献共442篇,其中中文文献320篇、英文文献122篇。②2016年以后,相关发文量呈现持续增长的趋势。③发文国家主要集中在中国。④中文文献中硕博士学位论文比重较大,其中发文量最高的是北京交通大学。⑤中文文献发文机构以中国中医科学院发文量最高;英文文献发文机构以北京科技大学发文量最高;中医机构与计算机相关机构合作频繁。⑥基于BERT的命名实体识别算法在中医术语研究中的应用效果最为显著。⑦中医文献类的数据占比较大。结论基于NLP智能技术的中医术语标准化研究仍处于探索阶段,现有研究表现出技术应用的多样性,但缺乏系统性。鉴于NLP智能技术在中医术语识别和标注方面的潜力,未来研究需进一步加强,以期实现中医术语标准研究的系统化、智能化与广泛应用。
文摘中文故事结尾生成(SEG)是自然语言处理中的下游任务之一。基于全错误结尾的CLSEG(Contrastive Learning of Story Ending Generation)在故事的一致性方面表现较好。然而,由于错误结尾中也包含与原结尾文本相同的内容,仅使用错误结尾的对比训练会导致生成文本中原结尾正确的主要部分被剥离。因此,在CLSEG基础上增加正向结尾增强训练,以保留对比训练中损失的正确部分;同时,通过正向结尾的引入,使生成的结尾具有更强的多样性和关联性。基于双向对比训练的中文故事结尾生成模型包含两个主要部分:1)多结尾采样,通过不同的模型方法获取正向增强的结尾和反向对比的错误结尾;2)对比训练,在训练过程中修改损失函数,使生成的结尾接近正向结尾,远离错误结尾。在公开的故事数据集OutGen上的实验结果表明,相较于GPT2. ft和深层逐层隐变量融合(Della)等模型,所提模型的BERTScore、METEOR等指标均取得了较优的结果,生成的结尾具有更强的多样性和关联性。
基金Project partially supported by the Brazilian National Council for Scientific and Technological Development(CNPq)(No.309545/2021-8)。
文摘While large language models(LLMs)have made significant strides in natural language processing(NLP),they continue to face challenges in adequately addressing the intricacies of the Chinese language in certain scenarios.We propose a framework called Six-Writings multimodal processing(SWMP)to enable direct integration of Chinese NLP(CNLP)with morphological and semantic elements.The first part of SWMP,known as Six-Writings pictophonetic coding(SWPC),is introduced with a suitable level of granularity for radicals and components,enabling effective representation of Chinese characters and words.We conduct several experimental scenarios,including the following:(1)We establish an experimental database consisting of images and SWPC for Chinese characters,enabling dual-mode processing and matrix generation for CNLP.(2)We characterize various generative modes of Chinese words,such as thousands of Chinese idioms,used as question-and-answer(Q&A)prompt functions,facilitating analogies by SWPC.The experiments achieve 100%accuracy in answering all questions in the Chinese morphological data set(CA8-Mor-10177).(3)A fine-tuning mechanism is proposed to refine word embedding results using SWPC,resulting in an average relative error of≤25%for 39.37%of the questions in the Chinese wOrd Similarity data set(COS960).The results demonstrate that SWMP/SWPC methods effectively capture the distinctive features of Chinese and offer a promising mechanism to enhance CNLP with better efficiency.
文摘梳理总结现阶段BERT模型应用于医学中的研究热点和未来发展趋势,为我国医学信息化提供参考和建议。采用文献计量学方法,收集整理Web of Science数据库核心集(WoSCC)中从2018年1月1日至2022年12月31日医学应用BERT模型的相关文献并进行分析。经筛选共纳入267篇文献。研究显示BERT主要应用在西医领域;参研国家主要为中国和美国,其他国家涉猎较少;作者单位分布呈现以高校为主,医疗机构及科研院所、政府机关等为辅的特征;研究内容主要聚焦于医疗信息抽取、命名实体识别等。中医领域应用BERT模型较早,但目前尚处于起步阶段,而我国健康卫生保障体系中西医并重,未来研究可围绕BERT如何促进中医信息化方面进一步扩展。