词嵌入模型可以将词语映射到低维向量空间以分析词语语义,为计算机理解和文本处理提供有效手段。传统中文词嵌入模型通过中文词语内部的组成信息学习语义信息,然而,对于汉字及其不同层级组件信息的利用程度,不同模型存在利用不够或过度...词嵌入模型可以将词语映射到低维向量空间以分析词语语义,为计算机理解和文本处理提供有效手段。传统中文词嵌入模型通过中文词语内部的组成信息学习语义信息,然而,对于汉字及其不同层级组件信息的利用程度,不同模型存在利用不够或过度的问题。为了更好地利用汉字不同层级组件信息生成高质量的词嵌入,提出多级组件融合中文词嵌入(MJWE)模型,综合考虑词语、汉字和多级组件的特征,融合带有位置信息的字嵌入,构建以偏旁、部首和更小粒度的组件构成的多级组件嵌入,从而更全面地捕捉中文词语内部语义信息。同时,构建非组合词词表防止词语内部信息的过度利用。实验结果表明,在词相似任务WS-295上,与JWE(Joint learning Word Embeddings)模型相比,MJWE模型的准确率提高了2.11%;在词类比任务state上,与跳元(SG)模型相比,MJWE模型的准确率提高了2.52%;在词类比任务family上,与连续词袋(CBOW)模型相比,MJWE模型的准确率提高了6.58%。在情感二分类任务上,与JWE模型相比,MJWE模型的准确率提高了0.71%;在情感七分类任务上,与SG模型相比,MJWE模型的准确率提高了8.60%。同时,将MJWE模型应用于中医文献分析,在方剂核心药物识别的任务中,MJWE可以识别治疗慢性肾小球肾炎不同证候的核心药物。可见,MJWE可以生成质量较好的中文词嵌入,结合社区检测算法可以识别治疗慢性肾小球肾炎不同证候的核心药物,有利于辅助中医医师临床决策。展开更多
In order to improve Chinese overlapping ambiguity resolution based on a support vector machine, statistical features are studied for representing the feature vectors. First, four statistical parameters-mutual informat...In order to improve Chinese overlapping ambiguity resolution based on a support vector machine, statistical features are studied for representing the feature vectors. First, four statistical parameters-mutual information, accessor variety, two-character word frequency and single-character word frequency are used to describe the feature vectors respectively. Then other parameters are tried to add as complementary features to the parameters which obtain the best results for further improving the classification performance. Experimental results show that features represented by mutual information, single-character word frequency and accessor variety can obtain an optimum result of 94. 39%. Compared with a commonly used word probability model, the accuracy has been improved by 6. 62%. Such comparative results confirm that the classification performance can be improved by feature selection and representation.展开更多
The paper proposes a unified framework to combine the advantages of the fast one-at-a-time approach and the high-performance all-at-once approach to perform Chinese Word Segmentation(CWS) and Part-of-Speech(PoS) taggi...The paper proposes a unified framework to combine the advantages of the fast one-at-a-time approach and the high-performance all-at-once approach to perform Chinese Word Segmentation(CWS) and Part-of-Speech(PoS) tagging.In this framework,the input of the PoS tagger is a candidate set of several CWS results provided by the CWS model.The widely used one-at-a-time approach and all-at-once approach are two extreme cases of the proposed candidate-based approaches.Experiments on Penn Chinese Treebank 5 and Tsinghua Chinese Treebank show that the generalized candidate-based approach outperforms one-at-a-time approach and even the all-at-once approach.The candidate-based approach is also faster than the time-consuming all-at-once approach.The authors compare three different methods based on sentence,words and character-intervals to generate the candidate set.It turns out that the word-based method has the best performance.展开更多
Every language possesses three cardinal elements: phonetic element, lexical element, and grammatical structure, of which lexis is the fundamental pillar that supports the huge system of a language. The close relation...Every language possesses three cardinal elements: phonetic element, lexical element, and grammatical structure, of which lexis is the fundamental pillar that supports the huge system of a language. The close relationship between language and culture is most readily seen in words. In fact, being the most active and elastic element of a language, vocabulary has the greatest culture-loading capacity. Vocabulary teaching is an integral part of foreign language teaching. Its efficiency has a direct relation with the development of the learners' communicative competence. Vocabulary is culture-bound, so it is self-evident that culture introduction is indispensable in teaching. The author attempts to make a comparison between English and Chinese cultures, to make clear how cultural disparities exist in English and Chinese vocabulary, and to put forward some constructive suggestions on how to integrate culture into vocabulary teaching in Chinese schools, so as to promote the efficiency of vocabulary teaching and improve learners' competence in intercultural communication.展开更多
The vocabulary is the most active factor in the language, and the meaning of color words in the vocabulary is abundant. The nature is multicolored, so various nationalities have formed each unique color view in precip...The vocabulary is the most active factor in the language, and the meaning of color words in the vocabulary is abundant. The nature is multicolored, so various nationalities have formed each unique color view in precipitated long-term history, refracting gorgeous national culture. Culture has restrained the meaning of the color words from developing, and the cultural meaning of the color words has refracted out abundant cultural intension again. Because of different cultural issues, cultural tradition, and culture psychology, the cultural connotations of the English and Chinese color words differ greatly, as a result, these particular cultural connotation meanings are cast under different environments by different nationalities. There are a lot of similarities and differences on the meaning between English and Chinese color words. This paper analyzing Chinese and English color terms in the angle of lexicology, is guided by the book An Introduction to English Lexicology. After reviewing the research done by some linguists, this paper starts from the definition and origin of color terms, studies the changing and word-formation of color terms, then ends with some researches on the idioms on color terms, which gives a systematically comparison of Chinese and English color terms on their developing progress.展开更多
文摘词嵌入模型可以将词语映射到低维向量空间以分析词语语义,为计算机理解和文本处理提供有效手段。传统中文词嵌入模型通过中文词语内部的组成信息学习语义信息,然而,对于汉字及其不同层级组件信息的利用程度,不同模型存在利用不够或过度的问题。为了更好地利用汉字不同层级组件信息生成高质量的词嵌入,提出多级组件融合中文词嵌入(MJWE)模型,综合考虑词语、汉字和多级组件的特征,融合带有位置信息的字嵌入,构建以偏旁、部首和更小粒度的组件构成的多级组件嵌入,从而更全面地捕捉中文词语内部语义信息。同时,构建非组合词词表防止词语内部信息的过度利用。实验结果表明,在词相似任务WS-295上,与JWE(Joint learning Word Embeddings)模型相比,MJWE模型的准确率提高了2.11%;在词类比任务state上,与跳元(SG)模型相比,MJWE模型的准确率提高了2.52%;在词类比任务family上,与连续词袋(CBOW)模型相比,MJWE模型的准确率提高了6.58%。在情感二分类任务上,与JWE模型相比,MJWE模型的准确率提高了0.71%;在情感七分类任务上,与SG模型相比,MJWE模型的准确率提高了8.60%。同时,将MJWE模型应用于中医文献分析,在方剂核心药物识别的任务中,MJWE可以识别治疗慢性肾小球肾炎不同证候的核心药物。可见,MJWE可以生成质量较好的中文词嵌入,结合社区检测算法可以识别治疗慢性肾小球肾炎不同证候的核心药物,有利于辅助中医医师临床决策。
文摘In order to improve Chinese overlapping ambiguity resolution based on a support vector machine, statistical features are studied for representing the feature vectors. First, four statistical parameters-mutual information, accessor variety, two-character word frequency and single-character word frequency are used to describe the feature vectors respectively. Then other parameters are tried to add as complementary features to the parameters which obtain the best results for further improving the classification performance. Experimental results show that features represented by mutual information, single-character word frequency and accessor variety can obtain an optimum result of 94. 39%. Compared with a commonly used word probability model, the accuracy has been improved by 6. 62%. Such comparative results confirm that the classification performance can be improved by feature selection and representation.
基金supported by the National Natural Science Foundation of China under GrantNo.60873174
文摘The paper proposes a unified framework to combine the advantages of the fast one-at-a-time approach and the high-performance all-at-once approach to perform Chinese Word Segmentation(CWS) and Part-of-Speech(PoS) tagging.In this framework,the input of the PoS tagger is a candidate set of several CWS results provided by the CWS model.The widely used one-at-a-time approach and all-at-once approach are two extreme cases of the proposed candidate-based approaches.Experiments on Penn Chinese Treebank 5 and Tsinghua Chinese Treebank show that the generalized candidate-based approach outperforms one-at-a-time approach and even the all-at-once approach.The candidate-based approach is also faster than the time-consuming all-at-once approach.The authors compare three different methods based on sentence,words and character-intervals to generate the candidate set.It turns out that the word-based method has the best performance.
文摘Every language possesses three cardinal elements: phonetic element, lexical element, and grammatical structure, of which lexis is the fundamental pillar that supports the huge system of a language. The close relationship between language and culture is most readily seen in words. In fact, being the most active and elastic element of a language, vocabulary has the greatest culture-loading capacity. Vocabulary teaching is an integral part of foreign language teaching. Its efficiency has a direct relation with the development of the learners' communicative competence. Vocabulary is culture-bound, so it is self-evident that culture introduction is indispensable in teaching. The author attempts to make a comparison between English and Chinese cultures, to make clear how cultural disparities exist in English and Chinese vocabulary, and to put forward some constructive suggestions on how to integrate culture into vocabulary teaching in Chinese schools, so as to promote the efficiency of vocabulary teaching and improve learners' competence in intercultural communication.
文摘The vocabulary is the most active factor in the language, and the meaning of color words in the vocabulary is abundant. The nature is multicolored, so various nationalities have formed each unique color view in precipitated long-term history, refracting gorgeous national culture. Culture has restrained the meaning of the color words from developing, and the cultural meaning of the color words has refracted out abundant cultural intension again. Because of different cultural issues, cultural tradition, and culture psychology, the cultural connotations of the English and Chinese color words differ greatly, as a result, these particular cultural connotation meanings are cast under different environments by different nationalities. There are a lot of similarities and differences on the meaning between English and Chinese color words. This paper analyzing Chinese and English color terms in the angle of lexicology, is guided by the book An Introduction to English Lexicology. After reviewing the research done by some linguists, this paper starts from the definition and origin of color terms, studies the changing and word-formation of color terms, then ends with some researches on the idioms on color terms, which gives a systematically comparison of Chinese and English color terms on their developing progress.