The rise of social networking enables the development of multilingual Internet-accessible digital documents in several languages.The digital document needs to be evaluated physically through the Cross-Language Text Su...The rise of social networking enables the development of multilingual Internet-accessible digital documents in several languages.The digital document needs to be evaluated physically through the Cross-Language Text Summarization(CLTS)involved in the disparate and generation of the source documents.Cross-language document processing is involved in the generation of documents from disparate language sources toward targeted documents.The digital documents need to be processed with the contextual semantic data with the decoding scheme.This paper presented a multilingual crosslanguage processing of the documents with the abstractive and summarising of the documents.The proposed model is represented as the Hidden Markov Model LSTM Reinforcement Learning(HMMlstmRL).First,the developed model uses the Hidden Markov model for the computation of keywords in the cross-language words for the clustering.In the second stage,bi-directional long-short-term memory networks are used for key word extraction in the cross-language process.Finally,the proposed HMMlstmRL uses the voting concept in reinforcement learning for the identification and extraction of the keywords.The performance of the proposed HMMlstmRL is 2%better than that of the conventional bi-direction LSTM model.展开更多
As one of Chinese minority languages,Tibetan speech recognition technology was not researched upon as extensively as Chinese and English were until recently.This,along with the relatively small Tibetan corpus,has resu...As one of Chinese minority languages,Tibetan speech recognition technology was not researched upon as extensively as Chinese and English were until recently.This,along with the relatively small Tibetan corpus,has resulted in an unsatisfying performance of Tibetan speech recognition based on an end-to-end model.This paper aims to achieve an accurate Tibetan speech recognition using a small amount of Tibetan training data.We demonstrate effective methods of Tibetan end-to-end speech recognition via cross-language transfer learning from three aspects:modeling unit selection,transfer learning method,and source language selection.Experimental results show that the Chinese-Tibetan multi-language learning method using multilanguage character set as the modeling unit yields the best performance on Tibetan Character Error Rate(CER)at 27.3%,which is reduced by 26.1%compared to the language-specific model.And our method also achieves the 2.2%higher accuracy using less amount of data compared with the method using Tibetan multi-dialect transfer learning under the same model structure and data set.展开更多
The present paper describes the use of online free language resources for translating and expanding queries in CLIR (cross-language information retrieval). In a previous study, we proposed method queries that were t...The present paper describes the use of online free language resources for translating and expanding queries in CLIR (cross-language information retrieval). In a previous study, we proposed method queries that were translated by two machine translation systems on the Language Gridem. The queries were then expanded using an online dictionary to translate compound words or word phrases. A concept base was used to compare back translation words with the original query in order to delete mistranslated words. In order to evaluate the proposed method, we constructed a CLIR system and used the science documents of the NTCIR1 dataset. The proposed method achieved high precision. However~ proper nouns (names of people and places) appear infrequently in science documents. In information retrieval, proper nouns present unique problems. Since proper nouns are usually unknown words, they are difficult to find in monolingual dictionaries, not to mention bilingual dictionaries. Furthermore, the initial query of the user is not always the best description of the desired information. In order to solve this problem, and to create a better query representation, query expansion is often proposed as a solution. Wikipedia was used to translate compound words or word phrases. It was also used to expand queries together with a concept base. The NTCIRI and NTCIR 6 datasets were used to evaluate the proposed method. In the proposed method, the CLIR system was implemented with a high rate of precision. The proposed syst had a higher ranking than the NTCIRI and NTCIR6 participation systems.展开更多
Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the...Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the impact of using bilingual word vectors for cross-language information retrieval in long-distance language pairs. In this paper, it systematically analyzes the retrieval performance of various European languages (English, German, Italian, French, Finnish, Dutch) as well as Asian languages (Chinese, Japanese) in the adhoc task of CLEF 2002–2003 campaign. Genetic proximity was used to visually represent the relationships between languages and compare their crosslingual retrieval performance in various settings. The results show that the differences in language vocabulary would dramatically affect the retrieval performance. At the same time, the term by term translation retrieval method performs slightly better than the simple vector addition retrieval methods. It proves that the translation-based retrieval model can still maintain its advantage under the new semantic scheme.展开更多
Prior studies have demonstrated that deep learning-based approaches can enhance the performance of source code vulnerability detection by training neural networks to learn vulnerability patterns in code representation...Prior studies have demonstrated that deep learning-based approaches can enhance the performance of source code vulnerability detection by training neural networks to learn vulnerability patterns in code representations.However,due to limitations in code representation and neural network design,the validity and practicality of the model still need to be improved.Additionally,due to differences in programming languages,most methods lack cross-language detection generality.To address these issues,in this paper,we analyze the shortcomings of previous code representations and neural networks.We propose a novel hierarchical code representation that combines Concrete Syntax Trees(CST)with Program Dependence Graphs(PDG).Furthermore,we introduce a Tree-Graph-Gated-Attention(TGGA)network based on gated recurrent units and attention mechanisms to build a Hierarchical Code Representation learning-based Vulnerability Detection(HCRVD)system.This system enables cross-language vulnerability detection at the function-level.The experiments show that HCRVD surpasses many competitors in vulnerability detection capabilities.It benefits from the hierarchical code representation learning method,and outperforms baseline in cross-language vulnerability detection by 9.772%and 11.819%in the C/C++and Java datasets,respectively.Moreover,HCRVD has certain ability to detect vulnerabilities in unknown programming languages and is useful in real open-source projects.HCRVD shows good validity,generality and practicality.展开更多
Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural net...Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural network models and semantic matching techniques.Experiments conducted on the Flickr8k and AraImg2k benchmark datasets,featuring images and descriptions in English and Arabic,showcase remarkable performance improvements over state-of-the-art methods.Our model,equipped with the Image&Cross-Language Semantic Matching module and the Target Language Domain Evaluation module,significantly enhances the semantic relevance of generated image descriptions.For English-to-Arabic and Arabic-to-English cross-language image descriptions,our approach achieves a CIDEr score for English and Arabic of 87.9%and 81.7%,respectively,emphasizing the substantial contributions of our methodology.Comparative analyses with previous works further affirm the superior performance of our approach,and visual results underscore that our model generates image captions that are both semantically accurate and stylistically consistent with the target language.In summary,this study advances the field of cross-lingual image description,offering an effective solution for generating image captions across languages,with the potential to impact multilingual communication and accessibility.Future research directions include expanding to more languages and incorporating diverse visual and textual data sources.展开更多
Ever since late 2019,the COVID pandemic has given the world a great deal of pain and financial loss.Virologists around the world are working hard to eradicate it.Vaccines and treatment methods have been found,which ca...Ever since late 2019,the COVID pandemic has given the world a great deal of pain and financial loss.Virologists around the world are working hard to eradicate it.Vaccines and treatment methods have been found,which cannot be accomplished without the joint efforts of the world virologist community.Naturally,facilitating global communication would help advance the research.This paper analyzes syntactical features of English virology texts and finds that:In these texts,verbs and postpositive attributes are used frequently,complicated logic needs careful analysis,and personification is often used.Having some knowledge of these sentence features may contribute to better communication in the virology community.展开更多
COVID has already been taken as a global pandemic,the culprit of which is the virus called SARS-CoV-2.Virologists around the world have been working hard at trying to find remedies and protection methods for the world...COVID has already been taken as a global pandemic,the culprit of which is the virus called SARS-CoV-2.Virologists around the world have been working hard at trying to find remedies and protection methods for the world.This paper is a brief analysis of lexical features of English virology texts,which will show that technical words,semi-technical words,and acronyms are the three prominent lexical features of English virology tests.It may be helpful in guiding cross-language communication in the virologists’community,which the authors hope would facilitate global research on COVID.展开更多
Taxonomy denotes the hierarchical structure of a knowledge organization system. It has important applications in knowledge navigation, semantic annotation and semantic search. It is a useful instrument to study the mu...Taxonomy denotes the hierarchical structure of a knowledge organization system. It has important applications in knowledge navigation, semantic annotation and semantic search. It is a useful instrument to study the multilingual taxonomy generated automatically under the dynamic information environment in which massive amounts of information are processed and found. Multilingual taxonomy is the core component of the multilingual thesaurus or ontology. This paper presents two methods of bilingual generated taxonomy: Cross-language terminology clustering and mixed-language based terminology clustering. According to our experimental results of terminology clustering related to four specific subject domains, we found that if the parallel corpus is used to cluster multilingual terminologies, the method of using mixed-language based terminology clustering outperforms that of using the cross-language terminology clustering.展开更多
文摘The rise of social networking enables the development of multilingual Internet-accessible digital documents in several languages.The digital document needs to be evaluated physically through the Cross-Language Text Summarization(CLTS)involved in the disparate and generation of the source documents.Cross-language document processing is involved in the generation of documents from disparate language sources toward targeted documents.The digital documents need to be processed with the contextual semantic data with the decoding scheme.This paper presented a multilingual crosslanguage processing of the documents with the abstractive and summarising of the documents.The proposed model is represented as the Hidden Markov Model LSTM Reinforcement Learning(HMMlstmRL).First,the developed model uses the Hidden Markov model for the computation of keywords in the cross-language words for the clustering.In the second stage,bi-directional long-short-term memory networks are used for key word extraction in the cross-language process.Finally,the proposed HMMlstmRL uses the voting concept in reinforcement learning for the identification and extraction of the keywords.The performance of the proposed HMMlstmRL is 2%better than that of the conventional bi-direction LSTM model.
基金This work was supported by three projects.Zhao Y received the Grant with Nos.61976236 and 2020MDJC06Bi X J received the Grant with No.20&ZD279.
文摘As one of Chinese minority languages,Tibetan speech recognition technology was not researched upon as extensively as Chinese and English were until recently.This,along with the relatively small Tibetan corpus,has resulted in an unsatisfying performance of Tibetan speech recognition based on an end-to-end model.This paper aims to achieve an accurate Tibetan speech recognition using a small amount of Tibetan training data.We demonstrate effective methods of Tibetan end-to-end speech recognition via cross-language transfer learning from three aspects:modeling unit selection,transfer learning method,and source language selection.Experimental results show that the Chinese-Tibetan multi-language learning method using multilanguage character set as the modeling unit yields the best performance on Tibetan Character Error Rate(CER)at 27.3%,which is reduced by 26.1%compared to the language-specific model.And our method also achieves the 2.2%higher accuracy using less amount of data compared with the method using Tibetan multi-dialect transfer learning under the same model structure and data set.
文摘The present paper describes the use of online free language resources for translating and expanding queries in CLIR (cross-language information retrieval). In a previous study, we proposed method queries that were translated by two machine translation systems on the Language Gridem. The queries were then expanded using an online dictionary to translate compound words or word phrases. A concept base was used to compare back translation words with the original query in order to delete mistranslated words. In order to evaluate the proposed method, we constructed a CLIR system and used the science documents of the NTCIR1 dataset. The proposed method achieved high precision. However~ proper nouns (names of people and places) appear infrequently in science documents. In information retrieval, proper nouns present unique problems. Since proper nouns are usually unknown words, they are difficult to find in monolingual dictionaries, not to mention bilingual dictionaries. Furthermore, the initial query of the user is not always the best description of the desired information. In order to solve this problem, and to create a better query representation, query expansion is often proposed as a solution. Wikipedia was used to translate compound words or word phrases. It was also used to expand queries together with a concept base. The NTCIRI and NTCIR 6 datasets were used to evaluate the proposed method. In the proposed method, the CLIR system was implemented with a high rate of precision. The proposed syst had a higher ranking than the NTCIRI and NTCIR6 participation systems.
基金National Natural Science Foundation of China under Project No. 61876062Scientific Research Fund of Hunan Provincial Education Department of China under Project No. 16K030Hunan Provincial Natural Science Foundation of China under Project No. 2017JJ2101, Hunan Provincial Innovation Foundation for Postgraduate under Project No. CX2018B671.
文摘Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the impact of using bilingual word vectors for cross-language information retrieval in long-distance language pairs. In this paper, it systematically analyzes the retrieval performance of various European languages (English, German, Italian, French, Finnish, Dutch) as well as Asian languages (Chinese, Japanese) in the adhoc task of CLEF 2002–2003 campaign. Genetic proximity was used to visually represent the relationships between languages and compare their crosslingual retrieval performance in various settings. The results show that the differences in language vocabulary would dramatically affect the retrieval performance. At the same time, the term by term translation retrieval method performs slightly better than the simple vector addition retrieval methods. It proves that the translation-based retrieval model can still maintain its advantage under the new semantic scheme.
基金funded by the Major Science and Technology Projects in Henan Province,China,Grant No.221100210600.
文摘Prior studies have demonstrated that deep learning-based approaches can enhance the performance of source code vulnerability detection by training neural networks to learn vulnerability patterns in code representations.However,due to limitations in code representation and neural network design,the validity and practicality of the model still need to be improved.Additionally,due to differences in programming languages,most methods lack cross-language detection generality.To address these issues,in this paper,we analyze the shortcomings of previous code representations and neural networks.We propose a novel hierarchical code representation that combines Concrete Syntax Trees(CST)with Program Dependence Graphs(PDG).Furthermore,we introduce a Tree-Graph-Gated-Attention(TGGA)network based on gated recurrent units and attention mechanisms to build a Hierarchical Code Representation learning-based Vulnerability Detection(HCRVD)system.This system enables cross-language vulnerability detection at the function-level.The experiments show that HCRVD surpasses many competitors in vulnerability detection capabilities.It benefits from the hierarchical code representation learning method,and outperforms baseline in cross-language vulnerability detection by 9.772%and 11.819%in the C/C++and Java datasets,respectively.Moreover,HCRVD has certain ability to detect vulnerabilities in unknown programming languages and is useful in real open-source projects.HCRVD shows good validity,generality and practicality.
文摘Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural network models and semantic matching techniques.Experiments conducted on the Flickr8k and AraImg2k benchmark datasets,featuring images and descriptions in English and Arabic,showcase remarkable performance improvements over state-of-the-art methods.Our model,equipped with the Image&Cross-Language Semantic Matching module and the Target Language Domain Evaluation module,significantly enhances the semantic relevance of generated image descriptions.For English-to-Arabic and Arabic-to-English cross-language image descriptions,our approach achieves a CIDEr score for English and Arabic of 87.9%and 81.7%,respectively,emphasizing the substantial contributions of our methodology.Comparative analyses with previous works further affirm the superior performance of our approach,and visual results underscore that our model generates image captions that are both semantically accurate and stylistically consistent with the target language.In summary,this study advances the field of cross-lingual image description,offering an effective solution for generating image captions across languages,with the potential to impact multilingual communication and accessibility.Future research directions include expanding to more languages and incorporating diverse visual and textual data sources.
基金2021 Undergraduate Innovation and Entrepreneurship Training Program(No.XJ2021284)First-Class Curriculum Construction Program of USST“English Interpreting Ability Training”(YLKC202204).
文摘Ever since late 2019,the COVID pandemic has given the world a great deal of pain and financial loss.Virologists around the world are working hard to eradicate it.Vaccines and treatment methods have been found,which cannot be accomplished without the joint efforts of the world virologist community.Naturally,facilitating global communication would help advance the research.This paper analyzes syntactical features of English virology texts and finds that:In these texts,verbs and postpositive attributes are used frequently,complicated logic needs careful analysis,and personification is often used.Having some knowledge of these sentence features may contribute to better communication in the virology community.
基金2021 Undergraduate Innovation and Entrepreneurship Training Program(No.XJ2021284)First-Class Curriculum Construction Program of USST“English Interpreting Ability Training”(YLKC202204)The Eleventh China Foreign Language Education Fund Project“On the blended teaching model of interpretation course with the synergistic development of interpretation ability and critical thinking ability”(ZGWYJYJJ11A071).
文摘COVID has already been taken as a global pandemic,the culprit of which is the virus called SARS-CoV-2.Virologists around the world have been working hard at trying to find remedies and protection methods for the world.This paper is a brief analysis of lexical features of English virology texts,which will show that technical words,semi-technical words,and acronyms are the three prominent lexical features of English virology tests.It may be helpful in guiding cross-language communication in the virologists’community,which the authors hope would facilitate global research on COVID.
基金supported by the National Natural Science Foundation of China(Grant No.:70903032)the Foundation for Humanities and Social Science of the Chinese Ministry of Education(Grant No.:08JC870007)
文摘Taxonomy denotes the hierarchical structure of a knowledge organization system. It has important applications in knowledge navigation, semantic annotation and semantic search. It is a useful instrument to study the multilingual taxonomy generated automatically under the dynamic information environment in which massive amounts of information are processed and found. Multilingual taxonomy is the core component of the multilingual thesaurus or ontology. This paper presents two methods of bilingual generated taxonomy: Cross-language terminology clustering and mixed-language based terminology clustering. According to our experimental results of terminology clustering related to four specific subject domains, we found that if the parallel corpus is used to cluster multilingual terminologies, the method of using mixed-language based terminology clustering outperforms that of using the cross-language terminology clustering.