The present paper describes the use of online free language resources for translating and expanding queries in CLIR (cross-language information retrieval). In a previous study, we proposed method queries that were t...The present paper describes the use of online free language resources for translating and expanding queries in CLIR (cross-language information retrieval). In a previous study, we proposed method queries that were translated by two machine translation systems on the Language Gridem. The queries were then expanded using an online dictionary to translate compound words or word phrases. A concept base was used to compare back translation words with the original query in order to delete mistranslated words. In order to evaluate the proposed method, we constructed a CLIR system and used the science documents of the NTCIR1 dataset. The proposed method achieved high precision. However~ proper nouns (names of people and places) appear infrequently in science documents. In information retrieval, proper nouns present unique problems. Since proper nouns are usually unknown words, they are difficult to find in monolingual dictionaries, not to mention bilingual dictionaries. Furthermore, the initial query of the user is not always the best description of the desired information. In order to solve this problem, and to create a better query representation, query expansion is often proposed as a solution. Wikipedia was used to translate compound words or word phrases. It was also used to expand queries together with a concept base. The NTCIRI and NTCIR 6 datasets were used to evaluate the proposed method. In the proposed method, the CLIR system was implemented with a high rate of precision. The proposed syst had a higher ranking than the NTCIRI and NTCIR6 participation systems.展开更多
Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the...Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the impact of using bilingual word vectors for cross-language information retrieval in long-distance language pairs. In this paper, it systematically analyzes the retrieval performance of various European languages (English, German, Italian, French, Finnish, Dutch) as well as Asian languages (Chinese, Japanese) in the adhoc task of CLEF 2002–2003 campaign. Genetic proximity was used to visually represent the relationships between languages and compare their crosslingual retrieval performance in various settings. The results show that the differences in language vocabulary would dramatically affect the retrieval performance. At the same time, the term by term translation retrieval method performs slightly better than the simple vector addition retrieval methods. It proves that the translation-based retrieval model can still maintain its advantage under the new semantic scheme.展开更多
未登录词(out of vocabulary,OOV)的查询翻译是影响跨语言信息检索(cross-language information retrieval,CLIR)性能的关键因素之一.它根据维基百科(Wikipedia)的数据结构和语言特性,将译文环境划分为目标存在环境和目标缺失环境.针对...未登录词(out of vocabulary,OOV)的查询翻译是影响跨语言信息检索(cross-language information retrieval,CLIR)性能的关键因素之一.它根据维基百科(Wikipedia)的数据结构和语言特性,将译文环境划分为目标存在环境和目标缺失环境.针对目标缺失环境下的译文挖掘难点,它采用频度变化信息和邻接信息实现候选单元抽取,并建立基于频度-距离模型、表层匹配模板和摘要得分模型的混合译文挖掘策略.实验将基于搜索引擎的未登录词挖掘技术作为baseline,并采用TOP1进行评测.实验验证基于维基百科的混合译文挖掘方法可达到0.6822的译文正确率,相对baseline取得6.98%的改进.展开更多
文摘The present paper describes the use of online free language resources for translating and expanding queries in CLIR (cross-language information retrieval). In a previous study, we proposed method queries that were translated by two machine translation systems on the Language Gridem. The queries were then expanded using an online dictionary to translate compound words or word phrases. A concept base was used to compare back translation words with the original query in order to delete mistranslated words. In order to evaluate the proposed method, we constructed a CLIR system and used the science documents of the NTCIR1 dataset. The proposed method achieved high precision. However~ proper nouns (names of people and places) appear infrequently in science documents. In information retrieval, proper nouns present unique problems. Since proper nouns are usually unknown words, they are difficult to find in monolingual dictionaries, not to mention bilingual dictionaries. Furthermore, the initial query of the user is not always the best description of the desired information. In order to solve this problem, and to create a better query representation, query expansion is often proposed as a solution. Wikipedia was used to translate compound words or word phrases. It was also used to expand queries together with a concept base. The NTCIRI and NTCIR 6 datasets were used to evaluate the proposed method. In the proposed method, the CLIR system was implemented with a high rate of precision. The proposed syst had a higher ranking than the NTCIRI and NTCIR6 participation systems.
基金National Natural Science Foundation of China under Project No. 61876062Scientific Research Fund of Hunan Provincial Education Department of China under Project No. 16K030Hunan Provincial Natural Science Foundation of China under Project No. 2017JJ2101, Hunan Provincial Innovation Foundation for Postgraduate under Project No. CX2018B671.
文摘Bilingual word vectors have been exploited a lot in cross-language information retrieval research. However, most of the research is currently focused on similar language pairs. There are very few studies exploring the impact of using bilingual word vectors for cross-language information retrieval in long-distance language pairs. In this paper, it systematically analyzes the retrieval performance of various European languages (English, German, Italian, French, Finnish, Dutch) as well as Asian languages (Chinese, Japanese) in the adhoc task of CLEF 2002–2003 campaign. Genetic proximity was used to visually represent the relationships between languages and compare their crosslingual retrieval performance in various settings. The results show that the differences in language vocabulary would dramatically affect the retrieval performance. At the same time, the term by term translation retrieval method performs slightly better than the simple vector addition retrieval methods. It proves that the translation-based retrieval model can still maintain its advantage under the new semantic scheme.
文摘未登录词(out of vocabulary,OOV)的查询翻译是影响跨语言信息检索(cross-language information retrieval,CLIR)性能的关键因素之一.它根据维基百科(Wikipedia)的数据结构和语言特性,将译文环境划分为目标存在环境和目标缺失环境.针对目标缺失环境下的译文挖掘难点,它采用频度变化信息和邻接信息实现候选单元抽取,并建立基于频度-距离模型、表层匹配模板和摘要得分模型的混合译文挖掘策略.实验将基于搜索引擎的未登录词挖掘技术作为baseline,并采用TOP1进行评测.实验验证基于维基百科的混合译文挖掘方法可达到0.6822的译文正确率,相对baseline取得6.98%的改进.