Discriminative Latent Model(DLM) is proposed for Multiword Expressions(MWEs) extraction in Chinese text to improve the performance of Machine Translation(MT) system such as Template Based MT(TBMT).For MT systems to be...Discriminative Latent Model(DLM) is proposed for Multiword Expressions(MWEs) extraction in Chinese text to improve the performance of Machine Translation(MT) system such as Template Based MT(TBMT).For MT systems to become of further practical use,they need to be enhanced with MWEs processing capability.As our study towards this goal,we propose DLM,which is developed for sequence labeling task including hidden structures,to extract MWEs for MT systems.DLM combines the advantages of existing discriminative models,which can learn hidden structures in sequence labeling task.In our evaluations,DLM achieves precisions ranging up to 90.73% for some type of MWEs,which is higher than state-of-the-art discriminative models.Such results demonstrate that it is feasible to automatically identify many Chinese MWEs using our DLM tool.With MWEs processing model,BLEU score of MT system has also been increased by up to 0.3 in close test.展开更多
Multiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and language...Multiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and languageindependent Multiword Expression Distance (MED). The new metric is derived from an accepted physical principle, measures the distance from an n-gram to its semantics, and outperforms other state-of-the-art methods on MWEs in two applications: question answering and named entity extraction.展开更多
针对微博新词的情感倾向分析问题,提出了一种改进互信息的微博新情感词提取方法.首先,对预处理后的微博数据进行N元切分,以此得到候选字串;然后,通过计算多字互信息(multiword mutual information,MMI)和左右侧邻接熵对候选字串进行扩...针对微博新词的情感倾向分析问题,提出了一种改进互信息的微博新情感词提取方法.首先,对预处理后的微博数据进行N元切分,以此得到候选字串;然后,通过计算多字互信息(multiword mutual information,MMI)和左右侧邻接熵对候选字串进行扩展和过滤得到候选新词,再将候选新词与相应词典进行对比得到新词;最后,通过词间情感相似度(sentiment similarity between the words,SW)计算出新词的情感倾向值,从而得到新情感词.实验结果显示,该方法对新词情感倾向识别的准确率、召回率和F1值比文献[4]方法分别提高了13.14%、5.81%和8.59%,因此该方法具有很好的应用价值.展开更多
基金supported by Liaoning Province Doctor Startup Fund under Grant No.20101021the Fund of the State Ethic Affairs Commissions under Grant No.10DL08AnHui Provincie Key Laboratory of Affective Computing and Advanced Intelligent Machine
文摘Discriminative Latent Model(DLM) is proposed for Multiword Expressions(MWEs) extraction in Chinese text to improve the performance of Machine Translation(MT) system such as Template Based MT(TBMT).For MT systems to become of further practical use,they need to be enhanced with MWEs processing capability.As our study towards this goal,we propose DLM,which is developed for sequence labeling task including hidden structures,to extract MWEs for MT systems.DLM combines the advantages of existing discriminative models,which can learn hidden structures in sequence labeling task.In our evaluations,DLM achieves precisions ranging up to 90.73% for some type of MWEs,which is higher than state-of-the-art discriminative models.Such results demonstrate that it is feasible to automatically identify many Chinese MWEs using our DLM tool.With MWEs processing model,BLEU score of MT system has also been increased by up to 0.3 in close test.
基金supported mainly by Canada's IDRC Research Chair in Information Technology Program,under Grant No.104519006supported by the National Natural Science Foundation of China under Grant No.60973104+2 种基金the National Basic Research 973 Program of China under Grant No.2007CB311003NSERC Grant OGP0046506Canada Research Chair Program,MITACS,an NSERC Collaborative Grant,and Ontario's Premier's Discovery Award
文摘Multiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and languageindependent Multiword Expression Distance (MED). The new metric is derived from an accepted physical principle, measures the distance from an n-gram to its semantics, and outperforms other state-of-the-art methods on MWEs in two applications: question answering and named entity extraction.
文摘针对微博新词的情感倾向分析问题,提出了一种改进互信息的微博新情感词提取方法.首先,对预处理后的微博数据进行N元切分,以此得到候选字串;然后,通过计算多字互信息(multiword mutual information,MMI)和左右侧邻接熵对候选字串进行扩展和过滤得到候选新词,再将候选新词与相应词典进行对比得到新词;最后,通过词间情感相似度(sentiment similarity between the words,SW)计算出新词的情感倾向值,从而得到新情感词.实验结果显示,该方法对新词情感倾向识别的准确率、召回率和F1值比文献[4]方法分别提高了13.14%、5.81%和8.59%,因此该方法具有很好的应用价值.