基于外部排序的字串左右熵快速计算方法被引量：2

Rapid algorithm for left(right) entropy of character strings based on external sort

下载PDF

导出

摘要左右熵在自然语言处理领域有着广泛应用,但目前尚无有效方法实施大规模语料中海量模式的左右熵快速计算。提出了一种计算方法,对于某长度字串计算熵,首先按长度提取语料中的全部字串,使用外部排序和归并获取字串的出现频率,然后分别剔除首尾字符构造待计算字串的频率提供文件,最后使用文件记录频率对比来计算右熵和左熵。分析和实验表明,该方法的计算量同语料规模成线性关系,适于大规模语料中海量字串的左右熵计算。 Left（right） entropy has wide applications in natural language processing,but there is no effective algorithm to implement rapid calculation of them for lots of strings in large-scalce corpus.This paper puts forward a rapid method for left（right） entropy.For calculating entropy of distinct length of strings,extracting all strings with fixed length in corpus is the first step,and these extracted strings are sorted and merged by external sort method.Then the files of string frequency-provision are constructed by removing the suffix or prefix of each string respectively.At last,the left（right） entropy is calculated by employing frequency comparison of strings in files.According to analyses and experiments,the computational burden of this method has a linear relationship with the size of corpus and it is suitable for the calculations of huge amount of strings in large-scale corpus.

作者张海军彭成栾静

机构地区新疆师范大学计算机科学技术学院中国科技大学计算机科学技术学院

出处《计算机工程与应用》 CSCD 北大核心 2011年第19期18-20,共3页 Computer Engineering and Applications

基金国家自然科学基金No.61040035 新疆师范大学优秀青年教师科研启动基金项目(No.XJNU1011)~~

关键词自然语言处理左右熵统计特征新词检测 natrual language processing left（right）entropy statistical feature new words detection

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1罗智勇,宋柔.基于多特征的自适应新词识别[J].北京工业大学学报,2007,33(7):718-725. 被引量：14
2Somlertlamvanich V, Potipiti T, Charoenpom T.Automatic cor- pus-based Thai word extraction with the C4.5 learning algo- rithm[C]//Proceedings of COLING 2000,Nancy-Saarbrucken-Lux- embourg, 29 July-6 August, 2000: 802-807.
3Luo S, Sun M.Two-character Chinese word extraction based on hybrid of internal and contextual measures[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language, Sapporo, Japan, 2003 : 24-30.
4周浪,冯冲,黄河燕.一种面向术语抽取的短语过滤技术[J].计算机工程与应用,2009,45(19):9-11. 被引量：7
5张海军,丁溪源,朱朝勇.一种改进的中文字符串排序方法[J].计算机工程与应用,2010,46(19):129-131. 被引量：3

二级参考文献31

1杨磊,黄辉,宋涛.桶外排序算法的抽样分点分发策略[J].软件学报,2005,16(5):643-651. 被引量：5
2杨磊,宋涛.基于数组的桶排序算法[J].计算机研究与发展,2007,44(2):341-347. 被引量：13
3Damerau F J.Evaluating domain-oriented multi-word terms from texts [J].Information Processing and Management, 1993,29 (4) : 433- 447.
4Frantzi K T,Ananiadou S,Mima H.Automatic recognition of multiword terms:The C-value/NC-value method[J].International Journal on Digital Libraries, 2000,3 (2) : 115-130.
5Kageura K,Umino B.Methods of automatic term recognition:A review[J].Terminology, 1996,3 ( 2 ) : 259-289.
6Justeson J,Katz S.Technical term:Some linguistic properties and an algorithm for Identification in text[J].Natural Language Engineering, 1995,1 ( 1 ) : 9-27.
7Sui Zhi-fang, Chen Yi-rong,Wei Zhou-chao.Automatic recognition of Chinese scientific and technological terms using integrated linguistic knowledge[C]//Proceedings of 2003 International Conference, 2003 : 444-451.
8Dunning T.Aceurate methods for the statistics of surprise and coincidence[J].Computational Linguistic, 1993,19( 1 ) :61-74.
9Yoshida M,Nakagawa H.Automatic term extraction based on Perplexity of compound words[C]//IJCNLP,2005:269-279.
10Merkel M., Andersson M..Knowledge-lite extraction of multi-word units with language filters and entropy thresholds[C]//Proceedings of RIAO'2000,Paris,France,2000,1:737-746.

共引文献21

1汪青青.现代汉语新词特征探析[J].现代语文（下旬．语言研究）,2009(7):122-123. 被引量：3
2张海军,史树敏,朱朝勇,黄河燕.中文新词识别技术综述[J].计算机科学,2010,37(3):6-10. 被引量：39
3张海军,史树敏,丁溪源,黄河燕.基于分词提取重复串的未登录词遗漏量化模型[J].中文信息学报,2011,25(2):122-128. 被引量：5
4郑泽芝,敖婷.基于底表的多层扫描术语自动标注算法[J].厦门大学学报（自然科学版）,2011,50(3):546-552.
5廖祥文,林自芳,陈水利.基于词内部模式的中文新词识别研究[J].集美大学学报（自然科学版）,2011,16(6):461-466. 被引量：1
6张海军,栾静,李勇,齐向伟.基于统计学习框架的中文新词检测方法[J].计算机科学,2012,39(2):232-235. 被引量：10
7刘永超,刘宜轩.字串结合紧密度的计算方法研究[J].计算机光盘软件与应用,2012,15(2):133-134.
8徐远方,李成城.基于SVM和词间特征的新词识别研究[J].计算机技术与发展,2012,22(5):134-136. 被引量：4
9修驰,宋柔.基于无监督学习的专业领域分词歧义消解方法[J].计算机应用,2013,33(3):780-783. 被引量：7
10聂金慧,苏红旗,时志远.中文新词提取与过滤研究综述[J].中国科技博览,2013(30):209-210. 被引量：1

同被引文献20

1邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量：59
2崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量：32
3施水才,俞鸿魁,吕学强,李渝勤.基于大规模语料的新词语识别方法[J].山东大学学报（理学版）,2006,41(3):89-91. 被引量：5
4贺敏,龚才春,张华平,程学旗.一种基于大规模语料的新词识别方法[J].计算机工程与应用,2007,43(21):157-159. 被引量：24
5ONG T H, CHEN H C. Updateable PAT-tree approach to chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management [ C ]// Proceedings of the 2nd Asian Digital Library Conference. Taipei : [ s. n. ] , 1999 : 63- 84.
6CHEN K J. Unknownword detection for chineseby a cor- pus-based learning method [ J ]. Computational Linguistics and Chinese Language Processing, 1998,3 (1): 27-44.
7TERRA E, CLARKE C L A. Frequency estimates for sta- tistical word similaritymeasures [ C ]//In Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Con- ference 2003,2003:244-251.
8GAO Jianfeng, LI Mu, WU Andi, et al. Chinese word seg- mentation : a pragmatic approach [ R ]. Beijing : Microsoft Research Technical Report,2004.
9杨绪明,杨文全.当代汉语新词新语探析[J].汉语学习,2009(1):97-104. 被引量：68
10韩艳,林煜熙,姚建民.基于统计信息的未登录词的扩展识别方法[J].中文信息学报,2009,23(3):24-30. 被引量：15

引证文献2

1钟将,耿升华,董高峰.一种新词检测方法研究[J].数字通信,2013,40(2):1-5. 被引量：6
2陈秋瑗,程光,李迪,张建.机械设计领域的命名实体识别研究[J].计算机工程与应用,2017,53(20):100-104. 被引量：5

二级引证文献11

1王倩倩,范通让.汉语中新词识别方法研究[J].河北省科学院学报,2014,31(2):35-40.
2王琳琳.规则与统计相结合的中文新词识别研究[J].嘉兴学院学报,2014,26(6):124-130. 被引量：4
3车飞.近十余年来汉语网络新词语研究述略[J].重庆工商大学学报（社会科学版）,2015,32(3):102-113. 被引量：6
4郝晓玲,茅嘉惠,于秀艳.微博热词抽取及话题发现研究[J].情报杂志,2015,34(6):109-113. 被引量：10
5李志义,李德惠,赵鹏武.电子商务领域本体概念及概念间关系的自动抽取研究[J].情报科学,2018,36(7):85-90. 被引量：7
6李淑平.中文新词识别研究概述[J].科技资讯,2016,14(29):145-146.
7贾全烨,张强,宋博川.一种基于循环神经网络的电网客服语音文本实体识别算法[J].供用电,2020,37(6):13-20. 被引量：7
8王欢,朱文球,吴岳忠,何频捷,万烂军.基于数控机床设备故障领域的命名实体识别[J].工程科学学报,2020,42(4):476-482. 被引量：13
9高学攀,杜楚,吴金亮.基于BiLSTM-CRF的军事命名实体识别方法[J].无线电工程,2020,50(12):1050-1054. 被引量：6
10臧凌玉,张应中,罗晓芳.基于双重深度迁移学习的机械领域命名实体识别[J].计算机应用与软件,2022,39(9):219-224. 被引量：3

1刘翔,李明星.超媒体与决策支持系统集成化的研究[J].电脑开发与应用,1998,11(3):44-48.
2王岁花,杨海萍.一种新的外部排序算法的设计与实现[J].许昌学院学报,2005,24(2):80-83.
3李洋,庞立滨.排序算法的分析与总结[J].科技与企业,2015,0(11):178-179. 被引量：1
4武佳.AMD首推服务器平台[J].互联网周刊,2009(19):14-14.
5于广辉.SDN校园应用场景:Still Don’t kNow[J].中国教育网络,2013(8):33-33. 被引量：1
6杨睿.一种防止单片机错误运行的有效方法[J].煤矿自动化,1994(1):57-58.
7钟将,耿升华,董高峰.一种新词检测方法研究[J].数字通信,2013,40(2):1-5. 被引量：6
8韩艳,姚建民,朱巧明,张晶.不限领域的中文新词的识别研究[J].郑州大学学报（理学版）,2008,40(3):67-71. 被引量：2
9张建成.定义UCDOS右SHIFT键功能的有效方法[J].长江工程职业技术学院学报,1996,13(2):70-70.
10张海军,李勇,闫琪琪.一种基于海量语料的网络热点新词识别方法[J].计算机工程与应用,2015,51(5):208-213. 被引量：6

计算机工程与应用

2011年第19期

浏览历史

内容加载中请稍等...

基于外部排序的字串左右熵快速计算方法被引量：2

参考文献5

二级参考文献31

共引文献21

同被引文献20

引证文献2

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

基于外部排序的字串左右熵快速计算方法 被引量：2

参考文献5

二级参考文献31

共引文献21

同被引文献20

引证文献2

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

基于外部排序的字串左右熵快速计算方法被引量：2