面向文本知识管理的自适应中文分词算法被引量：1

Text knowledge management oriented adaptive Chinese word segmentation algorithms

下载PDF

导出

摘要针对传统字典匹配分词法在识别新词和特殊词处理方面的不足,结合2元统计模型提出了面向文本知识管理的自适应中文分词算法——SACWSA。SACWSA在预处理阶段结合应用有限状态机理论、基于连词的分隔方法和分治策略对输入文本进行子句划分,从而有效降低了分词算法的复杂度;在分词阶段应用2元统计模型,结合局部概率和全局概率,完成子句的切分,从而有效地提升了新词的识别率并消除了歧义;在后处理阶段,通过建立词性搭配规则来进一步消除2元分词结果的歧义。SACWSA主要的特色在于利用'分而治之'的思想来处理长句和长词,用局部概率与全局概率相结合来识别生词和消歧。通过在不同领域语料库的实验表明,SACWSA能准确、高效地自动适应不同行业领域的文本知识管理要求。 To overcome the shortcomings of new word recognition and special word processing for the traditional dictionary-based matching algorithm in,text knowledge management oriented adaptive Chinese word segmentation algorithm （SACWSA） based on 2-gram statistical model is presented.. At the preprocessing stage,SACWSA applies finite state machine theory,conjunction-based partition method and divide conquer strategy to partition long sentences in input text into sub-sentences,which reduces the algorithm complexity effectively. At the word segmentation stage,2-gram statistical model is employed and combined with partial probability and overall probability to partition the sub-sentences into words,which improved the recognition rate of new words and eliminated ambiguity. At the post-processing stage,the matching rules of part-of-speech are established to eliminate ambiguity of 2-gram word segmentation results further. The innovations of SACWSA include dealing with the long sentences and long terms with the idea of ＇Divide and Conquer＇; while combining the partial probability and overall probability to identify new words and eliminate ambiguity. Experimental results on text corpus of different fields show that SACWSA can adapt to different text knowledge management requirements in different fields accurately,efficiently and automatically.

作者冯永贺迅唐黎陈显勇陈贞

机构地区重庆大学计算机学院

出处《重庆大学学报（自然科学版）》 EI CAS CSCD 北大核心 2010年第10期110-117,共8页 Journal of Chongqing University

基金重庆市自然科学基金资助项目(2008BB2183) 中央高校基本科研资助项目(DJIR10180006) '211工程'三期建设资助项目(S-10218) 中国博士后科学基金资助项目(20080440699) 国家科技支撑计划资助项目(2008BAH37B04) 国家社会科学基金'十一五'规划教育学重点课题(ACA07004-08)

关键词知识管理文本处理统计方法自适应算法 knowtl edeg management text processing statistical methods adaptive algorithms

分类号 TP182 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献17

1GAO J F , WU A D, LI M. Adaptive Chinese word segmentation [C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. [ s. l.]:ACL2004, 2004: 462-469.
2ZHANG M Y , LU ZD , ZOUC Y. A Chinese word segmentation based on language situation in processing ambiguous words[J].Information Sciences, 2004, 162 (3-4):275-285.
3WANG X J , QIN Y , I.IU W . A search-based Chinese word segmentation method [C]. Proceedings of the 16th International World Wide Web Conference, 2007 : 1129-1130.
4WANG H M M A Chinese word segmentation based on machine learning[C]// Proceedings of the 1st International Workshop on Education Technology and Computer Science.[S. L.] ETCS 2009, 2009, 2:610-613.
5HONGC M , CHEN C M , CHIU C Y . Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems[J]. Expert Systems with Applications, 2009, 36(2): 3641-3651.
6ZENG D , WEI D H , CHAU M , et al . Chinese word segmentation for terrorism-related contents[J]. Lecture Notes in Computer Science, 2008, 5075:1-13.
7LUO XG , LUO J , XIE Z. The research of chinese automatic word segmentation in hierarchical model dictionary binary tree[C]//Proceedings of 1st International Workshop on Database Technology and Applications. [s.l.]: DBTA 2009, 2009: 321-324.
8冯冲,陈肇雄,黄河燕,关真珍.基于Multigram语言模型的主动学习中文分词[J].中文信息学报,2006,20(1):50-58. 被引量：6
9曹勇刚,曹羽中,金茂忠,刘超.面向信息检索的自适应中文分词系统[J].软件学报,2006,17(3):356-363. 被引量：48
10YANG, C C, LI K W. A heuristic method based on a statistical approach for Chinese text segmentation[J].Journal of the American Society for Information Science and Technology,2005, 56(13): 1438-1447.

二级参考文献15

1孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001,3(1):22-32. 被引量：101
2周强.规则和统计相结合的汉语词类标注方法[J].中文信息学报,1995,9(3):1-10. 被引量：43
3黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量：250
4赵海,揭春雨.基于有效子串标注的中文分词[J].中文信息学报,2007,21(5):8-13. 被引量：26
5S Deligne, F Bimbot Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams[A]. In: Proceedings IEEE. International Conference on Acoustics, Speech and SignalProcessing(ICASSP)[C], 1995,67- 73.
6A Dempster, N Laird, and D Rubin Maximum-likelihood from Incomplete Data via the EM algorithm[J] .J Royal Statist Soc Ser, B(39),1977,21-29.
7Fuchun Peng, Language Independent Text Learning with Statistical n-Gram Language Models [D]. University of Waterloo, Ontario, Canada, 2003.
8C Manning, H Schutze, Foundations of Statistical Natural Language Processing[M]. MIT Press, Cambridge, Massachusetts, 1999.
9D A Cohn,Z Chahramani, and MI Jordan, 1996, Active Learning with statistical models[J]. Journal of Artificial Intelligence Research, Vol 4,129 - 145.
10L Rabiner ATutorialon Hidden Markov Models and Selected Applications in Speech Recognition[A]. In: Proceedings of IEEE.[C], 77(2), 1989, 172 - 209.

共引文献184

1李对红,王裴岩 ,张桂平,张少阳.基于字簇的多模型中文分词方法研究[J].计算机应用研究,2020,37(2):355-359. 被引量：2
2战疆,冯月利,王珊.PostgreSQL中文全文索引技术研究与实现[J].华中科技大学学报（自然科学版）,2005,33(z1):213-216. 被引量：3
3隋丽萍,徐承韬,李瑞芳.一个中文全文检索系统的设计与实现[J].科技资讯,2007,5(18):244-245. 被引量：1
4于江德,谷川,葛文英,樊孝忠.一种基于字和子串联合标注的汉语分词方法[J].山西大学学报（自然科学版）,2011,34(3):357-362. 被引量：2
5张素娟,郑庆华,胡云华,孙霞.一种面向网络答疑的汉语切分歧义消除算法[J].计算机工程与应用,2004,40(25):55-58. 被引量：4
6王朝静,郑庆华.面向答疑文本的词类标注方法的研究与实现[J].计算机工程与应用,2004,40(16):57-60. 被引量：2
7刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量：198
8李文翔,晏蒲柳,夏德麟,张滨.基于差异相似矩阵算法的词语用法知识发现[J].计算机应用与软件,2005,22(1):90-92.
9刘新斌,李俊.一种基于N-gram组合的中文垃圾邮件过滤方法[J].微电子学与计算机,2004,21(12):85-91. 被引量：5
10黄建中,王肖雷.Katz平滑算法在中文分词系统中的应用[J].计算机工程,2004,30(B12):371-372. 被引量：5

同被引文献5

1颜跃进,李舟军,陈火旺.基于FP-Tree有效挖掘最大频繁项集[J].软件学报,2005,16(2):215-222. 被引量：68
2胡锡衡.正向最大匹配法在中文分词技术中的应用[J].鞍山师范学院学报,2008,10(2):42-45. 被引量：18
3卢致旭,邱卫东,廖凌.基于数据挖掘技术的字典生成方法[J].信息安全与通信保密,2011,9(11):63-65. 被引量：4
4牛永洁,张成.多种字符串相似度算法的比较研究[J].计算机与数字工程,2012,40(3):14-17. 被引量：37
5邹静,林东岱,郝春辉.一种基于结构划分概率的口令攻击方法[J].计算机学报,2014,37(5):1206-1215. 被引量：11

引证文献1

1高强,李啸,胡勇,吴少华.基于社工信息的口令生成与安全性分析[J].通信技术,2017,50(7):1511-1516. 被引量：2

二级引证文献2

1李蓓蕾,唐祖锴,陈燕.基于口令因子的web口令安全性评估方法研究[J].湖北第二师范学院学报,2018,35(8):35-39.
2曾剑平,陈其乐,吴承荣,方熙.中文语境下的口令分析方法[J].计算机应用,2019,39(6):1713-1718. 被引量：1

1卫刚,叶晨洲.数据发掘在服装设计中的应用[J].微型电脑应用,2000,16(3):31-33.
2江华丽.中文分词算法研究与分析[J].物联网技术,2016,6(1):87-89. 被引量：9
3昝红英,左维松,张坤丽,吴云芳.规则和统计相结合的情感分析研究[J].计算机工程与科学,2011,33(5):146-150. 被引量：4
4曹琦,杨源,杨俭.结合图像局部概率和边缘信息的小目标检测方法[J].激光与红外,2009,39(5):551-554. 被引量：3
5邹永平.改进的Fp-Growth数据关联挖掘算法研究[J].河北能源职业技术学院学报,2013,13(1):64-66.
6金澎,刘毅,王树梅.汉语分词对中文搜索引擎检索性能的影响[J].情报学报,2006,25(1):21-24. 被引量：6
7吕庆文,陈武凡.基于互信息量的图像分割[J].工程质量,2006,24(4):64-64.
8宋程,贺昱曜,杨盼盼,雷小康.基于局部概率可靠度的信息趋向源搜索方法[J].西北工业大学学报,2016,34(5):843-850.
9郑旭玲,周昌乐,李堂秋,陈毅东.基于关联规则挖掘的汉语语义搭配规则获取方法[J].厦门大学学报（自然科学版）,2007,46(3):331-336. 被引量：5
10由丽萍,王素格.汉语动词-动词搭配规则与分布特征[J].计算机工程与应用,2005,41(23):179-181. 被引量：6

重庆大学学报（自然科学版）

2010年第10期

浏览历史

内容加载中请稍等...

面向文本知识管理的自适应中文分词算法被引量：1

参考文献17

二级参考文献15

共引文献184

同被引文献5

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

面向文本知识管理的自适应中文分词算法 被引量：1

参考文献17

二级参考文献15

共引文献184

同被引文献5

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

面向文本知识管理的自适应中文分词算法被引量：1