期刊文献+

中文生物医学文本无词典分词方法研究 被引量:4

Research on Method for Chinese Word Segmentation without Thesaurus in Chinese Biomedical Text
下载PDF
导出
摘要 为了在不利用词典的条件下实现对中文生物医学文本的有效切分,结合中文生物医学文本专业术语多、新术语不断出现和结构式摘要的特点,引入一种基于重现原理的无词典分词方法,并在实际应用过程中从分词长度上限值的设定和层次特征项抽取两方面对其进行了改进.实验结果表明,该方法可以在不需要词典和语料库学习的情况下,实现对生物医学文本中关键性专业术语的有效抽取,分词准确率约为84.51%.最后,基于本研究中的分词结果,对生物医学领域的词长分布进行了初步探讨,结果表明中文生物医学领域的词长分布与普通汉语文本有非常大的差异.研究结果对在处理中文生物医学文本时N-gram模型中N值的确定具有一定的参考价值. In order to segment Chinese biomedical text without thesaurus, combining with the characteristics of Chinese biomedical text, such as lots of specialized terms, new terms emerging and Structured Abstract, the paper introduces a method of Chinese word segmentation without thesaurus based on recurrence, and improves it in the process of practical ap-plication in two ways. First, do not set the upper limit of the length of terms, second, extracting terms and hierarchical terms at one time. Experimental results show that, without the help of thesaurus and corpus learning, the algorithm can extract the crucial specialized terms in the biomedical text effectively, and the Accuracy Rate is about 84. 51%. Finally, a preliminary study for the word length distribution in the field of biomedicine has been done, and the results prove that, the word length distribution in the field of Chinese biomedicine is very different from General Chinese's, it could provide reference for determining the value of N in N-gram model in the process of Chinese biomedical text.
出处 《情报学报》 CSSCI 北大核心 2011年第2期197-203,共7页 Journal of the China Society for Scientific and Technical Information
关键词 无词典分词 结构式摘要 生物医学文本 Chinese word segmentation without thesaurus structured abstract biomedical text
  • 相关文献

参考文献10

二级参考文献59

  • 1刘涌泉.中国计算机和自然语言处理的新进展[J].情报科学,1987,8(1):64-70. 被引量:4
  • 2袁军鹏,朱东华,李毅,李连宏,黄进.文本挖掘技术研究进展[J].计算机应用研究,2006,23(2):1-4. 被引量:57
  • 3姜韶华,党延忠.基于长度递减与串频统计的文本切分算法[J].情报学报,2006,25(1):74-79. 被引量:14
  • 4王永成 等.中文信息处理技术及其基础[M].上海交通大学出版社,1993.92-110.
  • 5王还 常宝儒.现代汉语频率词典[M].北京:北京语言学院出版社,1986..
  • 6William J.F.,Gregory P.-S.,Christopher J.M.Knowledge Discovery in Databases:an overview[J].AI Mag.,1992,13 (3):57-70.
  • 7William M.Pottenger,Yong-Bin Kin,Daryl D.Meling.HDDI《'TM》:hierarchical distributed dynamic indexing[EB/ OL].[2008-10-17].http://www.dimacs,rutgers.edu/~billp/pubs/HDDIFinalChapter,pdf.
  • 8Swanson D.R.Fish-oil,Raynaud's Syndrome,and Undiscovered Public Knowledge[J].Perspectives in Biology and Medicine,1986,30 (1):7-18.
  • 9Swanson D.R.Two Medical Literatures that are Logically but not Bibliographically Connected[J].Journal of the American Society for Information Science,1987,38 (4:228-233.
  • 10吴竞存,现代汉语句法结构与分析,1992年

共引文献108

同被引文献48

引证文献4

二级引证文献20

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部