期刊文献+

语料库自然标注信息与中文分词应用研究(英文) 被引量:2

Natural Annotation Research in Large-Scale Corpora with a Focus on Chinese Word Segmentation
下载PDF
导出
摘要 以中文分词为应用目标,将大规模语料库上存在的自然标注信息分为显性标注信息与隐性标注信息,分别考察了它们的分布和对大数据集上语言计算的影响。结果表明,两者都直接或间接地表达了作者对语言的分割意志,因而对分词具有积极的影响。通过词语抽取测试,发现在缺乏丰富显性标注信息的文本中,来自语言固有规律的自然标注信息对字符串有着强大的分割性能。 The distribution and meaning of natural annotations on large datasets are discussed. The proposed research on word extraction shows the positive potential of both implicit and explicit natural annotation in word segmentation. Experiments on word extraction indicates that the implicit natural annotation derived from language laws and patterns are more powerful in splitting character strings in raw corpora.
出处 《北京大学学报(自然科学版)》 EI CAS CSCD 北大核心 2013年第1期140-146,共7页 Acta Scientiarum Naturalium Universitatis Pekinensis
基金 国家自然科学基金(60973062,61170162) 中央高校基本科研业务费专项资金(2012-jbyz-001)资助
关键词 自然标注信息 中文分词 词语抽取 大规模语料库 natural annotation Chinese word segmentation word extraction large-scale corpora
  • 相关文献

参考文献15

  • 1Zhao Hal, Kit C. An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework//Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJNLP-2008). Hyderabad, 2008:9-16.
  • 2Qu honghua, Liu Yang. Interactive group suggesting for Twitter//Proceedings of the 49th Annual Meeting of the Association" for Computational Linguistics (ACL-2011). Portland, 2011:519-523.
  • 3Huang Borong, Liao Xudong. Modern Chinese (Volume I). Beijing: High Education Press, 2002:252.
  • 4Yu S, Duan H, Zhu X, et al. Specification for corpus processing at Peking University; word segmentation, POS tagging and phonetic notationl Journal of Chinese Language and Computing, 2003, 13 (2): 121- 158.
  • 5Wu Fei, Weld D S. Open information extraction using Wikipedia // Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010). Uppsala, 2010:118-127.
  • 6Kamvar S D, Harris J. We feel fine and searching the emotional web//Proceeding s of the 4th ACM Interna- tional Conference on Web Search and Data Mining (WSDM-2011). Hong Kong, 2011:117-126.
  • 7Wu A, Jiang Zixin. Statistically-enhanced new word identification in a rule-based Chinese system // Proc of the 2nd ACL Chinese Processing Workshop. Hong Kong, 2000:41-66.
  • 8De Saussure F, Course in general linguistics. Beijing: Foreign Language Teaching and Research Press, 2001: 24.
  • 9Church K, A pendulum swung too far. Linguistic issues in language technology (LILT), 2011, 6(5): 1- 27.
  • 10Zhao H, Song Y, Kit C. How large a corpus do we need: statistical method vs. rule-based method // LREC-2010. Istanbul, 2010:1672-1677.

二级参考文献8

  • 1Steven Abney. Semisupervised Learning for Computa- tional Linguistics [M]. 2007. Chapman and Hall/ CRC.
  • 2Noah Smith. Structured Prediction for Natural Lan- guage Processing [C]//A Tutorial Presented at IC ML, Montr al, Qu bee. 2009.
  • 3Zhongguo Li andMaosong Sun. Punctuation as Implicit Annotations for Chinese Word Segmentation [J].Computational Linguistics,2009, 35(4): 505-512.
  • 4Jure Leskovec, Lars Backstrom and Jon Kleinberg. Meme-tracking and the Dynamics of the News Cycle [C]//Proceedings of the 15th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining. 2009.
  • 5Sepandar D. Kamvar and Jonathan Harris. We Feel Fine and Searching the Emotional Web [C]//Proceed- ings of the Fourth ACM International Conference onWeb Search and Data Mining. 2011.
  • 6Xiance Si, Zhiyuan Liu andMaosong Sun. Modeling Social Annotations via Latent Reason Identification [J]. IEEE Intelligent Systems, 2010, 25(6):. 42- 49.
  • 7Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski and Larry Brilliant. Detecting Influenza Epidemics UsingSearch Engine UueryData [J]. Nature, 2009, 457 (19).
  • 8陆俭明.“VA了”叙补结构语义分析[M]//陆俭明自选集.1993.河南教育出版社.

共引文献15

同被引文献24

  • 1袁毓林.一价名词的认知研究[J].中国语文,1994(4):241-253. 被引量:185
  • 2陈文亮,朱靖波,朱慕华,姚天顺.基于领域词典的文本特征表示[J].计算机研究与发展,2005,42(12):2155-2160. 被引量:23
  • 3赵世奇,刘挺,李生.一种基于主题的文本聚类方法[J].中文信息学报,2007,21(2):58-62. 被引量:24
  • 4Zamir O and Etzioni O. Web Document Clustering: A Feasibility Demonstration [C]//Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998). Melbourne, Australia, 1998:46-54.
  • 5Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allo- cation [J]. Journal of Machine Learning Research, 2003, 3:993-1022.
  • 6Griffiths T I,, Steyvers M. Finding scientific topics [J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101 (Suppl 1) : 5228-5235.
  • 7Huang Z E, Xun E D, Rao G Q, et al. Chinese Natu- ral Chunk Research Based on Natural Annotations in Massive Scale Corpora [C]//Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer Berlin Heidel- berg, 2013:13-24.
  • 8Zhongguo Li, Maosong Sun. Punctuation as Implicit Annotations for Chinese Word Segmentation [J]. Computational Linguistics, 2009, 35(4) :505 512.
  • 9Si X, Liu Z, Sun M. Modeling Social Annotations via Latent Reason Identification [J]. Intelligent Systems IEEE, 2010, 25(6):42-49.
  • 10刘知远,司宪策,郑亚斌,等.中文博客标签的若干统计性质[c]//中国计算技术与语言问题研究——第七届中文信息处理国际会议论文集.2007.

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部