期刊文献+

基于字频分布的中文网页编码识别算法 被引量:2

Chinese Webpage Encoding Identification Algorithm Based on Word Frequency Distribution
下载PDF
导出
摘要 编码识别是网页内容过滤的必要前提,多种中文编码共存给中文网页的内容过滤带来不便。针对上述问题,提出一种基于字频分布的中文网页编码识别算法。根据汉字的使用频率,选取使用频度较高的字符构成高频字符编码表,以高频字符编码作为关键字,使用改进的模式匹配算法查找待识别网页,并统计匹配次数。将编码的匹配结果作为分析的依据,最终判定待识别网页的真实码制。实验结果证明,与Unigram算法相比,该算法对目前通用的中文编码识别率较高,适合对未知编码的中文网页进行快速编码识别。 Web coding identification is the premise of webpage content filtering,and coexistence of a variety of Chinese encoding makes Chinese webpage coded identification inconvenient. This paper presents a Chinese Web encoding identification algorithm———FKI ( Frequency Keyword Identification ) which is based on the frequency of Chinese character used. FKI selects the frequency of high character to construct high frequency character encoding tables, according to the frequency of the use of Chinese characters. Using high frequency character encoding as a keyword,FKI algorithm scans the Webpage by improved pattern matching algorithm, statistical matching number, and determines the real code of webpage based on the matching result. Experimental results show that, compared with the Unigram algorithm,this algorithm has a higher recognition rate. FKI algorithm is suitable for Chinese webpage which is unknown code to identify code quickly and accurately.
出处 《计算机工程》 CAS CSCD 2014年第12期199-204,共6页 Computer Engineering
基金 教育部广东省产学研基金资助项目(2009B090200049)
关键词 中文编码 网页过滤 高频字符 模式匹配 有限状态自动机 Chinese encoding Web filtering high frequency characters pattern matching finite state automata
  • 相关文献

参考文献13

  • 1国家标准总局.GB2312-1980信息交换用汉字编码字符集基本集[S].1980.
  • 2国家标准总局.GB13000.1-1993汉字扩展内码规范[S].1993.
  • 3Unicode Consortium.The Unicode Standard Version4.0[M].[S.l.]:Addison-wesely,2003.
  • 4International Organization for Standardization.ISO/IEC10646-1:1993(E)/10646-1:2000(E)/10646-2:2001(E)Universal Multiple-octet Coded Character Set(UCS)[S].2001.
  • 5李培峰,朱巧明,钱培德.多文种环境下汉字内码识别算法的研究[J].中文信息学报,2004,18(2):73-79. 被引量:16
  • 6于明俭.GB/BIG5文件识别[EB/OL].(2012-10-20).http://www.ibiblio.org/pub/packages/ccic/software/data/chrecog.gb.html.
  • 7贺敏,张华平,程学旗.基于贝叶斯分类的汉字编码识别方法[C] //第九届"计算机科学与技术"研究生学术研讨会.青岛:[s.n.] ,2006:1067-1073.
  • 8李继锋,刘群.基于N-Gram模型的高速汉字编码识别系统[J].计算机工程与应用,2004,40(3):39-41. 被引量:4
  • 9王昊,李思舒,邓三鸿.基于N-Gram的文本语种识别研究[J].现代图书情报技术,2013(4):54-61. 被引量:6
  • 10He Gang,Peng Peidong,Wu Xiaochun,et al.Chinese Coding Type Identification Based on Subsentence Length Observation[C]//Proceedings of2009IEEE International Conference on Natural Language Processing and Knowledge Engineering.Dalian,China:[s.n.],2009:1-5.

二级参考文献26

  • 1冯冲,黄河燕,陈肇雄,张亮.基于字符层马尔科夫模型的多语种识别[J].计算机科学,2006,33(1):226-228. 被引量:5
  • 2王永成.中文信息处理技术及其基础[M].上海:上海交通大学出版社,1990..
  • 3尹宝生 潘峰 徐立军 等.中日韩大字符集文字编码的比较研究.http://www.ge-soft .com/research/paper/he4.ht m.,.
  • 4于明俭(中国科学院高能物理研究所计算中心).GB/BIG5文件识别.http ://ftp.cityu.edu.hk/pub/chinese/ifcss/data/chrecog.gb.html.,.
  • 5张轴材.ISO/ IEC 10646-1 and Unicode标准与实现.CharacterCode amp Data To Come研讨会[R].,1996..
  • 6Bauer D, Segond F, Zaenen A. LOCOLEX: The Translation Rolls off Your Tongue [ C ]. In : Proceedings of ACH - ALLC, Santa - Barbara, California, USA. 1995.
  • 7Grefenstette G. Comparing Two Language Identification Schemes [ C ]. In: Proceedings of the 3rd International Conference on Statis- tical Analysis of Textual Data, Rome, Italy. 1995.
  • 8Dunning T. Statistical Identification of Language [ R ]. Technical Report CRL MCCS -94 -273. Computing Research Laboratory, New Mexico State University, 1994.
  • 9Pingali P, Varma V. Multi -lingual Indexing Support for CLIR U- sing Language Modeling [ J ]. IEEE Data Engineering Bulletin, 2007,30(1) : 70-85.
  • 10Makiu R, Pandey N, Pingafi P, et al. Experiments in Cross -lin- gual IR Among Indian Languages [ C ]. In : Proceedings of the Inter- national Workshop on Cross Language Information Processing (CLIP) , Genova,Italy. 2007.

共引文献27

同被引文献30

  • 1张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程,2005,31(8):171-172. 被引量:98
  • 2Law M R, Mintzes B, Morgan S G. The sources and popularity ofonline drug information: an analysis of top search engine resultsand web page views[J]. Annals of Pharmacotherapy, 2011,45(3):350-356.
  • 3许玉贏.常用开源中文分词工具[EB/OL]. (2014-04-20)[2016-01-15] http: //www. scholat. com/vpost. html?pid=4477.
  • 4Mathew M, Shine N D, Lakshmi T R. A novel approach for near-duplicate detection of Web pages using TDW matrix[J]. Interna-tional Journal of Computer Applications, 2011,19(7): 16-21.
  • 5Agrawal A, Husain M, Tiwari R G. A novel technique for data-base selection and document selection[J]. International Journal ofComputer Applications, 2011,17(8): 22-26.
  • 6Cafarella M, Cutting D. Building nutch: Open source search[J].ACM Queue, 2004, 2(2): 21-24.
  • 7HACIB T, Le Bihan Y. Microwave characterization using ridgepolynomial neural networks and least-square support vector ma-chines[J]. IEEE Transactions on Magnetics, 2011, 47(5): 990-993.
  • 8Deng N, Tian Y,Zhang C. Support vector machines: Optimizationbased theory, algorithms, and extensions[M]. Boca raton: CRCPress, 2012.
  • 9Chang C C, Lin C J. LIBSVM: A library for support vectormachines[J]. ACM Transactions on Intelligent Systems and Tech-nology, 2011,2(3) : 27-36.
  • 10Habibi Y, Sheisi G H, Abdi H. Voltage instability detection inpower system using support vector machine (SVM)[J]. TechnicalJournal of Engineering and Applied Sciences, 2015(2): 22-26.

引证文献2

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部