摘要
在web文本挖掘中,网络编码的识别是前期关键步骤,其识别成功与否,直接影响后期任务能否继续执行。本文提出了一种使用非GB2312编码判断网页编码的方法,通过得到的比较值与经验阈值比较,最终用二值区分确定网页代码类型,成功解决了网页信息采集系统在面临大量不同类型网页时的编码识别问题。
In Web text mining,network coding is the early identification of key steps,the recognition is successful or not, directly affect the ability to continue to perform the task later.This paper presents a use of non-page encoding GB2312 encoding method to determine, by comparing values obtained by comparison with the experience of the threshold,the binary distinction between end-page code to determine the type of successful resolution of a web information collection system in the face of a number different types of webpages encoding recognition.
出处
《计算机光盘软件与应用》
2011年第2期84-84,共1页
Computer CD Software and Application