摘要
本文主要研究了从不同类型的html页面中根据需要抽取指定文本的技术。首先分析了目前主流的文本抽取技术的优点及缺点,并针对传统文本抽取技术的不足提出了基于机器学习的网页文本抽取技术;然后重点分析了此技术的实现原理,并在最后以案例方式介绍了使用java语言构建基于此技术的文本抽取系统。
This paper studies on the technology extracting giving text on demand from different html pages. The paper first analyzes the merits and flaws of current text extracting technology used most widely, and brings up the web text extraction technology based on machine learning based on the traditional theory; secondly, it analyzes the principle of realization of the technology; at last, it introduces an example of constructing the text extracting system based using java.
出处
《图书馆学研究》
CSSCI
2008年第5期21-22,共2页
Research on Library Science