摘要
Internet上的内容日益增多 ,搜索引擎返回的结果往往冗长。本文首先讨论Web页面文本与一般文本的四个不同点 ,然后介绍一种以统计方法为主、以匹配校验为辅的Web页面中文文本主题自动提取方法 ,它能帮助用户在最短时间内了解当前页面的主题。实验显示 ,所提取的前15个字串 ,反映主题的平均正确率在 85%以上 ,而处理时间仅为几十到几百毫秒。
The information on the Internet is increasing quickly.Search engines always feed back long\|list of web sites and pages.In this paper,we firstly enumerate four differences between Web pages' text and the common texts,then introduce an automatic subject extracting method from Web pages' Chinese text,mainly based on a statistical method and assisted with match\|correcting.It can help the net users to master most of the subject of a Web page in the shortest time.The experiment results show that,the headed 15 strings in our result can reflect the subject with the precision of more than 85%,while the processing time ranges only tens to hundreds of milliseconds.
出处
《情报学报》
CSSCI
北大核心
2001年第2期217-223,共7页
Journal of the China Society for Scientific and Technical Information
基金
国家 8 63计划资助!(合同号 :863 30 6 ZD0 3 0 4 1)