摘要
本文选取公开日从1992年1月1日到2011年12月31日的水产类的9 894条失效专利作为数据挖掘的文本。从中选出56条专利,利用分词器对其摘要进行分词,并通过卡方检验的方法过滤掉与分类相关度比较小的词,形成词组矩阵。然后采用朴素贝叶斯的方法对这些矩阵进行训练并设计程序。使用训练过后的程序对失效专利进行分类测试,合格后再对所有的专利的摘要文本进行分类,并对分类结果进行了分析和验证。验证的结果表明该程序对文本进行分类的准确率达到了85%,达到了比较好的可信度,可以用它对文本分类。如此我们就可以把失效的水产类专利文本按照设定的类别进行归类,了解一个时间段它们的分布情况,为以后做决策提供参考。
In this article,the aquatic product patents from Jan. 1,1992 to Dec. 31,2011 are selected as the data mining texts,and the total number is 9894. Firstly,56 patents are chosen,and the segmentation devices are used to segment the summary. Secondly,the words of small correlation are filtered with the category through the chi-square test. And a matrix of words is established. Thirdly,the Nave Bayes method is used to train the program according to the matrix. Fourthly,the program is tested after being trained. Lastly,after passing the test,the program is used to classify all the patent' s summaries,then to analyze and verify the results. The verification shows that the accuracy of the text classification program is 85%,Which means that we can use it to classify text. So we can categorize the lapsed aquatic patents' summaries according to the categories set by ourselves. Then we will have knowledge of the distribution of the lapsed aquatic patents in a period of time,which can provide a reference for the future decision.
出处
《渔业信息与战略》
2014年第1期54-59,共6页
Fishery Information & Strategy
基金
"十二五"国家科技支撑计划项目(2013BAD13B01)
上海市科学技术委员会资助项目(12511501200).
关键词
朴素贝叶斯
文本分类
卡方检验
Naive Bayes
text classification
chi-square test