基于领域相关词汇提取的特征选择方法被引量：4

Feature Selection Method Based on Domain-specific Term Extraction

下载PDF

导出

摘要传统文本分类中的文档表示方法一般基于全文本(Bag-Of-Words)的分析,由于忽略了领域相关的语义特征,无法很好地应用于面向特定领域的文本分类任务.本文提出了一种基于语料库对比领域相关词汇提取的特征选择方法,结合SVM分类器实现了适用于特定领域的文本分类系统,能轻松应用到各个领域.该系统在2005年文本检索会议(TREC,Text REtrieval Conference)的基因领域文本分类任务(Genomics Track Categorization Task)的评测中取得第一名. The traditional text representation methods for text classification are generally based on the analysis of full text （Bagof-Words）. Because of ignoring domain-specific semantic features, they can not fit domain-specific text classification. This paper describes a feature selection method based on domain-specific term extraction using corpus comparison, and a text classification system based on the combination of this method and the SVM classifier, which can be applied to any domain easily. This text classification system got the highest score among runs from 19 groups in the evaluation of TREC 2005 Genomics Track Categorization Task.

作者孙麟牛军钰

机构地区复旦大学计算机科学与工程系

出处《小型微型计算机系统》 CSCD 北大核心 2007年第5期895-899,共5页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(60305006)资助

关键词文本分类文档表示特征选择领域相关 text classification document representation feature selection domain-specific

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献15

1Ron Kohavi,George H John.Wrappers for feature subset selection[C].In:Artificial Intelligence,1997,97(1-2):273-324.
2Avrim L Blum,Pat Langley.Selection of relevant features and examples in machine learning[C].In:AAAI Fall Symposium on Relevance,1994,140-144.
3Yang Yi-ming,Jan O Pedersen.A comparative study on feature selection in text categorization[C].In:Proceedings of 14th International Conference on Machine Learning,1997,412-420.
4Lewis D D,Ringuette M.Comparison of two learning algorithms for text categorization[C].In:Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval,1994.
5Wiener E,Pedersen J O,Weigend A S.A neural network approach to topic spotting[C].In:Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval,1995,317-332.
6Schutze H,Hull D A,Pedersen J O.A comparison of classifiers and document representations for the routing problem[C].In:18^th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval,1995,229-237.
7Penas A,Verdejo F,Gonzalo J,et al.Corpus-based terminology extraction applied to information access[C].In:Proceedings of Corpus Linguistics,2001.
8David Vogel.Using generic corpora to learn domain-specific terminology[C].In:Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2003.
9Teresa Mihwa Chung.A corpus comparison approach for terminology extraction[J].Terminology,2003,(9):221-246.
10Patrick Drouin.Detection of domain specific terminology using corpora comparison[C].In:Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC),Lisbon,Portugal,2004.

同被引文献33

1崔晨旸,石教英.三维模型检索中的特征提取技术综述[J].计算机辅助设计与图形学学报,2004,16(7):882-889. 被引量：65
2张锋,许云,侯艳,樊孝忠.基于互信息的中文术语抽取系统[J].计算机应用研究,2005,22(5):72-73. 被引量：36
3张双民,石纯一.一种基于特征向量提取的FMDP模型求解方法[J].软件学报,2005,16(5):733-743. 被引量：3
4凌祺,樊孝忠.领域词汇自动获取的研究[J].微机发展,2005,15(8):148-150. 被引量：6
5王晓黎,王文杰.基于向量空间模型的文本检索系统[J].微电子学与计算机,2006,23(6):188-190. 被引量：18
6陈晓云,李荣陆,胡运发.基于最小词频阈值的文档特征选择[J].模式识别与人工智能,2006,19(4):531-537. 被引量：7
7何婷婷,张勇.基于质子串分解的中文术语自动抽取[J].计算机工程,2006,32(23):188-190. 被引量：21
8刘桃,刘秉权,徐志明,王晓龙.领域术语自动抽取及其在文本分类中的应用[J].电子学报,2007,35(2):328-332. 被引量：31
9贺敏,龚才春,张华平,程学旗.一种基于大规模语料的新词识别方法[J].计算机工程与应用,2007,43(21):157-159. 被引量：24
10John M. Pierre. On the automated classification of web sites [J]. Linkoping Electronic Articles in Computerand Information Science,2001, 6: 1--12.

引证文献4

1张杰,林木辉,包正委.基于领域本体的语句相似度研究[J].福建师范大学学报（自然科学版）,2009,25(1):39-43. 被引量：1
2汪星一,林木辉.基于语义Web的受限领域智能答疑系统设计[J].闽江学院学报,2009,30(5):89-92.
3季培培,鄢小燕,岑咏华.面向领域中文文本信息处理的术语识别与抽取研究综述[J].图书情报工作,2010,54(16):124-129. 被引量：17
4王卫玲,初建崇,任颖,张燕红.基于动态融合的三维模型特征选择算法[J].计算机与数字工程,2022,50(6):1259-1262.

二级引证文献18

1王元直,卢潇,钱建立.基于加权相似度的相关性排序算法的研究[J].电子设计工程,2010,18(7):49-50. 被引量：2
2祝清松,冷伏海.自动术语识别存在的问题及发展趋势综述[J].图书情报工作,2012,56(18):104-109. 被引量：16
3YANG Yuehua,DU Junping,ZI Lingling.Bootstrapping-based Automatic Acquisition of Domain Concepts for Ontology Construction[J].Chinese Journal of Electronics,2013,22(2):313-318. 被引量：2
4熊李艳,谭龙,钟茂生.基于有效词频的改进C-value自动术语抽取方法[J].现代图书情报技术,2013(9):54-59. 被引量：11
5颜端武,李兰彬,曲美娟.基于N-gram复合分词的领域概念自动获取方法研究[J].情报理论与实践,2014,37(2):122-126. 被引量：5
6刘剑,唐慧丰,刘伍颖.一种基于统计技术的中文术语抽取方法[J].中国科技术语,2014,16(5):10-14. 被引量：15
7刘剑.一种领域合成词的抽取方法[J].太赫兹科学与电子信息学报,2014,12(6):870-873.
8刘彤,倪维健,柳梅.面向搜索引擎查询日志的领域术语自动识别方法[J].现代图书情报技术,2016(2):25-33. 被引量：2
9尤胜.数字图书馆本体的构建方法研究与应用[J].现代电子技术,2016,39(17):112-115. 被引量：1
10王良,梁卿.中国现代职业教育理论之概念的抽取研究[J].职教通讯,2016(19):1-4.

1郑家恒,武琼.基于元数据的Office文档表示方法的研究[J].计算机工程,2003,29(3):86-88. 被引量：5
2花洁,刘涛.基于KNN的中文文本自动分类研究[J].教育技术导刊,2008(2):16-18.
3杨志峰,刘悦,杨哲,王斌,程学旗.TREC2002中的WEB信息检索[J].计算机工程与应用,2003,39(26):37-39.
4Osondu C. Unegbu.A Re-investigation of the Concept of Word Classes Through a Categorization Approach[J].Journal of Literature and Art Studies,2014,4(11):990-999.
5Yun Xu.On Prototypes from the Perspective of Semantics and Categorization[J].International English Education Research,2015(3):147-149.
6杨单.基于Lucene的校园信息搜索引擎的设计与实现[J].中南民族大学学报（自然科学版）,2013,32(4):97-101. 被引量：2
7李季.一个标准中文问答系统的研究与实现[J].计算机系统应用,2004,13(6):17-20. 被引量：1
8李季,迟呈英.中文问答系统的研究[J].鞍山科技大学学报,2003,26(6):437-440. 被引量：1
9徐文博,吴恋,于国龙.基于SIFT特征图像检索的分布式应用[J].贵州师范学院学报,2016,32(9):13-17.
10陈海利,孙志伟,庞龙.基于随机森林的文本分类研究[J].科技创新与应用,2014,4(2):55-55. 被引量：2

小型微型计算机系统

2007年第5期

浏览历史

内容加载中请稍等...

基于领域相关词汇提取的特征选择方法被引量：4

参考文献15

同被引文献33

引证文献4

二级引证文献18

相关作者

相关机构

相关主题

浏览历史

基于领域相关词汇提取的特征选择方法 被引量：4

参考文献15

同被引文献33

引证文献4

二级引证文献18

相关作者

相关机构

相关主题

浏览历史

基于领域相关词汇提取的特征选择方法被引量：4