中文维基百科的实体分类研究被引量：1

Classifying Named Entities on Chinese Wikipedia

下载PDF

导出

摘要维基百科实体分类对自然语言处理和机器学习具有重要的作用。该文采用机器学习的方法对中文维基百科的条目进行实体分类,在利用维基百科页面中半结构化信息和无结构化文本作为基本特征的基础上,结合中文的特点使用扩展特征和语义特征来提高实体分类性能。在人工标注的语料库上的实验表明,这些额外特征有效地提高了ACE分类体系上的实体分类性能,总体F1值达到96%,同时在扩展实体分类上也取得了较好的效果,总体F1值达95%。 Classifying Wikipedia Entities is of great significance to NLP and machine learning. This paper presents a machine learning based method to classify the Chinese Wikipedia articles. Besides using semi-structured data and non-structured text as basic features, we also extend to use Chinese-oriented features and semantic features in order to improve the classification performance. The experimental results on a manually tagged corpus show that the addi- tional features significantly boost the entity classification performance with the overall Fl-measure as high as 96 % on the ACE entity type hierarchy and 95% on the extended entity type hierarchy.

作者徐志浩惠浩添钱龙华朱巧明

机构地区苏州大学自然语言处理实验室苏州大学计算机科学与技术学院

出处《中文信息学报》 CSCD 北大核心 2015年第5期91-97,124,共8页 Journal of Chinese Information Processing

基金国家自然科学基金(61373096 90920004) 江苏省高校自然科学研究重大项目(11KJA520003)

关键词维基百科实体分类半结构化信息信息框 Wikipedia named entities classification semi-structured data Infobox

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Nothman J, Curran J R, Murphy T. Transforming Wikipedia into named entity training data[C]//Proceedings of the Australian Language Technology Workshop. 2008: 124-132.
2Nothman J. Learning named entity recognition from Wikipedia[D]. The University of Sydney Australia 7, 2008.
3Bunescu R C, Pasca M. Using Encyclopedic Knowledge for Named entity Disambiguation[C]//Proceedings of the EACL. 2006, 6: 9-16.
4Zirn C, Nastase V, Strube M. Distinguishing between instances and classes in the wikipedia taxonomy[M]. Springer Berlin Heidelberg, 2008.
5Toral A, Munoz R. A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia[J]. NEW TEXT Wikis and blogs and other dynamic text sources, 2006, 56.
6Bhole A, Fortuna B, Grobelnik M, et al. Extracting named entities and relating them over time based on wikipedia[J]. Informatica (Slovenia), 2007, 31(4): 463-468.
7Tardif S, Curran J R, Murphy T. Improved text categorisation for Wikipedia named entities[C]//Proceedings of the Australasian Language Technology Association Workshop 2009. 2009: 104.
8Dakka W, Cucerzan S. Augmenting Wikipedia with Named Entity Tags[C]//Proceedings of the IJCNLP. 2008: 545-552.
9谌志群,高飞,曾智军.基于中文维基百科的词语相关度计算[J].情报学报,2012,31(12):1265-1270. 被引量：12
10张苇如,孙乐,韩先培.基于维基百科和模式聚类的实体关系抽取方法[J].中文信息学报,2012,26(2):75-81. 被引量：23

二级参考文献25

1刘群李素建.基于《知网》的词汇语义相似度的计算.中文计算语言学,2002,17(2):59-76.
2O. Medelyan, D. Milne, C. Legg, et al. Mining Meaning from Wikipedia[J].International Journal of Human-Computer Studies,September 2009,67 (9):716-754.
3E.Agichtein,L.Gravano.Snowball:Extracting Relations from Large Plain-Text Collections[C]//Proceedings of the fifth ACM conference on Digital libraries.New York,NY,USA:ACM,2000:85-94.
4M.Ruiz-Casado,E.Alfonseca,P.Castells.Automatic Extraction of Semantic Relationships for WordNet by Means of Pattern Learning from Wikipedia[J].Natural Language Processing and Information Systems 2005,3513:233-242.
5Y.Yan,N.Okazaki,Y.Matsuo,et al.Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web[C]//Proceeding of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP:Volume 2-Volume 2.
6P. Pantel,M. Pennacchiotti. Espresso:Leveraging Generic Patterns for Automatically Harvesting Semantic Relations[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics,2006:113-120.
7F. M. Suchanek,G. Ifrim,G. Weikum. LEILA:Learning to Extract Information by Linguistic Analysis[J].ACL,2006:18-25.
8G.Wang,Y.Yu,H.Zhu.PORE:Positive-Only Relation Extraction from Wikipedia Text.Lecture Notes in Computer Science[C]//Proceedings of Lecture Notes in Computer Science,2007,Volume 4825:580-594.
9Kilgarriff,J.Rosenzweig.English SENSEVAL:Report an Results.[C]//Proceedings of the 2nd International Conference on Language Resourcesand Evaluation,LREC,Athens,Greece.2000.
10J.X.Chen,D. H.Ji,C.L.Tan,et al.Unsupervised Feature Selection for Relation Extraction[C]//IJCNLP,2005.

共引文献33

1黄令贺,朱庆华.网络百科用户贡献行为研究综述[J].图书情报工作,2013,57(22):138-144. 被引量：9
2周建政,谌志群,李治,王荣波,冯凯.问答系统中问题模式分类与相似度计算方法[J].计算机工程与应用,2014,50(1):116-120. 被引量：4
3单永刚,虞江锋.SNS环境下智能学习平台的设计与实现[J].中国教育信息化（高教职教）,2014(2):84-87. 被引量：3
4陈叶旺.一种基于百度百科的中文网络文本关键词抽取方法[J].小型微型计算机系统,2014,35(11):2422-2427.
5王荣波,谌志群,周建政,李治,高飞.基于Wikipedia的短文本语义相关度计算方法[J].计算机应用与软件,2015,32(1):82-85. 被引量：15
6邵发,黄银阁,周兰江,郭剑毅,余正涛,张金鹏.基于实体消歧的中文实体关系抽取[J].山东大学学报（工学版）,2014,44(6):32-37. 被引量：6
7刘晓亮.基于维基语义图的词语语义相关度计算研究[J].情报学报,2014,33(11):1124-1132. 被引量：5
8余丽,陆锋,张恒才.网络文本蕴涵地理信息抽取:研究进展与展望[J].地球信息科学学报,2015,17(2):127-134. 被引量：41
9刘绍毓,周杰,李弼程,席耀一,唐浩浩.基于多分类SVM-KNN的实体关系抽取方法[J].数据采集与处理,2015,30(1):202-210. 被引量：20
10曾光.基于维基百科结构特征的语义相关度计算方法研究[J].情报科学,2015,33(9):72-75. 被引量：2

同被引文献3

1周俊生,戴新宇,尹存燕,陈家骏.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5):804-809. 被引量：112
2孙镇,王惠临.命名实体识别研究进展综述[J].现代图书情报技术,2010(6):42-47. 被引量：98
3尹迪,周俊生,曲维光.基于联合模型的中文嵌套命名实体识别[J].南京师大学报（自然科学版）,2014,37(3):29-35. 被引量：8

引证文献1

1李雁群,何云琪,钱龙华,周国栋.基于维基百科的中文嵌套命名实体识别语料库自动构建[J].计算机工程,2018,44(11):76-82. 被引量：7

二级引证文献7

1莫天金,李韧,杨建喜,李童,蒋仕新,李东.公路桥梁定期检测领域命名实体识别语料库构建[J].计算机应用,2020,40(S01):103-108. 被引量：6
2张栋,王铭涛,陈文亮.结合五笔字形与上下文相关字向量的命名实体识别[J].计算机工程,2021,47(3):94-101. 被引量：7
3刘冬霞,刘建国,林凯,陈曼倩,陈晨.基于装备制造业工业汉语平行语料库的构建及相关问题的研究[J].软件,2021,42(1):8-11. 被引量：1
4王学军,何文杰,赵宇.基于知识图谱的齿轮传动智能问答系统[J].农业装备与车辆工程,2022,60(2):61-66. 被引量：1
5李军怀,陈苗苗,王怀军,崔颖安,张爱华.基于ALBERT-BGRU-CRF的中文命名实体识别方法[J].计算机工程,2022,48(6):89-94. 被引量：12
6王晓莉.基于差分进化算法的思政多模态语料库智能构建[J].微型电脑应用,2022,38(5):149-151. 被引量：2
7张汝佳,代璐,王邦,郭鹏.基于深度学习的中文命名实体识别最新研究进展综述[J].中文信息学报,2022,36(6):20-35. 被引量：30

1李艳霞,巩九洲,黎玉琴.基于Web Services的Web挖掘实现方案[J].自动化技术与应用,2008,27(5):73-75. 被引量：1
2熊忠阳,任芳,张玉芳,毛嘉莉,周涓.基于XML描述的数据挖掘结果的存储方法[J].计算机工程与设计,2006,27(20):3874-3877. 被引量：2
3刘海静.基于ESA的文本分类算法研究[J].洛阳师范学院学报,2016,35(2):68-71.
4朱苏阳,惠浩添,钱龙华,张民.基于自监督学习的维基百科家庭关系抽取[J].计算机应用,2015,35(4):1013-1016. 被引量：1
5于波,唐世渭,张鹏,童云海.基于实体分类的数据库模式匹配方法[J].计算机科学,2004,31(10):157-159. 被引量：8
6黄豫清,戚广智,张福炎.构造Web文档中半结构化信息的技术[J].计算机辅助设计与图形学学报,2000,12(3):230-234. 被引量：4
7陈宇,郑德权,赵铁军.基于Deep Belief Nets方法的中文名实体分类研究[J].智能计算机与应用,2014,4(2):29-31. 被引量：2
8杜婧君,陆蓓,谌志群.基于中文维基百科的命名实体消歧方法[J].杭州电子科技大学学报（自然科学版）,2012,32(6):57-60. 被引量：3
9陈和平,高丽,杨玲贤.基于面向值的映像方法在XML数据存储中的应用[J].武汉科技大学学报,2005,28(2):197-200. 被引量：2
10黄豫清,戚广志,张福炎.从WEB文档中构造半结构化信息的抽取器[J].软件学报,2000,11(1):73-78. 被引量：47

中文信息学报

2015年第5期

浏览历史

内容加载中请稍等...

中文维基百科的实体分类研究被引量：1

参考文献13

二级参考文献25

共引文献33

同被引文献3

引证文献1

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

中文维基百科的实体分类研究 被引量：1

参考文献13

二级参考文献25

共引文献33

同被引文献3

引证文献1

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

中文维基百科的实体分类研究被引量：1