学术文本的结构功能识别——基于章节内容的识别被引量：39

The Structure Function Recognition of Academic Text——Chapter Content Based Recognition

下载PDF

导出

摘要学术文本的结构功能是对学术文本结构及章节功能的阐述和概括,主要分为引言、相关研究、方法、实验、结论五种。根据研究对象的不同,结构功能识别的方法可分为基于章节标题的识别、基于章节内容的识别和基于段落的识别三个层次。然而,基于章节标题的结构功能识别法存在较多的局限性,如数据集构建困难、含未登录词的标题的识别率较低等。因此本文以章节内容为研究对象,探讨学术文本结构功能识别的第二个层次,并将基于章节内容的结构功能识别问题转化为文本分类问题,在特征选择上,除了传统的词汇特征,还引入词汇的聚类特征,并使用支持向量机作为分类器在基于自然标注的实验数据集上进行了实证研究。实验结果表明相比较于只使用词汇特征,本文所提方法的识别效果有明显提升。 The structure function of the academic text refers to the summarization of academic text structure and section function, mainly dividing into five parts, introduction and related research, method, experiment, and conclusion. Depending on the research object, three different analytical levels can be applied to recognize the structure function, namely title-based, chapter-based and paragraph-based. However, there are many limitations of the title-based method, such as unknown words problem, dataset construction difficultly and so on. This paper studies the chapter content, recognizes the structure function of academic text at the chapter-based level and regards it as a text classification problem. This paper applies the bag-of-word feature and clustering features into support vector machine （SVM）, the result is improved significantly.

作者黄永陆伟程齐凯

机构地区武汉大学信息管理学院信息检索与知识挖掘研究所

出处《情报学报》 CSSCI 北大核心 2016年第3期293-300,共8页 Journal of the China Society for Scientific and Technical Information

基金国家自然科学基金面上项目"面向词汇功能的学术文本语义识别与知识图谱构建"(项目编号:71473183) 教育部人文社会科学基地重大项目"面向细粒度的网络信息检索模型及框架构建研究"(项目编号:10JJD630014)的研究成果之一

关键词结构功能文本分类词汇特征 structure function, text classification, lexical feature

分类号 G350 [文化科学—情报学]

引文网络
相关文献

参考文献20

1陆伟,黄永,程齐凯.学术文本的结构功能识别——功能框架及基于章节标题的识别[J].情报学报,2014,33(9):979-985. 被引量：51
2Leydesdorff L. The Challenge of Scientometrics: The Development, Measurement, and Self-organization of Scientific Communications [ M ]. Boca Raton Universal- Publishers ,2001.
3Hinton G E. Learning distributed representations of concepts [ C ]//Proceedings of the eighth annual conference of the cognitive science society. 1986, 1: 12.
4Yang Y, Pedersen J O. A comparative study on feature selection in text categorization [ C ]//ICML. 1997, 97 : 412-420.
5Forman G. An extensive empirical study of feature selection metrics for text classification[ J]. The Journal of Machine Learning Research, 2003, 3 : 1289-1305.
6Bengio Y. Learning deep architectures for AI [ J ]. Foundations and trends (~) in Machine Learning, 2009,2 (1) : 1-127.
7Bengio ~ ,Schwenk H ,Sen6cal J S ,et al. Neural probabilistic language models [ M ]//Innovations in Machine Learning. Springer Berlin Heidelberg, 2006: 137-186.
8Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from seratch[ J ]. The Journal of Machine Learning Research, 2011, 12: 2493-2537.
9Wu K, Gao Z, Peng C, et al. Text Window Denoising Autoencoder: Building Deep Architecture for Chinese Word Segmentation [ M ]//Natural Language Processing and Chinese Computing. Springer Berlin Heidelberg, 2013 : 1-12.
10Zheng X, Chen H, Xu T. Deep Learning for Chinese Word Segmentation and POS Tagging [ C ]//EMNLP. 2013 : 647-657.

二级参考文献14

1Qikai Cheng,Xiaoguang Wang,Wei Lu, et al. NEViewer: A New Software for Analyzing the Evolution of Research Topics [ J ]. Proceedings of the 14th International Conference of the International Society for Scientometrics and Informetrics. 2013: 1307-1320.
2Xiaodan Zhu, Peter Turney, Daniel Lemire, et al. Measuring academic influence: Not a!l citations are equal [ J ]. Journal of the Association for Information Science and Technology, 2014 ,doi: 10. 1002/asi. 23179.
3Carole Slade. Form and Style:Research Papers, Reports, Theses [ M ]. Houghton Mifflin Company, 1997.
4Song Mao, Azriel Rosenfeld, Tapas Kanungo. Document structure analysis algorithms: a literature survey [ C ]. International Society for Optics and Photonics, 2003: 197-207.
5Simone Marinai,Marco Gori,Giovanni Soda. Artificial neural networks for document analysis and recognition [ J ]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2005, 27 ( 1 ) : 23-35.
6Koji Nakagawa,Akihiro Nomura,Masakazu Suzuki. Extraction of logical structure from articles in mathematics [ C ]. Springer, 2004 : 276-289.
7Belai'd A,Rangoni Y. Structure extraction in printed documents using neural approaches [ M ]//Machine Learning in Document Analysis and Recognition Springer Berlin Heidelberg, 2008 : 21-43.
8Luong M T, Nguyen T D, Kan M Y. Logical structure recovery in scholarly articles with rich document features [ J ]. International Journal of Digital Library Systems (IJDLS), 2010, 1(4): 1-23.
9Hu Zhigang, Chen Chaomei ,Liu Zeyuan. Where are citations located in the body of scientific articles? A study of the distributions of citation locations [ J ]. Journal of Informetrics, 2013, 7(4) : 887-896.
10Ying Ding, Xiaozhong Liu, Chun Guo, et al. The distribution of references across texts: Some implications for citation analysis[ J]. Journal of Informetrics, 2013, 7 ( 3 ) : 583-592.