摘要
学术文本的结构功能是对学术文本结构及章节功能的阐述和概括,主要分为引言、相关研究、方法、实验、结论五种。根据研究对象的不同,结构功能识别的方法可分为基于章节标题的识别、基于章节内容的识别和基于段落的识别三个层次。然而,基于章节标题的结构功能识别法存在较多的局限性,如数据集构建困难、含未登录词的标题的识别率较低等。因此本文以章节内容为研究对象,探讨学术文本结构功能识别的第二个层次,并将基于章节内容的结构功能识别问题转化为文本分类问题,在特征选择上,除了传统的词汇特征,还引入词汇的聚类特征,并使用支持向量机作为分类器在基于自然标注的实验数据集上进行了实证研究。实验结果表明相比较于只使用词汇特征,本文所提方法的识别效果有明显提升。
The structure function of the academic text refers to the summarization of academic text structure and section function, mainly dividing into five parts, introduction and related research, method, experiment, and conclusion. Depending on the research object, three different analytical levels can be applied to recognize the structure function, namely title-based, chapter-based and paragraph-based. However, there are many limitations of the title-based method, such as unknown words problem, dataset construction difficultly and so on. This paper studies the chapter content, recognizes the structure function of academic text at the chapter-based level and regards it as a text classification problem. This paper applies the bag-of-word feature and clustering features into support vector machine (SVM), the result is improved significantly.
出处
《情报学报》
CSSCI
北大核心
2016年第3期293-300,共8页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金面上项目"面向词汇功能的学术文本语义识别与知识图谱构建"(项目编号:71473183)
教育部人文社会科学基地重大项目"面向细粒度的网络信息检索模型及框架构建研究"(项目编号:10JJD630014)的研究成果之一
关键词
结构功能
文本分类
词汇特征
structure function, text classification, lexical feature