摘要
以植物学作为专业领域的样本,对专业领域的新词自动化识别进行探索。研究选取《中国植物志》作为样本集,在ICTCLAS切词的基础上采用N-Gram统计的方法提取新词的候选项,然后分别按照词频(TF)、文档频率(D)和平均词频(TF/D)对新词候选项排序,取一定范围内的候选项作为识别出的新词。实验结果表明,词频TF筛选新词候选项的识别效果最好,F值为0.65。该方法能够自动产生专业领域的用户词典,具有较强的可移植性。
The paper researches automatic new word recognition in specialized field which is represented by phytology.A set of 200 documents on plant description randomly drawn from "Flora of China" is taken as sample set.At first,draw new words candidates are drawn by N-Gram method based on words split by ICTCLAS.Then all the new words candidates are sorted respectively by term frequency(TF),document frequency(D) and average term frequency(TF/D) and the candidates are selected among certain boundary as true new words.The experiments show that new words recognition according to TF is the best and F measurement is 0.65.This method can automatically produce user dictionary of specialized field and is highly portable.
出处
《现代图书情报技术》
CSSCI
北大核心
2012年第2期41-47,共7页
New Technology of Library and Information Service
基金
教育部人文社会科学研究青年基金项目"基于深度语义标注的网络中文学术信息抽取研究--以生物多样性描述为例"(项目编号:10YJC870004)的研究成果之一