摘要
提出了语料库和机器学习相结合的方法识别英语句子中的简单的、非递归的名词短语 (Base NP) .在含有词性标注和 Base NP边界标注的训练语料中 ,抽取所有不同类型 Base NP短语对应的词性序列 (Base NP规则 ) ,通过规则排序和语言学知识 ,对其中正确率低且明显不符合语法的规则进行剔除 .在识别时 ,采取规则匹配树的方法进行最大长度匹配 ,通过归纳机器学习 C4.5算法引入上下文信息 ,由 C4.5算法学习出有效 (或无效 )应用 Base NP规则的条件 ,参照上下文条件 ,约束应用 Base NP规则 .实验结果表明 ,提出的方法具有很高的正确率和召回率 .
A new method, which combines the corpus approach with the machine learning approach, is put forward in this paper to identify simple, non recursion noun phrases (BaseNP). Firstly, all different part of speech (POS) strings (BaseNP rules) which are corresponding to BaseNP are extracted from the training corpus tagged with POS and the boundary of each BaseNP. By means of training and based on linguistics knowledge, some BaseNP rules which have lower precision and have no linguistics sense apparently are deleted. Secondly, the remaining BaseNP rules are employed to identify BaseNP in new sentences. In the process, a heuristic algorithm of longest match, which is combined with the machine learning method of inductive decision trees to consult contexts, is applied. Experiments show that this new method results in higher precision and recall precision.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2000年第7期826-832,共7页
Journal of Computer Research and Development
基金
国家自然科学基金
国家"八六三"高技术研究发展计划基金