摘要
基于SVM(supportvectormachine)理论的分类算法 ,由于其完善的理论基础和良好的实验结果 ,目前已逐渐引起国内外研究者的关注。和其他分类算法相比 ,基于结构风险最小化原则的SVM在小样本模式识别中表现较好的泛化能力。文本组块分析作为句法分析的预处理阶段 ,通过将文本划分成一组互不重叠的片断 ,来达到降低句法分析的难度。本文将中文组块识别问题看成分类问题 ,并利用SVM加以解决。实验结果证明 ,SVM算法在汉语组块识别方面是有效的 ,在哈尔滨工业大学树库语料测试的结果是F =88 6 7%,并且特别适用于有限的汉语带标信息的情况。
The classification algorithm based on SVM (support vector machine) attracts more attention from researchers due to its perfect theoretical properties and good empirical results. Compared with other classification algorithms, structural risk minimizations based SVM achieve high generalization performance with small number of samples. The text chunking, as a preprocessing step for parsing, is to divide text into syntactically related non-overlapping groups of words (chunks), reducing the complexity of the full parsing. In this paper, we treat Chinese text chunking as a classification problem, and apply SVM to solve it. The chunking experiments were carried out on the HIT Chinese Treebank corpus. Experimental results show that it is an effective approach, achieving an F score of 88.67%, especially for a small number of Chinese labeled samples.
出处
《中文信息学报》
CSCD
北大核心
2004年第2期1-7,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目 (6 0 0 830 0 6 )
国家重点基础研究发展规划 973资助项目(G19980 30 5 0 11)
国家自然科学基金和微软亚洲研究院联合资助项目 (6 0 2 0 30 19)