Purpose: To design an efficient high-performance algorithm for semantic annotation of biodiversity documents in Chinese.Design/methodology/approach: Data set consists of 1,000 randomly selected documents from Flora of...Purpose: To design an efficient high-performance algorithm for semantic annotation of biodiversity documents in Chinese.Design/methodology/approach: Data set consists of 1,000 randomly selected documents from Flora of China. Comparative evaluation of the proposed approach with the Na ve Bayes algorithm have been developed before for the same purpose.Findings: Experimental results show that the heuristics based algorithm outperformed the Na ve Bayes algorithm. The use of leading words helped improving the annotation performance while prioritizing rule application based on their weights had no significant impact on algorithm performance.Research limitations: The ICTCLAS was used to identify word boundaries off-shelf without optimatization for biodiversity domain. This may have not made the best use of the tool.Practical implications & Originality/value: The performance of heuristics based approach,enhanced by leading words analysis, reached an F value of 0.9216, which is sufficiently accurate for practical use.展开更多
基金supported by the National Social Science Foundation of China (Grant No.:11BTQ024)the Foundation for Humanities and Social Sciences of the Chinese Ministry of Education (Grant No.:10YJC87004)
文摘Purpose: To design an efficient high-performance algorithm for semantic annotation of biodiversity documents in Chinese.Design/methodology/approach: Data set consists of 1,000 randomly selected documents from Flora of China. Comparative evaluation of the proposed approach with the Na ve Bayes algorithm have been developed before for the same purpose.Findings: Experimental results show that the heuristics based algorithm outperformed the Na ve Bayes algorithm. The use of leading words helped improving the annotation performance while prioritizing rule application based on their weights had no significant impact on algorithm performance.Research limitations: The ICTCLAS was used to identify word boundaries off-shelf without optimatization for biodiversity domain. This may have not made the best use of the tool.Practical implications & Originality/value: The performance of heuristics based approach,enhanced by leading words analysis, reached an F value of 0.9216, which is sufficiently accurate for practical use.