摘要
概念图的构建是一项复杂的工程,在概念术语提取阶段往往需要领域专家花费大量时间手工完成。随着概念图在信息处理和知识管理系统中得到日益广泛的应用,仅仅依靠领域专家来手工提取概念术语生成概念图的办法已不能满足应用需求。基于此,提出结合网络爬虫技术和LSA的方法自动提取概念术语,生成概念图的方法,可以降低概念图制作的人工复杂度,高效、准确的构建概念图,可以大大扩展概念图的应用范围。从指定网站上爬取大量领域文本资源;进行文本预处理并抽取特征项;再利用LSA挖掘特征项与特征项、特征项与文本之间的潜在语义结构,消除噪音及冗余特征项,提取概念术语。实验结果表明,结合网络爬虫技术和LSA方法能够降低概念术语的提取过程中的人力复杂度,去除冗余概念,并提高准确性。
Constructing concept maps is a complex task requiring lots of domain experts' time to manually extract concept terms from the unstructured text. With the rapid growth applications of concept maps, it's obviously hard to meet the demand by rel- ying solely on the manual efforts of extracting the terms. A method of auto-extraction of terms of domain concepts is proposed by combining web crawler technology and LSA technique. Firstly, through the specific domain sites, numerous text resources are captured. Then, the texts and extracts features from them are preprocessed. Finally, it extracts the domain concept terms by e- liminating the noisy terms and redundant features through a method of LSA, which can mine the potential semantic structures between features, and those between features and texts. Experiments show that the method of the combination of web crawler technology and LSA technique can decrease the artificial complexity, remove redundant terms and improve the accuracy of the ex- traction of domain concepts terms.
出处
《计算机工程与设计》
CSCD
北大核心
2012年第7期2864-2867,共4页
Computer Engineering and Design
基金
全国教育科学规划项目国家青年基金课题基金项目(CCA100176)
四川省教育厅科研基金项目(09ZC080)
关键词
概念图
概念术语
网络爬虫技术
潜在语义分析
特征项
concept map
concept terms
web crawler technology latent semantic analysis features