摘要
网页主题挖掘对自然语言处理如网页文本分类、文摘自动生成、信息融合等具有重要意义。挖掘网页主题可以帮助用户更好地理解网页内容。尽管已有一些从普通文本中挖掘概念的工作,但其很少考虑单词所属标签和位置对单词权重的影响,且没有工作给出上述两种影响因子的计算方法。借助WordNet,将网页主题从词语扩展到概念层次,提出了使用词性标注和词义消歧确定网页中单词词义并充分利用标签影响因子和位置影响因子对网页正文文本特征进行权重修正的主题概念挖掘方法,给出了两种影响因子的计算公式。在DMOZ数据集上的实验结果表明,修正权重可以明显提高主题挖掘精度,最高可达到0.95。
Topic discovery from Web page has an important impact on natural language processing, such as text classification,automatic abstract generation,information fusion etc. Mining Web page topics can help users better understand the content of Web pages. Although there are some papers discussing topic discovery from ordinary texts, few of them consider how the label a word belongs to and the location in which a word appears affect the weight of a word, and none of them gives calculation methods for the two impact factors. This article extended Web topics from words level to concepts level based on WordNet, used speech tagging to determine the POS of the words, used word sense disambiguation to determine the words' meaning in the pages,made full use of label impact factor and location impact factor to modify the weights of concepts, and proposed calculation formulas for calculating these two impact factors. Experimental results on DMOZ dataset show that, compared with un-adjusted weight method, the adjusted weights method can significantly improve topic mining accuracy,which can reach up to 0. 95 in the best case.
出处
《计算机科学》
CSCD
北大核心
2015年第5期62-66,共5页
Computer Science
基金
国家自然科学青年基金项目(20130206051GX)
吉林省重点科技攻关项目(20130206051GX)资助
关键词
词性标注
词义消歧
标签影响因子
位置影响因子
权重修正
Speech tagging, Word sense disambiguation, Label impact factor, Location impact factor, Adjusted weights