
面向网页的主题概念挖掘 被引量:1

Topic Concept Discovery for Web Pages
摘要 网页主题挖掘对自然语言处理如网页文本分类、文摘自动生成、信息融合等具有重要意义。挖掘网页主题可以帮助用户更好地理解网页内容。尽管已有一些从普通文本中挖掘概念的工作,但其很少考虑单词所属标签和位置对单词权重的影响,且没有工作给出上述两种影响因子的计算方法。借助WordNet,将网页主题从词语扩展到概念层次,提出了使用词性标注和词义消歧确定网页中单词词义并充分利用标签影响因子和位置影响因子对网页正文文本特征进行权重修正的主题概念挖掘方法,给出了两种影响因子的计算公式。在DMOZ数据集上的实验结果表明,修正权重可以明显提高主题挖掘精度,最高可达到0.95。 Topic discovery from Web page has an important impact on natural language processing, such as text classification,automatic abstract generation,information fusion etc. Mining Web page topics can help users better understand the content of Web pages. Although there are some papers discussing topic discovery from ordinary texts, few of them consider how the label a word belongs to and the location in which a word appears affect the weight of a word, and none of them gives calculation methods for the two impact factors. This article extended Web topics from words level to concepts level based on WordNet, used speech tagging to determine the POS of the words, used word sense disambiguation to determine the words' meaning in the pages,made full use of label impact factor and location impact factor to modify the weights of concepts, and proposed calculation formulas for calculating these two impact factors. Experimental results on DMOZ dataset show that, compared with un-adjusted weight method, the adjusted weights method can significantly improve topic mining accuracy,which can reach up to 0. 95 in the best case.
出处 《计算机科学》 CSCD 北大核心 2015年第5期62-66,共5页 Computer Science
基金 国家自然科学青年基金项目(20130206051GX) 吉林省重点科技攻关项目(20130206051GX)资助
关键词 词性标注 词义消歧 标签影响因子 位置影响因子 权重修正 Speech tagging, Word sense disambiguation, Label impact factor, Location impact factor, Adjusted weights
  • 相关文献


  • 1Jayabharathy J,Kanmani S,Parveen A A.Document Clustering and Topic Discovery based on Semantic Similarity in Scientific Literature[C]∥2011 IEEE 3rd International Conference on Communication Software and Networks (ICCSN).2011:425-429.
  • 2Uluhan E,Badur B.Development of a Framework for Sub-Topic Discovery from the Web[C]∥PICMET 2008 Proceedings.July 2008:878-888.
  • 3Shi Jing,Li Wan-long.Topic Discovery Based on LDA Modelwith Fast Gibbs Samping[C]∥2009 International Conference on Artificial Intelligence and Computational Intelligence.2009:91-95.
  • 4Ding W,Rohban M H,Ishwar P,et al.Topic Discovery through Data Dependent and Random Projections[C]∥International Conference on Machine Learning (ICML'13).2013:471-479.
  • 5Yang Yun,Wu Ya-nan.Content-based topic discovery of high-impact model[C]∥2010 2nd International Conference on Computer Engineering and Technology.2010.
  • 6王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 7Yamaguchi Y,Amagasa T,Kitagawa H.Tag-based User Topic Discovery using Twitter Lists[C]∥2011 International Confe-rence on Advances in Social Networks Analysis and Mining.2011:13-20.
  • 8Cheng L.Unsupervised topic discovery by anomaly detection[D].Monterey,California:Naval Postgraduate School,2013.
  • 9Pedersen T,Banerjee S,Patwardhan S.Maximizing semantic relatedness to perform word sense disambiguation[J/OL].http://www.patwardhans.net/papers/pedersenBP05.pdf.
  • 10Naskar S K,Bandyopadhyay S.Word sense disambiguation using extended wordnet[C]∥Proceedings of the International Confe-rence on Computing:Theory and Applications(ICCTA'07).2007:446-450.


  • 1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48
  • 8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227~251
  • 9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
  • 10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207~214



  • 1张云涛,龚玲,王永成.基于综合方法的文本主题句的自动抽取[J].上海交通大学学报,2006,40(5):771-774. 被引量:16
  • 2LUHN H P. The automatic creation of literature abstract [ J ]. IBM journal of research and development, 1958,2 (2) : 159 - 165.
  • 3BAXENDALE P E. Machine-made index for technical literature-an experiment [ J ]. Journal of research and development, 1958, 2 (4) :354 -361.
  • 4BOKAETF M H, SAMETI H, LIU Y. Unsupervised approach to extract summary keywords in meeting domain [ C ]//Proceedings of the 23rd European signal processing conference. Piseataway : IEEE Press, 2015 : 1406 - 1410.
  • 5BARNES C I, COSTANTINI L, PERSCHKE S. Automatic inde- xing using the SLC-II system [ J ]. Information processing & man- agement, 1978, 14(2): 107-119.
  • 6AWAJAN A. Keyword extraction from Arabic documents using term equivalence classes [ J ]. Crains Chicago business, 2015, 14 (2) : 1 -18.
  • 7E1-BELTAGY S R, RAFEA A. KP-Miner: a keyphrase extraction system for English and Arabic documents [ J ]. Information sys- tems, 2009,34(1) : 132 -144.
  • 8CHEN Y H, LU J L, MENG F T. Finding keywords in blogs: effi- cient keyword extraction in blog mining via user behaviors[ J]. Ex- pert systems with applications an international journal, 2014, 41 (2) :663 - 670.
  • 9BRANDOW R, MITZE K, RAU L F. Automatic condensation of e- lectronic publications by sentence selection [ J ]. Information pro- cessing & management, 1995,31 (5) :675 - 685.
  • 10SUBRAMANIYASWAMY V, PANDIAN S. Effective tag recom- mendation system based on topic ontology using Wikipedia and WordNet [J]. International journal of intelligent systems,2012,27 (12) : 1034-1048.










使用帮助 返回顶部