摘要
针对新闻和微博2组有代表性的语料开展实验研究,旨在发现不同词性特征及其组合对2种通用网络平台话题检测的作用及其影响.研究表明:在选择单一词性特征时,名词特征可得到最好的检测结果,命名实体可在保证准确率的情况下大大降低聚类的特征维度.在选择词性组合作为特征时,名词或命名实体、数词、时间短语、形容词以及量词的组合特征可提升新闻网络话题检测的准确率,而名词或命名实体、形容词、量词、数词以及特殊符号与网址的组合特征可在微博语料上获得较好的检测结果.
Based on two representative corpus of news and micro-blog,an experimental study was conducted in the paper,in which the purpose is to find the effect and influence of different part-ofspeeches and their combinations on the network topic detection. The research shows that if a single partof-speech as a characteristic is chosen,nouns can get the best results,and named entities can greatly reduce the dimensions of clustering characteristics while keeping the accuracy. If the combination of partof-speeches as a characteristic is chosen, nouns or named entities, numerals, the time phrases,adjectives and quantifiers can promote the accuracy of news network topic detection while nouns or named entities,adjectives,quantifiers,numerals,and the combination of special symbols and sites can achieve good results on micro-blog corpus.
出处
《北京工业大学学报》
CAS
CSCD
北大核心
2015年第4期526-533,共8页
Journal of Beijing University of Technology
基金
国家自然科学基金重点资助项目(613300194)
关键词
话题检测
词性
文本特征
新闻
微博
topic detection
part-of-speeches
text feature
news
micro-blog