摘要
文本区间热词查询是根据用户指定的查询时间范围,从文本数据中提取热词。现有的热词提取算法主要面向挖掘任务,时间复杂度较高,难以直接应用于热词的在线查询处理。为此,提出一种文本区间热词的在线查询处理算法。利用数据划分和范围查询技术,在准确率和空间复杂度不变的条件下降低提取热词的时间复杂度。实验结果表明,与现有的面向挖掘算法相比,该算法在CNN、BBC和NYT 3个数据集涉及的整个时间范围上的运行时间分别减少59.7%、65.1%和75.5%,有效提高热词在线查询的效率。
Text interval hot word query is based on user-specified query time range,it extracts from the text data hot words.Existing hot words extraction algorithm is generally oriented to mining tasks,which has a high time complexity and is difficult to be directly applied to an online query processing of hot words.Therefore,an online query processing algorithm for text interval hot words is proposed.Using data partitioning and range search technology,the time complexity of extracting hot words is reduced with the same accuracy and space complexity.Experimental results show that compared with the existing mining-oriented algorithms,the running time of the algorithm is reduced by 59.7%,65.1% and 75.5% respectively over the entire time range covered by the three CNN,BBC and NYT datasets,which effectively improves the hot words online query efficiency.
出处
《计算机工程》
CAS
CSCD
北大核心
2018年第2期17-23,30,共8页
Computer Engineering
基金
国家自然科学基金(61370080)
上海市科技创新行动计划项目(16DZ1100200)