Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will...Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.展开更多
How to quickly and accurately detect new topics from massive data online becomes a main problem of public opinion monitoring in cyberspace. This paper presents a new event detection method for the current new event de...How to quickly and accurately detect new topics from massive data online becomes a main problem of public opinion monitoring in cyberspace. This paper presents a new event detection method for the current new event detection system,based on sorted subtopic matching algorithm and constructs the entire design framework. In this paper,the subtopics contained in old topics(or news stories) are sorted in descending order according to their importance to the topic(or news stories),and form a sorted subtopic sequence. In the process of subtopic matching,subtopic scoring matrix is used to determine whether a new story is reporting a new event. Experimental results show that the sorted subtopic matching model improved the accuracy and effectiveness of the new event detection system in cyberspace.展开更多
Considering the deviation in content of community detection resulting from the low accuracy of resource relevance,an algorithm based on the topology of sites and the similarity between their topics is proposed. With t...Considering the deviation in content of community detection resulting from the low accuracy of resource relevance,an algorithm based on the topology of sites and the similarity between their topics is proposed. With topic content factors fully considered,this algorithm can search for topically similar site clusters on the premise of inter-site topology. The experimental results show that the algorithm can generate a more accurate result of detection in the real network.展开更多
Network intrusion poses a severe threat to the Internet.However,existing intrusion detection models cannot effectively distinguish different intrusions with high-degree feature overlap.In addition,efficient real-time ...Network intrusion poses a severe threat to the Internet.However,existing intrusion detection models cannot effectively distinguish different intrusions with high-degree feature overlap.In addition,efficient real-time detection is an urgent problem.To address the two above problems,we propose a Latent Dirichlet Allocation topic model-based framework for real-time network Intrusion Detection(LDA-ID),consisting of static and online LDA-ID.The problem of feature overlap is transformed into static LDA-ID topic number optimization and topic selection.Thus,the detection is based on the latent topic features.To achieve efficient real-time detection,we design an online computing mode for static LDA-ID,in which a parameter iteration method based on momentum is proposed to balance the contribution of prior knowledge and new information.Furthermore,we design two matching mechanisms to accommodate the static and online LDA-ID,respectively.Experimental results on the public NSL-KDD and UNSW-NB15 datasets show that our framework gets higher accuracy than the others.展开更多
The COVID-19 pandemic has become one of the severe diseases in recent years.As it majorly affects the common livelihood of people across the universe,it is essential for administrators and healthcare professionals to ...The COVID-19 pandemic has become one of the severe diseases in recent years.As it majorly affects the common livelihood of people across the universe,it is essential for administrators and healthcare professionals to be aware of the views of the community so as to monitor the severity of the spread of the outbreak.The public opinions are been shared enormously in microblogging med-ia like twitter and is considered as one of the popular sources to collect public opinions in any topic like politics,sports,entertainment etc.,This work presents a combination of Intensity Based Emotion Classification Convolution Neural Net-work(IBEC-CNN)model and Non-negative Matrix Factorization(NMF)for detecting and analyzing the different topics discussed in the COVID-19 tweets as well the intensity of the emotional content of those tweets.The topics were identified using NMF and the emotions are classified using pretrained IBEC-CNN,based on predefined intensity scores.The research aimed at identifying the emotions in the Indian tweets related to COVID-19 and producing a list of topics discussed by the users during the COVID-19 pandemic.Using the Twitter Application Programming Interface(Twitter API),huge numbers of COVID-19 tweets are retrieved during January and July 2020.The extracted tweets are ana-lyzed for emotions fear,joy,sadness and trust with proposed Intensity Based Emotion Classification Convolution Neural Network(IBEC-CNN)model which is pretrained.The classified tweets are given an intensity score varies from 1 to 3,with 1 being low intensity for the emotion,2 being the moderate and 3 being the high intensity.To identify the topics in the tweets and the themes of those topics,Non-negative Matrix Factorization(NMF)has been employed.Analysis of emotions of COVID-19 tweets has identified,that the count of positive tweets is more than that of count of negative tweets during the period considered and the negative tweets related to COVID-19 is less than 5%.Also,more than 75%nega-tive tweets expressed sadness,fear are of low intensity.A qualitative analysis has also been conducted and the topics detected are grouped into themes such as eco-nomic impacts,case reports,treatments,entertainment and vaccination.The results of analysis show that the issues related to the pandemic are expressed dif-ferent emotions in twitter which helps in interpreting the public insights during the pandemic and these results are beneficial for planning the dissemination of factual health statistics to build the trust of the people.The performance comparison shows that the proposed IBEC-CNN model outperforms the conventional models and achieved 83.71%accuracy.The%of COVID-19 tweets that discussed the different topics vary from 7.45%to 26.43%on topics economy,Statistics on cases,Government/Politics,Entertainment,Lockdown,Treatments and Virtual Events.The least number of tweets discussed on politics/government on the other hand the tweets discussed most about treatments.展开更多
揭示技术演化脉络是把握技术发展规律的前提,基于专利信息的主题挖掘是基于技术发展微观机制呈现宏观规律的重要研究内容,对技术超前布局和创新驱动实践具有重大意义。技术主题动态演化分析DPL-BMM(Dirichlet process biterm-based mixt...揭示技术演化脉络是把握技术发展规律的前提,基于专利信息的主题挖掘是基于技术发展微观机制呈现宏观规律的重要研究内容,对技术超前布局和创新驱动实践具有重大意义。技术主题动态演化分析DPL-BMM(Dirichlet process biterm-based mixture model with labelling)是一种附有标签的基于双项狄利克雷过程的混合模型,其突破了传统主题模型在进行主题识别时需固定主题数目的局限,通过增加技术主题表示模块使识别到的技术主题内容更加明确。本文以人工智能领域技术为例进行实证分析,研究结果表明,该方法对技术主题及其演化脉络展示具有实际应用价值。展开更多
基金Supported by the National Natural Science Foundation of China(No.61502312)the Fundamental Research Funds for the Central Universities(No.2017BQ024)+1 种基金the Natural Science Foundation of Guangdong Province(No.2017A030310428)the Science and Technology Programm of Guangzhou(No.201806020075,20180210025)
文摘Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.
基金Funded by the Planning Project of National Language Committee in the "12th 5-year Plan"(No.YB125-49)the Foundation for Key Program of Ministry of Education,China(No.212167)the Fundamental Research Funds for the Central Universities(No.SWJTU12CX096)
文摘How to quickly and accurately detect new topics from massive data online becomes a main problem of public opinion monitoring in cyberspace. This paper presents a new event detection method for the current new event detection system,based on sorted subtopic matching algorithm and constructs the entire design framework. In this paper,the subtopics contained in old topics(or news stories) are sorted in descending order according to their importance to the topic(or news stories),and form a sorted subtopic sequence. In the process of subtopic matching,subtopic scoring matrix is used to determine whether a new story is reporting a new event. Experimental results show that the sorted subtopic matching model improved the accuracy and effectiveness of the new event detection system in cyberspace.
基金Supported by the National Science and Technology Support Program of China(No.2012BAH45B01)the National Natural Science Foundation of China(No.61100189,61370215,61370211,61402137)the National“242”Project of China(No.2016A104)
文摘Considering the deviation in content of community detection resulting from the low accuracy of resource relevance,an algorithm based on the topology of sites and the similarity between their topics is proposed. With topic content factors fully considered,this algorithm can search for topically similar site clusters on the premise of inter-site topology. The experimental results show that the algorithm can generate a more accurate result of detection in the real network.
基金supported by the National Natural Science Foundation of China(Grant No.U1636208,No.61862008,No.61902013)the Beihang Youth Top Talent Support Program(Grant No.YWF-21-BJJ-1039)。
文摘Network intrusion poses a severe threat to the Internet.However,existing intrusion detection models cannot effectively distinguish different intrusions with high-degree feature overlap.In addition,efficient real-time detection is an urgent problem.To address the two above problems,we propose a Latent Dirichlet Allocation topic model-based framework for real-time network Intrusion Detection(LDA-ID),consisting of static and online LDA-ID.The problem of feature overlap is transformed into static LDA-ID topic number optimization and topic selection.Thus,the detection is based on the latent topic features.To achieve efficient real-time detection,we design an online computing mode for static LDA-ID,in which a parameter iteration method based on momentum is proposed to balance the contribution of prior knowledge and new information.Furthermore,we design two matching mechanisms to accommodate the static and online LDA-ID,respectively.Experimental results on the public NSL-KDD and UNSW-NB15 datasets show that our framework gets higher accuracy than the others.
文摘The COVID-19 pandemic has become one of the severe diseases in recent years.As it majorly affects the common livelihood of people across the universe,it is essential for administrators and healthcare professionals to be aware of the views of the community so as to monitor the severity of the spread of the outbreak.The public opinions are been shared enormously in microblogging med-ia like twitter and is considered as one of the popular sources to collect public opinions in any topic like politics,sports,entertainment etc.,This work presents a combination of Intensity Based Emotion Classification Convolution Neural Net-work(IBEC-CNN)model and Non-negative Matrix Factorization(NMF)for detecting and analyzing the different topics discussed in the COVID-19 tweets as well the intensity of the emotional content of those tweets.The topics were identified using NMF and the emotions are classified using pretrained IBEC-CNN,based on predefined intensity scores.The research aimed at identifying the emotions in the Indian tweets related to COVID-19 and producing a list of topics discussed by the users during the COVID-19 pandemic.Using the Twitter Application Programming Interface(Twitter API),huge numbers of COVID-19 tweets are retrieved during January and July 2020.The extracted tweets are ana-lyzed for emotions fear,joy,sadness and trust with proposed Intensity Based Emotion Classification Convolution Neural Network(IBEC-CNN)model which is pretrained.The classified tweets are given an intensity score varies from 1 to 3,with 1 being low intensity for the emotion,2 being the moderate and 3 being the high intensity.To identify the topics in the tweets and the themes of those topics,Non-negative Matrix Factorization(NMF)has been employed.Analysis of emotions of COVID-19 tweets has identified,that the count of positive tweets is more than that of count of negative tweets during the period considered and the negative tweets related to COVID-19 is less than 5%.Also,more than 75%nega-tive tweets expressed sadness,fear are of low intensity.A qualitative analysis has also been conducted and the topics detected are grouped into themes such as eco-nomic impacts,case reports,treatments,entertainment and vaccination.The results of analysis show that the issues related to the pandemic are expressed dif-ferent emotions in twitter which helps in interpreting the public insights during the pandemic and these results are beneficial for planning the dissemination of factual health statistics to build the trust of the people.The performance comparison shows that the proposed IBEC-CNN model outperforms the conventional models and achieved 83.71%accuracy.The%of COVID-19 tweets that discussed the different topics vary from 7.45%to 26.43%on topics economy,Statistics on cases,Government/Politics,Entertainment,Lockdown,Treatments and Virtual Events.The least number of tweets discussed on politics/government on the other hand the tweets discussed most about treatments.
文摘揭示技术演化脉络是把握技术发展规律的前提,基于专利信息的主题挖掘是基于技术发展微观机制呈现宏观规律的重要研究内容,对技术超前布局和创新驱动实践具有重大意义。技术主题动态演化分析DPL-BMM(Dirichlet process biterm-based mixture model with labelling)是一种附有标签的基于双项狄利克雷过程的混合模型,其突破了传统主题模型在进行主题识别时需固定主题数目的局限,通过增加技术主题表示模块使识别到的技术主题内容更加明确。本文以人工智能领域技术为例进行实证分析,研究结果表明,该方法对技术主题及其演化脉络展示具有实际应用价值。