This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schem...This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.展开更多
Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft...Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft of work referred to as plagiarism.Many approaches have been put forward to detect such cases based on various text features and grammatical structures of languages.However,there is a huge scope of improvement for detecting intelligent plagiarism.Design/methodology/approach-To realize this,the paper introduces a hybrid model to detect intelligent plagiarism by breaking the entire process into three stages:(1)clustering,(2)vector formulation in each cluster based on semantic roles,normalization and similarity index calculation and(3)Summary generation using encoder-decoder.An effective weighing scheme has been introduced to select terms used to build vectors based on K-means,which is calculated on the synonym set for the said term.If the value calculated in the last stage lies above a predefined threshold,only then the next semantic argument is analyzed.When the similarity score for two documents is beyond the threshold,a short summary for plagiarized documents is created.Findings-Experimental results show that this method is able to detect connotation and concealment used in idea plagiarism besides detecting literal plagiarism.Originality/value-The proposed model can help academics stay updated by providing summaries of relevant articles.It would eliminate the practice of plagiarism infesting the academic community at an unprecedented pace.The model will also accelerate the process of reviewing academic documents,aiding in the speedy publishing of research articles.展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
Predicting anomalous behaviour of a running process using system call trace is a common practice among security community and it is still an active research area. It is a typical pattern recognition problem and can be...Predicting anomalous behaviour of a running process using system call trace is a common practice among security community and it is still an active research area. It is a typical pattern recognition problem and can be dealt with machine learning algorithms. Standard system call datasets were employed to train these algorithms. However, advancements in operating systems made these datasets outdated and un-relevant. Australian Defence Force Academy Linux Dataset (ADFA-LD) and Australian Defence Force Academy Windows Dataset (ADFA-WD) are new generation system calls datasets that contain labelled system call traces for modern exploits and attacks on various applications. In this paper, we evaluate performance of Modified Vector Space Representation technique on ADFA-LD and ADFA-WD datasets using various classification algorithms. Our experimental results show that our method performs well and it helps accurately distinguishing process behaviour through system calls.展开更多
A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of key...A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of keywords retrieval and concept retrieval but also can compensate for their shortcomings. Their parameters can be adjusted according to different usage in order to accept the best information retrieval result, and it has been proved by our experiments.展开更多
We propose an algorithm for learning hierarchical user interest models according to the Web pages users have browsed. In this algorithm, the interests of a user are represented into a tree which is called a user inter...We propose an algorithm for learning hierarchical user interest models according to the Web pages users have browsed. In this algorithm, the interests of a user are represented into a tree which is called a user interest tree, the content and the structure of which can change simultaneously to adapt to the changes in a user's interests. This expression represents a user's specific and general interests as a continuurn. In some sense, specific interests correspond to shortterm interests, while general interests correspond to longterm interests. So this representation more really reflects the users' interests. The algorithm can automatically model a us er's multiple interest domains, dynamically generate the in terest models and prune a user interest tree when the number of the nodes in it exceeds given value. Finally, we show the experiment results in a Chinese Web Site.展开更多
文摘This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.
基金This work is supported by Technical Education Quality Program-TEQIP III.The project is implemented by NPIU,which is a unit of MHRD,Govt of India for implementation of world bank assisted projects in Technical Education.
文摘Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft of work referred to as plagiarism.Many approaches have been put forward to detect such cases based on various text features and grammatical structures of languages.However,there is a huge scope of improvement for detecting intelligent plagiarism.Design/methodology/approach-To realize this,the paper introduces a hybrid model to detect intelligent plagiarism by breaking the entire process into three stages:(1)clustering,(2)vector formulation in each cluster based on semantic roles,normalization and similarity index calculation and(3)Summary generation using encoder-decoder.An effective weighing scheme has been introduced to select terms used to build vectors based on K-means,which is calculated on the synonym set for the said term.If the value calculated in the last stage lies above a predefined threshold,only then the next semantic argument is analyzed.When the similarity score for two documents is beyond the threshold,a short summary for plagiarized documents is created.Findings-Experimental results show that this method is able to detect connotation and concealment used in idea plagiarism besides detecting literal plagiarism.Originality/value-The proposed model can help academics stay updated by providing summaries of relevant articles.It would eliminate the practice of plagiarism infesting the academic community at an unprecedented pace.The model will also accelerate the process of reviewing academic documents,aiding in the speedy publishing of research articles.
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
文摘Predicting anomalous behaviour of a running process using system call trace is a common practice among security community and it is still an active research area. It is a typical pattern recognition problem and can be dealt with machine learning algorithms. Standard system call datasets were employed to train these algorithms. However, advancements in operating systems made these datasets outdated and un-relevant. Australian Defence Force Academy Linux Dataset (ADFA-LD) and Australian Defence Force Academy Windows Dataset (ADFA-WD) are new generation system calls datasets that contain labelled system call traces for modern exploits and attacks on various applications. In this paper, we evaluate performance of Modified Vector Space Representation technique on ADFA-LD and ADFA-WD datasets using various classification algorithms. Our experimental results show that our method performs well and it helps accurately distinguishing process behaviour through system calls.
文摘A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of keywords retrieval and concept retrieval but also can compensate for their shortcomings. Their parameters can be adjusted according to different usage in order to accept the best information retrieval result, and it has been proved by our experiments.
文摘模块化多电平换流器(modular multilevel converter,MMC)具有运行稳定、控制多样化等优点,对其精确建模成为当前研究热点。提出了一种基于复矢量的MMC交直流侧阻抗建模方法。该方法根据复矢量建模的基本原理分析了换流器的正负序转换关系,并在考虑三相耦合的前提下,从差共模角度简述了三相MMC的建模要点,建立了三相系统的完整时域模型。引入谐波状态空间理论(harmonic state space,HSS)对所建模型进行处理来提高建模精度,并利用接口矩阵及阻抗求解的基本原理建立了MMC的交直流侧阻抗模型。最后在仿真平台进行了对比实验,证实了建模方法的可行性。
基金Supported by the National Natural Science Funda-tion of China (69973012 ,60273080)
文摘We propose an algorithm for learning hierarchical user interest models according to the Web pages users have browsed. In this algorithm, the interests of a user are represented into a tree which is called a user interest tree, the content and the structure of which can change simultaneously to adapt to the changes in a user's interests. This expression represents a user's specific and general interests as a continuurn. In some sense, specific interests correspond to shortterm interests, while general interests correspond to longterm interests. So this representation more really reflects the users' interests. The algorithm can automatically model a us er's multiple interest domains, dynamically generate the in terest models and prune a user interest tree when the number of the nodes in it exceeds given value. Finally, we show the experiment results in a Chinese Web Site.