Low-resource text plagiarism detection faces a significant challenge due to the limited availability of labeled data for training.This task requires the development of sophisticated algorithms capable of identifying s...Low-resource text plagiarism detection faces a significant challenge due to the limited availability of labeled data for training.This task requires the development of sophisticated algorithms capable of identifying similarities and differences in texts,particularly in the realm of semantic rewriting and translation-based plagiarism detection.In this paper,we present an enhanced attentive Siamese Long Short-Term Memory(LSTM)network designed for Tibetan-Chinese plagiarism detection.Our approach begins with the introduction of translation-based data augmentation,aimed at expanding the bilingual training dataset.Subsequently,we propose a pre-detection method leveraging abstract document vectors to enhance detection efficiency.Finally,we introduce an improved attentive Siamese LSTM network tailored for Tibetan-Chinese plagiarism detection.We conduct comprehensive experiments to showcase the effectiveness of our proposed plagiarism detection framework.展开更多
Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft...Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft of work referred to as plagiarism.Many approaches have been put forward to detect such cases based on various text features and grammatical structures of languages.However,there is a huge scope of improvement for detecting intelligent plagiarism.Design/methodology/approach-To realize this,the paper introduces a hybrid model to detect intelligent plagiarism by breaking the entire process into three stages:(1)clustering,(2)vector formulation in each cluster based on semantic roles,normalization and similarity index calculation and(3)Summary generation using encoder-decoder.An effective weighing scheme has been introduced to select terms used to build vectors based on K-means,which is calculated on the synonym set for the said term.If the value calculated in the last stage lies above a predefined threshold,only then the next semantic argument is analyzed.When the similarity score for two documents is beyond the threshold,a short summary for plagiarized documents is created.Findings-Experimental results show that this method is able to detect connotation and concealment used in idea plagiarism besides detecting literal plagiarism.Originality/value-The proposed model can help academics stay updated by providing summaries of relevant articles.It would eliminate the practice of plagiarism infesting the academic community at an unprecedented pace.The model will also accelerate the process of reviewing academic documents,aiding in the speedy publishing of research articles.展开更多
Plagiarism detection system plays an essential role in education quality improvement by helping teachers to detect plagiarism.Using a number of measures customized to determine occurrences of plagiarism is the most co...Plagiarism detection system plays an essential role in education quality improvement by helping teachers to detect plagiarism.Using a number of measures customized to determine occurrences of plagiarism is the most common approach for plagiarism detection tool.It is simple and effective,while it lacks flexibility when applied in more complicated situations.This paper proposes the MLChecker,a smart plagiarism detection system,to provide more flexible detection tactics.An automatic plagiarism dataset construction method was exploited in MLChecker to dynamically update the plagiarism detection algorithms according to different detection tasks.And the full-process quality management functions were also provided by MLChecker.The result shows that the detection accuracy is raised by 56%.Compared with traditional plagiarism detection tools,MLChecker is with higher accuracy and efficiency.展开更多
Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Gene...Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.展开更多
基金supported by the National Natural Science Foundation of China(No.62271456)the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems.
文摘Low-resource text plagiarism detection faces a significant challenge due to the limited availability of labeled data for training.This task requires the development of sophisticated algorithms capable of identifying similarities and differences in texts,particularly in the realm of semantic rewriting and translation-based plagiarism detection.In this paper,we present an enhanced attentive Siamese Long Short-Term Memory(LSTM)network designed for Tibetan-Chinese plagiarism detection.Our approach begins with the introduction of translation-based data augmentation,aimed at expanding the bilingual training dataset.Subsequently,we propose a pre-detection method leveraging abstract document vectors to enhance detection efficiency.Finally,we introduce an improved attentive Siamese LSTM network tailored for Tibetan-Chinese plagiarism detection.We conduct comprehensive experiments to showcase the effectiveness of our proposed plagiarism detection framework.
基金This work is supported by Technical Education Quality Program-TEQIP III.The project is implemented by NPIU,which is a unit of MHRD,Govt of India for implementation of world bank assisted projects in Technical Education.
文摘Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft of work referred to as plagiarism.Many approaches have been put forward to detect such cases based on various text features and grammatical structures of languages.However,there is a huge scope of improvement for detecting intelligent plagiarism.Design/methodology/approach-To realize this,the paper introduces a hybrid model to detect intelligent plagiarism by breaking the entire process into three stages:(1)clustering,(2)vector formulation in each cluster based on semantic roles,normalization and similarity index calculation and(3)Summary generation using encoder-decoder.An effective weighing scheme has been introduced to select terms used to build vectors based on K-means,which is calculated on the synonym set for the said term.If the value calculated in the last stage lies above a predefined threshold,only then the next semantic argument is analyzed.When the similarity score for two documents is beyond the threshold,a short summary for plagiarized documents is created.Findings-Experimental results show that this method is able to detect connotation and concealment used in idea plagiarism besides detecting literal plagiarism.Originality/value-The proposed model can help academics stay updated by providing summaries of relevant articles.It would eliminate the practice of plagiarism infesting the academic community at an unprecedented pace.The model will also accelerate the process of reviewing academic documents,aiding in the speedy publishing of research articles.
基金the Social Science Fund of Heilongjiang Province(No.18TQB103)the National Natural Science Foundation of China(No.61806075+1 种基金No.61772177)the Natural Science Foundation of Heilongjiang Province(No.F2018029).
文摘Plagiarism detection system plays an essential role in education quality improvement by helping teachers to detect plagiarism.Using a number of measures customized to determine occurrences of plagiarism is the most common approach for plagiarism detection tool.It is simple and effective,while it lacks flexibility when applied in more complicated situations.This paper proposes the MLChecker,a smart plagiarism detection system,to provide more flexible detection tactics.An automatic plagiarism dataset construction method was exploited in MLChecker to dynamically update the plagiarism detection algorithms according to different detection tasks.And the full-process quality management functions were also provided by MLChecker.The result shows that the detection accuracy is raised by 56%.Compared with traditional plagiarism detection tools,MLChecker is with higher accuracy and efficiency.
基金supported by the National Social Science Foundation of China(No.14CTQ032)the National Natural Science Foundation of China(No.61370170)
文摘Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.