Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Gene...Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.展开更多
The ultimate solution to anthropogenic air pollution depends on an adjustment and upgrade of industrial and energy structures. Before this process can be completed, reducing the anthropogenic pollutant emissions is an...The ultimate solution to anthropogenic air pollution depends on an adjustment and upgrade of industrial and energy structures. Before this process can be completed, reducing the anthropogenic pollutant emissions is an effective measure. This is a problem belonging to "Natural Cybernetics", i.e., the problem of air pollution control should be solved together with the weather prediction; however, this is very complicated. Considering that heavy air pollution usually occurs in stable weather conditions and that the feedbacks between air pollutants and meteorological changes are insufficient, we propose a simplified natural cybernetics method. Here, an off-line air pollution evolution equation is first solved with data from a given anthropogenic emission inventory under the predicted weather conditions, and then, a related "incomplete adjoint problem" is solved to obtain the optimal reduction of anthropogenic emissions. Usually, such solution is sufficient for satisfying the air quality and economical/social requirements. However, a better solution can be obtained by iteration after updating the emission inventory with the reduced anthropogenic emissions. Then, this paper discusses the retrieval of the pollutant emission source with a known spatio-temporal distribution of the pollutant concentrations, and a feasible mathematical method to achieve this is proposed. The retrieval of emission source would also help control air pollution.展开更多
Using the incomplete adjoint operator method in part I of this series of papers,the total emission source S can be retrieved from the pollutant concentrationsρob obtained from the air pollution monitoring network.Thi...Using the incomplete adjoint operator method in part I of this series of papers,the total emission source S can be retrieved from the pollutant concentrationsρob obtained from the air pollution monitoring network.This paper studies the problem of retrieving anthropogenic emission sources from S.Assuming that the natural source Sn is known,and as the internal source Sc due to chemical reactions is a function of pollutant concentrations,if the chemical reaction equations are complete and the parameters are accurate,Sc can be calculated directly fromρob,and then Sa can be obtained from S.However,if the chemical reaction parameters(denoted asγ)are insufficiently accurate,bothγand Sc should be corrected.This article proposes a"double correction iterative method"to retrieve Sc and correctγand proves that this iterative method converges.展开更多
Fast and accurate identification of the pollutant source location and release rate is important for improving indoor air quality.From the perspective of public health,identification of the airborne pathogen source in ...Fast and accurate identification of the pollutant source location and release rate is important for improving indoor air quality.From the perspective of public health,identification of the airborne pathogen source in public buildings is particularly important for ensuring people’s safety and health.The existing adjoint probability method has difficulty in distinguishing the temporal source,and the optimization algorithm can only analyze a few potential sources in space.This study proposed an algorithm combining the adjoint-pulse and regularization methods to identify the spatiotemporal information of the point pollutant source in an entire room space.We first obtained a series of source-receptor response matrices using the adjoint-pulse method in the room based on the validated CFD model,and then used the regularization method and composite Bayesian inference to identify the release rate and location of the dynamic pollutant source.The results showed that the MAPEs(mean absolute percentage errors)of estimated source intensities were almost less than 15%,and the source localization success rates were above 25/30 in this study.This method has the potential to be used to identify the airborne pathogen source in public buildings combined with sensors for disease-specific biomarkers.展开更多
基金supported by the National Social Science Foundation of China(No.14CTQ032)the National Natural Science Foundation of China(No.61370170)
文摘Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
基金supported by the National Natural Science Foundation of China (Grant No. 41630530the National Key Research and Development Program of China (Grant No. 2016YFC0209000)
文摘The ultimate solution to anthropogenic air pollution depends on an adjustment and upgrade of industrial and energy structures. Before this process can be completed, reducing the anthropogenic pollutant emissions is an effective measure. This is a problem belonging to "Natural Cybernetics", i.e., the problem of air pollution control should be solved together with the weather prediction; however, this is very complicated. Considering that heavy air pollution usually occurs in stable weather conditions and that the feedbacks between air pollutants and meteorological changes are insufficient, we propose a simplified natural cybernetics method. Here, an off-line air pollution evolution equation is first solved with data from a given anthropogenic emission inventory under the predicted weather conditions, and then, a related "incomplete adjoint problem" is solved to obtain the optimal reduction of anthropogenic emissions. Usually, such solution is sufficient for satisfying the air quality and economical/social requirements. However, a better solution can be obtained by iteration after updating the emission inventory with the reduced anthropogenic emissions. Then, this paper discusses the retrieval of the pollutant emission source with a known spatio-temporal distribution of the pollutant concentrations, and a feasible mathematical method to achieve this is proposed. The retrieval of emission source would also help control air pollution.
基金supported by the National Natural Science Foundation of China(Grant Nos.41630530&41877316)the Key Research Program of Frontier Sciences,Chinese Academy of Sciences(Grant No.QYZDY-SSW-DQC002)the Youth Innovation Promotion Association,Chinese Academy of Sciences(Grant No.2019079)。
文摘Using the incomplete adjoint operator method in part I of this series of papers,the total emission source S can be retrieved from the pollutant concentrationsρob obtained from the air pollution monitoring network.This paper studies the problem of retrieving anthropogenic emission sources from S.Assuming that the natural source Sn is known,and as the internal source Sc due to chemical reactions is a function of pollutant concentrations,if the chemical reaction equations are complete and the parameters are accurate,Sc can be calculated directly fromρob,and then Sa can be obtained from S.However,if the chemical reaction parameters(denoted asγ)are insufficiently accurate,bothγand Sc should be corrected.This article proposes a"double correction iterative method"to retrieve Sc and correctγand proves that this iterative method converges.
基金This study is supported by the Postgraduate Research&Practice Innovation Program of Jiangsu Province(No.SJCX22_0470).
文摘Fast and accurate identification of the pollutant source location and release rate is important for improving indoor air quality.From the perspective of public health,identification of the airborne pathogen source in public buildings is particularly important for ensuring people’s safety and health.The existing adjoint probability method has difficulty in distinguishing the temporal source,and the optimization algorithm can only analyze a few potential sources in space.This study proposed an algorithm combining the adjoint-pulse and regularization methods to identify the spatiotemporal information of the point pollutant source in an entire room space.We first obtained a series of source-receptor response matrices using the adjoint-pulse method in the room based on the validated CFD model,and then used the regularization method and composite Bayesian inference to identify the release rate and location of the dynamic pollutant source.The results showed that the MAPEs(mean absolute percentage errors)of estimated source intensities were almost less than 15%,and the source localization success rates were above 25/30 in this study.This method has the potential to be used to identify the airborne pathogen source in public buildings combined with sensors for disease-specific biomarkers.