Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Gene...Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.展开更多
In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF...In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the biomedical community, data integration and knowledge discovery from heterogeneous domains become important research problem. In the application level, detection of related concepts among medical ontologies is an important goal of life science research. It is more crucial to figure out how different concepts are related within a single ontology or across multiple ontologies by analysing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difficult for biomedical researchers to find existing or potential predicates to perform linking among cross domain concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and do query generation to discover cross domain knowledge from each topic. In this paper, we present such a model that predicates oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovated unsupervised learning algorithm to partition large data sets into smaller and closer topics and generate meaningful queries to fully discover knowledge over a set of interlinked data sources. We have implemented a prototype system named BmQGen and evaluate the proposed model with colorectal surgical cohort from the Mayo Clinic.展开更多
The social internet of things(SIoT)is one of the emerging paradigms that was proposed to solve the problems of network service discovery,navigability,and service composition.The SIoT aims to socialize the IoT devices ...The social internet of things(SIoT)is one of the emerging paradigms that was proposed to solve the problems of network service discovery,navigability,and service composition.The SIoT aims to socialize the IoT devices and shape the interconnection between them into social interaction just like human beings.In IoT,an object can offer multiple services and different objects can offer the same services with different parameters and interest factors.The proliferation of offered services led to difficulties during service customization and service filtering.This problem is known as service explosion.The selection of suitable service that fits the requirements of applications and objects is a challenging task.To address these issues,we propose an efficient automated query-based service search model based on the local network navigability concept for the SIoT.In the proposed model,objects can use information from their friends or friends of their friends while searching for the desired services,rather than exploring a global network.We employ a centrality metric that computes the degree of importance for each object in the social IoT that helps in selecting neighboring objects with high centrality scores.The distributed nature of our navigation model results in high scalability and short navigation times.We verified the efficacy of our model on a real-world SIoT-related dataset.The experimental results confirm the validity of our model in terms of scalability,navigability,and the desired objects that provide services are determined quickly via the shortest path,which in return improves the service search process in the SIoT.展开更多
基金supported by the National Social Science Foundation of China(No.14CTQ032)the National Natural Science Foundation of China(No.61370170)
文摘Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
文摘In the current biomedical data movement, numerous efforts have been made to convert and normalize a large number of traditional structured and unstructured data (e.g., EHRs, reports) to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the biomedical community, data integration and knowledge discovery from heterogeneous domains become important research problem. In the application level, detection of related concepts among medical ontologies is an important goal of life science research. It is more crucial to figure out how different concepts are related within a single ontology or across multiple ontologies by analysing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difficult for biomedical researchers to find existing or potential predicates to perform linking among cross domain concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and do query generation to discover cross domain knowledge from each topic. In this paper, we present such a model that predicates oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovated unsupervised learning algorithm to partition large data sets into smaller and closer topics and generate meaningful queries to fully discover knowledge over a set of interlinked data sources. We have implemented a prototype system named BmQGen and evaluate the proposed model with colorectal surgical cohort from the Mayo Clinic.
基金This work was supported by the National Research Foundation of Korea(NRF)grant funded by the Korean government(MSIT)(2020R1A2B5B01002145).
文摘The social internet of things(SIoT)is one of the emerging paradigms that was proposed to solve the problems of network service discovery,navigability,and service composition.The SIoT aims to socialize the IoT devices and shape the interconnection between them into social interaction just like human beings.In IoT,an object can offer multiple services and different objects can offer the same services with different parameters and interest factors.The proliferation of offered services led to difficulties during service customization and service filtering.This problem is known as service explosion.The selection of suitable service that fits the requirements of applications and objects is a challenging task.To address these issues,we propose an efficient automated query-based service search model based on the local network navigability concept for the SIoT.In the proposed model,objects can use information from their friends or friends of their friends while searching for the desired services,rather than exploring a global network.We employ a centrality metric that computes the degree of importance for each object in the social IoT that helps in selecting neighboring objects with high centrality scores.The distributed nature of our navigation model results in high scalability and short navigation times.We verified the efficacy of our model on a real-world SIoT-related dataset.The experimental results confirm the validity of our model in terms of scalability,navigability,and the desired objects that provide services are determined quickly via the shortest path,which in return improves the service search process in the SIoT.