The development of a knowledge management system for the National Hydro Data Center of Thailand was described in this paper. The system was created after the major flood event in 2011 to improve water resource managem...The development of a knowledge management system for the National Hydro Data Center of Thailand was described in this paper. The system was created after the major flood event in 2011 to improve water resource management. It addresses the need for easy access to water situation reports, which are crucial for informed decision-making on water usage, allocation, and reservoir management. The system utilizes Optical Character Recognition technique to convert scanned water situation reports into searchable text. It applied FastText and ElasticSearch for advanced search functionalities. FastText identified the documents related to the search query, even with typos or misspelled words. ElasticSearch allows for efficient searching of text data based on relevance. The system also integrates Google Search for additional information access. Therefore, this knowledge management system provides an efficient way to access and analyze water situation data in Thailand.展开更多
Despite the extensive effort to improve intelligent educational tools for smart learning environments,automatic Arabic essay scoring remains a big research challenge.The nature of the writing style of the Arabic langu...Despite the extensive effort to improve intelligent educational tools for smart learning environments,automatic Arabic essay scoring remains a big research challenge.The nature of the writing style of the Arabic language makes the problem even more complicated.This study designs,implements,and evaluates an automatic Arabic essay scoring system.The proposed system starts with pre-processing the student answer and model answer dataset using data cleaning and natural language processing tasks.Then,it comprises two main components:the grading engine and the adaptive fusion engine.The grading engine employs string-based and corpus-based similarity algorithms separately.After that,the adaptive fusion engine aims to prepare students’scores to be delivered to different feature selection algorithms,such as Recursive Feature Elimination and Boruta.Then,some machine learning algorithms such as Decision Tree,Random Forest,Adaboost,Lasso,Bagging,and K-Nearest Neighbor are employed to improve the suggested system’s efficiency.The experimental results in the grading engine showed that Extracting DIStributionally similar words using the CO-occurrences similarity measure achieved the best correlation values.Furthermore,in the adaptive fusion engine,the Random Forest algorithm outperforms all other machine learning algorithms using the(80%–20%)splitting method on the original dataset.It achieves 91.30%,94.20%,0.023,0.106,and 0.153 in terms of Pearson’s Correlation Coefficient,Willmot’s Index of Agreement,Mean Square Error,Mean Absolute Error,and Root Mean Square Error metrics,respectively.展开更多
Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the sema...Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.展开更多
As various software bots are widely used in open source software repositories,some drawbacks are coming to light,such as giving newcomers non-positive feedback and misleading empirical studies of software engineering ...As various software bots are widely used in open source software repositories,some drawbacks are coming to light,such as giving newcomers non-positive feedback and misleading empirical studies of software engineering researchers.Several techniques have been proposed by researchers to perform bot detection,but most of them are limited to identifying bots performing specific activities,let alone distinguishing between GitHub App and OAuth App.In this paper,we propose a bot detection technique for OAuth App,named BDGOA.24 features are used in BDGOA,which can be divided into three dimensions:account information,account activity,and text similarity.To better explore the behavioral features,we define a fine-grained classification of behavioral events and introduce self-similarity to quantify the repeatability of behavioral sequence.We leverage five machine learning classifiers on the benchmark dataset to conduct bot detection,and finally choose random forest as the classifier,which achieves the highest F1-score of 95.83%.The experimental results comparing with the state-of-the-art approaches also demonstrate the superiority of BDGOA.展开更多
文摘The development of a knowledge management system for the National Hydro Data Center of Thailand was described in this paper. The system was created after the major flood event in 2011 to improve water resource management. It addresses the need for easy access to water situation reports, which are crucial for informed decision-making on water usage, allocation, and reservoir management. The system utilizes Optical Character Recognition technique to convert scanned water situation reports into searchable text. It applied FastText and ElasticSearch for advanced search functionalities. FastText identified the documents related to the search query, even with typos or misspelled words. ElasticSearch allows for efficient searching of text data based on relevance. The system also integrates Google Search for additional information access. Therefore, this knowledge management system provides an efficient way to access and analyze water situation data in Thailand.
文摘Despite the extensive effort to improve intelligent educational tools for smart learning environments,automatic Arabic essay scoring remains a big research challenge.The nature of the writing style of the Arabic language makes the problem even more complicated.This study designs,implements,and evaluates an automatic Arabic essay scoring system.The proposed system starts with pre-processing the student answer and model answer dataset using data cleaning and natural language processing tasks.Then,it comprises two main components:the grading engine and the adaptive fusion engine.The grading engine employs string-based and corpus-based similarity algorithms separately.After that,the adaptive fusion engine aims to prepare students’scores to be delivered to different feature selection algorithms,such as Recursive Feature Elimination and Boruta.Then,some machine learning algorithms such as Decision Tree,Random Forest,Adaboost,Lasso,Bagging,and K-Nearest Neighbor are employed to improve the suggested system’s efficiency.The experimental results in the grading engine showed that Extracting DIStributionally similar words using the CO-occurrences similarity measure achieved the best correlation values.Furthermore,in the adaptive fusion engine,the Random Forest algorithm outperforms all other machine learning algorithms using the(80%–20%)splitting method on the original dataset.It achieves 91.30%,94.20%,0.023,0.106,and 0.153 in terms of Pearson’s Correlation Coefficient,Willmot’s Index of Agreement,Mean Square Error,Mean Absolute Error,and Root Mean Square Error metrics,respectively.
基金supported by the Foundation of the State Key Laboratory of Software Development Environment(No.SKLSDE-2015ZX-04)
文摘Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.
文摘As various software bots are widely used in open source software repositories,some drawbacks are coming to light,such as giving newcomers non-positive feedback and misleading empirical studies of software engineering researchers.Several techniques have been proposed by researchers to perform bot detection,but most of them are limited to identifying bots performing specific activities,let alone distinguishing between GitHub App and OAuth App.In this paper,we propose a bot detection technique for OAuth App,named BDGOA.24 features are used in BDGOA,which can be divided into three dimensions:account information,account activity,and text similarity.To better explore the behavioral features,we define a fine-grained classification of behavioral events and introduce self-similarity to quantify the repeatability of behavioral sequence.We leverage five machine learning classifiers on the benchmark dataset to conduct bot detection,and finally choose random forest as the classifier,which achieves the highest F1-score of 95.83%.The experimental results comparing with the state-of-the-art approaches also demonstrate the superiority of BDGOA.