Nowadays,we can use the multi-task learning approach to train a machine-learning algorithm to learn multiple related tasks instead of training it to solve a single task.In this work,we propose an algorithm for estimat...Nowadays,we can use the multi-task learning approach to train a machine-learning algorithm to learn multiple related tasks instead of training it to solve a single task.In this work,we propose an algorithm for estimating textual similarity scores and then use these scores in multiple tasks such as text ranking,essay grading,and question answering systems.We used several vectorization schemes to represent the Arabic texts in the SemEval2017-task3-subtask-D dataset.The used schemes include lexical-based similarity features,frequency-based features,and pre-trained model-based features.Also,we used contextual-based embedding models such as Arabic Bidirectional Encoder Representations from Transformers(AraBERT).We used the AraBERT model in two different variants.First,as a feature extractor in addition to the text vectorization schemes’features.We fed those features to various regression models to make a prediction value that represents the relevancy score between Arabic text units.Second,AraBERT is adopted as a pre-trained model,and its parameters are fine-tuned to estimate the relevancy scores between Arabic textual sentences.To evaluate the research results,we conducted several experiments to compare the use of the AraBERT model in its two variants.In terms of Mean Absolute Percentage Error(MAPE),the results showminor variance between AraBERT v0.2 as a feature extractor(21.7723)and the fine-tuned AraBERT v2(21.8211).On the other hand,AraBERT v0.2-Large as a feature extractor outperforms the finetuned AraBERT v2 model on the used data set in terms of the coefficient of determination(R2)values(0.014050,−0.032861),respectively.展开更多
Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues.The pull-based development model,as the state-of-art collaborative developm...Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues.The pull-based development model,as the state-of-art collaborative development mechanism,provides high openness and transparency to improve the visibility of contributors'work.However,duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this model.If not detected in time,duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work.In this paper,we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission time.For a new-arriving contribution,we first compute textual similarity and change similarity between it and other existing contributions.And then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change similarity.The evaluation shows that 83.4%of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8%using only textual similarity and 78.2%using only change similarity.展开更多
文摘Nowadays,we can use the multi-task learning approach to train a machine-learning algorithm to learn multiple related tasks instead of training it to solve a single task.In this work,we propose an algorithm for estimating textual similarity scores and then use these scores in multiple tasks such as text ranking,essay grading,and question answering systems.We used several vectorization schemes to represent the Arabic texts in the SemEval2017-task3-subtask-D dataset.The used schemes include lexical-based similarity features,frequency-based features,and pre-trained model-based features.Also,we used contextual-based embedding models such as Arabic Bidirectional Encoder Representations from Transformers(AraBERT).We used the AraBERT model in two different variants.First,as a feature extractor in addition to the text vectorization schemes’features.We fed those features to various regression models to make a prediction value that represents the relevancy score between Arabic text units.Second,AraBERT is adopted as a pre-trained model,and its parameters are fine-tuned to estimate the relevancy scores between Arabic textual sentences.To evaluate the research results,we conducted several experiments to compare the use of the AraBERT model in its two variants.In terms of Mean Absolute Percentage Error(MAPE),the results showminor variance between AraBERT v0.2 as a feature extractor(21.7723)and the fine-tuned AraBERT v2(21.8211).On the other hand,AraBERT v0.2-Large as a feature extractor outperforms the finetuned AraBERT v2 model on the used data set in terms of the coefficient of determination(R2)values(0.014050,−0.032861),respectively.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB1004202the National Natural Science Foundation of China under Grant No. 61702534.
文摘Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues.The pull-based development model,as the state-of-art collaborative development mechanism,provides high openness and transparency to improve the visibility of contributors'work.However,duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this model.If not detected in time,duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work.In this paper,we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission time.For a new-arriving contribution,we first compute textual similarity and change similarity between it and other existing contributions.And then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change similarity.The evaluation shows that 83.4%of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8%using only textual similarity and 78.2%using only change similarity.