Author Profiling (AP) is a subsection of digital forensics that focuses on the detection of the author’s personalinformation, such as age, gender, occupation, and education, based on various linguistic features, e.g....Author Profiling (AP) is a subsection of digital forensics that focuses on the detection of the author’s personalinformation, such as age, gender, occupation, and education, based on various linguistic features, e.g., stylistic,semantic, and syntactic. The importance of AP lies in various fields, including forensics, security, medicine, andmarketing. In previous studies, many works have been done using different languages, e.g., English, Arabic, French,etc.However, the research on RomanUrdu is not up to the mark.Hence, this study focuses on detecting the author’sage and gender based on Roman Urdu text messages. The dataset used in this study is Fire’18-MaponSMS. Thisstudy proposed an ensemble model based on AdaBoostM1 and Random Forest (AMBRF) for AP using multiplelinguistic features that are stylistic, character-based, word-based, and sentence-based. The proposed model iscontrasted with several of the well-known models fromthe literature, including J48-Decision Tree (J48),Na飗e Bays(NB), K Nearest Neighbor (KNN), and Composite Hypercube on Random Projection (CHIRP), NB-Updatable,RF, and AdaboostM1. The overall outcome shows the better performance of the proposed AdaboostM1 withRandom Forest (ABMRF) with an accuracy of 54.2857% for age prediction and 71.1429% for gender predictioncalculated on stylistic features. Regarding word-based features, age and gender were considered in 50.5714% and60%, respectively. On the other hand, KNN and CHIRP show the weakest performance using all the linguisticfeatures for age and gender prediction.展开更多
基金the support of Prince Sultan University for the Article Processing Charges(APC)of this publication。
文摘Author Profiling (AP) is a subsection of digital forensics that focuses on the detection of the author’s personalinformation, such as age, gender, occupation, and education, based on various linguistic features, e.g., stylistic,semantic, and syntactic. The importance of AP lies in various fields, including forensics, security, medicine, andmarketing. In previous studies, many works have been done using different languages, e.g., English, Arabic, French,etc.However, the research on RomanUrdu is not up to the mark.Hence, this study focuses on detecting the author’sage and gender based on Roman Urdu text messages. The dataset used in this study is Fire’18-MaponSMS. Thisstudy proposed an ensemble model based on AdaBoostM1 and Random Forest (AMBRF) for AP using multiplelinguistic features that are stylistic, character-based, word-based, and sentence-based. The proposed model iscontrasted with several of the well-known models fromthe literature, including J48-Decision Tree (J48),Na飗e Bays(NB), K Nearest Neighbor (KNN), and Composite Hypercube on Random Projection (CHIRP), NB-Updatable,RF, and AdaboostM1. The overall outcome shows the better performance of the proposed AdaboostM1 withRandom Forest (ABMRF) with an accuracy of 54.2857% for age prediction and 71.1429% for gender predictioncalculated on stylistic features. Regarding word-based features, age and gender were considered in 50.5714% and60%, respectively. On the other hand, KNN and CHIRP show the weakest performance using all the linguisticfeatures for age and gender prediction.