Purpose-Normalization is an important step in all the natural language processing applications that are handling social media text.The text from social media poses a different kind of problems that are not present in ...Purpose-Normalization is an important step in all the natural language processing applications that are handling social media text.The text from social media poses a different kind of problems that are not present in regular text.Recently,a considerable amount of work has been done in this direction,but mostly in the English language.People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script.This kind of text further aggravates the problem of normalizing.This paper aims to discuss the concept of normalization with respect to code-mixed social media text,and a model has been proposed to normalize such text.Design/methodology/approach-The system is divided into two phases-candidate generation and most probable sentence selection.Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language.Characterbased translation system has been proposed to generate candidate tokens.Once candidates are generated,the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.Findings-Character error rate(CER)and bilingual evaluation understudy(BLEU)score are reported.The proposed system has been compared with Akhar software and RB\_R2G system,which are also capable of transliterating Roman text to Gurmukhi.The performance of the system outperforms Akhar software.The CER and BLEU scores are 0.268121 and 0.6807939,respectively,for ill-formed text.Research limitations/implications-It was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing.Spell checker can improve the output of the system by correcting these minor errors.Extensive experimentation is needed for optimizing language identifier,which will further help in improving the output.The language model also seeks further exploration.Inclusion of wider context,particularly from social media text,is an important area that deserves further investigation.Practical implications-The practical implications of this study are:(1)development of parallel dataset containing Roman and Gurmukhi text;(2)development of dataset annotated with language tag;(3)development of the normalizing system,which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi.It can be extended for any pair of scripts.(4)The proposed system can be used for better analysis of social media text.Theoretically,our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/value-Existing research work focus on normalizing monolingual text.This study contributes towards the development of a normalization system for multilingual text.展开更多
The study explores linguistic and media genre characteristics of the British, U.S. and Georgian print media. Theoretical apparatus of media studies and other interdisciplinary linguistic fields were employed for compa...The study explores linguistic and media genre characteristics of the British, U.S. and Georgian print media. Theoretical apparatus of media studies and other interdisciplinary linguistic fields were employed for comparative analysis of genre characteristics. The paper is part of a longitudinal study of print media genres over the period of 2002-2010. The aim of the research is to(a) define and compare genre characteristics of the British, U.S. and Georgian print media,(b) examine and define structural and linguistic(semantic, pragmatic, semiotic) characteristics of the British, U.S. and Georgian quality newspaper genres,(c) define deictic composition of newspaper articles and(d) study expression of coded meanings in media texts. In this paper, I will focus on two major genres of quality print media: news and features. The media genres are analysed within the theoretical framework of pragmatics, semantics, semiotics and media studies.展开更多
When someone threatens or humiliates another person online by sending those unpleasant messages or comments, this is known as Cyberbullying. Recently, Bangla text has been used much more often on social media. People ...When someone threatens or humiliates another person online by sending those unpleasant messages or comments, this is known as Cyberbullying. Recently, Bangla text has been used much more often on social media. People communicate with others on social media through messages and comments. So bullies use social media as a rich environment to bully others, especially on political issues. Fights over Cyberbullying on political and social media posts are common today. Most of the time, it does a lot of damage. However, few works have been done for monitoring Bangla text on social media & no work has been done yet for detecting the bullying Bangla text on political issues due to the lack of annotated corpora and morphologic analyzers. In this work, we used several machine learning classifiers & a model. That will help to detect the Bangla bullying texts on social media. For this work, 11,000 Bangla texts have been collected from the comments section of political Facebook posts to make a new dataset and labelled the data as either bullied or not. This dataset has been used to train the machine learning classifier. The results indicate that Random Forest achieves superior accuracy of 91.08%.展开更多
Online social media exhibit massive organizational event relevant messages, and the well categorized event information can be useful in many real-world applications. In this paper, we propose a research framework to e...Online social media exhibit massive organizational event relevant messages, and the well categorized event information can be useful in many real-world applications. In this paper, we propose a research framework to extract high quality event information from massive online media data. The main contributions lie in two aspects: First, we present an event-extraction and event-categorization system for online media data; second, we present a novel approach for both discovering important event categories and classifying extracted events based on word representation and clustering model. Experimental results with real dataset show that the proposed framework is effective to extract high quality event information.展开更多
The increasing prevalence of technology in society has an impact on young people’s language use and development. Greeklish is the writing of Greek texts using the Latin instead of the Greek alphabet, a practice known...The increasing prevalence of technology in society has an impact on young people’s language use and development. Greeklish is the writing of Greek texts using the Latin instead of the Greek alphabet, a practice known as Latinization, also employed for many non-latin alphabet languages. The primary aim of this research is to evaluate the effect of Greeklish on reading time. A sample of 732 young Greeks were asked about their habits when communicating through e-mail and social media with their friends and they then participated in an experiment in which they were asked to read and understand two short texts, one written in Greek and the other in Greeklish. The findings of the research show that nearly one third of the participants use Greeklish. The results of the experiment conducted reveal that understanding is not affected by the alphabet used but reading Greeklish is significantly more time consuming than reading Greek independently of the sex and the familiarity of the participants with Greeklish. The findings suggest that amending social and communication media with software utilities related to Latinization such as language identifiers and converters may reduce reading time and thus facilitate written communication among the users.展开更多
文摘Purpose-Normalization is an important step in all the natural language processing applications that are handling social media text.The text from social media poses a different kind of problems that are not present in regular text.Recently,a considerable amount of work has been done in this direction,but mostly in the English language.People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script.This kind of text further aggravates the problem of normalizing.This paper aims to discuss the concept of normalization with respect to code-mixed social media text,and a model has been proposed to normalize such text.Design/methodology/approach-The system is divided into two phases-candidate generation and most probable sentence selection.Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language.Characterbased translation system has been proposed to generate candidate tokens.Once candidates are generated,the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.Findings-Character error rate(CER)and bilingual evaluation understudy(BLEU)score are reported.The proposed system has been compared with Akhar software and RB\_R2G system,which are also capable of transliterating Roman text to Gurmukhi.The performance of the system outperforms Akhar software.The CER and BLEU scores are 0.268121 and 0.6807939,respectively,for ill-formed text.Research limitations/implications-It was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing.Spell checker can improve the output of the system by correcting these minor errors.Extensive experimentation is needed for optimizing language identifier,which will further help in improving the output.The language model also seeks further exploration.Inclusion of wider context,particularly from social media text,is an important area that deserves further investigation.Practical implications-The practical implications of this study are:(1)development of parallel dataset containing Roman and Gurmukhi text;(2)development of dataset annotated with language tag;(3)development of the normalizing system,which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi.It can be extended for any pair of scripts.(4)The proposed system can be used for better analysis of social media text.Theoretically,our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/value-Existing research work focus on normalizing monolingual text.This study contributes towards the development of a normalization system for multilingual text.
文摘The study explores linguistic and media genre characteristics of the British, U.S. and Georgian print media. Theoretical apparatus of media studies and other interdisciplinary linguistic fields were employed for comparative analysis of genre characteristics. The paper is part of a longitudinal study of print media genres over the period of 2002-2010. The aim of the research is to(a) define and compare genre characteristics of the British, U.S. and Georgian print media,(b) examine and define structural and linguistic(semantic, pragmatic, semiotic) characteristics of the British, U.S. and Georgian quality newspaper genres,(c) define deictic composition of newspaper articles and(d) study expression of coded meanings in media texts. In this paper, I will focus on two major genres of quality print media: news and features. The media genres are analysed within the theoretical framework of pragmatics, semantics, semiotics and media studies.
文摘When someone threatens or humiliates another person online by sending those unpleasant messages or comments, this is known as Cyberbullying. Recently, Bangla text has been used much more often on social media. People communicate with others on social media through messages and comments. So bullies use social media as a rich environment to bully others, especially on political issues. Fights over Cyberbullying on political and social media posts are common today. Most of the time, it does a lot of damage. However, few works have been done for monitoring Bangla text on social media & no work has been done yet for detecting the bullying Bangla text on political issues due to the lack of annotated corpora and morphologic analyzers. In this work, we used several machine learning classifiers & a model. That will help to detect the Bangla bullying texts on social media. For this work, 11,000 Bangla texts have been collected from the comments section of political Facebook posts to make a new dataset and labelled the data as either bullied or not. This dataset has been used to train the machine learning classifier. The results indicate that Random Forest achieves superior accuracy of 91.08%.
基金supported by the National Natural Science Foundation of China under Grants No.71271044,No.U1233118,and No.71572029
文摘Online social media exhibit massive organizational event relevant messages, and the well categorized event information can be useful in many real-world applications. In this paper, we propose a research framework to extract high quality event information from massive online media data. The main contributions lie in two aspects: First, we present an event-extraction and event-categorization system for online media data; second, we present a novel approach for both discovering important event categories and classifying extracted events based on word representation and clustering model. Experimental results with real dataset show that the proposed framework is effective to extract high quality event information.
文摘The increasing prevalence of technology in society has an impact on young people’s language use and development. Greeklish is the writing of Greek texts using the Latin instead of the Greek alphabet, a practice known as Latinization, also employed for many non-latin alphabet languages. The primary aim of this research is to evaluate the effect of Greeklish on reading time. A sample of 732 young Greeks were asked about their habits when communicating through e-mail and social media with their friends and they then participated in an experiment in which they were asked to read and understand two short texts, one written in Greek and the other in Greeklish. The findings of the research show that nearly one third of the participants use Greeklish. The results of the experiment conducted reveal that understanding is not affected by the alphabet used but reading Greeklish is significantly more time consuming than reading Greek independently of the sex and the familiarity of the participants with Greeklish. The findings suggest that amending social and communication media with software utilities related to Latinization such as language identifiers and converters may reduce reading time and thus facilitate written communication among the users.