As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects in...As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.展开更多
Nowadays short texts can be widely found in various social data in relation to the 5G-enabled Internet of Things (IoT). Short text classification is a challenging task due to its sparsity and the lack of context. Prev...Nowadays short texts can be widely found in various social data in relation to the 5G-enabled Internet of Things (IoT). Short text classification is a challenging task due to its sparsity and the lack of context. Previous studies mainly tackle these problems by enhancing the semantic information or the statistical information individually. However, the improvement achieved by a single type of information is limited, while fusing various information may help to improve the classification accuracy more effectively. To fuse various information for short text classification, this article proposes a feature fusion method that integrates the statistical feature and the comprehensive semantic feature together by using the weighting mechanism and deep learning models. In the proposed method, we apply Bidirectional Encoder Representations from Transformers (BERT) to generate word vectors on the sentence level automatically, and then obtain the statistical feature, the local semantic feature and the overall semantic feature using Term Frequency-Inverse Document Frequency (TF-IDF) weighting approach, Convolutional Neural Network (CNN) and Bidirectional Gate Recurrent Unit (BiGRU). Then, the fusion feature is accordingly obtained for classification. Experiments are conducted on five popular short text classification datasets and a 5G-enabled IoT social dataset and the results show that our proposed method effectively improves the classification performance.展开更多
With the popularity of English learning in the world,the demand for instructional materials is increasing,and how to accurately convey the information of these texts to English-learners and provide the learners with a...With the popularity of English learning in the world,the demand for instructional materials is increasing,and how to accurately convey the information of these texts to English-learners and provide the learners with an effective learning experience has become a major problem.Therefore,the role of the translator is crucial.It is found that that translators should make linguistic choices based on the text functions and target readers’expectations,and flexibly adopt the translation strategies,such as addition,conversion,and cohesion to convey the intention of the source author and generate the source context accurately and appropriately.Applying relevant theories to the analysis of translation cases,the paper tentatively puts forward solutions to the problems encountered in the E-C translation of chapter one of Murder Your Darlings and Other Gentle Writing Advice from Aristotle to Zinsser.Hopefully,it could provide a reference for other translators working on instructional texts.展开更多
Integrated writing tasks,requiring writers to summarize source content and integrate their ideas or experience with those in the source text,are widely used in standardized tests of English proficiency.Integrated writ...Integrated writing tasks,requiring writers to summarize source content and integrate their ideas or experience with those in the source text,are widely used in standardized tests of English proficiency.Integrated writing adds an element not found in traditional independent writing:the use of source text material.Recently,concern has been raised that over-reliance on source texts in integrated writing may lead to plagiarism as well as inaccurate assessment which calls for further investigation.Research on L2 students’writing using sources has broadened its focus from transgressive citing practices to stages of skill development in this complicated literacy,the successive challenges students face,and how instruction might be of benefit.Previous literature mainly focused on source use in academic writing and seldom considered summary writing in standardized tests,while summary strategy is particularly important in standardized tests,in which summary accounts for a significant part of the assessment criteria.Given that a series of previous literature have indicated that source text use,source texts,and prompts are associated with the quality of summary writing,this article reviews the source text borrowing in the summary of integrated writing to further explore the relationship between some variables and the quality of summary writing and therefore finds some research gaps in this field.展开更多
Purpose: The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for ...Purpose: The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don's contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.Design/methodology/approach: Personal recollections and literature review. Findings: The Swanson A-B-C model of literature-based discovery has been successfully used by laboratory investigators analyzing their findings and hypotheses. It continues to be a fertile area of research in a wide range of application areas including text mining, drug repurposing, studies of scientific innovation, knowledge discovery in databases, and bioinformatics. Recently, additional modes of discovery that do not follow the A-B-C model have also been proposed and explored (e.g. so-called storytelling, gaps, analogies, link prediction, negative consensus, outliers, and revival of neglected or discarded research questions). Research limitations: This paper reflects the opinions of the author and is not a comprehensive nor technically based review of literature-based discovery. Practical implications: The general scientific public is still not aware of the availability of tools for literature-based discovery. Our Arrowsmith project site maintains a suite of discovery tools that are free and open to the public (http://arrowsmith.psych.uic.edu), as does BITOLA which is maintained by Dmitar Hristovski (http:// http://ibmi.mf.uni-lj.si/bitola), and Epiphanet which is maintained by Trevor Cohen (http://epiphanet.uth.tme.edu/). Bringing user-friendly tools to the public should be a high priority, since even more than advancing basic research in informatics, it is vital that we ensure that scientists actually use discovery tools and that these are actually able to help them make experimental discoveries in the lab and in the clinic. Originality/value: This paper discusses problems and issues which were inherent in Don's thoughts during his life, including those which have not yet been fully taken up and studied systematically.展开更多
With the flooding of pornographic information on the Internet, how to keep people away from that offensive information is becoming one of the most important research areas in network information security. Some applica...With the flooding of pornographic information on the Internet, how to keep people away from that offensive information is becoming one of the most important research areas in network information security. Some applications which can block or filter such information are used. Approaches in those systems can be roughly classified into two kinds: metadata based and content based. With the development of distributed technologies, content based filtering technologies will play a more and more important role in filtering systems. Keyword matching is a content based method used widely in harmful text filtering. Experiments to evaluate the recall and precision of the method showed that the precision of the method is not satisfactory, though the recall of the method is rather high. According to the results, a new pornographic text filtering model based on reconfirming is put forward. Experiments showed that the model is practical, has less loss of recall than the single keyword matching method, and has higher precision.展开更多
With the development of Web 2.0,more and more people choose to use the Internet to express their opinions.All this opinions together into a new form text which contains a lot of valuable emotional information,this is ...With the development of Web 2.0,more and more people choose to use the Internet to express their opinions.All this opinions together into a new form text which contains a lot of valuable emotional information,this is why how to deal with these texts and analysis the emotional information is significant for us.We get three main tasks of sentiment analysis,including sentiment extraction,sentiment classification,sentiment application and summarization.In this paper,based on the R software,we introduced the steps of sentiment analysis in detail.Finally,we collect the movie reviews from the Internet,and use R software to do sentiment analysis in order to judge the emotional tendency of the text.展开更多
This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schem...This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.展开更多
The frame of text classification system was presented. The high dimensionality in feature space for text classification was studied. The mutual information is a widely used information theoretic measure, in a descript...The frame of text classification system was presented. The high dimensionality in feature space for text classification was studied. The mutual information is a widely used information theoretic measure, in a descriptive way, to measure the stochastic dependency of discrete random variables. The measure method was used as a criterion to reduce high dimensionality of feature vectors in text classification on Web. Feature selections or conversions were performed by using maximum mutual information including linear and non-linear feature conversions. Entropy was used and extended to find right features commendably in pattern recognition systems. Favorable foundation would be established for text classification mining.展开更多
In the realm of Multi-Label Text Classification(MLTC),the dual challenges of extracting rich semantic features from text and discerning inter-label relationships have spurred innovative approaches.Many studies in sema...In the realm of Multi-Label Text Classification(MLTC),the dual challenges of extracting rich semantic features from text and discerning inter-label relationships have spurred innovative approaches.Many studies in semantic feature extraction have turned to external knowledge to augment the model’s grasp of textual content,often overlooking intrinsic textual cues such as label statistical features.In contrast,these endogenous insights naturally align with the classification task.In our paper,to complement this focus on intrinsic knowledge,we introduce a novel Gate-Attention mechanism.This mechanism adeptly integrates statistical features from the text itself into the semantic fabric,enhancing the model’s capacity to understand and represent the data.Additionally,to address the intricate task of mining label correlations,we propose a Dual-end enhancement mechanism.This mechanism effectively mitigates the challenges of information loss and erroneous transmission inherent in traditional long short term memory propagation.We conducted an extensive battery of experiments on the AAPD and RCV1-2 datasets.These experiments serve the dual purpose of confirming the efficacy of both the Gate-Attention mechanism and the Dual-end enhancement mechanism.Our final model unequivocally outperforms the baseline model,attesting to its robustness.These findings emphatically underscore the imperativeness of taking into account not just external knowledge but also the inherent intricacies of textual data when crafting potent MLTC models.展开更多
In order to solve the poor performance in text classification when using traditional formula of mutual information (MI),a feature selection algorithm were proposed based on improved mutual information.The improved mut...In order to solve the poor performance in text classification when using traditional formula of mutual information (MI),a feature selection algorithm were proposed based on improved mutual information.The improved mutual information algorithm,which is on the basis of traditional improved mutual information methods that enhance the MI value of negative characteristics and feature's frequency,supports the concept of concentration degree and dispersion degree.In accordance with the concept of concentration degree and dispersion degree,formulas which embody concentration degree and dispersion degree were constructed and the improved mutual information was implemented based on these.In this paper,the feature selection algorithm was applied based on improved mutual information to a text classifier based on Biomimetic Pattern Recognition and it was compared with several other feature selection methods.The experimental results showed that the improved mutual information feature selection method greatly enhances the performance compared with traditional mutual information feature selection methods and the performance is better than that of information gain.Through the introduction of the concept of concentration degree and dispersion degree,the improved mutual information feature selection method greatly improves the performance of text classification system.展开更多
The developed system for eye and face detection using Convolutional Neural Networks(CNN)models,followed by eye classification and voice-based assistance,has shown promising potential in enhancing accessibility for ind...The developed system for eye and face detection using Convolutional Neural Networks(CNN)models,followed by eye classification and voice-based assistance,has shown promising potential in enhancing accessibility for individuals with visual impairments.The modular approach implemented in this research allows for a seamless flow of information and assistance between the different components of the system.This research significantly contributes to the field of accessibility technology by integrating computer vision,natural language processing,and voice technologies.By leveraging these advancements,the developed system offers a practical and efficient solution for assisting blind individuals.The modular design ensures flexibility,scalability,and ease of integration with existing assistive technologies.However,it is important to acknowledge that further research and improvements are necessary to enhance the system’s accuracy and usability.Fine-tuning the CNN models and expanding the training dataset can improve eye and face detection as well as eye classification capabilities.Additionally,incorporating real-time responses through sophisticated natural language understanding techniques and expanding the knowledge base of ChatGPT can enhance the system’s ability to provide comprehensive and accurate responses.Overall,this research paves the way for the development of more advanced and robust systems for assisting visually impaired individuals.By leveraging cutting-edge technologies and integrating them into amodular framework,this research contributes to creating a more inclusive and accessible society for individuals with visual impairments.Future work can focus on refining the system,addressing its limitations,and conducting user studies to evaluate its effectiveness and impact in real-world scenarios.展开更多
Text visualization is concerned with the representation of text in a graphicalform to facilitate comprehension of large textual data. Its aim is to improve the ability tounderstand and utilize the wealth of text-based...Text visualization is concerned with the representation of text in a graphicalform to facilitate comprehension of large textual data. Its aim is to improve the ability tounderstand and utilize the wealth of text-based information available. An essential task inany scientific research is the study and review of previous works in the specified domain,a process that is referred to as the literature survey process. This process involves theidentification of prior work and evaluating its relevance to the research question. With theenormous number of published studies available online in digital form, this becomes acumbersome task for the researcher. This paper presents the design and implementationof a tool that aims to facilitate this process by identifying relevant work and suggestingclusters of articles by conceptual modeling, thus providing different options that enablethe researcher to visualize a large number of articles in a graphical easy-to-analyze form.The tool helps the researcher in analyzing and synthesizing the literature and building aconceptual understanding of the designated research area. The evaluation of the toolshows that researchers have found it useful and that it supported the process of relevantwork analysis given a specific research question, and 70% of the evaluators of the toolfound it very useful.展开更多
Text event mining,as an indispensable method of text mining processing,has attracted the extensive attention of researchers.A modeling method for knowledge graph of events based on mutual information among neighbor do...Text event mining,as an indispensable method of text mining processing,has attracted the extensive attention of researchers.A modeling method for knowledge graph of events based on mutual information among neighbor domains and sparse representation is proposed in this paper,i.e.UKGE-MS.Specifically,UKGE-MS can improve the existing text mining technology's ability of understanding and discovering high-dimensional unmarked information,and solves the problems of traditional unsupervised feature selection methods,which only focus on selecting features from a global perspective and ignoring the impact of local connection of samples.Firstly,considering the influence of local information of samples in feature correlation evaluation,a feature clustering algorithm based on average neighborhood mutual information is proposed,and the feature clusters with certain event correlation are obtained;Secondly,an unsupervised feature selection method based on the high-order correlation of multi-dimensional statistical data is designed by combining the dimension reduction advantage of local linear embedding algorithm and the feature selection ability of sparse representation,so as to enhance the generalization ability of the selected feature items.Finally,the events knowledge graph is constructed by means of sparse representation and l1 norm.Extensive experiments are carried out on five real datasets and synthetic datasets,and the UKGE-MS are compared with five corresponding algorithms.The experimental results show that UKGE-MS is better than the traditional method in event clustering and feature selection,and has some advantages over other methods in text event recognition and discovery.展开更多
It is important to effectively identify the data value of open source scientific and technological information and to help intelligence analysts select high-value data from a large number of open-source scientific and...It is important to effectively identify the data value of open source scientific and technological information and to help intelligence analysts select high-value data from a large number of open-source scientific and technological information. The data value evaluation methods of scientific and technological information is proposed in the open source environment. According to the characteristics of the methods, the data value evaluation methods were divided into the following three aspects: research on data value evaluation methods based on information metrology, research on data value evaluation methods based on economic perspective and research on data value assessment methods based on text analysis. For each method, it indicated the main ideas, application scenarios, advantages and disadvantages.展开更多
Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to th...Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to the goals. Four novel entropy-based features extracted from anchor data and click-through data are proposed, and a support vector machines (SVM) classifier is used to identify the user's goal based on these features. Experi- mental results show that the proposed entropy-based features are more effective than those reported in previous work. By combin- ing multiple features the goals for more than 97% of the queries studied can be correctly identified. Besides these, this paper reaches the following important conclusions: First, anchor-based features are more effective than click-through-based features; Second, the number of sites is more reliable than the number of links; Third, click-distribution- based features are more effective than session-based ones.展开更多
In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established...In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established to approximately describe the nonlinear and time-varying TC systems. According to this mathematical model, the feedback control theory is adopted to prove the system's stableness and zero steady state error. The experiments result shows that the error of deadline satisfied ratio in the system is kept within 4 of the desired value. And the number of classifiers can be dynamically adjusted by the system itself to save the computa tion resources. The proposed methodology enables the theo retical analysis and evaluation to the TC systems, leading to a high-quality and low cost implementation approach.展开更多
文摘As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.
基金supported in part by the Beijing Natural Science Foundation under grants M21032 and 19L2029in part by the National Natural Science Foundation of China under grants U1836106 and 81961138010in part by the Scientific and Technological Innovation Foundation of Foshan under grants BK21BF001 and BK20BF010.
文摘Nowadays short texts can be widely found in various social data in relation to the 5G-enabled Internet of Things (IoT). Short text classification is a challenging task due to its sparsity and the lack of context. Previous studies mainly tackle these problems by enhancing the semantic information or the statistical information individually. However, the improvement achieved by a single type of information is limited, while fusing various information may help to improve the classification accuracy more effectively. To fuse various information for short text classification, this article proposes a feature fusion method that integrates the statistical feature and the comprehensive semantic feature together by using the weighting mechanism and deep learning models. In the proposed method, we apply Bidirectional Encoder Representations from Transformers (BERT) to generate word vectors on the sentence level automatically, and then obtain the statistical feature, the local semantic feature and the overall semantic feature using Term Frequency-Inverse Document Frequency (TF-IDF) weighting approach, Convolutional Neural Network (CNN) and Bidirectional Gate Recurrent Unit (BiGRU). Then, the fusion feature is accordingly obtained for classification. Experiments are conducted on five popular short text classification datasets and a 5G-enabled IoT social dataset and the results show that our proposed method effectively improves the classification performance.
文摘With the popularity of English learning in the world,the demand for instructional materials is increasing,and how to accurately convey the information of these texts to English-learners and provide the learners with an effective learning experience has become a major problem.Therefore,the role of the translator is crucial.It is found that that translators should make linguistic choices based on the text functions and target readers’expectations,and flexibly adopt the translation strategies,such as addition,conversion,and cohesion to convey the intention of the source author and generate the source context accurately and appropriately.Applying relevant theories to the analysis of translation cases,the paper tentatively puts forward solutions to the problems encountered in the E-C translation of chapter one of Murder Your Darlings and Other Gentle Writing Advice from Aristotle to Zinsser.Hopefully,it could provide a reference for other translators working on instructional texts.
文摘Integrated writing tasks,requiring writers to summarize source content and integrate their ideas or experience with those in the source text,are widely used in standardized tests of English proficiency.Integrated writing adds an element not found in traditional independent writing:the use of source text material.Recently,concern has been raised that over-reliance on source texts in integrated writing may lead to plagiarism as well as inaccurate assessment which calls for further investigation.Research on L2 students’writing using sources has broadened its focus from transgressive citing practices to stages of skill development in this complicated literacy,the successive challenges students face,and how instruction might be of benefit.Previous literature mainly focused on source use in academic writing and seldom considered summary writing in standardized tests,while summary strategy is particularly important in standardized tests,in which summary accounts for a significant part of the assessment criteria.Given that a series of previous literature have indicated that source text use,source texts,and prompts are associated with the quality of summary writing,this article reviews the source text borrowing in the summary of integrated writing to further explore the relationship between some variables and the quality of summary writing and therefore finds some research gaps in this field.
基金supported by NIH grants R01LM010817 and P01AG039347
文摘Purpose: The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don's contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.Design/methodology/approach: Personal recollections and literature review. Findings: The Swanson A-B-C model of literature-based discovery has been successfully used by laboratory investigators analyzing their findings and hypotheses. It continues to be a fertile area of research in a wide range of application areas including text mining, drug repurposing, studies of scientific innovation, knowledge discovery in databases, and bioinformatics. Recently, additional modes of discovery that do not follow the A-B-C model have also been proposed and explored (e.g. so-called storytelling, gaps, analogies, link prediction, negative consensus, outliers, and revival of neglected or discarded research questions). Research limitations: This paper reflects the opinions of the author and is not a comprehensive nor technically based review of literature-based discovery. Practical implications: The general scientific public is still not aware of the availability of tools for literature-based discovery. Our Arrowsmith project site maintains a suite of discovery tools that are free and open to the public (http://arrowsmith.psych.uic.edu), as does BITOLA which is maintained by Dmitar Hristovski (http:// http://ibmi.mf.uni-lj.si/bitola), and Epiphanet which is maintained by Trevor Cohen (http://epiphanet.uth.tme.edu/). Bringing user-friendly tools to the public should be a high priority, since even more than advancing basic research in informatics, it is vital that we ensure that scientists actually use discovery tools and that these are actually able to help them make experimental discoveries in the lab and in the clinic. Originality/value: This paper discusses problems and issues which were inherent in Don's thoughts during his life, including those which have not yet been fully taken up and studied systematically.
文摘With the flooding of pornographic information on the Internet, how to keep people away from that offensive information is becoming one of the most important research areas in network information security. Some applications which can block or filter such information are used. Approaches in those systems can be roughly classified into two kinds: metadata based and content based. With the development of distributed technologies, content based filtering technologies will play a more and more important role in filtering systems. Keyword matching is a content based method used widely in harmful text filtering. Experiments to evaluate the recall and precision of the method showed that the precision of the method is not satisfactory, though the recall of the method is rather high. According to the results, a new pornographic text filtering model based on reconfirming is put forward. Experiments showed that the model is practical, has less loss of recall than the single keyword matching method, and has higher precision.
文摘With the development of Web 2.0,more and more people choose to use the Internet to express their opinions.All this opinions together into a new form text which contains a lot of valuable emotional information,this is why how to deal with these texts and analysis the emotional information is significant for us.We get three main tasks of sentiment analysis,including sentiment extraction,sentiment classification,sentiment application and summarization.In this paper,based on the R software,we introduced the steps of sentiment analysis in detail.Finally,we collect the movie reviews from the Internet,and use R software to do sentiment analysis in order to judge the emotional tendency of the text.
文摘This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.
文摘The frame of text classification system was presented. The high dimensionality in feature space for text classification was studied. The mutual information is a widely used information theoretic measure, in a descriptive way, to measure the stochastic dependency of discrete random variables. The measure method was used as a criterion to reduce high dimensionality of feature vectors in text classification on Web. Feature selections or conversions were performed by using maximum mutual information including linear and non-linear feature conversions. Entropy was used and extended to find right features commendably in pattern recognition systems. Favorable foundation would be established for text classification mining.
基金supported by National Natural Science Foundation of China(NSFC)(Grant Nos.62162022,62162024)the Key Research and Development Program of Hainan Province(Grant Nos.ZDYF2020040,ZDYF2021GXJS003)+2 种基金the Major Science and Technology Project of Hainan Province(Grant No.ZDKJ2020012)Hainan Provincial Natural Science Foundation of China(Grant Nos.620MS021,621QN211)Science and Technology Development Center of the Ministry of Education Industry-University-Research Innovation Fund(2021JQR017).
文摘In the realm of Multi-Label Text Classification(MLTC),the dual challenges of extracting rich semantic features from text and discerning inter-label relationships have spurred innovative approaches.Many studies in semantic feature extraction have turned to external knowledge to augment the model’s grasp of textual content,often overlooking intrinsic textual cues such as label statistical features.In contrast,these endogenous insights naturally align with the classification task.In our paper,to complement this focus on intrinsic knowledge,we introduce a novel Gate-Attention mechanism.This mechanism adeptly integrates statistical features from the text itself into the semantic fabric,enhancing the model’s capacity to understand and represent the data.Additionally,to address the intricate task of mining label correlations,we propose a Dual-end enhancement mechanism.This mechanism effectively mitigates the challenges of information loss and erroneous transmission inherent in traditional long short term memory propagation.We conducted an extensive battery of experiments on the AAPD and RCV1-2 datasets.These experiments serve the dual purpose of confirming the efficacy of both the Gate-Attention mechanism and the Dual-end enhancement mechanism.Our final model unequivocally outperforms the baseline model,attesting to its robustness.These findings emphatically underscore the imperativeness of taking into account not just external knowledge but also the inherent intricacies of textual data when crafting potent MLTC models.
基金Sponsored by the National Nature Science Foundation Projects (Grant No. 60773070,60736044)
文摘In order to solve the poor performance in text classification when using traditional formula of mutual information (MI),a feature selection algorithm were proposed based on improved mutual information.The improved mutual information algorithm,which is on the basis of traditional improved mutual information methods that enhance the MI value of negative characteristics and feature's frequency,supports the concept of concentration degree and dispersion degree.In accordance with the concept of concentration degree and dispersion degree,formulas which embody concentration degree and dispersion degree were constructed and the improved mutual information was implemented based on these.In this paper,the feature selection algorithm was applied based on improved mutual information to a text classifier based on Biomimetic Pattern Recognition and it was compared with several other feature selection methods.The experimental results showed that the improved mutual information feature selection method greatly enhances the performance compared with traditional mutual information feature selection methods and the performance is better than that of information gain.Through the introduction of the concept of concentration degree and dispersion degree,the improved mutual information feature selection method greatly improves the performance of text classification system.
文摘The developed system for eye and face detection using Convolutional Neural Networks(CNN)models,followed by eye classification and voice-based assistance,has shown promising potential in enhancing accessibility for individuals with visual impairments.The modular approach implemented in this research allows for a seamless flow of information and assistance between the different components of the system.This research significantly contributes to the field of accessibility technology by integrating computer vision,natural language processing,and voice technologies.By leveraging these advancements,the developed system offers a practical and efficient solution for assisting blind individuals.The modular design ensures flexibility,scalability,and ease of integration with existing assistive technologies.However,it is important to acknowledge that further research and improvements are necessary to enhance the system’s accuracy and usability.Fine-tuning the CNN models and expanding the training dataset can improve eye and face detection as well as eye classification capabilities.Additionally,incorporating real-time responses through sophisticated natural language understanding techniques and expanding the knowledge base of ChatGPT can enhance the system’s ability to provide comprehensive and accurate responses.Overall,this research paves the way for the development of more advanced and robust systems for assisting visually impaired individuals.By leveraging cutting-edge technologies and integrating them into amodular framework,this research contributes to creating a more inclusive and accessible society for individuals with visual impairments.Future work can focus on refining the system,addressing its limitations,and conducting user studies to evaluate its effectiveness and impact in real-world scenarios.
文摘Text visualization is concerned with the representation of text in a graphicalform to facilitate comprehension of large textual data. Its aim is to improve the ability tounderstand and utilize the wealth of text-based information available. An essential task inany scientific research is the study and review of previous works in the specified domain,a process that is referred to as the literature survey process. This process involves theidentification of prior work and evaluating its relevance to the research question. With theenormous number of published studies available online in digital form, this becomes acumbersome task for the researcher. This paper presents the design and implementationof a tool that aims to facilitate this process by identifying relevant work and suggestingclusters of articles by conceptual modeling, thus providing different options that enablethe researcher to visualize a large number of articles in a graphical easy-to-analyze form.The tool helps the researcher in analyzing and synthesizing the literature and building aconceptual understanding of the designated research area. The evaluation of the toolshows that researchers have found it useful and that it supported the process of relevantwork analysis given a specific research question, and 70% of the evaluators of the toolfound it very useful.
基金This study was funded by the International Science and Technology Cooperation Program of the Science and Technology Department of Shaanxi Province,China(No.2021KW-16)the Science and Technology Project in Xi’an(No.2019218114GXRC017CG018-GXYD17.11),Thesis work was supported by the special fund construction project of Key Disciplines in Ordinary Colleges and Universities in Shaanxi Province,the authors would like to thank the anonymous reviewers for their helpful comments and suggestions.
文摘Text event mining,as an indispensable method of text mining processing,has attracted the extensive attention of researchers.A modeling method for knowledge graph of events based on mutual information among neighbor domains and sparse representation is proposed in this paper,i.e.UKGE-MS.Specifically,UKGE-MS can improve the existing text mining technology's ability of understanding and discovering high-dimensional unmarked information,and solves the problems of traditional unsupervised feature selection methods,which only focus on selecting features from a global perspective and ignoring the impact of local connection of samples.Firstly,considering the influence of local information of samples in feature correlation evaluation,a feature clustering algorithm based on average neighborhood mutual information is proposed,and the feature clusters with certain event correlation are obtained;Secondly,an unsupervised feature selection method based on the high-order correlation of multi-dimensional statistical data is designed by combining the dimension reduction advantage of local linear embedding algorithm and the feature selection ability of sparse representation,so as to enhance the generalization ability of the selected feature items.Finally,the events knowledge graph is constructed by means of sparse representation and l1 norm.Extensive experiments are carried out on five real datasets and synthetic datasets,and the UKGE-MS are compared with five corresponding algorithms.The experimental results show that UKGE-MS is better than the traditional method in event clustering and feature selection,and has some advantages over other methods in text event recognition and discovery.
文摘It is important to effectively identify the data value of open source scientific and technological information and to help intelligence analysts select high-value data from a large number of open-source scientific and technological information. The data value evaluation methods of scientific and technological information is proposed in the open source environment. According to the characteristics of the methods, the data value evaluation methods were divided into the following three aspects: research on data value evaluation methods based on information metrology, research on data value evaluation methods based on economic perspective and research on data value assessment methods based on text analysis. For each method, it indicated the main ideas, application scenarios, advantages and disadvantages.
基金the Tianjin Applied Fundamental Research Plan (07JCYBJC14500)
文摘Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to the goals. Four novel entropy-based features extracted from anchor data and click-through data are proposed, and a support vector machines (SVM) classifier is used to identify the user's goal based on these features. Experi- mental results show that the proposed entropy-based features are more effective than those reported in previous work. By combin- ing multiple features the goals for more than 97% of the queries studied can be correctly identified. Besides these, this paper reaches the following important conclusions: First, anchor-based features are more effective than click-through-based features; Second, the number of sites is more reliable than the number of links; Third, click-distribution- based features are more effective than session-based ones.
基金Supported by the National Natural Science Foun-dation of China (90104032) ,the National High-Tech Research andDevelopment Plan of China (2003AA1Z2090)
文摘In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established to approximately describe the nonlinear and time-varying TC systems. According to this mathematical model, the feedback control theory is adopted to prove the system's stableness and zero steady state error. The experiments result shows that the error of deadline satisfied ratio in the system is kept within 4 of the desired value. And the number of classifiers can be dynamically adjusted by the system itself to save the computa tion resources. The proposed methodology enables the theo retical analysis and evaluation to the TC systems, leading to a high-quality and low cost implementation approach.