Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a lan...Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.展开更多
The issue of proper names recognition in Chinese text was discussed. An automatic approach based on association analysis to extract rules from corpus was presented. The method tries to discover rules relevant to exter...The issue of proper names recognition in Chinese text was discussed. An automatic approach based on association analysis to extract rules from corpus was presented. The method tries to discover rules relevant to external evidence by association analysis, without additional manual effort. These rules can be used to recognize the proper nouns in Chinese texts. The experimental result shows that our method is practical in some applications. Moreover, the method is language independent.展开更多
Feature selection and sentiment analysis are two common studies that are currently being conducted;consistent with the advancements in computing and growing the use of social media.High dimensional or large feature se...Feature selection and sentiment analysis are two common studies that are currently being conducted;consistent with the advancements in computing and growing the use of social media.High dimensional or large feature sets is a key issue in sentiment analysis as it can decrease the accuracy of sentiment classification and make it difficult to obtain the optimal subset of the features.Furthermore,most reviews from social media carry a lot of noise and irrelevant information.Therefore,this study proposes a new text-feature selection method that uses a combination of rough set theory(RST)and teaching-learning based optimization(TLBO),which is known as RSTLBO.The framework to develop the proposed RSTLBO includes numerous stages:(1)acquiring the standard datasets(user reviews of six major U.S.airlines)which are used to validate search result feature selection methods,(2)preprocessing of the dataset using text processing methods.This involves applying text processing methods from natural language processing techniques,combined with linguistic processing techniques to produce high classification results,(3)employing the RSTLBO method,and(4)using the selected features from the previous process for sentiment classification using the Support Vector Machine(SVM)technique.Results show an improvement in sentiment analysis when combining natural language processing with linguistic processing for text processing.More importantly,the proposed RSTLBO feature selection algorithm is able to produce an improved sentiment analysis.展开更多
The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research area...The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.展开更多
This paper reviews the theories and studies in the field of inferential situation models. The Construction-Integration (CI) model, the Structure Building Framework (SBF) and 3 empirical studies are introduced. The...This paper reviews the theories and studies in the field of inferential situation models. The Construction-Integration (CI) model, the Structure Building Framework (SBF) and 3 empirical studies are introduced. The paper concludes that future studies, from a quantitative approach, should make some improvements in test materials, language proficiency manipulation and language background.展开更多
In this paper, we present a general model for Arabic bank check processing indicating the major phases of a check processing system. We then survey the available databases for Arabic bank check processing research. Th...In this paper, we present a general model for Arabic bank check processing indicating the major phases of a check processing system. We then survey the available databases for Arabic bank check processing research. The state of the art in the different phases of Arabic bank check processing is surveyed (i.e., pre-processing, check analysis and segmentation, features extraction, and legal and courtesy amounts recognition). The open issues for future research are stated and areas that need improvements are presented. To the best of our knowledge, it is the first survey of Arabic bank check processing.展开更多
Nowadays,natural language processing(NLP)is one of the most popular areas of,broadly understood,artificial intelligence.Therefore,every day,new research contributions are posted,for instance,to the arXiv repository.He...Nowadays,natural language processing(NLP)is one of the most popular areas of,broadly understood,artificial intelligence.Therefore,every day,new research contributions are posted,for instance,to the arXiv repository.Hence,it is rather difficult to capture the current"state of the field"and thus,to enter it.This brought the id-art NLP techniques to analyse the NLP-focused literature.As a result,(1)meta-level knowledge,concerning the current state of NLP has been captured,and(2)a guide to use of basic NLP tools is provided.It should be noted that all the tools and the dataset described in this contribution are publicly available.Furthermore,the originality of this review lies in its full automation.This allows easy reproducibility and continuation and updating of this research in the future as new researches emerge in the field of NLP.展开更多
In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied result...In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster.展开更多
Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid...Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid that wins, but also many other cluster centroids near this point. A soft competition learning method is proposed. Centroid all rank distance (CARD), CARDx, and centroid all rank distance batch K-means (CARDBK) are three clustering algorithms that adopt the proposed soft competition learning method. Among them the extent to which one point affects a cluster centroid depends on the distances from this point to the other nearer cluster centroids, rather than just the rank number of the distance from this point to this cluster centroid among the distances from this point to all cluster centroids. In addition, the validation experiments are carried out in order to compare the three soft competition learning algorithms CARD, CARDx, and CARDBK with several hard competition learning algorithms as well as neural gas (NG) algorithm on five data sets from different sources. Judging from the values of five performance indexes in the clustering results, this kind of soft competition learning method has better clustering effect and efficiency, and has linear scalability.展开更多
Personalized search is a promising way to improve the quality of Websearch,and it has attracted much attention from both academic and industrial communities.Much of the current related research is based on commercial ...Personalized search is a promising way to improve the quality of Websearch,and it has attracted much attention from both academic and industrial communities.Much of the current related research is based on commercial search engine data,which can not be released publicly for such reasons as privacy protection and information security.This leads to a serious lack of accessible public data sets in this field.The few publicly available data sets have not become widely used in academia because of the complexity of the processing process required to study personalized search methods.The lack of data sets together with the difficulties of data processing has brought obstacles to fair comparison and evaluation of personalized search models.In this paper,we constructed a large-scale data set AOL4 PS to evaluate personalized search methods,collected and processed from AOL query logs.We present the complete and detailed data processing and construction process.Specifically,to address the challenges of processing time and storage space demands brought by massive data volumes,we optimized the process of data set construction and proposed an improved BM25 algorithm.Experiments are performed on AOL4 PS with some classic and state-of-the-art personalized search methods,and the experiment results demonstrate that AOL4 PS can measure the effect of personalized search models.展开更多
This paper,which aims to increment the vocabulary of an existing thesaurus using hyponymy relations,focuses on an agricultural thesaurus called AGROVOC.Our main goal is to acquire AGROVOC-qualified candidates from the...This paper,which aims to increment the vocabulary of an existing thesaurus using hyponymy relations,focuses on an agricultural thesaurus called AGROVOC.Our main goal is to acquire AGROVOC-qualified candidates from the hyponymy relations of legal texts and tables.We propose a pattern-based approach to hyponymy relation acquisition.Our experimental result showed that 222 and 868 candidates are extracted from statutory sentences with 67.1%precision and tables with 37.0%precision,respectively.展开更多
Event extraction(EE)is a difficult task in natural language processing(NLP).The target of EE is to obtain and present key information described in natural language in a structured form.Internet opinion,as an essential...Event extraction(EE)is a difficult task in natural language processing(NLP).The target of EE is to obtain and present key information described in natural language in a structured form.Internet opinion,as an essential bearer of social information,is crucial.In order to help readers quickly get the main idea of news,a method of analyzing public sentiment information on the Internet and extracting events from news information is proposed.It enables users to quickly obtain information they need.An event extraction method was proposed based on Chinese language public opinion information,aiming at automatically classifying different types of public opinion events by using sentence-level features,and neural networks were applied to extract events.A sentence feature model was introduced to classify different types of public opinion events.To ensure the effective retention of text information in the calculation process,attention mechanism was added to the semantic information,and an effective public opinion event extractor was trained through CNN and LSTM networks.Experiments show that structured information can be extracted from unstructured text,and the purpose of obtaining public opinion event entities,event-entity relationships,and entity attribute information can be achieved.展开更多
This paper takes as its main point of departure a body of empirical research on reading and text processing,and makes particular reference to the type of experiments conducted in Egidi and Gerrig(2006)and Rapp and Ger...This paper takes as its main point of departure a body of empirical research on reading and text processing,and makes particular reference to the type of experiments conducted in Egidi and Gerrig(2006)and Rapp and Gerrig(2006).Broadly put,these experiments(i)explore the psychology of readers’preferences for narrative outcomes,(ii)examine the way readers react to characters’goals and actions,and(iii)investigate how readers tend to identify with characters’goals the more‘urgently’those goals are narrated.The present paper signals how stylistics can productively enrich such experimental work.Stylistics,it is argued,is well equipped to deal with subtle and nuanced variations in textual patterns without losing sight of the broader cognitive and discoursal positioning of readers in relation to these patterns.Making particular reference to what might constitute narrative‘urgency’,the article develops a model which amalgamates different strands of contemporary research in narrative stylistics.This model advances and elaborates three key components:a Stylistic Profi le,a Burlesque Block and a Kuleshov Monitor.Developing analyses of,and informal informant tests on,examples of both fiction and film,the paper calls for a more rounded and sophisticated understanding of style in empirical research on subjects’responses to patterns in narrative.展开更多
文摘Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.
基金The National Hi-Tech Research and Development Program ( 863 )of China ( No2002AA119050)
文摘The issue of proper names recognition in Chinese text was discussed. An automatic approach based on association analysis to extract rules from corpus was presented. The method tries to discover rules relevant to external evidence by association analysis, without additional manual effort. These rules can be used to recognize the proper nouns in Chinese texts. The experimental result shows that our method is practical in some applications. Moreover, the method is language independent.
基金This publication was supported by the Universiti Kebangsaan Malaysia(UKM)under the Research University Grant(Project Code:DIP-2016-024).
文摘Feature selection and sentiment analysis are two common studies that are currently being conducted;consistent with the advancements in computing and growing the use of social media.High dimensional or large feature sets is a key issue in sentiment analysis as it can decrease the accuracy of sentiment classification and make it difficult to obtain the optimal subset of the features.Furthermore,most reviews from social media carry a lot of noise and irrelevant information.Therefore,this study proposes a new text-feature selection method that uses a combination of rough set theory(RST)and teaching-learning based optimization(TLBO),which is known as RSTLBO.The framework to develop the proposed RSTLBO includes numerous stages:(1)acquiring the standard datasets(user reviews of six major U.S.airlines)which are used to validate search result feature selection methods,(2)preprocessing of the dataset using text processing methods.This involves applying text processing methods from natural language processing techniques,combined with linguistic processing techniques to produce high classification results,(3)employing the RSTLBO method,and(4)using the selected features from the previous process for sentiment classification using the Support Vector Machine(SVM)technique.Results show an improvement in sentiment analysis when combining natural language processing with linguistic processing for text processing.More importantly,the proposed RSTLBO feature selection algorithm is able to produce an improved sentiment analysis.
文摘The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication.
文摘This paper reviews the theories and studies in the field of inferential situation models. The Construction-Integration (CI) model, the Structure Building Framework (SBF) and 3 empirical studies are introduced. The paper concludes that future studies, from a quantitative approach, should make some improvements in test materials, language proficiency manipulation and language background.
基金supported by King Fahd University of Petroleum and Minerals (KFUPM) of Saudi Arabia under Grant Nos. RG-1009-1 and RG-1009-2
文摘In this paper, we present a general model for Arabic bank check processing indicating the major phases of a check processing system. We then survey the available databases for Arabic bank check processing research. The state of the art in the different phases of Arabic bank check processing is surveyed (i.e., pre-processing, check analysis and segmentation, features extraction, and legal and courtesy amounts recognition). The open issues for future research are stated and areas that need improvements are presented. To the best of our knowledge, it is the first survey of Arabic bank check processing.
文摘Nowadays,natural language processing(NLP)is one of the most popular areas of,broadly understood,artificial intelligence.Therefore,every day,new research contributions are posted,for instance,to the arXiv repository.Hence,it is rather difficult to capture the current"state of the field"and thus,to enter it.This brought the id-art NLP techniques to analyse the NLP-focused literature.As a result,(1)meta-level knowledge,concerning the current state of NLP has been captured,and(2)a guide to use of basic NLP tools is provided.It should be noted that all the tools and the dataset described in this contribution are publicly available.Furthermore,the originality of this review lies in its full automation.This allows easy reproducibility and continuation and updating of this research in the future as new researches emerge in the field of NLP.
基金Supported by the Social Science Foundation of Shaanxi Province of China(2018P03)the Humanities and Social Sciences Research Youth Fund Project of Ministry of Education of China(13YJCZH251)
文摘In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster.
基金supported by the Project of Natural Science Foundation Research Project of Shaanxi Province of China (2015JM6318)the Humanities and Social Sciences Research Youth Fund Project of Ministry of Education of China (13YJCZH251)
文摘Hard competition learning has the feature that each point modifies only one cluster centroid that wins. Correspondingly, soft competition learning has the feature that each point modifies not only the cluster centroid that wins, but also many other cluster centroids near this point. A soft competition learning method is proposed. Centroid all rank distance (CARD), CARDx, and centroid all rank distance batch K-means (CARDBK) are three clustering algorithms that adopt the proposed soft competition learning method. Among them the extent to which one point affects a cluster centroid depends on the distances from this point to the other nearer cluster centroids, rather than just the rank number of the distance from this point to this cluster centroid among the distances from this point to all cluster centroids. In addition, the validation experiments are carried out in order to compare the three soft competition learning algorithms CARD, CARDx, and CARDBK with several hard competition learning algorithms as well as neural gas (NG) algorithm on five data sets from different sources. Judging from the values of five performance indexes in the clustering results, this kind of soft competition learning method has better clustering effect and efficiency, and has linear scalability.
基金supported by the National Key R&D Program of China(No.2018YFC0830200)
文摘Personalized search is a promising way to improve the quality of Websearch,and it has attracted much attention from both academic and industrial communities.Much of the current related research is based on commercial search engine data,which can not be released publicly for such reasons as privacy protection and information security.This leads to a serious lack of accessible public data sets in this field.The few publicly available data sets have not become widely used in academia because of the complexity of the processing process required to study personalized search methods.The lack of data sets together with the difficulties of data processing has brought obstacles to fair comparison and evaluation of personalized search models.In this paper,we constructed a large-scale data set AOL4 PS to evaluate personalized search methods,collected and processed from AOL query logs.We present the complete and detailed data processing and construction process.Specifically,to address the challenges of processing time and storage space demands brought by massive data volumes,we optimized the process of data set construction and proposed an improved BM25 algorithm.Experiments are performed on AOL4 PS with some classic and state-of-the-art personalized search methods,and the experiment results demonstrate that AOL4 PS can measure the effect of personalized search models.
基金This research was partly supported by the Japan Society for the Promotion of Science KAKENHI Grant-in-Aid for Scientific Researches(S)No.23220005,(A)No.26240050,(B)No.23300094 and Young Scientists(B)No.23700310.
文摘This paper,which aims to increment the vocabulary of an existing thesaurus using hyponymy relations,focuses on an agricultural thesaurus called AGROVOC.Our main goal is to acquire AGROVOC-qualified candidates from the hyponymy relations of legal texts and tables.We propose a pattern-based approach to hyponymy relation acquisition.Our experimental result showed that 222 and 868 candidates are extracted from statutory sentences with 67.1%precision and tables with 37.0%precision,respectively.
基金National Natural Science Foundation of China under Grant(No.61802160)Doctoral Start-Up Fund of Liao-ning Province(No.20180540106)Liao-ning Public Opinion and Network Security Big Data System Engineering Laboratory(No.04-2016-0089013).
文摘Event extraction(EE)is a difficult task in natural language processing(NLP).The target of EE is to obtain and present key information described in natural language in a structured form.Internet opinion,as an essential bearer of social information,is crucial.In order to help readers quickly get the main idea of news,a method of analyzing public sentiment information on the Internet and extracting events from news information is proposed.It enables users to quickly obtain information they need.An event extraction method was proposed based on Chinese language public opinion information,aiming at automatically classifying different types of public opinion events by using sentence-level features,and neural networks were applied to extract events.A sentence feature model was introduced to classify different types of public opinion events.To ensure the effective retention of text information in the calculation process,attention mechanism was added to the semantic information,and an effective public opinion event extractor was trained through CNN and LSTM networks.Experiments show that structured information can be extracted from unstructured text,and the purpose of obtaining public opinion event entities,event-entity relationships,and entity attribute information can be achieved.
文摘This paper takes as its main point of departure a body of empirical research on reading and text processing,and makes particular reference to the type of experiments conducted in Egidi and Gerrig(2006)and Rapp and Gerrig(2006).Broadly put,these experiments(i)explore the psychology of readers’preferences for narrative outcomes,(ii)examine the way readers react to characters’goals and actions,and(iii)investigate how readers tend to identify with characters’goals the more‘urgently’those goals are narrated.The present paper signals how stylistics can productively enrich such experimental work.Stylistics,it is argued,is well equipped to deal with subtle and nuanced variations in textual patterns without losing sight of the broader cognitive and discoursal positioning of readers in relation to these patterns.Making particular reference to what might constitute narrative‘urgency’,the article develops a model which amalgamates different strands of contemporary research in narrative stylistics.This model advances and elaborates three key components:a Stylistic Profi le,a Burlesque Block and a Kuleshov Monitor.Developing analyses of,and informal informant tests on,examples of both fiction and film,the paper calls for a more rounded and sophisticated understanding of style in empirical research on subjects’responses to patterns in narrative.