All human languages have words that can mean different things in different contexts, such words with multiple meanings are potentially “ambiguous”. The process of “deciding which of several meanings of a term is in...All human languages have words that can mean different things in different contexts, such words with multiple meanings are potentially “ambiguous”. The process of “deciding which of several meanings of a term is intended in a given context” is known as “word sense disambiguation (WSD)”. This paper presents a method of WSD that assigns a target word the sense that is most related to the senses of its neighbor words. We explore the use of measures of relatedness between word senses based on a novel hybrid approach. First, we investigate how to “literally” and “regularly” express a “concept”. We apply set algebra to WordNet’s synsets cooperating with WordNet’s word ontology. In this way we establish regular rules for constructing various representations (lexical notations) of a concept using Boolean operators and word forms in various synset(s) defined in WordNet. Then we establish a formal mechanism for quantifying and estimating the semantic relatedness between concepts—we facilitate “concept distribution statistics” to determine the degree of semantic relatedness between two lexically expressed con- cepts. The experimental results showed good performance on Semcor, a subset of Brown corpus. We observe that measures of semantic relatedness are useful sources of information for WSD.展开更多
Sentiment analysis is the computational study of how opinions, attitudes, emotions, and perspectives are expressed in language, and has been the important task of natural language processing. Sentiment analysis is hig...Sentiment analysis is the computational study of how opinions, attitudes, emotions, and perspectives are expressed in language, and has been the important task of natural language processing. Sentiment analysis is highly valuable for both research and practical applications. The focuses were put on the difficulties in the construction of sentiment classifiers which normally need tremendous labeled domain training data, and a novel unsupervised framework was proposed to make use of the Chinese idiom resources to develop a general sentiment classifier. Furthermore, the domain adaption of general sentiment classifier was improved by taking the general classifier as the base of a self-training procedure to get a domain self-training sentiment classifier. To validate the effect of the unsupervised framework, several experiments were carried out on publicly available Chinese online reviews dataset. The experiments show that the proposed framework is effective and achieves encouraging results. Specifically, the general classifier outperforms two baselines(a Na?ve 50% baseline and a cross-domain classifier), and the bootstrapping self-training classifier approximates the upper bound domain-specific classifier with the lowest accuracy of 81.5%, but the performance is more stable and the framework needs no labeled training dataset.展开更多
In order to overcome defects of the classical hidden Markov model (HMM), Markov family model (MFM), a new statistical model was proposed. Markov family model was applied to speech recognition and natural language proc...In order to overcome defects of the classical hidden Markov model (HMM), Markov family model (MFM), a new statistical model was proposed. Markov family model was applied to speech recognition and natural language processing. The speaker independently continuous speech recognition experiments and the part-of-speech tagging experiments show that Markov family model has higher performance than hidden Markov model. The precision is enhanced from 94.642% to 96.214% in the part-of-speech tagging experiments, and the work rate is reduced by 11.9% in the speech recognition experiments with respect to HMM baseline system.展开更多
Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a lan...Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.展开更多
In this paper, we present a modular incremental statistical model for English full parsing. Unlike other full parsing approaches in which the analysis of the sentence is a uniform process, our model separates the full...In this paper, we present a modular incremental statistical model for English full parsing. Unlike other full parsing approaches in which the analysis of the sentence is a uniform process, our model separates the full parsing into shallow parsing and sentence skeleton parsing. In shallow parsing, we finish POS tagging, Base NP identification, prepositional phrase attachment and subordinate clause identification. In skeleton parsing, we use a layered feature-oriented statistical method. Modularity possesses the advantage of solving different problems in parsing with corresponding mechanisms. Feature-oriented rule is able to express the complex lingual phenomena at the key point if needed. Evaluated on Penn Treebank corpus, we obtained 89.2% precision and 89.8% recall.展开更多
文摘All human languages have words that can mean different things in different contexts, such words with multiple meanings are potentially “ambiguous”. The process of “deciding which of several meanings of a term is intended in a given context” is known as “word sense disambiguation (WSD)”. This paper presents a method of WSD that assigns a target word the sense that is most related to the senses of its neighbor words. We explore the use of measures of relatedness between word senses based on a novel hybrid approach. First, we investigate how to “literally” and “regularly” express a “concept”. We apply set algebra to WordNet’s synsets cooperating with WordNet’s word ontology. In this way we establish regular rules for constructing various representations (lexical notations) of a concept using Boolean operators and word forms in various synset(s) defined in WordNet. Then we establish a formal mechanism for quantifying and estimating the semantic relatedness between concepts—we facilitate “concept distribution statistics” to determine the degree of semantic relatedness between two lexically expressed con- cepts. The experimental results showed good performance on Semcor, a subset of Brown corpus. We observe that measures of semantic relatedness are useful sources of information for WSD.
基金Projects(61170156,60933005)supported by the National Natural Science Foundation of China
文摘Sentiment analysis is the computational study of how opinions, attitudes, emotions, and perspectives are expressed in language, and has been the important task of natural language processing. Sentiment analysis is highly valuable for both research and practical applications. The focuses were put on the difficulties in the construction of sentiment classifiers which normally need tremendous labeled domain training data, and a novel unsupervised framework was proposed to make use of the Chinese idiom resources to develop a general sentiment classifier. Furthermore, the domain adaption of general sentiment classifier was improved by taking the general classifier as the base of a self-training procedure to get a domain self-training sentiment classifier. To validate the effect of the unsupervised framework, several experiments were carried out on publicly available Chinese online reviews dataset. The experiments show that the proposed framework is effective and achieves encouraging results. Specifically, the general classifier outperforms two baselines(a Na?ve 50% baseline and a cross-domain classifier), and the bootstrapping self-training classifier approximates the upper bound domain-specific classifier with the lowest accuracy of 81.5%, but the performance is more stable and the framework needs no labeled training dataset.
基金Project(60763001)supported by the National Natural Science Foundation of ChinaProjects(2009GZS0027,2010GZS0072)supported by the Natural Science Foundation of Jiangxi Province,China
文摘In order to overcome defects of the classical hidden Markov model (HMM), Markov family model (MFM), a new statistical model was proposed. Markov family model was applied to speech recognition and natural language processing. The speaker independently continuous speech recognition experiments and the part-of-speech tagging experiments show that Markov family model has higher performance than hidden Markov model. The precision is enhanced from 94.642% to 96.214% in the part-of-speech tagging experiments, and the work rate is reduced by 11.9% in the speech recognition experiments with respect to HMM baseline system.
文摘Text alignment is crucial to the accuracy of MT (Machine Translation) systems, some NLP (Natural Language Processing) tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED (Translanguage English Database) talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with the described tool is shown.
文摘In this paper, we present a modular incremental statistical model for English full parsing. Unlike other full parsing approaches in which the analysis of the sentence is a uniform process, our model separates the full parsing into shallow parsing and sentence skeleton parsing. In shallow parsing, we finish POS tagging, Base NP identification, prepositional phrase attachment and subordinate clause identification. In skeleton parsing, we use a layered feature-oriented statistical method. Modularity possesses the advantage of solving different problems in parsing with corresponding mechanisms. Feature-oriented rule is able to express the complex lingual phenomena at the key point if needed. Evaluated on Penn Treebank corpus, we obtained 89.2% precision and 89.8% recall.