With the widespread use of Chinese globally, the number of Chinese learners has been increasing, leading to various grammatical errors among beginners. Additionally, as domestic efforts to develop industrial informati...With the widespread use of Chinese globally, the number of Chinese learners has been increasing, leading to various grammatical errors among beginners. Additionally, as domestic efforts to develop industrial information grow, electronic documents have also proliferated. When dealing with numerous electronic documents and texts written by Chinese beginners, manually written texts often contain hidden grammatical errors, posing a significant challenge to traditional manual proofreading. Correcting these grammatical errors is crucial to ensure fluency and readability. However, certain special types of text grammar or logical errors can have a huge impact, and manually proofreading a large number of texts individually is clearly impractical. Consequently, research on text error correction techniques has garnered significant attention in recent years. The advent and advancement of deep learning have paved the way for sequence-to-sequence learning methods to be extensively applied to the task of text error correction. This paper presents a comprehensive analysis of Chinese text grammar error correction technology, elaborates on its current research status, discusses existing problems, proposes preliminary solutions, and conducts experiments using judicial documents as an example. The aim is to provide a feasible research approach for Chinese text error correction technology.展开更多
Word segmentation is an integral step in many knowledge discovery applications. However, existing word segmentation methods have problems when applying to Chinese judicial documents:(1) existing methods rely on large-...Word segmentation is an integral step in many knowledge discovery applications. However, existing word segmentation methods have problems when applying to Chinese judicial documents:(1) existing methods rely on large-scale labeled data which is typically unavailable in judicial documents, and (2) judicial document has its own language features and writing formats. In this paper, a word segmentation method is proposed for Chinese judicial documents. The proposed method consists of two steps:(1) automatically generating some labeled data as legal dictionaries, and (2) applying a hybrid multilayer neural networks to do word segmentation incorporating legal dictionaries. Experiments are conducted on a dataset of Chinese judicial documents showing that the proposed model can achieve better results than the existing methods.展开更多
In Chinese language studies, both “The Textual Research on Historical Documents” and “The Comparative Study of Historical Data” are traditional in methodology and they both deserve being treasured, passed on, and ...In Chinese language studies, both “The Textual Research on Historical Documents” and “The Comparative Study of Historical Data” are traditional in methodology and they both deserve being treasured, passed on, and further developed. It will certainly do harm to the development of academic research if any of the two methods is given unreasonable priority. The author claims that the best or one of the best methodologies of the historical study of Chinese language is the combination of the two, hence a new interpretation of “The Double-proof Method”. Meanwhile, this essay is also an attempt to put forward “The Law of Quan-ma and Gui-mei” in Chinese language studies, in which the author believes that it is not advisable to either treat Gui-mei as Quan-ma or vice versa in linguistic research. It is crucial for us to respect always the language facts first, which is considered the very soul of linguistics.展开更多
Purpose: The thrust of this paper is to present a method for improving the accuracy of automatic indexing of Chinese-English mixed documents.Design/methodology/approach: Based on the inherent characteristics of Chines...Purpose: The thrust of this paper is to present a method for improving the accuracy of automatic indexing of Chinese-English mixed documents.Design/methodology/approach: Based on the inherent characteristics of Chinese-English mixed texts and the cybernetics theory,we proposed an integrated control method for indexing documents. It consists of 'feed-forward control','in-progress control' and 'feed-back control',aiming at improving the accuracy of automatic indexing of Chinese-English mixed documents. An experiment was conducted to investigate the effect of our proposed method.Findings: This method distinguishes Chinese and English documents in grammatical structures and word formation rules. Through the implementation of this method in the three phases of automatic indexing for the Chinese-English mixed documents,the results were encouraging. The precision increased from 88.54% to 97.10% and recall improved from97.37% to 99.47%.Research limitations: The indexing method is relatively complicated and the whole indexing process requires substantial human intervention. Due to pattern matching based on a bruteforce(BF) approach,the indexing efficiency has been reduced to some extent.Practical implications: The research is of both theoretical significance and practical value in improving the accuracy of automatic indexing of multilingual documents(not confined to Chinese-English mixed documents). The proposed method will benefit not only the indexing of life science documents but also the indexing of documents in other subject areas.Originality/value: So far,few studies have been published about the method for increasing the accuracy of multilingual automatic indexing. This study will provide insights into the automatic indexing of multilingual documents,especially Chinese-English mixed documents.展开更多
The reasoning of judgment documents is the touchstone of justice. Attaching importance to the reasoning of judgment documents is essentially the embodiment of judiciary civilization. In order to promote the reform of ...The reasoning of judgment documents is the touchstone of justice. Attaching importance to the reasoning of judgment documents is essentially the embodiment of judiciary civilization. In order to promote the reform of judgment documents reasoning and improve the level of it, the technology of automated judgment documents reasoning evaluation has to be studied on. How to build evidence chain relational model is the basis and key to this technology.An approach is proposed to build evidence chain relational model based on Chinese judgment documents. Using automated text preprocessing for Chinese judgment documents creates semi-structured XML documents and extracts evidence set and fact set. The method of key elements extraction is used to obtain the keywords of evidence and facts. Calculating the degree of association can work out the connection points of evidence chain relational model. Tabular display and graphical display of evidence chain relational model can be realized.展开更多
In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored,has provided a good foundation for the research of judicial big data and machine learning...In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored,has provided a good foundation for the research of judicial big data and machine learning. In this situation, some ideas about Chinese courts can reach automation or get better result through the research of machine learning, such as similar documents recommendation, workload evaluation based on similarity of judgement documents and prediction of possible relevant statutes. In trying to achieve all above mentioned, and also in face of the characteristics of Chinese judgement document, we propose a topic model based approach to measure the text similarity of Chinese judgement document, which is based on TF-IDF, Latent Dirichlet Allocation (LDA), Labeled Latent Dirichlet Allocation (LLDA) and other treatments. Combining with the characteristics of Chinese judgment document,we focus on the specific steps of approach, the preprocessing of corpus, the parameters choices of training and the evaluation of similarity measure result. Besides, implementing the approach for prediction of possible statutes and regarding the prediction accuracy as the evaluation metric, we designed experiments to demonstrate the reasonability of decisions in the process of design and the high performance of our approach on text similarity measure. The experiments also show the restriction of our approach which need to be focused in future work.展开更多
Based on cognitive science, the EnergyCalculus in Chinese language segmentation was presented to eliminate segmentation ambiguity. The notion of “EnergyCost” was advanced to denote the extent of the under standabili...Based on cognitive science, the EnergyCalculus in Chinese language segmentation was presented to eliminate segmentation ambiguity. The notion of “EnergyCost” was advanced to denote the extent of the under standability of a certain segmentation. EnergyCost function was defined with Z notation. This approcah is effective to all natural language segmentation.展开更多
Sentiment analysis is now more and more important in modern natural language processing,and the sentiment classification is the one of the most popular applications.The crucial part of sentiment classification is feat...Sentiment analysis is now more and more important in modern natural language processing,and the sentiment classification is the one of the most popular applications.The crucial part of sentiment classification is feature extraction.In this paper,two methods for feature extraction,feature selection and feature embedding,are compared.Then Word2Vec is used as an embedding method.In this experiment,Chinese document is used as the corpus,and tree methods are used to get the features of a document:average word vectors,Doc2Vec and weighted average word vectors.After that,these samples are fed to three machine learning algorithms to do the classification,and support vector machine(SVM) has the best result.Finally,the parameters of random forest are analyzed.展开更多
Objective To explore the rules and characteristics of the adverse drug reactions(ADRs)of three Chinese patent medicines and three herbal formulas for the treatment of COVID-19,and to provide a reference for clinical s...Objective To explore the rules and characteristics of the adverse drug reactions(ADRs)of three Chinese patent medicines and three herbal formulas for the treatment of COVID-19,and to provide a reference for clinical safe medication.Methods The cases and ADR reports of the three Chinese patent medicines and three herbal formulas in PubMed,Web of Science,Springer Link,CNKI,Wanfang and VIP database were retrieved from December 2019 to May 2021.Then we extracted and analyzed the effective information included in the literature.Results and Conclusion According to the pre-developed retrieval plan,a total of 136 documents were obtained,and a total of 6 documents met the inclusion criteria finally.553 patients used three Chinese patent medicines and three herbal formulas,and there were 133 cases of adverse reactions.The adverse reactions of patients taking the three Chinese patent medicines and three herbal formulas can all be explained under the theory of traditional Chinese medicine,and the adverse reactions can be eliminated by adding or subtracting the flavor of the medicine or stopping the medicine.展开更多
Named entity recognition(NER)is essential in many natural language processing(NLP)tasks such as information extraction and document classification.A construction document usually contains critical named entities,and a...Named entity recognition(NER)is essential in many natural language processing(NLP)tasks such as information extraction and document classification.A construction document usually contains critical named entities,and an effective NER method can provide a solid foundation for downstream applications to improve construction management efficiency.This study presents a NER method for Chinese construction documents based on conditional random field(CRF),including a corpus design pipeline and a CRF model.The corpus design pipeline identifies typical NER tasks in construction management,enables word-based tokenization,and controls the annotation consistency with a newly designed annotating specification.The CRF model engineers nine transformation features and seven classes of state features,covering the impacts of word position,part-of-speech(POS),and word/character states within the context.The F1-measure on a labeled construction data set is 87.9%.Furthermore,as more domain knowledge features are infused,the marginal performance improvement of including POS information will decrease,leading to a promising research direction of POS customization to improve NLP performance with limited data.展开更多
Purpose: This paper documents an exploration of an innovative approach to the sharing of documents and information among the members of the National Alliance of Academies of Sciences (NAAS) in China, based on the p...Purpose: This paper documents an exploration of an innovative approach to the sharing of documents and information among the members of the National Alliance of Academies of Sciences (NAAS) in China, based on the practice initiated by the National Science Library of the Chinese Academy of Sciences (NSLC).Design/methodology/approach: Through interviews and user surveys, we analyzed the general information demands of users from provincial academies of sciences (PASs) and problems of their document and information service teams. Based on our findings, we designed targeted services to help Alliance members support their document resources, information services for science and technology (S&T) decisions, and their knowledge transfer achievements. Furthermore, we offered training courses for provincial service teams, researchers, and administrators, to improve their information skills. These activities represent a new collaborative model for professional library consortia.Findings: To date, our service has been extended to all Alliance members, covering 19 provinces in China, and the NSLC service covers all aspects of knowledge services of Alliance members, from basic document delivery services to subject information analyses.Research limitations: Different PASs have different understandings of the role of the document and information services in the process of scientific research. These differences limit information service sharing of the NSLC with the PASs, and affect the service performance. For the sake of convenience, the original survey was conducted in only three provinces, which may not fully reflect the information needs of users in each Alliance institution. In addition, quantitative and qualitative analyses have been limited by the coverage of the sample.Practical implications: Document and information sharing has not only taken advantage of the NSLC knowledge service system and cooperation model, it has also enhanced the range of services of the NAAS in China.Originality/value: Based on knowledge service enhancements, the NAAS in China has formed a new kind of library consortium, which has broken the traditional library alliance model that was based mainly on the sharing of resources and services.展开更多
文摘With the widespread use of Chinese globally, the number of Chinese learners has been increasing, leading to various grammatical errors among beginners. Additionally, as domestic efforts to develop industrial information grow, electronic documents have also proliferated. When dealing with numerous electronic documents and texts written by Chinese beginners, manually written texts often contain hidden grammatical errors, posing a significant challenge to traditional manual proofreading. Correcting these grammatical errors is crucial to ensure fluency and readability. However, certain special types of text grammar or logical errors can have a huge impact, and manually proofreading a large number of texts individually is clearly impractical. Consequently, research on text error correction techniques has garnered significant attention in recent years. The advent and advancement of deep learning have paved the way for sequence-to-sequence learning methods to be extensively applied to the task of text error correction. This paper presents a comprehensive analysis of Chinese text grammar error correction technology, elaborates on its current research status, discusses existing problems, proposes preliminary solutions, and conducts experiments using judicial documents as an example. The aim is to provide a feasible research approach for Chinese text error correction technology.
文摘Word segmentation is an integral step in many knowledge discovery applications. However, existing word segmentation methods have problems when applying to Chinese judicial documents:(1) existing methods rely on large-scale labeled data which is typically unavailable in judicial documents, and (2) judicial document has its own language features and writing formats. In this paper, a word segmentation method is proposed for Chinese judicial documents. The proposed method consists of two steps:(1) automatically generating some labeled data as legal dictionaries, and (2) applying a hybrid multilayer neural networks to do word segmentation incorporating legal dictionaries. Experiments are conducted on a dataset of Chinese judicial documents showing that the proposed model can achieve better results than the existing methods.
文摘In Chinese language studies, both “The Textual Research on Historical Documents” and “The Comparative Study of Historical Data” are traditional in methodology and they both deserve being treasured, passed on, and further developed. It will certainly do harm to the development of academic research if any of the two methods is given unreasonable priority. The author claims that the best or one of the best methodologies of the historical study of Chinese language is the combination of the two, hence a new interpretation of “The Double-proof Method”. Meanwhile, this essay is also an attempt to put forward “The Law of Quan-ma and Gui-mei” in Chinese language studies, in which the author believes that it is not advisable to either treat Gui-mei as Quan-ma or vice versa in linguistic research. It is crucial for us to respect always the language facts first, which is considered the very soul of linguistics.
基金supported by the Shanghai International Studies University(Grant No.:2011114061)
文摘Purpose: The thrust of this paper is to present a method for improving the accuracy of automatic indexing of Chinese-English mixed documents.Design/methodology/approach: Based on the inherent characteristics of Chinese-English mixed texts and the cybernetics theory,we proposed an integrated control method for indexing documents. It consists of 'feed-forward control','in-progress control' and 'feed-back control',aiming at improving the accuracy of automatic indexing of Chinese-English mixed documents. An experiment was conducted to investigate the effect of our proposed method.Findings: This method distinguishes Chinese and English documents in grammatical structures and word formation rules. Through the implementation of this method in the three phases of automatic indexing for the Chinese-English mixed documents,the results were encouraging. The precision increased from 88.54% to 97.10% and recall improved from97.37% to 99.47%.Research limitations: The indexing method is relatively complicated and the whole indexing process requires substantial human intervention. Due to pattern matching based on a bruteforce(BF) approach,the indexing efficiency has been reduced to some extent.Practical implications: The research is of both theoretical significance and practical value in improving the accuracy of automatic indexing of multilingual documents(not confined to Chinese-English mixed documents). The proposed method will benefit not only the indexing of life science documents but also the indexing of documents in other subject areas.Originality/value: So far,few studies have been published about the method for increasing the accuracy of multilingual automatic indexing. This study will provide insights into the automatic indexing of multilingual documents,especially Chinese-English mixed documents.
文摘The reasoning of judgment documents is the touchstone of justice. Attaching importance to the reasoning of judgment documents is essentially the embodiment of judiciary civilization. In order to promote the reform of judgment documents reasoning and improve the level of it, the technology of automated judgment documents reasoning evaluation has to be studied on. How to build evidence chain relational model is the basis and key to this technology.An approach is proposed to build evidence chain relational model based on Chinese judgment documents. Using automated text preprocessing for Chinese judgment documents creates semi-structured XML documents and extracts evidence set and fact set. The method of key elements extraction is used to obtain the keywords of evidence and facts. Calculating the degree of association can work out the connection points of evidence chain relational model. Tabular display and graphical display of evidence chain relational model can be realized.
文摘In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored,has provided a good foundation for the research of judicial big data and machine learning. In this situation, some ideas about Chinese courts can reach automation or get better result through the research of machine learning, such as similar documents recommendation, workload evaluation based on similarity of judgement documents and prediction of possible relevant statutes. In trying to achieve all above mentioned, and also in face of the characteristics of Chinese judgement document, we propose a topic model based approach to measure the text similarity of Chinese judgement document, which is based on TF-IDF, Latent Dirichlet Allocation (LDA), Labeled Latent Dirichlet Allocation (LLDA) and other treatments. Combining with the characteristics of Chinese judgment document,we focus on the specific steps of approach, the preprocessing of corpus, the parameters choices of training and the evaluation of similarity measure result. Besides, implementing the approach for prediction of possible statutes and regarding the prediction accuracy as the evaluation metric, we designed experiments to demonstrate the reasonability of decisions in the process of design and the high performance of our approach on text similarity measure. The experiments also show the restriction of our approach which need to be focused in future work.
文摘Based on cognitive science, the EnergyCalculus in Chinese language segmentation was presented to eliminate segmentation ambiguity. The notion of “EnergyCost” was advanced to denote the extent of the under standability of a certain segmentation. EnergyCost function was defined with Z notation. This approcah is effective to all natural language segmentation.
基金National Natural Science Foundation of China(No.71331008)
文摘Sentiment analysis is now more and more important in modern natural language processing,and the sentiment classification is the one of the most popular applications.The crucial part of sentiment classification is feature extraction.In this paper,two methods for feature extraction,feature selection and feature embedding,are compared.Then Word2Vec is used as an embedding method.In this experiment,Chinese document is used as the corpus,and tree methods are used to get the features of a document:average word vectors,Doc2Vec and weighted average word vectors.After that,these samples are fed to three machine learning algorithms to do the classification,and support vector machine(SVM) has the best result.Finally,the parameters of random forest are analyzed.
文摘Objective To explore the rules and characteristics of the adverse drug reactions(ADRs)of three Chinese patent medicines and three herbal formulas for the treatment of COVID-19,and to provide a reference for clinical safe medication.Methods The cases and ADR reports of the three Chinese patent medicines and three herbal formulas in PubMed,Web of Science,Springer Link,CNKI,Wanfang and VIP database were retrieved from December 2019 to May 2021.Then we extracted and analyzed the effective information included in the literature.Results and Conclusion According to the pre-developed retrieval plan,a total of 136 documents were obtained,and a total of 6 documents met the inclusion criteria finally.553 patients used three Chinese patent medicines and three herbal formulas,and there were 133 cases of adverse reactions.The adverse reactions of patients taking the three Chinese patent medicines and three herbal formulas can all be explained under the theory of traditional Chinese medicine,and the adverse reactions can be eliminated by adding or subtracting the flavor of the medicine or stopping the medicine.
基金This work is supported by the National Natural Science Foundation of China(Grant No.71971196).
文摘Named entity recognition(NER)is essential in many natural language processing(NLP)tasks such as information extraction and document classification.A construction document usually contains critical named entities,and an effective NER method can provide a solid foundation for downstream applications to improve construction management efficiency.This study presents a NER method for Chinese construction documents based on conditional random field(CRF),including a corpus design pipeline and a CRF model.The corpus design pipeline identifies typical NER tasks in construction management,enables word-based tokenization,and controls the annotation consistency with a newly designed annotating specification.The CRF model engineers nine transformation features and seven classes of state features,covering the impacts of word position,part-of-speech(POS),and word/character states within the context.The F1-measure on a labeled construction data set is 87.9%.Furthermore,as more domain knowledge features are infused,the marginal performance improvement of including POS information will decrease,leading to a promising research direction of POS customization to improve NLP performance with limited data.
基金jointly supported by the National Science Library of the Chinese Academy of Sciencesthe Bureau of Development and Planning of the Chinese Academy of Sciences
文摘Purpose: This paper documents an exploration of an innovative approach to the sharing of documents and information among the members of the National Alliance of Academies of Sciences (NAAS) in China, based on the practice initiated by the National Science Library of the Chinese Academy of Sciences (NSLC).Design/methodology/approach: Through interviews and user surveys, we analyzed the general information demands of users from provincial academies of sciences (PASs) and problems of their document and information service teams. Based on our findings, we designed targeted services to help Alliance members support their document resources, information services for science and technology (S&T) decisions, and their knowledge transfer achievements. Furthermore, we offered training courses for provincial service teams, researchers, and administrators, to improve their information skills. These activities represent a new collaborative model for professional library consortia.Findings: To date, our service has been extended to all Alliance members, covering 19 provinces in China, and the NSLC service covers all aspects of knowledge services of Alliance members, from basic document delivery services to subject information analyses.Research limitations: Different PASs have different understandings of the role of the document and information services in the process of scientific research. These differences limit information service sharing of the NSLC with the PASs, and affect the service performance. For the sake of convenience, the original survey was conducted in only three provinces, which may not fully reflect the information needs of users in each Alliance institution. In addition, quantitative and qualitative analyses have been limited by the coverage of the sample.Practical implications: Document and information sharing has not only taken advantage of the NSLC knowledge service system and cooperation model, it has also enhanced the range of services of the NAAS in China.Originality/value: Based on knowledge service enhancements, the NAAS in China has formed a new kind of library consortium, which has broken the traditional library alliance model that was based mainly on the sharing of resources and services.