Regular expressions are widely used within and even outside of computer science due to their expressiveness and flexibility.However, regular expressions have a quite compact and rather tolerant syntax that makes them ...Regular expressions are widely used within and even outside of computer science due to their expressiveness and flexibility.However, regular expressions have a quite compact and rather tolerant syntax that makes them hard to understand, hard to compose,and error-prone. Faulty regular expressions may cause failures of the applications that use them. Therefore, ensuring the correctness of regular expressions is a vital prerequisite for their use in practical applications. The importance and necessity of ensuring correct definitions of regular expressions have attracted extensive attention from researchers and practitioners, especially in recent years. In this study, we provide a review of the recent works for ensuring the correct usage of regular expressions. We classify those works into different categories, including the empirical study, test string generation, automatic synthesis and learning, static checking and verification,visual representation and explanation, and repairing. For each category, we review the main results, compare different approaches, and discuss their advantages and disadvantages. We also discuss some potential future research directions.展开更多
Nowadays, using Deterministic Finite Automata (DFA) or Non-deterministic Finite Automata (NFA) to parse regular expressions is the most popular way for Deep Packet Inspection (DPI), and the research about DPI focuses ...Nowadays, using Deterministic Finite Automata (DFA) or Non-deterministic Finite Automata (NFA) to parse regular expressions is the most popular way for Deep Packet Inspection (DPI), and the research about DPI focuses on the improvement of DFA to reduce memory. However, most of the existing literature ignores a special kind of "overlap-matching expression", which causes states explosion and takes quite a large part in the DPI rules. To solve this problem, in this paper a new mechanism is proposed based on bitmap. We start with a simple regular expression to describe "overlap-matching expressions" and state the problem. Then, after calculating the terrible number of exploded states for this kind of expressions, the procedure of Bitmap-based Soft Parallel Mechanism (BSPM) is described. Based on BSPM, we discuss all the different types of "overlap-matching ex- pressions" and give optimization suggestions of them separately. Finally, experiment results prove that BSPM can give an excellent performance on solving the problem stated above, and the optimization suggestions are also effective for the memory reduction on all types of "overlap-matching expressions".展开更多
With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so ...With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.展开更多
Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality ...Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality represents a great challenge because the cost of non-quality can be very high. Therefore the use of data quality becomes an absolute necessity within an organization. To improve the data quality in a Big-Data source, our purpose, in this paper, is to add semantics to data and help user to recognize the Big-Data schema. The originality of this approach lies in the semantic aspect it offers. It detects issues in data and proposes a data schema by applying a semantic data profiling.展开更多
A new concept of rare axis based on statistical facts is proposed, and an evaluation algorithm is designed thereafter. For the nested regular expressions containing rare axes, the proposed algorithm can reduce its eva...A new concept of rare axis based on statistical facts is proposed, and an evaluation algorithm is designed thereafter. For the nested regular expressions containing rare axes, the proposed algorithm can reduce its evaluation complexity from polynomial time to nearly linear time. The distributed technique is also employed to construct the navigation axis indexes for resource description framework(RDF) graph data. Experiment results in Drug Bank and Bio GRID show that this method can improve the query efficiency significantly while ensuring the accuracy and meet the query requirements on Web-scale RDF graph data.展开更多
Independent XML storage based on XSD (XML Schema Document) is adopted in NXD(Native XML Data base), XMI. storage structure based on tree-structure disassemble and the algorithm used in dynamically updating XML doc...Independent XML storage based on XSD (XML Schema Document) is adopted in NXD(Native XML Data base), XMI. storage structure based on tree-structure disassemble and the algorithm used in dynamically updating XML document are provided in this paper. The main idea is that in term of data model of XML document, XML document is parsed to Document Structure-Tree with Hierarchical Model and Leaf-Data with Relation Model for storage. Simultaneously Proxy node is imported in order to solve the problem that XML data store in cross-blocks. And with XSD model information, sparse index is constructed to save storage space. It is proved that this storage structure could improve efficiency of XML document operation.展开更多
Unified modeling language (UML) is a powerful graphical modeling language with intuitional meaning. It provides various diagrams to depict system characteristics and complex environment from different viewpoints and...Unified modeling language (UML) is a powerful graphical modeling language with intuitional meaning. It provides various diagrams to depict system characteristics and complex environment from different viewpoints and different application layers. UML-based software development and modeling environments have been widely accepted in industry, including areas in which safety is an important issue such as spaceflight, defense, automobile, etc. To ensure and improve software quality becomes a main concern in the field. As one of the key techniques for software quality, software testing can effectively detect system faults. UML based software testing based is an important research direction in software engineering. The key to software testing is the generation of test cases. This dissertation studies an approach to generating test cases from UML statecharts.展开更多
The energy--momentum tensor, which is coordinate-independent, is used to calculate energy, momentum and angular momentum of two different tetrad fields. Although, the two tetrad fields reproduce the same space--time t...The energy--momentum tensor, which is coordinate-independent, is used to calculate energy, momentum and angular momentum of two different tetrad fields. Although, the two tetrad fields reproduce the same space--time their energies are different. Therefore, a regularized expression of the gravitational energy--momentum tensor of the teleparallel equivalent of general relativity (TEGR), is used to make the energies of the two tetrad fields equal. The definition of the gravitational energy--momentum is used to investigate the energy within the external event horizon. The components of angular momentum associated with these space--times are calculated. In spite of using a static space--time, we get a non-zero component of angular momentum! Therefore, we derive the Killing vectors associated with these space--times using the definition of the Lie derivative of a second rank tensor in the framework of the TEGR to make the picture more clear.展开更多
Purpose:In order to annotate the semantic information and extract the research level information of research papers,we attempt to seek a method to develop an information extraction system.Design/methodology/approach:S...Purpose:In order to annotate the semantic information and extract the research level information of research papers,we attempt to seek a method to develop an information extraction system.Design/methodology/approach:Semantic dictionary and conditional random field model(CRFM)were used to annotate the semantic information of research papers.Based on the annotation results,the research level information was extracted through regular expression.All the functions were implemented on Sybase platform.Findings:According to the result of our experiment in carbon nanotube research,the precision and recall rates reached 65.13%and 57.75%,respectively after the semantic properties of word class have been labeled,and F-measure increased dramatically from less than 50%to60.18%while added with semantic features.Our experiment also showed that the information extraction system for research level(IESRL)can extract performance indicators from research papers rapidly and effectively.Research limitations:Some text information,such as that of format and chart,might have been lost due to the extraction processing of text format from PDF to TXT files.Semantic labeling on sentences could be insufficient due to the rich meaning of lexicons in the semantic dictionary.Research implications:The established system can help researchers rapidly compare the level of different research papers and find out their implicit innovation values.It could also be used as an auxiliary tool for analyzing research levels of various research institutions.Originality/value:In this work,we have successfully established an information extraction system for research papers by a revised semantic annotation method based on CRFM and the semantic dictionary.Our system can analyze the information extraction problem from two levels,i.e.from the sentence level and noun(phrase)level of research papers.Compared with the extraction method based on knowledge engineering and that on machine learning,our system shows advantages of the both.展开更多
In this paper, a novel approach for service substitutions based on the service type in terms of its interface type and behavior semantics is proposed. In order to analyze and verify behavior-consistent service substit...In this paper, a novel approach for service substitutions based on the service type in terms of its interface type and behavior semantics is proposed. In order to analyze and verify behavior-consistent service substitutions in dynamic environments, we first present a formal language to describe services from control-flow perspective, then introduce a type and effect system to infer conservative approximations of all possible behaviors of these services. The service behaviors are represented by concurrent behavior expressions (CBEs). Built upon the interpretation of CBEs, behavior-consistent service substitutions are defined and analyzed by subtyping technology. The correctness of the analysis approach is guaranteed by type safety theorem, which is mechanically proved in the Coq proof assistant. Finally, applications in web services show that our method is effective and feasible.展开更多
In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructe...In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.展开更多
An extent join to compute path expressions containing parent-children andancestor-descendent operations and two path expression optimization rules, path-shortening andpath-complementing, are presented in this paper. P...An extent join to compute path expressions containing parent-children andancestor-descendent operations and two path expression optimization rules, path-shortening andpath-complementing, are presented in this paper. Path-shortening reduces the number of joins byshortening the path while path-complementing optimizes the path execution by using an equivalentcomplementary path expression to compute the original one. Experimental results show that thealgorithms proposed are more efficient than traditional algorithms.展开更多
Modern datacenter and enterprise networks require application identification to enable granular traffic control that eJther Jmproves data transfer rates or ensures network security. Providing application visi- bility ...Modern datacenter and enterprise networks require application identification to enable granular traffic control that eJther Jmproves data transfer rates or ensures network security. Providing application visi- bility as a core network function is challenging due to its performance requirements, including high through- put, low memory usage, and high identification accuracy. This paper presents a payload-based application identification method using a signature matching engine utilizing characteristics of the application identifica- tion. The solution uses two-stage matching and pre-classification to simultaneously improve the throughput and reduce the memory. Compared to a state-of-the-art common regular expression engine, this matching engine achieves 38% memory use reduction and triples the throughput. In addition, the solution is orthogonal to most existing optimization techniques for regular expression matching, which means it can be leveraged to further increase the performance of other matching algorithms.展开更多
基金by National Natural Science Foundation of China(Nos.61872339,61502184 and 61925203).
文摘Regular expressions are widely used within and even outside of computer science due to their expressiveness and flexibility.However, regular expressions have a quite compact and rather tolerant syntax that makes them hard to understand, hard to compose,and error-prone. Faulty regular expressions may cause failures of the applications that use them. Therefore, ensuring the correctness of regular expressions is a vital prerequisite for their use in practical applications. The importance and necessity of ensuring correct definitions of regular expressions have attracted extensive attention from researchers and practitioners, especially in recent years. In this study, we provide a review of the recent works for ensuring the correct usage of regular expressions. We classify those works into different categories, including the empirical study, test string generation, automatic synthesis and learning, static checking and verification,visual representation and explanation, and repairing. For each category, we review the main results, compare different approaches, and discuss their advantages and disadvantages. We also discuss some potential future research directions.
基金Supported by the National High Technology Development 863 Program of China (No. 2008AA01Z117)
文摘Nowadays, using Deterministic Finite Automata (DFA) or Non-deterministic Finite Automata (NFA) to parse regular expressions is the most popular way for Deep Packet Inspection (DPI), and the research about DPI focuses on the improvement of DFA to reduce memory. However, most of the existing literature ignores a special kind of "overlap-matching expression", which causes states explosion and takes quite a large part in the DPI rules. To solve this problem, in this paper a new mechanism is proposed based on bitmap. We start with a simple regular expression to describe "overlap-matching expressions" and state the problem. Then, after calculating the terrible number of exploded states for this kind of expressions, the procedure of Bitmap-based Soft Parallel Mechanism (BSPM) is described. Based on BSPM, we discuss all the different types of "overlap-matching ex- pressions" and give optimization suggestions of them separately. Finally, experiment results prove that BSPM can give an excellent performance on solving the problem stated above, and the optimization suggestions are also effective for the memory reduction on all types of "overlap-matching expressions".
基金This research was supported by the National Natural Science Foundation of China under Grant(No.42050102)the Postgraduate Education Reform Project of Jiangsu Province under Grant(No.SJCX22_0343)Also,this research was supported by Dou Wanchun Expert Workstation of Yunnan Province(No.202205AF150013).
文摘With the rapid development of information technology,the electronifi-cation of medical records has gradually become a trend.In China,the population base is huge and the supporting medical institutions are numerous,so this reality drives the conversion of paper medical records to electronic medical records.Electronic medical records are the basis for establishing a smart hospital and an important guarantee for achieving medical intelligence,and the massive amount of electronic medical record data is also an important data set for conducting research in the medical field.However,electronic medical records contain a large amount of private patient information,which must be desensitized before they are used as open resources.Therefore,to solve the above problems,data masking for Chinese electronic medical records with named entity recognition is proposed in this paper.Firstly,the text is vectorized to satisfy the required format of the model input.Secondly,since the input sentences may have a long or short length and the relationship between sentences in context is not negligible.To this end,a neural network model for named entity recognition based on bidirectional long short-term memory(BiLSTM)with conditional random fields(CRF)is constructed.Finally,the data masking operation is performed based on the named entity recog-nition results,mainly using regular expression filtering encryption and principal component analysis(PCA)word vector compression and replacement.In addi-tion,comparison experiments with the hidden markov model(HMM)model,LSTM-CRF model,and BiLSTM model are conducted in this paper.The experi-mental results show that the method used in this paper achieves 92.72%Accuracy,92.30%Recall,and 92.51%F1_score,which has higher accuracy compared with other models.
文摘Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality represents a great challenge because the cost of non-quality can be very high. Therefore the use of data quality becomes an absolute necessity within an organization. To improve the data quality in a Big-Data source, our purpose, in this paper, is to add semantics to data and help user to recognize the Big-Data schema. The originality of this approach lies in the semantic aspect it offers. It detects issues in data and proposes a data schema by applying a semantic data profiling.
基金Supported by the National Natural Science Foundation of China(No.61373035 and No.61100049)National High Technology Research and Development Program of China("863"Program,No.2013AA013204)+1 种基金Fundamental Research Funds for the Central Universities(No.3122014C018 and 3122015C022)Scientific Research Funds Supported by Civil Aviation University of China(No.09QD02X)
文摘A new concept of rare axis based on statistical facts is proposed, and an evaluation algorithm is designed thereafter. For the nested regular expressions containing rare axes, the proposed algorithm can reduce its evaluation complexity from polynomial time to nearly linear time. The distributed technique is also employed to construct the navigation axis indexes for resource description framework(RDF) graph data. Experiment results in Drug Bank and Bio GRID show that this method can improve the query efficiency significantly while ensuring the accuracy and meet the query requirements on Web-scale RDF graph data.
基金Supported by the National Natural Science Foun-dation of China (60073045)
文摘Independent XML storage based on XSD (XML Schema Document) is adopted in NXD(Native XML Data base), XMI. storage structure based on tree-structure disassemble and the algorithm used in dynamically updating XML document are provided in this paper. The main idea is that in term of data model of XML document, XML document is parsed to Document Structure-Tree with Hierarchical Model and Leaf-Data with Relation Model for storage. Simultaneously Proxy node is imported in order to solve the problem that XML data store in cross-blocks. And with XSD model information, sparse index is constructed to save storage space. It is proved that this storage structure could improve efficiency of XML document operation.
文摘Unified modeling language (UML) is a powerful graphical modeling language with intuitional meaning. It provides various diagrams to depict system characteristics and complex environment from different viewpoints and different application layers. UML-based software development and modeling environments have been widely accepted in industry, including areas in which safety is an important issue such as spaceflight, defense, automobile, etc. To ensure and improve software quality becomes a main concern in the field. As one of the key techniques for software quality, software testing can effectively detect system faults. UML based software testing based is an important research direction in software engineering. The key to software testing is the generation of test cases. This dissertation studies an approach to generating test cases from UML statecharts.
文摘The energy--momentum tensor, which is coordinate-independent, is used to calculate energy, momentum and angular momentum of two different tetrad fields. Although, the two tetrad fields reproduce the same space--time their energies are different. Therefore, a regularized expression of the gravitational energy--momentum tensor of the teleparallel equivalent of general relativity (TEGR), is used to make the energies of the two tetrad fields equal. The definition of the gravitational energy--momentum is used to investigate the energy within the external event horizon. The components of angular momentum associated with these space--times are calculated. In spite of using a static space--time, we get a non-zero component of angular momentum! Therefore, we derive the Killing vectors associated with these space--times using the definition of the Lie derivative of a second rank tensor in the framework of the TEGR to make the picture more clear.
基金supported by the National Social Science Foundation of China(Grant No.12CTQ032)
文摘Purpose:In order to annotate the semantic information and extract the research level information of research papers,we attempt to seek a method to develop an information extraction system.Design/methodology/approach:Semantic dictionary and conditional random field model(CRFM)were used to annotate the semantic information of research papers.Based on the annotation results,the research level information was extracted through regular expression.All the functions were implemented on Sybase platform.Findings:According to the result of our experiment in carbon nanotube research,the precision and recall rates reached 65.13%and 57.75%,respectively after the semantic properties of word class have been labeled,and F-measure increased dramatically from less than 50%to60.18%while added with semantic features.Our experiment also showed that the information extraction system for research level(IESRL)can extract performance indicators from research papers rapidly and effectively.Research limitations:Some text information,such as that of format and chart,might have been lost due to the extraction processing of text format from PDF to TXT files.Semantic labeling on sentences could be insufficient due to the rich meaning of lexicons in the semantic dictionary.Research implications:The established system can help researchers rapidly compare the level of different research papers and find out their implicit innovation values.It could also be used as an auxiliary tool for analyzing research levels of various research institutions.Originality/value:In this work,we have successfully established an information extraction system for research papers by a revised semantic annotation method based on CRFM and the semantic dictionary.Our system can analyze the information extraction problem from two levels,i.e.from the sentence level and noun(phrase)level of research papers.Compared with the extraction method based on knowledge engineering and that on machine learning,our system shows advantages of the both.
基金the National Natural Science Foundation of China(Nos.61232007 and 91118004)the Innovation Program of Shanghai Municipal Education Commission(No.13ZZ023)
文摘In this paper, a novel approach for service substitutions based on the service type in terms of its interface type and behavior semantics is proposed. In order to analyze and verify behavior-consistent service substitutions in dynamic environments, we first present a formal language to describe services from control-flow perspective, then introduce a type and effect system to infer conservative approximations of all possible behaviors of these services. The service behaviors are represented by concurrent behavior expressions (CBEs). Built upon the interpretation of CBEs, behavior-consistent service substitutions are defined and analyzed by subtyping technology. The correctness of the analysis approach is guaranteed by type safety theorem, which is mechanically proved in the Coq proof assistant. Finally, applications in web services show that our method is effective and feasible.
文摘In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.
文摘An extent join to compute path expressions containing parent-children andancestor-descendent operations and two path expression optimization rules, path-shortening andpath-complementing, are presented in this paper. Path-shortening reduces the number of joins byshortening the path while path-complementing optimizes the path execution by using an equivalentcomplementary path expression to compute the original one. Experimental results show that thealgorithms proposed are more efficient than traditional algorithms.
基金Supported by the National High-Tech Research and Development(863) Program of China (No. 2007AA01Z468)
文摘Modern datacenter and enterprise networks require application identification to enable granular traffic control that eJther Jmproves data transfer rates or ensures network security. Providing application visi- bility as a core network function is challenging due to its performance requirements, including high through- put, low memory usage, and high identification accuracy. This paper presents a payload-based application identification method using a signature matching engine utilizing characteristics of the application identifica- tion. The solution uses two-stage matching and pre-classification to simultaneously improve the throughput and reduce the memory. Compared to a state-of-the-art common regular expression engine, this matching engine achieves 38% memory use reduction and triples the throughput. In addition, the solution is orthogonal to most existing optimization techniques for regular expression matching, which means it can be leveraged to further increase the performance of other matching algorithms.