In this paper a new feature called crosscount for document analysis is introduced.The feature crosscount is a function of white line segment with its start on the edgeof document images. It reflects not only the conto...In this paper a new feature called crosscount for document analysis is introduced.The feature crosscount is a function of white line segment with its start on the edgeof document images. It reflects not only the contour of image, but also the periodicity of white lines(background) and text lines in the document images. In complexprinted-page layouts, there are different blocks such as textual, graphical, tabular, andso on. of these blocks, textual ones have the most obvious periodicity with their homogeneous white lines arranged regularly. The important property of textual blockscan be extracted by crosscount functions. Here the document layouts are classifiedinto three classes on the basis of their physical structures. Then the definition andproperties of the crosscount function are described. According to the classification ofdocument layouts, the application of this new feature to different types of documentimages analysis and understanding is discussed.展开更多
Neural attention-based encoders,which effectively attend sentence tokens to their associated context without being restricted by long-term distance or dependency,have demonstrated outstanding performance in embedding ...Neural attention-based encoders,which effectively attend sentence tokens to their associated context without being restricted by long-term distance or dependency,have demonstrated outstanding performance in embedding sentences into meaningful representations(embeddings).The Universal Sentence Encoder(USE)is one of the most well-recognized deep neural network(DNN)based solutions,which is facilitated with an attention-driven transformer architecture and has been pre-trained on a large number of sentences from the Internet.Besides the fact that USE has been widely used in many downstream applications,including information retrieval(IR),interpreting its complicated internal working mechanism remains challenging.In this work,we present a visual analytics solution towards addressing this challenge.Specifically,focused on semantics and syntactics(concepts and relations)that are critical to domain clinical IR,we designed and developed a visual analytics system,i.e.,USEVis.The system investigates the power of USE in effectively extracting sentences’semantics and syntactics through exploring and interpreting how linguistic properties are captured by attentions.Furthermore,by thoroughly examining and comparing the inherent patterns of these attentions,we are able to exploit attentions to retrieve sentences/documents that have similar semantics or are closely related to a given clinical problem in IR.By collaborating with domain experts,we demonstrate use cases with inspiring findings to validate the contribution of our work and the effectiveness of our system.展开更多
Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the sema...Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.展开更多
Rule selection has long been a problem of great challenge that has to be solved when developing a rule-based knowledge learning system. Many methods have been proposed to evaluate the eligibility of a single rule base...Rule selection has long been a problem of great challenge that has to be solved when developing a rule-based knowledge learning system. Many methods have been proposed to evaluate the eligibility of a single rule based on some criteria. However, in a knowledge learning system there is usually a set of rules. These rules are not independent, but interactive. They tend to affect each other and form a rulesystem. In such case, it is no longer reasonable to isolate each rule from others for evaluation. A best rule according to certain criterion is not always the best one for the whole system. Furthermore, the data in the real world from which people want to create their learning system are often ill-defined and inconsistent. In this case, the completeness and consistency criteria for rule selection are no longer essential. In this paper, some ideas about how to solve the rule-selection problem in a systematic way are proposed. These ideas have been applied in the design of a Chinese business card layout analysis system and gained a good result on the training data set of 425 images. The implementation of the system and the result are presented in this paper.展开更多
文摘In this paper a new feature called crosscount for document analysis is introduced.The feature crosscount is a function of white line segment with its start on the edgeof document images. It reflects not only the contour of image, but also the periodicity of white lines(background) and text lines in the document images. In complexprinted-page layouts, there are different blocks such as textual, graphical, tabular, andso on. of these blocks, textual ones have the most obvious periodicity with their homogeneous white lines arranged regularly. The important property of textual blockscan be extracted by crosscount functions. Here the document layouts are classifiedinto three classes on the basis of their physical structures. Then the definition andproperties of the crosscount function are described. According to the classification ofdocument layouts, the application of this new feature to different types of documentimages analysis and understanding is discussed.
文摘Neural attention-based encoders,which effectively attend sentence tokens to their associated context without being restricted by long-term distance or dependency,have demonstrated outstanding performance in embedding sentences into meaningful representations(embeddings).The Universal Sentence Encoder(USE)is one of the most well-recognized deep neural network(DNN)based solutions,which is facilitated with an attention-driven transformer architecture and has been pre-trained on a large number of sentences from the Internet.Besides the fact that USE has been widely used in many downstream applications,including information retrieval(IR),interpreting its complicated internal working mechanism remains challenging.In this work,we present a visual analytics solution towards addressing this challenge.Specifically,focused on semantics and syntactics(concepts and relations)that are critical to domain clinical IR,we designed and developed a visual analytics system,i.e.,USEVis.The system investigates the power of USE in effectively extracting sentences’semantics and syntactics through exploring and interpreting how linguistic properties are captured by attentions.Furthermore,by thoroughly examining and comparing the inherent patterns of these attentions,we are able to exploit attentions to retrieve sentences/documents that have similar semantics or are closely related to a given clinical problem in IR.By collaborating with domain experts,we demonstrate use cases with inspiring findings to validate the contribution of our work and the effectiveness of our system.
基金supported by the Foundation of the State Key Laboratory of Software Development Environment(No.SKLSDE-2015ZX-04)
文摘Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.
文摘Rule selection has long been a problem of great challenge that has to be solved when developing a rule-based knowledge learning system. Many methods have been proposed to evaluate the eligibility of a single rule based on some criteria. However, in a knowledge learning system there is usually a set of rules. These rules are not independent, but interactive. They tend to affect each other and form a rulesystem. In such case, it is no longer reasonable to isolate each rule from others for evaluation. A best rule according to certain criterion is not always the best one for the whole system. Furthermore, the data in the real world from which people want to create their learning system are often ill-defined and inconsistent. In this case, the completeness and consistency criteria for rule selection are no longer essential. In this paper, some ideas about how to solve the rule-selection problem in a systematic way are proposed. These ideas have been applied in the design of a Chinese business card layout analysis system and gained a good result on the training data set of 425 images. The implementation of the system and the result are presented in this paper.