In the field of information security,a gap exists in the study of coreference resolution of entities.A hybrid method is proposed to solve the problem of coreference resolution in information security.The work consists...In the field of information security,a gap exists in the study of coreference resolution of entities.A hybrid method is proposed to solve the problem of coreference resolution in information security.The work consists of two parts:the first extracts all candidates(including noun phrases,pronouns,entities,and nested phrases)from a given document and classifies them;the second is coreference resolution of the selected candidates.In the first part,a method combining rules with a deep learning model(Dictionary BiLSTM-Attention-CRF,or DBAC)is proposed to extract all candidates in the text and classify them.In the DBAC model,the domain dictionary matching mechanism is introduced,and new features of words and their contexts are obtained according to the domain dictionary.In this way,full use can be made of the entities and entity-type information contained in the domain dictionary,which can help solve the recognition problem of both rare and long entities.In the second part,candidates are divided into pronoun candidates and noun phrase candidates according to the part of speech,and the coreference resolution of pronoun candidates is solved by making rules and coreference resolution of noun phrase candidates by machine learning.Finally,a dataset is created with which to evaluate our methods using information security data.The experimental results show that the proposed model exhibits better performance than the other baseline models.展开更多
Due to the small size of the annotated corpora and the sparsity of the event trigger words, the event coreference resolver cannot capture enough event semantics, especially the trigger semantics, to identify coreferen...Due to the small size of the annotated corpora and the sparsity of the event trigger words, the event coreference resolver cannot capture enough event semantics, especially the trigger semantics, to identify coreferential event mentions. To address the above issues, this paper proposes a trigger semantics augmentation mechanism to boost event coreference resolution. First, this mechanism performs a trigger-oriented masking strategy to pre-train a BERT (Bidirectional Encoder Representations from Transformers)-based encoder (Trigger-BERT), which is fine-tuned on a large-scale unlabeled dataset Gigaword. Second, it combines the event semantic relations from the Trigger-BERT encoder with the event interactions from the soft-attention mechanism to resolve event coreference. Experimental results on both the KBP2016 and KBP2017 datasets show that our proposed model outperforms several state-of-the-art baselines.展开更多
A tool for the manual annotation of cross-document entity and event coreferences that helps annotators to label mention coreference relations in text is essential for the annotation of coreference corpora. To the best...A tool for the manual annotation of cross-document entity and event coreferences that helps annotators to label mention coreference relations in text is essential for the annotation of coreference corpora. To the best of our knowledge, CROss-document Main Events and entities Recognition(CROMER) is the only open-source manual annotation tool available for cross-document entity and event coreferences. However, CROMER lacks multi-language support and extensibility. Moreover, to label cross-document mention coreference relations, CROMER requires the support of another intra-document coreference annotation tool known as Content Annotation Tool, which is now unavailable. To address these problems, we introduce Cross-Document Coreference Annotation Tool(CDCAT), a new multi-language open-source manual annotation tool for cross-document entity and event coreference, which can handle different input/output formats, preprocessing functions, languages, and annotation systems. Using this new tool, annotators can label a reference relation with only two mouse clicks. Best practice analyses reveal that annotators can reach an annotation speed of 0.025 coreference relations per second on a corpus with a coreference density of 0.076 coreference relations per word. As the first multi-language open-source cross-document entity and event coreference annotation tool, CDCAT can theoretically achieve higher annotation efficiency than CROMER.展开更多
Knowledge of noun phrase anaphoricity might be profitably exploited in coreference resolution to bypass the resolution of non-anaphoric noun phrases. However, it is surprising to notice that recent attempts to incorpo...Knowledge of noun phrase anaphoricity might be profitably exploited in coreference resolution to bypass the resolution of non-anaphoric noun phrases. However, it is surprising to notice that recent attempts to incorporate automatically acquired anaphoricity information into coreferenee resolution systems have been far from expectation. This paper proposes a global learning method in determining the anaphoricity of noun phrases via a label propagation algorithm to improve learning-based coreference resolution. In order to eliminate the huge computational burden in the label propagation algorithm, we employ the weighted support vectors as the critical instances in the training texts. In addition, two kinds of kernels, i.e instances to represent all the anaphoricity-labeled NP , the feature-based RBF (Radial Basis Function) kernel and the convolution tree kernel with approximate matching, are explored to compute the anaphoricity similarity between two noun phrases. Experiments on the ACE2003 corpus demonstrate the great effectiveness of our method in anaphoricity determination of noun phrases and its application in learning-based coreference resolution.展开更多
An object on the Semantic Web is likely to be denoted with several URIs by different parties. Object core-ferencing is a process to identify "equivalent" URIs of objects for achieving a better Data Web. In this pape...An object on the Semantic Web is likely to be denoted with several URIs by different parties. Object core-ferencing is a process to identify "equivalent" URIs of objects for achieving a better Data Web. In this paper, we propose a bootstrapping approach for object coreferencing on the Semantic Web. For an object URI, we firstly establish a kernel that consists of semantically equivalent URIs from the same-as, (inverse) functional properties and (max-)cardinalities, and then extend the kernel with respect to the textual descriptions (e.g., labels and local names) of URIs. We also propose a trustworthiness-based method to rank the coreferent URIs in the kernel as well as a similarity-based method for ranking the URIs in the extension of the kernel. We implement the proposed approach, called ObjectCoref, on a large-scale dataset that contains 76 million URIs collected by the Falcons search engine until 2008. The evaluation on precision, relative recall and response time demonstrates the feasibility of our approach. Additionally, we apply the proposed approach to investigate the popularity of the URI alias phenomenon on the current Semantic Web.展开更多
We present a novel approach for extracting noun phrases in general and named entities in particular from a digital repository of text documents.The problem of coreference resolution has been divided into two subproble...We present a novel approach for extracting noun phrases in general and named entities in particular from a digital repository of text documents.The problem of coreference resolution has been divided into two subproblems:pronoun resolution and non-pronominal resolution.A rule based-technique was used for pronoun resolution while a learning approach for nonpronominal resolution.For named entity resolution,disambiguation arises mainly due to polysemy and synonymy.The proposed approach fixes both problems with the help of WordNet and the Word Sense Disambiguation tool.The proposed approach,to our knowledge,outperforms several baseline techniques with a higher balanced F-measure,which is harmonic mean of recall and precision.The improvements in the system performance are due to the filtering of antecedents for the anaphor based on several linguistic disagreements,use of a hybrid approach,and increment in the feature vector to include more linguistic details in the learning technique.展开更多
We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries.The corpus contains document-,sentence-,and token-level annotations.Th...We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries.The corpus contains document-,sentence-,and token-level annotations.This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information,constructing knowledge bases that enable comparative social and political science studies.For each news source,the annotation starts with random samples of news articles and continues with samples drawn using active learning.Each batch of samples is annotated by two social and political scientists,adjudicated by an annotation supervisor,and improved by identifying annotation errors semi-automatically.We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting,contributing to the generalizability and robustness of automated text processing systems.This corpus and the reported results will establish a common foundation in automated protest event collection studies,which is currently lacking in the literature.展开更多
基金This work was supported by the National Natural Science Foundation of China(grant no.61602515).
文摘In the field of information security,a gap exists in the study of coreference resolution of entities.A hybrid method is proposed to solve the problem of coreference resolution in information security.The work consists of two parts:the first extracts all candidates(including noun phrases,pronouns,entities,and nested phrases)from a given document and classifies them;the second is coreference resolution of the selected candidates.In the first part,a method combining rules with a deep learning model(Dictionary BiLSTM-Attention-CRF,or DBAC)is proposed to extract all candidates in the text and classify them.In the DBAC model,the domain dictionary matching mechanism is introduced,and new features of words and their contexts are obtained according to the domain dictionary.In this way,full use can be made of the entities and entity-type information contained in the domain dictionary,which can help solve the recognition problem of both rare and long entities.In the second part,candidates are divided into pronoun candidates and noun phrase candidates according to the part of speech,and the coreference resolution of pronoun candidates is solved by making rules and coreference resolution of noun phrase candidates by machine learning.Finally,a dataset is created with which to evaluate our methods using information security data.The experimental results show that the proposed model exhibits better performance than the other baseline models.
基金supported by the National Natural Science Foundation of China under Grant Nos.61836007 and 61772354.
文摘Due to the small size of the annotated corpora and the sparsity of the event trigger words, the event coreference resolver cannot capture enough event semantics, especially the trigger semantics, to identify coreferential event mentions. To address the above issues, this paper proposes a trigger semantics augmentation mechanism to boost event coreference resolution. First, this mechanism performs a trigger-oriented masking strategy to pre-train a BERT (Bidirectional Encoder Representations from Transformers)-based encoder (Trigger-BERT), which is fine-tuned on a large-scale unlabeled dataset Gigaword. Second, it combines the event semantic relations from the Trigger-BERT encoder with the event interactions from the soft-attention mechanism to resolve event coreference. Experimental results on both the KBP2016 and KBP2017 datasets show that our proposed model outperforms several state-of-the-art baselines.
基金supported by the National Natural Science Foundation of China (No. 61872038)the Fundamental Research Funds for the Central Universities (No. FRF-GF-19-020B)。
文摘A tool for the manual annotation of cross-document entity and event coreferences that helps annotators to label mention coreference relations in text is essential for the annotation of coreference corpora. To the best of our knowledge, CROss-document Main Events and entities Recognition(CROMER) is the only open-source manual annotation tool available for cross-document entity and event coreferences. However, CROMER lacks multi-language support and extensibility. Moreover, to label cross-document mention coreference relations, CROMER requires the support of another intra-document coreference annotation tool known as Content Annotation Tool, which is now unavailable. To address these problems, we introduce Cross-Document Coreference Annotation Tool(CDCAT), a new multi-language open-source manual annotation tool for cross-document entity and event coreference, which can handle different input/output formats, preprocessing functions, languages, and annotation systems. Using this new tool, annotators can label a reference relation with only two mouse clicks. Best practice analyses reveal that annotators can reach an annotation speed of 0.025 coreference relations per second on a corpus with a coreference density of 0.076 coreference relations per word. As the first multi-language open-source cross-document entity and event coreference annotation tool, CDCAT can theoretically achieve higher annotation efficiency than CROMER.
基金Supported by the National Natural Science Foundation of China under Grant Nos.60873150,90920004 and 61003153
文摘Knowledge of noun phrase anaphoricity might be profitably exploited in coreference resolution to bypass the resolution of non-anaphoric noun phrases. However, it is surprising to notice that recent attempts to incorporate automatically acquired anaphoricity information into coreferenee resolution systems have been far from expectation. This paper proposes a global learning method in determining the anaphoricity of noun phrases via a label propagation algorithm to improve learning-based coreference resolution. In order to eliminate the huge computational burden in the label propagation algorithm, we employ the weighted support vectors as the critical instances in the training texts. In addition, two kinds of kernels, i.e instances to represent all the anaphoricity-labeled NP , the feature-based RBF (Radial Basis Function) kernel and the convolution tree kernel with approximate matching, are explored to compute the anaphoricity similarity between two noun phrases. Experiments on the ACE2003 corpus demonstrate the great effectiveness of our method in anaphoricity determination of noun phrases and its application in learning-based coreference resolution.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.61003018 and 60973024in part by the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No.20100091120041in part by the IBM CRL UR Joint Project
文摘An object on the Semantic Web is likely to be denoted with several URIs by different parties. Object core-ferencing is a process to identify "equivalent" URIs of objects for achieving a better Data Web. In this paper, we propose a bootstrapping approach for object coreferencing on the Semantic Web. For an object URI, we firstly establish a kernel that consists of semantically equivalent URIs from the same-as, (inverse) functional properties and (max-)cardinalities, and then extend the kernel with respect to the textual descriptions (e.g., labels and local names) of URIs. We also propose a trustworthiness-based method to rank the coreferent URIs in the kernel as well as a similarity-based method for ranking the URIs in the extension of the kernel. We implement the proposed approach, called ObjectCoref, on a large-scale dataset that contains 76 million URIs collected by the Falcons search engine until 2008. The evaluation on precision, relative recall and response time demonstrates the feasibility of our approach. Additionally, we apply the proposed approach to investigate the popularity of the URI alias phenomenon on the current Semantic Web.
文摘We present a novel approach for extracting noun phrases in general and named entities in particular from a digital repository of text documents.The problem of coreference resolution has been divided into two subproblems:pronoun resolution and non-pronominal resolution.A rule based-technique was used for pronoun resolution while a learning approach for nonpronominal resolution.For named entity resolution,disambiguation arises mainly due to polysemy and synonymy.The proposed approach fixes both problems with the help of WordNet and the Word Sense Disambiguation tool.The proposed approach,to our knowledge,outperforms several baseline techniques with a higher balanced F-measure,which is harmonic mean of recall and precision.The improvements in the system performance are due to the filtering of antecedents for the anaphor based on several linguistic disagreements,use of a hybrid approach,and increment in the feature vector to include more linguistic details in the learning technique.
基金funded by the European Research Council(ERC)Starting Grant 714868 awarded to Dr.Erdem Yörük for his project Emerging Welfare。
文摘We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries.The corpus contains document-,sentence-,and token-level annotations.This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information,constructing knowledge bases that enable comparative social and political science studies.For each news source,the annotation starts with random samples of news articles and continues with samples drawn using active learning.Each batch of samples is annotated by two social and political scientists,adjudicated by an annotation supervisor,and improved by identifying annotation errors semi-automatically.We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting,contributing to the generalizability and robustness of automated text processing systems.This corpus and the reported results will establish a common foundation in automated protest event collection studies,which is currently lacking in the literature.