Languages–independent text tokenization can aid in classification of languages with few sources.There is a global research effort to generate text classification for any language.Human text classification is a slow p...Languages–independent text tokenization can aid in classification of languages with few sources.There is a global research effort to generate text classification for any language.Human text classification is a slow procedure.Conse-quently,the text summary generation of different languages,using machine text classification,has been considered in recent years.There is no research on the machine text classification for many languages such as Czech,Rome,Urdu.This research proposes a cross-language text tokenization model using a Transformer technique.The proposed Transformer employs an encoder that has ten layers with self-attention encoding and a feedforward sublayer.This model improves the efficiency of text classification by providing a draft text classification for a number of documents.We also propose a novel Sub-Word tokenization model with frequent vocabulary usage in the documents.The Sub-Word Byte-Pair Tokenization technique(SBPT)utilizes the sharing of the vocabulary of one sentence with other sentences.The Sub-Word tokenization model enhances the performance of other Sub-Word tokenization models such pair encoding model by+10%using precision metric.展开更多
关于舆情事件的新闻数据是纷繁复杂的.即便是关于同一舆情事件的新闻数据,往往包含有不同的子话题(事件的不同侧面).因此,如何生成能够准确描述事件子话题含义的标签对深入分析舆情事件(包括掌握事件热点、监测发展走向等)具有重要意义...关于舆情事件的新闻数据是纷繁复杂的.即便是关于同一舆情事件的新闻数据,往往包含有不同的子话题(事件的不同侧面).因此,如何生成能够准确描述事件子话题含义的标签对深入分析舆情事件(包括掌握事件热点、监测发展走向等)具有重要意义.事件子话题标签的生成通常包括两个关键步骤:首先发现子话题,然后依据每个子话题的关键词或文档内容生成描述该子话题的有效标签.传统方法在发现话题时多采用聚类或分类的方法,它们将同一个话题的文档整合到一个簇中.然而,由于隶属同一事件的文档具有很强的相似性,现有方法难以度量他们之间的距离,因此无法应用于发现事件子话题这一任务.此外,在为子话题生成标签时,传统的方法通常通过抽取来实现.此类方法所生成标签的准确性无法保证.为此,该文提出了一种基于PLSA with Background Language并结合关键词聚类发现事件内部子话题,进而基于维基百科等知识库生成事件子话题标签的模型ET-TAG.在多类舆情事件数据集上的实验结果表明,ET-TAG算法相比K-means和LDA等已有子话题发现方法具有更好的性能;从子话题标签生成角度而言,ET-TAG生成的标签相对于传统方法也具有更好的准确性和概括性.该文最后将ET-TAG算法生成的子话题标签用于事件的对比和追踪,结果表明通过子话题标签可以发现事件共性,并反映事件子话题热度的变化趋势.展开更多
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2022R113),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Languages–independent text tokenization can aid in classification of languages with few sources.There is a global research effort to generate text classification for any language.Human text classification is a slow procedure.Conse-quently,the text summary generation of different languages,using machine text classification,has been considered in recent years.There is no research on the machine text classification for many languages such as Czech,Rome,Urdu.This research proposes a cross-language text tokenization model using a Transformer technique.The proposed Transformer employs an encoder that has ten layers with self-attention encoding and a feedforward sublayer.This model improves the efficiency of text classification by providing a draft text classification for a number of documents.We also propose a novel Sub-Word tokenization model with frequent vocabulary usage in the documents.The Sub-Word Byte-Pair Tokenization technique(SBPT)utilizes the sharing of the vocabulary of one sentence with other sentences.The Sub-Word tokenization model enhances the performance of other Sub-Word tokenization models such pair encoding model by+10%using precision metric.
文摘关于舆情事件的新闻数据是纷繁复杂的.即便是关于同一舆情事件的新闻数据,往往包含有不同的子话题(事件的不同侧面).因此,如何生成能够准确描述事件子话题含义的标签对深入分析舆情事件(包括掌握事件热点、监测发展走向等)具有重要意义.事件子话题标签的生成通常包括两个关键步骤:首先发现子话题,然后依据每个子话题的关键词或文档内容生成描述该子话题的有效标签.传统方法在发现话题时多采用聚类或分类的方法,它们将同一个话题的文档整合到一个簇中.然而,由于隶属同一事件的文档具有很强的相似性,现有方法难以度量他们之间的距离,因此无法应用于发现事件子话题这一任务.此外,在为子话题生成标签时,传统的方法通常通过抽取来实现.此类方法所生成标签的准确性无法保证.为此,该文提出了一种基于PLSA with Background Language并结合关键词聚类发现事件内部子话题,进而基于维基百科等知识库生成事件子话题标签的模型ET-TAG.在多类舆情事件数据集上的实验结果表明,ET-TAG算法相比K-means和LDA等已有子话题发现方法具有更好的性能;从子话题标签生成角度而言,ET-TAG生成的标签相对于传统方法也具有更好的准确性和概括性.该文最后将ET-TAG算法生成的子话题标签用于事件的对比和追踪,结果表明通过子话题标签可以发现事件共性,并反映事件子话题热度的变化趋势.