In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored,has provided a good foundation for the research of judicial big data and machine learning...In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored,has provided a good foundation for the research of judicial big data and machine learning. In this situation, some ideas about Chinese courts can reach automation or get better result through the research of machine learning, such as similar documents recommendation, workload evaluation based on similarity of judgement documents and prediction of possible relevant statutes. In trying to achieve all above mentioned, and also in face of the characteristics of Chinese judgement document, we propose a topic model based approach to measure the text similarity of Chinese judgement document, which is based on TF-IDF, Latent Dirichlet Allocation (LDA), Labeled Latent Dirichlet Allocation (LLDA) and other treatments. Combining with the characteristics of Chinese judgment document,we focus on the specific steps of approach, the preprocessing of corpus, the parameters choices of training and the evaluation of similarity measure result. Besides, implementing the approach for prediction of possible statutes and regarding the prediction accuracy as the evaluation metric, we designed experiments to demonstrate the reasonability of decisions in the process of design and the high performance of our approach on text similarity measure. The experiments also show the restriction of our approach which need to be focused in future work.展开更多
Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these...Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.展开更多
Automatic thread labeling for news events can help people know different aspects of a news event. In this paper, we present a method to label threads of a news event. We use latent Dirichlet allocation (LDA) topic mod...Automatic thread labeling for news events can help people know different aspects of a news event. In this paper, we present a method to label threads of a news event. We use latent Dirichlet allocation (LDA) topic model to extract news threads from news corpus. Our method first selects the thread words subset then extracts phrases based on co-occurrence calculation. The extracted phrase is then used as a label of a news thread. Experimental results show that about 60% of generated labels visualize the meaningful aspects of a news event. These labels can help people fast to capture many different aspects of a news event.展开更多
【目的】对面向主题模型的主题自动语义标注方法进行总结与评述,以促进主题模型的发展与应用。【文献范围】在Web of Science和CNKI数据库中分别以“Topic Labeling OR Topic Labelling OR Topic Tagging ORTopicIndexing”和“主题模型...【目的】对面向主题模型的主题自动语义标注方法进行总结与评述,以促进主题模型的发展与应用。【文献范围】在Web of Science和CNKI数据库中分别以“Topic Labeling OR Topic Labelling OR Topic Tagging ORTopicIndexing”和“主题模型AND(标注OR标签)”等检索式进行检索,通过手工筛选获得代表性文献57篇。【方法】对相关论文进行深入阅读与分析,以主题标注过程中主题标签的生成来源为线索,对已有方法进行分类与比较分析。【结果】面向主题模型的主题自动语义标注包括候选标签生成与排序两个主要步骤,根据候选标签的生成来源可分为依靠自身语料库和依靠外部语料库两类方法。【局限】目前该领域的研究还不是很丰富,分析与评述不够系统和全面。【结论】该领域的研究仍具有较大探索空间,面向社交媒体内容的主题语义标注是未来研究方向,可结合更丰富的知识库并采用深度学习技术进行改进提升。展开更多
文摘In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored,has provided a good foundation for the research of judicial big data and machine learning. In this situation, some ideas about Chinese courts can reach automation or get better result through the research of machine learning, such as similar documents recommendation, workload evaluation based on similarity of judgement documents and prediction of possible relevant statutes. In trying to achieve all above mentioned, and also in face of the characteristics of Chinese judgement document, we propose a topic model based approach to measure the text similarity of Chinese judgement document, which is based on TF-IDF, Latent Dirichlet Allocation (LDA), Labeled Latent Dirichlet Allocation (LLDA) and other treatments. Combining with the characteristics of Chinese judgment document,we focus on the specific steps of approach, the preprocessing of corpus, the parameters choices of training and the evaluation of similarity measure result. Besides, implementing the approach for prediction of possible statutes and regarding the prediction accuracy as the evaluation metric, we designed experiments to demonstrate the reasonability of decisions in the process of design and the high performance of our approach on text similarity measure. The experiments also show the restriction of our approach which need to be focused in future work.
基金Project supported by the National Natural Science Foundation of China(No.61602204)
文摘Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.
基金the National Natural Science Foundation of China(No.60873134)
文摘Automatic thread labeling for news events can help people know different aspects of a news event. In this paper, we present a method to label threads of a news event. We use latent Dirichlet allocation (LDA) topic model to extract news threads from news corpus. Our method first selects the thread words subset then extracts phrases based on co-occurrence calculation. The extracted phrase is then used as a label of a news thread. Experimental results show that about 60% of generated labels visualize the meaningful aspects of a news event. These labels can help people fast to capture many different aspects of a news event.
文摘【目的】对面向主题模型的主题自动语义标注方法进行总结与评述,以促进主题模型的发展与应用。【文献范围】在Web of Science和CNKI数据库中分别以“Topic Labeling OR Topic Labelling OR Topic Tagging ORTopicIndexing”和“主题模型AND(标注OR标签)”等检索式进行检索,通过手工筛选获得代表性文献57篇。【方法】对相关论文进行深入阅读与分析,以主题标注过程中主题标签的生成来源为线索,对已有方法进行分类与比较分析。【结果】面向主题模型的主题自动语义标注包括候选标签生成与排序两个主要步骤,根据候选标签的生成来源可分为依靠自身语料库和依靠外部语料库两类方法。【局限】目前该领域的研究还不是很丰富,分析与评述不够系统和全面。【结论】该领域的研究仍具有较大探索空间,面向社交媒体内容的主题语义标注是未来研究方向,可结合更丰富的知识库并采用深度学习技术进行改进提升。