当前的基于词向量的多文档摘要方法没有考虑句子中词语的顺序,存在异句同向量问题以及在小规模训练数据上生成的摘要冗余度高的问题。针对这些问题,提出基于PV-DM(Distributed Memory Model of Paragraph Vectors)模型的多文档摘要方法...当前的基于词向量的多文档摘要方法没有考虑句子中词语的顺序,存在异句同向量问题以及在小规模训练数据上生成的摘要冗余度高的问题。针对这些问题,提出基于PV-DM(Distributed Memory Model of Paragraph Vectors)模型的多文档摘要方法。该方法首先构建单调亚模(Submodular)目标函数;然后,通过训练PV-DM模型得到句子向量计算句子间的语义相似度,进而求解单调亚模目标函数;最后,利用优化算法抽取句子生成摘要。在标准数据集Opinosis上的实验结果表明该方法优于当前主流的多文档摘要方法。展开更多
The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicat...The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicators from socio psychology, and conduct document-level multiple societal risk classification of BBS posts. To effectively capture the semantics and word order of documents, a shallow neural network as Paragraph Vector is applied to realize the distributed vector representations of the posts in the vector space. Based on the document vectors, the authors apply one classification method KNN to identify the societal risk category of the posts. The experimental results reveal that paragraph vector in document-level societal risk classification achieves much faster training speed and at least 10% improvements of F-measures than Bag-of-Words. Furthermore, the performance of paragraph vector is also superior to edit distance and Lucene-based search method. The present work is the first attempt of combining document embedding method with socio psychology research results to public opinions area.展开更多
文摘当前的基于词向量的多文档摘要方法没有考虑句子中词语的顺序,存在异句同向量问题以及在小规模训练数据上生成的摘要冗余度高的问题。针对这些问题,提出基于PV-DM(Distributed Memory Model of Paragraph Vectors)模型的多文档摘要方法。该方法首先构建单调亚模(Submodular)目标函数;然后,通过训练PV-DM模型得到句子向量计算句子间的语义相似度,进而求解单调亚模目标函数;最后,利用优化算法抽取句子生成摘要。在标准数据集Opinosis上的实验结果表明该方法优于当前主流的多文档摘要方法。
基金supported by the National Natural Science Foundation of China under Grant Nos.71171187,71371107,and 61473284
文摘The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicators from socio psychology, and conduct document-level multiple societal risk classification of BBS posts. To effectively capture the semantics and word order of documents, a shallow neural network as Paragraph Vector is applied to realize the distributed vector representations of the posts in the vector space. Based on the document vectors, the authors apply one classification method KNN to identify the societal risk category of the posts. The experimental results reveal that paragraph vector in document-level societal risk classification achieves much faster training speed and at least 10% improvements of F-measures than Bag-of-Words. Furthermore, the performance of paragraph vector is also superior to edit distance and Lucene-based search method. The present work is the first attempt of combining document embedding method with socio psychology research results to public opinions area.