摘要
法语复杂的语法和词形变化规则导致N-gram等词语提取方法的效果无法保证,影响法语文本挖掘的准确性。该文提出一种高效的法文词语提取方法,从待分析的法语文本中自动获取包括单词和短语的词语集合,构建法语文本挖掘所需的词库。该方法把文本中的单词共现信息压缩为FP序列树结构,快速提取频繁词串并计算其成词度,得到法文词语集合。实验表明,该方法的准确率高达90%,且具有比现有法文词语提取方法更高的召回率,能有效支持法语文本挖掘应用。
French is one of the working languages of the United Nations.Its complex grammar and part-ofspeech rules result in the inability of term extraction methods such as N-gram and thus affect the accuracy of French text mining.This paper proposes an effective and efficient French term extraction method,which can be used to extract words and phrases from the analyzing French text corpora and provide a complete lexicon for French text mining.Firstly,word co-occurrence information of the corpora being analyzed is compressed into an FP(Frequent Pattern)sequence tree for extracting frequent word sequences rapidly,and then the termhood of each frequent word sequence is calculated to obtain the term set.The FP sequence tree is a newly-designed data structure for reducing the time complexity of word co-occurrence statistics to linear time.Experiments show that the proposed method has a high accuracy of approximate 90%with a much higher than normal recall rate and thus has good potentials for French text mining applications.
作者
于娟
吴晓鹏
廖晓
刘建国
YU Juan;WU Xiao-peng;LIAO Xiao;LIU Jian-guo(School of Economics and Management,Fuzhou University,Fuzhou 350108;School of Internet Finance and Information Engineering,Guangdong University of Finance,Guangzhou 510521;Institute of Finance and Accounting,Shanghai University of Finance and Economics,Yangpu Shanghai 200433)
出处
《电子科技大学学报》
EI
CAS
CSCD
北大核心
2021年第1期84-90,共7页
Journal of University of Electronic Science and Technology of China
基金
国家自然科学基金(71771054)。