This paper proposed a new method of semi-automatic extraction for semantic structures from unlabelled corpora in specific domains. The approach is statistical in nature. The extracted structures can be used for shallo...This paper proposed a new method of semi-automatic extraction for semantic structures from unlabelled corpora in specific domains. The approach is statistical in nature. The extracted structures can be used for shallow parsing and semantic labeling. By iteratively extracting new words and clustering words, we get an inital semantic lexicon that groups words of the same semantic meaning together as a class. After that, a bootstrapping algorithm is adopted to extract semantic structures. Then the semantic structures are used to extract new展开更多
A log is a text message that is generated in various services,frameworks,and programs.The majority of log data mining tasks rely on log parsing as the first step,which transforms raw logs into formatted log templates....A log is a text message that is generated in various services,frameworks,and programs.The majority of log data mining tasks rely on log parsing as the first step,which transforms raw logs into formatted log templates.Existing log parsing approaches often fail to effectively handle the trade-off between parsing quality and performance.In view of this,in this paper,we present Multi-Layer Parser(ML-Parser),an online log parser that runs in a streaming manner.Specifically,we present a multi-layer structure in log parsing to strike a balance between efficiency and effectiveness.Coarse-grained tokenization and a fast similarity measure are applied for efficiency while fine-grained tokenization and an accurate similarity measure are used for effectiveness.In experiments,we compare ML-Parser with two existing online log parsing approaches,Drain and Spell,on ten real-world datasets,five labeled and five unlabeled.On the five labeled datasets,we use the proportion of correctly parsed logs to measure the accuracy,and ML-Parser achieves the highest accuracy on four datasets.On the whole ten datasets,we use Loss metric to measure the parsing quality.ML-Parse achieves the highest quality on seven out of the ten datasets while maintaining relatively high efficiency.展开更多
文摘This paper proposed a new method of semi-automatic extraction for semantic structures from unlabelled corpora in specific domains. The approach is statistical in nature. The extracted structures can be used for shallow parsing and semantic labeling. By iteratively extracting new words and clustering words, we get an inital semantic lexicon that groups words of the same semantic meaning together as a class. After that, a bootstrapping algorithm is adopted to extract semantic structures. Then the semantic structures are used to extract new
基金the National Natural Science Foundation of China under Grant No.61672163.
文摘A log is a text message that is generated in various services,frameworks,and programs.The majority of log data mining tasks rely on log parsing as the first step,which transforms raw logs into formatted log templates.Existing log parsing approaches often fail to effectively handle the trade-off between parsing quality and performance.In view of this,in this paper,we present Multi-Layer Parser(ML-Parser),an online log parser that runs in a streaming manner.Specifically,we present a multi-layer structure in log parsing to strike a balance between efficiency and effectiveness.Coarse-grained tokenization and a fast similarity measure are applied for efficiency while fine-grained tokenization and an accurate similarity measure are used for effectiveness.In experiments,we compare ML-Parser with two existing online log parsing approaches,Drain and Spell,on ten real-world datasets,five labeled and five unlabeled.On the five labeled datasets,we use the proportion of correctly parsed logs to measure the accuracy,and ML-Parser achieves the highest accuracy on four datasets.On the whole ten datasets,we use Loss metric to measure the parsing quality.ML-Parse achieves the highest quality on seven out of the ten datasets while maintaining relatively high efficiency.