Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ...Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.展开更多
This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discuss...This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.展开更多
With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one o...With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Localitysensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naive bayes one.展开更多
Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of word...Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.展开更多
对电力系统中重要节点进行有效区分,有助于在资源有限的条件下对重要节点施加额外保护或改变拓扑结构,从而提高系统鲁棒性、降低事故发生的概率。受网页排序算法启发,提出电气链接结构分析的随机方法(electrical stochastic approach fo...对电力系统中重要节点进行有效区分,有助于在资源有限的条件下对重要节点施加额外保护或改变拓扑结构,从而提高系统鲁棒性、降低事故发生的概率。受网页排序算法启发,提出电气链接结构分析的随机方法(electrical stochastic approach for link structure analysis,E-SALSA)用于电力系统重要节点评估。该算法综合考虑了电力系统拓扑结构、潮流等因素对节点的影响,能够有效反映电力系统的真实情况,并且其特点更符合电力系统背景。在IEEE300节点电力系统中,使用失负荷规模和最大子群规模两个指标对E-SALSA算法与电气介数算法、基于共同引用的超链接引导的主题搜索(model based on co-citation hypertext induced topic search,MBCC-HITS)算法进行了对比分析。结果证明E-SALSA算法相比电气介数算法在两个指标上都具有优势,相比MBCC-HITS算法能够更综合考虑各方面因素对节点的影响,进而证明了E-SALSA算法的合理性、有效性。展开更多
基金supported by the National Natural Science Foundation of China under Grants No.61100205,No.60873001the HiTech Research and Development Program of China under Grant No.2011AA010705the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212
文摘Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.
文摘This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.
文摘With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Localitysensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naive bayes one.
文摘Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.
文摘对电力系统中重要节点进行有效区分,有助于在资源有限的条件下对重要节点施加额外保护或改变拓扑结构,从而提高系统鲁棒性、降低事故发生的概率。受网页排序算法启发,提出电气链接结构分析的随机方法(electrical stochastic approach for link structure analysis,E-SALSA)用于电力系统重要节点评估。该算法综合考虑了电力系统拓扑结构、潮流等因素对节点的影响,能够有效反映电力系统的真实情况,并且其特点更符合电力系统背景。在IEEE300节点电力系统中,使用失负荷规模和最大子群规模两个指标对E-SALSA算法与电气介数算法、基于共同引用的超链接引导的主题搜索(model based on co-citation hypertext induced topic search,MBCC-HITS)算法进行了对比分析。结果证明E-SALSA算法相比电气介数算法在两个指标上都具有优势,相比MBCC-HITS算法能够更综合考虑各方面因素对节点的影响,进而证明了E-SALSA算法的合理性、有效性。