Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ...Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.展开更多
This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discuss...This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.展开更多
With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one o...With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Localitysensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naive bayes one.展开更多
Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of word...Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.展开更多
基金supported by the National Natural Science Foundation of China under Grants No.61100205,No.60873001the HiTech Research and Development Program of China under Grant No.2011AA010705the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212
文摘Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.
文摘This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.
文摘With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Localitysensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naive bayes one.
文摘Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.