Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ...Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.展开更多
This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discuss...This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.展开更多
Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of word...Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.展开更多
This paper proposed a watermarking algorithm for tamper-proof of web pages. For a web page, it generates a watermark consisting of a sequence of Space and Tab. The wa termark is then embedded into the web page after e...This paper proposed a watermarking algorithm for tamper-proof of web pages. For a web page, it generates a watermark consisting of a sequence of Space and Tab. The wa termark is then embedded into the web page after each word and each line. When a watermarked web page is tampered, the extracted watermark can detect and locate the modifications to the web page. Besides, the framework of watermarked Web Server system was given. Compared with traditional digital signature methods, this watermarking method is more transparent in that there is no necessary to detach the watermark before displaying web pages. The e xperimental results show that the proposed scheme is an effective tool for tamper-proof of web pages.展开更多
基金supported by the National Natural Science Foundation of China under Grants No.61100205,No.60873001the HiTech Research and Development Program of China under Grant No.2011AA010705the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212
文摘Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.
文摘This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.
文摘Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.
文摘This paper proposed a watermarking algorithm for tamper-proof of web pages. For a web page, it generates a watermark consisting of a sequence of Space and Tab. The wa termark is then embedded into the web page after each word and each line. When a watermarked web page is tampered, the extracted watermark can detect and locate the modifications to the web page. Besides, the framework of watermarked Web Server system was given. Compared with traditional digital signature methods, this watermarking method is more transparent in that there is no necessary to detach the watermark before displaying web pages. The e xperimental results show that the proposed scheme is an effective tool for tamper-proof of web pages.