Focusing on the problem that it is hard to utilize the web multi-fields information with various forms in large scale web search,a novel approach,which can automatically acquire features from web pages based on a set ...Focusing on the problem that it is hard to utilize the web multi-fields information with various forms in large scale web search,a novel approach,which can automatically acquire features from web pages based on a set of well defined rules,is proposed.The features describe the contents of web pages from different aspects and they can be used to improve the ranking performance for web search.The acquired feature has the advantages of unified form and less noise,and can easily be used in web page relevance ranking.A special specs for judging the relevance between user queries and acquired features is also proposed.Experimental results show that the features acquired by the proposed approach and the feature relevance specs can significantly improve the relevance ranking performance for web search.展开更多
A new mapping approach for automated ontology mapping using web search engines (such as Google) is presented. Based on lexico-syntactic patterns, the hyponymy relationships between ontology concepts can be obtained ...A new mapping approach for automated ontology mapping using web search engines (such as Google) is presented. Based on lexico-syntactic patterns, the hyponymy relationships between ontology concepts can be obtained from the web by search engines and an initial candidate mapping set consisting of ontology concept pairs is generated. According to the concept hierarchies of ontologies, a set of production rules is proposed to delete the concept pairs inconsistent with the ontology semantics from the initial candidate mapping set and add the concept pairs consistent with the ontology semantics to it. Finally, ontology mappings are chosen from the candidate mapping set automatically with a mapping select rule which is based on mutual information. Experimental results show that the F-measure can reach 75% to 100% and it can effectively accomplish the mapping between ontologies.展开更多
This study examined users' querying behaviors based on a sample of 30 Chinese college students from Peking University. The authors designed 5 search tasks and each participant conducted two randomly selected searc...This study examined users' querying behaviors based on a sample of 30 Chinese college students from Peking University. The authors designed 5 search tasks and each participant conducted two randomly selected search tasks during the experiment. The results show that when searching for pre-designed search tasks, users often have relatively clear goals and strategies before searching. When formulating their queries, users often select words from tasks, use concrete concepts directly, or extract 'central words' or keywords. When reformulating queries, seven query reformulation types were identified from users' behaviors, i.e. broadening, narrowing, issuing new query, paralleling, changing search tools, reformulating syntax terms, and clicking on suggested queries. The results reveal that the search results and/or the contexts can also influence users' querying behaviors.展开更多
The paper presents a novel benefit based query processing strategy for efficient query routing. Based on DHT as the overlay network, it first applies Nash equilibrium to construct the optimal peer group based on the c...The paper presents a novel benefit based query processing strategy for efficient query routing. Based on DHT as the overlay network, it first applies Nash equilibrium to construct the optimal peer group based on the correlations of keywords and coverage and overlap of the peers to decrease the time cost, and then presents a two-layered architecture for query processing that utilizes Bloom filter as compact representation to reduce the bandwidth consumption. Extensive experiments conducted on a real world dataset have demonstrated that our approach obviously decreases the processing time, while improves the precision and recall as well.展开更多
A web-based translation method for Chinese organization name is proposed.After ana-lyzing the structure of Chinese organization name,the methods of bilingual query formulation and maximum entropy based translation re-...A web-based translation method for Chinese organization name is proposed.After ana-lyzing the structure of Chinese organization name,the methods of bilingual query formulation and maximum entropy based translation re-ranking are suggested to retrieve the English translation from the web via public search engine.The experiments on Chinese university names demonstrate the validness of this approach.展开更多
Online reviews are considered of an important indicator for users to decide on the activity they wish to do, whether it is watching a movie, going to a restaurant, or buying a product. It also serves businesses as it ...Online reviews are considered of an important indicator for users to decide on the activity they wish to do, whether it is watching a movie, going to a restaurant, or buying a product. It also serves businesses as it keeps tracking user feedback. The sheer volume of online reviews makes it difficult for a human to process and extract all significant information to make purchasing choices. As a result, there has been a trend toward systems that can automatically summarize opinions from a set of reviews. In this paper, we present a hybrid algorithm that combines an auto-summarization algorithm with a sentiment analysis (SA) algorithm, to offer a personalized user experiences and to solve the semantic-pragmatic gap. The algorithm consists of six steps that start with the original text document and generate a summary of that text by choosing the N most relevant sentences in the text. The tagged texts are then processed and then passed to a Naive Bayesian classifier along with their tags as training data. The raw data used in this paper belong to the tagged corpus positive and negative processed movie reviews introduced in [1]. The measures that are used to gauge the performance of the SA and classification algorithm for all test cases consist of accuracy, recall, and precision. We describe in details both the aspect of extraction and sentiment detection modules of our system.展开更多
Purpose: The aim of this paper is to discuss how the keyword concentration change ratio(KCCR) is used while identifying the stability-mutation feature of Web search keywords during information analyses and predictions...Purpose: The aim of this paper is to discuss how the keyword concentration change ratio(KCCR) is used while identifying the stability-mutation feature of Web search keywords during information analyses and predictions.Design/methodology/approach: By introducing the stability-mutation feature of keywords and its significance, the paper describes the function of the KCCR in identifying keyword stability-mutation features. By using Ginsberg's influenza keywords, the paper shows how the KCCR can be used to identify the keyword stability-mutation feature effectively.Findings: Keyword concentration ratio has close positive correlation with the change rate of research objects retrieved by users, so from the characteristic of the 'stability-mutation' of keywords, we can understand the relationship between these keywords and certain information. In general, keywords representing for mutation fit for the objects changing in short-term, while those representing for stability are suitable for long-term changing objects. Research limitations: It is difficult to acquire the frequency of keywords, so indexes or parameters which are closely related to the true search volume are chosen for this study.Practical implications: The stability-mutation feature identification of Web search keywords can be applied to predict and analyze the information of unknown public events through observing trends of keyword concentration ratio.Originality/value: The stability-mutation feature of Web search could be quantitatively described by the keyword concentration change ratio(KCCR). Through KCCR, the authors took advantage of Ginsberg's influenza epidemic data accordingly and demonstrated how accurate and effective the method proposed in this paper was while it was used in information analyses and predictions.展开更多
A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phr...A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.展开更多
Modem search engines record user interactions and use them to improve search quality. In particular, user click-through has been successfully used to improve click- through rate (CTR), Web search ranking, and query ...Modem search engines record user interactions and use them to improve search quality. In particular, user click-through has been successfully used to improve click- through rate (CTR), Web search ranking, and query rec- ommendations and suggestions. Although click-through logs can provide implicit feedback of users' click preferences, de- riving accurate absolute relevance judgments is difficult be- cause of the existence of click noises and behavior biases. Previous studies showed that user clicking behaviors are bi- ased toward many aspects such as "position" (user's attention decreases from top to bottom) and "trust" (Web site reputa- tions will affect user's judgment). To address these problems, researchers have proposed several behavior models (usually referred to as click models) to describe users? practical browsing behaviors and to obtain an unbiased estimation of result relevance. In this study, we review recent efforts to construct click models for better search ranking and propose a novel convolutional neural network architecture for build- ing click models. Compared to traditional click models, our model not only considers user behavior assumptions as input signals but also uses the content and context information of search engine result pages. In addition, our model uses pa- rameters from traditional click models to restrict the meaning of some outputs in our model's hidden layer. Experimental results show that the proposed model can achieve consider- able improvement over state-of-the-art click models based on the evaluation metric of click perplexity.展开更多
In this paper we discuss three important kinds of Markov chains used in Web search algorithms-the maximal irreducible Markov chain, the miuimal irreducible Markov chain and the middle irreducible Markov chain, We disc...In this paper we discuss three important kinds of Markov chains used in Web search algorithms-the maximal irreducible Markov chain, the miuimal irreducible Markov chain and the middle irreducible Markov chain, We discuss the stationary distributions, the convergence rates and the Maclaurin series of the stationary distributions of the three kinds of Markov chains. Among other things, our results show that the maximal and minimal Markov chains have the same stationary distribution and that the stationary distribution of the middle Markov chain reflects the real Web structure more objectively. Our results also prove that the maximal and middle Markov chains have the same convergence rate and that the maximal Markov chain converges faster than the minimal Markov chain when the damping factor α 〉1/√2.展开更多
Internet users heavily rely on web search engines for their intended information.The major revenue of search engines is advertisements(or ads).However,the search advertising suffers from fraud.Fraudsters generate fake...Internet users heavily rely on web search engines for their intended information.The major revenue of search engines is advertisements(or ads).However,the search advertising suffers from fraud.Fraudsters generate fake traffic which does not reach the intended audience,and increases the cost of the advertisers.Therefore,it is critical to detect fraud in web search.Previous studies solve this problem through fraudster detection(especially bots)by leveraging fraudsters'unique behaviors.However,they may fail to detect new means of fraud,such as crowdsourcing fraud,since crowd workers behave in part like normal users.To this end,this paper proposes an approach to detecting fraud in web search from the perspective of fraudulent keywords.We begin by using a unique dataset of 150 million web search logs to examine the discriminating features of fraudulent keywords.Specifically,we model the temporal correlation of fraudulent keywords as a graph,which reveals a very well-connected community structure.Next,we design DFW(detection of fraudulent keywords)that mines the temporal correlations between candidate fraudulent keywords and a given list of seeds.In particular,DFW leverages several refinements to filter out non-fraudulent keywords that co-occur with seeds occasionally.The evaluation using the search logs shows that DFW achieves high fraud detection precision(99%)and accuracy(93%).A further analysis reveals several typical temporal evolution patterns of fraudulent keywords and the co-existence of both bots and crowd workers as fraudsters for web search fraud.展开更多
There are several issues with Web-based search interfaces on a Sensor Web data infrastructure.It can be difficult to(1)find the proper keywords for the formulation of queries and(2)explore the information if the user ...There are several issues with Web-based search interfaces on a Sensor Web data infrastructure.It can be difficult to(1)find the proper keywords for the formulation of queries and(2)explore the information if the user does not have previous knowledge about the particular sensor systems providing the informa-tion.We investigate how the visualization of sensor resources on a 3D Web-based Digital Earth globe organized by level-of-detail(LOD)can enhance search and exploration of information by easing the formulation of geospatial queries against the metadata of sensor systems.Our case study provides an approach inspired by geographical mashups in which freely available functionality and data are flexibly combined.We use PostgreSQL,PostGIS,PHP,and X3D-Earth technologies to allow the Web3D standard and its geospatial component to be used for visual exploration and LOD control of a dynamic scene.Our goal is to facilitate the dynamic exploration of the Sensor Web and to allow the user to seamlessly focus in on a particular sensor system from a set of registered sensor networks deployed across the globe.We present a prototype metadata exploration system featuring LOD for a multiscaled Sensor Web as a Digital Earth application.展开更多
In the era of big data,stock markets are closely connected with Internet big data from diverse sources.This paper makes the first attempt to compare the linkage between stock markets and various Internet big data coll...In the era of big data,stock markets are closely connected with Internet big data from diverse sources.This paper makes the first attempt to compare the linkage between stock markets and various Internet big data collected from search engines,public media and social media.To achieve this purpose,a big data-based causality testing framework is proposed with three steps,i.e.,data crawling,data mining and causality testing.Taking the Shanghai Stock Exchange and Shenzhen Stock Exchange as targets for stock markets,web search data,news,and microblogs as samples of Internet big data,some interesting findings can be obtained.1)There is a strong bi-directional,linear and nonlinear Granger causality between stock markets and investors'web search behaviors due to some similar trends and uncertain factors.2)News sentiments from public media have Granger causality with stock markets in a bi-directional linear way,while microblog sentiments from social media have Granger causality with stock markets in a unidirectional linear way,running from stock markets to microblog sentiments.3)News sentiments can explain the changes in stock markets better than microblog sentiments due to their authority.The results of this paper might provide some valuable information for both stock market investors and modelers.展开更多
This paper addresses the issue of search of definitions. Specifically, for a given term, we are to find out its definition candidates and rank the candidates according to their likelihood of being good definitions. Th...This paper addresses the issue of search of definitions. Specifically, for a given term, we are to find out its definition candidates and rank the candidates according to their likelihood of being good definitions. This is in contrast to the traditional methods of either generating a single combined definition or outputting all retrieved definitions. Definition ranking is essential for tasks. A specification for judging the goodness of a definition is given. In the specification, a definition is categorized into one of the three levels: good definition, indifferent definition, or bad definition. Methods of performing definition ranking are also proposed in this paper, which formalize the problem as either classification or ordinal regression. We employ SVM (Support Vector Machines) as the classification model and Ranking SVM as the ordinal regression model respectively, and thus they rank definition candidates according to their likelihood of being good definitions. Features for constructing the SVM and Ranking SVM models are defined, which represent the characteristics of terms, definition candidate, and their relationship. Experimental results indicate that the use of SVM and Ranking SVM can significantly outperform the baseline methods such as heuristic rules, the conventional information retrieval--Okapi, or SVM regression. This is true when both the answers are paragraphs and they are sentences. Experimental results also show that SVM or Ranking SVM models trained in one domain can be adapted to another domain, indicating that generic models for definition ranking can be constructed.展开更多
In the present era of big data,web page searching and ranking in an efficient manner on the World Wide Web to satisfy the specific search needs of the modern user is undoubtedly a major challenge for search engines.Ev...In the present era of big data,web page searching and ranking in an efficient manner on the World Wide Web to satisfy the specific search needs of the modern user is undoubtedly a major challenge for search engines.Even though a large number of web search techniques have been developed,some problems still exist while searching with generic search engines as none of the search engines can index the entire web.The issue is not just the volume but also the relevance concerning the user’s requirements.Moreover,if the search query is partially incomplete or is ambiguous,then most of the modern search engines tend to return the result by interpreting all possible meanings of the query.Concerning search quality,more than half of the retrieved web pages have been reported to be irrelevant.Hence web search personalization is required to retrieve search results while incorporating the user’s interests.In the proposed research work we have highlighted the strengths and weakness of various studies as proposed in the literature for web search personalization by carrying out a detailed comparison among them.The in-depth comparative study with baselines leads to the recommendation of Intelligent Meta Search System(IMSS)and Advanced Cluster Vector Page Ranking(ACVPR)algorithm as one of the best approaches as proposed in the literature for web search personalization.Furthermore,the detailed discussion about the comparative analysis of all categories gives new opportunities to think in different research areas.展开更多
基金The National Natural Science Foundation of China(No.60673087)
文摘Focusing on the problem that it is hard to utilize the web multi-fields information with various forms in large scale web search,a novel approach,which can automatically acquire features from web pages based on a set of well defined rules,is proposed.The features describe the contents of web pages from different aspects and they can be used to improve the ranking performance for web search.The acquired feature has the advantages of unified form and less noise,and can easily be used in web page relevance ranking.A special specs for judging the relevance between user queries and acquired features is also proposed.Experimental results show that the features acquired by the proposed approach and the feature relevance specs can significantly improve the relevance ranking performance for web search.
基金The National Natural Science Foundation of China(No60425206,90412003)the Foundation of Excellent Doctoral Dis-sertation of Southeast University (NoYBJJ0502)
文摘A new mapping approach for automated ontology mapping using web search engines (such as Google) is presented. Based on lexico-syntactic patterns, the hyponymy relationships between ontology concepts can be obtained from the web by search engines and an initial candidate mapping set consisting of ontology concept pairs is generated. According to the concept hierarchies of ontologies, a set of production rules is proposed to delete the concept pairs inconsistent with the ontology semantics from the initial candidate mapping set and add the concept pairs consistent with the ontology semantics to it. Finally, ontology mappings are chosen from the candidate mapping set automatically with a mapping select rule which is based on mutual information. Experimental results show that the F-measure can reach 75% to 100% and it can effectively accomplish the mapping between ontologies.
基金partially supported by China Scholarship Council(Grant No.:2009601175)
文摘This study examined users' querying behaviors based on a sample of 30 Chinese college students from Peking University. The authors designed 5 search tasks and each participant conducted two randomly selected search tasks during the experiment. The results show that when searching for pre-designed search tasks, users often have relatively clear goals and strategies before searching. When formulating their queries, users often select words from tasks, use concrete concepts directly, or extract 'central words' or keywords. When reformulating queries, seven query reformulation types were identified from users' behaviors, i.e. broadening, narrowing, issuing new query, paralleling, changing search tools, reformulating syntax terms, and clicking on suggested queries. The results reveal that the search results and/or the contexts can also influence users' querying behaviors.
基金Supported by the National Natural Science Foundation of China (60673139, 60473073, 60573090)
文摘The paper presents a novel benefit based query processing strategy for efficient query routing. Based on DHT as the overlay network, it first applies Nash equilibrium to construct the optimal peer group based on the correlations of keywords and coverage and overlap of the peers to decrease the time cost, and then presents a two-layered architecture for query processing that utilizes Bloom filter as compact representation to reduce the bandwidth consumption. Extensive experiments conducted on a real world dataset have demonstrated that our approach obviously decreases the processing time, while improves the precision and recall as well.
基金Supported by National Natural Science Foundation of China (No.60736044 & 60773066)the Post Doctorial Funds of Heilongjiang
文摘A web-based translation method for Chinese organization name is proposed.After ana-lyzing the structure of Chinese organization name,the methods of bilingual query formulation and maximum entropy based translation re-ranking are suggested to retrieve the English translation from the web via public search engine.The experiments on Chinese university names demonstrate the validness of this approach.
文摘Online reviews are considered of an important indicator for users to decide on the activity they wish to do, whether it is watching a movie, going to a restaurant, or buying a product. It also serves businesses as it keeps tracking user feedback. The sheer volume of online reviews makes it difficult for a human to process and extract all significant information to make purchasing choices. As a result, there has been a trend toward systems that can automatically summarize opinions from a set of reviews. In this paper, we present a hybrid algorithm that combines an auto-summarization algorithm with a sentiment analysis (SA) algorithm, to offer a personalized user experiences and to solve the semantic-pragmatic gap. The algorithm consists of six steps that start with the original text document and generate a summary of that text by choosing the N most relevant sentences in the text. The tagged texts are then processed and then passed to a Naive Bayesian classifier along with their tags as training data. The raw data used in this paper belong to the tagged corpus positive and negative processed movie reviews introduced in [1]. The measures that are used to gauge the performance of the SA and classification algorithm for all test cases consist of accuracy, recall, and precision. We describe in details both the aspect of extraction and sentiment detection modules of our system.
基金supported by National Social Science Foundation of China(Grand No.13&ZD173)
文摘Purpose: The aim of this paper is to discuss how the keyword concentration change ratio(KCCR) is used while identifying the stability-mutation feature of Web search keywords during information analyses and predictions.Design/methodology/approach: By introducing the stability-mutation feature of keywords and its significance, the paper describes the function of the KCCR in identifying keyword stability-mutation features. By using Ginsberg's influenza keywords, the paper shows how the KCCR can be used to identify the keyword stability-mutation feature effectively.Findings: Keyword concentration ratio has close positive correlation with the change rate of research objects retrieved by users, so from the characteristic of the 'stability-mutation' of keywords, we can understand the relationship between these keywords and certain information. In general, keywords representing for mutation fit for the objects changing in short-term, while those representing for stability are suitable for long-term changing objects. Research limitations: It is difficult to acquire the frequency of keywords, so indexes or parameters which are closely related to the true search volume are chosen for this study.Practical implications: The stability-mutation feature identification of Web search keywords can be applied to predict and analyze the information of unknown public events through observing trends of keyword concentration ratio.Originality/value: The stability-mutation feature of Web search could be quantitatively described by the keyword concentration change ratio(KCCR). Through KCCR, the authors took advantage of Ginsberg's influenza epidemic data accordingly and demonstrated how accurate and effective the method proposed in this paper was while it was used in information analyses and predictions.
基金Foundation item: Supported by the National Natural Science Foundation of China (60503020, 60503033, 60703086)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow Uni-versity (KJS0714)+1 种基金Research Foundation of Nanjing University of Posts and Telecommunications (NY207052, NY207082)National Natural Science Foundation of Jiangsu (BK2006094).
文摘A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.
文摘Modem search engines record user interactions and use them to improve search quality. In particular, user click-through has been successfully used to improve click- through rate (CTR), Web search ranking, and query rec- ommendations and suggestions. Although click-through logs can provide implicit feedback of users' click preferences, de- riving accurate absolute relevance judgments is difficult be- cause of the existence of click noises and behavior biases. Previous studies showed that user clicking behaviors are bi- ased toward many aspects such as "position" (user's attention decreases from top to bottom) and "trust" (Web site reputa- tions will affect user's judgment). To address these problems, researchers have proposed several behavior models (usually referred to as click models) to describe users? practical browsing behaviors and to obtain an unbiased estimation of result relevance. In this study, we review recent efforts to construct click models for better search ranking and propose a novel convolutional neural network architecture for build- ing click models. Compared to traditional click models, our model not only considers user behavior assumptions as input signals but also uses the content and context information of search engine result pages. In addition, our model uses pa- rameters from traditional click models to restrict the meaning of some outputs in our model's hidden layer. Experimental results show that the proposed model can achieve consider- able improvement over state-of-the-art click models based on the evaluation metric of click perplexity.
基金Supported by the National Natural Science Foundation of China (No.10371034).Acknowledgements. We thank Zhi-Ming Ma for his valuable suggestion and instruction. We thank Guo-Lie Lan for his discussion.
文摘In this paper we discuss three important kinds of Markov chains used in Web search algorithms-the maximal irreducible Markov chain, the miuimal irreducible Markov chain and the middle irreducible Markov chain, We discuss the stationary distributions, the convergence rates and the Maclaurin series of the stationary distributions of the three kinds of Markov chains. Among other things, our results show that the maximal and minimal Markov chains have the same stationary distribution and that the stationary distribution of the middle Markov chain reflects the real Web structure more objectively. Our results also prove that the maximal and middle Markov chains have the same convergence rate and that the maximal Markov chain converges faster than the minimal Markov chain when the damping factor α 〉1/√2.
基金supported by the National Key Research and Development Program of China under Grant No.2018YFB1800205the National Natural Science Foundation of China under Grant Nos.61725206 and U20A20180CAS-Austria Project under Grant No.GJHZ202114.
文摘Internet users heavily rely on web search engines for their intended information.The major revenue of search engines is advertisements(or ads).However,the search advertising suffers from fraud.Fraudsters generate fake traffic which does not reach the intended audience,and increases the cost of the advertisers.Therefore,it is critical to detect fraud in web search.Previous studies solve this problem through fraudster detection(especially bots)by leveraging fraudsters'unique behaviors.However,they may fail to detect new means of fraud,such as crowdsourcing fraud,since crowd workers behave in part like normal users.To this end,this paper proposes an approach to detecting fraud in web search from the perspective of fraudulent keywords.We begin by using a unique dataset of 150 million web search logs to examine the discriminating features of fraudulent keywords.Specifically,we model the temporal correlation of fraudulent keywords as a graph,which reveals a very well-connected community structure.Next,we design DFW(detection of fraudulent keywords)that mines the temporal correlations between candidate fraudulent keywords and a given list of seeds.In particular,DFW leverages several refinements to filter out non-fraudulent keywords that co-occur with seeds occasionally.The evaluation using the search logs shows that DFW achieves high fraud detection precision(99%)and accuracy(93%).A further analysis reveals several typical temporal evolution patterns of fraudulent keywords and the co-existence of both bots and crowd workers as fraudsters for web search fraud.
基金This work was supported in part by the Korea Institute of Science and Technology(KIST)Institutional Program(Project No.2E24100).
文摘There are several issues with Web-based search interfaces on a Sensor Web data infrastructure.It can be difficult to(1)find the proper keywords for the formulation of queries and(2)explore the information if the user does not have previous knowledge about the particular sensor systems providing the informa-tion.We investigate how the visualization of sensor resources on a 3D Web-based Digital Earth globe organized by level-of-detail(LOD)can enhance search and exploration of information by easing the formulation of geospatial queries against the metadata of sensor systems.Our case study provides an approach inspired by geographical mashups in which freely available functionality and data are flexibly combined.We use PostgreSQL,PostGIS,PHP,and X3D-Earth technologies to allow the Web3D standard and its geospatial component to be used for visual exploration and LOD control of a dynamic scene.Our goal is to facilitate the dynamic exploration of the Sensor Web and to allow the user to seamlessly focus in on a particular sensor system from a set of registered sensor networks deployed across the globe.We present a prototype metadata exploration system featuring LOD for a multiscaled Sensor Web as a Digital Earth application.
基金sponsored by the National Natural Science Foundation of China under Grant Nos.715732447153201371202115 and 71403260。
文摘In the era of big data,stock markets are closely connected with Internet big data from diverse sources.This paper makes the first attempt to compare the linkage between stock markets and various Internet big data collected from search engines,public media and social media.To achieve this purpose,a big data-based causality testing framework is proposed with three steps,i.e.,data crawling,data mining and causality testing.Taking the Shanghai Stock Exchange and Shenzhen Stock Exchange as targets for stock markets,web search data,news,and microblogs as samples of Internet big data,some interesting findings can be obtained.1)There is a strong bi-directional,linear and nonlinear Granger causality between stock markets and investors'web search behaviors due to some similar trends and uncertain factors.2)News sentiments from public media have Granger causality with stock markets in a bi-directional linear way,while microblog sentiments from social media have Granger causality with stock markets in a unidirectional linear way,running from stock markets to microblog sentiments.3)News sentiments can explain the changes in stock markets better than microblog sentiments due to their authority.The results of this paper might provide some valuable information for both stock market investors and modelers.
文摘This paper addresses the issue of search of definitions. Specifically, for a given term, we are to find out its definition candidates and rank the candidates according to their likelihood of being good definitions. This is in contrast to the traditional methods of either generating a single combined definition or outputting all retrieved definitions. Definition ranking is essential for tasks. A specification for judging the goodness of a definition is given. In the specification, a definition is categorized into one of the three levels: good definition, indifferent definition, or bad definition. Methods of performing definition ranking are also proposed in this paper, which formalize the problem as either classification or ordinal regression. We employ SVM (Support Vector Machines) as the classification model and Ranking SVM as the ordinal regression model respectively, and thus they rank definition candidates according to their likelihood of being good definitions. Features for constructing the SVM and Ranking SVM models are defined, which represent the characteristics of terms, definition candidate, and their relationship. Experimental results indicate that the use of SVM and Ranking SVM can significantly outperform the baseline methods such as heuristic rules, the conventional information retrieval--Okapi, or SVM regression. This is true when both the answers are paragraphs and they are sentences. Experimental results also show that SVM or Ranking SVM models trained in one domain can be adapted to another domain, indicating that generic models for definition ranking can be constructed.
文摘In the present era of big data,web page searching and ranking in an efficient manner on the World Wide Web to satisfy the specific search needs of the modern user is undoubtedly a major challenge for search engines.Even though a large number of web search techniques have been developed,some problems still exist while searching with generic search engines as none of the search engines can index the entire web.The issue is not just the volume but also the relevance concerning the user’s requirements.Moreover,if the search query is partially incomplete or is ambiguous,then most of the modern search engines tend to return the result by interpreting all possible meanings of the query.Concerning search quality,more than half of the retrieved web pages have been reported to be irrelevant.Hence web search personalization is required to retrieve search results while incorporating the user’s interests.In the proposed research work we have highlighted the strengths and weakness of various studies as proposed in the literature for web search personalization by carrying out a detailed comparison among them.The in-depth comparative study with baselines leads to the recommendation of Intelligent Meta Search System(IMSS)and Advanced Cluster Vector Page Ranking(ACVPR)algorithm as one of the best approaches as proposed in the literature for web search personalization.Furthermore,the detailed discussion about the comparative analysis of all categories gives new opportunities to think in different research areas.