In the era of big data,E-commerce plays an increasingly important role,and steel E-commerce certainly occupies a positive position.However,it is very difficult to choose satisfactory steel raw materials from diverse s...In the era of big data,E-commerce plays an increasingly important role,and steel E-commerce certainly occupies a positive position.However,it is very difficult to choose satisfactory steel raw materials from diverse steel commodities online on steel E-commerce platforms in the purchase of staffs.In order to improve the efficiency of purchasers searching for commodities on the steel E-commerce platforms,we propose a novel deep learning-based loss function for named entity recognition(NER).Considering the impacts of small sample and imbalanced data,in our NER scheme,the focal loss,the label smoothing,and the cross entropy are incorporated into a lite bidirectional encoder representations from transformers(BERT)model to avoid the over-fitting.Moreover,through the analysis of different classic annotation techniques used to tag data,an ideal one is chosen for the training model in our proposed scheme.Experiments are conducted on Chinese steel E-commerce datasets.The experimental results show that the training time of a lite BERT(ALBERT)-based method is much shorter than that of BERT-based models,while achieving the similar computational performance in terms of metrics precision,recall,and F1 with BERT-based models.Meanwhile,our proposed approach performs much better than that of combining Word2Vec,bidirectional long short-term memory(Bi-LSTM),and conditional random field(CRF)models,in consideration of training time and F1.展开更多
The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel ent...The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching.展开更多
E-commerce plays an essential role in modern trade today.It is expected that e-commerce volume amounted to 29 trillion USD in the world in 2017,and would grow with the spread of the Internet and information and commun...E-commerce plays an essential role in modern trade today.It is expected that e-commerce volume amounted to 29 trillion USD in the world in 2017,and would grow with the spread of the Internet and information and communication technologies(ICTs).Brazil,Russia,India,China and South Africa(BRICS),together with many others,consider e-commerce a means to facilitate rapid,inclusive and sustainable economic growth,improving the living standards and alleviating poverty.This article examines areas for potential cooperation by BRICS countries in e-commerce development across rural and remote areas to fight poverty.It analyses the current state of e-commerce development in rural and remote areas in each of the BRICS countries,including cases of public and private initiatives to support it.The article also defines the opportunities which e-commerce brings to people living in rural and remote areas.Moreover,it evaluates the existing challenges and risks.The article concludes that despite the rapid e-commerce development in BRICS countries,and significant opportunities created,there are still issues of disproportionate e-commerce in varied regions and the lack of BRICS cooperation in this sphere.Based on a comparative and normative in-depth,systematic analysis,the article develops a set of recommendations for deepening BRICS countries'cooperation in the following areas:infrastructure in rural and remote regions;education;consumer protection;online dispute resolution;coordinated policy in the international scene,including representation of BRICS countries in international indexes,such as the Organization of Economic Co-operation and Development(OECD)Digital Services Trade Restrictiveness Index(STRI).展开更多
In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity res...In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.展开更多
Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may ...Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.展开更多
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match ru...Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.展开更多
Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inacc...Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inaccurate orerroneous results are likely to be generated when humans give unreliable judgments. Previous studies have found thatcorrectly estimating human accuracy or expertise in crowd ER is crucial to truth inference. However, a large number ofthem assume that humans have consistent expertise over all the tasks, and ignore the fact that humans may have variedexpertise on different topics (e.g., music versus sport). In this paper, we deal with crowd ER in the Semantic Web area.We identify multiple topics of ER tasks and model human expertise on different topics. Furthermore, we leverage similartask clustering to enhance the topic modeling and expertise estimation. We propose a probabilistic graphical model thatcomputes ER task similarity, estimates human expertise, and infers the task truths in a unified framework. Our evaluationresults on real-world and synthetic datasets show that, compared with several state-of-the-art approaches, our proposedmodel achieves higher accuracy on the task truth inference and is more consistent with the human real expertise.展开更多
Entity resolution(ER)is a significant task in data integration,which aims to detect all entity profiles that correspond to the same real-world entity.Due to its inherently quadratic complexity,blocking was proposed to...Entity resolution(ER)is a significant task in data integration,which aims to detect all entity profiles that correspond to the same real-world entity.Due to its inherently quadratic complexity,blocking was proposed to ameliorate ER,and it offers an approximate solution which clusters similar entity profiles into blocks so that it suffices to perform pair-wise comparisons inside each block in order to reduce the computational cost of ER.This paper presents a comprehensive survey on existing blocking technologies.We summarize and analyze all classic blocking methods with emphasis on different blocking construction and optimization techniques.We find that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces.How to use schema information flexibly is of great significance to efficiently process data with the new features of this era.Machine learning is an important tool for ER,but end-to-end and efficient machine learning methods still need to be explored.We also sum up and provide the most promising trend for future work from the directions of real-time blocking ER,incremental blocking ER,deep learning with ER,etc.展开更多
基金This work was supported in part by the National Natural Science Foundation of China under Grants U1836106 and 81961138010in part by the Beijing Natural Science Foundation under Grants M21032 and 19L2029+2 种基金in part by the Beijing Intelligent Logistics System Collaborative Innovation Center under Grant BILSCIC-2019KF-08in part by the Scientific and Technological Innovation Foundation of Shunde Graduate School,USTB,under Grants BK20BF010 and BK19BF006in part by the Fundamental Research Funds for the University of Science and Technology Beijing under Grant FRF-BD-19-012A.
文摘In the era of big data,E-commerce plays an increasingly important role,and steel E-commerce certainly occupies a positive position.However,it is very difficult to choose satisfactory steel raw materials from diverse steel commodities online on steel E-commerce platforms in the purchase of staffs.In order to improve the efficiency of purchasers searching for commodities on the steel E-commerce platforms,we propose a novel deep learning-based loss function for named entity recognition(NER).Considering the impacts of small sample and imbalanced data,in our NER scheme,the focal loss,the label smoothing,and the cross entropy are incorporated into a lite bidirectional encoder representations from transformers(BERT)model to avoid the over-fitting.Moreover,through the analysis of different classic annotation techniques used to tag data,an ideal one is chosen for the training model in our proposed scheme.Experiments are conducted on Chinese steel E-commerce datasets.The experimental results show that the training time of a lite BERT(ALBERT)-based method is much shorter than that of BERT-based models,while achieving the similar computational performance in terms of metrics precision,recall,and F1 with BERT-based models.Meanwhile,our proposed approach performs much better than that of combining Word2Vec,bidirectional long short-term memory(Bi-LSTM),and conditional random field(CRF)models,in consideration of training time and F1.
基金National Natural Science Foundation of China(No.61402100)the Fundamental Research Funds for the Central Universities of China(No.17D111205)
文摘The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching.
文摘E-commerce plays an essential role in modern trade today.It is expected that e-commerce volume amounted to 29 trillion USD in the world in 2017,and would grow with the spread of the Internet and information and communication technologies(ICTs).Brazil,Russia,India,China and South Africa(BRICS),together with many others,consider e-commerce a means to facilitate rapid,inclusive and sustainable economic growth,improving the living standards and alleviating poverty.This article examines areas for potential cooperation by BRICS countries in e-commerce development across rural and remote areas to fight poverty.It analyses the current state of e-commerce development in rural and remote areas in each of the BRICS countries,including cases of public and private initiatives to support it.The article also defines the opportunities which e-commerce brings to people living in rural and remote areas.Moreover,it evaluates the existing challenges and risks.The article concludes that despite the rapid e-commerce development in BRICS countries,and significant opportunities created,there are still issues of disproportionate e-commerce in varied regions and the lack of BRICS cooperation in this sphere.Based on a comparative and normative in-depth,systematic analysis,the article develops a set of recommendations for deepening BRICS countries'cooperation in the following areas:infrastructure in rural and remote regions;education;consumer protection;online dispute resolution;coordinated policy in the international scene,including representation of BRICS countries in international indexes,such as the Organization of Economic Co-operation and Development(OECD)Digital Services Trade Restrictiveness Index(STRI).
基金the Special Research Fund for the China Postdoctoral Science Foundation(No.2015M582832)the Major National Science and Technology Program(No.2015ZX01040201)the National Natural Science Foundation of China(No.61371196)。
文摘In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.
基金This work was partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the National Natural Science Foundation of China under Grant Nos. U1509216, 61472099, and 61133002, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and the Ministry of Education (MOE)-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.
文摘Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.
基金The authors thank anonymous reviewers for their in- spiting doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Cen- tral Universities (N 120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).
文摘Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 61872172 and 61772264.
文摘Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inaccurate orerroneous results are likely to be generated when humans give unreliable judgments. Previous studies have found thatcorrectly estimating human accuracy or expertise in crowd ER is crucial to truth inference. However, a large number ofthem assume that humans have consistent expertise over all the tasks, and ignore the fact that humans may have variedexpertise on different topics (e.g., music versus sport). In this paper, we deal with crowd ER in the Semantic Web area.We identify multiple topics of ER tasks and model human expertise on different topics. Furthermore, we leverage similartask clustering to enhance the topic modeling and expertise estimation. We propose a probabilistic graphical model thatcomputes ER task similarity, estimates human expertise, and infers the task truths in a unified framework. Our evaluationresults on real-world and synthetic datasets show that, compared with several state-of-the-art approaches, our proposedmodel achieves higher accuracy on the task truth inference and is more consistent with the human real expertise.
基金supported by the National Natural Science Foundation of China under Grant No.61772268the Fundamental Research Funds for the Central Universities of China under Grant Nos.NS2018057 and NJ2018014.
文摘Entity resolution(ER)is a significant task in data integration,which aims to detect all entity profiles that correspond to the same real-world entity.Due to its inherently quadratic complexity,blocking was proposed to ameliorate ER,and it offers an approximate solution which clusters similar entity profiles into blocks so that it suffices to perform pair-wise comparisons inside each block in order to reduce the computational cost of ER.This paper presents a comprehensive survey on existing blocking technologies.We summarize and analyze all classic blocking methods with emphasis on different blocking construction and optimization techniques.We find that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces.How to use schema information flexibly is of great significance to efficiently process data with the new features of this era.Machine learning is an important tool for ER,but end-to-end and efficient machine learning methods still need to be explored.We also sum up and provide the most promising trend for future work from the directions of real-time blocking ER,incremental blocking ER,deep learning with ER,etc.