JSTOR is a full text database for back issues of academic journals,especially those related to humane studies,social science and natural science.This article describes the original intention of JSTOR,analyzes the feat...JSTOR is a full text database for back issues of academic journals,especially those related to humane studies,social science and natural science.This article describes the original intention of JSTOR,analyzes the features of its collections,user interface and use value,and discusses how it is introduced to users.展开更多
Web crawlers are an important part of modern search engines.With the development of the times,data has exploded and humans have entered a“big data era”.For example,Wikipedia carries the knowledge from all over the w...Web crawlers are an important part of modern search engines.With the development of the times,data has exploded and humans have entered a“big data era”.For example,Wikipedia carries the knowledge from all over the world,records the realtime news that occurs every day,and provides users with a good database of data,but because of the large amount of data,it puts a lot of pressure on users to search.At present,single-threaded crawling data can no longer meet the requirements of text crawling.In order to improve the performance and program versatility of single-threaded crawlers,a high-speed multi-threaded web crawler is designed to crawl the network hyper-scale text database.Multi-threaded crawling uses multiple threads to process web pages in parallel,combining breadth-first and depth-first algorithms to control web crawling.The practice project is based on the Python language to achieve multi-threaded optimization network hyper-large-scale text database-Wikipedia book crawling method,the project is inspired by the article on the Wikipedia article in the Big Data Digest public number.展开更多
Well developed continuous speech recognition and synthesis systems demand a high quality continuous speech database which is compact and valid, and whose scientific design would benefit from incorporating linguistic a...Well developed continuous speech recognition and synthesis systems demand a high quality continuous speech database which is compact and valid, and whose scientific design would benefit from incorporating linguistic and phonetic knowledge. It is argued that at the present stage the database should be limited to read speech. To describe those very complex variabilities in continuous speech, the following speech units are proposed: (1) 401syllables without tone; (2) 415 inter-syllabic diphones, (3) 3035 inter-syllabic triphones, (4) 781 inter-syllabic final-initial structures. The 17 basic sefltence patterns in standard Chinese are summarized to cover the most important prosodic phenomena. By using the automatic method,2393 sentences and 388 phrases are selected by above phonetic rules from a large corpus, which includes People's Daily in recent years, TV play scripts and dictionary entries, as the reading text of continuous speech recognition database in standard Chinese. This set of sentences and pbrases covers 99.8% syllables without counting tones, 100% inter-syllable diphones, 99.6% inter-syllable triphones and 100% sentence patterns.展开更多
文摘JSTOR is a full text database for back issues of academic journals,especially those related to humane studies,social science and natural science.This article describes the original intention of JSTOR,analyzes the features of its collections,user interface and use value,and discusses how it is introduced to users.
基金This research is funded by the Open Foundation for the University Innovation Platform in the Hunan Province,grant number 16K013Hunan Provincial Natural Science Foundation of China,grant number 2017JJ2016+2 种基金2016 Science Research Project of Hunan Provincial Department of Education,grant number 16C0269.Accurate crawler design and implementation with a data cleaning function,National Students innovation and entrepreneurship of training program,grant number 201811532010.This research work is implemented at the 2011 Collaborative Innovation Center for Development and Utilization of Finance and Economics Big Data Property,Universities of Hunan Province.Open Foundation for the University Innovation Platform in the Hunan Province,grant number 16K013Hunan Provincial Natural Science Foundation of China,grant number 2017JJ20162016 Science Research Project of Hunan Provincial Department of Education,grant number 16C0269.This research work is implemented at the 2011 Collaborative Innovation Center for Development and Utilization of Finance and Economics Big Data Property,Universities of Hunan Province.Open project,grant number 20181901CRP03,20181901CRP04,20181901CRP05.
文摘Web crawlers are an important part of modern search engines.With the development of the times,data has exploded and humans have entered a“big data era”.For example,Wikipedia carries the knowledge from all over the world,records the realtime news that occurs every day,and provides users with a good database of data,but because of the large amount of data,it puts a lot of pressure on users to search.At present,single-threaded crawling data can no longer meet the requirements of text crawling.In order to improve the performance and program versatility of single-threaded crawlers,a high-speed multi-threaded web crawler is designed to crawl the network hyper-scale text database.Multi-threaded crawling uses multiple threads to process web pages in parallel,combining breadth-first and depth-first algorithms to control web crawling.The practice project is based on the Python language to achieve multi-threaded optimization network hyper-large-scale text database-Wikipedia book crawling method,the project is inspired by the article on the Wikipedia article in the Big Data Digest public number.
文摘Well developed continuous speech recognition and synthesis systems demand a high quality continuous speech database which is compact and valid, and whose scientific design would benefit from incorporating linguistic and phonetic knowledge. It is argued that at the present stage the database should be limited to read speech. To describe those very complex variabilities in continuous speech, the following speech units are proposed: (1) 401syllables without tone; (2) 415 inter-syllabic diphones, (3) 3035 inter-syllabic triphones, (4) 781 inter-syllabic final-initial structures. The 17 basic sefltence patterns in standard Chinese are summarized to cover the most important prosodic phenomena. By using the automatic method,2393 sentences and 388 phrases are selected by above phonetic rules from a large corpus, which includes People's Daily in recent years, TV play scripts and dictionary entries, as the reading text of continuous speech recognition database in standard Chinese. This set of sentences and pbrases covers 99.8% syllables without counting tones, 100% inter-syllable diphones, 99.6% inter-syllable triphones and 100% sentence patterns.