摘要
提出一种基于人类计算的语音语料库标注方法.该标注方法的主要思路是通过一个基于Web的语言学习系统来收集由大量学习者(用户)输入的词汇标注和音标标注,并从中选择出现概率最大的用户输入作为语料的正确标注.为了保证通过这种人类计算方法获得的标注文本的质量,使用了一些计算机辅助机制来校验收集到的标注的可靠性.采用这种方法实现语音语料库标注的主要优点在于将语料库标注和语言学习相结合,无需专门投入大量的人力来进行枯燥乏味的语料库标注工作,从而节省了语料库标注的成本.对这种基于人类计算的语音语料库标注技术进行了探讨,说明了用于收集用户输入的语言学习系统的设计以及标注生成系统的设计.系统的应用表明,该标注方法能够有效、低成本地生成语音语料库的词汇标注和音标标注.
A new method is proposed for generating transcriptions of speech corpora based on human-computation. The method depends on collection of orthographic transcriptions and phonetic transcriptions from a large number of users by using a Web-based language learning system and choosing commonly-used labels as the transcriptions of the speech corpora. In order to guarantee the quality of transcriptions, some computer-aided mechanisms are also used to verify the collected transcriptions. This method combines speech data transcribing with language learning and cuts down the cost of transcribing corpora effectively. The technology of human-computation-based speech corpora transcribing and the detailed design of language learning system have been discussed, transcriptions generation system has also been expatiated in this article. The application of system shows that this method is an effective and economical way to generate orthographic and phonetic transcriptions.
出处
《智能系统学报》
2009年第3期270-277,共8页
CAAI Transactions on Intelligent Systems
基金
国家留学基金资助项目(2006104705)
福建省自然科学基金资助项目(2006J0043)
厦门大学"985工程"二期信息创新平台资助项目(0000-X07204)
关键词
语音语料库标注
人类计算
分布式知识获取
基于Web的语言学习
speech corpora transcription
human-computation
distributed knowledge acquisition
Web-based language learning