RAPID: Harvesting Speech Datasets for Linguistic Research on the Web (Digging into Data Challenge)
RAPID:收集语音数据集以进行网络语言研究(挖掘数据挑战)
基本信息
- 批准号:1035151
- 负责人:
- 金额:$ 10万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2010
- 资助国家:美国
- 起止时间:2010-08-01 至 2013-07-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Distinctions of prosody (rhythm, stress, and intonation) are ubiquitous in spoken language. It often seems obvious to a native speakers of English what prosody is most appropriate in a given sentence and context, and researchers in Linguistics and related fields have proposed numerous formalized hypotheses about it. But establishing the validity of these hypotheses is remarkably elusive. Much of the problem is that it is difficult to observe enough examples of a given phenomenon to evaluate hypotheses. The project aims to address this problem of a dearth of data by collecting or "harvesting" examples of specific word sequences or word patterns from web sources. It is often possible to find hundreds or thousands of examples of people using the very same word pattern. If these examples are collected together into a dataset and made available to the research community, it will be possible to evaluate theories about the form and meaning of prosody on an unprecedented scale. Scaling up available data can be expected to have a transformative effect on our understanding of prosody.Audio and audio-video recordings of spoken language, including podcasts, radio and television broadcasts, lectures, and much else, are pervasive on the web. This does not help in itself, because it is not possible to listen to tens of thousands of hours of speech in order to find a few hundred examples of a certain type. Fortunately, more sites are becoming available that provide text transcriptions obtained with automatic speech recognition (for instance Fox Business News, WNYC, Elections Video Search at Google, and university lectures at MIT). Industry blogs and newsletters indicate that more large sites will come online soon. By searching for a word pattern in the text transcription and subsequently retrieving an audio or video file, it becomes possible to find relevant data. To construct datasets for prosody research from these web sources, the project team will implement software harvest engines that interact with the web through standard protocols. Datasets for eight to twelve specific phenomena will be collected. In order to demonstrate the impact of a data-intensive methodology, the samples will be analyzed using techniques of statistics and formal linguistics. For instance, an approach known as machine learning classification will be used to identify the specific features of the sound signal (such as pitch, vowel duration, and intensity) that are responsible for the perception of prosody.Prosody and intonation play an important role in making the discourse coherent, in signaling what part of the communicated information is foregrounded and backgrounded, and disambiguating speaker intention. Any advancement in understanding prosody will not only deepen our understanding of the human language capability, it also has implications in a wide range of areas, including language instruction, translation studies, speech therapy, improving comprehensibility of synthesized speech, and improving speech recognition systems.
韵律(节奏、重音和语调)的区别在口语中无处不在。对于以英语为母语的人来说,在特定的句子和语境中,什么样的韵律是最合适的,这是显而易见的,语言学和相关领域的研究者们对此提出了许多形式化的假设,但这些假设的有效性却很难确定。大部分的问题是,很难观察到足够的例子,一个给定的现象,以评估假设。该项目旨在通过从网络资源中收集或“收获”特定单词序列或单词模式的例子来解决数据缺乏的问题。人们经常可以找到成百上千个使用同一种句型的例子。如果这些例子被收集到一个数据集,并提供给研究界,这将是可能的评估理论的形式和意义的韵律在一个前所未有的规模。扩大可用的数据可以预期会对我们对韵律的理解产生变革性的影响。口头语言的音频和音频视频记录,包括播客,广播和电视广播,讲座和其他许多东西,在网络上无处不在。这本身并没有帮助,因为不可能为了找到几百个某种类型的例子而听数万小时的演讲。幸运的是,越来越多的网站提供通过自动语音识别获得的文本翻译(例如Fox Business News,WNYC,Google的Elections Video Search和MIT的大学讲座)。行业博客和时事通讯表明,更多的大型网站将很快上线。通过在文本转录中搜索单词模式并随后检索音频或视频文件,可以找到相关数据。为了从这些网络资源中构建韵律研究的数据集,项目团队将实现通过标准协议与网络交互的软件收获引擎。将收集8到12个特定现象的数据集。为了展示数据密集型方法的影响,将使用统计学和正式语言学技术分析样本。例如,一种被称为机器学习分类的方法将被用来识别声音信号的特定特征(如音高、元音持续时间和强度),这些特征负责韵律的感知。韵律和语调在使话语连贯、指示所传达的信息的哪一部分是前景化和背景化以及消除说话者意图的歧义方面发挥着重要作用。韵律理解的任何进展不仅会加深我们对人类语言能力的理解,而且在语言教学、翻译研究、语音治疗、提高合成语音的可理解性和改进语音识别系统等广泛领域都有意义。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Mats Rooth其他文献
GIVENNESS, AVOIDF AND OTHER CONSTRAINTS ON THE PLACEMENT OF ACCENT*
对重音位置的给予、避免和其他限制*
- DOI:
- 发表时间:
1999 - 期刊:
- 影响因子:0
- 作者:
M. Bittner;Daniel Büring;K. Fintel;J. Grimshaw;I. Kohlhof;B. Ladusaw;A. Prince;R. Raffelsiefen;Mats Rooth;Lisa Selkirk;U. Sauerland - 通讯作者:
U. Sauerland
Association with focus
- DOI:
- 发表时间:
1985 - 期刊:
- 影响因子:0
- 作者:
Mats Rooth - 通讯作者:
Mats Rooth
Harvesting speech datasets for linguistic research on the web
收集语音数据集以进行网络语言研究
- DOI:
- 发表时间:
2013 - 期刊:
- 影响因子:0
- 作者:
Mats Rooth;Jonathan Howell;M. Wagner - 通讯作者:
M. Wagner
Induction of fine-grained lexical parameters of treebank pcfgs with inside-outside estimation and lexical transformations
通过内部-外部估计和词法转换归纳树库 pcfgs 的细粒度词法参数
- DOI:
- 发表时间:
2009 - 期刊:
- 影响因子:0
- 作者:
Mats Rooth;Tejaswini Deoskar - 通讯作者:
Tejaswini Deoskar
Mats Rooth的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似海外基金
High-performance thin film porous pyroelectric materials and composites for thermal sensing and harvesting
用于热传感和收集的高性能薄膜多孔热释电材料和复合材料
- 批准号:
EP/Y017412/1 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Fellowship
NSF Convergence Accelerator Track M: Water-responsive Materials for Evaporation Energy Harvesting
NSF 收敛加速器轨道 M:用于蒸发能量收集的水响应材料
- 批准号:
2344305 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Standard Grant
RE-WITCH Renewable and Waste heat valorisation in Industries via Technologies for Cooling production and energy Harvesting
RE-WITCH 通过冷却生产和能量收集技术实现工业中的可再生能源和废热价值
- 批准号:
10092071 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
EU-Funded
Exploring Microbial Light-Harvesting with Rhodopsin in Extreme Polar Environments: Unveiling Distribution, Diversity, and Functional Insights
在极端极地环境中探索利用视紫红质进行微生物光捕获:揭示分布、多样性和功能见解
- 批准号:
24K03072 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
- 批准号:
2403399 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Standard Grant
CAREER: Dynamics and harvesting of stochastic populations
职业:随机群体的动态和收获
- 批准号:
2339000 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Continuing Grant
MetacMed: Acoustic and mechanical metamaterials for biomedical and energy harvesting applications
MetacMed:用于生物医学和能量收集应用的声学和机械超材料
- 批准号:
EP/Y034635/1 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Research Grant
Researching the feasibility of enhancing the productivity of daffodil harvesting through the use of a daffodil collection robotic platform (Daffy)
研究利用水仙花采集机器人平台(Daffy)提高水仙花采收生产力的可行性
- 批准号:
10107691 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Launchpad
Asymmetric Biomembranes for Blue Energy Harvesting
用于蓝色能量收集的不对称生物膜
- 批准号:
DP240101192 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Discovery Projects
Green Optical Wireless Communications Facilitated by Photonic Power Harvesting "GreenCom"
光子能量收集“GreenCom”促进绿色光无线通信
- 批准号:
EP/X027511/2 - 财政年份:2024
- 资助金额:
$ 10万 - 项目类别:
Research Grant














{{item.name}}会员




