Word segmentation from noisy data with minimal supervision
在最少的监督下从噪声数据中进行分词
基本信息
- 批准号:EP/H050442/1
- 负责人:
- 金额:$ 35.9万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2011
- 资助国家:英国
- 起止时间:2011 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In recent years, the field of natural language processing (NLP) has made great advances in a wide range of areas, such as machine translation, document summarization, and topic identification. However, much of this success is due to systems that are built using large quantities of human-annotated data in a supervised machine learning approach. This means that languages with fewer annotated resources (low-density languages) are left without much useful language technology. An important direction in NLP research is therefore to improve our ability to develop successful systems using as little annotated data as possible. Research on completely unsupervised systems is particularly interesting not only for its potential to broaden the reach of NLP technology, but also because it may shed light on the ways in which human infants manage to learn language with little or no explicit instruction.We propose to focus on the particular problem of word segmentation, and to develop a new type of probabilistic model, the infinite noisy channel model, for solving this problem in settings where little or no annotated data is available. Word segmentation refers to the problem of identifying word boundaries in either text or speech. It arises in NLP systems for many Asian languages, where words are not separated by whitespace, and also for infants learning language, because most spoken words are not separated by pauses. Previous work on unsupervised word segmentation has assumed that every time a particular word occurs, it is realized in exactly the same way. However, this is not the case for infants learning language (since words are subject to phonetic variability and noise in pronunciation), nor is it always true in NLP (if the input text contains errors, such as those produced by an optical character recognition system). Our new model will address this shortcoming by simultaneously performing word segmentation and correction of noise and variability, to recover a sequence of de-noised words from the unsegmented noisy input. We plan to develop two different versions of our model. One of these will be designed to correct for phonetic variability, and will be evaluated as a cognitive model of human language acquisition. With this model, we hope to gain insight into the computational mechanisms that allow infants to successfully extract words from noisy input, and in particular to show that the Bayesian inference techniques used in our model are a plausible explanation of infants' learning behavior. The second version of our model will be designed to correct for errors resulting from optical character recognition, and will be evaluated as a word segmentation and error-correcting NLP application in several different languages. We hope to show that the model reduces the number of character errors in the document while also producing successful segmentations. We expect these improvements to be particularly pronounced in low-density language situations.
近年来,自然语言处理(NLP)领域在机器翻译、文档摘要和主题识别等广泛领域取得了巨大进展。然而,这种成功在很大程度上是由于系统是在监督机器学习方法中使用大量人类注释的数据构建的。这意味着注释资源较少的语言(低密度语言)没有太多有用的语言技术。因此,NLP研究的一个重要方向是提高我们使用尽可能少的注释数据开发成功系统的能力。对完全无监督系统的研究特别有趣,不仅因为它有可能扩大NLP技术的范围,而且因为它可能揭示人类婴儿在很少或没有明确指导的情况下学习语言的方式。我们建议专注于分词的特殊问题,并开发一种新型的概率模型,无限噪声信道模型,用于在很少或没有注释数据可用的设置中解决该问题。分词是指在文本或语音中识别单词边界的问题。它出现在许多亚洲语言的NLP系统中,其中单词不被空格分隔,也出现在学习语言的婴儿中,因为大多数口语单词不被停顿分隔。之前关于无监督分词的工作假设每次出现特定单词时,都会以完全相同的方式实现。然而,对于学习语言的婴儿来说,情况并非如此(因为单词会受到语音变化和发音噪音的影响),在NLP中也不总是如此(如果输入文本包含错误,例如由光学字符识别系统产生的错误)。我们的新模型将通过同时执行单词分割和噪声和可变性校正来解决这个缺点,以便从未分割的噪声输入中恢复去噪单词序列。我们计划开发两个不同版本的模型。其中之一将被设计用于纠正语音变异,并将作为人类语言习得的认知模型进行评估。有了这个模型,我们希望深入了解的计算机制,让婴儿成功地从嘈杂的输入提取的话,特别是要表明,贝叶斯推理技术在我们的模型中使用的是一个合理的解释婴儿的学习行为。我们的模型的第二个版本将被设计为纠正光学字符识别导致的错误,并将作为几种不同语言的分词和纠错NLP应用程序进行评估。我们希望表明,该模型减少了文档中的字符错误的数量,同时也产生了成功的分割。我们希望这些改进在低密度语言情况下特别明显。
项目成果
期刊论文数量(8)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model
- DOI:10.21437/interspeech.2015-239
- 发表时间:2015
- 期刊:
- 影响因子:0
- 作者:H. Kamper;A. Jansen;S. Goldwater
- 通讯作者:H. Kamper;A. Jansen;S. Goldwater
Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings
- DOI:10.1109/taslp.2016.2517567
- 发表时间:2016-03
- 期刊:
- 影响因子:0
- 作者:H. Kamper;A. Jansen;S. Goldwater
- 通讯作者:H. Kamper;A. Jansen;S. Goldwater
Bootstrapping a Unified Model of Lexical and Phonetic Acquisition
引导词汇和语音习得的统一模型
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:Micha Elsner (Author)
- 通讯作者:Micha Elsner (Author)
A segmental framework for fully-unsupervised large-vocabulary speech recognition
- DOI:10.1016/j.csl.2017.04.008
- 发表时间:2016-06
- 期刊:
- 影响因子:0
- 作者:H. Kamper;A. Jansen;S. Goldwater
- 通讯作者:H. Kamper;A. Jansen;S. Goldwater
Weak semantic context helps phonetic learning in a model of infant language acquisition
- DOI:10.3115/v1/p14-1101
- 发表时间:2014-06
- 期刊:
- 影响因子:0
- 作者:Stella Frank;Naomi H Feldman;S. Goldwater
- 通讯作者:Stella Frank;Naomi H Feldman;S. Goldwater
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Sharon Goldwater其他文献
Edinburgh Research Explorer Inflecting when there’s no majority: Limitations of encoder-decoder neural networks as cognitive models for German plurals
爱丁堡研究探索者在没有多数时发生变化:编码器-解码器神经网络作为德语复数认知模型的局限性
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Kate McCurdy;Sharon Goldwater;Adam Lopez - 通讯作者:
Adam Lopez
Title: Online Learning Mechanisms for Bayesian Models of Word Segmentation
标题:分词贝叶斯模型的在线学习机制
- DOI:
10.1007/978-3-319-19650-3_2414 - 发表时间:
2011 - 期刊:
- 影响因子:0
- 作者:
Sharon Goldwater;M. Steyvers - 通讯作者:
M. Steyvers
Correction: Cross-linguistically consistent semantic and syntactic annotation of child-directed speech
- DOI:
10.1007/s10579-024-09775-3 - 发表时间:
2024-09-20 - 期刊:
- 影响因子:1.800
- 作者:
Ida Szubert;Omri Abend;Nathan Schneider;Samuel Gibbon;Louis Mahon;Sharon Goldwater;Mark Steedman - 通讯作者:
Mark Steedman
Sharon Goldwater的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Sharon Goldwater', 18)}}的其他基金
Modeling the Development of Phonetic Representations
语音表示的发展建模
- 批准号:
ES/R006660/1 - 财政年份:2018
- 资助金额:
$ 35.9万 - 项目类别:
Research Grant
相似海外基金
Digging Deeper with AI: Canada-UK-US Partnership for Next-generation Plant Root Anatomy Segmentation
利用人工智能进行更深入的挖掘:加拿大、英国、美国合作开发下一代植物根部解剖分割
- 批准号:
BB/Y513908/1 - 财政年份:2024
- 资助金额:
$ 35.9万 - 项目类别:
Research Grant
Ultra-precision clinical imaging and detection of Alzheimers Disease using deep learning
使用深度学习进行超精密临床成像和阿尔茨海默病检测
- 批准号:
10643456 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
Dynamical maintenance of left-right symmetry during vertebrate development
脊椎动物发育过程中左右对称的动态维持
- 批准号:
10797382 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
A Connectomic Analysis of a Developing Brain Undergoing Neurogenesis
正在经历神经发生的发育中大脑的连接组学分析
- 批准号:
10719296 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
Early-Stage Clinical Trial of AI-Driven CBCT-Guided Adaptive Radiotherapy for Lung Cancer
AI驱动的CBCT引导的肺癌适应性放疗的早期临床试验
- 批准号:
10575081 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
A mechanistic understanding of glymphatic transport and its implications in neurodegenerative disease
对类淋巴运输的机制及其在神经退行性疾病中的影响的理解
- 批准号:
10742654 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
Understanding metabolic and vascular vulnerabilities of residual disease in triple negative breast cancer to inform on treatment strategies
了解三阴性乳腺癌残留疾病的代谢和血管脆弱性,为治疗策略提供信息
- 批准号:
10744480 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
Multi-Resolution Curriculum Learning Guided Convolutional Neural Networks for Automatic Segmentation of iPS Cell Colonies
多分辨率课程学习引导卷积神经网络自动分割 iPS 细胞集落
- 批准号:
23K11170 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Individual differences in temporal predictions in speech: word segmentation and conversational turn-taking
语音时间预测的个体差异:分词和会话轮流
- 批准号:
2863885 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
Studentship
SBIR Phase II: High-Resolution Image Segmentation for Natural Resource Management
SBIR 第二阶段:用于自然资源管理的高分辨率图像分割
- 批准号:
2233680 - 财政年份:2023
- 资助金额:
$ 35.9万 - 项目类别:
Cooperative Agreement














{{item.name}}会员




