权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

The Probabilistic Representation of Linguistic Knowledge

语言知识的概率表示

基本信息

批准号：
ES/J022969/1
负责人：
Shalom Lappin
金额：
$ 58.46万
依托单位：
King's College London
依托单位国家：
英国
项目类别：
Fellowship
财政年份：
2012
资助国家：
英国
起止时间：
2012 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=ES%2FJ022969%2F1
关键词：
Probabilistic Representation Linguistic Knowledge

项目摘要

In the past twenty-five years work in natural language technology has made impressive progress across a wide range of tasks, which include, among others, information retrieval and extraction, text interpretation and summarization, speech recognition, morphological analysis, syntactic parsing, word sense identification, and machine translation. Much of this progress has been due to the successful application of powerful techniques for probabilistic modeling and statistical analysis to large corpora of linguistic data. These methods have given rise to a set of engineering tools that are rapidly shaping the digital environment in which we access and process most of the information that we use. In recent work (Lappin and Shieber (2007), Clark and Lappin (2011a), Clark and Lappin (2011b)) my co-authors and I have argued that the machine learning methods that are driving the expansion of natural language technology are also directly relevant to understanding central features of human language acquisition. When these methods are used to construct carefully specified formal models and implementations of the grammar induction task, they yield striking insights into the limits and possibility of human learning on the basis of the primary linguistic data to which children are exposed. These models indicate that language learning can be achieved without the sorts of strong innate learning biases that have been posited by traditional theories of universal grammar. Weak biases, some derivable from non-linguistic cognitive domains, and domain general learning procedures are sufficient to support efficient data driven learning of plausible systems of grammatical representation.In the current research I am focussing on the problem of how to specify the class of representations that encode human knowledge of the syntax of natural languages. I am pursuing the hypothesis that a representation in this class is best expressed as an enriched statistical language model that assigns probability values to the sentences of a language. A central part of the enrichment of the model consists of a procedure for determining the acceptability (grammaticality) of a sentence as a graded value, relative to the properties of that sentence and the language of which it is a part. This procedure avoids the simple reduction of the grammaticality of a string to its estimated probability of occurrence, while still characterizing grammaticality in probabilistic terms. An enriched model of this kind will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences. The pervasiveness of gradedness in the linguistic knowledge of individual speakers poses a serious problem for classical theories of syntax, which partition strings of words into the grammatical sentences of a language and ill formed strings of words. This research holds out the prospect of important impact in two areas. First, it can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This work can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition. Second, this work can contribute to the development of more effective language technology by providing insight, from a computational perspective, into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more efficient tools for parsing and interpreting text and speech.

在过去的25年里，自然语言技术在广泛的任务方面取得了令人印象深刻的进展，其中包括信息检索和提取，文本解释和摘要，语音识别，形态分析，句法分析，词义识别和机器翻译。这一进展主要归功于将强大的概率建模和统计分析技术成功应用于大型语言数据语料库。这些方法催生了一系列工程工具，这些工具正在迅速塑造我们访问和处理大部分信息的数字环境。在最近的工作中（Lappin和Shieber（2007），Clark和Lappin（2011 a），Clark和Lappin（2011 b）），我和我的合著者认为，推动自然语言技术扩展的机器学习方法也与理解人类语言习得的核心特征直接相关。当这些方法被用来构建精心指定的正式模型和实现的语法归纳任务，他们产生惊人的洞察力的限制和人类学习的基础上的主要语言数据的儿童接触的可能性。这些模型表明，语言学习可以在没有传统普遍语法理论所假设的那种强烈的先天学习偏见的情况下实现。弱偏见，一些来自非语言的认知域，域一般的学习程序足以支持有效的数据驱动学习的似是而非的系统的语法representation.In目前的研究，我专注于如何指定类的表示编码人类知识的自然语言的语法的问题。我追求的假设，在这类表示是最好的表达为丰富的统计语言模型，分配概率值的句子的语言。一个核心部分的丰富的模型包括一个程序，用于确定可接受性（语法）的句子作为一个分级值，相对于该句子的属性和语言的一部分。这个过程避免了简单地将字符串的语法性简化为其估计的出现概率，同时仍然以概率术语表征语法性。这种丰富的模型将提供一个简单的解释，即个别母语者通常沿着连续体判断句子的良好结构，而不是通过在可接受和不可接受的句子之间强加一个清晰的界限。等级在个体说话者的语言知识中的普遍存在给经典的句法理论带来了严重的问题，经典的句法理论将单词串划分为一种语言的语法句子和不规则的单词串。本研究在两个方面展示了重要影响的前景。首先，它可以揭示语言知识的表征和获得与其他认知领域知识的学习和编码之间的关系。这项工作反过来可以帮助澄清生物条件学习偏差和数据驱动学习在人类认知中的各自作用。第二，这项工作可以通过从计算的角度提供洞察力来促进更有效的语言技术的发展，从而了解人类在其语言中表示句子的句法属性的方式。在某种程度上，自然语言处理系统考虑到这类表示，它们将提供更有效的工具来解析和解释文本和语音。