权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Assessing latent-structure models in natural language processing with artificial datasets

使用人工数据集评估自然语言处理中的潜在结构模型

基本信息

批准号：
RGPIN-2021-03134
负责人：
Venant, Antoine
金额：
$ 1.1万
依托单位：
Université de Montréal
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=743921
关键词：
Assessing latent structure models natural

项目摘要

Deep learning is now ubiquitous in natural language processing (NLP). It has raised performances to new levels in a variety of tasks, including (but not limited to) : language modeling, syntactic and semantic parsing, named entity recognition, sentiment analysis and natural language understanding. At the heart of deep learning lies the idea of automatically learning and extracting features from the data that are useful to the task at end, thereby saving developers the need to handcraft all linguistic knowledge into computational models. Doing so counters some difficulties encountered by rule-based approaches. It notably offers a much easier handling of out-of-vocabulary items (leveraging pre-trained word embeddings), and a more expressive set of models with less commitments to (overly) strong assumptions (such as context-freeness hypotheses in approaches based on formal language theory, for example). These strong advantages come however with some downsides. First, very large amounts of data are generally required to infer good models, which in NLP (especially in semantics and syntax) is often produced by skilled annotators in a time consuming and expensive process. Second, inference on large amounts of data is computationally heavy and has a strong environmental impact. Finally, deep learning models are much more difficult for humans to interpret. Injecting linguistic knowledge, especially *structural* knowledge into deep learning models might overcome these limitations. To do this, one must determine how to make deep learning models aware of linguistic structure, and find the right kind of structural bias. A (popular) direction of research models aspects of linguistic structure like syntax trees with latent variables. Learning these models from data requires specific inference algorithms, often in combination with variance-reducing tricks. If and when the inferred model do not improve performance, or learned structures differ a lot from linguists' expectations (which tend to be the case), there is always a question of whether this indicates a failure to learn the 'correct' structures, or to the contrary whether the putative structures do not provide as good an explanation of the data as what the model is doing. We propose, in order to test the strict ability of these inference techniques to capture the kind of structures theorized by linguists, to evaluate these algorithms on *artificial* datasets. Controlling the structures actually governing the data generation and their distribution should enable testing the ability of these algorithms to learn a given type of structure, independently from concerns pertaining to the salience of these structures for modeling real linguistic data and other dataset-specific issues. We will put a particular emphasis on inferring latent syntactic trees for (compositional) semantic parsing, and design our artificial datasets to exhibit specific structural aspects inspired from the work of semanticists.

深度学习是自然语言处理中普遍存在的问题。它将各种任务的性能提高到了新的水平，包括(但不限于)：语言建模、句法和语义分析、命名实体识别、情感分析和自然语言理解。深度学习的核心是自动学习并从数据中提取对最终任务有用的特征，从而使开发人员不必将所有语言知识手工创建到计算模型中。这样做可以解决基于规则的方法遇到的一些困难。值得注意的是，它提供了对词汇表外条目的更轻松的处理(利用预先训练的单词嵌入)，以及一组更具表现力的模型，而对(过度)强假设的承诺较少(例如，基于形式语言理论的方法中的上下文无关性假设)。然而，这些强大的优势也伴随着一些不利因素。首先，推断好的模型通常需要非常大量的数据，在NLP中(特别是在语义和语法方面)，这些模型通常是由熟练的注释员在耗时且昂贵的过程中产生的。其次，对大量数据的推断计算量很大，对环境影响很大。最后，深度学习模型对人类来说要难得多。在深度学习模型中注入语言知识，特别是结构知识，可能会克服这些限制。要做到这一点，必须确定如何让深度学习模型意识到语言结构，并找到正确的结构偏差。一个(流行的)研究方向是对语言结构的各个方面进行建模，比如带有潜在变量的句法树。从数据中学习这些模型需要特定的推理算法，通常与减少方差的技巧相结合。如果或当推断的模型没有提高成绩，或者习得的结构与语言学家的期望(往往是这样)相差很大时，总是存在一个问题，即这是否表明未能学习“正确的”结构，或者相反，假设的结构是否没有像模型所做的那样很好地解释数据。为了测试这些推理技术捕获语言学家理论上的结构的严格能力，我们建议在*人工*数据集上评估这些算法。控制实际管理数据生成及其分布的结构应该能够测试这些算法学习给定类型结构的能力，而独立于对真实语言数据的建模和其他数据集特定问题的关于这些结构的显著程度的关注。我们将特别强调为(组合)语义分析推断潜在句法树，并设计我们的人工数据集以展示受语义学家工作启发的特定结构方面。