Assessing latent-structure models in natural language processing with artificial datasets
使用人工数据集评估自然语言处理中的潜在结构模型
基本信息
- 批准号:RGPIN-2021-03134
- 负责人:
- 金额:$ 1.1万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2022
- 资助国家:加拿大
- 起止时间:2022-01-01 至 2023-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Deep learning is now ubiquitous in natural language processing (NLP). It has raised performances to new levels in a variety of tasks, including (but not limited to) : language modeling, syntactic and semantic parsing, named entity recognition, sentiment analysis and natural language understanding. At the heart of deep learning lies the idea of automatically learning and extracting features from the data that are useful to the task at end, thereby saving developers the need to handcraft all linguistic knowledge into computational models. Doing so counters some difficulties encountered by rule-based approaches. It notably offers a much easier handling of out-of-vocabulary items (leveraging pre-trained word embeddings), and a more expressive set of models with less commitments to (overly) strong assumptions (such as context-freeness hypotheses in approaches based on formal language theory, for example). These strong advantages come however with some downsides. First, very large amounts of data are generally required to infer good models, which in NLP (especially in semantics and syntax) is often produced by skilled annotators in a time consuming and expensive process. Second, inference on large amounts of data is computationally heavy and has a strong environmental impact. Finally, deep learning models are much more difficult for humans to interpret. Injecting linguistic knowledge, especially *structural* knowledge into deep learning models might overcome these limitations. To do this, one must determine how to make deep learning models aware of linguistic structure, and find the right kind of structural bias. A (popular) direction of research models aspects of linguistic structure like syntax trees with latent variables. Learning these models from data requires specific inference algorithms, often in combination with variance-reducing tricks. If and when the inferred model do not improve performance, or learned structures differ a lot from linguists' expectations (which tend to be the case), there is always a question of whether this indicates a failure to learn the 'correct' structures, or to the contrary whether the putative structures do not provide as good an explanation of the data as what the model is doing. We propose, in order to test the strict ability of these inference techniques to capture the kind of structures theorized by linguists, to evaluate these algorithms on *artificial* datasets. Controlling the structures actually governing the data generation and their distribution should enable testing the ability of these algorithms to learn a given type of structure, independently from concerns pertaining to the salience of these structures for modeling real linguistic data and other dataset-specific issues. We will put a particular emphasis on inferring latent syntactic trees for (compositional) semantic parsing, and design our artificial datasets to exhibit specific structural aspects inspired from the work of semanticists.
深度学习是自然语言处理中普遍存在的问题。它将各种任务的性能提高到了新的水平,包括(但不限于):语言建模、句法和语义分析、命名实体识别、情感分析和自然语言理解。深度学习的核心是自动学习并从数据中提取对最终任务有用的特征,从而使开发人员不必将所有语言知识手工创建到计算模型中。这样做可以解决基于规则的方法遇到的一些困难。值得注意的是,它提供了对词汇表外条目的更轻松的处理(利用预先训练的单词嵌入),以及一组更具表现力的模型,而对(过度)强假设的承诺较少(例如,基于形式语言理论的方法中的上下文无关性假设)。然而,这些强大的优势也伴随着一些不利因素。首先,推断好的模型通常需要非常大量的数据,在NLP中(特别是在语义和语法方面),这些模型通常是由熟练的注释员在耗时且昂贵的过程中产生的。其次,对大量数据的推断计算量很大,对环境影响很大。最后,深度学习模型对人类来说要难得多。在深度学习模型中注入语言知识,特别是结构知识,可能会克服这些限制。要做到这一点,必须确定如何让深度学习模型意识到语言结构,并找到正确的结构偏差。一个(流行的)研究方向是对语言结构的各个方面进行建模,比如带有潜在变量的句法树。从数据中学习这些模型需要特定的推理算法,通常与减少方差的技巧相结合。如果或当推断的模型没有提高成绩,或者习得的结构与语言学家的期望(往往是这样)相差很大时,总是存在一个问题,即这是否表明未能学习“正确的”结构,或者相反,假设的结构是否没有像模型所做的那样很好地解释数据。为了测试这些推理技术捕获语言学家理论上的结构的严格能力,我们建议在*人工*数据集上评估这些算法。控制实际管理数据生成及其分布的结构应该能够测试这些算法学习给定类型结构的能力,而独立于对真实语言数据的建模和其他数据集特定问题的关于这些结构的显著程度的关注。我们将特别强调为(组合)语义分析推断潜在句法树,并设计我们的人工数据集以展示受语义学家工作启发的特定结构方面。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Venant, Antoine其他文献
Venant, Antoine的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Venant, Antoine', 18)}}的其他基金
Assessing latent-structure models in natural language processing with artificial datasets
使用人工数据集评估自然语言处理中的潜在结构模型
- 批准号:
RGPIN-2021-03134 - 财政年份:2021
- 资助金额:
$ 1.1万 - 项目类别:
Discovery Grants Program - Individual
Assessing latent-structure models in natural language processing with artificial datasets
使用人工数据集评估自然语言处理中的潜在结构模型
- 批准号:
DGECR-2021-00145 - 财政年份:2021
- 资助金额:
$ 1.1万 - 项目类别:
Discovery Launch Supplement
相似国自然基金
基于LMP-1第五跨膜结构域为靶点治疗EB病毒诱导鼻咽癌的药物研发
- 批准号:21602216
- 批准年份:2016
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Technology to capture latent relationships using network structure and its applications
利用网络结构捕获潜在关系的技术及其应用
- 批准号:
23K01632 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Novel Epigenetic Marks for HIV Latency Entry and Reversal
HIV潜伏期进入和逆转的新表观遗传标记
- 批准号:
10617943 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
Bottom-up and top-down computational modeling approaches to study CMV retinitis
研究 CMV 视网膜炎的自下而上和自上而下的计算模型方法
- 批准号:
10748709 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
DDALAB: Identifying Latent States from Neural Recordings with Nonlinear Causal Analysis
DDALAB:通过非线性因果分析从神经记录中识别潜在状态
- 批准号:
10643212 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
PARP1-Chromatin and NAD-Metabolism in EBV Epithelial Cancers
EBV 上皮癌中的 PARP1-染色质和 NAD-代谢
- 批准号:
10627691 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
Exploring herpesvirus exonucleases as potential antiviral targets
探索疱疹病毒核酸外切酶作为潜在的抗病毒靶点
- 批准号:
10825475 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
HSV-1 reactivation and glaucomatous trabecular meshwork damage
HSV-1 重新激活和青光眼小梁网损伤
- 批准号:
10592565 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
Cognitive and Neural Strategies for Latent Feature Inference
潜在特征推理的认知和神经策略
- 批准号:
10662877 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
Tuberculosis Immunopathogenesis During Superinfection with SARS-CoV2
SARS-CoV2 重复感染期间的结核病免疫发病机制
- 批准号:
10737053 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别:
Combining In Vitro and In Silico Models to Investigate Antiretroviral Drug Transport Across the Blood Brain Barrier for the Treatment of HIV-1 Infection in the Brain
结合体外和计算机模型研究抗逆转录病毒药物跨血脑屏障转运以治疗大脑中的 HIV-1 感染
- 批准号:
10838759 - 财政年份:2023
- 资助金额:
$ 1.1万 - 项目类别: