CAREER: Multilingual Learning for Event Structures from Text
职业:从文本中学习事件结构的多语言
基本信息
- 批准号:2239570
- 负责人:
- 金额:$ 58.22万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-06-01 至 2028-05-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Natural language text is replete with important events in different areas (protests, cybersecurity breaches, elections, disease outbreaks, and business transactions). Identifying events to describe who did what to whom and their relations (causal, subevent, and coreferential) from a large amount of text can provide valuable data to support intelligent applications and data-driven decisions over various domains. However, current event structure extraction systems can only perform over text data for a few popular languages such as English, Chinese, Spanish, and Arabic. Text data from many other languages in the world thus cannot be processed by current event extraction systems. This limitation has hindered the coverage of data sources for the systems, introduced language biases in the extracted events, and delayed updates with latest events in local reports. Eventually, the collected event data from current techniques cannot comprehensively represent the latest dynamics over the world to effectively support decision making for important problems of national interests. To address the multilingual challenges, this project will develop event extraction and event-event relation extraction systems that can be effective for data in multiple languages, emphasizing on understudied and low-resource languages to improve the coverage of extracted data and promote democratization of technologies. In information retrieval, multilingual event structure data from the developed technologies can enable data management systems to quickly obtain answers and create summaries for broader user queries in many more languages. In cybersecurity, databases for extracted cyber attack events from multilingual sources can be used to generate more fine-grained and comprehensive reports to inform resource allocation decisions to better protect online activities. In socio-political science, coded conflict and meditation events from more languages can increase the scope and reduce biases of the data to support better decisions for foreign policy, civil war prevention, environmental challenges, or economic strategies.This project will address three fundamental limitations of existing multilingual learning research for event structure extraction: (i) the lack of multilingual datasets that provide data annotation for multiple languages to sufficiently support generalization evaluation of models across different language families, (ii) the limitations of current multilingual representation learning methods when aligning representations between languages to induce language-general features, and (iii) the scarcity of labeled data in different languages to train multilingual models. First, the project will annotate documents for all event extraction and event-event relation extraction tasks in many more languages using consistent schemas. The selected languages for annotation will be typologically diverse, understudied and low-resource to provide reliable multilingual evaluation data for the developed methods. Second, to boost cross-lingual performance for event structure extraction, this project will devise multilingual representation learning methods to enable effective knowledge transfer where models trained on labeled data of high-resource languages can be directly applied to data of other languages. The project will develop novel representation alignment methods for different languages using representation matching, augmentation, and language-general structure induction for text. Third, concerning limited training data for multilingual learning, this project will develop novel methods to automatically generate labeled data in different languages. The project will introduce techniques to mitigate noises in the generated data and optimize generation procedures to boost multilingual learning and performance. The research activities in this project will be closely integrated with education and outreach missions to broaden their impacts.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
自然语言文本充满了不同领域的重要事件(抗议、网络安全漏洞、选举、疾病爆发和商业交易)。从大量文本中识别事件以描述谁对谁做了什么以及它们之间的关系(因果关系、子事件和共同引用关系),可以提供有价值的数据,以支持各种领域的智能应用程序和数据驱动的决策。然而,当前的事件结构提取系统只能处理少数流行语言的文本数据,如英语、中文、西班牙语和阿拉伯语。因此,当前的事件提取系统无法处理来自世界上许多其他语言的文本数据。这种限制阻碍了系统数据源的覆盖,在提取的事件中引入了语言偏差,并延迟了本地报告中最新事件的更新。最终,现有技术收集的事件数据无法全面代表全球最新动态,无法有效支持国家利益重大问题的决策。为了应对多语言挑战,该项目将开发对多语言数据有效的事件提取和事件-事件关系提取系统,重点关注未被充分研究和资源匮乏的语言,以提高提取数据的覆盖范围,促进技术民主化。在信息检索中,开发的多语言事件结构数据使数据管理系统能够以更多的语言快速获取更广泛的用户查询的答案并创建摘要。在网络安全领域,从多语言来源提取的网络攻击事件数据库可用于生成更细粒度和更全面的报告,为资源分配决策提供信息,以更好地保护在线活动。在社会政治科学中,来自更多语言的编码冲突和冥想事件可以增加数据的范围并减少偏见,从而为外交政策、内战预防、环境挑战或经济战略提供更好的决策支持。本项目将解决现有多语言学习研究在事件结构提取方面的三个基本限制:(i)缺乏为多语言提供数据注释的多语言数据集,以充分支持跨不同语系模型的泛化评估;(ii)当前多语言表示学习方法在对齐语言之间的表示以归纳语言一般特征时存在局限性;(iii)缺乏不同语言的标记数据来训练多语言模型。首先,该项目将使用一致的模式为更多语言的所有事件提取和事件-事件关系提取任务注释文档。所选择的注释语言类型多样,研究不足,资源匮乏,无法为开发的方法提供可靠的多语言评估数据。其次,为了提高事件结构提取的跨语言性能,本项目将设计多语言表示学习方法,以实现有效的知识转移,其中在高资源语言的标记数据上训练的模型可以直接应用于其他语言的数据。该项目将为不同的语言开发新的表示对齐方法,使用表示匹配、增强和文本的语言一般结构归纳。第三,针对有限的多语言学习训练数据,本项目将开发新的方法来自动生成不同语言的标记数据。该项目将引入技术来减轻生成数据中的噪音,并优化生成过程,以促进多语言学习和表现。该项目的研究活动将与教育和外联特派团密切结合,以扩大其影响。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(4)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Generating Labeled Data for Relation Extraction: A Meta Learning Approach with Joint GPT-2 Training
- DOI:10.18653/v1/2023.findings-acl.727
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Amir Pouran Ben Veyseh;Franck Dernoncourt;Bonan Min;Thien Huu Nguyen
- 通讯作者:Amir Pouran Ben Veyseh;Franck Dernoncourt;Bonan Min;Thien Huu Nguyen
Retrieving Relevant Context to Align Representations for Cross-lingual Event Detection
- DOI:10.18653/v1/2023.findings-acl.135
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Chien Nguyen;Linh Van Ngo;Thien Huu Nguyen
- 通讯作者:Chien Nguyen;Linh Van Ngo;Thien Huu Nguyen
Contextualized Soft Prompts for Extraction of Event Arguments
- DOI:10.18653/v1/2023.findings-acl.266
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Chien Van Nguyen;Hieu Man;Thien Huu Nguyen
- 通讯作者:Chien Van Nguyen;Hieu Man;Thien Huu Nguyen
Hybrid Knowledge Transfer for Improved Cross-Lingual Event Detection via Hierarchical Sample Selection
- DOI:10.18653/v1/2023.acl-long.296
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Luis Guzman Nateras;Franck Dernoncourt;Thien Huu Nguyen
- 通讯作者:Luis Guzman Nateras;Franck Dernoncourt;Thien Huu Nguyen
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Thien Nguyen其他文献
A Comparative Study of Several Classical, Discrete Differential and Isogeometric Methods for Solving Poisson's Equation on the Disk
- DOI:
10.3390/axioms3020280 - 发表时间:
2014-06-01 - 期刊:
- 影响因子:2
- 作者:
Thien Nguyen;Karciauskas, Kestutis;Peters, Jorg - 通讯作者:
Peters, Jorg
Addressing the Challenges in the Placement of Seafloor Infrastructure on the East Breaks Slide-A Case Study: The Falcon Field (EB 579/623), Northwestern Gulf of Mexico
解决东部海底基础设施布局的挑战打破幻灯片 - 案例研究:墨西哥湾西北部 Falcon Field (EB 579/623)
- DOI:
10.4043/16748-ms - 发表时间:
2004 - 期刊:
- 影响因子:0
- 作者:
J. S. Hoffman;Michael J. Kaluza;R. Griffiths;Gary McCullough;J. Hall;Thien Nguyen - 通讯作者:
Thien Nguyen
Time-Resolved Velocity Measurements in a Matched Refractive Index Facility of Randomly Packed Spheres
随机填充球体匹配折射率设施中的时间分辨速度测量
- DOI:
10.1115/icone26-82425 - 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
E. Kappes;M. Marciniak;A. Mills;R. Muyshondt;S. King;Thien Nguyen;Y. Hassan;V. Ugaz - 通讯作者:
V. Ugaz
Highly-sensitive fluorescence detection and imaging with microfabricated total internal reflection (TIR)-based devices
使用基于微加工全内反射 (TIR) 的设备进行高灵敏度荧光检测和成像
- DOI:
- 发表时间:
2012 - 期刊:
- 影响因子:0
- 作者:
N. Le;D. Dao;R. Yokokawa;Thien Nguyen;J. Wells;S. Sugiyama - 通讯作者:
S. Sugiyama
Optimizing Bilingual Neural Transducer with Synthetic Code-switching Text Generation
通过合成语码转换文本生成来优化双语神经传感器
- DOI:
- 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Thien Nguyen;Nathalie Tran;Liuhui Deng;T. F. D. Silva;Matthew Radzihovsky;Roger Hsiao;Henry Mason;Stefan Braun;E. McDermott;Dogan Can;P. Swietojanski;Lyan Verwimp;Sibel Oyman;Tresi Arvizo;Honza Silovsky;Arnab Ghoshal;M. Martel;Bharat Ram Ambati;Mohamed Ali - 通讯作者:
Mohamed Ali
Thien Nguyen的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Thien Nguyen', 18)}}的其他基金
Phase I IUCRC University of Oregon: Center for Big Learning
第一阶段 IUCCRC 俄勒冈大学:大学习中心
- 批准号:
1747798 - 财政年份:2018
- 资助金额:
$ 58.22万 - 项目类别:
Continuing Grant
相似海外基金
Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation
统一预训练和多语言语义表示学习以实现低资源神经机器翻译
- 批准号:
22KJ1843 - 财政年份:2023
- 资助金额:
$ 58.22万 - 项目类别:
Grant-in-Aid for JSPS Fellows
Elementary Teacher Professional Learning of Equitable Engineering Pedagogies for Multilingual Students
多语言学生公平工程教育学的小学教师专业学习
- 批准号:
2300766 - 财政年份:2023
- 资助金额:
$ 58.22万 - 项目类别:
Standard Grant
The role and effects of Bilingual Learning Assistants in supporting multilingual learners in schools
双语学习助理在支持学校多语言学习者方面的作用和效果
- 批准号:
2737845 - 财政年份:2022
- 资助金额:
$ 58.22万 - 项目类别:
Studentship
Enabling Deep Learning for Multilingual Sociopragmatics
为多语言社交语用学提供深度学习
- 批准号:
RGPIN-2018-04267 - 财政年份:2022
- 资助金额:
$ 58.22万 - 项目类别:
Discovery Grants Program - Individual
Exploring the Dynamics of Language and Thought in the Multilingual Mind: Effects of Multiple Language Learning
探索多语言思维中语言和思维的动态:多语言学习的效果
- 批准号:
ES/V012274/1 - 财政年份:2021
- 资助金额:
$ 58.22万 - 项目类别:
Fellowship
Towards cultivation of multilingual competence involving English: A longitudinal investigation of Conversation-for Learning
培养涉及英语的多语言能力:对话学习的纵向调查
- 批准号:
21K13051 - 财政年份:2021
- 资助金额:
$ 58.22万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Reconceptualising the Motivational Complexity of Multilingual Learning
重新概念化多语言学习的动机复杂性
- 批准号:
2565882 - 财政年份:2021
- 资助金额:
$ 58.22万 - 项目类别:
Studentship
Enabling Deep Learning for Multilingual Sociopragmatics
为多语言社交语用学提供深度学习
- 批准号:
RGPIN-2018-04267 - 财政年份:2021
- 资助金额:
$ 58.22万 - 项目类别:
Discovery Grants Program - Individual
Multilingual speech synthesis based on deep learning to reproduce the speaker and emotion of input speech in different languages
基于深度学习的多语言语音合成,重现不同语言输入语音的说话人和情感
- 批准号:
20K11862 - 财政年份:2020
- 资助金额:
$ 58.22万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Enabling Deep Learning for Multilingual Sociopragmatics
为多语言社交语用学提供深度学习
- 批准号:
RGPIN-2018-04267 - 财政年份:2020
- 资助金额:
$ 58.22万 - 项目类别:
Discovery Grants Program - Individual