权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

EAGER: DCL: SaTC: Enabling Interdisciplinary Collaboration: Efficient Human-in-the-Loop Redaction of Language Development Corpora

EAGER：DCL：SaTC：实现跨学科协作：语言开发语料库的高效人机交互编辑

基本信息

批准号：
2210193
负责人：
Blase Ur
金额：
$ 30万
依托单位：
University of Chicago
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2022
资助国家：
美国
起止时间：
2022-07-01 至 2024-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2210193&HistoricalAwards=false
关键词：
EAGER DCL SaTC Enabling Interdisciplinary

项目摘要

At great effort and expense, and with the cooperation of hundreds of parents, teachers, and children, researchers have collected conversation transcripts to study topics like children's language development. The data most useful for science are longitudinal and naturalistic, such as data collected periodically over time in children's homes. Unfortunately, the longitudinal, naturalistic corpora most likely to advance knowledge may contain information that renders participants identifiable. For this reason, naturalistic corpora are rarely shared with other researchers, hindering science. Sharing requires careful redaction--the removal of potentially identifying information. Currently, naturalistic corpora are often too large for manual redaction, and current automated tools both miss critical redactions and over-redact important information. To enable such data to be shared, this project seeks to develop novel computational methods for redaction.This project's aim is to develop initially automated, human-in-the-loop redaction of identifying information in unstructured text data. First, to better understand key challenges around what aspects of transcripts make participants identifiable, the researchers are conducting interviews with social and behavioral science researchers and members of ethics boards. From these insights, the researchers are developing novel models for predicting what language may need to be redacted and they are designing novel user interactions for leveraging human expertise in redaction decisions. The unique characteristics of conversation transcripts require modeling novel features of language, drawing from natural language processing, psychology, privacy engineering, and linguistics. Because automated methods lack human insights into conversational context for making complex redaction decisions, the researchers are designing user interfaces that summarize how marked language, or tokens, appear longitudinally in transcripts, enabling human coders to quickly make redaction decisions. As a case study, the researchers are applying these techniques to the Language Development Project, a longitudinal corpus of 100 diverse children's development of language. The project is also training students in multidisciplinary research across the computational and social sciences.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

在数百名家长、教师和孩子的合作下，研究人员花费了巨大的努力和费用，收集了对话记录，以研究儿童语言发展等主题。对科学最有用的数据是纵向的和自然的，例如在儿童家中定期收集的数据。不幸的是，纵向的，自然的语料库最有可能推进知识可能包含的信息，使参与者可识别。由于这个原因，自然主义语料库很少与其他研究人员共享，阻碍了科学。共享需要仔细的编辑-删除潜在的识别信息。目前，自然主义语料库通常太大，无法进行手动编辑，而当前的自动化工具既会错过关键的编辑，也会过度编辑重要信息。为了使这些数据能够共享，该项目寻求开发新的编辑计算方法。该项目的目标是开发初始自动化的、人在回路的编辑，以识别非结构化文本数据中的信息。首先，为了更好地了解成绩单的哪些方面使参与者可识别的关键挑战，研究人员正在与社会和行为科学研究人员以及道德委员会成员进行访谈。根据这些见解，研究人员正在开发新的模型来预测哪些语言可能需要编辑，他们正在设计新的用户交互，以利用人类在编辑决策中的专业知识。会话记录的独特特征需要从自然语言处理、心理学、隐私工程和语言学中提取建模语言的新特征。由于自动化方法缺乏对会话上下文的人类洞察力，无法做出复杂的编辑决策，研究人员正在设计用户界面，总结标记语言或标记如何纵向出现在转录本中，使人类编码人员能够快速做出编辑决策。作为一个案例研究，研究人员正在将这些技术应用于语言发展项目，这是一个由100名不同儿童语言发展组成的纵向语料库。该项目还培训学生进行跨计算科学和社会科学的多学科研究。该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Blase Ur其他文献

Forgotten But Not Gone: Identifying the Need for Longitudinal Data Management in Cloud Storage

被遗忘但并未消失：确定云存储中纵向数据管理的需求

DOI：
10.1145/3173574.3174117
发表时间：
2018
期刊：
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems
影响因子：
0
作者：
Mohammad Taha Khan;Maria Hyun;Chris Kanich;Blase Ur
通讯作者：
Blase Ur