权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

SLES: A Theoretical Lens on Generative AI Safety: Near and Long Term

SLES：生成式人工智能安全的理论视角：近期和长期

基本信息

批准号：
2331831
负责人：
Sitan Chen
金额：
$ 80万
依托单位：
Harvard University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-11-01 至 2026-10-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2331831&HistoricalAwards=false
关键词：
SLES Theoretical Lens Generative AI

项目摘要

Generative AI technologies like ChatGPT have taken the world by storm with their ability to synthesize strikingly coherent text, code, and more. The pace with which these systems continue to improve in quality and increasingly shape diverse facets of society and industry is remarkable, yet the field's proficiency in controlling and ensuring the reliability of these systems has not quite kept up. These models remain notoriously prone to confidently making factually incorrect yet convincing-sounding statements. Even when they in principle have all of the knowledge that they need to prevent this, the models often still stumble in putting the pieces together. As this technology makes its way into mission-critical contexts like healthcare or policy decisions, it is crucial to avoid such failure modes. This research will develop mathematically rigorous AI deployment methods that come with solid theoretical assurances that the systems will not stray from their intended behavior in this way. The findings of this project will be instrumental in establishing sustainable checks and fail safes so that generative AI technologies can scale in a controlled fashion that is aligned with human interests. The research aims to tackle a mixture of both near-term challenges in safety for generative AI as well as emerging, longer-term ones that will arise as these models grow in their capabilities. For the former, the project will establish mathematical parameters for factuality and non-hallucination in generative models. This encompasses detecting instances when models make factual assertions, calibrating confidence scores for these assertions, reliably attributing these assertions to their sources in the training data, and encouraging models to abstain from generation when faced with sufficiently out-of-distribution input. Another goal is investigating methodologies to elicit and edit knowledge stored in generative models, as well as isolating fundamental barriers to doing so based on tools from fine-grained complexity theory and computational notions of entropy. For safety in the longer-term, the project will examine the feasibility of integrating emergency stop functionality into AI systems based on cryptographic backdoors, as well as implementing "AI arms protocols" based on zero knowledge proofs to publicly certify their safety properties while keeping certain components of these systems private. The research will also rigorously stress-test existing proposals for scalable oversight of AI systems, like natural-language debate and iterated amplification, using techniques from combinatorial game theory and average-case analysis of recursive heuristics.This research is supported by a partnership between the National Science Foundation and Open Philanthropy.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

像ChatGPT这样的生成式人工智能技术以其合成惊人连贯的文本、代码等的能力席卷了世界。这些系统的质量不断提高，并日益影响社会和工业的各个方面，但该领域在控制和确保这些系统可靠性方面的熟练程度并没有完全跟上。众所周知，这些模型仍然倾向于自信地做出事实错误但听起来令人信服的陈述。即使这些模型原则上已经掌握了防止这种情况发生所需的所有知识，但它们在拼凑这些信息时仍然常常磕磕绊绊。随着该技术进入关键任务环境（如医疗保健或政策决策），避免此类故障模式至关重要。这项研究将开发数学上严格的人工智能部署方法，这些方法具有坚实的理论保证，即系统不会以这种方式偏离其预期行为。该项目的研究结果将有助于建立可持续的检查和故障保险，以便生成式人工智能技术能够以符合人类利益的可控方式进行扩展。这项研究旨在解决生成式人工智能安全方面的短期挑战，以及随着这些模型能力的增长而出现的新兴长期挑战。对于前者，该项目将在生成模型中建立事实性和非幻觉的数学参数。这包括在模型做出事实断言时检测实例，校准这些断言的置信度分数，可靠地将这些断言归因于训练数据中的来源，并鼓励模型在面对充分超出分布的输入时放弃生成。另一个目标是研究方法来引出和编辑存储在生成模型中的知识，以及基于细粒度复杂性理论和熵的计算概念的工具来隔离基本障碍。为了长期安全，该项目将研究将紧急停止功能集成到基于加密后门的人工智能系统中的可行性，以及实施基于零知识证明的“人工智能武器协议”，以公开认证其安全属性，同时保持这些系统的某些组件的私密性。该研究还将使用组合博弈论和递归启发式的平均案例分析技术，对现有的人工智能系统可扩展监督建议进行严格的压力测试，如自然语言辩论和迭代放大。这项研究得到了美国国家科学基金会和开放慈善机构的合作支持。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。