权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Frameworks: arXiv as an accessible large-scale open research platform

框架：arXiv 作为一个可访问的大型开放研究平台

基本信息

批准号：
2311521
负责人：
Ramin Zabih
金额：
$ 496.65万
依托单位：
Cornell University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2024
资助国家：
美国
起止时间：
2024-01-01 至 2028-12-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2311521&HistoricalAwards=false
关键词：
Frameworks arXiv accessible large scale

项目摘要

arXiv is an open-access repository that has played a leading role in disciplines such as computer science, mathematics and physics for over 30 years. It hosts more than 2 million scientific papers and has a large user community. Each month there are approximately 5 million active users and 100 million web accesses. Despite its size and usage, arXiv has very limited search and recommendation functionality. In order to better serve the arXiv community, this project is building a new generation of search and recommendation functionality and simultaneously creating a research sandbox to reduce reliance on third-party, commercial services. To make arXiv's trove of scientific content accessible to the visually impaired, support is being added for well-structured HTML as well as PDF. Improved discovery of research results provides broad multidisciplinary benefits across areas of science. These include less researcher time wasted browsing through large amounts of irrelevant papers, revelation of "unknown unknowns," and accelerating research across different subject areas through unexpected synergies. Improved recommendation tools, which can provide unbiased and diverse sources of relevant research results and techniques, are urgently needed to break silos. arXiv will provide improved mechanisms for scientists to find out about important advances, both in their own field of expertise and in adjacent fields.This project includes 4 major focus areas: Open A/B Testing, Neural Representations of Scientific Text, arXiv Dynamics, and Security & Privacy. (1) Open A/B Testing enables arXiv to become a platform for A/B testing of search and recommendation algorithms. In addition to online A/B testing, offline A/B testing is provided using historical data along with counterfactual estimators for policy rewards. (2) Neural Representation of Scientific Text provides a vector-based representation of scientific texts (documents, paragraphs, and sentences) appropriate for multiple tasks, including citation, author, title, and keyword prediction. Differentiable search indices are investigated due to their potential to provide additional search performance improvements without requiring incremental re-training. Finally, this supports the construction of a scientific question-answering system which can also be used as a context-sensitive "chat-bot" enabling researchers to converse with and get a list of recent publications relevant to their interests. (3) The arXiv Dynamics project investigates how scientific fields grow, shrink, and transform over time. Creating a "trending and emerging arXiv topics" pattern recognition system predicts how interesting current and historical articles are to researchers. Research is investigating methods to remove the "rich-get-richer" effect from this model, to correct the model for the effects of the users' historical interactions with the system, and to track performance and solicit user feedback as these models change over time. (4) Under Security & Privacy arXiv's privacy policy is updated so that users are aware of how their (meta-)data may be used and the protections that will be deployed to protect their privacy. A "Layer 1" API allows researchers to make coarse-grained queries on anonymized arXiv weblogs and a "Layer 2" API which allows researchers to securely experiment on arXiv metadata and weblogs. Privacy is preserved by a combination of query restrictions and researcher usage agreements. A machine-learning API layer is being developed which supports differential privacy, and allows researchers to investigate the utility of these tools for novel ML-based applications, such as free-form question answering about scientific texts, neural recommender systems, etc.This award by the Office of Advanced Cyberinfrastructure is jointly supported by the Division of Information and Intelligent Systems in the Directorate for Computer and Information Science and Engineering and the Division of Physics within the Directorate for Mathematical and Physical Sciences.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

arXiv是一个开放获取的存储库，在计算机科学、数学和物理等学科中发挥了主导作用，已有30多年的历史。它拥有超过200万篇科学论文，并拥有庞大的用户社区。每月大约有500万活跃用户和1亿次网络访问。尽管它的大小和用途，arXiv有非常有限的搜索和推荐功能。为了更好地服务于arXiv社区，该项目正在构建新一代搜索和推荐功能，同时创建一个研究沙箱，以减少对第三方商业服务的依赖。为了让视障人士能够访问arXiv的科学内容，增加了对结构良好的HTML和PDF的支持。改进研究成果的发现为科学领域提供了广泛的多学科利益。这些包括减少研究人员浪费在浏览大量不相关论文上的时间，揭示“未知的未知”，以及通过意想不到的协同作用加速不同学科领域的研究。迫切需要改进的推荐工具，以提供公正和多样化的相关研究结果和技术来源，打破孤岛。arXiv将为科学家提供改进的机制，以了解他们自己的专业领域和相邻领域的重要进展。该项目包括4个主要重点领域：开放A/B测试，科学文本的神经表示，arXiv动态和安全隐私。(1)开放A/B测试使arXiv成为搜索和推荐算法的A/B测试平台。除了在线A/B测试，离线A/B测试提供了使用历史数据沿着与反事实估计的政策奖励。(2)科学文本的神经表示提供了科学文本（文档，段落和句子）的基于向量的表示，适用于多个任务，包括引用，作者，标题和关键字预测。不同的搜索索引进行了调查，由于其潜力，提供额外的搜索性能的改善，而不需要增量重新训练。最后，这支持了科学问答系统的构建，该系统也可以用作上下文敏感的“聊天机器人”，使研究人员能够与他们的兴趣相关的最新出版物进行匡威并获得列表。(3)arXiv Dynamics项目研究科学领域如何随着时间的推移而增长，缩小和转变。创建一个“趋势和新兴的arXiv主题”模式识别系统，预测研究人员对当前和历史文章的兴趣程度。研究人员正在调查从这个模型中消除“富人变得更富有”效应的方法，以纠正用户与系统的历史交互的影响，并跟踪性能并征求用户反馈，因为这些模型随着时间的推移而变化。(4)在安全隐私下，arXiv的隐私政策进行了更新，以便用户了解他们的（Meta）数据可能会被如何使用，以及将部署哪些保护措施来保护他们的隐私。“第1层”API允许研究人员对匿名arXiv weblog进行粗粒度查询，“第2层”API允许研究人员安全地对arXiv元数据和weblog进行实验。隐私是通过查询限制和研究人员使用协议的组合来保护的。正在开发一个机器学习API层，它支持差分隐私，并允许研究人员研究这些工具在基于ML的新应用中的效用，例如关于科学文本的自由形式问题回答，神经推荐系统，该奖项由高级网络基础设施办公室颁发，并得到计算机和信息科学局信息和智能系统司的联合支持该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。