权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

EAGER: Annotating and extracting detailed syntactic information from a 1.1-billion-word corpus

EAGER：从 11 亿单词的语料库中注释和提取详细的语法信息

基本信息

批准号：
2026850
负责人：
Beatrice Santorini
金额：
$ 29.84万
依托单位：
University of Pennsylvania
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-08-15 至 2024-01-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2026850&HistoricalAwards=false
关键词：
EAGER Annotating extracting detailed syntactic

项目摘要

Over the past decade, very large text corpora of English have become available to researchers that turn out to be of considerable value for the language sciences. Even more recently, methods in natural language processing have advanced to a point where we can begin to imagine conducting linguistic research using automatically parsed and uncorrected corpora of the sort that has so far been conducted using human-corrected corpora. It is this new situation that the PIs wish to exploit by producing an automatically parsed billion-plus word corpus of early modern English based on the digitized Early English Books Online (EEBO) corpus that has recently been completed and made accessible to research. The aim is to create an automatically parsed database with a level of accuracy suitable for both linguistic and computational research, using the recently developed cutting-edge methods in natural language processing. The resulting resource will make possible investigations hitherto impossible; specifically, the information contained in a parsed version of EEBO will permit researchers to investigate frequency effects not just of words, but of larger grammatical units (phrases and clauses). In addition to their inherent linguistic interest, the results of such investigations may lead to the discovery of more sophisticated meaning-based properties and how these vary, which should be of value for research in natural language processing. The PIs have made progress on this goal, having created a first automatically parsed version of the EEBO corpus and begun to assess its accuracy. Some features like the syntax of clausal negation are already within our reach, but for many other structures, it remains to be determined how accurate retrieval with large-scale methods can be. Since EEBO is more than 300 times larger than even the largest individual human-corrected corpora, it is expected that a more accurately parsed version of it than the one now available will begin to allow researchers to study phenomena that are only sporadically attested in existing English corpora, to zero in on the very beginnings and ends of historical changes, to investigate many different types of frequency effects (including the novel ones already mentioned) with an accuracy and reliability not hitherto possible, and to rigorously evaluate mathematical models of language change. Because the stage of English covered by EEBO (1500-1700) is already recognizably the modern language, a parsed version of EEBO can to some extent stand proxy for a corpus of Present-Day English for research in the language sciences. As a result, it should be useful as a training and testing ground for applications in computational linguistics including part-of-speech tagging, parsing, named entity recognition, and eventually lemmatization, sense disambiguation, and others. EEBO’s great genre variety and variable orthography and its moderate distance from Present-Day English will also make a parsed version of it a natural candidate for assessing and improving the robustness of these applications and for developing novel parser evaluation metrics that can serve as linguistically informed benchmarks for computational linguistics.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

在过去的十年里，非常大的英语文本语料库已经成为研究人员，原来是相当大的价值，语言科学。甚至最近，自然语言处理的方法已经发展到了这样一个地步，我们开始可以想象使用自动解析和未经校正的语料库进行语言学研究，就像迄今为止使用人工校正的语料库进行的那样。正是这种新的情况下，PI希望通过产生自动解析的早期现代英语的十亿多字语料库的基础上，数字化的早期英语图书在线（EEBO）语料库，最近已经完成，并提供给研究。其目的是使用最近开发的自然语言处理前沿方法，创建一个具有适合语言和计算研究的准确度的自动解析数据库。由此产生的资源将使迄今为止不可能的调查成为可能;具体而言，EEBO的解析版本中包含的信息将允许研究人员不仅调查单词的频率效应，而且调查更大的语法单位（短语和从句）。除了其固有的语言兴趣之外，此类调查的结果可能会发现更复杂的基于含义的属性以及这些属性的变化方式，这对自然语言处理的研究应该具有价值。PI在这一目标上取得了进展，创建了EEBO语料库的第一个自动解析版本，并开始评估其准确性。一些特征，如小句否定的语法，已经在我们的范围内，但对于许多其他结构，它仍然有待确定如何准确检索与大规模的方法。由于EEBO甚至比最大的个人人类校正语料库大300倍以上，因此预计比现在可用的更准确的解析版本将开始允许研究人员研究在现有英语语料库中仅零星证实的现象，以关注历史变化的开始和结束，研究许多不同类型的频率效应（包括已经提到的新的频率效应），其准确性和可靠性是迄今为止不可能的，并严格评估语言变化的数学模型。由于EEBO所涵盖的英语阶段（1500-1700）已经是公认的现代语言，EEBO的解析版本在某种程度上可以代表现代英语语料库，用于语言科学研究。因此，它应该是一个有用的培训和测试场的应用程序在计算语言学，包括词性标注，解析，命名实体识别，并最终词形还原，意义消歧，和其他。EEBO的巨大体裁多样性和可变的正字法和它的适度距离现在-Day English还将使其解析版本成为评估和提高这些应用程序的鲁棒性以及开发新的解析器评估指标的自然候选者，这些指标可以作为计算语言学的语言学基准。该奖项反映了NSF的法定使命，并被认为值得通过使用基金会的学术价值和更广泛的影响审查标准。