EAGER: Annotating and extracting detailed syntactic information from a 1.1-billion-word corpus
EAGER:从 11 亿单词的语料库中注释和提取详细的语法信息
基本信息
- 批准号:2026850
- 负责人:
- 金额:$ 29.84万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-08-15 至 2024-01-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Over the past decade, very large text corpora of English have become available to researchers that turn out to be of considerable value for the language sciences. Even more recently, methods in natural language processing have advanced to a point where we can begin to imagine conducting linguistic research using automatically parsed and uncorrected corpora of the sort that has so far been conducted using human-corrected corpora. It is this new situation that the PIs wish to exploit by producing an automatically parsed billion-plus word corpus of early modern English based on the digitized Early English Books Online (EEBO) corpus that has recently been completed and made accessible to research. The aim is to create an automatically parsed database with a level of accuracy suitable for both linguistic and computational research, using the recently developed cutting-edge methods in natural language processing. The resulting resource will make possible investigations hitherto impossible; specifically, the information contained in a parsed version of EEBO will permit researchers to investigate frequency effects not just of words, but of larger grammatical units (phrases and clauses). In addition to their inherent linguistic interest, the results of such investigations may lead to the discovery of more sophisticated meaning-based properties and how these vary, which should be of value for research in natural language processing. The PIs have made progress on this goal, having created a first automatically parsed version of the EEBO corpus and begun to assess its accuracy. Some features like the syntax of clausal negation are already within our reach, but for many other structures, it remains to be determined how accurate retrieval with large-scale methods can be. Since EEBO is more than 300 times larger than even the largest individual human-corrected corpora, it is expected that a more accurately parsed version of it than the one now available will begin to allow researchers to study phenomena that are only sporadically attested in existing English corpora, to zero in on the very beginnings and ends of historical changes, to investigate many different types of frequency effects (including the novel ones already mentioned) with an accuracy and reliability not hitherto possible, and to rigorously evaluate mathematical models of language change. Because the stage of English covered by EEBO (1500-1700) is already recognizably the modern language, a parsed version of EEBO can to some extent stand proxy for a corpus of Present-Day English for research in the language sciences. As a result, it should be useful as a training and testing ground for applications in computational linguistics including part-of-speech tagging, parsing, named entity recognition, and eventually lemmatization, sense disambiguation, and others. EEBO’s great genre variety and variable orthography and its moderate distance from Present-Day English will also make a parsed version of it a natural candidate for assessing and improving the robustness of these applications and for developing novel parser evaluation metrics that can serve as linguistically informed benchmarks for computational linguistics.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在过去的十年里,非常大的英语文本语料库已经成为研究人员,原来是相当大的价值,语言科学。甚至最近,自然语言处理的方法已经发展到了这样一个地步,我们开始可以想象使用自动解析和未经校正的语料库进行语言学研究,就像迄今为止使用人工校正的语料库进行的那样。正是这种新的情况下,PI希望通过产生自动解析的早期现代英语的十亿多字语料库的基础上,数字化的早期英语图书在线(EEBO)语料库,最近已经完成,并提供给研究。其目的是使用最近开发的自然语言处理前沿方法,创建一个具有适合语言和计算研究的准确度的自动解析数据库。 由此产生的资源将使迄今为止不可能的调查成为可能;具体而言,EEBO的解析版本中包含的信息将允许研究人员不仅调查单词的频率效应,而且调查更大的语法单位(短语和从句)。除了其固有的语言兴趣之外,此类调查的结果可能会发现更复杂的基于含义的属性以及这些属性的变化方式,这对自然语言处理的研究应该具有价值。PI在这一目标上取得了进展,创建了EEBO语料库的第一个自动解析版本,并开始评估其准确性。一些特征,如小句否定的语法,已经在我们的范围内,但对于许多其他结构,它仍然有待确定如何准确检索与大规模的方法。 由于EEBO甚至比最大的个人人类校正语料库大300倍以上,因此预计比现在可用的更准确的解析版本将开始允许研究人员研究在现有英语语料库中仅零星证实的现象,以关注历史变化的开始和结束,研究许多不同类型的频率效应(包括已经提到的新的频率效应),其准确性和可靠性是迄今为止不可能的,并严格评估语言变化的数学模型。 由于EEBO所涵盖的英语阶段(1500-1700)已经是公认的现代语言,EEBO的解析版本在某种程度上可以代表现代英语语料库,用于语言科学研究。 因此,它应该是一个有用的培训和测试场的应用程序在计算语言学,包括词性标注,解析,命名实体识别,并最终词形还原,意义消歧,和其他。EEBO的巨大体裁多样性和可变的正字法和它的适度距离现在-Day English还将使其解析版本成为评估和提高这些应用程序的鲁棒性以及开发新的解析器评估指标的自然候选者,这些指标可以作为计算语言学的语言学基准。该奖项反映了NSF的法定使命,并被认为值得通过使用基金会的学术价值和更广泛的影响审查标准。
项目成果
期刊论文数量(3)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Parsing Early Modern English for linguistic search
解析早期现代英语以进行语言搜索
- DOI:10.7275/twww-ef90
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Kulick, Seth;Ryant, Neville;Santorini, Beatrice
- 通讯作者:Santorini, Beatrice
Parsing "Early English Books Online" for linguistic search
解析“早期英语在线书籍”以进行语言搜索
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Kulick, Seth;Ryant, Neville;Santorini, Beatrice
- 通讯作者:Santorini, Beatrice
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Beatrice Santorini其他文献
評議コーパスから見た裁判員と裁判官の思考体系の差異
从审议语料库看非专业法官与法官思维体系的差异
- DOI:
- 发表时间:
2010 - 期刊:
- 影响因子:0
- 作者:
Fumi Karahashi;Beatrice Santorini;玉岡賀津雄;堀田秀吾 - 通讯作者:
堀田秀吾
Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
- DOI:
- 发表时间:
1990 - 期刊:
- 影响因子:0
- 作者:
Beatrice Santorini - 通讯作者:
Beatrice Santorini
The Penn Parsed Corpus of Modern British English: First Parsing Results and Analysis
现代英式英语佩恩句法语料库:首次句法分析结果与分析
- DOI:
- 发表时间:
2014 - 期刊:
- 影响因子:0
- 作者:
S. Kulick;A. Kroch;Beatrice Santorini - 通讯作者:
Beatrice Santorini
Sumerian Relative Clauses with Anticipated Arguments : A Null Analysis
带有预期参数的苏美尔关系从句:空分析
- DOI:
- 发表时间:
2012 - 期刊:
- 影响因子:0
- 作者:
Fumi Karahashi;Beatrice Santorini - 通讯作者:
Beatrice Santorini
Codeswitching and the syntactic status of adnominal adjectives
语码转换和名词形容词的句法状态
- DOI:
10.1016/0024-3841(94)00026-i - 发表时间:
1995 - 期刊:
- 影响因子:1.1
- 作者:
Beatrice Santorini;Shahrzad Mahootian - 通讯作者:
Shahrzad Mahootian
Beatrice Santorini的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Beatrice Santorini', 18)}}的其他基金
Collaborative Research: A corpus of New York City English: Audio-aligned and parsed
合作研究:纽约市英语语料库:音频对齐和解析
- 批准号:
1629348 - 财政年份:2016
- 资助金额:
$ 29.84万 - 项目类别:
Standard Grant
Collaborative Research: A syntactically annotated corpus of Appalachian English
合作研究:阿巴拉契亚英语句法注释语料库
- 批准号:
1151630 - 财政年份:2012
- 资助金额:
$ 29.84万 - 项目类别:
Standard Grant
相似海外基金
Annotating the New Testament: Codex H, Euthalian Traditions, and the Humanities
新约注释:Codex H、安乐传统和人文学科
- 批准号:
AH/X001458/1 - 财政年份:2023
- 资助金额:
$ 29.84万 - 项目类别:
Research Grant
Annotating the New Testament: Textual Transmission as Reception History
注释新约:文字传播作为接受历史
- 批准号:
2890120 - 财政年份:2023
- 资助金额:
$ 29.84万 - 项目类别:
Studentship
Annotating dark ion-channel functions using evolutionary features, machine learning and knowledge graph mining
使用进化特征、机器学习和知识图挖掘注释暗离子通道函数
- 批准号:
10457684 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Annotating dark ion-channel functions using evolutionary features, machine learning and knowledge graph mining (Kennady Boyd)
使用进化特征、机器学习和知识图挖掘注释暗离子通道函数 (Kennady Boyd)
- 批准号:
10809950 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Annotating dark ion-channel functions using evolutionary features, machine learning and knowledge graph mining
使用进化特征、机器学习和知识图挖掘注释暗离子通道函数
- 批准号:
10661550 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Computational framework for analyzing and annotating single bacterium RNA-Seq data
用于分析和注释单细菌 RNA-Seq 数据的计算框架
- 批准号:
10444669 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Annotating dark ion-channel functions using evolutionary features, machine learning and knowledge graph mining (Rayna Carter)
使用进化特征、机器学习和知识图挖掘注释暗离子通道函数 (Rayna Carter)
- 批准号:
10809931 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Computational framework for analyzing and annotating single bacterium RNA-Seq data
用于分析和注释单细菌 RNA-Seq 数据的计算框架
- 批准号:
10610447 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Annotating Reference and Coreference In Dialogue Using Conversational Agents in games
在游戏中使用对话代理注释对话中的参考和共指
- 批准号:
EP/W001632/1 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Research Grant
Annotating Recorded Telemetry for Extracting Meaning and Insight from Scenarios in Virtual Reality (ARTEMIS-VR)
注释记录的遥测数据以从虚拟现实场景中提取意义和见解 (ARTEMIS-VR)
- 批准号:
10022324 - 财政年份:2022
- 资助金额:
$ 29.84万 - 项目类别:
Feasibility Studies