Semantic Annotation and Mark Up for Enhancing Lexical Searches (SAMUELS)

用于增强词汇搜索的语义注释和标记 (SAMUELS)

基本信息

  • 批准号:
    AH/L010062/1
  • 负责人:
  • 金额:
    $ 51.78万
  • 依托单位:
  • 依托单位国家:
    英国
  • 项目类别:
    Research Grant
  • 财政年份:
    2014
  • 资助国家:
    英国
  • 起止时间:
    2014 至 无数据
  • 项目状态:
    已结题

项目摘要

As humanities datasets get ever larger, researchers have a pressing need for more sophisticated techniques of analysis. The most significant issue in big data research into textual datasets is that our primary methodology for searching, aggregating and analysing them relies not on concepts or meanings, but rather on word forms. These forms are imperfect and evasive proxies for the meanings they refer to, and with 60% of word forms in English referring to more than one meaning, and some word forms referring to close to two hundred meanings, the irrelevant "noise" which appears when searching using word forms grows with the size of the texts being searched.In big data contexts, this problem cripples research, making any sort of detailed analysis entirely intractable and requiring impossible amounts of manual intervention. In this project, we will deliver a system for automatically annotating words in texts with their precise meanings, enabling a step-change in the way we deal with large textual data. The system is based around the unparalleled Historical Thesaurus of English, which contains 797,000 words from across the history of English arranged into 236,000 hierarchical categories of meanings alongside each word's dates of known use. The annotation software will take a text and provide for each word it contains an XML annotation giving the word meaning's Historical Thesaurus category code. The system will automatically disambiguate word meanings using a range of state-of-the-art computational techniques alongside new context-dependent methods unlocked by the Thesaurus's dating codes and its uniquely detailed and fine-grained hierarchical structure.Textual data tagged in this way can then be accurately searched and precisely investigated, with any results also able to be aggregated at a range of levels of precision, without the need for manual intervention. A major part of the project is also the development of new techniques for working with semantically-aggregated and disambiguated data. Project partners will conduct research on resources including the Hansard Corpus, consisting of over 2.3 billion words of text, the Oxford English Corpus, the world's largest stratified corpus of modern English, and the EEBO-TCP corpus of 40,000 early modern books. As part of our work on changing the nature of how we deal with data on this scale, we will mine these text collections for frequently-occurring or statistically unusual concepts, will take advantage of our ability to search large datasets for terms realised by ambiguous word forms (such as "union" in the particular context of industrial relations rather than any of the other 33 possible meanings of this word), and will examine the data as a whole from a distant-reading perspective in order to look for striking or significant patterns of meaning changes across time.These research projects based on tagged data will also drive the development of our tools for using this data, with teams of researchers across the UK and abroad providing a range of different demands on the data, ensuring a variety of needs and use-cases are catered for in the development of the project. In this way, we are committed to producing a set of compelling, fruitful, and practical research outcomes using semantically-tagged data during the lifetime of the project, in order to demonstrate the value of our approach and to help ensure the work of the project is as widely utilised and exploited as possible.By doing all of this, we will enable new and transformative techniques of exploring, searching and investigating large-scale cultural, literary, historical and linguistic phenomena in big humanities datasets; through this project, it will be possible to place meaning - rather than word forms - at the heart of digital humanities research into text.
随着人文数据集变得越来越大,研究人员迫切需要更复杂的分析技术。文本数据集的大数据研究中最重要的问题是,我们搜索、聚合和分析它们的主要方法不是依赖于概念或含义,而是依赖于单词形式。这些形式对于它们所指的含义来说是不完美的和回避的代理,并且60%的英语单词形式指的是一个以上的含义,并且一些单词形式指的是接近200个含义,当使用单词形式搜索时出现的不相关的“噪音”随着被搜索的文本的大小而增加。这使得任何类型的详细分析都变得非常棘手,并且需要大量的人工干预。在这个项目中,我们将提供一个系统,用于自动注释文本中的单词及其精确含义,使我们处理大型文本数据的方式发生变化。该系统基于无与伦比的英语历史同义词词典,其中包含来自英语历史的797,000个单词,并按照每个单词的已知使用日期排列成236,000个分层含义类别。注释软件将获取文本,并为其中包含的每个单词提供XML注释,该注释给出了单词含义的Historical Thesaurus类别代码。该系统将使用一系列最先进的计算技术以及由叙词表的年代代码及其独特的详细和细粒度的层次结构解锁的新的上下文相关方法来自动消除词义。以这种方式标记的文本数据可以被准确地搜索和精确地调查,任何结果也可以在一系列精度水平上聚合,而不需要人工干预。该项目的一个主要部分也是开发用于处理语义聚合和消歧数据的新技术。项目合作伙伴将对包括超过23亿字文本的Hansard语料库、世界上最大的现代英语分层语料库Oxford English Corpus以及包含40,000本早期现代书籍的EEBO-TCP语料库在内的资源进行研究。作为我们改变处理这种规模数据的方式的工作的一部分,我们将挖掘这些文本集合中频繁出现或统计上不寻常的概念,将利用我们搜索大型数据集的能力,以寻找由模糊的单词形式实现的术语(例如在劳资关系的特定语境中的“工会”,而不是这个词的其他33种可能含义中的任何一种),并将从远程阅读的角度对数据进行整体检查,以寻找随着时间的推移意义变化的显著模式。这些基于标记数据的研究项目也将推动我们使用这些数据的工具的发展,英国和国外的研究人员团队提供了一系列不同的数据需求,确保在项目开发过程中满足各种需求和用例。通过这种方式,我们致力于在项目的生命周期内使用语义标记的数据产生一系列引人注目的,富有成效的和实用的研究成果,以证明我们方法的价值,并帮助确保项目的工作尽可能广泛地使用和利用。通过这样做,我们将实现探索,在大型人文数据集中搜索和调查大规模的文化、文学、历史和语言现象;通过该项目,将有可能将意义-而不是文字形式-置于数字人文研究的核心。

项目成果

期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Impression management in the Early Modern English courtroom
早期现代英国法庭中的印象管理
"In barbarous times and in uncivilized countries" Two centuries of the evolving uncivil in the Hansard Corpus
“在野蛮时代和不文明国家”《国会议事录》语料库中两个世纪以来不断演变的不文明行为
Mapping Hansard Impression Management Strategies through Time and Space
通过时间和空间映射国会议事印象管理策略
  • DOI:
    10.1080/00393274.2017.1370981
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0.4
  • 作者:
    Archer D
  • 通讯作者:
    Archer D
The Routledge Handbook of English Language and Digital Humanities
  • DOI:
    10.4324/9781003031758
  • 发表时间:
    2020-04
  • 期刊:
  • 影响因子:
    0
  • 作者:
    S. Adolphs;Dawn Knight
  • 通讯作者:
    S. Adolphs;Dawn Knight
Metaphor, Popular Science, and Semantic Tagging: Distant reading with the Historical Thesaurus of English
隐喻、科普和语义标签:利用英语历史词库进行远读
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Marc Alexander其他文献

Presenting the Meta-Performance Test, a Metacognitive Battery based on Performance
介绍元性能测试,一种基于性能的元认知电池
Experiences with Parallelisation of an Existing NLP Pipeline: Tagging Hansard
现有 NLP 流程并行化的经验:标记 Hansard
Formulating and managing neighbourhood complaints: a comparative study of service provision
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Marc Alexander
  • 通讯作者:
    Marc Alexander
The metaphorical understanding of power and authority
对权力和权威的隐喻理解
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Marc Alexander
  • 通讯作者:
    Marc Alexander
Schemata
图式

Marc Alexander的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Marc Alexander', 18)}}的其他基金

The formulation and management of social problems in service provision
服务提供中社会问题的制定和管理
  • 批准号:
    ES/T008172/1
  • 财政年份:
    2019
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Fellowship

相似海外基金

CRII: SHF: Theoretical Foundations of Verifying Function Values and Reducing Annotation Overhead in Automatic Deductive Verification
CRII:SHF:自动演绎验证中验证函数值和减少注释开销的理论基础
  • 批准号:
    2348334
  • 财政年份:
    2024
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Standard Grant
ProtFunAI: AI based methods for functional annotation of proteins in crop genomes
ProtFunAI:基于人工智能的作物基因组蛋白质功能注释方法
  • 批准号:
    BB/Y514044/1
  • 财政年份:
    2024
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Research Grant
Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods PID 7012435
使用深度学习方法提高 InterPro、Pfam 和 FunFam 中功能蛋白注释的准确性、覆盖范围和可持续性 PID 7012435
  • 批准号:
    BB/X018563/1
  • 财政年份:
    2024
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Research Grant
Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods
使用深度学习方法提高 InterPro、Pfam 和 FunFam 中功能蛋白注释的准确性、覆盖范围和可持续性
  • 批准号:
    BB/X018660/1
  • 财政年份:
    2024
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Research Grant
Doctoral Dissertation Research: Discourse relation annotation in speech databases
博士论文研究:语音数据库中的话语关系标注
  • 批准号:
    2336603
  • 财政年份:
    2024
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Standard Grant
Prosodic Event Annotation and Detection in Three Varieties of English
三种英语韵律事件标注与检测
  • 批准号:
    2316030
  • 财政年份:
    2023
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Standard Grant
EAGER: Integrating Multi-Omics Biological Networks and Ontologies for lncRNA Function Annotation using Deep Learning
EAGER:使用深度学习集成多组学生物网络和本体以进行 lncRNA 功能注释
  • 批准号:
    2400785
  • 财政年份:
    2023
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Standard Grant
K-mer indexing for pan-genome reference annotation
用于泛基因组参考注释的 K-mer 索引
  • 批准号:
    10793082
  • 财政年份:
    2023
  • 资助金额:
    $ 51.78万
  • 项目类别:
Unsupervised Annotation of Complex 3D BioMedical Data.
复杂 3D 生物医学数据的无监督注释。
  • 批准号:
    2882348
  • 财政年份:
    2023
  • 资助金额:
    $ 51.78万
  • 项目类别:
    Studentship
Connecting the universe of proteins to address annotation inequality in the microbial proteome
连接蛋白质领域以解决微生物蛋白质组中的注释不平等问题
  • 批准号:
    10658439
  • 财政年份:
    2023
  • 资助金额:
    $ 51.78万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了