权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Semantic Annotation and Mark Up for Enhancing Lexical Searches (SAMUELS)

用于增强词汇搜索的语义注释和标记 (SAMUELS)

基本信息

批准号：
AH/L010062/1
负责人：
Marc Alexander
金额：
$ 51.78万
依托单位：
University of Glasgow
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2014
资助国家：
英国
起止时间：
2014 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=AH%2FL010062%2F1
关键词：
Semantic Annotation Mark Up Enhancing

项目摘要

As humanities datasets get ever larger, researchers have a pressing need for more sophisticated techniques of analysis. The most significant issue in big data research into textual datasets is that our primary methodology for searching, aggregating and analysing them relies not on concepts or meanings, but rather on word forms. These forms are imperfect and evasive proxies for the meanings they refer to, and with 60% of word forms in English referring to more than one meaning, and some word forms referring to close to two hundred meanings, the irrelevant "noise" which appears when searching using word forms grows with the size of the texts being searched.In big data contexts, this problem cripples research, making any sort of detailed analysis entirely intractable and requiring impossible amounts of manual intervention. In this project, we will deliver a system for automatically annotating words in texts with their precise meanings, enabling a step-change in the way we deal with large textual data. The system is based around the unparalleled Historical Thesaurus of English, which contains 797,000 words from across the history of English arranged into 236,000 hierarchical categories of meanings alongside each word's dates of known use. The annotation software will take a text and provide for each word it contains an XML annotation giving the word meaning's Historical Thesaurus category code. The system will automatically disambiguate word meanings using a range of state-of-the-art computational techniques alongside new context-dependent methods unlocked by the Thesaurus's dating codes and its uniquely detailed and fine-grained hierarchical structure.Textual data tagged in this way can then be accurately searched and precisely investigated, with any results also able to be aggregated at a range of levels of precision, without the need for manual intervention. A major part of the project is also the development of new techniques for working with semantically-aggregated and disambiguated data. Project partners will conduct research on resources including the Hansard Corpus, consisting of over 2.3 billion words of text, the Oxford English Corpus, the world's largest stratified corpus of modern English, and the EEBO-TCP corpus of 40,000 early modern books. As part of our work on changing the nature of how we deal with data on this scale, we will mine these text collections for frequently-occurring or statistically unusual concepts, will take advantage of our ability to search large datasets for terms realised by ambiguous word forms (such as "union" in the particular context of industrial relations rather than any of the other 33 possible meanings of this word), and will examine the data as a whole from a distant-reading perspective in order to look for striking or significant patterns of meaning changes across time.These research projects based on tagged data will also drive the development of our tools for using this data, with teams of researchers across the UK and abroad providing a range of different demands on the data, ensuring a variety of needs and use-cases are catered for in the development of the project. In this way, we are committed to producing a set of compelling, fruitful, and practical research outcomes using semantically-tagged data during the lifetime of the project, in order to demonstrate the value of our approach and to help ensure the work of the project is as widely utilised and exploited as possible.By doing all of this, we will enable new and transformative techniques of exploring, searching and investigating large-scale cultural, literary, historical and linguistic phenomena in big humanities datasets; through this project, it will be possible to place meaning - rather than word forms - at the heart of digital humanities research into text.

随着人文数据集变得越来越大，研究人员迫切需要更复杂的分析技术。文本数据集的大数据研究中最重要的问题是，我们搜索、聚合和分析它们的主要方法不是依赖于概念或含义，而是依赖于单词形式。这些形式对于它们所指的含义来说是不完美的和回避的代理，并且60%的英语单词形式指的是一个以上的含义，并且一些单词形式指的是接近200个含义，当使用单词形式搜索时出现的不相关的“噪音”随着被搜索的文本的大小而增加。这使得任何类型的详细分析都变得非常棘手，并且需要大量的人工干预。在这个项目中，我们将提供一个系统，用于自动注释文本中的单词及其精确含义，使我们处理大型文本数据的方式发生变化。该系统基于无与伦比的英语历史同义词词典，其中包含来自英语历史的797，000个单词，并按照每个单词的已知使用日期排列成236，000个分层含义类别。注释软件将获取文本，并为其中包含的每个单词提供XML注释，该注释给出了单词含义的Historical Thesaurus类别代码。该系统将使用一系列最先进的计算技术以及由叙词表的年代代码及其独特的详细和细粒度的层次结构解锁的新的上下文相关方法来自动消除词义。以这种方式标记的文本数据可以被准确地搜索和精确地调查，任何结果也可以在一系列精度水平上聚合，而不需要人工干预。该项目的一个主要部分也是开发用于处理语义聚合和消歧数据的新技术。项目合作伙伴将对包括超过23亿字文本的Hansard语料库、世界上最大的现代英语分层语料库Oxford English Corpus以及包含40，000本早期现代书籍的EEBO-TCP语料库在内的资源进行研究。作为我们改变处理这种规模数据的方式的工作的一部分，我们将挖掘这些文本集合中频繁出现或统计上不寻常的概念，将利用我们搜索大型数据集的能力，以寻找由模糊的单词形式实现的术语（例如在劳资关系的特定语境中的“工会”，而不是这个词的其他33种可能含义中的任何一种），并将从远程阅读的角度对数据进行整体检查，以寻找随着时间的推移意义变化的显著模式。这些基于标记数据的研究项目也将推动我们使用这些数据的工具的发展，英国和国外的研究人员团队提供了一系列不同的数据需求，确保在项目开发过程中满足各种需求和用例。通过这种方式，我们致力于在项目的生命周期内使用语义标记的数据产生一系列引人注目的，富有成效的和实用的研究成果，以证明我们方法的价值，并帮助确保项目的工作尽可能广泛地使用和利用。通过这样做，我们将实现探索，在大型人文数据集中搜索和调查大规模的文化、文学、历史和语言现象;通过该项目，将有可能将意义-而不是文字形式-置于数字人文研究的核心。