Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction

Corpws Cenedlaethol Cymraeg Cyfoes(当代威尔士语国家语料库):社区驱动的语言语料库建设方法

基本信息

  • 批准号:
    ES/M011348/1
  • 负责人:
  • 金额:
    $ 183.6万
  • 依托单位:
  • 依托单位国家:
    英国
  • 项目类别:
    Research Grant
  • 财政年份:
    2016
  • 资助国家:
    英国
  • 起止时间:
    2016 至 无数据
  • 项目状态:
    已结题

项目摘要

This project will create a major corpus of Welsh language: CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes: National Corpus of Contemporary Welsh). A corpus is a principled collection of language data sampled from real-life contexts, presented as a searchable database. This will be the first corpus to represent spoken, written and electronically-mediated Welsh, and the first in any language with a functional design informed, from the outset, by representatives of all anticipated academic and community user groups. CorCenCC will provide societal, economic and academic benefits by:- Facilitating uses of Welsh in public, commercial, educational and governmental settings.- Redefining the scope, relevance and design infrastructure of corpus development methodology.A corpus allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it 'should' be used. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, voice recognition and web search tools. Welsh has had no comprehensive corpus facility able to meet these requirements.CorCenCC will capitalise on extensive community interest in sustaining and 'growing' Welsh, using the novel integration of crowdsourcing, a powerful data collection method which has the potential to revolutionize corpus construction. Recruited through social and broadcast media, roadshows and existing networks, Welsh speakers will record and upload their own data via a mobile app, and even contribute to data coding. This approach promises representative language across genres, language varieties (regional and social) and contexts. Traditional, data collection will supplement the crowdsourcing, ensuring a representative balance of data as specified in the project targets.Preliminary engagement with stakeholders (including a briefing event at the Senedd) generated collaboration from the Welsh Government, Welsh Language Commissioner, Welsh Joint Education Committee, Welsh for Adults, BBC, Gwasg y Lolfa press, and University of Wales Dictionary; all have identified current needs which CorCenCC can meet, and all will be represented in the project advisory group, so the corpus design is user-informed throughout. A language corpus able to inform delivery of Welsh has been called for by e.g. National Foundation for Educational Research (2008:48) and Welsh Government (2013:27,71). CorCenCC, with its integrated pedagogical toolkit, will impact significantly on Welsh language teaching practice, enabling data-driven, inductive learning and assessment.CorCenCC will be open-source and publicly accessible, with user interfaces for specific groups. It will enable, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; language learners learn from real life models of Welsh; and researchers to investigate patterns of language use and change. In order to ensure that CorCenCC remains a sustainable, permanent and user-oriented record of language, an in-built facility will allow data to be added and moderated beyond the life of the project. The project team comprises experts in corpus linguistics, Welsh, and language pedagogy and assessment, who specialise in the application of linguistic tools to real world issues. Working with an advisory body of stakeholder representatives, they are optimally placed to meet the project aims: creating a permanent, sustainable and fit-for-purpose record of the living language, and pioneering an approach to content generation and user-driven applications that will provide a model for future corpus creation.
该项目将创建一个主要的威尔士语语料库:CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes:当代威尔士语国家语料库)。语料库是从现实生活中抽取的语言数据的有原则的集合,以可搜索的数据库的形式呈现。这将是第一个代表威尔士语口语、书面语和电子媒介的语料库,也是第一个从一开始就由所有预期的学术和社区用户群体代表告知功能设计的任何语言的语料库。CorCenCC将通过以下方式提供社会、经济和学术效益:-促进威尔士语在公共、商业、教育和政府机构中的使用。-重新定义语料库开发方法的范围、相关性和设计基础结构。语料库允许用户识别和探索实际使用的语言,而不是依赖于直觉或“应该”如何使用的说明性说明。这种基于证据的方法被学术研究人员、词典编纂者、教师、语言学习者、评估人员、资源开发人员、政策制定者、出版商、翻译人员和其他人员使用,对于预测文本生成、文字处理工具、机器翻译、语音识别和网络搜索工具等技术的发展至关重要。威尔士语没有能够满足这些要求的综合语料库设施。CorCenCC将利用广泛的社区对维持和“增长”威尔士语的兴趣,使用新颖的众包整合,这是一种强大的数据收集方法,有可能彻底改变语料库建设。通过社交和广播媒体、路演和现有网络进行招募,说威尔士语的人将通过移动应用程序记录和上传自己的数据,甚至为数据编码做出贡献。这种方法保证了跨体裁、语言多样性(区域性和社会性)和语境的代表性语言。传统的数据收集将补充众包,确保项目目标中指定的数据的代表性平衡。与利益相关者的初步接触(包括在Senedd的简报活动)产生了威尔士政府、威尔士语言专员、威尔士联合教育委员会、成人威尔士语、BBC、Gwasg y Lolfa出版社和威尔士大学词典的合作;所有这些都确定了CorCenCC可以满足的当前需求,所有这些都将在项目顾问组中代表,因此语料库设计自始至终都是用户知情的。例如,国家教育研究基金会(2008:48)和威尔士政府(2013:27,71)呼吁建立一个能够为威尔士语教学提供信息的语言语料库。CorCenCC及其综合教学工具包将对威尔士语教学实践产生重大影响,实现数据驱动,归纳学习和评估。CorCenCC将是开源的,可公开访问,并为特定群体提供用户界面。例如,它将使社区用户能够调查方言的变化或他们自己语言使用的特质;专业用户分析文本的可读性或开发数字语言工具;语言学习者从现实生活中的威尔士人身上学习;研究人员调查语言使用和变化的模式。为了确保CorCenCC是一个可持续的、永久的、以用户为导向的语言记录,一个内置的设施将允许在项目生命周期之后添加和调节数据。项目团队由语料库语言学、威尔士语和语言教育学和评估方面的专家组成,他们专门研究语言工具在现实世界问题中的应用。与利益相关者代表的咨询机构合作,他们处于最佳位置,以满足项目目标:创建一个永久的、可持续的、适合目的的现存语言记录,并开创一种内容生成和用户驱动应用程序的方法,为未来的语料库创建提供模型。

项目成果

期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - National Corpus of Contemporary Welsh): A demonstration
CorCenCC(Corpws Cenedlaethol Cymraeg Cyfoes - 当代威尔士语国家语料库):演示
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Knight D
  • 通讯作者:
    Knight D
Creating Welsh Language Word Embeddings
  • DOI:
    10.3390/app11156896
  • 发表时间:
    2021-07
  • 期刊:
  • 影响因子:
    0
  • 作者:
    P. Corcoran;Geraint I. Palmer;Laura Arman;Dawn Knight;Irena Spasic
  • 通讯作者:
    P. Corcoran;Geraint I. Palmer;Laura Arman;Dawn Knight;Irena Spasic
Leveraging Pre-Trained Embeddings for Welsh Taggers
  • DOI:
    10.18653/v1/w19-4332
  • 发表时间:
    2019-08
  • 期刊:
  • 影响因子:
    0
  • 作者:
    I. Ezeani;S. Piao;Steven Neale;Paul Rayson;Dawn Knight
  • 通讯作者:
    I. Ezeani;S. Piao;Steven Neale;Paul Rayson;Dawn Knight
Creating pedagogical wordlists: a comparison of thematic and corpus approaches
创建教学词汇表:主题方法和语料库方法的比较
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Fitzpatrick T
  • 通讯作者:
    Fitzpatrick T
Introducing the Welsh Text Summarisation Dataset and Baseline Systems
  • DOI:
    10.48550/arxiv.2205.02545
  • 发表时间:
    2022-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ignatius M Ezeani;Mahmoud El-Haj;Jonathan Morris;Dawn Knight
  • 通讯作者:
    Ignatius M Ezeani;Mahmoud El-Haj;Jonathan Morris;Dawn Knight
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Dawn Knight其他文献

Building a spoken corpus: what are the basics?
构建口语语料库:基础知识是什么?
Multimodal Corpora
多模态语料库
I’m having a Spring Clear Out: A Corpus-based Analysis of e-transactional Discourse
我正在进行春季清理:基于语料库的电子交易话语分析
  • DOI:
    10.1093/applin/amv019
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    3.6
  • 作者:
    Dawn Knight;Steve Walsh;S. Papagiannidis
  • 通讯作者:
    S. Papagiannidis
PriPA: A Tool for Privacy-Preserving Analytics of Linguistic Data
PriPA:语言数据隐私保护分析工具
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jérémie Clos;Emma Mcclaughlin;Pepita Barnard;Elena Nichele;Dawn Knight;Derek McAuley;S. Adolphs
  • 通讯作者:
    S. Adolphs
1.3 Designing a National Corpus in a Minoritised Language
1.3 设计小语种国家语料库

Dawn Knight的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Dawn Knight', 18)}}的其他基金

FreeTxt: supporting bilingual free-text survey and questionnaire data analysis
FreeTxt:支持双语自由文本调查和问卷数据分析
  • 批准号:
    AH/W004844/1
  • 财政年份:
    2022
  • 资助金额:
    $ 183.6万
  • 项目类别:
    Research Grant
Interactional variation online: harnessing emerging technologies in the digital humanities to analyse online discourse in different workplace contexts
在线互动变化:利用数字人文中的新兴技术来分析不同工作场所环境中的在线话语
  • 批准号:
    AH/W001608/1
  • 财政年份:
    2021
  • 资助金额:
    $ 183.6万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了