权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Corpus linguistic methods

语料库语言方法

基本信息

批准号：
455915757
负责人：
Professorin Dr. Anke Lüdeling
金额：
--
依托单位：
Institut für deutsche Sprache und Linguistik
依托单位国家：
德国
项目类别：
Research Units
财政年份：
资助国家：
德国
起止时间：
项目状态：
未结题

来源：
https://gepris.dfg.de/gepris/projekt/455915757?language=en
关键词：
Corpus linguistic methods

项目摘要

Project Pc is both an infrastructure and a research project within RUEG2. It is the successor to project Pd in RUEG1. On the side of infrastructure and support, it will continuously provide integration of new and/or corrected annotations, data curation and sustainability, as well as technical support and research engineering, i.e. the improvement of automatic and semi-automatic annotation of non-standard data across two modalities, and more generally the development of tools and pipelines for information retrieval/text mining and quantitative analysis. It will also provide support and consultation in the choice and application of quantitative research methods for projects P8-P11 in RUEG2.On the research side, it aims to advance the field of corpus linguistics in two ways: (1) through an evaluation of advanced machine learning techniques and the feasibility and usefulness of their application for the automatic and semi-automatic annotation and information retrieval in non-standard corpora of limited size; and (2) through a focus on the development, validation, evaluation, and epistemological embedding of methods for the RUEG corpus specifically, as well as small and mid-sized corpora in general. The RUEG corpus, being a mid-sized corpus and very well controlled in terms of topic, structure, setting, participants‘ backgrounds, and enriched with ample metadata, offers the chance to deeply understand, annotate, and analyze the full data set in a collaborative effort of the whole research group. It is in fact one of the few corpora that allow for variationist analyses across samples from different production situations and modes, speaker groups, age groups, and two languages recorded for each speaker. However, the trade-off for capturing this complexity lies in the diminished sample size for each group, which does not typically reach representativity as it would be required for frequentist statistics. Since there is no existing set of quantitative techniques that beyond reasonable doubt yield reliable results for smaller corpora, methodological development is crucial to the quantitative study of the RUEG data. At the same time, RUEG is unusually well-suited as a testing field for the evaluation of methods. It thus provides exceptionally synergetic potential for the development of corpus-linguistic methods overall. Pc will investigate and evaluate several promising techniques: a) The applicability (including the validity, reliability, and explanatory power) of mixed-effect models (MEMs), b) two frameworks that are currently almost unused in core-linguistics, graph theory or network analysis and Bayesian statistics, but show promising results in other quantitative fields; and c) the application of machine learning techniques for knowledge gain (rather than text mining objectives, as it is currently mainly used in computational linguistics).

Project Pc是RUEG 2中的基础设施和研究项目。它是RUEG 1项目Pd的后继项目。在基础设施和支持方面，它将继续提供新的和（或）更正的注释、数据管理和可持续性的整合，以及技术支持和研究工程，即改进两种模式的非标准数据的自动和半自动注释，更广泛地说，开发信息检索/文本挖掘和定量分析的工具和管道。它也将为RUEG 2项目P8-P11的定量研究方法的选择和应用提供支持和咨询。在研究方面，它旨在从两个方面推动语料库语言学领域的发展：（1）通过评估先进的机器学习技术及其在非计算机领域的自动和半自动标注和信息检索中应用的可行性和有用性，有限大小的标准语料库;和（2）通过专注于开发，验证，评估和认识论嵌入的方法，特别是RUEG语料库，以及一般的中小型语料库。RUEG语料库是一个中等规模的语料库，在主题，结构，设置，参与者的背景方面得到了很好的控制，并丰富了丰富的元数据，提供了深入理解，注释和分析整个研究小组合作努力的完整数据集的机会。事实上，它是为数不多的允许对来自不同生产情况和模式、说话者群体、年龄群体以及为每个说话者记录的两种语言的样本进行变异分析的语料库之一。然而，捕获这种复杂性的权衡在于每个组的样本量减少，这通常不会达到频率统计所需的代表性。由于没有一套现有的定量技术，超出合理怀疑产生可靠的结果，较小的语料库，方法的发展是至关重要的定量研究的RUEG数据。同时，RUEG非常适合作为方法评估的测试领域。因此，它为语料库语言学方法的整体发展提供了特别的协同潜力。PC将调查和评估几种有前途的技术：a）适用性（包括有效性、可靠性和解释力），B）两个框架，目前在核心语言学、图论或网络分析和贝叶斯统计中几乎没有使用，但在其他定量领域显示出有希望的结果;以及c）用于知识获取的机器学习技术的应用（而不是文本挖掘目标，因为它目前主要用于计算语言学）。