权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

EVIDENCE: computer-assisted interactive extraction of good dictionary examples from large corpora

证据：计算机辅助从大型语料库中交互式提取优秀的词典示例

基本信息

批准号：
433249742
负责人：
Privatdozent Dr. Alexander Geyken
金额：
--
依托单位：
Fachgebiet Ubiquitäre Wissensverarbeitung
依托单位国家：
德国
项目类别：
Research data and software (Scientific Library Services and Information Systems)
财政年份：
2019
资助国家：
德国
起止时间：
2018-12-31 至 2023-12-31
项目状态：
已结题

来源：
https://gepris.dfg.de/gepris/projekt/433249742?language=en
关键词：
EVIDENCE computer assisted interactive extraction

项目摘要

The project will bring together computer scientists and lexicographers in solving a lexicographical problem, i.e. the identification and extraction of good examples from a large set of corpus examples. Machine learning will be applied to help lexicographers in selecting good examples from corpora for inclusion in dictionary articles. The application of machine learning should facilitate the task of the lexicographers by ranking the examples according to their measured quality and therefore direct the attention of the lexicographers to the best examples. Since quality and appropriateness of examples from corpora are not well-defined features, unanimous judgment cannot be achieved even among professional lexicographers. With interactive learning, we plan to train an adaptive machine learning model on preferences which we assume are more unanimous for different lexicographers since it is more likely that they agree on example 1 being better than example 2 than agreeing on explicit scores for both examples. Furthermore, it is planned to acquire and integrate the judgment of dictionary users (i.e. informed lay persons) on sets of ranked good examples. The outcome of the project will be a system for the extraction, classification, and ranking of corpus examples. This system will initially be tested in the context of the DWDS. There it will support the lexicographers in their daily work. It is expected that for each headword the final system will present a set of good examples that are sufficiently diverse to illustrate various facets of the real use of this word. Furthermore, it will generate an additional value for non-expert dictionary users, as it will supply good examples also for headwords that have not yet received full lexical treatment. The new system will allow any user to provide feedback on the quality of examples which are used by the system to learn. E.g. in the context of teaching, students no longer only consume, but actively participate in the development of a lexicographic resource. The project will also organize workshops to acquire early adopters and to gather feedback from the community. Thus, the proposed method and its application will be useful for other dictionary projects as they are language independent and easy to integrate into current state-of-the-art systems for lexicography.

该项目将汇集计算机科学家和词典编纂者解决词典编纂问题，即从大量语料库示例中识别和提取好的示例。机器学习将被应用于帮助词典编纂者从语料库中选择好的例子纳入词典文章。机器学习的应用应该促进词典编纂者的任务，根据它们的测量质量对例子进行排名，从而将词典编纂者的注意力引导到最好的例子上。由于语料库中例句的质量和恰当性并不是一个很好的特征，即使是专业词典编纂者也无法做出一致的判断。通过交互式学习，我们计划根据偏好训练自适应机器学习模型，我们假设这些偏好对于不同的词典编纂者来说更加一致，因为他们更有可能同意示例1比示例2更好，而不是同意两个示例的显式得分。此外，计划获取和整合字典用户（即知情的外行人）对排名良好的例子集的判断。该项目的成果将是一个用于语料库示例的提取、分类和排名的系统。这一系统最初将在妇女发展战略的范围内进行测试。在那里，它将支持词典编纂者的日常工作。预计最后的系统将为每个词目提供一组充分多样化的好例子，以说明该词真实的使用的各个方面。此外，它将为非专业词典用户产生额外的价值，因为它还将为尚未接受完整词汇处理的词目提供很好的示例。新系统将允许任何用户对系统用于学习的示例的质量提供反馈。例如，在教学中，学生不再只是消费，而是积极参与词典资源的开发。该项目还将组织讲习班，以获得早期采用者并收集社区的反馈。因此，所提出的方法及其应用程序将是有用的其他词典项目，因为它们是语言独立的，易于集成到当前的最先进的系统词典。