权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Natural Language Data Management

自然语言数据管理

基本信息

批准号：
RGPIN-2018-04683
负责人：
Rafiei, Davood
金额：
$ 2.04万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2018
资助国家：
加拿大
起止时间：
2018-01-01 至 2019-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=662877
关键词：
Natural Language Data Management

项目摘要

A large volume of data generated everyday is in some form of natural language intended for human consumption; this includes news articles, blog posts, tweets, scientific articles, Wikipedia entries, financial reports, etc. However, our querying capabilities over this data have remained very much limited to keyword search, which reduces the efficiency of the search and the scope of the information that can be retrieved. Specifically, a keyword search is not sufficient when the search is not limited to selection and involves join and set operations over the contents of the documents. Additionally, a keyword search is not very applicable when the granularity of the search result is smaller than a document.******This proposal advances the research in querying natural language data and the study of issues that hinder querying and managing such data (including both structured and unstructured) in documents. The particular challenges to be studied are: (1) storage and indexing, (2) querying and query processing, and (3) data integration and aggregation.******(1) Standard text-based indices such as inverted index often ignore the structure of natural language data and will not provide the best support for queries. A storage system for natural language data may track both the ordering and the lexical relations between words and between senses to better support certain classes of queries. For example, synonymy and hyponymy relationships may indicate a degree of locality in the sense that a document that matches a word is likely to match the synonyms and hyponyms of the word as well.******(2) Natural language data may be stored in and queried using a relational database, but composing queries over data can be cumbersome and relational systems may not provide the best support for the queries. A natural language data management system is expected to be geared towards the needs of applications that use natural language data by providing native support and treating natural language data as first class citizens. In particular, natural language data may be transformed to a meaning representation to better support reasoning and entailment detection and for integration with other sources. The querying system can then provide some support for these transformations; the querying system can also utilize the known relationships between fragments (e.g. distributional similarity) in both evaluating the queries and optimizing their evaluation.******(3) Natural language data that resides in different sources can refer to the same entities differently; even the references within the same source can be ambiguous if taken out of their contexts. Ambiguities introduce problems in integrating and aggregating data from multiple sources. Despite the progress in the area of entity resolution, many challenges remain. We will work toward addressing the challenges related to querying, by exploiting new developments in knowledge bases and linked data.

每天生成的大量数据都是以某种形式的自然语言供人类使用;这包括新闻文章，博客文章，推文，科学文章，维基百科条目，财务报告等，然而，我们对这些数据的查询能力仍然非常局限于关键字搜索，这降低了搜索的效率和可以检索的信息的范围。具体地说，当搜索不限于选择并且涉及对文档内容的连接和设置操作时，关键字搜索是不够的。此外，当搜索结果的粒度小于文档时，关键字搜索不是很适用。**这一建议推进了自然语言数据查询的研究，以及阻碍查询和管理文档中此类数据（包括结构化和非结构化）的问题的研究。需要研究的具体挑战是：（1）存储和索引，（2）查询和查询处理，以及（3）数据集成和聚合。(1)标准的基于文本的索引（如倒排索引）通常忽略自然语言数据的结构，并且不会为查询提供最佳支持。用于自然语言数据的存储系统可以跟踪单词之间和含义之间的排序和词汇关系，以更好地支持某些类别的查询。例如，同义词和下义词关系可以在与单词匹配的文档也可能与单词的同义词和下义词匹配的意义上指示局部性的程度。(2)自然语言数据可以存储在关系数据库中并使用关系数据库进行查询，但是对数据进行查询可能很麻烦，并且关系系统可能无法为查询提供最佳支持。自然语言数据管理系统预计将通过提供本地支持并将自然语言数据视为一等公民来适应使用自然语言数据的应用程序的需求。特别地，自然语言数据可以被转换为含义表示，以更好地支持推理和蕴涵检测，并用于与其他源的集成。然后查询系统可以为这些转换提供一些支持;查询系统还可以在评估查询和优化其评估时利用片段之间的已知关系（例如，分布相似性）。(3)驻留在不同来源中的自然语言数据可以以不同的方式引用相同的实体;即使是同一来源中的引用，如果脱离其上下文，也可能是模糊的。模糊性在整合和聚合来自多个来源的数据时会带来问题。尽管在解决实体问题方面取得了进展，但仍然存在许多挑战。我们将通过利用知识库和关联数据的新发展，努力应对与查询相关的挑战。