RealPDBs: Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases
RealPDB:大规模概率数据库的现实数据模型和查询编译
基本信息
- 批准号:EP/R013667/1
- 负责人:
- 金额:$ 99.55万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2017
- 资助国家:英国
- 起止时间:2017 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In the recent years, there has been a strong interest in academia and industry in building large-scale probabilistic knowledge bases from data in an automated way, which has resulted in a number of systems, such as DeepDive, NELL, Yago, Freebase, Microsoft's Probase, and Google's Knowledge Vault. These systems continuously crawl the Web and extract structured information, and thus populate their databases with millions of entities and billions of tuples. To what extent can these search and extraction systems help with real-world use cases? This turns out to be an open-ended question. For example, DeepDive is used to build knowledge bases for domains such as paleontology, geology, medical genetics, and human movement. From a broader perspective, the quest for building large-scale knowledge bases serves as a new dawn for artificial intelligence research. Fields such as information extraction, natural language processing (e.g., question answering), relational and deep learning, knowledge representation and reasoning, and databases are taking initiative towards a common goal. Querying large-scale probabilistic knowledge bases is commonly regarded to be at the heart of these efforts.Beyond all these success stories, however, probabilistic knowledge bases still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. For computational efficiency reasons, probabilistic databases are typically based on strong, unrealistic completeness assumptions, such as the closed-world assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These strong unrealistic assumptions do not only lead to unwanted consequences, but also put probabilistic databases on weak footing in terms of knowledge base learning, completion, and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete; these systems continuously crawl the Web, encounter new sources, and consequently new facts, leading them to add such facts to their database. However, when it comes to querying, most of these systems employ the closed-world assumption, i.e., any fact that is not present in the database is assigned the probability 0, and thus assumed to be impossible. As a closely related problem, it is common practice to view every extracted fact as an independent Bernoulli variable, i.e., any two facts are probabilistically independent. For example, the fact that a person starred in a movie is independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack (in particular ontological) commonsense knowledge, which can often be exploited in reasoning to deduce implicit consequences from data, and which is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the Web. The main goal of this proposal is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by more realistic data models, while preserving their computational properties. We are planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability. We are also planning to design practical scalable query answering algorithms for them, especially algorithms based on knowledge compilation techniques, extending existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. We will also produce a prototype implementation and experimentally evaluate the proposed algorithms.
近年来,学术界和工业界对以自动化的方式从数据中构建大规模概率知识库产生了浓厚的兴趣,这导致了许多系统,如DeepDive,NELL,Yago,Freebase,Microsoft的Probase和Google的Knowledge Vault。这些系统不断地抓取Web并提取结构化信息,从而用数百万个实体和数十亿个元组填充其数据库。这些搜索和提取系统在多大程度上可以帮助现实世界的用例?这是一个开放式的问题。例如,DeepDive用于为古生物学、地质学、医学遗传学和人类运动等领域构建知识库。从更广泛的角度来看,构建大规模知识库的探索是人工智能研究的新曙光。诸如信息提取、自然语言处理(例如,问答)、关系和深度学习、知识表示和推理以及数据库正在朝着一个共同的目标采取主动。大规模概率知识库的查询被认为是这些努力的核心,然而,除了所有这些成功的故事,概率知识库仍然缺乏基本的机制,以传达一些有价值的知识隐藏在他们的最终用户,这严重限制了他们在实践中的潜在应用。这些问题根源于(元组独立)概率数据库,这是用于编码大多数概率知识库的语义。出于计算效率的原因,概率数据库通常基于强大的、不切实际的完备性假设,例如封闭世界假设、元组独立性假设和缺乏常识知识。这些强烈的不切实际的假设不仅会导致不必要的后果,而且会使概率数据库在知识库学习,完成和查询方面处于弱势地位。更具体地说,上述系统中的每一个都只对真实的世界的一部分进行编码,并且这种描述必然是不完整的;这些系统不断地抓取Web,遇到新的来源,因此遇到新的事实,导致它们将这些事实添加到它们的数据库中。然而,当涉及到查询时,这些系统中的大多数采用封闭世界假设,即,数据库中不存在的任何事实被赋予概率0,并因此被假定为不可能。作为一个密切相关的问题,通常的做法是将每个提取的事实视为一个独立的伯努利变量,即,任何两个事实在概率上是独立的。例如,一个人出演电影的事实与这个人是演员的事实是独立的,这与知识领域的基本性质相冲突。此外,目前的概率数据库缺乏(特别是本体论)常识知识,这往往可以利用推理推断隐含的后果,从数据,这往往是必不可少的查询大规模的概率数据库在不受控制的环境,如Web。该提案的主要目标是通过更现实的数据模型增强大规模概率数据库(从而释放其全部数据建模潜力),同时保留其计算特性。我们正计划为由此产生的概率数据库开发不同的语义,并分析其计算特性和棘手的来源。我们还计划为他们设计实用的可扩展的查询应答算法,特别是基于知识编译技术的算法,扩展现有的知识编译方法,并阐述新的,基于张量因式分解和神经符号知识编译。我们还将产生一个原型实现和实验评估所提出的算法。
项目成果
期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs
- DOI:10.4230/lipics.icdt.2020.5
- 发表时间:2019-10
- 期刊:
- 影响因子:0
- 作者:Antoine Amarilli;I. Ceylan
- 通讯作者:Antoine Amarilli;I. Ceylan
An ontology-based deep learning approach for triple classification with out-of-knowledge-base entities
- DOI:10.1016/j.ins.2021.02.018
- 发表时间:2021-02
- 期刊:
- 影响因子:0
- 作者:Elvira Amador-Domínguez;E. Serrano;Daniel Manrique;Patrick Hohenecker;Thomas Lukasiewicz
- 通讯作者:Elvira Amador-Domínguez;E. Serrano;Daniel Manrique;Patrick Hohenecker;Thomas Lukasiewicz
Approximate weighted model integration on DNF structures
DNF 结构上的近似加权模型集成
- DOI:10.1016/j.artint.2022.103753
- 发表时间:2022
- 期刊:
- 影响因子:14.4
- 作者:Abboud R
- 通讯作者:Abboud R
Combining RDF and SPARQL with CP-theories to reason about preferences in a Linked Data setting
- DOI:10.3233/sw-180339
- 发表时间:2020-04
- 期刊:
- 影响因子:3
- 作者:V. W. Anelli;R. Leone;T. D. Noia;Thomas Lukasiewicz;Jessica Rosati
- 通讯作者:V. W. Anelli;R. Leone;T. D. Noia;Thomas Lukasiewicz;Jessica Rosati
The Dichotomy of Evaluating Homomorphism-Closed Queries on Probabilistic Graphs
评估概率图上同态闭查询的二分法
- DOI:10.46298/lmcs-18(1:2)2022
- 发表时间:2022
- 期刊:
- 影响因子:0.6
- 作者:Amarilli A
- 通讯作者:Amarilli A
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Thomas Lukasiewicz其他文献
Uncertainty Representation and Reasoning in the Semantic Web
- DOI:
10.4018/978-1-60566-112-4.ch013 - 发表时间:
2008 - 期刊:
- 影响因子:0
- 作者:
Thomas Lukasiewicz - 通讯作者:
Thomas Lukasiewicz
Hybrid Deep-Semantic Matrix Factorization for Tag-Aware Personalized Recommendation
用于标签感知个性化推荐的混合深度语义矩阵分解
- DOI:
10.1109/icassp40776.2020.9053044 - 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Zhenghua Xu;Cheng Chen;Thomas Lukasiewicz;Yishu Miao - 通讯作者:
Yishu Miao
Ontology-Mediated Query Answering over Log-Linear Probabilistic Data
基于对数线性概率数据的本体介导的查询应答
- DOI:
- 发表时间:
2019 - 期刊:
- 影响因子:0
- 作者:
Stefan Borgwardt;I. Ceylan;Thomas Lukasiewicz - 通讯作者:
Thomas Lukasiewicz
Complexity Results for Preference Aggregation over (m)CP-nets: Pareto and Majority Voting
(m)CP 网络上的偏好聚合的复杂性结果:帕累托和多数投票
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:14.4
- 作者:
Thomas Lukasiewicz;Enrico Malizia - 通讯作者:
Enrico Malizia
Complexity results for preference aggregation over (m)CP-nets: Max and rank voting
(m)CP 网络偏好聚合的复杂性结果:最大投票和排名投票
- DOI:
- 发表时间:
2021 - 期刊:
- 影响因子:14.4
- 作者:
Thomas Lukasiewicz;Enrico Malizia - 通讯作者:
Enrico Malizia
Thomas Lukasiewicz的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Thomas Lukasiewicz', 18)}}的其他基金
PrOQAW: Probabilistic Ontological Query Answering on the Web
ProOQAW:网络上的概率本体查询回答
- 批准号:
EP/J008346/1 - 财政年份:2012
- 资助金额:
$ 99.55万 - 项目类别:
Research Grant
相似海外基金
Advancing Bio-Realistic Modeling via the Brain Modeling ToolKit and SONATA Data Format
通过大脑建模工具包和 SONATA 数据格式推进生物真实建模
- 批准号:
10306896 - 财政年份:2021
- 资助金额:
$ 99.55万 - 项目类别:
Advancing Bio-Realistic Modeling via the Brain Modeling ToolKit and SONATA Data Format
通过大脑建模工具包和 SONATA 数据格式推进生物真实建模
- 批准号:
10477439 - 财政年份:2021
- 资助金额:
$ 99.55万 - 项目类别:
Collaborative Research: EAGER: SaTC-EDU: Safeguarding STEM Education and Scientific Knowledge in the Age of Hyper-Realistic Data Generated Using Artificial Intelligence
合作研究:EAGER:SaTC-EDU:在人工智能生成的超现实数据时代保护 STEM 教育和科学知识
- 批准号:
2039613 - 财政年份:2020
- 资助金额:
$ 99.55万 - 项目类别:
Standard Grant
RAPID: Combining Big Data in Transportation with Hospital Health Data to Build Realistic "Flattening the Curves" Models during the COVID-19 Outbreak
RAPID:将交通大数据与医院健康数据相结合,在 COVID-19 爆发期间构建现实的“压平曲线”模型
- 批准号:
2027678 - 财政年份:2020
- 资助金额:
$ 99.55万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: SaTC-EDU: Safeguarding STEM Education and Scientific Knowledge in the Age of Hyper-Realistic Data Generated Using Artificial Intelligence
合作研究:EAGER:SaTC-EDU:在人工智能生成的超现实数据时代保护 STEM 教育和科学知识
- 批准号:
2039614 - 财政年份:2020
- 资助金额:
$ 99.55万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: SaTC-EDU: Safeguarding STEM Education and Scientific Knowledge in the Age of Hyper-Realistic Data Generated Using Artificial Intelligence
合作研究:EAGER:SaTC-EDU:在人工智能生成的超现实数据时代保护 STEM 教育和科学知识
- 批准号:
2039612 - 财政年份:2020
- 资助金额:
$ 99.55万 - 项目类别:
Standard Grant
Data-Driven Computation of Lagrangian Transport Structure in Realistic Flows
现实流动中拉格朗日输运结构的数据驱动计算
- 批准号:
1821145 - 财政年份:2018
- 资助金额:
$ 99.55万 - 项目类别:
Continuing Grant
Build /enhance a network simulator to provide realistic data to aid in system testing and demonstrat
构建/增强网络模拟器以提供真实数据以帮助系统测试和演示
- 批准号:
513547-2017 - 财政年份:2017
- 资助金额:
$ 99.55万 - 项目类别:
Experience Awards (previously Industrial Undergraduate Student Research Awards)
Biologically realistic extensions to artificial neural networks for integrative pattern recognition from multiple data sources
人工神经网络的生物学现实扩展,用于来自多个数据源的综合模式识别
- 批准号:
476267-2015 - 财政年份:2017
- 资助金额:
$ 99.55万 - 项目类别:
Postgraduate Scholarships - Doctoral














{{item.name}}会员




