基于Spark的大图数据最优子模式匹配查询方法研究

批准号:
61502258
项目类别:
青年科学基金项目
资助金额:
20.0 万元
负责人:
彭云
依托单位:
学科分类:
F0202.系统软件、数据库与工业软件
结题年份:
2018
批准年份:
2015
项目状态:
已结题
项目参与者:
蔡冠球、赵华伟、张路、葛程程、杨涛
国基评审专家1V1指导 中标率高出同行96.8%
结合最新热点,提供专业选题建议
深度指导申报书撰写,确保创新可行
指导项目中标800+,快速提高中标率
微信扫码咨询
中文摘要
近年来,包含海量节点的大图数据得到了广泛的应用,如知识图谱和社交网络等。模式匹配查询是这些大图数据上应用非常广泛的查询方式。但现有的模式匹配查询方法在查询的查全率、多种模式约束的整合、多查询优化和分布式求解方法的可扩展性方面还存在一些不足。因此本项目提出基于新一代分布式计算框架Spark且支持多种模式约束和多查询优化的最优子模式匹配查询方法。该方法能在保证查准率的同时提高查全率;支持多种模式约束,有较广的应用范围;利用多个查询之间的共性进行优化,在多查询并发时整体的查询处理效率较高;基于更适合模式匹配计算的Spark框架,其分布式求解方法有较高的可扩展性。本项目不仅能为当前的大图数据应用提供一种有效且快速的查询方法,而且能进一步丰富大图数据模式匹配的理论,具有重要的理论和现实意义。本项目的研究成果还能为大图数据检索、挖掘等相关领域的实践与应用提供理论指导和方法借鉴。
英文摘要
Recently, massive graph-structured data with billions of nodes has been adopted in several applications, such as knowledge graphs and social networks, etc. Pattern matching query is one of the most widely used approach to retrieve these graphs. Given a pattern graph Q and a data graph G, pattern matching query is to find a subgraph of G satisfying the pattern constrains of Q. ..However, existing pattern matching techniques have at least four technical challenges to effectively and efficiently retrieve the data graphs. Firstly, these queries require that all nodes of Q must match to nodes of G. But, such complete matching is often unachievable due to the great data incompleteness in G. Secondly, most existing works study pattarn matching query with a certain pattern constrain, which can only support a certain application scenario. They cannot efficiently support the applications with diverse scenarios. Thirdly, there are often many queries submitted to be processed concurrently in real applications, and the queries usually have common expressions. But, most existing works just focus on processing a single query. It is under-optimized for the real applications. Fourthly, most existing distributed pattern matching methods are based on the traditional distributed frameworks, such as MapReduce and Vertex-oriented framework. But, these frameworks do not suit the characteristics of pattern matching computation and hence the methods cannot scale to the ever-growing massive graphs. ..To address these challenges, this project studies the optimal subpattern matching query with multi-query optimization using Spark. The novelties of this research are as follows. Firstly, the optimal subpattern matching query allows incomplete matching and therefore is of higher effectiveness on the data graphs having great data incompleteness. Secondly, this project integrates the processing of multiple pattern constrains. Therefore, our technique can efficiently support the applications with diverse scenarios. Thirdly, this project studies multi-query optimization techniques, which can significantly improve the performance of the query processing in real applications. Fourthly, this project adopts the recently proposed and well-accepted Spark framework. It is more suitable to the characteristics of pattern matching computation and can achieve higher scalability. ..The research results of this project can provide an effective and efficient way for retrieving real massive data graphs, and have great theoretical and practical significance. The results can also guide the practices and applications of pattern matching techniques to massive graph retrieval and mining.
本项目研究一种具有较高查全率同时保证查准率、支持多种模式约束、支持多查询优化、基于新一代分布式计算框架Spark具有较高可扩展性的最优子模式匹配查询方法。主要研究内容包括:1)研究单个最优子模式匹配查询的处理方法。定义最优子模式匹配查询,分析其计算复杂性,并研究它的近似算法。该算法不仅要支持多种模式约束,还要有较好的近似度和较低阶的时间复杂度。研究单个最优子模式匹配查询最优执行计划的搜索,包括搜索算法和选择度估计方法。2)研究多个最优子模式匹配查询的优化方法。根据最优子模式匹配查询的性质,研究公共子查询的定义及其提取算法。研究一个成本模型来衡量一个整体执行计划的优劣。基于该成本模型,研究最优整体执行计划的搜索算法。研究将多个查询进行分组并对各组独立地进行多查询优化,以进一步提高多查询优化的效果。3)研究它们在Spark框架下的分布式算法。对前述的最优子模式匹配查询算法及其多查询优化,研究如何使用Spark所支持的操作来设计分布式算法。研究利用Spark的DAG任务调度机制和内存持久化机制来最小化Spark算法的数据混洗。研究将多个分布式计算任务进行合并,以进一步提高计算效率。在本项目的资助下,研究团队取得了11项研究成果。在国内外权威期刊和国际知名学术会议共发表学术论文8篇,其中在SCI 检索的期刊中发表论文3 篇,包括2篇论文发表在国际权威学术期刊《IEEE Transactions on Knowledge and Data Engineering》和《ACM Transactions on Intelligent Systems and Technology》,2篇论文发表在国际权威学术会议ICDE 2016和ICDE 2018中;授权1项美国发明专利、1项中国发明专利、1项软件著作权。培养硕士生2人。各项指标均完成了预期目标。本项目的研究可以为当前的大图数据应用提供一种具有较高查全率和更广的应用范围、支持多查询优化且可扩展性更高的查询方法,能进一步丰富和完善大图数据的模式匹配和分布式计算的基本理论,具有重要的理论和现实意义。本项目的研究成果还能为大图数据检索、挖掘等相关领域的实践与应用提供理论指导和方法借鉴。
期刊论文列表
专著列表
科研奖励列表
会议论文列表
专利列表
DOI:10.1007/s11704-019-7235-0
发表时间:2019-12
期刊:Frontiers of Computer Science
影响因子:4.2
作者:Yun Peng;Yitong Xu;Huawei Zhao;Zhizheng Zhou;Huimin Han
通讯作者:Huimin Han
DOI:10.1145/2960408
发表时间:2016-12-01
期刊:ACM TRANSACTIONS ON INFORMATION SYSTEMS
影响因子:5.6
作者:Wang, Shuaiqiang;Huang, Shanshan;Veijalainen, Jari
通讯作者:Veijalainen, Jari
Human-Powered Data Cleaning for Probabilistic Reachability Queries on Uncertain Graphs
不确定图上概率可达性查询的人力数据清理
DOI:10.1109/tkde.2017.2684166
发表时间:2017-07
期刊:IEEE Transactions on Knowledge and Data Engineering
影响因子:8.9
作者:Xin Lin;Yun Peng;Byron Choi;Jianliang Xu
通讯作者:Jianliang Xu
国内基金
海外基金
