FDB Research

Data, more data, too much data

Over the past 30 years, the world of data management has undergone a dramatic evolution, allowing us to consider an increasingly wide range of data types, sources, and processing models. This has also led to the collection of tremendous volumes of data, and we are now at a point where almost every aspect of our lives is data-driven. In this talk, Prof. Tova Milo will share her perspective on some of the "hot topics" that she has been personally involved with in this evolution, and discuss why certain research topics become popular and when they lose their relevance. She will furthermore argue that we are now facing a turning point where the deluge of data has become a serious risk, and that the next "hot" research agenda should focus on data disposal.

Despite advances in storage technology, the amount of data generated is projected to surpass storage production by an order of magnitude by 2025, and uncontrolled data retention further poses significant risks to security and privacy. She will discuss the logical, algorithmic, and methodological foundations necessary for the systematic disposal of large-scale data, highlighting new research challenges and the potential for reusing existing techniques. She will also share insights from the research conducted by the Tel Aviv Databases group in this direction. Ultimately, managing data effectively while respecting storage, processing, and regulatory constraints is a significant challenge, and she looks forward to exploring this topic further in her talk.

Optimizing Recursive Queries

Modern data analytics requires iteration, yet relational database engines are mostly optimized for non-recursive queries. SQL supports only a limited form of recursion. A better formalism for recursive queries is datalog, which has some elegant properties (recursion always terminates), and led to the development of two powerful optimizations techniques: semi-naive evaluation, and magic set rewriting. But standard datalog is restricted to monotone queries over sets and does not support aggregates, which has limited its adoption.

In this talk I will describe a new approach to recursive queries and their optimization. First, we extend datalog to semirings, while preserving some of the elegant properties of datalog, and also supporting aggregates naturally. Then, I will describe a simple, yet very powerful optimization rule, called the FGH rule, that rewrites a recursive program into a different recursive program. The rule captures many optimizations discussed in the literature, such as magic set optimization, the PreM rule, and semi-naive evaluation, and as well as new semantics optimizations. Our implementation of the FGH rule is based on the egg term rewriting engine, and the Rosette program synthesizer.

Joint work with: Yisu Remy Wang, Mahmoud Abo Khamis, Hung Q. Ngo, Reinhard Pichler

GeoAI for CAV: an Ordnance Survey perspective

The scientific field of geospatial artificial intelligence (GeoAI) was recently formed from combining innovations in spatial science with the rapid growth of methods in artificial intelligence (AI), particularly machine learning (e.g., deep learning), data mining, and high-performance computing to glean meaningful information from spatial big data. GeoAI is an interdisciplinary field, spanning several scientific subjects including computer science, engineering, statistics, and spatial science. In this seminar talk, we will explore current research work undergoing at Ordnance Survey and partner universities in unlocking the potential of GeoAI in the connected and autonomous vehicle (CAV) sector.

Speaker Bio: Dr Stefano Cavazzi is a Principal Innovation and Research Scientist at Ordnance Survey where he leads the development of GIS and geomatics research programmes. He holds a PhD from the University of Cranfield and spent the last fifteen years working in the geospatial sector in both academia and industry specialising in geospatial data science. At the Geo IoT World Awards 2018 a team led by Stefano won the GeoData Intelligence award for developing an autonomous decision-making support system for emergency responders.

The SQL++ Query Language

SQL-on-Hadoop, NewSQL and NoSQL databases provide semi-structured data models (typically JSON-based). They now drive towards declarative, SQL-alike query languages. However, their idiomatic, non-SQL language constructs, the many variations and the lack of formal syntax and semantics pose problems. Notably, database vendors end up with unclear semantics and complicated implementations, as they add one feature at-a-time.

The presented SQL++ semi-structured data model bridges semistructured data and the SQL data model. The SQL++ query language aims to backwards compatibility with SQL. We show that a relatively small set of SQL restriction removals and feature additions is enough to provide a SQL-compatible extension to semistructured data. SQL++ is currently being adopted by the industry.

The extension to Configurable SQL++ includes configuration options that describe different options of language semantics and formally capture the variations of existing database languages. Configurable SQL++ is unifying: By appropriate choices of configuration options, the Configurable SQL++ semantics can morph into the semantics of any of eleven popular semistructured databases, which we surveyed, as the experimental validation shows. In this way, Configurable SQL++ allows a formal characterization of the capabilities of the emerging query languages.

Short Bio: Yannis Papakonstantinou is a Senior Principal Scientist of Amazon Web Services and Professor (on leave) of Computer Science and Engineering at the University of California, San Diego. His research is in the intersection of data management technologies and the web, where he has published over one hundred research articles and received over 15,000 citations. He has given multiple tutorials and invited talks, has served on journal editorial boards and has chaired and participated in program committees for many international conferences and workshops. He also co-founded and taught for UCSD's Master of Advanced Studies in Data Science.

Enumeration of UCQs with Constant Delay

We discuss the complexity of enumerating the answers of Unions of Conjunctive Queries (UCQs), and focus on the ability to list output tuples with constant delay following linear-time preprocessing. A known dichotomy classifies the self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. In such cases, some queries classified as hard are in fact tractable, and we generalize the dichotomy to accommodate Functional Dependencies (FDs). Next, we aspire to have a similar characterization for UCQs, even when there are no FDs. It was claimed in the past that a UCQ is hard if one of its queries is hard. We examine the task of enumerating UCQs, and show that some unions containing a hard query are tractable. In fact, a UCQ may be tractable even if it does not contain a single tractable CQ.

Short Bio: Nofar is a PhD student in the Data and Knowledge group at the Technion, advised by Prof. Benny Kimelfeld. She is currently a visiting researcher in the FDB group at Oxford. Her research focuses on query optimization using enumeration techniques. Nofar completed her BSc in 2015 in the Lapidim excellence program of the Computer Science department of the Technion.

Complexity Bounds for Relational Algebra over Text Extractors

Information Extraction (IE) is the task of extracting structured information from textual data. We explore a programming paradigm that is supported by several IE systems where relations extracted by atomic extractors undergo a relational manipulation. In our efforts toward achieving a better understanding of IE systems, we study the computational complexity of queries in the framework of document spanners (spanners, for short) that was introduced by Fagin et al. A spanner is a function that extracts from a document (string) a relation over text intervals, called spans, using either atomic extractors or a relational algebra query on top of these extractors. Evaluating a spanner on a document is a computational problem that involves three components: the atomic extractors, the relational structure of the query, and the document. We investigate the complexity of this problem from various angles, each keeping some components fixed and regarding the rest as input.

Short Bio: Liat Peterfreund is a PhD candidate in the Computer Science Department in the Technion. Her research is done under the supervision of Prof. Benny Kimelfeld and focuses on establishing the foundations of incorporating Information Extraction in Database Queries.

Enumeration of UCQs with Constant Delay

We discuss the complexity of enumerating the answers of Unions of Conjunctive Queries (UCQs), and focus on the ability to list output tuples with constant delay following linear-time preprocessing. A known dichotomy classifies the self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. In such cases, some queries classified as hard are in fact tractable, and we generalize the dichotomy to accommodate Functional Dependencies (FDs). Next, we aspire to have a similar characterization for UCQs, even when there are no FDs. It was claimed in the past that a UCQ is hard if one of its queries is hard. We examine the task of enumerating UCQs, and show that some unions containing a hard query are tractable. In fact, a UCQ may be tractable even if it does not contain a single tractable CQ.

Short Bio: Nofar is a PhD student in the Data and Knowledge group at the Technion, advised by Prof. Benny Kimelfeld. She is currently a visiting researcher in the FDB group at Oxford. Her research focuses on query optimization using enumeration techniques. Nofar completed her BSc in 2015 in the Lapidim excellence program of the Computer Science department of the Technion.

Practical challenges to building large applications written in a datalog-based language

Since 2009, we have been developing and operating Cloud-deployed, hybrid transactional/analytic processing (HTAP) systems for large retail customers. We implement these applications primarily in a datalog-based language called LogiQL, which permits the concise expression of rich models with powerful constraints and business rules. The ability to build such systems at scale in LogiQL owes to a number of innovations in the LogicBlox platform, including advanced multi-way join algorithms, powerful query optimizations, write-optimized data structures, and the ability to harness the parallelism available in modern servers. Unfortunately, not every LogiQL program can (yet) take full advantage of these innovations. When tuning such a program, a developer will often rewrite parts of the program to introduce new predicates and rules that materialize intermediate results. Inevitably, these rewrites have a deleterious effect on the concision, readability, and maintainability of a program. In practice, we mitigate these challenges through optimization heuristics and by generating a significant component of the LogiQL program from a higher-level form (called CubiQL) that allows us to automate their application. This talk surveys the challenges that arise when using LogiQL to build HTAP applications in this domain and the heuristics we developed to address them. We conclude with open problems that remain difficult to automate.

Short Bio: Kurt Stirewalt joined LogicBlox in 2009 and has since served as Dev Lead, Chief Application Architect, and eventually VP of Software Development for Infor Retail, which acquired LogicBlox by acquisition in 2016. Previously, he was a professor of computer science and engineering at Michigan State University, where he specialized in model-based software development, generative programming, and software reuse. Kurt received his PhD in computer science from the Georgia Institute of Technology in 1997.

ReStore: Reusing Results of MapReduce Jobs

Performing data analytics on huge data sets have been very crucial for many enterprises. It was facilitated by the introduction of the MapReduce programming and execution model and other more recent distributed computing platforms. MapReduce users often have analysis tasks that are too complex to express as individual MapReduce jobs, and therefore they use high-level query languages such as Pig or Hive to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. In my talk, I will present ReStore, a system that manages the storage and reuse of intermediate results between MapReduce jobs in such workflows. At the end of the talk, I will briefly discuss related ideas for optimizing queries executed on other distributed computing platforms.

Hardware-conscious data processing systems

Hardware-conscious database systems evaluate queries in milliseconds that take minutes in conventional systems, turning long-running jobs into interactive queries. However, the plethora of hardware-focused tuning techniques creates a design-space that is hard to navigate for a skilled performance engineer and even harder to exploit for modern, code-generating data processing systems. In addition, hardware-conscious tuning is often at odds with other design goals such as implementation effort, ease of use and maintainability -- in particular when developing code-generating database systems. Well-designed programming abstractions are essential to allow the creation of systems that are fast, easy to use and maintainable.

In my talk, I demonstrate how existing frameworks for high-performance, data-parallel programming fall short of this goal. I argue that the poor performance of many mainstream data processing systems is due to the lack of an appropriate intermediate abstraction layer, i.e., one that is amenable to both, traditional data-oriented as well as low-level hardware-focused optimizations.

To address this problem, I introduce Voodoo, a data parallel intermediate language that is abstract enough to allow effective code generation and optimization but low-level enough to express many common optimizations such as parallelization, loop tiling or memory locality optimizations. I demonstrate how we used Voodoo to build a relational data processing system that outperforms the fastest state-of-the-art in-memory database systems by up to five times. I also demonstrate how Voodoo can be used as a performance engineering framework, allowing the expression of many known optimizations and even enabling the discovery of entirely new optimizations.

Short Bio: Holger is a Lecturer (roughly assistant professor) in the Department of Computing at Imperial College London. As such, he is a member of the Large-Scale Data and Systems Group. Before that, he was a Postdoc at the Database group at MIT CSAIL. He spent his PhD years in the Database Architectures group at CWI in Amsterdam resulting in a PhD from the University of Amsterdam in 2015. He received a master's degree (Diplom) in computer science at Humboldt-Universität zu Berlin in 2010. His research interests lie in analytical query processing on memory-resident data. In particular, he studies storage schemes and processing models for modern hardware.

Compressed Representations of Conjunctive Query Results

Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. In this talk, we consider the problem of constructing space-efficient compressed representations of the output of conjunctive queries, with the goal of supporting the efficient access of the intermediate compressed result for a given access pattern. In particular, we will study of an important tradeoff : minimizing the space necessary to store the compressed result, versus minimizing the answer time and delay for an access request over the result. We present a novel parameterized data structure, which can be tuned to trade off space for answer time. This tradeoff allows us to control the space requirement of the data structure precisely, and depends both on the structure of the query and the access pattern. We then show how we can use the data structure in conjunction with query decomposition techniques, in order to efficiently represent the outputs for several classes of conjunctive queries. Finally, we conclude with some exciting open problems in this area.

This is joint work with Shaleen Deep.

Short Bio: Paris Koutris is an assistant professor at the University of Wisconsin-Madison, where he started in Fall 2015. He completed his Ph.D. in the Computer Science & Engineering Department at the University of Washington, advised by Dan Suciu. His research focuses on the theoretical aspects of data management. He is particularly interested in applying formal methods to various problems of modern data management systems: data processing in massively parallel systems and at scale, data pricing, and managing data with uncertainty. For his Ph.D. thesis, he received the 2016 SIGMOD Jim Gray Doctoral Dissertation Award.

Using LogicBlox to build large retail analytics applications

LogiQL is a declarative modeling language for concisely expressing rich data models with powerful constraints and business rules. The declarative nature and simple syntax of LogiQL makes it amenable to many different implementation strategies, including:

allowing the database to determine how data are stored and indexed;
deciding how queries are evaluated and results cached;
evaluating concurrent transactions to maximize throughput; and even
employing powerful constraint optimization solvers to solve linear and integer programming problems.

This ability to declare a model in a manner that is abstract with respect to implementation strategy is powerful. As a simple example, it removes much of the complexity of multi-core programming. It also enables a new approach to programming large database applications, especially those requiring complex prescriptive and descriptive analytics. Infor Retail has adopted LogiQL (and the LogicBlox platform) to build out our suite of retail forecasting and supply-chain optimization products. This talk gives a brief introduction to LogiQL and LogicBlox in the context of several common modeling problems in the retail domain.

Short Bio: Kurt Stirewalt joined LogicBlox in 2009 and has since served as Dev Lead, Chief Application Architect, and eventually VP of Software Development for Infor Retail, which acquired LogicBlox by acquisition in 2016. Previously, he was a professor of computer science and engineering at Michigan State University, where he specialized in model-based software development, generative programming, and software reuse. Kurt received his PhD in computer science from the Georgia Institute of Technology in 1997.

Integrating Semantic Web in the Real World: A journey between two cities

An early vision in Computer Science has been to create intelligent systems capable of reasoning on large amounts of data. Today, this vision can be delivered by integrating Relational Databases with the Semantic Web using the W3C standards: a graph data model (RDF), ontology language (OWL), mapping language (R2RML) and query language (SPARQL). The research community has successfully been showing how intelligent systems can be created with Semantic Web technologies, dubbed now as Knowledge Graphs.

However, where is the mainstream industry adoption? What are the barriers to adoption? Are these engineering and social barriers or are they open scientific problems that need to be addressed?

This talk will chronicle our journey of deploying Semantic Web technologies with real world users to address Business Intelligence and Data Integration needs, describe technical and social obstacles that are present in large organizations, and scientific challenges that require attention.

Short Bio: Juan F. Sequeda is the co-founder of Capsenta, a spin-off from his research, and the Senior Director of Capsenta Labs. He holds a PhD in Computer Science from the University of Texas at Austin. His research interests are on the intersection of Logic and Data and in particular between the Semantic Web and Relational Databases for data integration, ontology based data access and semantic/graph data management. Juan is the recipient of the NSF Graduate Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on ConstituteProject.org, Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best Transfer and Innovation Project awarded by Institute for Applied Informatics. Juan is the General Chair of 2018 Alberto Mendelzon Workshop on Foundations of Databases (AMW2018), was the PC chair of the ISWC 2017 In-Use track, is on the Editorial Board of the Journal of Web Semantics and member of multiple program committees (ISWC, ESWC, WWW, AAAI, IJCAI). Juan is a member of the Linked Data Benchmark Council (LDBC) Graph Query Languages task force and has also been an invited expert and standards editor at the World Wide Web Consortium (W3C).

Visitors

Visitors

28 April 2023 Tova Milo ( Tel Aviv University )

16 March 2023: Dan Suciu ( University of Washington )

9 December 2019: Dan Suciu ( University of Washington )

10 October 2019: Stefano Cavazzi ( Ordnance Survey )

28 May 2019: Molham Aref ( Relational AI )

22 March 2019: Yannis Papakonstantinou ( Amazon Web Services and University of California, San Diego )

29 January 2019: Liat Peterfreund ( Technion )

2 January 2019: Nofar Carmeli ( Technion )

11 December 2018: Kurt Stirewalt ( Vice President of Software Development, Infor )

24 July 2018: Iman Elghandour ( Alexandria University and Université Libre de Bruxelles )

5 June 2018: Holger Pirk (Imperial College London)

24 - 29 April 2018: Christoph Koch (EPFL)

22 March 2018: Paris Koutris (University of Wisconsin-Madison)

20 March 2018: Kurt Stirewalt (Vice President of Software Development, Infor Retail)

6 Feb 2018: Juan F. Sequeda (Capsenta, USA)

16-18 Jan 2018: Fabian Peternek (University of Edinburgh)

3-5 Jan 2018: Amir Shaikhha (EPFL)

21-22 Nov 2017: FWA project partner Stijn Vansummeren (Université Libre de Bruxelles)

13 Nov 2017: Vaishak Belle ( Edinburgh )

11 Nov 2017: Hung Ngo ( stealth mode )

11 Oct 2017: Shuai Li ( Cambridge )

15 Nov 2016: Milos Nikolic ( Oxford )

11 Mar 2016: Mihnea Andrei ( SAP )

19 Feb 2016: Bart Samwel ( Google )

17 Nov 2015: Hung Ngo ( LogicBlox and SUNY-Buffalo )

6 Nov 2015: Dan Suciu ( University of Washington )

5 Nov 2015: Dan Suciu ( University of Washington )