28 April 2023 Tova Milo ( Tel Aviv University )
Data, more data, too much data

Over the past 30 years, the world of data management has undergone a dramatic evolution, allowing us to consider an increasingly wide range of data types, sources, and processing models. This has also led to the collection of tremendous volumes of data, and we are now at a point where almost every aspect of our lives is data-driven. In this talk, Prof. Tova Milo will share her perspective on some of the "hot topics" that she has been personally involved with in this evolution, and discuss why certain research topics become popular and when they lose their relevance. She will furthermore argue that we are now facing a turning point where the deluge of data has become a serious risk, and that the next "hot" research agenda should focus on data disposal.
Read More..

16 March 2023: Dan Suciu ( University of Washington )
Optimizing Recursive Queries

Modern data analytics requires iteration, yet relational database engines are mostly optimized for non-recursive queries. SQL supports only a limited form of recursion. A better formalism for recursive queries is datalog, which has some elegant properties (recursion always terminates), and led to the development of two powerful optimizations techniques: semi-naive evaluation, and magic set rewriting. But standard datalog is restricted to monotone queries over sets and does not support aggregates, which has limited its adoption.
Read More..

10 October 2019: Stefano Cavazzi ( Ordnance Survey )
GeoAI for CAV: an Ordnance Survey perspective

The scientific field of geospatial artificial intelligence (GeoAI) was recently formed from combining innovations in spatial science with the rapid growth of methods in artificial intelligence (AI), particularly machine learning (e.g., deep learning), data mining, and high-performance computing to glean meaningful information from spatial big data. GeoAI is an interdisciplinary field, spanning several scientific subjects including computer science, engineering, statistics, and spatial science. In this seminar talk, we will explore current research work undergoing at Ordnance Survey and partner universities in unlocking the potential of GeoAI in the connected and autonomous vehicle (CAV) sector.
Read More..

22 March 2019: Yannis Papakonstantinou ( Amazon Web Services and University of California, San Diego )
The SQL++ Query Language

SQL-on-Hadoop, NewSQL and NoSQL databases provide semi-structured data models (typically JSON-based). They now drive towards declarative, SQL-alike query languages. However, their idiomatic, non-SQL language constructs, the many variations and the lack of formal syntax and semantics pose problems. Notably, database vendors end up with unclear semantics and complicated implementations, as they add one feature at-a-time.
Read More..

29 January 2019: Liat Peterfreund ( Technion )
Complexity Bounds for Relational Algebra over Text Extractors

Information Extraction (IE) is the task of extracting structured information from textual data. We explore a programming paradigm that is supported by several IE systems where relations extracted by atomic extractors undergo a relational manipulation. In our efforts toward achieving a better understanding of IE systems, we study the computational complexity of queries in the framework of document spanners (spanners, for short) that was introduced by Fagin et al. A spanner is a function that extracts from a document (string) a relation over text intervals, called spans, using either atomic extractors or a relational algebra query on top of these extractors. Evaluating a spanner on a document is a computational problem that involves three components: the atomic extractors, the relational structure of the query, and the document. We investigate the complexity of this problem from various angles, each keeping some components fixed and regarding the rest as input.
Read More..

2 January 2019: Nofar Carmeli ( Technion )
Enumeration of UCQs with Constant Delay

We discuss the complexity of enumerating the answers of Unions of Conjunctive Queries (UCQs), and focus on the ability to list output tuples with constant delay following linear-time preprocessing. A known dichotomy classifies the self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. In such cases, some queries classified as hard are in fact tractable, and we generalize the dichotomy to accommodate Functional Dependencies (FDs). Next, we aspire to have a similar characterization for UCQs, even when there are no FDs. It was claimed in the past that a UCQ is hard if one of its queries is hard. We examine the task of enumerating UCQs, and show that some unions containing a hard query are tractable. In fact, a UCQ may be tractable even if it does not contain a single tractable CQ.
Read More..

11 December 2018: Kurt Stirewalt ( Vice President of Software Development, Infor )
Practical challenges to building large applications written in a datalog-based language

Since 2009, we have been developing and operating Cloud-deployed, hybrid transactional/analytic processing (HTAP) systems for large retail customers. We implement these applications primarily in a datalog-based language called LogiQL, which permits the concise expression of rich models with powerful constraints and business rules. The ability to build such systems at scale in LogiQL owes to a number of innovations in the LogicBlox platform, including advanced multi-way join algorithms, powerful query optimizations, write-optimized data structures, and the ability to harness the parallelism available in modern servers. Unfortunately, not every LogiQL program can (yet) take full advantage of these innovations. When tuning such a program, a developer will often rewrite parts of the program to introduce new predicates and rules that materialize intermediate results. Inevitably, these rewrites have a deleterious effect on the concision, readability, and maintainability of a program. In practice, we mitigate these challenges through optimization heuristics and by generating a significant component of the LogiQL program from a higher-level form (called CubiQL) that allows us to automate their application. This talk surveys the challenges that arise when using LogiQL to build HTAP applications in this domain and the heuristics we developed to address them. We conclude with open problems that remain difficult to automate.
Read More..

24 July 2018: Iman Elghandour ( Alexandria University and Université Libre de Bruxelles )
ReStore: Reusing Results of MapReduce Jobs

Performing data analytics on huge data sets have been very crucial for many enterprises. It was facilitated by the introduction of the MapReduce programming and execution model and other more recent distributed computing platforms. MapReduce users often have analysis tasks that are too complex to express as individual MapReduce jobs, and therefore they use high-level query languages such as Pig or Hive to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. In my talk, I will present ReStore, a system that manages the storage and reuse of intermediate results between MapReduce jobs in such workflows. At the end of the talk, I will briefly discuss related ideas for optimizing queries executed on other distributed computing platforms.
Read More..

5 June 2018: Holger Pirk (Imperial College London)
Hardware-conscious data processing systems

Hardware-conscious database systems evaluate queries in milliseconds that take minutes in conventional systems, turning long-running jobs into interactive queries. However, the plethora of hardware-focused tuning techniques creates a design-space that is hard to navigate for a skilled performance engineer and even harder to exploit for modern, code-generating data processing systems. In addition, hardware-conscious tuning is often at odds with other design goals such as implementation effort, ease of use and maintainability -- in particular when developing code-generating database systems. Well-designed programming abstractions are essential to allow the creation of systems that are fast, easy to use and maintainable...
Read More..

22 March 2018: Paris Koutris (University of Wisconsin-Madison)
Compressed Representations of Conjunctive Query Results

Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. In this talk, we consider the problem of constructing space-efficient compressed representations of the output of conjunctive queries, with the goal of supporting the efficient access of the intermediate compressed result for a given access pattern...
Read More..

16-18 Jan 2018: Fabian Peternek (University of Edinburgh)
Graph Compression using Graph Grammars

This talk considers a method for compressing graphs into a smaller graph grammar based on the RePair compression scheme. We start by defining the necessary notion of context-free hyperedge replacement grammars using examples. We then extend RePair to graphs using this grammar formalism and discuss how the problem of finding non-overlapping occurrences appears more difficult on graphs than on strings and trees....
Read More..

3-5 Jan 2018: Amir Shaikhha (EPFL)
How to Architect a Query Compiler

In this talk, we study architecting query compilers. The state of the art in query compiler construction is lagging behind that in the compilers field. We attempt to remedy this by exploring the key causes of technical challenges in need of well-founded solutions, and by gathering the most relevant ideas and approaches from the PL and compilers communities for easy digestion by database researchers...
Read More..

21-22 Nov 2017: FWA project partner Stijn Vansummeren (Université Libre de Bruxelles)
The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates

Modern computing tasks such as real-time analytics require refresh of query results under high update rates. Incremental View Maintenance (IVM) approaches this problem by materializing results in order to avoid recomputation. IVM naturally induces a trade-off between the space needed to maintain the materialized results and the time used to process updates....
Read More..

13 Nov 2017: Vaishak Belle ( Edinburgh )
Decision-theoretic planning via probabilistic programming

We study planning in Markov decision processes involving discrete and continuous states and actions, and an unknown number of objects. Planning in such domains is notoriously challenging and often requires restrictive assumptions. We introduce HYPE: a sample-based planner for hybrid domains that is very general, which combines model-based approaches with state abstraction. Most significantly, the domains where such planners are deployed are usually very complex with deep structural and geometric constraints. HYPE is instantiated in a probabilistic programming language that allows compact codification of such constraints....
Read More..

11 Nov 2017: Hung Ngo ( stealth mode )
Shannon-type inequalities, submodular width, and disjunctive datalog

This talk overviews recent results on bounding the output size and on efficiently evaluating a disjunctive datalog rule, given input statistics such as relation sizes, functional dependencies, and degree bounds. These are the kind of statistics prevalent in database query evaluation, and our results apply to aggregate queries as well. The disjunctive datalog query evaluation problem is fairly general, as both conjunctive query evaluation and the basic constraint satisfaction problem are special cases. These new combinatorial and algorithmic results are built up on a fundamental connection between query evaluation and Shannon-type inequalities. It was observed in different contexts over the...
Read More..

11 Oct 2017: Shuai Li ( Cambridge )
Toward a Principled Framework for Clustering Techniques

I will talk about algorithmic and theoretical aspects of an adaptive clustering framework for the content recommendation based on exploration-exploitation strategies in the multi-armed bandit scenarios. First, I'll give an introduction to the centralized clustering methods in a standard stochastic noise setting for networked users on the graph. Next, in the decentralized clustering setting I'll describe two distributed efficient algorithms for solving linear regression based problems in peer to peer networks with limited communication capabilities. Last, I'll establish upper bounds on the generalized collaborative filtering applications combined with clustered models. In all cases, the technically sound theoretical guarantees...
Read More..

19 Feb 2016: Bart Samwel ( Google )
F1 - The Distributed SQL Database Supporting Google's Ad Business

Large scale internet operations such as Google, Facebook, and Amazon manage amazing amounts of data. Doing so requires databases that are distributed across multiple servers or even multiple data centers, with high throughput, strong latency requirements, "five nines" of availability, and often with strict data consistency requirements. This talk starts by introducing relational SQL databases, NoSQL databases, and the current state of the art in such databases as deployed in industry. It then provides an introduction to Google F1, a SQL database based on Google's Spanner distributed storage system. F1 is used to store the data for AdWords, Google's search...
Read More..

17 Nov 2015: Hung Ngo ( LogicBlox and SUNY-Buffalo )
Functional Aggregate Queries Asked Frequently (FAQAF)

We show how a variety of problems from database, logic, probabilistic graphical model (PGM) inference, matrix computation, constraint satisfaction (CSP), etc. can be formulated as instances of the same problem called the Functional Aggregate Query (FAQ) problem. Then, we explain how a pair of very simple algorithms called OutsideIn and InsideOut can be used in concert to solve the FAQ problem. OutsideIn is a slightly more general version of recent worst-case optimal join algorithms. InsideOut is a slightly more general version of the classic variable elimination algorithm. ...
Read More..

6 Nov 2015: Dan Suciu ( University of Washington )
Query Compilation: the View from the Database Side

We study knowledge compilation for Boolean formulas that are given as groundings of First Order formulas. This problem is motivated by probabilistic databases, where each record in the database is an independent probabilistic event, and the query is given by a SQL expression or, equivalently, a First Order formula. The query's probability can be computed in linear time in the size of the compilation representation, hence the interest in studying the size of such a representation. ...
Read More..

5 Nov 2015: Dan Suciu ( University of Washington )
Parallel Query Evaluation

Fix a full, conjunctive query, and consider the following problem: what is the amount of communication required to compute the query in parallel, on p servers, over a large database instance? We define the Massively Parallel Communication (MPC) model, where the computation proceeds in rounds consisting of local computations followed by a global reshuffling of the data. Servers have unlimited computational power and are allowed to exchange any data, the only cost parameters are the number of rounds and the maximum amount of communication per server. I will describe tight bounds on the amount of communication for...
Read More..