Since their introduction eleven years ago, factorized
databases have seen great progress in their theory, systems, and applications. This workshop offers a
brief look at some very recent progress on several
facets of factorized databases and a great opportunity to look ahead. It features
talks on: graph database systems; enumeration in static and
dynamic query evaluation; query provenance management; and factorized data analytics including
factorized
machine learning over relational data.
The program can be found below.
August 2
9:30 am - 10:00 am:
Morning Coffee & Tea
10:00 am - 11:00 pm:
Query Enumeration and Ranking (1/2)
Maximilian Schleich (RelationalAI Vienna, Austria)
Amir Shaikhha (University of Edinburgh, UK)
Ainur Smagulova* (TigerGraph, USA)
Jaclyn Smith* (Omics Data Automation, USA)
Dan Suciu (University of Washington, USA)
Immanuel Trummer* (Cornell University, USA)
Jun Tu* (University of Zurich, Switzerland)
Nils Vortmeier (University of Bochum, Germany)
Haozhe Zhang* (University of Zurich, Switzerland)
Rui Zhou* (University of Zurich, Switzerland)
Dorde Zivanovic* (University of Oxford, UK)
This workshop is organised by the Data Systems and Theory (DaST) Group at the University of Zurich. Please
contact Denise Gloor or Dan Olteanu for further information or if
you want to attend.
We discuss the task of direct access to the k-th answer of a Conjunctive Query according to a user-specified
order over the answers. In other words, our goal is to simulate the sorted array of query answers without
materializing all answers. Assuming we can also efficiently count the query answers, this allows efficiently
answering quantile queries (e.g., finding the median). We focus on identifying the preprocessing time needed to
allow logarithmic access time.
Join queries often generate large output results when executed over a huge dataset. In such cases, it is
prohibitively expensive to store the whole materialized output if we plan to reuse it further down a data
processing pipeline. In this talk, I will present several techniques that allow us to avoid this materialization
step and discuss their applications. First, I will present a space-efficient compressed representation of the
output of join queries, where we can control the tradeoff between minimizing the space necessary to store the
result, versus minimizing the answering time during enumeration. Then, I will discuss how we can avoid
materialization when the output of the join query result is accessed in an order given by a ranking function.
CQs with projections are an interesting class of queries that are widely used in practice. In this talk, I will
review our results on ranked (ICDT 2021, VLDB 2022) and unranked enumeration for CQs (ICDT 2021) with
projections. We will also look at some new results and techniques on how we can build data structures that allow
for answering Boolean queries over CQs (such as k-path reachability). These results build upon the core ideas of
factorized databases literature with several twists that lead to intriguing tradeoffs between space requirement
of the data structure and the answering time.
This talk will present our latest results about efficient enumeration algorithms. These algorithms are defined
on circuit classes inspired by knowledge compilation: they can be seen as factorized representations with some
differences such as having non-constant height. I will review our results on enumeration for d-DNNF circuits
(ICALP'17), and their applications. One first application is to enumerate the matches of document spanners on
words (ICALP'19). The second application is to enumerate the answers to monadic second-order queries on trees
with support for efficient update operations (PODS'19). The third application is to enumerate extraction results
for enumeration grammars (PODS'22).
Querying graph-structured relations, i.e., those with many-to-many (m-n) relationships between entities, is
ubiquitous and integral to a wide range of analytical applications such as recommendations on social networks
and fraud detection in financial transactional networks. These queries are heavy on joins, can be broadly
categorized as cyclic and acyclic, and tend to be challenging as they produce large intermediate results. For
cyclic queries, the large intermediate results can be unnecessary and due to the suboptimality of the most
common core join algorithms being binary joins. For acyclic queries, while the large intermediate results are
part of the final results, they contain a lot of redundancy.
I will present a comprehensive query processor design and implementation that integrates factorized
representations and go over various research questions tackled to make the integration possible and efficient.
Our query processor is capable of having a robust query plan space with many efficient plans that lead to orders
of magnitude speedups.
We present a scheme for parallel execution of SQL queries on top of any vertex-centric BSP graph processing
engine. The scheme comprises a graph encoding of relational instances and a vertex program specification of our
algorithm called TAG-join, which matches the theoretical communication and computation complexity of
state-of-the-art join algorithms (and even improves on it in some scenarios using factorized representation of
the join result).
When run on top of the vertex-centric TigerGraph database engine on a single multi-core server, TAG-join
exploits thread parallelism and is competitive with (and often outperforms) reference RDBMSs on the TPC
benchmarks they are traditionally tuned for. In a distributed cluster, TAG-join outperforms the popular Spark
SQL engine.
Bio: Ainur recently completed her PhD at the University of California San Diego, where she was advised by Alin
Deutsch. Her research interests are at the intersection of database theory and systems applied to graph
processing engines. Ainur is currently at TigerGraph on the core language team working on graph query processing
and optimization.
In this talk, we introduce the problem of optimizing a linear program
whose variables are the tuples of a relation. For this we propose the
language for specifying linear programs whose constraints and
objective functions depend on relations schemas. The naive
interpretation of such linear programs has one variable per element in
the relation, which can be large in the size of the database when the
relation is not given explicitly in the input, e.g., if it is
specified as the result of conjunctive queries or as a factorized
database. In these cases, we present an approach to construct a linear
program having the same optimal value as the naive interpretation but
whose number of variables is much smaller, by exploiting the structure
of the conjunctive query or the circuit itself.
Data analytics over normalized databases typically requires computing and materializing expensive joins.
Factorized query execution models execution as message passing between relations in the join graph and pushes
aggregations through joins to reduce intermediate result sizes. Although this accelerates query execution, it
only optimizes a single query. In contrast, analytics is iterative, and offers novel work-sharing opportunities
between queries.
This work shows that carefully materializing messages during query execution can accelerate subsequent
aggregation queries by >10^5x as compared to factorized execution, and only incurs a constant factor overhead.
The key challenge is that messages are sensitive to the message passing ordering. To address this challenge, we
borrow the concept of calibration in probabilistic graphical models to materialize sufficient messages to
support any ordering. We manifest these ideas in the novel Calibrated Junction Hypertree (CJT) data structure,
which is fast to build, aggressively re-uses messages to accelerate future queries, and is incrementally
maintainable under updates. We further show how CJTs benefit applications such as OLAP, query explanation,
streaming data, and data augmentation for ML.
The query optimization technique of pushing down aggregates through joins has been studied in the RDBMS world
for over two decades. Factorized machine learning (ML) brings that paradigm to ML algorithms on multi-table
data. The talk will give a birds-eye overview of our line of work on introducing and enabling factorized ML for
various families of ML algorithms--generalized linear models, clustering methods, a unified formal framework
based on linear algebra, and models with feature interactions--as well as in various data analytics tools
contexts: in-RDBMS ML tools, in-memory R and Python tools, ML on Spark, and an optimizing polyglot compiler. I
will share our lessons learned on the benefits and limits of factorized ML, as well as the roadblocks and open
challenges in translating factorized ML to practice.
Speaker Bio: Arun Kumar is an Associate Professor in the Department of Computer Science and Engineering and the
Halicioglu Data Science Institute and an HDSI Faculty Fellow at the University of California, San Diego. His
primary research interests are in data management and systems for machine learning/artificial intelligence-based
data analytics. Systems and ideas from his work have been released as part of the Apache MADlib open-source
library and shipped as part of products from or used internally by many database, Web, and cloud companies. He
is a recipient of three SIGMOD research paper awards, four distinguished reviewer/metareviewer awards from
SIGMOD/VLDB, the IEEE TCDE Rising Star Award, an NSF CAREER Award, and research award gifts from Amazon, Google,
Oracle, and VMware.
Predicting the execution cost of query plans is hard. Traditional query optimizers estimate cost under various
simplifying assumptions, risking highly sub-optimal plan choices. This motivates intra-query learning, an
adaptive processing strategy based on reinforcement learning. Using specialized join operators, enabling fast
join order switching, it divides query execution into small time slices in which different plans are tried. By
judiciously balancing exploration of new plans versus exploitation of promising plans, it places upper bounds on
the expected gap between optimal and actual execution cost.
The recently proposed SkinnerDB system has shown that this approach is practical for a wide range of scenarios.
Its guarantees refer however to traditional, binary join operators. In this talk, I discuss ongoing research
aimed at combining intra-query learning with ideas from the domain of worst-case optimal joins and factorized
databases.
In this talk we introduce semi-ring dictionaries, a powerful class of compositional and purely functional
collections that subsume other collection types such as sets, multisets, arrays, vectors, and matrices. We
developed SDQL, a statically typed language that can express relational algebra with aggregations, linear
algebra, and functional collections over data such as relations and matrices using semi-ring dictionaries.
Furthermore, thanks to the algebraic structure behind these dictionaries, SDQL unifies a wide range of
optimizations commonly used in databases (DB) and linear algebra (LA). As a result, SDQL enables efficient
processing of hybrid DB and LA workloads, by putting together optimizations that are otherwise confined to
either DB systems or LA frameworks. We show experimentally that a handful of DB and LA workloads can take
advantage of the SDQL language and optimizations. SDQL can be competitive with or outperforms a host of systems
that are state of the art in their own domain: in-memory DB systems Typer and Tectorwise for (flat, not nested)
relational data; SciPy for LA workloads; sparse tensor compiler taco; the Trance nested relational engine; and
the in-database machine learning engines LMFAO and Morpheus for hybrid DB/LA workloads over relational data.
Optimizing machine learning (ML) workloads on structured data is a key concern for data platforms. One class of
optimizations called "factorized ML" helps reduce ML runtimes over multi-table datasets by pushing ML
computations down through joins, avoiding the need to materialize such joins. The recent Morpheus system
automated factorized ML to any ML algorithm expressible in linear algebra (LA). But all such prior factorized
ML/LA stacks are restricted by their chosen programming language (PL) and runtime environment, limiting their
reach in emerging industrial data science environments with many PLs (R, Python, etc.) and even cross-PL
analytics workflows. Re-implementing Morpheus from scratch in each PL/environment is a massive developability
overhead for implementation, testing, and maintenance. We tackle this challenge by proposing a new system
architecture, Trinity, to enable factorized LA logic to be written only once and easily reused across many
PLs/LA tools in one go. To do this in an extensible and efficient manner without costly data copies, Trinity
leverages and extends an emerging industrial polyglot compiler and runtime, Oracle's GraalVM. Trinity enables
factorized LA in multiple PLs and even cross-PL workflows. Experiments with real datasets show that Trinity is
significantly faster than materialized execution (> 8x speedups in some cases), while being largely competitive
to a prior single PL-focused Morpheus stack.
This work was first presented at VLDB'21 Industry Track.
Modern database systems have been progressively expanding their use cases far outside traditional bookkeeping
and into data analytics, machine learning, and mathematical optimization among other application domains. This
in turn motivates the need for native in-database automatic differentiation to better support these use cases.
In this talk, we present relationalAD (rAD), our framework for automatic differentiation at relationalAI (rAI).
The modeling language underlying rAI is called Rel and is declarative and can be viewed as a (vast)
generalization of Datalog with aggregation and function symbols. In rAD, the input is a Rel program that
computes some (multivalued) function and the output is another Rel program that computes its derivatives with
respect to some given input relations. We show that performing AutoDiff inside a high-level database language
like Rel allows us to evaluate derivatives while enjoying many features offered by the underlying database
engine like factorization, query optimization and compilation, as well as support for higher order derivatives.
We present several examples covering recursive Rel programs, matrix calculus, neural network training, and
gradient descent among others. We conclude with some challenges, design issues, as well as open problems.
In this talk I will discuss how to factorize a noisy input relation. Traditional data management is based on a
"schema first" approach, where the designer has a clear conceptual model of the data, but most of data science
is done using "data first", where the data is given “as is”, and is often noisy or incomplete. Our goal is to
find a good approximate factorization. I will describe how to use information theory to reason about approximate
dependencies between the data: approximate FDs and MVDs become constraints on conditional entropy and mutual
information respectively, and traditional axioms for FDs and MVDs become simple Shannon-inequalities.
Missing data is pervasive in real-world databases and is often a major cause of misleading analysis and poor
predictions. Common methods of dealing with missing data such as tuple deletion and mean imputation may
introduce bias and distort the data distribution. More reliable imputation methods utilize regression models for
predicting missing values but rely on slow external libraries that operate over denormalized data.
This talk presents our ongoing work on enabling data imputation within a database system. We focus on Multiple
Imputation by Chained Equations (MICE), a state-of-the-art method that requires expensive training of multiple
regression models to impute missing values of multiple attributes. We show how to reformulate the MICE method
such that computationally most intensive parts of the training process are shared across the models and executed
entirely within the database system. Our implementation within PostgreSQL supports scalable data imputation of
both numerical and categorical values, supports normalized datasets, and exhibits orders of magnitude better
performance than competitors such as scikit-learn and MADlib.
In this talk we present Figaro, an algorithm for computing the upper-triangular matrix in the QR decomposition
of the matrix defined by the natural join over a relational database. The QR decomposition lies at the core of
many linear algebra techniques and their machine learning applications, including: the matrix inverse; the least
squares; the singular value decomposition; eigenvalue problems; and the principal component analysis.
Figaro's main novelty is that it pushes the QR decomposition past the join. This leads to several desirable
properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its
execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its
number of rounding errors relative to the classical QR decomposition algorithms is on par with the input size
relative to the join size.
In experiments with real-world and synthetic databases, Figaro outperforms both in runtime performance and
accuracy the LAPACK libraries openblas and Intel MKL by a factor proportional to the gap between the join output
and input sizes.
This is joint work with Dorde Zivanovic (Oxford) and Dan Olteanu (Zurich) and appeared in ACM SIGMOD'22.
The Shapley value is a game-theoretic notion for wealth distribution that is nowadays extensively used to
explain complex data-intensive computation, for instance, in network analysis or machine learning.
Recent theoretical works show that query evaluation over relational databases fits well in this explanation
paradigm.
Yet, these works fall short of providing practical solutions to the computational challenge inherent to the
Shapley computation.
We present two practically effective solutions for computing Shapley values in query answering.
We start by establishing a tight theoretical connection to the extensively studied problem of query evaluation
over probabilistic databases, which allows us to obtain a polynomial-time algorithm for the class of queries for
which probability computation is tractable.
We then propose a first practical solution for computing Shapley values that adopts tools from probabilistic
query evaluation. In particular, we capture the dependence of query answers on input database facts using
Boolean expressions (data provenance), and then transform it, via Knowledge Compilation, into a particular
circuit form for which we devise an algorithm for computing the Shapley values.
Our second practical solution is a faster yet inexact approach that transforms the provenance to a Conjunctive
Normal Form and uses a heuristic to compute the Shapley values.
Our experiments on TPC-H and IMDB demonstrate the practical effectiveness of our solutions.
In this talk I give an overview of our exploits in using query instrumentation to capture factorized provenance
graphs using factorized database techniques. Furthermore, I will present some recent results on how
factorization of provenance as circuits impacts the performance of queries over bag semantics probabilistic
databases.