Since their introduction eleven years ago, factorized databases have seen great progress in their theory, systems, and applications. This workshop offers a brief look at some very recent progress on several facets of factorized databases and a great opportunity to look ahead. It features
talks on: graph database systems; enumeration in static and
dynamic query evaluation; query provenance management; and factorized data analytics including factorized
machine learning over relational data.
The program can be found below.
9:30 am - 10:00 am:
Morning Coffee & Tea
10:00 am - 11:00 pm:
Query Enumeration and Ranking (1/2)
We discuss the task of direct access to the k-th answer of a Conjunctive Query according to a user-specified order over the answers. In other words, our goal is to simulate the sorted array of query answers without materializing all answers. Assuming we can also efficiently count the query answers, this allows efficiently answering quantile queries (e.g., finding the median). We focus on identifying the preprocessing time needed to allow logarithmic access time.
Join queries often generate large output results when executed over a huge dataset. In such cases, it is prohibitively expensive to store the whole materialized output if we plan to reuse it further down a data processing pipeline. In this talk, I will present several techniques that allow us to avoid this materialization step and discuss their applications. First, I will present a space-efficient compressed representation of the output of join queries, where we can control the tradeoff between minimizing the space necessary to store the result, versus minimizing the answering time during enumeration. Then, I will discuss how we can avoid materialization when the output of the join query result is accessed in an order given by a ranking function.
CQs with projections are an interesting class of queries that are widely used in practice. In this talk, I will review our results on ranked (ICDT 2021, VLDB 2022) and unranked enumeration for CQs (ICDT 2021) with projections. We will also look at some new results and techniques on how we can build data structures that allow for answering Boolean queries over CQs (such as k-path reachability). These results build upon the core ideas of factorized databases literature with several twists that lead to intriguing tradeoffs between space requirement of the data structure and the answering time.
This talk will present our latest results about efficient enumeration algorithms. These algorithms are defined on circuit classes inspired by knowledge compilation: they can be seen as factorized representations with some differences such as having non-constant height. I will review our results on enumeration for d-DNNF circuits (ICALP'17), and their applications. One first application is to enumerate the matches of document spanners on words (ICALP'19). The second application is to enumerate the answers to monadic second-order queries on trees with support for efficient update operations (PODS'19). The third application is to enumerate extraction results for enumeration grammars (PODS'22).
Querying graph-structured relations, i.e., those with many-to-many (m-n) relationships between entities, is ubiquitous and integral to a wide range of analytical applications such as recommendations on social networks and fraud detection in financial transactional networks. These queries are heavy on joins, can be broadly categorized as cyclic and acyclic, and tend to be challenging as they produce large intermediate results. For cyclic queries, the large intermediate results can be unnecessary and due to the suboptimality of the most common core join algorithms being binary joins. For acyclic queries, while the large intermediate results are part of the final results, they contain a lot of redundancy.
I will present a comprehensive query processor design and implementation that integrates factorized representations and go over various research questions tackled to make the integration possible and efficient. Our query processor is capable of having a robust query plan space with many efficient plans that lead to orders of magnitude speedups.
We present a scheme for parallel execution of SQL queries on top of any vertex-centric BSP graph processing engine. The scheme comprises a graph encoding of relational instances and a vertex program specification of our algorithm called TAG-join, which matches the theoretical communication and computation complexity of state-of-the-art join algorithms (and even improves on it in some scenarios using factorized representation of the join result).
When run on top of the vertex-centric TigerGraph database engine on a single multi-core server, TAG-join exploits thread parallelism and is competitive with (and often outperforms) reference RDBMSs on the TPC benchmarks they are traditionally tuned for. In a distributed cluster, TAG-join outperforms the popular Spark SQL engine.
Bio: Ainur recently completed her PhD at the University of California San Diego, where she was advised by Alin Deutsch. Her research interests are at the intersection of database theory and systems applied to graph processing engines. Ainur is currently at TigerGraph on the core language team working on graph query processing and optimization.
In this talk, we introduce the problem of optimizing a linear program
whose variables are the tuples of a relation. For this we propose the
language for specifying linear programs whose constraints and
objective functions depend on relations schemas. The naive
interpretation of such linear programs has one variable per element in
the relation, which can be large in the size of the database when the
relation is not given explicitly in the input, e.g., if it is
specified as the result of conjunctive queries or as a factorized
database. In these cases, we present an approach to construct a linear
program having the same optimal value as the naive interpretation but
whose number of variables is much smaller, by exploiting the structure
of the conjunctive query or the circuit itself.
Data analytics over normalized databases typically requires computing and materializing expensive joins. Factorized query execution models execution as message passing between relations in the join graph and pushes aggregations through joins to reduce intermediate result sizes. Although this accelerates query execution, it only optimizes a single query. In contrast, analytics is iterative, and offers novel work-sharing opportunities between queries.
This work shows that carefully materializing messages during query execution can accelerate subsequent aggregation queries by >10^5x as compared to factorized execution, and only incurs a constant factor overhead. The key challenge is that messages are sensitive to the message passing ordering. To address this challenge, we borrow the concept of calibration in probabilistic graphical models to materialize sufficient messages to support any ordering. We manifest these ideas in the novel Calibrated Junction Hypertree (CJT) data structure, which is fast to build, aggressively re-uses messages to accelerate future queries, and is incrementally maintainable under updates. We further show how CJTs benefit applications such as OLAP, query explanation, streaming data, and data augmentation for ML.
The query optimization technique of pushing down aggregates through joins has been studied in the RDBMS world for over two decades. Factorized machine learning (ML) brings that paradigm to ML algorithms on multi-table data. The talk will give a birds-eye overview of our line of work on introducing and enabling factorized ML for various families of ML algorithms--generalized linear models, clustering methods, a unified formal framework based on linear algebra, and models with feature interactions--as well as in various data analytics tools contexts: in-RDBMS ML tools, in-memory R and Python tools, ML on Spark, and an optimizing polyglot compiler. I will share our lessons learned on the benefits and limits of factorized ML, as well as the roadblocks and open challenges in translating factorized ML to practice.
Speaker Bio: Arun Kumar is an Associate Professor in the Department of Computer Science and Engineering and the Halicioglu Data Science Institute and an HDSI Faculty Fellow at the University of California, San Diego. His primary research interests are in data management and systems for machine learning/artificial intelligence-based data analytics. Systems and ideas from his work have been released as part of the Apache MADlib open-source library and shipped as part of products from or used internally by many database, Web, and cloud companies. He is a recipient of three SIGMOD research paper awards, four distinguished reviewer/metareviewer awards from SIGMOD/VLDB, the IEEE TCDE Rising Star Award, an NSF CAREER Award, and research award gifts from Amazon, Google, Oracle, and VMware.
Predicting the execution cost of query plans is hard. Traditional query optimizers estimate cost under various simplifying assumptions, risking highly sub-optimal plan choices. This motivates intra-query learning, an adaptive processing strategy based on reinforcement learning. Using specialized join operators, enabling fast join order switching, it divides query execution into small time slices in which different plans are tried. By judiciously balancing exploration of new plans versus exploitation of promising plans, it places upper bounds on the expected gap between optimal and actual execution cost.
The recently proposed SkinnerDB system has shown that this approach is practical for a wide range of scenarios. Its guarantees refer however to traditional, binary join operators. In this talk, I discuss ongoing research aimed at combining intra-query learning with ideas from the domain of worst-case optimal joins and factorized databases.
In this talk we introduce semi-ring dictionaries, a powerful class of compositional and purely functional collections that subsume other collection types such as sets, multisets, arrays, vectors, and matrices. We developed SDQL, a statically typed language that can express relational algebra with aggregations, linear algebra, and functional collections over data such as relations and matrices using semi-ring dictionaries. Furthermore, thanks to the algebraic structure behind these dictionaries, SDQL unifies a wide range of optimizations commonly used in databases (DB) and linear algebra (LA). As a result, SDQL enables efficient processing of hybrid DB and LA workloads, by putting together optimizations that are otherwise confined to either DB systems or LA frameworks. We show experimentally that a handful of DB and LA workloads can take advantage of the SDQL language and optimizations. SDQL can be competitive with or outperforms a host of systems that are state of the art in their own domain: in-memory DB systems Typer and Tectorwise for (flat, not nested) relational data; SciPy for LA workloads; sparse tensor compiler taco; the Trance nested relational engine; and the in-database machine learning engines LMFAO and Morpheus for hybrid DB/LA workloads over relational data.
Optimizing machine learning (ML) workloads on structured data is a key concern for data platforms. One class of optimizations called "factorized ML" helps reduce ML runtimes over multi-table datasets by pushing ML computations down through joins, avoiding the need to materialize such joins. The recent Morpheus system automated factorized ML to any ML algorithm expressible in linear algebra (LA). But all such prior factorized ML/LA stacks are restricted by their chosen programming language (PL) and runtime environment, limiting their reach in emerging industrial data science environments with many PLs (R, Python, etc.) and even cross-PL analytics workflows. Re-implementing Morpheus from scratch in each PL/environment is a massive developability overhead for implementation, testing, and maintenance. We tackle this challenge by proposing a new system architecture, Trinity, to enable factorized LA logic to be written only once and easily reused across many PLs/LA tools in one go. To do this in an extensible and efficient manner without costly data copies, Trinity leverages and extends an emerging industrial polyglot compiler and runtime, Oracle's GraalVM. Trinity enables factorized LA in multiple PLs and even cross-PL workflows. Experiments with real datasets show that Trinity is significantly faster than materialized execution (> 8x speedups in some cases), while being largely competitive to a prior single PL-focused Morpheus stack.
This work was first presented at VLDB'21 Industry Track.
Modern database systems have been progressively expanding their use cases far outside traditional bookkeeping and into data analytics, machine learning, and mathematical optimization among other application domains. This in turn motivates the need for native in-database automatic differentiation to better support these use cases.
In this talk, we present relationalAD (rAD), our framework for automatic differentiation at relationalAI (rAI). The modeling language underlying rAI is called Rel and is declarative and can be viewed as a (vast) generalization of Datalog with aggregation and function symbols. In rAD, the input is a Rel program that computes some (multivalued) function and the output is another Rel program that computes its derivatives with respect to some given input relations. We show that performing AutoDiff inside a high-level database language like Rel allows us to evaluate derivatives while enjoying many features offered by the underlying database engine like factorization, query optimization and compilation, as well as support for higher order derivatives. We present several examples covering recursive Rel programs, matrix calculus, neural network training, and gradient descent among others. We conclude with some challenges, design issues, as well as open problems.
In this talk I will discuss how to factorize a noisy input relation. Traditional data management is based on a "schema first" approach, where the designer has a clear conceptual model of the data, but most of data science is done using "data first", where the data is given “as is”, and is often noisy or incomplete. Our goal is to find a good approximate factorization. I will describe how to use information theory to reason about approximate dependencies between the data: approximate FDs and MVDs become constraints on conditional entropy and mutual information respectively, and traditional axioms for FDs and MVDs become simple Shannon-inequalities.
Missing data is pervasive in real-world databases and is often a major cause of misleading analysis and poor predictions. Common methods of dealing with missing data such as tuple deletion and mean imputation may introduce bias and distort the data distribution. More reliable imputation methods utilize regression models for predicting missing values but rely on slow external libraries that operate over denormalized data.
This talk presents our ongoing work on enabling data imputation within a database system. We focus on Multiple Imputation by Chained Equations (MICE), a state-of-the-art method that requires expensive training of multiple regression models to impute missing values of multiple attributes. We show how to reformulate the MICE method such that computationally most intensive parts of the training process are shared across the models and executed entirely within the database system. Our implementation within PostgreSQL supports scalable data imputation of both numerical and categorical values, supports normalized datasets, and exhibits orders of magnitude better performance than competitors such as scikit-learn and MADlib.
In this talk we present Figaro, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over a relational database. The QR decomposition lies at the core of many linear algebra techniques and their machine learning applications, including: the matrix inverse; the least squares; the singular value decomposition; eigenvalue problems; and the principal component analysis.
Figaro's main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its number of rounding errors relative to the classical QR decomposition algorithms is on par with the input size relative to the join size.
In experiments with real-world and synthetic databases, Figaro outperforms both in runtime performance and accuracy the LAPACK libraries openblas and Intel MKL by a factor proportional to the gap between the join output and input sizes.
This is joint work with Dorde Zivanovic (Oxford) and Dan Olteanu (Zurich) and appeared in ACM SIGMOD'22.
The Shapley value is a game-theoretic notion for wealth distribution that is nowadays extensively used to explain complex data-intensive computation, for instance, in network analysis or machine learning.
Recent theoretical works show that query evaluation over relational databases fits well in this explanation paradigm.
Yet, these works fall short of providing practical solutions to the computational challenge inherent to the Shapley computation.
We present two practically effective solutions for computing Shapley values in query answering.
We start by establishing a tight theoretical connection to the extensively studied problem of query evaluation over probabilistic databases, which allows us to obtain a polynomial-time algorithm for the class of queries for which probability computation is tractable.
We then propose a first practical solution for computing Shapley values that adopts tools from probabilistic query evaluation. In particular, we capture the dependence of query answers on input database facts using Boolean expressions (data provenance), and then transform it, via Knowledge Compilation, into a particular circuit form for which we devise an algorithm for computing the Shapley values.
Our second practical solution is a faster yet inexact approach that transforms the provenance to a Conjunctive Normal Form and uses a heuristic to compute the Shapley values.
Our experiments on TPC-H and IMDB demonstrate the practical effectiveness of our solutions.
In this talk I give an overview of our exploits in using query instrumentation to capture factorized provenance graphs using factorized database techniques. Furthermore, I will present some recent results on how factorization of provenance as circuits impacts the performance of queries over bag semantics probabilistic databases.