Ten Years of Factorized Databases: A 2022 Review of their Impact

The work that started factorized databases

The International Conference on Database Theory (ICDT) 2012 paper [DBLP:CONF/ICDT/OLTEANUZ12] and its extended version published in the ACM Transactions on Database Systems (TODS) 2015 journal [DBLP:journals/tods/OlteanuZ15], both by Olteanu and Závodný, put forward two basic yet fundamental observations:

  • The relational representation of the answers to conjunctive queries entail redundancy that can be systematically avoided by a succinct and lossless factorized representation.
  • This factorized representation can be computed directly from the input database and in time proportional to its size and the input size.

Such factorized representations give a proper balance between compactness (elimination of redundancy) and efficiency. Subsequent work by the same authors also showed that a host of computations can be performed directly on the factorized representation, so without the need to de-factorize the represented data.

A factorized representation can be viewed as a relational algebra query that involves data values and only two operators: union and Cartesian product. We also refer generally to such representations as factorized databases. They are a special form of compression: they factorize data akin to the factorization of polynomials, e.g.,

a*x + a*y + b*x + b*y = (a+b)* (x+y).
Data factorization is orthogonal to value-based compression, such as run-length encoding; the two can be combined to achieve an even greater compression factor.

This initial work also gives effective tools for managing such factorized representations, including:

  • representation systems,
  • their worst-case optimal size and computation, and
  • constant-delay enumeration of the represented tuples.

Where are Factorized Databases Used?

Factorized databases have impacted research in both database systems and database theory, in particular on:

For this reason, the ICDT 2012 paper [DBLP:CONF/ICDT/OLTEANUZ12] on factorized databases received the ICDT 2022 Test of Time Award. This award "recognizes a paper selected from the proceedings of the ICDT 2012 conference that has had the highest impact in terms of research, methodology, conceptual contribution, or transfer to practice over the past decade."

The rest of this note highlights a select number of research efforts that acknowledge the use of factorized database techniques in new settings or that extend the initial ICDT 2012 work. It gives citations from publications to emphasize the pivotal reliance on factorized databases. The timeliness of the influence of this work is remarkable, as a solid body of follow-up work has been published a decade after the initial work.

Graph Representation and Processing

As argued in the system-building efforts that follow, factorization is a natural fit for the representation and processing of graph data. Graph traversals call for joins that have large outputs that are representable succinctly in factorized form.

Salihoglu et al [DBLP:journals/pvldb/GuptaMS21] revisit column-oriented storage and query processing techniques in the context of contemporary graph database management systems. A core decision in the design of such a system is to “adopt a factorized tuple set representation scheme” to “avoid repetitions of values.” It developed a new block-based processor that “uses factorized representation of intermediate tuples.” Salihoglu and his team also study the problem of optimizing one-time and continuous subgraph queries using worst-case optimal join plans [10.1145/3446980]. They put forward a cache that “gives benefits similar to factorization”: whenever the matches of query vertices are independent, they can be done once for each match of further query vertices.

Deutsch et al [10.1145/3448016.3457314] present a scheme for parallel execution of SQL queries on top of a graph processing engine. This work uses an aggregation scheme “inspired by aggregation over hypertree decompositions in factorized databases.” The values received by different vertices in the system “correspond to the factorized representation of the join result.”

Wu et al [10.1007/978-3-030-27520-4_20] address the challenge of quickly finding homomorphic matches for hybrid patterns over large data graphs, which is fundamental to graph analytics. They introduce the concept of an answer graph to encode all the possible homomorphisms from a pattern to the graph. This answer graph represents results succinctly “similar to factorized representations of query results studied in the context of classical and probabilistic databases.” This property is useful, as in factorized databases, to “calculate the cardinality of the query answer [. . . ] without explicitly enumerating the occurrences.”

Godfrey et al [DBLP:conf/edbt/Abul-BasherYGCC21] (best short paper award at EDBT 2021) propose an answer-graph method for the evaluation of SPARQL conjunctive queries. This approach “finds a factorized answer set first, an answer graph, and then finds the embedding tuples from this.” They also point out that factorization can be effective for graph data: “While factorization is sometimes a significant win for evaluating relational queries, it is virtually always a win for evaluating graph CQs.”

Vidal et al [10.1145/3102254.3102260] present “a solution to the problem of factorizing RDF graphs describing semantic sensor data.” A key contribution is an algorithm “that solves the problem of query evaluation on a factorized RDF graph.” This work builds on “factorization techniques [. . . ] for optimization of relational data and SQL query processing” and proposes “factorization technique tailored for semantically described sensor data.” The extended journal article [DBLP:journals/jiis/KarimVA21] also investigates “the effectiveness of the proposed factorization techniques [...] as well as the impact of factorizing semantic sensor data on query processing using LinkedSensorData benchmark.”

Enumeration for Static Query Evaluation

A key property of factorized representations is that they allow tuple enumeration with constant delay (that is, the time needed to output the first tuple as well as the time from outputting one tuple to outputting the next is constant in the size of the representation). This gives a refinement of the computational complexity of a query that consists of two components:

  1. The preprocessing time, which is essentially the construction of the factorized representation of the query result.
  2. The enumeration delay, which is the time between consecutive output tuples.

The ICDT 2012 paper [DBLP:CONF/ICDT/OLTEANUZ12] and its extended TODS version [DBLP:journals/tods/OlteanuZ15] are the first to settle this refined complexity to O(Nfhtw(Q)) for preprocessing time and O(1) for enumeration delay for any conjunctive query Q, where N is the size of the database and fhtw is the fractional hypertree width of Q. The acclaimed prior result by Bagan, Durand, and Grandjean [DBLP:conf/csl/BaganDG07] guarantees linear preprocessing time and constant delay for free-connex acyclic conjunctive queries. A number of follow-up publications considered more expressive queries, restricted databases, preprocessing-enumeration trade-offs for both static and dynamic query evaluation. Some of these efforts that directly use the results on constant-delay enumeration in factorized databases are highlighted below.

In a series of papers, Amarilli et al [DBLP:conf/icalp/AmarilliBJM17] developed circuits that are more general than factorizations and still allow for constant-delay enumeration. Their work focuses on efficient enumeration algorithms, such as that for factorized representations of query results, with strict requirements: the preprocessing must be linear in the input size, and the delay between successive solutions must be constant. It also shows how the d-DNNF circuits “generalize the deterministic factorized representations of relational instances.” It also “gives enumeration algorithms with linear preprocessing and constant delay for arbitrary deterministic d-representations [a form of factorized representation], extending the enumeration result [of factorized databases].” The work also describes “links to factorized databases and strengthens the enumeration result” in the original factorized database work. Amarilli and collaborators [DBLP:conf/icdt/AmarilliBM18] further consider the problem of enumeration on trees under relabelings. It defines set-valued circuits, which “can also be seen to be isomorphic to arithmetic circuits, and generalize factorized representations.”

Toruńczyk [10.1145/3375395.3387660] proposes an algebraic framework for studying efficient algorithms for query evaluation, aggregation, enumeration, and maintenance under updates, on sparse databases. This work introduces circuits that “can be alternatively viewed as factorized representations (extended by suitable permanent operators) of query answers.”

Koutris et al [DBLP:conf/icdt/DeepHK21] investigate the trade-off between preprocessing time and delay guarantees for the enumeration of star and path queries with projections. Their further work considers similar trade-offs between compressed representation and answer time for queries with access patterns [10.1145/3196959.3196979]. They say: “The key observation is that we can take advantage of the underlying logical structure in order to design algorithms that can compress the data effectively. This idea has been explored before in the context of factorized databases, which can be viewed as a form of logical compression. Our approach builds upon the idea of using query decompositions as a factorized representation, and we show that for certain access patterns it is possible to go below |D|fhw space for constant delay enumeration.”

Further work considers the enumeration of query result following a given order. Bakybayev et al [DBLP:journals/pvldb/BakibayevKOZ13] revisits the enumeration on factorized databases and characterizes which factorization structures admit constant-delay enumeration for a given order-by clause. This is later generalized to complex enumeration orders used in ranked enumeration by Tziavelis et al [10.14778/3397230.3397250] and Carmeli et al [DBLP:conf/pods/CarmeliTGKR21].

Tziavelis et al [10.14778/3397230.3397250] note that “factorized databases exploit the distributivity of product over union to represent query results compactly and generalize the results on bounded fhw to the non-Boolean case. Our encoding as a DP graph leverages the same principles and is at least as efficient space-wise.”

Carmeli et al [DBLP:conf/pods/CarmeliTGKR21] conclude their paper with the statement that “we view this work as part of a bigger challenge that continues the line of research on factorized representations in databases: how can we represent the output of a query in a way that, compared to the explicit representation, is fundamentally more compact and efficiently computable, yet equally useful to downstream operations?”. The non-disruptive trio characterization of queries that admit constant access time for tuples in their results corresponds to the alternative characterization for constant-delay enumeration from factorized representations over variable orders, put forward in [DBLP:journals/pvldb/BakibayevKOZ13]: Constant-delay enumeration can be achieved for those factorized representations over variable orders where the order of the variables in the order-by clause is a prefix of a topological ordering of the variable order.

Enumeration for Dynamic Query Evaluation

Beyond the static setting, factorized representations have been used for dynamic query evaluation, that is, for maintaining the results of queries under inserts to and deletes from the input database.

Schweikardt et al [10.1145/3034786.3034789] consider the task of enumerating and counting answers to conjunctive queries in the presence of tuple insertions and deletions. Essentially, “the dynamic data structure that is computed by our algorithm can be viewed as an f-representation [a form of factorized representation] of the query result, but not every f-representation can be efficiently maintained under database updates.”

Salihoglu et al [10.1145/3446980] point out that the Dynamic Constant-delay Linear Representation (DCLR) data structure and the Dynamic Yannakakis Algorithm [DBLP:conf/sigmod/IdrisUV17] are “reminiscent of factorized database representation and processing.”

F-IVM is an open-source prototype that unifies maintenance for a variety of tasks, including gradient computation for learning linear regression models over joins, matrix chain multiplication, and factorized evaluation of conjunctive queries [DBLP:conf/sigmod/NikolicO18]. It is “the first approach to employ factorized computation for three aspects of incremental maintenance for queries with aggregates and joins: (1) it exploits insights from query evaluation algorithms with best known complexity and optimizations that push aggregates past joins; (2) it can process bulk updates expressed as low-rank decompositions; and (3) it can maintain compressed representation of query results.”

Kara et al [10.1145/3375395.3387646] introduce a preprocessing-enumeration-update trade-off map for the static and dynamic evaluation of hierarchical conjunctive queries. It recovers a large number of prior results, including the original result on factorized databases in the static setting. It uses F-IVM's factorized representations called view trees to represent partial evaluation and maintenance of the result of a given hierarchical query under light/heavy value degree constraints.

Fine-Grained Query Provenance

The ICDT 2012 paper [DBLP:CONF/ICDT/OLTEANUZ12] also introduces effective techniques for the factorization of provenance polynomials of conjunctive queries. This has been used subsequently in a number of works, where the provenance information is too large to be stored and queried effectively.

Glavic et al [DBLP:journals/vldb/LeeLG19] introduce a practical system for why and why-not provenance called PUG (Provenance Unification through Graphs). PUG takes a provenance question and Datalog query as input and generates a Datalog program that computes an explanation, that is, the part of the provenance that is relevant to answer the question. It demonstrates “how a desirable factorization of provenance can be achieved by rewriting an input query .. This is akin to factorization of provenance polynomials in the semiring model and utilizes factorized databases techniques”. Section 9 of this work is devoted to provenance factorization and how to “extend our approach to produce concise, factorized representations of provenance”.

In work that received the VLDB 2017 best paper award [DBLP:journals/pvldb/DeutchFG17] and follow-up journal articles [DBLP:journals/vldb/DeutchFG20, 10.1145/3277006.3277017], Deutch and his team put forward a natural language (NL) interface for formulating database queries and an answering system that is based on the provenance of tuples in the query result, detailing not only the results but also their explanations. Since provenance is typically large, this work introduces “solutions for its effective presentation as NL text: one that is based on provenance factorization, with novel desiderata relevant to the NL case, and one that is based on summarization”. The core of their method is based on the observation underlying factorization: “As observed already in previous work, different assignments (explanations) may have significant parts in common, and this can be leveraged in a factorization that groups together multiple occurrences... Importantly, we impose a novel constraint on the factorizations that we look for (which we call compatibility), intuitively capturing that their structure is consistent with a partial order defined by the parse tree of the question”. The solutions put forward rely on extensions of the original provenance factorization technique: “We propose two solutions: the first based on the idea of provenance factorization, and the second leveraging factorization to provide a summarized form”.

Ives et al [10.14778/3436905.3436909] discuss solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. The motivation is to facilitate reproducible data science and auditable data analyses, by tracking the processes and inputs responsible for each result of an analysis. A key technical contribution is a provenance archival system that efficiently stores provenance for computations computed over time. This system “opportunistically exploits shared subexpressions as they occur”, which is acknowledged as a form of factorization. It also points out that Bao et al [10.1145/2396761.2398439] developed strategies for factoring out provenance storage for common query expressions, following the original work on provenance factorization.

Factorized Aggregate Computation

A distinct and rich line of work enabled by the ICDT 2012 paper is on efficient computation over factorized data. A first approach appeared in the VLDB 2013 paper by Bakibayev et al [DBLP:journals/pvldb/BakibayevKOZ13], where it is shown that the computation of group-by aggregates can benefit from factorized joins.

The framework of Functional Aggregate Queries (FAQ) by Abo Khamis et al [DBLP:conf/pods/KhamisNR16] (PODS’20 best paper award) and AJAR by Ré et al [10.1145/2902251.2902293] established the same computational complexity for answering conjunctive queries with aggregates as that of factorized aggregates by Bakibayev et al [DBLP:journals/pvldb/BakibayevKOZ13] when computed over factorized representations for conjunctive queries [DBLP:journals/tods/OlteanuZ15]. The FAQ paper notes about the relationship with prior work on factorized databases: “Bakibayev et al and Olteanu and Závodný introduced the notion of factorized databases, and showed how one can efficiently compute join and aggregates over factorized databases. In hindsight there is much in common between their approach and InsideOut applied to the single semiring case. Both approaches have the same runtime complexity, because both are dynamic programming algorithms, InsideOut is bottom-up, and factorized database computation is top-down (memoized).”

The AJAR paper [10.1145/2902251.2902293] also notes that “There is a standard modification to Yannakakis to handle aggregations, but the classic analysis provides only a O(IN · OUT) bound. Bakibayev, Kocisky, Olteanu, and Zavodny study aggregation-join queries in factorized databases, and later Olteanu and Závodný connected factorized databases and GHDs/GHDJoin [the TODS extended version of the ICDT 2012 paper]. They develop the intuition that if output attributes are above non-output attributes, the +OUT runtime is preserved; we use the same intuition to develop and analyze AggroGHDJoin, a variant to GHDJoin for aggregate-join queries.” AJAR forms the basis of the EmptyHeaded system [10.1145/3129246], which adopts factorized aggregate computation: “Previous work has investigated aggregations over hypertree decompositions [citations to factorized databases]. EmptyHeaded adopts this previous work in a straightforward way.”

Wu et al [huang2021reptile] adopted the factorized computational paradigm for the design and implementation of a query explanation system called Reptile. This system “computes a succinct factorised matrix representation that reduces the matrix representation by orders of magnitude” and “extend prior work, which developed model training procedures over factorised matrices derived from join queries, to matrices based on join-aggregation queries that exhibit fewer redundancies.” It “adapts factorized representations to compactly represent the feature matrix, and extend matrix operations to support factorized representations.”

Capelli et al [DBLP:journals/corr/abs-1901-03633] have recently proposed an approach to compute complex dependency weighted aggregations over factorized databases. They “based [their] approach on the framework of factorized databases of Olteanu and Závodný”. The core of their computational approach is to factorize the linear program, whose optimal solution is the dependency weighted aggregation, following the structure of the underlying factorized data join: “To solve ground linear programs efficiently, we will work on factorized representations of relations. We use a generalization of the framework of factorized databases introduced by Olteanu et al”.

Factorized Machine Learning

Realizing that model training in machine learning boils down to the evaluation of aggregates over joins, there followed a solid line of work on in-database factorized machine learning for a variety of models: linear regression [10.1145/2882903.2882939]; polynomial regression, factorization machines, and principle component analysis [10.1145/3375661]; support vector machines and arbitrary models trained using non-polynomial loss functions [DBLP:journals/tods/KhamisCMNNOS20]; and k-means clustering [DBLP:conf/aistats/CurtinM0NOS20]. This work led to several engines for factorized computation of batches of queries, the latest one is the open-source LMFAO system [DBLP:conf/sigmod/SchleichOK0N19].

Inspired by this line of work on factorized databases, Kumar coined the terms “factorized learning” [DBLP:journals/pvldb/JustoYSPK21, DBLP:conf/sigmod/KumarNP15] and “factorized linear algebra” [DBLP:journals/pvldb/ChenKNP17, DBLP:conf/sigmod/LiC019] to emphasize the core aspect of computing aggregates over joins without materializing the joins explicitly: “We call our technique factorized learning, borrowing the terminology from “factorized” databases [. . . ] We extend the general idea of factorized computation to ML algorithms over joins” [DBLP:conf/sigmod/KumarNP15]; “Is it possible to generalize the idea of factorized ML and automate its application to a much wider variety of ML algorithms and platforms in a unified manner?” [DBLP:journals/pvldb/ChenKNP17]. In recent years, factorized machine learning has become an accepted approach beyond the works co-auhtored by Kumar and Olteanu, with examples such as the factorized support vector machines [9101671].

References

  • .

    In . In . , Masters thesis. Masters thesis, . () ed., .