The aim of our research is to build a scalable engine to compute hybrid data management and analytics workloads over large-scale normalized relational databases. Current state-of-the-art analytics systems are confined to the conventional technology stack that has specialized data management and analytics systems. This specialization, however, results in a poor integration of data management and analytics techniques, which can lead to major bottlenecks for computing analytics over normalized databases. An in-database analytics engine thus bears the promise of significant improvements in performance, as well as our understanding of scalable analytics over databases.
Why compute analytics tasks inside the database?
Hybrid data management and analytics workloads are very common in industry. At LogicBlox, for instance, clients in the retail domain compute regression models over a database that contains weekly sales data, promotions, and product descriptions, to predict the additional demand generated for a given product due to promotion. Since machine learning algorithms require a single dataset as input, the training dataset is the result of a relational join query over the database for such applications. In a conventional technology stack, the query is computed inside a database management system, and the output is then imported into a specialized analytics engine that computes the regression model. In-database analytics, thus, brings the analytics closer to the data and saves non-trivial time usually spent on data import/export at the interface between database systems and analytics packages. This was the main motivation for early incarnations of in-database analytics systems, such as MADlib [6] or Bismark [5]. Such systems, however, still require the computation of the join result, which is given as input to the analytics problem.
The engine that we propose in our research can significantly outperform these early in-database analytics systems, because it computes analytics problems over databases without materializing the join result. This is particularly important because join queries can introduce lots of redundancy, due to the nature of relational data representation. This redundancy is not necessary for subsequent processing, such as computing regression or classification models. In order to avoid this redundancy, we express the data-intensive computation needed for a host of machine learning tasks as a set of database queries with joins and group-by aggregates, and devise a new factorised computation technique for such queries [1,11,9].
Factorized computation of relational queries
Factorized query computation is a new trend in database management research. It exploits three distinct ideas: worst-case optimal join algorithms [10], query plans that are defined by hypertree decompositions of the query hypergraph [2], and heavy-light partitioning of input relations for more adaptive processing of heavy values [3]. In addition, factorized aggregates-join queries systematically push aggregates past joins [4,2,9].
The combination of these techniques is the current state-of-the-art of query evaluation, which has significant benefits: factorized query computation can compute queries with joins and group-by aggregates with lowest known complexity [3, 10]. In practice, the low complexity guarantees translate to orders-of-magnitude performance improvements over non-factorized query evaluation techniques [4].
Our research shows that the data-intensive part of many analytics problems over relational databases can be expressed as factorized queries that compute group-by aggregates over joins. The intuition of this approach is illustrated with the example of computing linear regression models over normalized databases.
Efficient computation of linear regression models over databases
The parameters of a linear regression model can be computed with gradient descent optimization that minimizes the least-squares loss function. Gradient descent is an iterative algorithm, which repeatedly updates the model parameters in the direction of the gradient of the objective function until convergence. A rewriting of the gradient formulation allows for a decoupling of the data-dependent terms from the model parameters. For models that are computed over relation databases, we can then compute the data-dependent terms as relational aggregate-join queries that compute sum-product aggregates over the join of the input relations in the database. Subsequently, the iterative gradient descent process is run directly over the output of these aggregate-join queries, without going over the training dataset (i.e. the output of the join query) [11].
An additional benefit of the rewriting of data-intensive analytics code into aggregate-join queries is that it can capture regression models with categorical variables without any intermediate encoding of the input data. The state-of-the-art approach for learning models with categorical variables is to one-hot encode their active domain: each value in the active domain of a categorical variable is encoded by an indicator vector whose dimension is the size of the domain. For instance, the colors in the domain {red, green, blue} can be represented by indicator vectors [1, 0, 0] for red, [0, 1, 0] for green, and [0, 0, 1] for blue. One-hot encoding the training dataset results in a relational representation with one new attribute per distinct category of each categorical variable and with wide tuples whose values are mostly 0. This entails huge redundancy due to the presence of the many 0 values. It also blurs the usual database distinction between schema and data, since the schema can be as large as the input database under one-hot encoding. One-hot encoding of the input data is not necessary for in-database analytics engines. Instead, we compute linear regression models with categorical variables by rewriting the data-intensive computations into relational queries that compute group-by aggregates over joins, where categorical variables become group-by variables, and continuous variables are aggregated away [1].
The number of queries that are computed for a linear regression model is quadratic in the number of features of the model. Despite this large number of queries, evaluating them individually with factorized query computation techniques and then learning a regression model over them can be asymptotically much faster than state-of-the-art specialized analytics systems. By computing the set of queries together, however, we can share a lot of computation. This can further improve the complexity of the problem [11].
In-database analytics as a multi-query optimization problem
The aggregate-join queries that constitute the data-intensive part of analytics problem have very similar structure and share non-trivial amounts of computation. It is possible to exploit this sharing by computing all queries together at once. Thus, the in-database analytics problem becomes primarily a multi-query optimization problem for the special case of many queries that have similar structure.
In order to abstract away from specific analytics problems, we designed a general query expression that captures all cases of queries needed for a wide variety of applications. This is in the spirit of functional aggregate queries [2], yet with products and sums of functions of arbitrary arity. The class of analytics problems over databases for which this generalized query captures the data-intensive computations including (among others) regression, factorization machines, principal component analysis, decision trees, random forests, Bayesian networks, k-nearest neighbors, and frequent itemset mining. The multi-query optimization problem thus requires the computation of a set of queries of this generalized form.
The solution to this multi-query optimization problem can be computed with a single, highly optimized query evaluation algorithm, which systematically decomposes the queries into small, sharable views along the structure of a tree decomposition of the query. This algorithm generalizes the well-known message passing algorithm commonly used to do inference in probabilistic graphical models.
One form of sharing across such queries employs ring operations beyond sum and product: Instead of computing a set of very similar queries, we can compute an overapproximation and one compensation per query. For $n$ queries, where each query takes time linear in $n$ in expression complexity (in addition to the data complexity, which is the same for all these queries), the non-ring approach would take $O(n^2)$ time whereas the ring approach would take $O(n)$ time for the overapproximation, $O(1)$ for each compensation, so $O(n)$ overall. This is akin to the Strassen algorithm, which exploits similar ideas to reduce the number of value multiplications required for the product of two matrices [12].
All computation of this generalized multi-query optimization problem can be expressed in a declarative way, by defining the views in SQL, Datalog or any other relational query language. This means that the techniques can be readily deployed by anyone with a robust relational database management system.
This proposed framework is a generalization of the results from earlier publications, which have shown that this in-database analytics framework can lead to significant performance improvements for the computation of regression models and factorization machines over databases [11, 9, 1].
Further exploiting structure in the underlying data
In addition to factorizing and sharing the computation of the data-intensive parts of the analytics problem, an in-database analytics engine can further speed up the computation by exploiting structural properties in the underlying data.
Our research has shown that functional dependencies present in the database can be used to reduce the dimensionality of regression models, including standard linear regression models and factorization machines [1]. Given a model to learn, we instead learn a potentially much smaller model that has no functionally determined parameters. This smaller model can then be learned quicker, because it requires less aggregate-join queries and each gradient descent step takes less time to compute. Using the functional dependency, we can then map the small model into the original large model by adding back the functionally determined parameters.
In future research, we plan to investigate the impact of such properties on the generalized multi-query optimization problem. In particular, we will consider degree constraints and conditional functional dependencies.
In-database model selection, online learning, and distributed processing
Beyond scaling the computation of analytics, the rewriting of the data-intensive computations into aggregate-join queries is beneficial for three related problems in machine learning: First, the output of aggregate-join queries that are used to compute a given model can be re-used to compute any model over a subset of the original features. Therefore, once the queries are computed, it is possible to efficiently explore the feature space to select the most accurate model without ever touching the input data [8]. This means that model selection can be done very efficiently inside a database engine. Second, the aggregate-join queries that capture the data-intensive computations can be maintained efficiently under updates to the underlying database by using novel incremental maintenance techniques [7] (see here for more details). Such techniques present an efficient framework for online learning of analytics over databases. Third, the aggregate-join queries that capture the data-intensive computations of the analytics problem are distributive, which means that the computation of the problem can be easily parallelized and distributed [11].
Conclusion
This research project offers an exiting new research agenda towards large scale in-database analytics. The unifying problem formulation of this approach allows for a systematic investigation of data analytics from a database viewpoint, and constitutes a significant step towards designing end-to-end data analytics solutions.