Over the past 30 years, the world of data management has undergone a dramatic evolution, allowing us to consider an increasingly wide range of data types, sources, and processing models. This has also led to the collection of tremendous volumes of data, and we are now at a point where almost every aspect of our lives is data-driven. In this talk, Prof. Tova Milo will share her perspective on some of the "hot topics" that she has been personally involved with in this evolution, and discuss why certain research topics become popular and when they lose their relevance. She will furthermore argue that we are now facing a turning point where the deluge of data has become a serious risk, and that the next "hot" research agenda should focus on data disposal.
Read
More..
Modern data analytics requires iteration, yet relational database engines are mostly optimized for non-recursive queries. SQL supports only a limited form of recursion. A better formalism for recursive queries is datalog, which has some elegant properties (recursion always terminates), and led to the development of two powerful optimizations techniques: semi-naive evaluation, and magic set rewriting. But standard datalog is restricted to monotone queries over sets and does not support aggregates, which has limited its adoption.
Read
More..
The scientific field of geospatial artificial intelligence (GeoAI) was recently
formed from combining innovations in spatial science with the rapid growth of
methods in artificial intelligence (AI), particularly machine learning (e.g.,
deep learning), data mining, and high-performance computing to glean meaningful
information from spatial big data. GeoAI is an interdisciplinary field, spanning
several scientific subjects including computer science, engineering, statistics,
and spatial science. In this seminar talk, we will explore current research work
undergoing at Ordnance Survey and partner universities in unlocking the
potential of GeoAI in the connected and autonomous vehicle (CAV) sector.
Read
More..
SQL-on-Hadoop, NewSQL and NoSQL databases provide semi-structured data models
(typically JSON-based).
They now drive towards declarative, SQL-alike query languages. However, their
idiomatic, non-SQL
language constructs, the many variations and the lack of formal syntax and
semantics pose problems.
Notably, database vendors end up with unclear semantics and complicated
implementations, as they add one
feature at-a-time.
Read
More..
Information Extraction (IE) is the task of extracting structured information
from textual data. We
explore a programming paradigm that is supported by several IE systems where
relations extracted by
atomic extractors undergo a relational manipulation. In our efforts toward
achieving a better
understanding of IE systems, we study the computational complexity of queries in
the framework of
document spanners (spanners, for short) that was introduced by Fagin et al. A
spanner is a function that
extracts from a document (string) a relation over text intervals, called spans,
using either atomic
extractors or a relational algebra query on top of these extractors. Evaluating
a spanner on a document
is a computational problem that involves three components: the atomic
extractors, the relational
structure of the query, and the document. We investigate the complexity of this
problem from various
angles, each keeping some components fixed and regarding the rest as input.
Read
More..
We discuss the complexity of enumerating the answers of Unions of Conjunctive
Queries (UCQs), and focus
on the ability to list output tuples with constant delay following linear-time
preprocessing. A known
dichotomy classifies the self-join-free CQs into those that admit such
enumeration, and those that do
not. However, this classification no longer holds in the common case where the
database exhibits
dependencies among attributes. In such cases, some queries classified as hard
are in fact tractable, and
we generalize the dichotomy to accommodate Functional Dependencies (FDs). Next,
we aspire to have a
similar characterization for UCQs, even when there are no FDs. It was claimed in
the past that a UCQ is
hard if one of its queries is hard. We examine the task of enumerating UCQs, and
show that some unions
containing a hard query are tractable. In fact, a UCQ may be tractable even if
it does not contain a
single tractable CQ.
Read
More..
Since 2009, we have been developing and operating Cloud-deployed, hybrid
transactional/analytic
processing (HTAP) systems for large retail customers. We implement these
applications primarily in a
datalog-based language called LogiQL, which permits the concise expression of
rich models with powerful
constraints and business rules. The ability to build such systems at scale in
LogiQL owes to a number
of innovations in the LogicBlox platform, including advanced multi-way join
algorithms, powerful query
optimizations, write-optimized data structures, and the ability to harness the
parallelism available in
modern servers. Unfortunately, not every LogiQL program can (yet) take full
advantage of these
innovations. When tuning such a program, a developer will often rewrite parts of
the program to
introduce new predicates and rules that materialize intermediate results.
Inevitably, these rewrites
have a deleterious effect on the concision, readability, and maintainability of
a program. In practice,
we mitigate these challenges through optimization heuristics and by generating a
significant component
of the LogiQL program from a higher-level form (called CubiQL) that allows us to
automate their
application. This talk surveys the challenges that arise when using LogiQL to
build HTAP applications
in this domain and the heuristics we developed to address them. We conclude with
open problems that
remain difficult to automate.
Read
More..
Performing data analytics on huge data sets have been very crucial for many
enterprises. It was
facilitated by the introduction of the MapReduce programming and execution model
and other more recent
distributed computing platforms. MapReduce users often have analysis tasks that
are too complex to
express as individual MapReduce jobs, and therefore they use high-level query
languages such as Pig or
Hive to express their complex tasks. The compilers of these languages translate
queries into workflows
of MapReduce jobs. In my talk, I will present ReStore, a system that manages the
storage and reuse of
intermediate results between MapReduce jobs in such workflows. At the end of the
talk, I will briefly
discuss related ideas for optimizing queries executed on other distributed
computing platforms.
Read
More..
Hardware-conscious database systems evaluate queries in milliseconds that take
minutes in conventional
systems, turning long-running jobs into interactive queries. However, the
plethora of hardware-focused
tuning techniques creates a design-space that is hard to navigate for a skilled
performance engineer
and even harder to exploit for modern, code-generating data processing systems.
In addition,
hardware-conscious tuning is often at odds with other design goals such as
implementation effort, ease
of use and maintainability -- in particular when developing code-generating
database systems.
Well-designed programming abstractions are essential to allow the creation of
systems that are fast,
easy to use and maintainable...
Read
More..
Relational queries, and in particular join queries, often generate large output
results when executed
over a huge dataset.
In such cases, it is often infeasible to store the whole materialized output if
we plan to reuse it
further
down a data processing pipeline. In this talk, we consider the problem of
constructing space-efficient
compressed
representations of the output of conjunctive queries, with the goal of
supporting the efficient access
of
the intermediate compressed result for a given access pattern...
Read
More..
LogiQL is a declarative modeling language for concisely expressing rich data
models with powerful
constraints and business
rules. The declarative nature and simple syntax of LogiQL makes it amenable to
many different
implementation
strategies...
Read
More..
This talk will chronicle our journey of deploying Semantic Web technologies with
real world users to
address Business Intelligence
and Data Integration needs, describe technical and social obstacles that are
present in large
organizations,
and scientific challenges that require attention....
Read
More..
This talk considers a method for compressing graphs into a smaller graph grammar
based on the RePair
compression scheme.
We start by defining the necessary notion of context-free hyperedge replacement
grammars using
examples.
We then extend RePair to graphs using this grammar formalism and discuss how the
problem of finding
non-overlapping
occurrences appears more difficult on graphs than on strings and trees....
Read
More..
In this talk, we study architecting query compilers. The state of the art in
query compiler
construction is lagging behind
that in the compilers field. We attempt to remedy this by exploring the key
causes of technical
challenges
in need of well-founded solutions, and by gathering the most relevant ideas and
approaches from the PL
and
compilers communities for easy digestion by database researchers...
Read
More..
Modern computing tasks such as real-time analytics require refresh of query
results under high update
rates. Incremental
View Maintenance (IVM) approaches this problem by materializing results in order
to avoid
recomputation.
IVM naturally induces a trade-off between the space needed to maintain the
materialized results and the
time
used to process updates....
Read
More..
We study planning in Markov decision processes involving discrete and continuous
states and actions,
and an unknown number
of objects. Planning in such domains is notoriously challenging and often
requires restrictive
assumptions.
We introduce HYPE: a sample-based planner for hybrid domains that is very
general, which combines
model-based
approaches with state abstraction. Most significantly, the domains where such
planners are deployed are
usually
very complex with deep structural and geometric constraints. HYPE is
instantiated in a probabilistic
programming
language that allows compact codification of such constraints....
Read
More..
This talk overviews recent results on bounding the output size and on
efficiently evaluating a
disjunctive datalog rule,
given input statistics such as relation sizes, functional dependencies, and
degree bounds. These are
the
kind of statistics prevalent in database query evaluation, and our results apply
to aggregate queries
as
well. The disjunctive datalog query evaluation problem is fairly general, as
both conjunctive query
evaluation
and the basic constraint satisfaction problem are special cases. These new
combinatorial and
algorithmic
results are built up on a fundamental connection between query evaluation and
Shannon-type
inequalities.
It was observed in different contexts over the...
Read
More..
I will talk about algorithmic and theoretical aspects of an adaptive clustering
framework for the
content recommendation
based on exploration-exploitation strategies in the multi-armed bandit
scenarios. First, I'll give an
introduction
to the centralized clustering methods in a standard stochastic noise setting for
networked users on the
graph.
Next, in the decentralized clustering setting I'll describe two distributed
efficient algorithms for
solving
linear regression based problems in peer to peer networks with limited
communication capabilities.
Last,
I'll establish upper bounds on the generalized collaborative filtering
applications combined with
clustered
models. In all cases, the technically sound theoretical guarantees...
Read
More..
Many of today's popular computing applications require real-time analytics over
large and dynamic
datasets, from social web
applications to online data warehousing, network monitoring, and algorithmic
trading. These
applications
have long-lived analysis queries that require low latency processing over
rapidly changing datasets.
...
Read
More..
The talk presents the HANA column store, then focuses on 3 historical version of
snapshot isolation
implementation, presenting
for each what was working well and why we evolved to the next one....
Read More..
Large scale internet operations such as Google, Facebook, and Amazon manage
amazing amounts of data.
Doing so requires databases
that are distributed across multiple servers or even multiple data centers, with
high throughput,
strong
latency requirements, "five nines" of availability, and often with strict data
consistency
requirements.
This talk starts by introducing relational SQL databases, NoSQL databases, and
the current state of the
art
in such databases as deployed in industry. It then provides an introduction to
Google F1, a SQL
database
based on Google's Spanner distributed storage system. F1 is used to store the
data for AdWords,
Google's
search...
Read
More..
We show how a variety of problems from database, logic, probabilistic graphical
model (PGM) inference,
matrix computation,
constraint satisfaction (CSP), etc. can be formulated as instances of the same
problem called the
Functional
Aggregate Query (FAQ) problem. Then, we explain how a pair of very simple
algorithms called OutsideIn
and
InsideOut can be used in concert to solve the FAQ problem. OutsideIn is a
slightly more general version
of
recent worst-case optimal join algorithms. InsideOut is a slightly more general
version of the classic
variable
elimination algorithm. ...
Read
More..
We study knowledge compilation for Boolean formulas that are given as groundings
of First Order
formulas. This problem is
motivated by probabilistic databases, where each record in the database is an
independent probabilistic
event,
and the query is given by a SQL expression or, equivalently, a First Order
formula. The query's
probability
can be computed in linear time in the size of the compilation representation,
hence the interest in
studying
the size of such a representation. ...
Read
More..
Fix a full, conjunctive query, and consider the following problem: what is the
amount of communication
required to compute
the query in parallel, on p servers, over a large database instance? We define
the Massively Parallel
Communication
(MPC) model, where the computation proceeds in rounds consisting of local
computations followed by a
global
reshuffling of the data. Servers have unlimited computational power and are
allowed to exchange any
data,
the only cost parameters are the number of rounds and the maximum amount of
communication per server. I
will
describe tight bounds on the amount of communication for...
Read
More..
Over the past 30 years, the world of data management has undergone a dramatic evolution, allowing us to consider an increasingly wide range of data types, sources, and processing models. This has also led to the collection of tremendous volumes of data, and we are now at a point where almost every aspect of our lives is data-driven. In this talk, Prof. Tova Milo will share her perspective on some of the "hot topics" that she has been personally involved with in this evolution, and discuss why certain research topics become popular and when they lose their relevance. She will furthermore argue that we are now facing a turning point where the deluge of data has become a serious risk, and that the next "hot" research agenda should focus on data disposal.
Despite advances in storage technology, the amount of data generated is projected to surpass storage production by an order of magnitude by 2025, and uncontrolled data retention further poses significant risks to security and privacy. She will discuss the logical, algorithmic, and methodological foundations necessary for the systematic disposal of large-scale data, highlighting new research challenges and the potential for reusing existing techniques. She will also share insights from the research conducted by the Tel Aviv Databases group in this direction. Ultimately, managing data effectively while respecting storage, processing, and regulatory constraints is a significant challenge, and she looks forward to exploring this topic further in her talk.
Modern data analytics requires iteration, yet relational database engines are mostly optimized for non-recursive queries. SQL supports only a limited form of recursion. A better formalism for recursive queries is datalog, which has some elegant properties (recursion always terminates), and led to the development of two powerful optimizations techniques: semi-naive evaluation, and magic set rewriting. But standard datalog is restricted to monotone queries over sets and does not support aggregates, which has limited its adoption.
In this talk I will describe a new approach to recursive queries and their optimization. First, we extend datalog to semirings, while preserving some of the elegant properties of datalog, and also supporting aggregates naturally. Then, I will describe a simple, yet very powerful optimization rule, called the FGH rule, that rewrites a recursive program into a different recursive program. The rule captures many optimizations discussed in the literature, such as magic set optimization, the PreM rule, and semi-naive evaluation, and as well as new semantics optimizations. Our implementation of the FGH rule is based on the egg term rewriting engine, and the Rosette program synthesizer.
The scientific field of geospatial artificial intelligence (GeoAI) was recently formed from
combining innovations in spatial science with the rapid growth of methods in artificial
intelligence (AI), particularly machine learning (e.g., deep learning), data mining, and
high-performance computing to glean meaningful information from spatial big data. GeoAI is an
interdisciplinary field, spanning several scientific subjects including computer science,
engineering, statistics, and spatial science. In this seminar talk, we will explore current
research work undergoing at Ordnance Survey and partner universities in unlocking the potential
of GeoAI in the connected and autonomous vehicle (CAV) sector.
Speaker Bio: Dr Stefano Cavazzi is a Principal Innovation and Research Scientist at Ordnance
Survey where he leads the development of GIS and geomatics research programmes. He holds a PhD
from the University of Cranfield and spent the last fifteen years working in the geospatial
sector in both academia and industry specialising in geospatial data science. At the Geo IoT
World Awards 2018 a team led by Stefano won the GeoData Intelligence award for developing an
autonomous decision-making support system for emergency responders.
SQL-on-Hadoop, NewSQL and NoSQL databases provide semi-structured data models (typically
JSON-based). They now
drive towards declarative, SQL-alike query languages. However, their idiomatic, non-SQL
language constructs,
the many variations and the lack of formal syntax and semantics pose problems. Notably,
database vendors end
up with unclear semantics and complicated implementations, as they add one feature at-a-time.
The presented SQL++ semi-structured data model bridges semistructured data and the SQL data
model. The SQL++
query language aims to backwards compatibility with SQL. We show that a relatively small set of
SQL
restriction removals and feature additions is enough to provide a SQL-compatible extension to
semistructured
data. SQL++ is currently being adopted by the industry.
The extension to Configurable SQL++ includes configuration options that describe different
options of language
semantics and formally capture the variations of existing database languages. Configurable
SQL++ is unifying:
By appropriate choices of configuration options, the Configurable SQL++ semantics can morph
into the semantics
of any of eleven popular semistructured databases, which we surveyed, as the experimental
validation shows. In
this way, Configurable SQL++ allows a formal characterization of the capabilities of the
emerging query
languages.
Short Bio: Yannis Papakonstantinou is a Senior Principal Scientist of Amazon Web Services and
Professor (on
leave) of Computer Science and Engineering at the University of California, San Diego. His
research is in the
intersection of data management technologies and the web, where he has published over one
hundred research
articles and received over 15,000 citations. He has given multiple tutorials and invited talks,
has served on
journal editorial boards and has chaired and participated in program committees for many
international
conferences and workshops. He also co-founded and taught for UCSD's Master of Advanced Studies
in Data
Science.
We discuss the complexity of enumerating the answers of Unions of Conjunctive Queries (UCQs),
and focus on the
ability to list output tuples with constant delay following linear-time preprocessing. A known
dichotomy
classifies the self-join-free CQs into those that admit such enumeration, and those that do
not. However, this
classification no longer holds in the common case where the database exhibits dependencies
among attributes.
In such cases, some queries classified as hard are in fact tractable, and we generalize the
dichotomy to
accommodate Functional Dependencies (FDs). Next, we aspire to have a similar characterization
for UCQs, even
when there are no FDs. It was claimed in the past that a UCQ is hard if one of its queries is
hard. We examine
the task of enumerating UCQs, and show that some unions containing a hard query are tractable.
In fact, a UCQ
may be tractable even if it does not contain a single tractable CQ.
Short Bio: Nofar is a PhD student in the Data and Knowledge group at the Technion, advised by
Prof. Benny
Kimelfeld. She is currently a visiting researcher in the FDB group at Oxford. Her research
focuses on query
optimization using enumeration techniques. Nofar completed her BSc in 2015 in the Lapidim
excellence program
of the Computer Science department of the Technion.
Information Extraction (IE) is the task of extracting structured information from textual data.
We explore a
programming paradigm that is supported by several IE systems where relations extracted by
atomic extractors
undergo a relational manipulation. In our efforts toward achieving a better understanding of IE
systems, we
study the computational complexity of queries in the framework of document spanners (spanners,
for short) that
was introduced by Fagin et al. A spanner is a function that extracts from a document (string) a
relation over
text intervals, called spans, using either atomic extractors or a relational algebra query on
top of these
extractors. Evaluating a spanner on a document is a computational problem that involves three
components: the
atomic extractors, the relational structure of the query, and the document. We investigate the
complexity of
this problem from various angles, each keeping some components fixed and regarding the rest as
input.
Short Bio: Liat Peterfreund is a PhD candidate in the Computer Science Department in the
Technion. Her
research is done under the supervision of Prof. Benny Kimelfeld and focuses on establishing the
foundations of
incorporating Information Extraction in Database Queries.
We discuss the complexity of enumerating the answers of Unions of Conjunctive Queries (UCQs),
and focus on the
ability to list output tuples with constant delay following linear-time preprocessing. A known
dichotomy
classifies the self-join-free CQs into those that admit such enumeration, and those that do
not. However, this
classification no longer holds in the common case where the database exhibits dependencies
among attributes.
In such cases, some queries classified as hard are in fact tractable, and we generalize the
dichotomy to
accommodate Functional Dependencies (FDs). Next, we aspire to have a similar characterization
for UCQs, even
when there are no FDs. It was claimed in the past that a UCQ is hard if one of its queries is
hard. We examine
the task of enumerating UCQs, and show that some unions containing a hard query are tractable.
In fact, a UCQ
may be tractable even if it does not contain a single tractable CQ.
Short Bio: Nofar is a PhD student in the Data and Knowledge group at the Technion, advised by
Prof. Benny
Kimelfeld. She is currently a visiting researcher in the FDB group at Oxford. Her research
focuses on query
optimization using enumeration techniques. Nofar completed her BSc in 2015 in the Lapidim
excellence program
of the Computer Science department of the Technion.
Since 2009, we have been developing and operating Cloud-deployed, hybrid transactional/analytic
processing
(HTAP) systems for large retail customers. We implement these applications primarily in a
datalog-based
language called LogiQL, which permits the concise expression of rich models with powerful
constraints and
business rules. The ability to build such systems at scale in LogiQL owes to a number of
innovations in the
LogicBlox platform, including advanced multi-way join algorithms, powerful query optimizations,
write-optimized data structures, and the ability to harness the parallelism available in modern
servers.
Unfortunately, not every LogiQL program can (yet) take full advantage of these innovations.
When tuning such
a program, a developer will often rewrite parts of the program to introduce new predicates and
rules that
materialize intermediate results. Inevitably, these rewrites have a deleterious effect on the
concision,
readability, and maintainability of a program. In practice, we mitigate these challenges
through optimization
heuristics and by generating a significant component of the LogiQL program from a higher-level
form (called
CubiQL) that allows us to automate their application. This talk surveys the challenges that
arise when using
LogiQL to build HTAP applications in this domain and the heuristics we developed to address
them. We conclude
with open problems that remain difficult to automate.
Short Bio: Kurt Stirewalt joined LogicBlox in 2009 and has since served as Dev Lead, Chief
Application
Architect, and eventually VP of Software Development for Infor Retail, which acquired LogicBlox
by
acquisition in 2016. Previously, he was a professor of computer science and engineering at
Michigan State
University, where he specialized in model-based software development, generative programming,
and software
reuse. Kurt received his PhD in computer science from the Georgia Institute of Technology in
1997.
Performing data analytics on huge data sets have been very crucial for many enterprises. It was
facilitated
by the introduction of the MapReduce programming and execution model and other more recent
distributed
computing platforms. MapReduce users often have analysis tasks that are too complex to express
as individual
MapReduce jobs, and therefore they use high-level query languages such as Pig or Hive to
express their
complex tasks. The compilers of these languages translate queries into workflows of MapReduce
jobs. In my
talk, I will present ReStore, a system that manages the storage and reuse of intermediate
results between
MapReduce jobs in such workflows. At the end of the talk, I will briefly discuss related ideas
for optimizing
queries executed on other distributed computing platforms.
Hardware-conscious database systems evaluate queries in milliseconds that take minutes in
conventional
systems, turning long-running jobs into interactive queries. However, the plethora of
hardware-focused tuning
techniques creates a design-space that is hard to navigate for a skilled performance engineer
and even harder
to exploit for modern, code-generating data processing systems. In addition, hardware-conscious
tuning is
often at odds with other design goals such as implementation effort, ease of use and
maintainability -- in
particular when developing code-generating database systems. Well-designed programming
abstractions are
essential to allow the creation of systems that are fast, easy to use and maintainable.
In my talk, I demonstrate how existing frameworks for high-performance, data-parallel
programming fall short
of this goal. I argue that the poor performance of many mainstream data processing systems is
due to the lack
of an appropriate intermediate abstraction layer, i.e., one that is amenable to both,
traditional
data-oriented as well as low-level hardware-focused optimizations.
To address this problem, I introduce Voodoo, a data parallel intermediate language that is
abstract enough to
allow effective code generation and optimization but low-level enough to express many common
optimizations
such as parallelization, loop tiling or memory locality optimizations. I demonstrate how we
used Voodoo to
build a relational data processing system that outperforms the fastest state-of-the-art
in-memory database
systems by up to five times. I also demonstrate how Voodoo can be used as a performance
engineering
framework, allowing the expression of many known optimizations and even enabling the discovery
of entirely
new optimizations.
Short Bio: Holger is a Lecturer (roughly assistant professor) in the Department of Computing at
Imperial
College London. As such, he is a member of the Large-Scale Data and Systems Group. Before that,
he was a
Postdoc at the Database group at MIT CSAIL. He spent his PhD years in the Database
Architectures group at CWI
in Amsterdam resulting in a PhD from the University of Amsterdam in 2015. He received a
master's degree
(Diplom) in computer science at Humboldt-Universität zu Berlin in 2010. His research interests
lie in
analytical query processing on memory-resident data. In particular, he studies storage schemes
and processing
models for modern hardware.
Relational queries, and in particular join queries, often generate large output results when
executed over a
huge dataset.
In such cases, it is often infeasible to store the whole materialized output if we plan to
reuse it further
down
a data processing pipeline. In this talk, we consider the problem of constructing
space-efficient compressed
representations
of the output of conjunctive queries, with the goal of supporting the efficient access of the
intermediate
compressed
result for a given access pattern. In particular, we will study of an important tradeoff :
minimizing the
space
necessary to store the compressed result, versus minimizing the answer time and delay for an
access request
over
the result. We present a novel parameterized data structure, which can be tuned to trade off
space for answer
time.
This tradeoff allows us to control the space requirement of the data structure precisely, and
depends both on
the
structure of the query and the access pattern. We then show how we can use the data structure
in conjunction
with
query decomposition techniques, in order to efficiently represent the outputs for several
classes of
conjunctive
queries. Finally, we conclude with some exciting open problems in this area.
This is joint work with Shaleen Deep.
Short Bio: Paris Koutris is an assistant professor at the University of Wisconsin-Madison,
where he started
in Fall 2015.
He completed his Ph.D. in the Computer Science & Engineering Department at the University
of Washington,
advised
by Dan Suciu. His research focuses on the theoretical aspects of data management. He is
particularly
interested
in applying formal methods to various problems of modern data management systems: data
processing in
massively
parallel systems and at scale, data pricing, and managing data with uncertainty. For his Ph.D.
thesis, he
received
the 2016 SIGMOD Jim Gray Doctoral Dissertation Award.
LogiQL is a declarative modeling language for concisely expressing rich data models with
powerful constraints
and business
rules. The declarative nature and simple syntax of LogiQL makes it amenable to many different
implementation
strategies,
including:
allowing the database to determine how data are stored and indexed;
deciding how queries are evaluated and results cached;
evaluating concurrent transactions to maximize throughput; and even
employing powerful constraint optimization solvers to solve linear and integer programming
problems.
This ability to declare a model in a manner that is abstract with respect to implementation
strategy is
powerful. As a simple
example, it removes much of the complexity of multi-core programming. It also enables a new
approach to
programming
large database applications, especially those requiring complex prescriptive and descriptive
analytics. Infor
Retail
has adopted LogiQL (and the LogicBlox platform) to build out our suite of retail forecasting
and supply-chain
optimization
products. This talk gives a brief introduction to LogiQL and LogicBlox in the context of
several common
modeling
problems in the retail domain.
Short Bio: Kurt Stirewalt joined LogicBlox in 2009 and has since served as Dev Lead, Chief
Application
Architect, and eventually
VP of Software Development for Infor Retail, which acquired LogicBlox by acquisition in 2016.
Previously, he
was
a professor of computer science and engineering at Michigan State University, where he
specialized in
model-based
software development, generative programming, and software reuse. Kurt received his PhD in
computer science
from
the Georgia Institute of Technology in 1997.
An early vision in Computer Science has been to create intelligent systems capable of reasoning
on large
amounts of data.
Today, this vision can be delivered by integrating Relational Databases with the Semantic Web
using the W3C
standards:
a graph data model (RDF), ontology language (OWL), mapping language (R2RML) and query language
(SPARQL). The
research
community has successfully been showing how intelligent systems can be created with Semantic
Web
technologies,
dubbed now as Knowledge Graphs.
However, where is the mainstream industry adoption? What are the barriers to adoption? Are
these engineering
and social barriers
or are they open scientific problems that need to be addressed?
This talk will chronicle our journey of deploying Semantic Web technologies with real world
users to address
Business Intelligence
and Data Integration needs, describe technical and social obstacles that are present in large
organizations,
and
scientific challenges that require attention.
Short Bio: Juan F. Sequeda is the co-founder of Capsenta, a spin-off from his research, and the
Senior
Director of Capsenta
Labs. He holds a PhD in Computer Science from the University of Texas at Austin. His research
interests are
on
the intersection of Logic and Data and in particular between the Semantic Web and Relational
Databases for
data
integration, ontology based data access and semantic/graph data management. Juan is the
recipient of the NSF
Graduate
Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on
ConstituteProject.org,
Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best
Transfer and
Innovation
Project awarded by Institute for Applied Informatics. Juan is the General Chair of 2018 Alberto
Mendelzon
Workshop
on Foundations of Databases (AMW2018), was the PC chair of the ISWC 2017 In-Use track, is on
the Editorial
Board
of the Journal of Web Semantics and member of multiple program committees (ISWC, ESWC, WWW,
AAAI, IJCAI).
Juan
is a member of the Linked Data Benchmark Council (LDBC) Graph Query Languages task force and
has also been an
invited
expert and standards editor at the World Wide Web Consortium (W3C).
This talk considers a method for compressing graphs into a smaller graph grammar based on
the RePair
compression
scheme. We start by defining the necessary notion of context-free hyperedge replacement
grammars using
examples.
We then extend RePair to graphs using this grammar formalism and discuss how the problem
of finding
non-overlapping
occurrences appears more difficult on graphs than on strings and trees.
We also give some intuition on graphs that are difficult to compress with HR grammars,
present some
experimental results
based on a proof-of-concept implementation, and briefly mention how to navigate the
represented data
without
decompression.
Short Bio: Fabian Peternek is about to receive his Ph.D. from the University of Edinburgh
on the topic of
grammar-based compression
of graphs. His research includes a generalization of the popular RePair compression scheme
to graphs, and
also
algorithmic results on compressed graph and tree grammars with the aim of achieving
polynomial time bounds
(with
respect to the grammar's size) on queries run on grammars.
In this talk, we study architecting query compilers. The state of the art in query
compiler construction
is lagging
behind that in the compilers field. We attempt to remedy this by exploring the key causes
of technical
challenges
in need of well-founded solutions, and by gathering the most relevant ideas and approaches
from the PL and
compilers
communities for easy digestion by database researchers. All query compilers known to us
are more or less
monolithic
template expanders that do the bulk of the compilation task in one large leap. Such
systems are hard to
build
and maintain. We propose to use a stack of multiple DSLs on different levels of
abstraction with lowering
in
multiple steps to make query compilers easier to build and extend, ultimately allowing us
to create more
convincing
and sustainable compiler-based data management systems. We attempt to derive our advice
for creating such
DSL
stacks from widely accepted principles. We have also re-created a well-known query
compiler following these
ideas
and report on this effort.
Amir Shaikhha is a 5th year Ph.D. student at EPFL. His research aims to build efficient
data analytics
systems using high-level
languages. More specifically, he is interested in using compilation techniques for
generating efficient
low-level
code (e.g. C code) from the high-level specification (e.g. Scala code) of
performance-critical systems
(e.g.
database systems). He received his M.Sc. from EPFL and B.S. from the Sharif University of
Technology.
Modern computing tasks such as real-time analytics require refresh of query results under
high update
rates. Incremental
View Maintenance (IVM) approaches this problem by materializing results in order to avoid
recomputation.
IVM
naturally induces a trade-off between the space needed to maintain the materialized
results and the time
used
to process updates.
In this talk, we show that the full materialization of results is a barrier for more
general optimization
strategies.
In particular, we present a new approach for evaluating queries under updates. Instead of
the
materialization
of results, we require a data structure that allows: (1) linear time maintenance under
updates, (2)
constant-delay
enumeration of the output, (3) constant-time lookups in the output, while (4) using only
linear space in
the
size of the database. We call such a structure a Dynamic Constant delay Linear
Representation (DCLR) for
the
query. We show that Dyn, a dynamic version of the Yannakakis algorithm, yields DCLRs for
the class of
free-connex
acyclic CQs. We show that this is optimal in the sense that no DCLR can exist for CQs that
are not
free-connex
acyclic. Moreover, we identify a sub-class of queries for which Dyn features constant-time
update per tuple
and
show that this class is maximal. An experimental validation of Dyn shows that it is highly
effective in
practice.
The talk will conclude with current insights on ongoing research on how Dyn can be
extended to also allow
processing
of more general classes of queries, in particular queries that feature in-equality joins
rather than
equality
joins. Such in-equality joins are important in relational-style queries, but are also
prominently present
in
the area of Complex Event Recognition.
We study planning in Markov decision processes involving discrete and continuous states
and actions, and
an unknown
number of objects. Planning in such domains is notoriously challenging and often requires
restrictive
assumptions.
We introduce HYPE: a sample-based planner for hybrid domains that is very general, which
combines
model-based
approaches with state abstraction. Most significantly, the domains where such planners are
deployed are
usually
very complex with deep structural and geometric constraints. HYPE is instantiated in a
probabilistic
programming
language that allows compact codification of such constraints.
In our empirical evaluations, we show that HYPE is a general and widely applicable planner
in domains
ranging from
strictly discrete to strictly continuous to hybrid ones. Moreover, empirical results
showed that
abstraction
provides significant improvements.
In the final part of the talk, we turn to the question of whether there is any hope of
developing
computational
methodologies that are not based on sampling. In particular, it is tricky in hybrid
domains to deal with
low-probability
observations, and most sampling-based schemes only provide asymptotic guarantees.
This talk is based on a Machine Learning Journal article (2017), and is joint work with
Davide Nitti,
Tinne De
Laet and Luc De Raedt.
This talk overviews recent results on bounding the output size and on efficiently
evaluating a disjunctive
datalog
rule, given input statistics such as relation sizes, functional dependencies, and degree
bounds. These are
the
kind of statistics prevalent in database query evaluation, and our results apply to
aggregate queries as
well.
The disjunctive datalog query evaluation problem is fairly general, as both conjunctive
query evaluation
and
the basic constraint satisfaction problem are special cases. These new combinatorial and
algorithmic
results
are built up on a fundamental connection between query evaluation and Shannon-type
inequalities. It was
observed
in different contexts over the past 40 years that information-theoretic inequalities can
be used to bound
combinatorial
quantities. First, one can derive (sometimes tight) output size bounds of conjunctive
queries and
disjunctive
datalog rules using Shannon-type inequalities. This talk discusses these bounds and
techniques. Second, we
show
how one can turn a proof of an information-inequality into an efficient algorithm to
evaluate such queries.
The
algorithm’s runtime is bounded by a generalized version of the submodular width of the
query, which is
optimal
modulo complexity-theoretic assumptions.
The talk is based on joint works with Mahmoud Abo Khamis and Dan Suciu.
I will talk about algorithmic and theoretical aspects of an adaptive clustering framework
for the content
recommendation
based on exploration-exploitation strategies in the multi-armed bandit scenarios. First,
I'll give an
introduction
to the centralized clustering methods in a standard stochastic noise setting for networked
users on the
graph.
Next, in the decentralized clustering setting I'll describe two distributed efficient
algorithms for
solving
linear regression based problems in peer to peer networks with limited communication
capabilities. Last,
I'll
establish upper bounds on the generalized collaborative filtering applications combined
with clustered
models.
In all cases, the technically sound theoretical guarantees are provided, and extensive
real-world data sets
based
experiments demonstrated the significantly increased performance compared to the
state-of-the-art
approaches.
Dr Shuai Li is currently a postdoctoral researcher in the Engineering Design Centre of the
Department of
Engineering,
at the University of Cambridge. He has 7+ years professional experience across Europe,
North America,
Middle
East, and Asia Pacific; and 19+ years project experience in Computer Science and
Information Engineering;
also
he served as CEO for the industry and released several successful products. He has been
engaged in
internationally
leading research in the analysis of complex, dynamic data at scale, and brought
cutting-edge data mining
and
machine learning to the heart of industrial-scale big data analytics and data science. As
the academic
service
he has been involved in a number of international prestigious conferences and journals
including SIGIR,
SIGKDD,
ICDM, SDM, NIPS, UAI, AISTATS, WWW, WSDM, IJCAI, etc.
Many of today's popular computing applications require real-time analytics over large and
dynamic
datasets, from
social web applications to online data warehousing, network monitoring, and algorithmic
trading. These
applications
have long-lived analysis queries that require low latency processing over rapidly changing
datasets.
In this talk, I will present techniques for efficient incremental processing of complex
analytical
queries, ranging
from classical SQL queries to linear algebra programs. Our system, called DBToaster,
compiles declarative
database
queries into high-performance stream processing engines that keep query results (views)
fresh at very high
update
rates. DBToaster uses a recursive query compilation algorithm that materializes a
supporting set of
higher-order
delta views to achieve a substantially lower view maintenance cost. Our implementation
supports batched
processing
in local and distributed environments and can deliver up to 5 orders of magnitude better
performance than
existing
DBMS and stream engines.
The LINVIEW system focuses on the incremental computation of iterative linear algebra
programs that
consist of
the standard matrix operations. LINVIEW uses matrix factorization techniques to make the
incremental
computation
of standard machine learning algorithms, like linear regression, practical and usually
substantially
cheaper
than re-evaluation. LINVIEW generates parallel incremental programs that outperform
re-evaluation
techniques
by more than an order of magnitude.
The talk presents the HANA column store, then focuses on 3 historical version of snapshot
isolation
implementation,
presenting for each what was working well and why we evolved to the next one.
Mihnea Andrei
MS in computer science in 1988; the Bucharest Polytechnic Institute, Automatic Control and
Computers
engineering
school; Prof. Cristian Giumale
DEA in Machine Learning in 1990; Universite Paris 6; Prof. Jean-Gabriel Ganascia
Joined Sybase in 1993; currently working at SAP, which has acquired Sybase in 2010.
Worked on the core engine of several RDBMs (Sybase ASE and IQ; SAP HANA): query
optimization, Abstract
Plans (optimizer
hints), query compilation and execution, eager-lazy aggregation, shared-disk and
shared-nothing scale-out.
Current
focus: database stores (in-memory and on-disk, row and column oriented), transaction
processing, data
lifecycle.
Large scale internet operations such as Google, Facebook, and Amazon manage amazing
amounts of data. Doing
so requires
databases that are distributed across multiple servers or even multiple data centers, with
high throughput,
strong
latency requirements, "five nines" of availability, and often with strict data consistency
requirements.
This
talk starts by introducing relational SQL databases, NoSQL databases, and the current
state of the art in
such
databases as deployed in industry. It then provides an introduction to Google F1, a SQL
database based on
Google's
Spanner distributed storage system. F1 is used to store the data for AdWords, Google's
search advertising
product,
as well as several other major products in the Ads & Commerce area. F1 and Spanner
represent a new,
hybrid
approach to distributed databases that combines the scalability and availability of NoSQL
storage systems
like
Google's Bigtable and Amazon's DynamoDB, with the convenience and consistency guarantees
provided by
traditional
SQL relational databases.
Bart Samwel is a senior staff software engineer at Google. He is infamous at his alma
mater Leiden
University for
completing his Master's degree with honors (and with several peer-reviewed publications on
his resume), but
more
importantly for taking a full twelve years to do this. It is rumored that this discrepancy
has caused him
to
be excluded as an outlier from the Computer Science department graduation statistics. At
Google, Bart is a
core
member of the F1 team, working mostly on the F1 SQL query engine.
We show how a variety of problems from database, logic, probabilistic graphical model
(PGM) inference,
matrix computation,
constraint satisfaction (CSP), etc. can be formulated as instances of the same problem
called the
Functional
Aggregate Query (FAQ) problem. Then, we explain how a pair of very simple algorithms
called OutsideIn and
InsideOut
can be used in concert to solve the FAQ problem. OutsideIn is a slightly more general
version of recent
worst-case
optimal join algorithms. InsideOut is a slightly more general version of the classic
variable elimination
algorithm.
As is the case with constraint programming and graphical model inference, to make
InsideOut run
efficiently we
need to solve an optimization problem to compute an appropriate variable ordering. The
main technical
contribution
of this work is a precise characterization of when a variable ordering is "semantically
equivalent" to the
variable
ordering given by the input FAQ expression. Then, we design an approximation algorithm to
find an
equivalent
variable ordering that has the best "fractional FAQ-width". Our results imply a host of
known and a few new
results
in graphical model inference, matrix operations, relational joins, and logic.
This is joint work with Mahmoud Abo Khamis and Atri Rudra.
Fix a full, conjunctive query, and consider the following problem: what is the amount of
communication
required
to compute the query in parallel, on p servers, over a large database instance? We define
the Massively
Parallel
Communication (MPC) model, where the computation proceeds in rounds consisting of local
computations
followed
by a global reshuffling of the data. Servers have unlimited computational power and are
allowed to exchange
any
data, the only cost parameters are the number of rounds and the maximum amount of
communication per server.
I
will describe tight bounds on the amount of communication for the case of a single round
and data without
skew,
then discuss extensions to skewed data and multiround.
We study knowledge compilation for Boolean formulas that are given as groundings of First
Order formulas.
This
problem is motivated by probabilistic databases, where each record in the database is an
independent
probabilistic
event, and the query is given by a SQL expression or, equivalently, a First Order formula.
The query's
probability
can be computed in linear time in the size of the compilation representation, hence the
interest in
studying
the size of such a representation. We consider the "data complexity" setting, where the
query is fixed, and
the
input to the problem consists only of the database instance. We consider several
compilation targets, of
increasing
expressive power: OBDDs, FBDDs, and decision-DNNFs (a subclass of d-DNNFs). For the case
of OBDDs we
establish
a dichotomy theorem for queries in restricted languages FO(exists, wedge, ee) and
FO(orall, wedge, ee):
for
each such query the OBDD is either linear in the size of the database, or grows
exponentially, and the
complexity
can be determined through a simple analysis of the query expression. For the other targets
we describe a
class
of queries for which (a) the decision-DNNF is exponentially large in the size of the
database, and (b) the
probability
of the query can be computed in polynomial time in the size of the database. This suggests
that the
compilation
target decision-DNNF is too weak to capture all tractable cases of probabilistic
inference. Our lower bound
for
decision-DNNF's relies on a translation into FBDD's, which is of independent interest.
Joint work with Paul Beame, Abhay Jha, Jerry Li, and Sudeepa Roy