Publications
2017

Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning
Niccolò Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer
SIGMOD 2017 (to appear).

The Linearization of Belief Propagation on pairwise Markov Random Fields
Wolfgang Gatterbauer
AAAI 2017.
new [Long version (arXiv:1502.04956)], (Version Dec 2016) 
Algorithms for automatic ranking of participants and tasks in an anonymized contest
Yang Jiao, R. Ravi, Wolfgang Gatterbauer
WALCOM 2017.
new [Long version (arXiv:1612.04794)], (Version Dec 2016)
2016

Dissociation and propagation for approximate lifted inference with standard relational database management systems
Wolfgang Gatterbauer, Dan Suciu
VLDBJ (Special Issue of VLDB Journal on VLDB 2015). pp. 126, 2016.
new selection [paper (Springer)]
Extended 33 page online version with all proofs: [paper (arXiv:1310.6257)], (Version June 2016)
Extends and complements PVLDB 8(5):629640, 2015: [paper (VLDB)]
Project page: Propagation
This paper proposes an approach to uncertain query evaluation by which every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME selfjoinfree conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking functions from graphs to hypergraphs. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

Semisupervised learning with heterophily
Wolfgang Gatterbauer
arXiv:1412.3100.
[working paper (arXiv:1412.3100)], (Version Dec 2016)
Project page: SSLH
We propose a novel linear semisupervised learning formulation that is derived from a solid probabilistic framework: belief propagation. We show that our formulation generalizes a number of label propagation algorithms described in the literature by allowing them to propagate generalized assumptions about influences between classes of neighboring nodes. We call this formulation SemiSupervised Learning with Heterophily (SSLH). We also show how the modularization matrix can be learned from observed data with a simple convex optimization framework that is inspired by locally linear embedding. We call this approach Linear Heterophily Estimation (LHE). Experiments on synthetic data show that both approaches combined can learn heterophily of agraph with 1M nodes and 10M edges in under 1min.
2015

The complexity of resilience and responsibility for selfjoinfree conjunctive queries
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou
PVLDB 9(3):180191, 2015.
selection [paper (VLDB)], [bib]
Full 36 page version with all proofs (arXiv:1507.00674): [paper (arXiv:1507.00674)], (Version July 2015)
Project page: Causality
Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as "the resilience of a query" and show its connections to the wellstudied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for selfjoinfree conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source sideeffects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of selfjoinfree conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source sideeffects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards.

Linearized and singlepass belief propagation
Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, Christos Faloutsos
PVLDB 8(5):581592, 2015.
selection [paper (VLDB)], [slides (2MB)], [narrated slides (32MB), annotations only work on Windows], [video (21min)], [Python code on Github], [SQL code on Github], [bib]
Full 18 page version with all proofs (arXiv:1406.7288): [paper (arXiv:1406.7288)], (Version Oct 2014)
Project page: SSLH
How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. A wellknown problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops. This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closedform solution via intuitive matrix equations and, thus, comes with exact convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multiclass settings. Plus, it allows a compact implementation in SQL. The paper also introduces Singlepass Belief Propagation (SBP), a localized (or "myopic") version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard BP, while leading to almost identical node labels.

Approximate lifted inference with probabilistic databases
Wolfgang Gatterbauer, Dan Suciu
PVLDB 8(5):629640, 2015.
selection [paper (VLDB)], [paper (arXiv:1412.1069)], [slides (4MB)], [bib]
Invited to the Special Issue of VLDB Journal on VLDB 2015.
Project page: Propagation
This paper proposes a new approach for approximate evaluation of #Phard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME selfjoinfree conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over nonprobabilistic methods for ranking query answers.

FaultTolerant Entity Resolution with the Crowd
Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann
arXiv:1512.00537.
[working paper (arXiv:1512.00537)], (Version Dec 2016)
2014

Oblivious bounds on the probability of Boolean functions
Wolfgang Gatterbauer, Dan Suciu
ACM TODS, vol. 39, no. 1, pp. 191208, 2014.
selection [paper (ACM)], [paper (local preprint)], [paper (arXiv:1409.6052)], [SQL code], [bib]
Prior superseded working paper "Optimal Upper and Lower Bounds for Boolean Expressions by Dissociation": arXiv:1105.2813
Project page: Propagation
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #Phard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxationbased and modelbased approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
2013

Counterexamples to commonly held assumptions on unit commitment and market power assessment
Wolfgang Gatterbauer, Marija Ilic
Chapter 10 of: Engineering ITEnabled Sustainable Electricity Services, Marija Ilic, Le Xie, Qizing Liu (Eds.), Springer, 2013.
[chapter]
Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns shortterm supply optimization for a given demand, and does not consider longterm investment issues. In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally selfscheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.
2012

Querying provenance for ranking and recommending
Zachary Ives, Andreas Haeberlen, Tao Feng, Wolfgang Gatterbauer.
TaPP 2012.
[paper]
2011

Defaultall is dangerous!
Wolfgang Gatterbauer, Alexandra Meliou, Dan Suciu
TaPP 2011.
selection [paper], [paper (arXiv:1105.4395)], [slides], [slides], [bib]
Project pages: BeliefDB, Causality
This paper shows that the "defaultall propagation" scheme for database annotations can have some problematic semantic consequences. We propose an alternative "minimumpropagation" provenance that fixes the issue and that comes with several desirable properties.

Rules of thumb for information acquisition from large and redundant data
Wolfgang Gatterbauer
ECIR 2011, pp. 479490.
selection [paper], [slides], [slides], [bib],
Full 40 page version with all proofs (arXiv:1012.3502): [paper (arXiv:1012.3502)], [bib], (Version Dec 2010)
Project page: Unique recall
Assume you crawl 20% of the Web. Are you able to learn 80% of the available information? This paper develops an analytic model (and uses generally accepted assumptions of power laws distributions in data) to show that we can expect to learn less then 40% of the Web's content, hence the 8020 rules does not hold. The paper further describes a new family of power law distribution which remains invariant under sampling, i.e. randomly sampling from this distribution will lead again to the original distribution in the sample.

QueryViz: Helping users understand SQL queries and their patterns
Jonathan Danaparamita, Wolfgang Gatterbauer
EDBT 2011, pp. 558561. (System demonstration)
selection [demonstration paper], [demonstration paper (ACM version)], [bib]
Project page: QueryViz

Tracing data errors with viewconditioned causality
Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, Dan Suciu
SIGMOD 2011, pp. 505516.
selection [paper], [paper (ACM version)], [bib]
Project page: Causality
This paper shows how causal reasoning can be used for postfactum data cleaning, where errors are detected after data has been transformed (e.g. by a query) and which need to be corrected in the original input data. We achieve this with a novel way for translating the problem into a SAT and a weighted MAXSAT problem, and then using very effective existing tools.

Databases will visualize queries too
Wolfgang Gatterbauer
PVLDB 4(12):14981501, 2011. (VLDB challenges and visions track)
[paper], [narrated slides (16MB)], [video (19min)], [bib]
Project page: QueryViz
This paper describes a humanquery interaction pattern in which users reuse existing queries as templates to compose their own queries. This interaction mode is only made possible with new visualization tools which help users understand SQL patterns and the intent of existing SQL queries quickly. QueryViz is our new visualization approach.

Managing structured collections of community data
Wolfgang Gatterbauer, Dan Suciu
CIDR 2011, pp. 207210. (Outrageous ideas and vision track)
[vision paper], [slides], [slides], [bib]
Project page: BeliefDB

Reverse data management
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu
PVLDB 4(12):14901493, 2011. (VLDB challenges and visions track)
[paper], [bib]
Project page: Causality

Bringing provenance to its full potential using causal reasoning
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu
TaPP 2011.
[paper], [bib]
Project page: Causality

Sessionbased browsing for better query reuse
Nodira Khoussainova, Yongchul Kwon, WeiTing Liao, Magdalena Balazinska, Wolfgang Gatterbauer, Dan Suciu
SSDBM 2011, pp. 583585. (Poster)
[poster paper]
Full 10 page version (UW CSE TR 110402): [technical report]
Project page: CQMS
2010

Data conflict resolution using trust mappings
Wolfgang Gatterbauer, Dan Suciu
SIGMOD 2010, pp. 219230.
selection [paper], [slides], [slides], [poster], [bib]
Full 20 page version with all proofs (arXiv:1012.3320), extended UW CSE TR 091101: [paper], [bib], (Version Dec 2010)
Project page: BeliefDB

The complexity of causality and responsibility for query answers and nonanswers
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu
PVLDB 4(1):3445, 2010.
selection [paper], [bib]
Full 15 page version with all proofs (arXiv:1009.2021): [paper], [bib], (Version Sept 2010)
Project page: Causality

Dissociation and propagation for efficient query evaluation over probabilistic databases
Wolfgang Gatterbauer, Abhay K. Jha, Dan Suciu
MUD 2010, pp. 8397.
[paper], [bib]
Full 29 page version with all proofs (arXiv:1310.6257): [paper], (Version Aug 2014)
Project page: Propagation

Causality in databases
Alexandra Meliou, Wolfgang Gatterbauer, Joseph Halpern, Christoph Koch, Katherine F. Moore, Dan Suciu
IEEE Data Engineering Bulletin, Special issue on Data Provenance, 33(3):5967, Sept. 2010.
[paper], [bib]
Project page: Causality

Why so? or Why no? Functional causality for explaining query answers
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu
MUD 2010, pp. 317.
[paper], [bib]
Full 18 page version with all proofs (arXiv:0912.5340), also UW CSE TR 091201: [paper], [bib], (Version Dec 2009)
Project page: Causality

Visual structurebased web page clustering and retrieval
Paul Bohunsky, Wolfgang Gatterbauer
WWW 2010, pp. 10671068. (Poster)
[poster paper], [poster paper (ACM version)], [poster], [bib]
Former project page (not maintained anymore): VENTex
Online demo: VisBox
Example DIV table: TABLE vs DIV table
2009

Believe it or not: Adding belief annotations to databases
Wolfgang Gatterbauer, Magda Balazinska, Nodira Khoussainova, Dan Suciu
PVLDB 2(1):112, 2009.
selection [paper], [slides], [slides], [bib]
Full 17 page version with all proofs (arXiv:0912.5241, extended UW CSE TR 081201): [paper (arXiv:0912.5241)], [bib], (Version Sept 2009)
Project page: BeliefDB

Integrating and ranking uncertain scientific data
Landon Detwiler, Wolfgang Gatterbauer, Brent Louie, Dan Suciu, Peter TarczyHornoch
ICDE 2009, pp. 12351238. (Short paper)
selection [short paper], [poster], [bib]
Full 26 page version with all details (UW CSE TR 080603): [paper], (Version Sept 2008)
[slides], (Jan 2010)
Project pages: BioRank/U2/Biomediator (not maintained anymore), Propagation

A Case for a collaborative query management system
Nodira Khoussainova, Magdalena Balazinska, Wolfgang Gatterbauer, YongChul Kwon, Dan Suciu
CIDR 2009. (Perspectives Track)
[paper], [paper (arXiv:0909.1778)], [bib]
Project web page: CQMS

Web data extraction system
Robert Baumgartner, Wolfgang Gatterbauer, Georg Gottlob
Encyclopedia of Database Systems, Part 23, pp. 34653471, Springer, 2009.
[article] [bib]

Web harvesting
Wolfgang Gatterbauer
Encyclopedia of Database Systems, Part 23, pp. 34723473, Springer, 2009.
[article] [bib]
2007
 Towards domainindependent information extraction from web tables
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl, Bernhard Pollak
WWW 2007, pp. 7180.
selection [paper], [paper (ACM version)], [bib], [Citations],
Copy of former Project page: VENTex
Patent 8,719,291  Creating permanent test collections of web pages for information extraction research
Bernhard Pollak, Wolfgang Gatterbauer
SOFSEM 2007, Volume II, pp. 103115.
[paper], [slides], [poster], [bib]
Project pages: WebPageDump, VENTex
WebPageDump is a Firefox extension which allows you to save local copies of pages from the Web. It sounds simple, but it's not. The standard "Save page as" function of web browsers fails with most dynamic web pages and this shortcoming was a serious problem for our research. Comes the birth of WebPageDump. We hope you find it useful too.
2006
 Table extraction using spatial reasoning on the CSS2 visual box model
Wolfgang Gatterbauer, Paul Bohunsky
AAAI 2006, pp. 13131318.
selection [paper], [paper (AAAI version)], [slides], [bib]
Copy of former Project page: VENTrec
 Estimating required recall for successful knowledge acquisition from the Web
Wolfgang Gatterbauer
WWW 2006, pp. 969970. (Poster)
[poster paper], [poster], [poster 1+2], [bib]
Project page: Unique recall
2005

Using visual cues for extraction of tabular data from arbitrary HTML documents
Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer.
WWW 2005, pp. 10001001. (Poster)
[poster paper], [bib]

Web information extraction using eupeptic data in Web tables
Wolfgang Gatterbauer, Bernhard Krüpl, Wolfgang Holzinger, Marcus Herzog.
RAWS 2005 (1st International Workshop on Representation and Analysis of Web Space), pp. 4148.
[paper], [bib]
2002 (Electrical Engineering)

Counterexamples to commonly held assumptions on unit commitment and market power assessment
Wolfgang Gatterbauer, Marija Ilic
NAPS 2002 (34th Annual North American Power Symposium), pp. 219225.
[paper], [slides], [bib]

Interdependencies of electricity market characteristics and bidding strategies of power producers
Wolfgang Gatterbauer
Master's thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, May 2002.
[thesis], [bib]
Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns shortterm supply optimization for a given demand, and does not consider longterm investment issues.
In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally selfscheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.
Finally, by taking statistic uncertainties and intertemporal interdependencies into account, we show in the third part that a generator owner's optimum bid sequence for a centralized auction market is generally above marginal cost. In contrast to current literature, this is true even where absolutely no abuse of market power is involved. We conclude that marginal production costs cannot be used by themselves as baseline for the assessment of market power in electricity markets.
2000 (Mechanical Engineering)

Combining the Graz Cycle with coal and heavy oil gasification for industrial power stations
Original title in German: Der Graz Cycle für Industriekraftwerke gefeuert mit Brenngasen aus Kohle und Schwerölvergasung
Herbert Jericha, Armin Lukasser, Wolfgang Gatterbauer
VDIBerichte Nr. 1566, Gas Turbines for Combined Cycles, pp. 177185, Essen, Germany, September 2000.
[paper], [bib]
Project page: Graz Cycle