ProvenanceWeek2023

ProvenanceWeek will take place on April 30-May 1. Following successful previous ProvenanceWeek events, this year’s instalment will again co-locate the IPAW and TaPP workshops in conjunction with WebConf 2023. IPAW and TaPP build on a successful history of provenance workshops that bring together researchers from a wide range of computer science fields including workflows, semantic web, databases, high performance computing, distributed systems, operating systems, programming languages, and software engineering, as well as researchers from other fields, such as biology and physics that have urgent provenance needs.

Provenance is increasingly important in data science, workflow systems, and many other areas, particularly to support transparency, accountability and explanations. By providing a record of the data creation process and of dependencies between data, provenance information is essential for tracing errors in transformed data back to erroneous inputs, access control, auditing, repeatability and reproducibility, evaluating data quality, and establishing ownership of data.

Happy Birthday W3C PROV! The W3C PROV standard is 10years old. As a part of ProvenanceWeek 2023, we will be celebrating the standards creation and impact. There will be cake.

All speakers and presenters participating in any way are expected to attend the conference in person. For exceptional reasons, if you are not able to attend in person to present, you may assign a proxy who must be in person.

Workshop Program

AT&T Hotel and Conference Center Zlotnik Ballroom 5

Sunday, April 30

9:15-9:30 am: Welcome and Introduction.

9:30-10:30 am: Keynote by Vanessa Braganholo

Title: Myths and Truths about noWorkflow

Chair: Yuval Moskovitch

Abstract: Scientific experiments are frequently written as scripts. While scripts are powerful and convenient, they lack provenance support. With that problem in mind, we designed noWorkflow aiming at transparently and automatically capturing provenance of Python scripts. In its 10 years of existence, it has been used for different purposes. In this talk, we provide an overview of noWorkflow’s evolution throughout the years and analyze myths and truths about it. To do that, we go over the 150+ papers that cite noWorkflow and group their main claims, classifying them as myths or truths.

Bio: Vanessa Braganholo is an Associate Professor at the Instituto de Computação of Universidade Federal Fluminense (UFF), Brazil, since 2014. Before that, she worked as an Assistant Professor at Universidade Federal do Rio de Janeiro (UFRJ). She holds a Ph.D. degree (2004) in Computer Science from Universidade Federal do Rio Grande do Sul (UFRGS). She received the Best PhD Thesis Award of the Brazilian Computer Society in 2005. She has published over 120 papers in journals and conferences and acted as the PC Chair of IPAW in 2020 and 2021. Her research area is databases, and her current research interests include provenance.

10:30-10:50 am Break

10:50 am-12:30 pm Session 1

Chair: Vanessa Braganholo

10:50-11:20 am: Chenjie Li, Boris Glavic, Sudeepa Roy and Zhengjie Miao, CaJaDE: Explaining Query Results by Augmenting Provenance with Context
11:20-11:50 am: Thomas Delva, Anastasia Dimou, Maxime Jakubowski and Jan Van den Bussche, Data Provenance for SHACL
11:50-12:05 pm: Sara Moshtaghi and Seokki Lee, Efficiently Sampling Big Provenance
12:05-12:20 pm: Kerstin Gierend, Judith A.H. Wodke, Sascha Genehr, Robert Gött, Ron Henkel, Frank Krüger, Markus Mandalka, Lea Michaelis, Alexander Scheuerlein, Max Schröder, Atinkut Zeleke and Dagmar Waltemath, Defining standard provenance information for clinical research data and workflows - Obstacles and opportunities
12:20-12:30 pm: Stefan Grafberger, Paul Groth and Sebastian Schelter, Provenance Tracking for End-to-End Machine Learning Pipelines

12:30-1:30 pm Lunch break

1:30-2:30 pm : Keynote by Boris Glavic

Title: Provenance, Relevance-based Data Management, and the Value of Data

Chair: Tanu Malik

Abstract: In this talk, I introduce the audience to relevance-based data management (RBDM), a new paradigm where information about what data is "relevant" for producing a result of a computation, i.e., the data provenance of the computation, is used to improve management of the data, e.g., by restricting computations to what data is relevant or by making caching decisions based on relevance. I will present provenance-based data skipping (PBDS) as one example of relevance-based data management. In PBDS, a system maintains lightweight sketches of the provenance of queries based on a (virtual) horizontal partitioning of tables which are captured at query runtime. Such provenance sketches are then utilized to speed-up the evaluation of future queries, by translating them into selection conditions that can utilize the existing physical design of a databases. Additionally, I will discuss preliminary ideas on the use of relevance as an objective metric for the value of data.

Bio: Boris Glavic (http://www.cs.iit.edu/~dbgroup/members/bglavic.html) is an Associate Professor in the Department of Computer Science at Illinois Institute of Technology leading the IIT DBGroup. His research spans several areas of database systems and data science including data provenance, explanations, data integration, query execution and optimization, uncertain data, and data curation. Boris strives to build systems that are based on solid theoretical foundations.

2:30-3:00 pm Town Hall

Monday, May 1

9:30-10:30 am: Keynote by Sudeepa Roy

Title: Provenance Semirings: Beautiful Theory and Applications in Query Debugging for Database Education

Chair: Paul Groth

Abstract: Database provenance tracks the relationships between input tuples and outputs of a relational query that transforms the source data into a desired view. In a seminal work by Green, Karvounarakis, and Tannen (PODS 2007), Provenance Semirings were proposed as a formal and general approach to track data provenance as annotations to tuples produced in different stages of the transformation in a query, which was later extended to aggregate operations by Amsterdamer, Deutch, and Tannen (PODS 2011). In this talk, I will give an overview of provenance semirings and describe how the powerful concept of provenance semirings has been used in our work on building tools for helping new programmers and students learn, debug, and trace relational queries. In our work, we defined and studied the complexity of the “small counterexample” problem for query debugging, built tools for relational algebra and SQL efficiently tracking provenance with user interfaces for easy exploration and understanding, and studied the effectiveness of these tools in database education for helping students debug queries in classroom settings. The "RATest" tool for debugging relational algebra has been successfully used by more than 1000 students in our undergraduate and graduate classes in the last few years.

This is joint work with Zhengjie Miao, Jun Yang, Yihao Hu, Amir Gilad, Kristin Stephens-Martinez, and many other graduate and undergraduate students in the HNRQ project.

Bio: Sudeepa Roy is an Associate Professor in Computer Science at Duke University. She works broadly in data management, with a focus on foundational aspects of big data analysis, which includes causality and explanations for big data, data provenance, query optimization, data repair, probabilistic databases, database theory, and applications of database techniques in other domains. Before she joined Duke in 2015, she did a postdoc at the University of Washington, and obtained her Ph.D. from the University of Pennsylvania. She is a recipient of the VLDB Early Career Research Contributions Award, an NSF CAREER Award, and a Google Ph.D. fellowship in structured data. She co-directs the Almost Matching Exactly (AME) lab for interpretable causal inference at Duke (https://almost-matching-exactly.github.io/).

10:30-10:50 am Break

10:50 am-12:30 pm Session 2

Chair: Sudeepa Roy

10:50-11:20 am: Adriane Chapman, Paolo Missier, Giulia Simonelli and Riccardo Torlone, Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science
11:20-11:50 am: Stefan Grafberger, Paul Groth, Julia Stoyanovich and Sebastian Schelter, Data Distribution Debugging in Machine Learning Pipelines
11:50-12:05 pm: Seokki Lee, Boris Glavic, Adriane Chapman and Bertram Ludäscher, Hybrid Query and Instance Explanations and Repairs
12:05-12:20 pm: Aniket Modi, Moaz Reyad, Tanu Malik and Ashish Gehani, Querying Container Provenance
12:20-12:30 pm: Tanja Auge, A provenance system for reproducing query results

12:30-1:30 pm Lunch break

1:30-3:00 pm Session 3

Chair: Boris Glavic

Tanja Auge, Gunnar Bali, Meike Klettke, Bertram Ludäscher, Wolfgang Söldner, Simon Weishäupl and Tilo Wettig, Provenance for Lattice QCD workflows
Luc Moreau, Nicola Hogan and Nick O'Donnell, Implementing an Environmental Management System Using Provenance-By-Design
Débora Pina, Adriane Chapman, Daniel de Oliveira and Marta Mattoso, Deep Learning Provenance Data Integration: a Practical Approach
Nikolaus Nova Parulian and Bertram Ludäscher, Trust the Process: Analyzing Prospective Provenance for Data Cleaning

3:00-3:30 pm Break

3:30-4:30 pm : W3C PROV 10 years celebration Panel

Moderator: Paul Groth
Panelists: Bryon Jacob (data.world), Deborah McGuinness (Rensselaer Polytechnic Institute), Paolo Missier (Newcastle University), Luc Moreau (King's College London)

Organizers

Chair

Yuval Moskovich

Co-Chair

Daniel Deutch

IPAW SC Chair

Paul Groth

TaPP SC Chair

Adriane Chapman

Program Committee

Khalid Belhajjame, PSL, Université Paris-Dauphine, LAMSADE
Ghita Berrada, King's College London
Elisa Bertino, Purdue University
Pierre Bourhis, CNRS CRIStAL
Vanessa Braganholo, UFF
Vasa Curcin, King's College London
Daniel de Oliveira, Fluminense Federal University
David Eyers, University of Otago
Irini Fundulaki, ICS-FORTH
Amir Gilad, Duke University
Boris Glavic, Illinois Institute of Technology
Paul Groth, University of Amsterdam
Xueyuan Han, Wake Forest University
Melanie Herschel, University of Stuttgart
Matteo Interlandi, Microsoft
David Koop, Northern Illinois University
Seokki Lee, University of Cincinnati
Bertram Ludäscher, University of Illinois at Urbana-Champaign
Tanu Malik, DePaul University
Marta Mattoso, COPPE- Federal Univ. Rio de Janeiro
Paolo Missier, Newcastle University
Tope Omitola, University of Southampton
Liat Peterfreund, CNRS
Sudeepa Roy, Duke University
Pierre Senellart, DI, École normale supérieure, PSL Research University
Roee Shraga, Northeastern University
Gianmaria Silvello, University of Padua
Xiao Yu, Stellar Cyber

Call for Papers

We invite innovative and creative contributions, including papers outlining new formal approaches to provenance, innovative use of provenance, experience-based insights, and visionary ideas.

Topics of interest include, but are not limited to:

Provenance visualization, and human interaction with provenance
Provenance for big data and extreme computing
Provenance for attribution and trust
Provenance for transparency and accountability
Security and privacy implications of provenance
Provenance, social media, and the semantic web
Provenance analytics, discovery, and reasoning about provenance and its quality
Data sharing and data citation
Provenance of workflows and annotations
Standardization of provenance models, services, and representations
Provenance management system prototypes and commercial solutions
Applications of provenance in real-life settings
Theoretical foundations of provenance
Connections between provenance and established topics in other research fields (programming languages, security, software engineering, fairness, etc.)
Provenance-based audit and forensics
Design, performance and scalability of provenance systems

Location and Dates:

ProvenanceWeek will be held in conjunction with WebConf 2023 in Austin Texas on Sunday, April 30, and Monday, May 1, 2023.

Important Dates:

Papers Due: February 6, 2023
Notification of Acceptance: March 6, 2023
Camera-ready version due: March 15, 2023

All deadlines are end-of-day in the Anywhere on Earth (AoE) time zone.

Submission Site:

Submission is via easychair, using the following link: https://easychair.org/conferences/?conf=thewebconf2023iwpd

Formatting the submissions:

Papers must be submitted in PDF format according to the ACM template published in the ACM guidelines, selecting the generic “sigconf” sample. The PDF files must have all non-standard fonts embedded. Workshop papers must be self-contained and in English. Papers should not exceed 12 pages in length (maximum 8 pages for the main paper content + maximum 2 pages for appendixes + maximum 2 pages for references).

Authors should indicate the track in the title (IPAW, TAPP, DEMO, POSTER or BEST)

Tracks

Papers can be submitted to one of the following tracks:

IPAW:
Authors are invited to submit original research work to the IPAW track. This track solicits full research papers that describe mature, high-quality research on the topics of interest of the ProvenanceWeek. The title of submissions to IPAW must begin with "IPAW:".
TaPP:
We invite innovative and creative contributions, including papers outlining new challenges for provenance research, promising formal approaches to provenance, experiments, and visionary (and possibly risky) ideas. The title of submissions to TaPP must begin with "TAPP:".

In addition to regular research papers, we also encourage submissions of thefollowing flavors:
- Short papers (up to 4 pages, including references) that would typically include work in progress or visionary ideas that are not yet fully developed, but have the potential to lead to cutting-edge research.
- Application papers (up to 6 pages, including references) focusing on innovative applications and uses of provenance.

“Best of the Rest”
As ProvenanceWeek is meant to bring together the provenance community, which is now established and publishing in conferences and journals across computational domains, the Theory and Practice of Provenance (TaPP) is forming a “Best of the Rest” category. If you have had a provenance-paper accepted to a conference or journal in 2022 thru February 2023, we invite you to submit the abstract to TaPP. We will invite the “Best of the Rest” to present their work at TaPP to widen the conversation.

Submit the abstract and original publication venue and date of the previously published work. The title of the abstract must begin with "BEST:".

We also encourage the presentation of ongoing work as posters or demonstrations. Proposals for posters or demonstrations should be formatted and submitted as described above, with the following additional restrictions:

Demonstrations:
Using no more than 4 pages, describe the context and highlights of the proposed demonstration, including a brief description of the demonstration scenario. The title of the proposal must begin with "DEMO:".

Posters:
Submit a 1-page abstract of the poster. The title of the abstract must begin with "POSTER:".