Examining Statistics of Workflow Evolution Provenance: A First Study

L. Lins, D. Koop, E. W. Anderson, S. P. Callahan, E. Santos, C. E. Scheidegger, J. Freire, and C. T. Silva. Examining statistics of workflow evolution provenance: A first study. In B. Ludscher and N. Mamoulis, editors, Scientific and Statistical Database Management, volume 5069 of Lecture Notes in Computer Science, pages 573–579. Springer Berlin Heidelberg, 2008.

/images/thumbs/ssdbm2008.png

Abstract

Provenance (also referred to as audit trail, lineage, and pedigree) captures information about the steps used to generate a given data product. Such information provides documentation that is key to determining data quality and authorship, and necessary for preserving, reproducing, sharing and publishing the data. Workflow design, in particular for exploratory tasks (e.g., creating a visualization, mining a data set), requires an involved, trial-and-error process. To solve a problem, a user has to iteratively refine a workflow to experiment with different techniques and try different parameter values, as she formulates and test hypotheses. The maintenance of detailed provenance (or history) of this process has many benefits that go beyond documentation and result reproducibility. Notably, it supports several operations that facilitate exploration, including the ability to return to a previous workflow version in an intuitive way, to undo bad changes, to compare different workflows, and to be reminded of the actions that led to a particular result.

As provenance-enabled systems are deployed, and increasing volumes of provenance information are collected, there is a unique opportunity to leverage and obtain useful knowledge from this data. In this paper, we take a first step at analyzing this data. We present a preliminary analysis of workflow evolution provenance generated by thirty subjects who worked on six distinct exploratory tasks over the period of four months. This initial analysis shows that useful statistics can be extracted from this data that provide insights into how different people interact with workflow systems to solve problems.

Full paper

BibTex