--- title: "Overview of provTraceR" date: "26 July 2020" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{overview} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## provTraceR The provTraceR package displays information about files used or created by an R script or a series of R scripts. The package uses provenance collected by [rdtLite](https://CRAN.R-project.org/package=rdtLite) and stored in [prov-json format](https://github.com/End-to-end-provenance/ExtendedProvJson/blob/master/JSON-format.md). Output from provTraceR can be used to help manage files and to identify the input files needed to reproduce an analysis. ## Usage This package includes two functions: 1. To use existing provenance to trace file lineage: ``` prov.trace(scripts, prov.dir=NULL, file.details=FALSE, console=TRUE, save=FALSE, save.dir=NULL, check=TRUE) ``` 2. To run one or more scripts, collect provenance, and trace file lineage: ``` prov.trace.run(scripts, prov.dir=NULL, file.details=FALSE, console=TRUE, save=FALSE, save.dir=NULL, check=TRUE, prov.tool="rdtLite", details=FALSE, ...) ``` The scripts parameter may contain a single script name, a vector of script names, or a text file (with extension .txt) of script names. For prov.trace only: If more than one script is specified, the order of the scripts must match the order of execution as recorded in the provenance; otherwise an error message is displayed. For console sessions, set scripts = "console". For prov.trace.run only: The provenance collection tool specified by prov.tool must be "rdtLite" or "rdt". If details = TRUE, fine-grained provenance is collected. Other optional parameters (...) are passed to rdtLite or rdt. Scripts are executed in the order listed. It is assumed that provenance for each script is stored under a single provenance directory set by the prov.dir option. If not, the provenance directory may be specified with the prov.dir parameter. Timestamped provenance and provenance in scattered locations are not currently supported. Files are matched by hash value. INPUTS lists files that are required to run the script or scripts. These include files read by a script and not written by an earlier script or previously written by the same script. OUTPUTS lists files written by the script or scripts. EXCHANGES lists files with the same hash value that were written by one script and read by a later script; if the location changed, both locations are listed. If file.details = TRUE, additional details are displayed, including script execution timestamps, file timestamps, file hash values, and saved file names. Results of both functions are returned as a string. If console = TRUE (the default), results are displayed in the console. If save = TRUE, results are saved to the file prov-trace.txt. The save.dir parameter determines where the results file is saved. If NULL (the default), the R session temporary directory is used. If a period (.), the current working directory is used. Otherwise the directory specified by save.dir is used. If check = TRUE (the default), each file recorded in the provenance is checked against the user's file system. A dash (-) in the output indicates that the file no longer exists, a plus (+) indicates that the file exists but the hash value has changed, and a colon (:) indicates that the file exists and the hash value is unchanged. If check = FALSE, no comparison is made. ## Example In this example, three R scripts are used to gap fill, harmonize, and combine data from two meteorological stations to create a single dataset. The script names are contained in the file "update-hf300.txt". In the first case, the prov.trace.run function is used to run the scripts, collect provenance, and display summary file information. ``` prov.trace.run("update-hf300.txt") ``` Console output (below) shows the save message for each script from rdtLite followed by output from prov.trace.run. Scripts are numbered in the order of execution. Each line shows the script number, a symbol indicating whether the file has changed since provenance was collected, and the file path and name. ``` [1] "Saving prov.json in C:/Prov/prov_gap-fill-shaler" [1] "Saving prov.json in C:/Prov/prov_combine-shaler-fisher" [1] "Saving prov.json in C:/Prov/prov_calculate-hf-annual-monthly" SCRIPTS: 1 : C:/TraceR/gap-fill-shaler.R 2 : C:/TraceR/combine-shaler-fisher.R 3 : C:/TraceR/calculate-hf-annual-monthly.R INPUTS: 1 : C:/TraceR/amherst-ma-1964-2002.csv 1 : C:/TraceR/bedford-ma-1964-2002.csv 1 : C:/TraceR/hf000-02-daily-e.csv 2 : C:/TraceR/hf001-06-daily-m.csv 2 : C:/TraceR/hf001-08-hourly-m.csv OUTPUTS: 1 : C:/TraceR/hf-shaler-gap-filled.csv 2 : C:/TraceR/hf-shaler-fisher-overlap.csv 2 : C:/TraceR/hf300-05-daily-m.csv 2 : C:/TraceR/hf300-06-daily-e.csv 3 : C:/TraceR/hf300-01-annual-m.csv 3 : C:/TraceR/hf300-02-annual-e.csv 3 : C:/TraceR/hf300-03-monthly-m.csv 3 : C:/TraceR/hf300-04-monthly-e.csv EXCHANGES: 1 > 2 : C:/TraceR/hf-shaler-gap-filled.csv 2 > 3 : C:/TraceR/hf300-05-daily-m.csv ``` In the second case, the prov.trace function is used to display detailed file information contained in the provenance without running the scripts. ``` prov.trace("update-hf300.txt", file.details=TRUE) ``` For each file, the console output (below) shows the file timestamp, the file hash value and algorithm, and the path and name of the saved copy of the file on the provenance directory. For scripts the execution time stamp is also shown. ``` SCRIPTS: 1 : C:/TraceR/gap-fill-shaler.R Timestamp: 2019-10-19T09.42.45EDT Hash: 9ab73da3681ae9cbe85efb912550e432 / md5 Saved: C:/Prov/prov_gap-fill-shaler/scripts/gap-fill-shaler.R Executed: 2020-07-08T10.21.30EDT 2 : C:/TraceR/combine-shaler-fisher.R Timestamp: 2019-10-19T09.41.59EDT Hash: 848a20e2696b1fb7c9bdeec27df059f5 / md5 Saved: C:/Prov/prov_combine-shaler-fisher/scripts/combine-shaler-fisher.R Executed: 2020-07-08T10.21.35EDT 3 : C:/TraceR/calculate-hf-annual-monthly.R Timestamp: 2019-10-19T10.16.12EDT Hash: 213661ba5f7e4de68d2205c9fe8c0922 / md5 Saved: C:/Prov/prov_calculate-hf-annual-monthly/scripts/calculate-hf-annual-monthly.R Executed: 2020-07-08T10.21.41EDT INPUTS: 1 : C:/TraceR/amherst-ma-1964-2002.csv Timestamp: 2019-10-16T10.51.53EDT Hash: 06c82be1ceeec8f41216ee670f485d77 / md5 Saved: C:/Prov/prov_gap-fill-shaler/data/2-amherst-ma-1964-2002.csv 1 : C:/TraceR/bedford-ma-1964-2002.csv Timestamp: 2019-10-17T10.43.55EDT Hash: d7f8e08fd84f4b75941325cd82ca7768 / md5 Saved: C:/Prov/prov_gap-fill-shaler/data/3-bedford-ma-1964-2002.csv 1 : C:/TraceR/hf000-02-daily-e.csv Timestamp: 2019-10-16T10.37.42EDT Hash: e9f67f7074eb68059385c683d0410c01 / md5 Saved: C:/Prov/prov_gap-fill-shaler/data/1-hf000-02-daily-e.csv 2 : C:/TraceR/hf001-06-daily-m.csv Timestamp: 2020-06-01T09.07.21EDT Hash: 5e515ea3e7080543fba92b9b9114810f / md5 Saved: C:/Prov/prov_combine-shaler-fisher/data/2-hf001-06-daily-m.csv 2 : C:/TraceR/hf001-08-hourly-m.csv Timestamp: 2019-10-17T11.34.57EDT Hash: af36c84e4c0b8f72632eba5661506129 / md5 Saved: C:/Prov/prov_combine-shaler-fisher/data/3-hf001-08-hourly-m.csv OUTPUTS: 1 : C:/TraceR/hf-shaler-gap-filled.csv Timestamp: 2020-07-08T10.21.34EDT Hash: a5022c912b1ec50e8cd4c20d8ed636cf / md5 Saved: C:/Prov/prov_gap-fill-shaler/data/4-hf-shaler-gap-filled.csv 2 : C:/TraceR/hf-shaler-fisher-overlap.csv Timestamp: 2020-07-08T10.21.40EDT Hash: f7334fb30cf16c566f8e1de2b7643cf2 / md5 Saved: C:/Prov/prov_combine-shaler-fisher/data/6-hf-shaler-fisher-overlap.csv 2 : C:/TraceR/hf300-05-daily-m.csv Timestamp: 2020-07-08T10.21.39EDT Hash: 1c9eabddcd5474e11e36168234a1cfae / md5 Saved: C:/Prov/prov_combine-shaler-fisher/data/4-hf300-05-daily-m.csv 2 : C:/TraceR/hf300-06-daily-e.csv Timestamp: 2020-07-08T10.21.40EDT Hash: e463c55ff22f56c2fe5e7a69758d3339 / md5 Saved: C:/Prov/prov_combine-shaler-fisher/data/5-hf300-06-daily-e.csv 3 : C:/TraceR/hf300-01-annual-m.csv Timestamp: 2020-07-08T10.21.42EDT Hash: e4969c413d3abce641335ad418b51f5c / md5 Saved: C:/Prov/prov_calculate-hf-annual-monthly/data/2-hf300-01-annual-m.csv 3 : C:/TraceR/hf300-02-annual-e.csv Timestamp: 2020-07-08T10.21.42EDT Hash: cc99e71c31fd4696d99e68e724497dc5 / md5 Saved: C:/Prov/prov_calculate-hf-annual-monthly/data/3-hf300-02-annual-e.csv 3 : C:/TraceR/hf300-03-monthly-m.csv Timestamp: 2020-07-08T10.21.42EDT Hash: de8267dba4643b5d174d4a3140bd9414 / md5 Saved: C:/Prov/prov_calculate-hf-annual-monthly/data/4-hf300-03-monthly-m.csv 3 : C:/TraceR/hf300-04-monthly-e.csv Timestamp: 2020-07-08T10.21.42EDT Hash: bf6841b2b01b81c87b30f843b6dda0b1 / md5 Saved: C:/Prov/prov_calculate-hf-annual-monthly/data/5-hf300-04-monthly-e.csv EXCHANGES: 1 > 2 : C:/TraceR/hf-shaler-gap-filled.csv Timestamp: 2020-07-08T10.21.34EDT Hash: a5022c912b1ec50e8cd4c20d8ed636cf / md5 Saved out: C:/Prov/prov_gap-fill-shaler/data/6-hf-shaler-fisher-overlap.csv Saved in: C:/Prov/prov_combine-shaler-fisher/data/1-hf-shaler-gap-filled.csv 2 > 3 : C:/TraceR/hf300-05-daily-m.csv Timestamp: 2020-07-08T10.21.39EDT Hash: 1c9eabddcd5474e11e36168234a1cfae / md5 Saved out: C:/Prov/prov_combine-shaler-fisher/data/4-hf300-03-monthly-m.csv Saved in: C:/Prov/prov_calculate-hf-annual-monthly/data/1-hf300-05-daily-m.csv ``` In both cases, the colon after the script number for each file indicates that the file has not changed since the provenance was collected.