JAR Data and Code Sharing Policy
This template is designed to satisfy the Journal of Accounting Research’s Data and Code Sharing Policy. The policy expects authors to provide three things:
- Code that converts raw data into the final analytical dataset and produces the reported tables and figures.
- A comprehensive log file documenting the end-to-end execution of that code.
- Identifiers (e.g.,
gvkey,permno) of the observations comprising the final sample.
project-template is designed around these requirements:
- The pipeline splits raw WRDS pulls (
RAW_DATA_DIR) from derived data (DATA_DIR). A replication run can re-execute scripts 2-4 against the original researcher’s preserved raw inputs without hitting WRDS. - Every pipeline step produces a per-script log in the SAS-log style — every command echoed, output interleaved, plain text. R steps go through
batch_run()(anR CMD BATCHwrapper inutils.R); Python steps go through an equivalentbatch_run()inutils.pythat subprocesses through an AST-based echo wrapper; Stata’s nativelog usingand SAS’s nativeproc printtoproduce the same shape. All four supported languages emit visually consistent logs. - The
005-data-provenance.{R,py}step exportssample-identifiers.{parquet,csv}(gvkey, permno, rdq, datadate, fyearq, fqtr) into the template’sprovenance/folder and prints SHA256 hashes for every raw, derived, and output file. That step’s own.Rout/.logis the project’s content-addressed provenance record. The data files are gitignored by default so the template stays clean as it is passed around; submitters force-add the artifacts they want to ship withgit add -f provenance/sample-identifiers.csv.
Reference sample for exact-match replication
For users who want to confirm an exact-row match against the author’s
own run of the template, a reference copy of sample-identifiers.csv
lives in this hub’s provenance/ folder. After
running project-template end-to-end against your own WRDS access,
diff your project-template/provenance/sample-identifiers.csv
against the file in the hub — matching row counts and identifier sets
confirm your build is reproducing the author’s; mismatches point at
where in the pipeline your raw data, sample filters, or merge keys
diverge.
A real-world example
A published research project organized around these conventions in practice:
- Paper: Larocque, S. A., Watkins, J., and Weisbrod, E. H. (forthcoming). “Consensus? An Examination of Differences in Earnings Information Across Forecast Data Providers.” Journal of Accounting Research (in production).
- DOI: 10.1111/1475-679x.70072 (activates once Wiley completes production)
- Companion GitHub repository: https://github.com/eweisbrod/consensus
The consensus repository applies the same patterns described above — per-script logs, sample-identifier export, separation of raw and derived data — across a multi-language (R, Stata, SAS) production project. It is also the source of the SAS-side worked example presented in the SAS macros chapter.