Provenance folder

Reference copy of the sample-identifier file from the author’s own pipeline run, provided so replicators can confirm an exact-row match against their own re-run of project-template.

This reference file lives in the hub repo rather than in project-template so the template stays reusable: people who clone or “Use this template” from project-template shouldn’t have to remember to delete a sample-id file tied to the author’s specific WRDS pull before adapting the skeleton to their own project. The hub’s role is to host project-specific artifacts (this paper’s reference data, the JOSE paper, the SAS reference macros); the template’s role is to ship clean, project-agnostic code.

Files

File Description
sample-identifiers.csv One row per firm-quarter in the final analytic sample (617,494 rows). Columns: gvkey, permno, rdq, datadate, fyearq, fqtr.

How to use

Run the companion project-template end-to-end against your own WRDS access, then compare your generated project-template/provenance/sample-identifiers.csv to the file in this folder. If you match the row counts and the identifier set, your intermediate data builds are reproducing the author’s. If you do not, the differences point at where in the pipeline your raw data, sample filters, or merge keys diverge from the author’s.

A small Python or R diff is enough — for example:

Python:

import pandas as pd
mine = pd.read_csv("project-template/provenance/sample-identifiers.csv")
ref  = pd.read_csv("example-project/provenance/sample-identifiers.csv")
print(f"Mine: {len(mine):,}  Reference: {len(ref):,}")
print(f"Common keys: {len(set(mine['gvkey']) & set(ref['gvkey'])):,}")

R:

mine <- read.csv("project-template/provenance/sample-identifiers.csv")
ref  <- read.csv("example-project/provenance/sample-identifiers.csv")
cat(sprintf("Mine: %s  Reference: %s\n",
            format(nrow(mine), big.mark = ","),
            format(nrow(ref),  big.mark = ",")))

How this file was produced

Generated by src/005-data-provenance.py in project-template against the author’s WRDS access on 2026-05-25. The script that produced it, together with the full pipeline that built the upstream analytic dataset, is the canonical specification of how the sample was constructed.