Provenance folder
Reference copy of the sample-identifier file from the author’s own
pipeline run, provided so replicators can confirm an exact-row match
against their own re-run of project-template.
This reference file lives in the hub repo rather than in
project-template so the template stays reusable: people who clone
or “Use this template” from project-template shouldn’t have to
remember to delete a sample-id file tied to the author’s specific
WRDS pull before adapting the skeleton to their own project. The
hub’s role is to host project-specific artifacts (this paper’s
reference data, the JOSE paper, the SAS reference macros); the
template’s role is to ship clean, project-agnostic code.
Files
| File | Description |
|---|---|
sample-identifiers.csv |
One row per firm-quarter in the final analytic sample (617,494 rows). Columns: gvkey, permno, rdq, datadate, fyearq, fqtr. |
How to use
Run the companion project-template end-to-end against your own WRDS
access, then compare your generated
project-template/provenance/sample-identifiers.csv to the file in this
folder. If you match the row counts and the identifier set, your
intermediate data builds are reproducing the author’s. If you do not,
the differences point at where in the pipeline your raw data, sample
filters, or merge keys diverge from the author’s.
A small Python or R diff is enough — for example:
Python:
import pandas as pd
mine = pd.read_csv("project-template/provenance/sample-identifiers.csv")
ref = pd.read_csv("example-project/provenance/sample-identifiers.csv")
print(f"Mine: {len(mine):,} Reference: {len(ref):,}")
print(f"Common keys: {len(set(mine['gvkey']) & set(ref['gvkey'])):,}")
R:
mine <- read.csv("project-template/provenance/sample-identifiers.csv")
ref <- read.csv("example-project/provenance/sample-identifiers.csv")
cat(sprintf("Mine: %s Reference: %s\n",
format(nrow(mine), big.mark = ","),
format(nrow(ref), big.mark = ",")))
How this file was produced
Generated by src/005-data-provenance.py in
project-template
against the author’s WRDS access on 2026-05-25. The script that
produced it, together with the full pipeline that built the upstream
analytic dataset, is the canonical specification of how the sample
was constructed.