Environment variables and the .env file
Most research projects need a small amount of configuration —
values that depend on where the project is running rather than what
the project is trying to compute. The path to your raw data folder,
the path to your output folder, etc. These
are not part of the analysis; they are part of the local environment.
This note explains how the templates in this hub keep configuration
out of the code using environment variables and a .env file, why
that matters, and a few practical gotchas.
On this page
- Environment variables and the
.envfile- What is an environment variable?
- Why use environment variables in a research project?
- Project-level configuration via
.env - The three (sometimes more) env vars used in this template
- A worked example: two authors, three machines
.envmust be gitignored- How each language reads
.env - Credentials do NOT go in
.env - Other uses of
.envbeyond paths - Common gotchas
- How to debug when
.envisn’t loading - A note on R’s
.Renviron - See also
What is an environment variable?
An environment variable is a KEY=VALUE pair maintained by your
operating system and visible to every program your OS launches.
Programs read them by name. R reads Sys.getenv("DATA_DIR"),
Python reads os.getenv("DATA_DIR"), and below I discuss how these concepts can be extended to SAS and Stata. The variable
is the same in all four — it lives at the operating-system level,
and the programs are just asking the OS what its current value is.
When you open a terminal and type set (Windows CMD), env (Bash /
macOS), or Get-ChildItem env: (PowerShell), you’ll see a long list
of variables your OS is already maintaining: PATH, HOME,
USERNAME, TEMP, and many others. You could add a project-
specific variable like DATA_DIR directly to that OS-level list —
and on some systems people do — but in practice that turns out to
be a bad fit for research projects. The next two sections explain
why, and what the templates do instead.
Why use environment variables in a research project?
If you hardcode the path to your data in a script:
data <- read_parquet("D:/Dropbox/my-project/data/regdata.parquet")
…then the script runs on your machine, and only your machine, and only as long as that path doesn’t change. The moment a coauthor opens the same script, or you move the data to a different drive, the script breaks.
If instead you read the path from an environment variable:
data_dir <- Sys.getenv("DATA_DIR")
data <- read_parquet(glue("{data_dir}/regdata.parquet"))
…the script reads whatever DATA_DIR resolves to on the machine
where it’s currently running. Different machine, different
DATA_DIR, same script — no edits needed.
This is one of the most important habits in a reproducible research workflow: separate code from configuration. The code lives in git (version control), the configuration lives outside git, and you don’t have to edit code to change configuration. This also makes code shareable between coauthors and clean for posting on academic journal websites without revealing your personal information.
Project-level configuration via .env
The OS-level approach is fine for variables that really are global —
PATH, HOME, your default editor. Research projects don’t fit
that mold. A typical empirical researcher has several active
projects at any one time, each with its own data folder, its own
output folder, sometimes its own sample window or feature flags.
If DATA_DIR lives at the OS level there is only one of it at any
moment, and you’d be editing your OS environment every time you
switched between projects. That’s noisy, error-prone, and
impractical to share with a coauthor.
The convention the templates follow instead is a small text file
named .env at the project root. The file is just KEY=VALUE
lines, for example:
RAW_DATA_DIR=D:/Dropbox/my-project/data/raw
DATA_DIR=D:/Dropbox/my-project/data/derived
OUTPUT_DIR=output
Each of R, Python, Stata, and SAS has code provided that reads this file at the start of a script and exposes its contents as if they were OS-level environment variables (the section below covers the specifics). The benefits over an OS-level setup:
- One file per project. Every project has its own
.env, soDATA_DIRcan take a different value in every project on the same machine — no manual switching, no global conflicts. - Project-level overrides win. When a
.envvalue coexists with an OS-level variable of the same name — a staleDATA_DIRleft over from another project, say — the project’s.envtakes precedence, provided you call the loader correctly (in Python,override=True; see below). The OS keeps its global defaults and individual projects get the last word inside their own scope. - No admin rights needed. Setting OS-level env vars on a managed university machine can require admin privileges; editing a text file in your own project folder never does.
- A coauthor copies
.example-env, edits the paths, runs. Sharing OS-level configuration is impractical — everyone’s environment is different. Sharing a project-level.envtemplate is trivial.
The .env file is just a text file. There’s no magic. You can
edit it in any text editor. The format is intentionally trivial so
that every language can read it.
The three (sometimes more) env vars used in this template
The example pipeline reads three variables out of .env:
| Variable | What it points at |
|---|---|
RAW_DATA_DIR |
The folder holding raw data pulls (large, slow to refresh, treated as read-only inputs). |
DATA_DIR |
The folder holding derived datasets that the pipeline produces (regdata.parquet, figure-data.parquet, etc.). Safe to delete and regenerate. |
OUTPUT_DIR |
The folder holding final tables and figures (.tex, .pdf, .png, .docx, .rtf). |
A few optional variables are read when present:
| Variable | Purpose |
|---|---|
STATA_BIN |
Full path to the Stata executable, if stata isn’t on your PATH. Read by batch_run_stata(). |
SAS_BIN |
Full path to the SAS executable, if sas isn’t on your PATH. Read by batch_run_sas(). |
SAS_WORK_DIR |
Override for SAS’s WORK library when the default %TEMP% location is on a too-small drive. |
The rule for what belongs here: anything that varies by machine
or user. The path to your data folder, the path to a Stata
executable, a per-machine memory budget. Anything that should be
the same for every collaborator and every future replicator — the
sample window, the winsorization percentile, the regression
specification — belongs in code, committed to git, not in .env.
The “Other uses” section near the end gives concrete examples on
both sides of that line.
A worked example: two authors, three machines
Suppose you (Author 1) are collaborating with one coauthor (Author 2) on a paper. Between the two of you, you use three machines, each of which has different paths to the data:
Author 1, laptop — small SSD, Dropbox folder on C::
RAW_DATA_DIR=C:/Dropbox/our-paper/data/raw
DATA_DIR=C:/Dropbox/our-paper/data/derived
OUTPUT_DIR=output
Author 1, desktop — large secondary drive, data on D::
RAW_DATA_DIR=D:/Dropbox/our-paper/data/raw
DATA_DIR=D:/Dropbox/our-paper/data/derived
OUTPUT_DIR=output
Author 2, MacBook — macOS, no drive letters, Dropbox in ~:
RAW_DATA_DIR=/Users/coauthor/Dropbox/our-paper/data/raw
DATA_DIR=/Users/coauthor/Dropbox/our-paper/data/derived
OUTPUT_DIR=output
Same code, three different .env files. Every author runs the
pipeline on their own machine, the scripts pick up the right paths
from .env, and no one ever has to edit a read_parquet() line.
This is also why paths in .env use forward slashes even on
Windows. R, Python, Stata, and SAS all accept / on Windows
without conversion, but interpreting \ requires escaping (\\)
which is error-prone. The convention across the templates is
unconditional forward slashes; the scripts handle them correctly on
every OS.
.env must be gitignored
.env is never committed to git. The templates’ .gitignore
files all include .env. This is non-negotiable
for three reasons:
.envis machine-specific. If Author 1’s.envgot committed from the laptop, Author 2 would clone it and the paths would point at directories that don’t exist on the Mac..envis the natural place to accidentally put secrets. Even though the templates teach you to keep credentials in the OS keyring (see below), some teammate at some point will paste a WRDS password into.envwhile debugging. Once committed, that password is in your git history forever, even after you delete it from the current file. Treat.envlike a credentials file: never commit, never push, etc..envis meant to be customized. Each collaborator should have their own. Committing yours creates merge conflicts and confused expectations.
What you DO commit: .example-env
To tell new contributors what variables they need to set, the
templates commit a sibling file called .example-env. Same KEY=VALUE
format, but with placeholder values:
RAW_DATA_DIR=D:/Dropbox/your-project-name/data/raw
DATA_DIR=D:/Dropbox/your-project-name/data/derived
OUTPUT_DIR=output
A new contributor copies .example-env to .env, fills in their
real paths, and is up and running.
(The name is .example-env, not .env.example. The reason is
collision safety: .env* glob patterns and casual git rm .env*
commands match .env.example and could accidentally delete or
re-add the real .env.)
How each language reads .env
The mechanism is slightly different in each language, but the
result is the same: you call a function or %include a macro,
and after that your script can read any variable defined in .env.
R
library(dotenv)
load_dot_env(".env")
raw_data_dir <- Sys.getenv("RAW_DATA_DIR")
data_dir <- Sys.getenv("DATA_DIR")
The dotenv package is available on CRAN. load_dot_env(".env") parses the
file and sets each KEY as a real OS-level environment variable for
the duration of the R session.
Python
from dotenv import load_dotenv
import os
load_dotenv(".env", override=True)
raw_data_dir = os.getenv("RAW_DATA_DIR")
data_dir = os.getenv("DATA_DIR")
The python-dotenv package on PyPI. The override=True argument
matters: by default, python-dotenv does NOT override an existing
system-level environment variable with the same name. If you have
DATA_DIR set in your Windows User Environment Variables (for
example, because a previous project set it system-wide), Python
will silently use the system value and ignore your project’s .env.
override=True makes the project’s .env win, which is almost
always what you want for a reproducible project.
Stata
Stata does not ship with a .env reader, so the templates use a
combination of two community packages to get the same behavior:
projectpaths— a Stata package I wrote that records the on-disk location of each registered project, so a.dofile cancdto its own project root before any other command runs. This matters here becausedoenvlooks for.envin the current working directory; withoutprojectpaths, you’d have to manuallycdto the project root every time you launch Stata. This package is also useful for any other commands that rely on setting Stata’s current working directory to the corresponding project root for each user or machine.doenvby Vik Jam — parses a.envfile and exposes eachKEYas anr()result.
Install both once per Stata installation:
net install projectpaths, from("https://raw.githubusercontent.com/eweisbrod/projectpaths/main/src/") replace
net install doenv, from("https://github.com/vikjam/doenv/raw/master/")
Then register the project’s location with projectpaths (once per
project, e.g.):
project_paths_list, add project(your-project-name) path("C:/_git/your-project-name")
After this one-time setup for each coauthor, every .do file in the project can open with the same
preamble:
project_paths_list, project(your-project-name) cd
doenv using ".env"
local raw_data_dir "`r(RAW_DATA_DIR)'"
local data_dir "`r(DATA_DIR)'"
Note the slightly different idiom: Stata exposes the values as
r() results rather than as OS env vars, so you assign them to
local macros for use in your script.
SAS
SAS does not ship a community dotenv package, so the
sas-example/MACROS.sas file in this
hub provides a %load_env macro that fills the gap.
Unlike R, Python, and Stata, SAS has no clean built-in for “where
am I running from” — and %load_env needs that, because it looks
for .env relative to the executing .sas file. The preamble
therefore resolves &codepath (the full path of the currently-
executing .sas file) before doing anything else, by reading
whichever of two SAS-provided values is populated:
SYSIN— set automatically when SAS is launched in batch mode viasas -SYSIN script.sas.SAS_EXECFILEPATH— set automatically by Enhanced Editor and Enterprise Guide when you submit a script interactively.
Once &codepath is known, the preamble can %include the macros
file and call %load_env:
/* Resolve the running script's path. SYSIN is set in batch mode (sas -SYSIN);
SAS_EXECFILEPATH is set by Enhanced Editor / Enterprise Guide interactively. */
%let codepath = %sysfunc(getoption(sysin));
%if %length(&codepath) = 0 %then %do;
%let codepath = %sysfunc(sysget(SAS_EXECFILEPATH));
%end;
%include "&codepath\..\MACROS.sas";
%load_env;
libname raw "&RAW_DATA_DIR";
libname data "&DATA_DIR";
After %load_env;, each KEY from .env is available as a global
SAS macro variable (&RAW_DATA_DIR, &DATA_DIR, etc.). By default,
the macro derives the path to .env from &codepath by stripping
the .sas filename and then stripping one more directory level —
with the conventional src/foo.sas layout, that lands on the repo
root. The macro doesn’t check for the literal name src/; any
project where the .sas file lives one directory below .env
works. If the auto-derived path doesn’t exist, %load_env errors
out clearly and tells you to pass an explicit path via
%load_env(file=...).
See sas-example/README.md for the
fine print on how the SAS implementation handles batch vs.
interactive mode.
Credentials do NOT go in .env
This is the rule that catches most people the first time they set
up an .env file. .env is for non-secret configuration only.
Things like paths, project names, feature flags. It is NOT for:
- WRDS usernames and passwords
- API keys
- Database connection strings that include passwords
- SSH private keys
- Anything else you would not want to accidentally email
The risk is the same as the gitignore risk above: even with .env
gitignored, secrets in plain text leak through too many side
channels.
The templates handle credentials two different ways depending on language:
-
R, Python: the
keyringpackage stores credentials in the OS-native credential store (Windows Credential Manager on Windows, Keychain on macOS, Secret Service on Linux). The templates store both the WRDS username and password under the service"wrds", using the literal key names"username"and"password", so R and Python read the same two entries. In R the setter is interactive —key_setprompts you to type the value at runtime, so no plaintext appears in your script:keyring::key_set("wrds", "username") # prompts; type your WRDS username keyring::key_set("wrds", "password") # prompts; type your WRDS passwordPython’s
keyring.set_password()takes the value as an argument rather than prompting, so the interactive equivalent reads frominput/getpass:import keyring from getpass import getpass keyring.set_password("wrds", "username", input("WRDS username: ")) keyring.set_password("wrds", "password", getpass("WRDS password: "))After that, scripts read on demand —
keyring::key_get("wrds", "username")in R,keyring.get_password("wrds", "username")in Python — and the credentials never sit in a file; they live in the OS’s encrypted credential vault. The templates’project_setup()(R) andsetup.py(Python) walk through this prompt-and-store flow the first time you run the pipeline. -
Alternate (from WRDS Examples): The WRDS guidance is to use a
.pgpassfile in your home directory with mode 0600 file permissions, which the PostgreSQL client libraries read automatically. Same logical separation — secrets outside the project tree, project tree free of secrets.
A useful mental test: if you would not want this value to appear
in a screenshot during a Zoom presentation, it does not belong in
.env.
Other uses of .env beyond paths
Paths and binary locations are the most common uses, but anything else that legitimately varies by machine or by user is a good fit too. A few examples:
- Per-machine compute resources:
MAX_RAM_GB=16,N_WORKERS=8. Lets the same parallel-processing code run on a 16 GB laptop and on a 256 GB server without manual tuning. - Non-standard service endpoints, only if they actually vary.
WRDS_HOST=wrds-pgdata.wharton.upenn.eduis the same for everyone and so is hardcoded in the templates; if you’re hitting a custom mirror or a private host, that’s a legitimate.envuse.
What does NOT belong in .env
The reverse matters just as much. Anything that is part of the
analytical design should NOT live in .env, because it should
be the same for every collaborator and every future replicator.
That includes:
- The project name, citation key, paper title.
- Sample window (e.g.,
START_FYEAR=1970,END_FYEAR=2024). These define your study’s sample and have to be reproducible. - Variable construction choices — winsorization percentile, scaling denominator, which sub-sample to exclude. These are analytical decisions.
- Regression specifications, fixed-effect choices, clustering.
All of those belong in code, committed to git, where they can be versioned and reviewed. The dividing test:
If a future replicator running your code needs a particular value to be a particular number to reproduce your tables, that value is part of the analysis — put it in code. If it’s something they’d legitimately want to change to match their own machine or account, put it in
.env.
Common gotchas
A handful of things that bite people the first time:
- Backslashes on Windows. Use forward slashes (
/) in.envregardless of OS. R, Python, Stata, and SAS all accept them on Windows; backslashes require escaping and create platform- specific paths that break on Mac and Linux. - No
#comments inside.env. StrictKEY=VALUEonly. Some dotenv libraries’#-comment parsing is finicky around values that contain#(e.g., a URL with a fragment); the templates’ convention is to avoid them entirely and document what each variable means in this topic page or in a README instead. The.example-envfiles in the templates are also comment-free. - Quotes around values. Avoid them.
DATA_DIR=D:/foo/barnotDATA_DIR="D:/foo/bar". The dotenv libraries usually strip the quotes correctly, but not always, and it’s easier to never introduce them. - Trailing whitespace.
DATA_DIR=D:/foo/bar(with a trailing space) silently setsDATA_DIRto a path that ends in a space. This breaks file reads in baffling ways. Use a text editor that shows whitespace if you keep seeing “file not found” on paths that look right. - System-level env vars shadowing your project’s
.env. In Python this is theoverride=Trueissue covered above. In R,Sys.getenv()returns whatever was set most recently in the session, so if you accidentally setDATA_DIRfrom another project earlier, it will leak. Restart R orSys.unsetenv()if you’re not getting the expected value. .envnot found. The template code generally looks for.envin the current working directory, which is assumed to be the project root directory, not the script’s directory (e.g., src/). If the working directory is not the project- root, or more generally not where.envis stored, you may get the.envnot found error.
How to debug when .env isn’t loading
The symptom is usually a path-related error: “file not found,”
“directory does not exist,” Sys.getenv("DATA_DIR") returning the
empty string. Three diagnostic steps in order:
- Print the loaded value. Right after
load_dot_env(".env"), addprint(Sys.getenv("DATA_DIR"))(or the Python/Stata/SAS equivalent). If it’s empty, the file wasn’t loaded or the variable wasn’t in it. If it’s the wrong value, you have a shadowing problem (see gotcha #5). - Check the working directory.
getwd()(R),os.getcwd()(Python),pwd(Stata) — confirm you’re in the project root. - Open
.envin a hex-aware editor if all else fails. Things like a BOM at the start of the file (some Windows editors add one), trailing CRLF vs LF line endings on Mac, or invisible trailing whitespace can all break parsing. VS Code’s “Show Whitespace” toggle is the easiest way to see what’s actually there.
A note on R’s .Renviron
R has long had its own version of project-level configuration via
a file called .Renviron (at the project root) or ~/.Renviron
(at the user level). It does very similar work to .env: a file
of KEY=VALUE lines that R reads at startup and exposes through
Sys.getenv(). Earlier versions of these templates used
.Renviron for exactly that purpose.
The current templates standardize on .env plus the dotenv
package instead, for one reason: a project that mixes R, Python,
Stata, and SAS needs one configuration file that all four
languages can read. .Renviron is R-only; .env is universal.
See also
sas-example/README.md— the SAS%load_envmacro and its batch-vs-interactive handling.- The
project_setup()function in each language’sutils.{R,py}in theproject-templaterepo — walks new users through creating their first.envinteractively. .example-envin the polyglot template — the committed placeholder file.