The following material is adapted from a HPC Carpentry lesson
Let’s look at an example project which follows the project structure guidelines given in the previous episode. The project runs OSeMOSYS and plots a couple of charts.
To follow along, clone this repository:
$ git clone https://github.com/KTH-dESA/osemosys_workflow.git
$ git checkout simple
The example project directory listing is:
.
├── README.md
├── data
│ ├── README.md
│ └── simplicity.txt
├── env
│ ├── dag.yaml
│ ├── osemosys.yaml
│ └── plot.yaml
├── model
│ ├── LICENSE
│ └── osemosys.txt
├── processed_data
├── scripts
│ ├── osemosys_run.sh
│ └── plot_results.py
└── snakefile
In this example we wish to:
SelectedResults.csv
file and save them into separate csv filesIdeally, we would go on to:
Example (for one model run only) - let us test this out:
$ glpsol -d data/simplicity.txt -m model/osemosys.txt -o processed_data/results.sol
$ ???
$ python scripts/plot_results.py processed_data/total_annual_capacity.csv results/total_annual_capacity.pdf
$ ???
$ python scripts/plot_results.py processed_data/tid_demand.csv results/tid_demand.pdf
Can you relate? Are you using similar setups in your research?
This was for one model run - how about 3 model runs? How about 3000 model runs?
Discussion
Discuss the pros and cons of this approach. Is it reproducible? Does it scale to hundreds of model runs? Can it be automated? What if you modify only one data file and do not wish to rerun the pipeline for all datafiles again?
Exercise
What commands are required to replace the
???
in the second line? Think back to the first workshop on the shell…
Let’s express it more compactly with a shell script (Bash). Let’s call it run_analysis.sh
:
#!/usr/bin/env bash
glpsol -d data/simplicity.txt -m model/osemosys.txt -o processed_data/results.sol
head -n 326 processed_data/SelectedResults.csv | tail -n 29 > processed_data/total_annual_capacity.csv
python scripts/plot_results.py processed_data/total_annual_capacity.csv results/total_annual_capacity.pdf
head -n 33 processed_data/SelectedResults.csv | tail -n 22 > processed_data/tid_demand.csv
python scripts/plot_results.py processed_data/tid_demand.csv results/tid_demand.pdf
We can run it with:
$ run_analysis.sh
This is still imperative style: we tell the script to run these steps in precisely this order.
Exercise/discussion
Discuss the pros and cons of this approach. Is it reproducible? Does it scale to hundreds of model runs? Can it be automated? What if you modify only one data file and do not wish to rerun the pipeline for all datafiles again?
It contains rules that relate targets to dependencies and commands:
# rule (mind the tab)
target: dependencies
command(s)
We can think of it as follows:
outputs: inputs
command(s)
Make uses declarative style: we describe dependencies but we let Make figure out the series of steps to produce results (targets). Fun fact: Excel is also declarative, not imperative.
Switch to the simple
branch:
git checkout simple
First study a simple Snakefile
:
rule all:
input: "processed_data/SelectedResults.csv"
rule solve:
input: data="data/simplicity.txt", model="model/osemosys.txt"
output: results="processed_data/results.sol", default="processed_data/SelectedResults.csv"
log: "processed_data/glpsol.log"
shell:
"glpsol -d {input.data} -m {input.model} -o {output.results} > {log}"
rule clean:
shell:
"rm -f processed_data/*.sol processed_data/*.csv processed_data/*.log"
Snakemake uses the declarative style:
Try it out:
$ snakemake clean
$ snakemake
Try running snakemake
again and observe that and discuss why it refused to rerun all steps:
$ snakemake
Building DAG of jobs...
Nothing to be done.
Make a modification to the data file and run snakemake
again and discuss
what you see. One way to modify files is to use the touch
command which will
only update its timestamp:
$ touch data/simplicity.txt
$ snakemake
You can try a dry run with the -n
flag, if you’re not sure what’s going to be built:
$ snakemake clean
$ snakemake -n
Exercise: extending the simple example
Create snakemake rules to extract our data tables and write them to a csv file.
Hint:
head -n 326 processed_data/SelectedResults.csv | tail -n 29 > processed_data/total_annual_capacity.csv
head -n 33 processed_data/SelectedResults.csv | tail -n 22 > processed_data/tid_demand.csv
rule extract_tid_demand:
input: "processed_data/SelectedResults.csv"
output: "processed_data/tid_demand.csv"
shell:
"head -n 33 {input} | tail -n 22 > {output}"
Exercise: plotting the csv files
Create snakemake rules to generate the plots:
Hint:
python scripts/plot_results.py processed_data/total_annual_capacity.csv results/total_annual_capacity.pdf
python scripts/plot_results.py processed_data/tid_demand.csv results/tid_demand.pdf
After creating your new rules, try running snakemake
. Why does nothing happen?
The first command in your snakemake file should contain a list of targets - the final elements you want to produce.
RESULTS = ['tid_demand', 'total_annual_capacity']
rule all:
input: expand("processed_data/{x}.pdf", x=RESULTS)
message: "Running pipeline to generate the files '{input}'"
Switch to the general
branch:
$ git checkout general
The rules we made in the previous exercise were very very similar to one another.
This is a common coding “smell” and should be a warning to you to refactor your code.
rule plot:
input: "processed_data/{result}.csv"
output: "processed_data/{result}.pdf"
message: "Generating plot using '{input}' and writing to '{output}'"
shell:
"python scripts/plot_results.py {input} {output}"
If you install snakemake with conda, you can define conda environments per rule.
rule solve:
input: data="data/simplicity.txt", model="model/osemosys.txt"
output: results="processed_data/results.sol", default="processed_data/SelectedResults.csv"
log: "processed_data/glpsol.log"
conda: "env/osemosys.yaml"
shell:
"glpsol -d {input.data} -m {input.model} -o {output.results} > {log}"
The env/osemosys.yaml
file:
channels:
- conda-forge
- defaults
dependencies:
- glpk
Then, running snakemake with the --use-conda
flag:
$ snakemake --use-conda --cores 2
Building DAG of jobs...
Using shell: /usr/local/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 extract_tid_demand
1 extract_total_annual_capacity
2 plot
1 solve
6
[Wed Sep 11 22:53:50 2019]
rule solve:
input: data/simplicity.txt, model/osemosys.txt
output: processed_data/results.sol, processed_data/SelectedResults.csv
log: processed_data/glpsol.log
jobid: 5
Activating conda environment: /Users/wusher/repository/osemosys_snakemake/.snakemake/conda/6fbfeaaf
[Wed Sep 11 22:54:09 2019]
Finished job 5.
1 of 6 steps (17%) done
git checkout model_runs
We need to wrap OSeMOSYS in a little shell script to allow it to run in parallel. This allows us to change the destination of the results file which is written out:
#!/usr/bin/env bash
MODELRUN=$1
RESULTS="processed_data\/$MODELRUN\/SelectedResults.csv"
mkdir processed_data/$MODELRUN
cat model/osemosys.txt > processed_data/$MODELRUN/osemosys.txt
sed -i '' "s/FILEPATH/$RESULTS/g" processed_data/$MODELRUN/osemosys.txt
You can choose as many cores as your laptop will cope with:
snakemake --use-conda --cores 8
If you’re using the OSeMOSYS example, try running snakemake plot_dag
.
We can visualize the directed acyclic graph (DAG) of our current Snakefile
using the --dag
option, which will output the DAG in dot
language (a
plain-text format for describing graphs used by Graphviz software,
which can be installed by conda install graphviz
)
$ snakemake --dag | dot -Tpng > dag.png
Rules that have yet to be completed are indicated with solid outlines, while already completed rules are indicated with dashed outlines.
Exercise/discussion
Discuss the pros and cons of this approach. Is it reproducible? Does it scale to hundreds of model runs? Can it be automated?
$ snakemake --archive my-workflow.tar.gz
$ snakemake --gui
snakemake --use-conda
. Example:
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
script:
"scripts/plot-stuff.R"
$ snakemake clean
$ snakemake -j 4 --resources gpu=1
$ snakemake --archive myworkflow.tar.gz
$ scp myworkflow.tar.gz <some-cluster>
$ ssh <some-cluster>
$ tar zxf myworkflow.tar.gz
$ cd myworkflow
$ snakemake -n --use-conda
{
"__default__":
{
"account": "a_slurm_submission_account",
"mem": "1G",
"time": "0:5:0"
},
"count_words":
{
"time": "0:10:0",
"mem": "2G"
}
}
The workflow can now be executed by:
$ snakemake -j 100 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} --mem={cluster.mem} -t {cluster.time} -c {threads}"
Note that in this case -j
does not correspond to the number of cores used, instead it represents the maximum
number of jobs that Snakemake is allowed to have submitted at the same time.
The --cluster-config
flag specifies the config file for the particular cluster, and the --cluster
flag specifies
the command used to submit jobs on the particular cluster.
snakemake --use-singularity
. Example:
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
singularity:
"docker://joseespinosa/docker-r-ggplot2"
script:
"scripts/plot-stuff.R"
Preserve the steps for re-generating published results.
Hundreds of workflow management tools exist.
Make and Snakemake are a comparatively simple and lightweight options to create transferable and scalable data analyses.
Sometimes a script is enough.