Fork me on GitHub

Reproducible research: Recording dependencies

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • How can we communicate different versions of software dependencies?

Dependencies

Our codes often depend on other codes that in turn depend on other codes …

  • Reproducibility: We can control our code but how can we control dependencies?
  • 10-year challenge: Try to build/run your own code that you have created 10 (or less) years ago. Will your code from today work in 5 years if you don’t change it?
  • Dependency hell: Different codes on the same environment can have conflicting dependencies.

Conda, Anaconda, pip, Virtualenv, Pipenv, pyenv, Poetry, requirements.txt …

These tools try to solve the following problems:

  • Installing a specific set of dependencies, possibly with well defined versions
  • Recording the versions for all dependencies
  • Isolate environments on your computer for projects that have conflicting dependencies
  • Isolate environments on computers with many users
  • Using different Python versions per project
  • Provide tools and services to share packages

Exercise/discussion

Compare these four requirements.txt solutions:

A:

Code depends on a number of packages but there is no requirements.txt file or equivalent.

B:

scipy
numpy
sympy
click
git+https://github.com/someuser/someproject.git@master
git+https://github.com/anotheruser/anotherproject.git@master

C:

scipy==1.3.1
numpy==1.16.4
sympy==1.4
click==7.0
git+https://github.com/someuser/someproject.git@d7b2c7e
git+https://github.com/anotheruser/anotherproject.git@sometag

D:

scipy==1.3.1
numpy==1.16.4
sympy==1.4
click==7.0
someproject==1.2.3
anotherproject==2.3.4

Pip and PyPI

  • Python Package Index.
  • Standard place to share Python packages.
  • Also mixed-language packages are possible wrapped in a Python layer.
  • Install a package:
    $ pip install somepackage
    
  • Install a specific version:
    $ pip install somepackage==1.2.3
    
  • Freeze the current environment into requirements.txt:
    $ pip freeze > requirements.txt
    
  • Install all dependencies listed in requirements.txt:
    $ pip install -r requirements.txt
    
  • Creating and sharing your own package: https://packaging.python.org/tutorials/packaging-projects/
  • It is possible to pip install from GitHub or other places:
    $ pip install git+https://github.com/anotheruser/anotherproject.git@sometag
    

Conda, Anaconda, and Miniconda

  • Not only for Python: any language, also binaries.
  • Created by Continuum Analytics, part of Anaconda/Miniconda, but can be installed standalone.
  • Open source BSD license.
  • Manages isolated software environments.
  • Allows you to create and share conda packages.
  • Miniconda is a lightweight alternative to Anaconda.
  • Install a package:
    $ conda install somepackage
    
  • Install a specific version:
    $ conda install somepackage=1.2.3
    
  • Create a new environment:
    $ conda create --name myenvironment
    
  • Create a new environment from requirements.txt:
    $ conda create --name myenvironment --file requirements.txt
    
  • Activate a specific environment:
    $ conda activate myenvironment
    
  • Deactivate current environment:
    $ conda deactivate
    
  • List all environments:
    $ conda info -e
    
  • Freeze the current environment into requirements.txt:
    $ conda list --export > requirements.txt
    
  • Freeze the current environment into environment.yml:
    $ conda env export > environment.yml
    

Using conda to share a package

Conda packages can be built from a recipe and shared on anaconda.org via your own private or public channel, or via conda-forge.

  • conda-forge is a GitHub organization containing repositories of conda recipes.
  • Has become the de facto standard channel for packages.
  • Several continuous integration providers ensure that each repository (“feedstock”) automatically builds its own recipe on Windows, Linux and OSX.

A step-by-step guide on how to contribute packages can be found in the conda-forge documentation.

To get an idea of what’s needed, let’s have a look at the boost feedstock (a set of C++ libraries). We see that:

  • Every commit is tested on every platform.
  • There’s a list of maintainers.
  • There’s a meta.yaml file under the recipe/ directory, along with (optional) build.sh and bld.bat files for building non-python code on OSX/Linux and Windows platforms.

Virtualenv

  • Isolated Python environments: https://docs.python-guide.org/dev/virtualenvs/
  • Create a new environment:
    $ virtualenv myenvironment
    
  • Create a new environment with a specific Python version:
    $ virtualenv --python=python3 myenvironment
    
  • Create a new environment in a path outside of current directory:
    $ virtualenv /path/to/myenvironment
    
  • Activate a specific environment (in Bash):
    $ source myenvironment/bin/activate
    
  • Install into the current environment:
    $ pip install somepackage
    
  • Deactivate current environment:
    $ deactivate
    

Pipenv

  • https://pipenv.readthedocs.io
  • Alternative to virtualenv: you can activate and install in one step
  • Tool to easily manage per-project/per-directory Python packages
  • One dependency file and one lock file: Pipenv and Pipenv.lock
  • Records checksums of dependencies
  • Easier than virtualenv to separate dependencies for development and usage

Poetry

  • https://poetry.eustace.io
  • Alternative to virtualenv and Pipenv
  • [If you use this tool, please send a pull request with your experiences]

Pyenv

Key points

  • Capturing software dependencies is a must for reproducibility.

  • Files like requirements.txt, environment.yml, Pipenv, …, should be part of the source repository.

  • Be skeptical when you see dependency lists without versions.