Pick a Python Lockfile and Improve Security
The Zen of Python by Tim Peters is a great reminder of what it means to be “Pythonic,” or writing idiomatic and beautiful code. If you haven’t seen this before — or maybe you just need a refresher — type import this
into a Python interpreter and you’ll get the full text. This is a favorite line:
There should be one-- and preferably only one --obvious way to do it.
This is a reminder within the reminder of what it means to be Pythonic. Plus, it has a fun Easter Egg tucked away in plain sight: the postfixed dashes for one--
and prefixed ones for --obvious
. Legend has it that this is a dig on languages like C, C++, Java, and JavaScript, where subtle bugs are introduced by those uninitiated in the nuances of increment/decrement operators. This is likely intentional as the last line in the Zen of Python makes use of dashes in the more conventional manner:
Namespaces are one honking great idea -- let's do more of those!
Python is a great language. It is popular for having an easy learning curve and being both natural to read and write. It has “batteries included,” with a large standard library. It offers massive additional functionality in the form of third party packages that can be used as dependencies. But for all its advantages, Python has a problem: there is not one obvious way to specify dependencies. Nor is there a standard for representing those dependencies in a lockfile format.
Background
Other languages have made package management explicit. For example, Rust has its native Cargo
package manager for working with Cargo.toml
dependency manifests and Cargo.lock
dependency lockfiles. There are no other officially supported tools, and even the file names are fixed.
The story is not so simple for Python. A standard was proposed for a Python lockfile format, PEP 665, but was rejected due to “lukewarm reception from the community from the lack of source distribution support.” This is a shame since malicious packages published to PyPI tend to rely on running code during package installation (e.g., via the setup.py
file).
Phylum has identified and helped to remove a number of these malicious packages. Whether it’s crypto stealers or W4SP stealer malware, more of these supply chain attack attempts would be neutered by restricting lockfile based installations to wheels. True, attack vectors also include running malicious code during package import (e.g., via __init__.py
files) and package execution (e.g., cloning a valid project and modifying existing or adding additional code to the modules). The adoption of PEP 665 would have helped to prevent a specific and common attack vector.
Python Lockfile Landscape
Lockfiles are important because they allow for repeatable and deterministic installations. This is most beneficial for applications living at the end of the dependency chain. It can also be useful for internal development and testing of libraries so that other issues can be isolated and reproduced.
It is understood that there is disagreement about the use of lockfiles for libraries. This post is not meant to settle that debate. A great case for why lockfiles should be committed on all projects was made by the folks who built the Yarn package manager for the npm ecosystem and the same advice applies to Python projects.
What is the current state of Python lockfiles? Not great. It suffers from an abundance of choice, making it hard for developers to know where to start. Let's take a look at the current Python lockfile landscape.
Pip
Most Python developers will start with pip
since it is core to the ecosystem, even though it is not part of the standard library. Pip makes use of so-called requirements files which are simple text files containing a list of items to be installed using pip install
. The file name is not fixed but requirements.txt
is the traditional name. Whatever the name, these files should contain only strict requirements to qualify as a lockfile. Here is an example of a requirements file with both loose and strict entries:
❯ cat requirements-mixed.txt
# loose dependencies
cryptography
wheel>0.38.0
requests>=2.25.0,<3.0
# strict dependencies
pyyaml==6.0
packaging==21.3
It is okay to have loose requirements sometimes (e.g., libraries). However, if a concrete environment is going to be created from these requirements, it makes a lot more sense to generate a lockfile from them, with only strict, or pinned, dependencies in order to support the goals of reproducibility and security. Instead of doing it manually, there are some common tools available for transforming a loose or mixed requirements file into a lockfile containing strict/pinned dependencies only.
Pip itself can be used with the pip freeze
command:
# `pip freeze` works on installed packages in
# the current environment...so create that first
~/dev/phylum
❯ python -m venv .venv
# Activate the environment
~/dev/phylum took 2s
❯ source .venv/bin/activate
# Install the requirements in the environment
~/dev/phylum via 🐍 v3.11.0 (.venv)
❯ python -m pip install -r requirements-mixed.txt
# ---OUTPUT-TRIMMED---
# Check to see what was installed in the environment
~/dev/phylum via 🐍 v3.11.0 (.venv)
❯ python -m pip list
Package Version
------------------ ---------
certifi 2022.9.24
cffi 1.15.1
charset-normalizer 2.1.1
cryptography 38.0.3
idna 3.4
packaging 21.3
pip 22.3
pycparser 2.21
pyparsing 3.0.9
PyYAML 6.0
requests 2.28.1
setuptools 65.5.0
urllib3 1.26.12
wheel 0.38.4
# Freeze the packages from the environment
~/dev/phylum 13 2 via 🐍 v3.11.0 (.venv)
❯ python -m pip freeze --all > requirements-strict.txt
# Check the contents of the new requirements file
~/dev/phylum 13 3 via 🐍 v3.11.0 (.venv)
❯ cat requirements-strict.txt
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
cryptography==38.0.3
idna==3.4
packaging==21.3
pip==22.3
pycparser==2.21
pyparsing==3.0.9
PyYAML==6.0
requests==2.28.1
setuptools==65.5.0
urllib3==1.26.12
wheel==0.38.4
# The contents match the environment. We have a lockfile!
The pip-tools suite can also be used to create a lockfile from the same environment, using the same loose requirements file.
# Install pip-tools in the same environment
~/dev/phylum via 🐍 v3.11.0 (.venv)
❯ python -m pip install pip-tools
# ---OUTPUT-TRIMMED---
# "Compile" a strict requirements file from a loose one
~/dev/phylum via 🐍 v3.11.0 (.venv)
❯ pip-compile -o requirements.txt requirements-mixed.txt
# Check the contents of the new requirement file
~/dev/phylum via 🐍 v3.11.0 (.venv)
❯ cat requirements.txt
#
# This file is autogenerated by pip-compile with python 3.11
# To update, run:
#
# pip-compile --output-file=requirements.txt requirements-mixed.txt
#
certifi==2022.9.24
# via
# -r requirements-mixed.txt
# requests
cffi==1.15.1
# via
# -r requirements-mixed.txt
# cryptography
charset-normalizer==2.1.1
# via
# -r requirements-mixed.txt
# requests
cryptography==38.0.3
# via -r requirements-mixed.txt
idna==3.4
# via
# -r requirements-mixed.txt
# requests
packaging==21.3
# via -r requirements-mixed.txt
pycparser==2.21
# via
# -r requirements-mixed.txt
# cffi
pyparsing==3.0.9
# via
# -r requirements-mixed.txt
# packaging
pyyaml==6.0
# via -r requirements-mixed.txt
requests==2.28.1
# via -r requirements-mixed.txt
urllib3==1.26.12
# via
# -r requirements-mixed.txt
# requests
wheel==0.38.4
# via -r requirements-mixed.txt
# Success...we have another lockfile!
Pip-tools is nice because not only can it work with plain text requirements files, it can “compile” input dependencies specified in pyproject.toml
, setup.py
, and setup.cfg
files.
Poetry
The Poetry project “helps you declare, manage and install dependencies of Python projects, ensuring you have the right stack everywhere.” It also can build and publish projects. Poetry uses the modern pyproject.toml
file (see PEP 518, PEP 621, and the canonical spec) for configuration and abstract dependency specification. The Poetry dependency resolver is advanced and thorough, always finding a solution when one exists and providing a detailed explanation when it doesn’t. It can even be used to solve sudoku puzzles! It generates a lockfile named poetry.lock
that works in concert with the Poetry-driven environment to manage projects in a deterministic way.
Pipenv
The Pipenv project “aims to bring the best of all packaging worlds (bundler, composer, npm, cargo, yarn, etc.) to the Python world.” It combines the use of pip
and virtual environments together in one command so that projects can create and manage environments. More importantly, it handles the task of package management. It does this with a dependency manifest file, Pipfile
, and a Pipfile.lock
lockfile. Pipfile
is where top-level dependencies are specified in their loose form. Pipfile.lock
is the completely resolved collection of the abstract declarations in Pipfile
.
PDM
The PDM project “is a modern Python package and dependency manager supporting the latest PEP standards.” Like Poetry and Pipenv, PDM creates and manages environments. It is unique in that it allows for selecting the virtual environment backend (between virtualenv
, venv
, and conda
). It is even more unique in that it provides support for the PEP 582 __pypackages__
directory for direct dependency package access, bypassing the need for virtual environments entirely. PDM is PEP 621 compliant for project and dependency metadata written to pyproject.toml
. It resolves those dependencies to a lockfile named pdm.lock
which is meant to be tracked in source control to ensure all installers are using the same version of dependencies.
Conda
The Conda project almost didn’t make this list because it is not exclusive to Python and does not rely on PyPI for a package registry. However, because it is so popular in the data science world and that world is dominated by Python, it deserves at least a mention here.
Conda takes reproducibility to the extreme. Conda packages exist for everything and are not limited to Python. They can include libraries and executables from just about anywhere. This makes it possible to pin the entire environment, from the Python interpreter onwards. Conda packages can come from multiple sources, called "channels." Conda-Forge and Anaconda are two of the most popular channel providers.
Specifying dependencies for Conda can be done with an environment.yml
file. Like pip’s requirements files, environment.yml
may contain both loose and strict dependencies. There is a third party tool called conda-lock that appears to be the most widely used option for creating lockfiles for use with Conda. If used, the name of the lockfile will either be conda-lock.yml
or end in that as an extension (e.g., somefile.conda-lock.yml
).
Other authors have provided better and more exhaustive explanations of the differences between pip and Conda. Pythonspeed and Anaconda’s blog are two excellent resources.
Lockfiles for Security
With all the choices available, it can be hard for a Python developer to decide which lockfile should be used. This post is not stating which option is best, only that it is better to use a lockfile — any lockfile — than to go without. PEP 665 outlined four main motivations for why applications want reproducible installs with a lockfile, with the third one being security:
Three, reproducibility is more secure. When you control exactly what files are installed, you can make sure no malicious actor is attempting to slip nefarious code into your application (i.e. some supply chain attacks). By using a lock file which always leads to reproducible installs, we can avoid certain risks entirely.
Developers are the new high-value targets for attackers infecting the software supply chain. Some of the vectors used include, but are certainly not limited to:
- Dependency confusion
- Typosquatting
- Author compromise
- Expired author or maintainer domain takeovers
- Repo Jacking
- Exploiting software vulnerabilities
Phylum offers a solution for analyzing dependencies in lockfiles against five risk domains: malicious code, software vulnerabilities, engineering, author, and license. See The Phylum Risk Framework and documentation for more detail. Phylum is continuously ingesting and processing new packages submitted to supported package registries, including PyPI, to provide risk scores for each of these five domains. Using a lockfile is good. Using a lockfile only after confirming it’s dependencies meet your configured risk thresholds is better.
Summary
Python offers many ways to specify dependencies. Loose, uncapped, dependency specification is okay for libraries. Applications and library developers (i.e., working in their local, testing, CI, etc. environments) should use strict, pinned, dependencies in the form of lockfiles for reproducibility and security. Analyzing lockfiles for risk is a great method for protecting developers and securing the software supply chain.
At the time of this writing, Phylum offers Python support for lockfiles consumed by pip
, Poetry
, and Pipenv
. A free community edition is available for everyone to automate software supply chain security to block new risks, prioritize existing issues, and only use open source code that you trust. We hope you’ll give it a try!