Adding Spurious Wheels to PyPI

Finding malicious code in packages is like finding a needle in a collection of haystacks
Photo by Leshaesvan / Unsplash
🗣️
This is part of a series of posts examining the methods malicious Python code gains execution.

This technique is more about avoiding detection by hiding in plain sight and leveraging other techniques already discussed to gain execution. Think of it as reducing the signal-to-noise ratio for the good guys looking to root out malware.

Monitoring for new package publications (i.e., new versions) on the Python Package Index (PyPI) is common. Static analysis of source files to find malicious behavior is also common, but usually only for the files present at the time of publication. Less common is continuous monitoring for existing package publications.

Most specific artifact wins

It is possible to create a benign package and upload it to PyPI such that the source distribution and all wheels would pass inspection. Then, a malicious wheel could be custom-built and uploaded separately. Package installers will select the wheel most specific to the installation environment. It could be a platform wheel, tailored for a specific target (i.e., black-24.4.0-cp312-cp312-macosx_11_0_arm64.whl) to limit exposure and infect only the system type matching that of the desired victim. The camouflaged package could differ in a minor way, like with the addition of a single malicious dependency entry.

Examining release timelines

Warehouse, the web application that implements the canonical Python package index, offers RSS feeds to get the newest packages, latest updates, and project releases. It also provides legacy XML-RPC methods that can be used to find all the artifacts added to a given release. Here is an example of what a common package, black, looks like when it publishes a new version to PyPI:

>>> import xmlrpc.client
>>> import time
>>> from datetime import datetime as dt
>>> client = xmlrpc.client.ServerProxy("https://pypi.org/pypi")
>>> changelog_last_serial = client.changelog_last_serial()
>>> changelog_last_serial
22745974
>>> # Go back to an arbitrarily older serial id
>>> older_serial = 22740000
>>> recent_changes = client.changelog_since_serial(older_serial)
>>> black_changes = [change for change in recent_changes if change[0] == "black"]
>>> for change in black_changes:
...     print(change)
...
['black', '24.4.0', 1712952858, 'new release', 22744937]
['black', '24.4.0', 1712952858, 'add py3 file black-24.4.0-py3-none-any.whl', 22744941]
['black', '24.4.0', 1712952864, 'add source file black-24.4.0.tar.gz', 22744942]
['black', '24.4.0', 1712952997, 'add cp38 file black-24.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', 22745004]
['black', '24.4.0', 1712953022, 'add cp312 file black-24.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', 22745015]
['black', '24.4.0', 1712953024, 'add cp310 file black-24.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', 22745018]
['black', '24.4.0', 1712953034, 'add cp311 file black-24.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', 22745024]
['black', '24.4.0', 1712953037, 'add cp39 file black-24.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', 22745029]
['black', '24.4.0', 1712953064, 'add cp310 file black-24.4.0-cp310-cp310-win_amd64.whl', 22745039]
['black', '24.4.0', 1712953074, 'add cp311 file black-24.4.0-cp311-cp311-win_amd64.whl', 22745049]
['black', '24.4.0', 1712953084, 'add cp39 file black-24.4.0-cp39-cp39-win_amd64.whl', 22745058]
['black', '24.4.0', 1712953085, 'add cp38 file black-24.4.0-cp38-cp38-win_amd64.whl', 22745060]
['black', '24.4.0', 1712953096, 'add cp312 file black-24.4.0-cp312-cp312-win_amd64.whl', 22745065]
['black', '24.4.0', 1712953529, 'add cp310 file black-24.4.0-cp310-cp310-macosx_11_0_arm64.whl', 22745137]
['black', '24.4.0', 1712953658, 'add cp310 file black-24.4.0-cp310-cp310-macosx_10_9_x86_64.whl', 22745160]
['black', '24.4.0', 1712953677, 'add cp311 file black-24.4.0-cp311-cp311-macosx_11_0_arm64.whl', 22745161]
['black', '24.4.0', 1712953877, 'add cp38 file black-24.4.0-cp38-cp38-macosx_11_0_arm64.whl', 22745226]
['black', '24.4.0', 1712953884, 'add cp311 file black-24.4.0-cp311-cp311-macosx_10_9_x86_64.whl', 22745231]
['black', '24.4.0', 1712953949, 'add cp312 file black-24.4.0-cp312-cp312-macosx_10_9_x86_64.whl', 22745260]
['black', '24.4.0', 1712954026, 'add cp38 file black-24.4.0-cp38-cp38-macosx_10_9_x86_64.whl', 22745270]
['black', '24.4.0', 1712954034, 'add cp39 file black-24.4.0-cp39-cp39-macosx_11_0_arm64.whl', 22745271]
['black', '24.4.0', 1712954217, 'add cp39 file black-24.4.0-cp39-cp39-macosx_10_9_x86_64.whl', 22745292]
['black', '24.4.0', 1712955135, 'add cp312 file black-24.4.0-cp312-cp312-macosx_11_0_arm64.whl', 22745530]
>>> latest_timestamp = black_changes[-1][2]
>>> earliest_timestamp = black_changes[0][2]
>>> black_artifact_release_window = dt.fromtimestamp(latest_timestamp) - dt.fromtimestamp(earliest_timestamp)
>>> print(black_artifact_release_window)
0:37:57

Python code to find artifacts added to a specific release of black on PyPI

That is one source distribution and 21 built distributions, released over a span of almost 38 minutes! That is a lot of files to analyze for a single release. The black project is known to be good, but imagine a different project with even more platform wheels added, spread out over a longer period of time, perhaps even months later:

Screenshot of the `typed-ast` package v1.4.1 artifacts showing a wide upload date range
Release artifacts uploaded months after the initial release

How confident are you that there is no hidden malicious activity buried in the noise? How do you know the project wasn’t taken over with expired author or maintainer domain takeovers and compromised accounts?

Countermeasures #hashes

PyPI is immutable in that you can not republish a different artifact for an existing specific release artifact. It is possible to publish new artifacts that are more specific for a given release, giving rise to the class of attack outlined in this post. To prevent the possibility of using a malicious artifact, use hashes.

pip and other package installers allow for specifying dependencies with a matching hash. pip calls it “Hash-checking mode” and pipenv makes use of it in Pipfile.lock as a default security feature. The trick is to ensure any lockfile created with hashes from known good dependencies is guarded against updates that may add new hashes. Of course, Phylum is here to help with that.

--cta--

Charles Coggins

Charles Coggins

Senior Software Engineer, responsible for integrations and author of the "phylum" Python package. Documentation and quality champion, runner, baseball and scout dad, pod-faster, and lover of outdoors.