Phylum’s Monthly Malware Report: March 2022 – Unknown Unknowns
Relying on security research to manually discover open-source packages that exhibit supply chain issues is no longer enough. To truly mitigate the risk of using open-source software written by strangers on the Internet, we must analyze all packages published into the various ecosystems, in real time and at scale.
Phylum was purpose built to analyze the risk of all package releases with the necessary scale, depth, and automation in mind. In the last 29 days, Phylum has processed a total of 545,777 package releases across three ecosystems (NPM, PyPI, and RubyGems), for an average of 18,818 packages processed each day. We have analyzed the metadata and source code for each of these packages, resulting in the processing of 38,798,991 individual source files.
Package Registry |
# of Packages |
# of Packages/day |
NPM |
467,818 |
16,131 |
PyPI |
69,966 |
2,412 |
RubyGems |
7,993 |
275.6 |
Total |
545,777 |
Average: 18,818 |
Once this data has been collected, Phylum’s analytics, heuristics and ML models comb through the data to identify risk indicators. We look for a myriad of these risk indicators but are uniquely positioned to identify and convict malware. Phylum’s analytics seek to make determinations on the maliciousness of a package before a developer adds it to a production release. Examples of the analytics that identify elements of malware include:
- high entropy strings used as arguments to evaluation functions
- risky function calls
- author/maintainer transition
- package name similarity (for typosquatting detection)
- package code similarity (for typosquatting detection)
- pre and post installation hooks
- calls to system binaries frequently used for host compromise
- presence and changes to URIs and IP addresses
Our completely automated processing and analytics pipeline identified 72 package versions of interest over the past month. This includes legitimate malware, packages for reconnaissance, network enumeration, and individuals attempting to conduct security research. On average, these packages were identified within 11.2 minutes of publication.
Next, Phylum researchers validate the results of the identified packages, use the resulting data to improve our analytics, and report the malicious packages to the respective package registry. This amounted to 41 packages reported to two package registries as several packages had multiple affected versions.
In doing so, Phylum has simultaneously reduced the window of opportunity for an attacker to infect a victim and made it more difficult for a published malicious package to remain undetected for months, or years.
Preview: Malware Spotlight
Going forward, Phylum will select interesting packages identified during this process for deeper review on our blog. We’ll use adblock-lists as an example for a preview.
This package was quickly identified as a package version of interest, then verified and reported in a total of 14 minutes. In the image above, we can see three (3) artifacts that the system identified and a fourth explained:
- The package is missing a README. This isn’t common for legitimate software libraries and is a minor contributor to the classification.
- The package only has 2 released versions yet has a version number of 99.99.0. This is almost assuredly an attempt to target victims with a Dependency Confusion attack and is a large contributor to the classification.
- The package is missing a link to a version control system such as GitHub or GitLab. This is uncommon for legitimate software libraries, but very common for packages released by attackers. This is a minor contributor to the classification.
- Via automated static analysis of the source code, Phylum identified the package as employing an install hook that executes curl to send an HTTP request to URI. This is almost assuredly illegitimate and is a large contributor to the classification.
Packages of Interest
Phylum identified 41 packages of interest, with several triggering on multiple versions of the package.
NPM Packages:
Why Phylum & What’s Coming Next....
Phylum’s capabilities extend beyond pure source code analysis. We have constructed authorship models that, in combination with other metrics, allow us to identify odd behaviors around commits and activity. We analyze maintainer information for a package, allowing us to spot packages that have recently changed ownership that may be at risk for the introduction of malware (as was the case with even-stream in 2018).
As we look forward, we are imminently preparing the release of C#/Nuget and Java/Maven support. In addition to this, we are pushing hard to increase both the sophistication and number of our heuristics and analytics.
Phylum, at its core, is a risk detection system focusing on the software supply chain. Unlike other SCA products that focus nearly exclusively on well-known issues, we are looking for the unknown unknowns - the subtle modifications to a software package that will surreptitiously exfiltrate keys to your critical infrastructure. We do this at the scale of open source, tackling the problem in an automated fashion, to make software supply chain security proactive instead of merely reactive.
To learn more about Phylum’s automated malware identification capability and how we support secure and efficient use of open-source software; contact us for a conversation.