The Great npm Garbage Patch

The Great npm Garbage Patch | Phylum

In April of this year, the Phylum Research Team revealed the proliferation of spam packages in npm associated with the Tea protocol, a decentralized initiative that promises to compensate software developers in cryptocurrency for their open-source contributions. Last month, we published our quarterly research report, which estimated that approximately one out of every four packages published to npm in Q2 were associated with Tea, virtually all of which had no redeeming quality aside from gaming the protocol to inflate a software developer’s contribution artificially. With new research from a fresh perspective, our team can now report that the volume of these packages is likely larger than our initial estimates. Like the island of discarded plastic twice the size of Texas floating in the North Pacific Ocean, npm has accrued an astonishing amount of spam packages over the past six months. Join us as we take a fresh look at the pollution in this open-source ecosystem.

A quick recap

As detailed in our previous blog, the Tea protocol perversely incentivizes software developers to exaggerate their contribution to open-source development. Using a modified PageRank called teaRank, software developers are rewarded based on their “Proof of Contribution”. As the early SEO spammers figured out how to game PageRank for their benefit, history repeats itself, and a few software developers have spammed open-source repositories with absurd amounts of worthless packages.

npm, the largest open-source ecosystem, has suffered the worst from this pollution from various actors. Some of the hallmarks of these spam packages are gibberish package names, packages named with random combinations of words from a list, implausible lists of dependencies, a dubious number of dependent packages, and in this morass of transitive dependencies, the ubiquitous tea.yaml file that ultimately identifies the code owner. When Phylum first started investigating this situation in February, we were continually amazed at the sheer volume of packages that could be published, clearly due to automation. So, we turned our attention to trying to understand the full scope of this spam problem.

A fresh perspective

For a baseline, at the start of 2024, the total number of packages ingested into Phylum daily from npm was about 1,500 each business day and about half that on the weekends. Starting in February of 2024, Phylum began to notice a steady increase in npm package publications from a few thousand to tens of thousands. The high water mark of this increase occurred on 8 April 2024, with over 48,000 packages published to npm. This explosion of packages led us to our first discovery of the perverse incentives of the Tea protocol.

Last month, in preparation for our quarterly report, we took a random sample from all npm packages published in Q2, and we manually triaged 1600 packages. If a package contained markers of Tea protocol abuse, as noted above, we marked it as spam. With these, we found a 95% confidence interval for the estimate of the percentage of spam packages in npm in Q2 between 21.25% and 25.5%, or in other words, over 500,000 spam packages.

Upon further reflection, we considered that many npm projects have nightly builds or alpha, beta, and canary versions. So, these legitimate packages that enjoy a robust development cycle might dilute the size of the true impact of spam. What if we restricted our search to new packages? Packages that have never been seen before in npm?

We widened our search in our npm data back to February, when we saw the first Tea protocol spam, and then removed all the packages that had at least one version published prior. This left us over 890,000 new, never-before-seen packages between February 2024 and the present. From this set, we took a random sample of 900 packages and applied the same criteria as before. From this new perspective, our 95% confidence interval for the estimate of Tea protocol spam in new packages over the past six months jumped to between 68.66% and 74.67%, or somewhere between 613,000 and 667,000 packages.

In other words, among all new packages published to npm in the past six months, about five out of every seven packages are Tea spam.

Is there a threat here?

Followers of this blog know that most of our content focuses on exposing active malicious attacks against open-source software developers. In the spirit of full disclosure, Phylum has yet to discover evidence that these packages contain or lead to the usual kind of malice that we regularly report. But, as a general observation, this pollution is a kind of malice, and there are several dangerous avenues that this could turn into.

First, unlike malicious typosquatting campaigns, in which an unsuspecting developer might accidentally install reaxt instead of react, it is not at all likely that a developer would make the same mistake with, for example, quasar-fig-0e1t. However, a package like web3-cover is more plausible, where the developer would also get the 170 dependents along with the complete transitive dependency tree for each of those.

Next, because the AI hype train is at full steam, we must point out the obvious. AI models that are trained on these packages will almost certainly skew the outputs in unintended directions. These packages are ultimately garbage, and the mantra of “garbage in, garbage out” holds true.

Finally, these large-scale spam campaigns hinder the open-source package registry’s ability to reason through the safety of all packages in an ecosystem, despite the fact that no reasonable person would ever endeavor to install one of these spam packages. They raise the noise floor and create an environment in which an adversary could surreptitiously hide actual maliciousness.

Thinking like the adversary

Let’s start by taking a look at the following package, sournoise. The npmjs website lists a single dependency on axios.

npmjs.org showing axios as a dependency of sournoise
npmjs.org showing axios as a dependency of sournoise

There is not a lot happening here. The package does not contain code, and according to npm, the only dependency is on the extremely popular Axios package. Is this package safe to install?

The package.json tells a different story.

{
  "name": "sournoise",
  "version": "1.0.1",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "axios": "https://registry.npmjs.org/@putrifransiska/kwonthol36/-/kwonthol36-1.1.4.tgz"
  }
}

package.json for sournoise shouting the spurious axios dependency

Contrary to what npm states, this package actually depends on one of our aforementioned spam packages. This is a by-product of how npm handles and displays dependencies to users on its website. There is no clear linkage to @putrifransiska/kwonthol36, and axios lists sournoise as a dependent.

To say that you’d never install one of these spam packages is to ignore the complexity of the supply chain: transitive dependencies can pull in packages that the developer neither wants nor expects to receive.

Conclusion

Open-source software ecosystem pollution is a problem for everyone. The Tea protocol project is taking steps to remediate this problem. It would be unfair to legitimate participants in the Tea protocol to have their remuneration reduced because others are scamming the system. Also, npm has begun to take down some of these spammers, but the take-down rate does not match the new publication rate. And this problem is not limited to npm alone. For example, this user published nearly 1800 spam packages on Rubygems in late February and early March 2024. Phylum is actively researching this area, and we will continue to seek new ways to detect this spam as these actors adapt their tactics.

Phylum Research Team

Phylum Research Team

Hackers, Data Scientists, and Engineers responsible for the identification and takedown of software supply chain attackers.