Phylum 2021 Top 5 Most Viewed Blog Posts

Peter Morgan, President
Peter Morgan, President - January 13, 2022

Over the course of 2021 members of our team here at Phylum have published several technical and research blog posts. In case you missed them the first-time round, here are our Top Five Most Viewed Blog Posts for 2021.


Spark and Rust - How to Build Fast, Distributed and Flexible Analytics Pipelines with Side Effects
By Andrea Venuta, Senior Software Engineer

Apache Spark is a powerful piece of software that has enabled Phylum to build and run complex analytics and models over a big data lake comprised of data from popular programming language ecosystems.

Spark handles the nitty-gritty details of a distributed computation system for abstraction that allows our team to focus on the actual unit of computation.

Rust is another particularly important tool in our engineering toolbox. It enables us to build reliable, safe, and fast software, all from the comfort of a modern and well-thought-out type system that makes expressing complex ideas a breeze.




Using Entropy to Identify Obfuscated Malicious Code
By Eric Freitag, Chief Engineer

In a typical software program, a lot of information can be determined about a program by simply examining the strings that it contains. For example, you can see the files that the program uses, network addresses or hostnames, environment variables, and runtime libraries. Of course, these are easily discernible with static analysis programs, but even simple utilities like GNU strings are well suited for examining them.

To avoid detection, certain authors might wish to deliberately obscure the nature of a program by rendering these strings less transparent.




Detecting Potential Bad Actors in GitHub
By Chris Tokita, Data Scientist

The vast open-source software ecosystem contains millions of packages and tens of millions of contributing authors. This is both the strength and the weakness of open-source software: its crowdsourced nature means that packages are continually updated and innovated (for free!), while at the same time leaving them vulnerable to someone slipping in some harmful code. Therefore, to truly detect risk in the open-source software ecosystem, we ought to care about its social components—the behavior and interactions of the many thousands of software authors—as much as we care about its computational components—the code itself.

Here at Phylum, I’m taking on this challenge to find the proverbial malicious needle in the otherwise benign haystack: can we use machine learning to detect unusual behavior among authors of open-source software?



Canva Design DAEriT8LAT8

A Spooky Occurrence in the Open-Source Ecosystem: Hacktoberfest 2020
By Chris Tokita, Data Scientist

One of the things that excites me about the open-source software ecosystem is entirely outside the technical components of code and computation. Instead, as someone whose PhD was focused on behavior and networks in social systems—ant colonies, bee hives, social media, etc.—I find the social element of open-source software to be the most intriguing.

Much like the advent of social media, open-source software is as much a technological innovation as it is a social one. Open-source software is entirely driven by a large community of pro-bono developers who collaborate to create and update the software that now underlies much of our most important technology! With this social core of open-source, some have naturally begun trying to recruit new members to this grassroot development community.



Web application-3

Design Matters: How We Created Phylum’s Risk Score for Open-Source Packages
By Aaron Bray, CEO

Generating meaningful scores for open-source packages is extremely complex. Effective scoring for risk and reputation needs to incorporate disparate pieces of information while also accounting for important edge cases. The key challenge is to ensure that the score is:

  • Maintainable - as time goes on and more attributes are incorporated, the score should become more accurate, not less.
  • Intuitive - it needs to align with user expectations.
  • Understandable - as above, it should be easy to understand how the scoring mechanism works.
  • Useful - the scoring mechanism needs to provide utility and be actionable.



Stay in the know.

Subscribe to our newsletter.