How I re-created 2500+ stars open source project with 52 lines of code

Lesson 101 : Don’t get frightened & intimidated by GitHub stars

So, I was finishing off crafting my exploit sandwich by creating a requirements.txt, you know, icing on cake.

If you don’t know what is a requirements.txt file, then it’s your lucky day!

Requirements files” are files containing a list of items (Python packages) to be installed using pip install like so: pip install -r requirements. txt. Logically, a Requirements file is just a list of pip install arguments placed in a file.

While manually generating the file, I noticed a thing, I went and thought something like, “what’s my workflow for this particular task?”

Seeing my code, seeing the imports, taking the first module name and then doing something like,

or, if I’m on windows,

But, on that moment, it hit me, can this be automated?

And my brain quickly responded with:

Why freaking not? It’s just a shell script away!

I was going to kick in and write a shabby shell script, but thought, why to reinvent the wheel? Somebody must have done it!

Googled and .., to my expectations, tons of threads for grepping from pip and all popped out, but, not one-fell swoopy-whoopie reliable shell script with edge case handling?

As I continued searching out of my curiosity thinking what on earth are developers doing, my search wagon got wreaked by a HUGE GitHub repository, HUGE.

It was “bndr/pipreqs — Generate pip requirements.txt file based on imports of any project.

Thoughts came hurling in saying things like …

HOLY SHIT, LOOK AT THOSE STARS

WHAT THE … 45 ISSUES! ... SO MANNNYYY PULLS

I’VE NEVER BUILT SOMETHING THAT BIG!

It was so overwhelming that I got numb for few moments, but simultaneously thought why a simple script that should be there on gist holds so much value?

Then, I realized the challenges, and the project started making sense. I started to like the whole idea of it. There are many big & minuscule problems in this task, the major challenge being the inconsistency among module names and respective versions (which I’ll explain in detail later). Also, virtual environments tend to screw up things. I started to respect the repository and went ahead to see the code-base and most importantly the issues, as, my exploit was over and I had a whole night to fiddle around.

What I concluded was, it works well, but, is not flawless. I didn’t liked the underlying core principles, like, using hard-coded mappings, thus, causing lots of issues, e.g.

I went ahead to make my own version, set my stopwatch to 30 minutes and the coding began.

Let’s do it!

but before starting, you may suggest the following solution,

Given the name of a Python package, we just wanna know what is the name of the module to import, and it’s version. that’s it. Right?

The answer is “Yes!”. But unfortunately, it’s not easy as you may think.

“Regrettably, there’s no method to the madness. The name in the package index is independent of the module name you import. Disastrously some packages share module names. If you install both, your application will break with even odds.

Packaging in Python is generally dire. The root cause is that the language ships without a package manager. Ruby and Nodejs ship with full-featured package managers Gem and Npm, and have nurtured sharing communities centred around GitHub. Npm makes publishing packages as easy as installing them. Nodejs arrived 2009 and already has 14k packages. The venerable Python package index lists 24k. Ruby Gems lists 44k packages.

Fortunately, there is one decent package manager for Python, called Pip. Pip is inspired by Ruby’s Gem, but lacks some vital features (eg. listing packages, and upgrading en mass). Ironically, Pip itself is complicated to install. Installation on the popular 64-bit Windows demands building and installing two packages from source. This is a big ask for anyone new to programming.

Python’s devs are ignorant of all this frustration because they are seasoned programmers comfortable building from source, and they use Linux distributions with packaged Python modules.

Until Python ships with a package manager, thousands of developers will needlessly waste time reinventing the wheel.” — Colonel Panic

After knowing this fact, the first question I asked myself was, what can we do to grab name and version details.

I came to know that module/package version information in python can be in very different places depending on the case:

  • for modules and packages, on the optional __version__ attribute as recommended by PEP396.
  • for distributed modules and packages, on the Version Metadata field as indicated by PEP345, that is located:
    - for built wheels distributions (PEP427), on the dist-info directory, but also in the dist-info folder name
    - for built eggs distributions (legacy format from setuptools), on the egg-info directory, but is also in the egg-info folder name
  • for built-in modules and packages, the default version should be inherited from the python system version except if overridden

Utilizing the first method, we try the __version__ attribute, however as said, there are modules without it,

Using pkg_resources module distributed with setuptools library. Note that the string that we pass to get_distribution method should correspond to the PyPI entry. (which again, is not reliable, see #0x48piraj/rqmts/issues/4)

I employed some of the methods in haste and it almost took 44 minutes of my life, well initially, I continued working on it. Fun fact, the first version contained only 52 lines of code.

Project Technicals

Object Oriented (OO) or Functional Programming (FP) Debate

“What Hogwarts house do you belong to? Are you team Jacob or team Edward? Mayweather or McGregor? Which house in Game of Thrones do you pledge your allegiance to? Real Madrid or Barcelona? White or wheat? Few rivalries have split otherwise nice, normal people into such hostile, frenzied factions, and we have another one to add to the list: Object-oriented vs functional programming.” — Sho Miyata

Quite frankly, I like both, but before moving forward, I would like to clear a misconception i.e.,

Almost all modern programs use functions, but just using functions is not Functional programming.

What is required is functions that are ‘functional’ in the mathematical sense, not programming in the ‘using functions’ sense. A mathematical function, or ‘pure function’ operates on the supplied arguments and returns a result and does nothing else. No ‘side effects’. Nothing changed by the function, no internal variables altered that will result a future call of the same function dealing with different values.

Coming back, I thought this project needs separation of data and methods, as well as the high level of abstraction to leave less room for errors. I thus utilized, Procedural style.

I think Python excels in implementing this particular paradigm. It was made modular from the beginning so that employing new methods is extremely easy.

This was the first project where I utilized GitHub’s Create Releases feature to create Rqmts releases and track the development progress efficiently. And as always, building and publishing a Python module embraces documentation and proper code structure and consistent commit messages style and many more good software development practices. So hey, do not hesitate to open an issue or a PR if current version of Rqmts doesn’t work for you. (please do!)

Naming the repository

I’m bad at naming projects, and thus, literally searched for abbreviation for “requirements” and to my expectations, a thread by English Language & Usage Stack Exchange popped out giving out the name which was going to be the repository name.

… hey, what about a nice logo for the project ?

This task was accomplished by a veteran graphic designer named Tushar Sadana (be sure to check out his work) who is a senior of mine at my college. He hacked away the logo in a night or so resulting in,

I don’t know, but I liked it very much. What you guys think about it?

Final word

I take the stars as a measure of awareness, meaning that projects with lots of stars might be known to many people and projects with only a few stars may be relatively unknown, nothing else.

See Curating GitHub for engineered software projects for understanding how useful of a metric it is or just read the gist below,

We used reaper to measure the dimensions of 1,857,423 GitHub repositories. We then used manually classified data sets of repositories to train classifiers capable of predicting if a given GitHub repository contains an engineered software project. […] The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier (with 10 as the threshold for number of stargazers) to exhibit high precision (97%) but an inversely proportional recall (32%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (86%).

The stargazer-based criteria offers precision but fails to recall a significant portion of the population.

So, the moral of the story and my lesson was —

Don’t get scared of GitHub stars. Build crazy things which help solve big problems.

Coming back again, Rqmts is an open source project and is no way complete but is constantly improving and I would be happy to see contributors who report bugs and file feature requests by submitting pull requests. Simply put, I would love to see YOU on board!

Little about me

Security researcher at night, developer in the morning & tinkerer at noon.
As usual, you can connect with me over LinkedIn, Twitter, Instagram.

Google Code-In C. Winner. GsOCer ‘19. Independent Security Researcher. Have hacked Medium, Mozilla, Opera & many more. Personal Website: https://0x48piraj.com