I spend a lot of time doing Python development, and in the process, I make pretty heavy use of virtual environments. If you’ve spent any time in the space, you know that the Python virtual environment space is a bit of a mess. Depending what part of the problem space you’re trying to use, you might prefer:
As with many topics in open source land, there’s a lot of fighting that goes on about just which virtual environment manager is the right one to use.
I hear a lot of opinons in this space, stated as facts. Like the mantra Strong opinions, loosely held, except without the second part.
While not myths exactly, I’m going to bust a few of these opinions. They may be a perfectly good solution for your use case, but that doesn’t mean it solves my problems, or those of my customers.
Pipenv the officially recommended Python packaging tool from Python.org
Nope. not even close. Pipenv is aimed at managing application dependencies only. Not library development. Not packaging. Also, it doesn’t handle multiple versions of Python. As someone who does a lot of reproducibility work, being able to pick and choose which version of Python I use is absolutely essential.
I will admit, what pipenv does do nicely is the pipenv run
command,
which makes scripting things in a Makefile a ton easier. It also
separates dependencies into my desired dependencies (Pipfile
) and
the complete set of resolved dependencies (Pipfile.lock
). Then
again, just about every tool than pip
does this.
Also, I’d be remiss not to point out that the pipenv project appears to be in quite a bit of flux. To borrow a cheesy movie line, it’s not dead, but it’s pretty badly hurt.
You don’t need the ability to split packages between production and development
This claim, e.g. is made by Chris Warrick, is that installing development dependencies in production is fine because it:
…should only be a waste of HDD space and nothing more in a well-architected system.
Not even close. I write Python all over the place, including embedded systems, mobile devices, and other bizarre architectures. I also use things like jupyter to develop my code. I use numpy to generate test data. Some of these packages are pretty insane to compile and ship on non-mainline architectures, so in that case, I really need to split my development and deployment environments. It’s not a matter of disk space. It’s a vast investment of developer time getting development dependencies working someplace where they will never get used.
(There’s also the whole issue of attack surface, but this is Python, so you probably don’t want to open that particular Pandora’s box.)
pip freeze
is good enough
I’ll admit, pip freeze
is much faster than dependency
resolution. Since it dumps the current list of installed packages, it
can sometimes be used to reproduce an environment (though probably not
across platforms). The problem is, it doesn’t separate the
dependencies in a key way: those that I want vs the supporting
dependencies that I need to get there. This is the idea of lockfiles
that just about every package manager other than pip
has chosen to adopt.
As a human being, I only ever want to maintain the first list:
dependencies that I want, or need. I need jupyter
, but I don’t
want to maintain a list of all its dependencies. I really don’t want
to know what it takes to compile scipy
or scrapy
, and please
never make me peek under the hood of what it takes to get sage
to
compile.
pyproject.toml
is the standardized way to do Python packaging.
Well, there is PEP 518, sure, and that is a standard. Unfortunately,
the way that package managers (like Poetry) use these files is to put
all the configuration in a custom section, ([tool.poetry]
). So it’s
standard, but everyone’s standard is different. Cue Andew
Tanenbaum:
The nice thing about standards is that you have so many to choose from.
So take your pick. requirements.txt
, setup.cfg
, MANIFEST.in
,
Pipfile
, environment.yml
. They could all be thought of as
standard.
In Short: Know Your Use Case
There are a lot of ways to do python packaging. This is because
there are a lot of different use cases out there. I think it’s
important to know the strengths and weaknesses of each of the choices
before heading out and advocating for one or the other. For instance,
we standardized on conda
in EasyData for a variety of reasons. It’s
the best tool for our particular application right now, but we’re always
ready to revisit that assumption. Hopefully you will be, too.
And for goodness sake, don’t go and invent your own environment manager until you’re positive the existing choices won’t do the trick for you. This list is long enough already.