Modern Python Environmenting with Pyenv and Pipenv

One of my favorite things about Python is the environmenting. I first learned virtualenv, then virtualenvwrapper, and now just pipenv, the current favored method to put a special location on top of the $PATH for a specific purpose.

PATH and venv/virtualenv explanation

Let’s talk a bit about how Python creates “environments.” When new to Python and to code isolation in general, some will compare a virtualenv to a virtual machine, or a docker container, or a remote server, however the actual truth is that nothing is actually isolated at all. A symlink to the Python interpreter you created the virtualenv with, as well as the location of any Python modules you have installed, are given to your current session as the FIRST place to look for executing or running anything.

We’ll get there but let’s talk briefly about the $PATH because even with professionals who have been working in technology for many years, I see this misunderstood, and it’s important. The $PATH variable is an environment variable (a variable that has value in your current session – so when you open up a new terminal) that is a _list of paths_, or a list of locations. Just as my address is 123 Main Street, and my friend’s address is 125 Main Street, you would have a reasonable guess as to where to find me. So in this analogy, let’s say that PATH="123 Main Street:125 Main Street". Your mac or linux machine works the same way, except the locations are directories like /usr/local/bin/ and /home/rachel/.local/bin and /sbin. The order of the directories is crucial – if you have an executable in the first path it finds, and in the third, your computer will STOP looking after it finds the first one!

So, the way that all Python environmenting, to my knowledge, works, is to change the $PATH variable to put the Python stuff, for the desired venv, at the very front of the $PATH. We would also say “on top of the $PATH,” because this is absolutely analogous to a stack. So with the Python interpreter you’ve specified, and the Python modules you’ve installed, and the $PATH changed to look for these FIRST, we have a venv! It’s important to understand that it’s really all on the same machine and with the same level of access – but we just artificially make the desired stuff MORE available.

Automate a bunch of virtualenvs at once

So, at my job, I’m working on improving onboarding. This is a really tall order with a lot of different vectors, but the technical aspects are all pretty exciting. First up, for whatever reason, I chose to work on the problem of how challenging it is to get a new engineer on our team going with a ton of different venvs. There’s a boatload of Python in our day-to-day, and laboriously going through and setting up the venv for 10+ places is a pain, and error-prone, and the way that we’ve been doing it is the old-school way, with virtualenvwrapper (a tool I’ve loved for a long time but which is starting to go stale with newer Pythons [and newer versions of macs]!), so weird things have begun to go wrong.

It’s not only time for an update to our methodology, it’s time to just abstract this problem away entirely. Folks on my team interact with Python but we don’t write much of it, but we must use venvs all day every day for various tasks because there’s just tons of tooling that’s been written in various versions of Python. So I wanted to write something that would just… DO IT. Something to take all the places where there should be a virtualenv, and make it based on spec.

There were a lot of places I went to research, I thought about Makefiles (really not the right tool), thought about Docker (make an image, not committed to the repo*, for every single one??? no), looked up some Python docs, and finally stumbled across this fabulous blog post, https://www.rootstrap.com/blog/how-to-manage-your-python-projects-with-pipenv-pyenv/ by Bruno Michetti, detailing how to make pyenv and pipenv play nicely together. Edit: And I assembled it all in bash, because that’s the system scripting language I’m most comfortable with. It’s not in Python, sorry! Mentally I consider this level of abstraction to be another level above Python, though this may be possible to bootstrap itself – I don’t really know.

Pyenv is a really lovely abstraction to download whaaaaatever version of Python you want, and switch amongst them very easily without having to know its location, whether or not it’s a symlink, yadda yadda.

Pipenv is a tool that bundles pip, which is the Python module install tool, with virtualenv! How sensible, given that the workflow previous to the existence of pipenv was to make a virtualenv, activate the virtualenv, and then pip install everything in requirements.txt. Pipenv condenses that with some nice bells and whistles besides.

The remaining challenge was getting it all into bash so the user could just run one thing, and organizing the script comprehensibly.

*not committed to the repo because this is not a task where I’m trying to convince fifteen teams to generate their module in a completely new way

The code

Typed this out by hand, I hope you like it, if you too have a ton of virtualenvs to install at once, or want to give this to a team to be able to use once you fill in the blanks of repo location & the repos themselves of Python modules using various Python interpreters! The only thing that’s not real are the repo names. It would be cool to get these dynamically – if you know how this could be done, based on a requirements file or something else, please holler in the comments!

#!/bin/bash

repo_location=$HOME/repos

# brew install pyenv & pipenv
brewing () {
    brew install pyenv
    brew install pipenv
    brew update && brew upgrade pyenv && brew upgrade pipenv
}

# some tasks for pyenv to feel happy
pyenv_setup () {
    if [[ $PIPENV_VENV_IN_PROJECT=1 ]]; then
        echo "environment set up for pyenv already, probably"
    else
        echo 'if command -v pyenc 1>/dev/null 2>&1; then' >> ~/.zshrc
        echo '  eval "$$(pyenv init -)"' >> ~/.zshrc
        echo 'fi' >> ~/.zshrc
        echo 'export PIPENV_VENV_IN_PROJECT=1' >> ~/.zshrc
        source ~/.zshrc
    fi

    # if these already exist, the script will ask if you want to download them again
    pyenv install 2.7.18
    pyenv install 3.8.13
    pyenv install 3.9.13
}

# now the juicy stuff!  repo names are fake.  eat your vegetables!
venv_creation () {
    cd $repo_location

    # py 3.8 envs:
    three_eights="zucchini rutabaga cabbage"
    for i in $(echo $three_eights); do
        cd $i
        pyenv local 3.8.13
        pipenv install --python 3.8 -r requirements.txt
        cd $repo_location
    done
    cd $repo_location   # can you tell I've been scarred by bad dir management

    # py 3.9 envs:
    three_nines="cauliflower radicchio spinach"
    for i in $(echo $three_nines); do
        cd $i
        pyenv local 3.9.13
        pipenv install --python 3.9 -r requirements.txt
        cd $repo_location
    done
    cd $repo_location
}

brewing
pyenv_setup
venv_creation
Advertisement

Mongo migration

For the past few months I’ve been at a terrific job, doing devops at a small SaaS company. Real quick, SaaS means “Software as a Service” & refers to companies that have a webapp that they either sell access to and/or set up a version of for their customers. There are a lot of challenges with doing devops for a company like this, trying to find the balance between the heavyweight solutions and the latest and greatest to find what’s right for us, all the while (personally speaking) doing a LOT of learning on the topic. That’s not to say that heavyweight versus the latest&greatest are opposed; there are a few more weights on that spinning disk, not the least of which is “what we were doing before was …”.

So what I’ve been working on for the last few weeks, somewhere between the old solution & the new hotness, has been a Mongo problem. We deal in data that must be scrubbed before we analyze it. So the way that works, is that the host captures data, then scrubs ALL data there, and then sends it on to our long-term storage database, and then all local data on that host is removed after a couple days. What we’ll do with all of this in five, ten years will hopefully be the subject of another post, but for now we are only dealing with about 30GB of data in the long-term storage DB, collected over the last couple years. Let’s call that “Storeo,” and the hosts that they come from “partner databases,” which is true enough.

We’ve developed a couple of schemas for Storeo, and we only upgrade our partners from one to the next with code releases. So we have a couple old versions of Storeo kicking around. The next piece of this story is that we have an analytics dashboard set up for each partner, which pulls from Storeo, based on a domain field in the data we get from each partner. There’s one for each version of Storeo that they (and we) have to refer to, which means multiple dashboards just to get all the info! So that’s foolish, yeah? As a result, a previous engineer wrote a Mongo migration script to migrate all data from version 1 to 2, and then from version 2 to 3, the current version. So there are two steps to this – first, to migrate all the legacy data up to the current version so everything can be analyzed in the same way, and second, to do this regularly so even if partners are using older versions, we roll that data up so there is ONE source of truth for all their data.

As happens occasionally, no one can quite remember how I got this project, but it’s been a ride. Mostly good, occasionally “how the hell does Mongo even work?”. Some of the problems I’ve gone through have been of a Mongo nature, some of them of a sysadmin nature, some of them just basic DBA. Many of these steps might make you scream, but I’m cataloguing them because I want to try to get down what all I’ve done and learned. When you are self-taught, your education comes in fits and starts and in no particular, and in sometimes infuriating (out of) order. So I’m going to do my best to show you all the things I did wrong, too.

Problem 1 – Where to Test

I wanted to test the migration locally, not on the production Storeo server, which continues to receive data from all our partner database. First, I fired up the mongodump docs and tried that. Well, I nearly immediately ran out of room, and deleted that dump/ directory with those contents. When I looked around with a df -h /, a command which shows you the disk file size on root, human-readable, the output was that there were only a couple gigs left. Well, I knew that dumping a 15GB database wasn’t going to work locally. So I investigated a lot of other options, like sending the mongodump to another server (technically possible), SSHing into the server but sending all dumped data to my local machine with plenty of space on it. This probably took a couple days of investigation between other tasks.

None of this really panned out (but I still think it should have), and my boss let me know that there’s a 300GB volume attached to Storeo, and I said, wait, but I didn’t see that, I looked for something like that, and they gently let me know not to give df any arguments in order to see all disks mounted on a server. With that, a df -h showed me the 300GB volume, mounted on /var/lib! Excellent. On a practical note, it’s extremely sensible to have all the data for your application stored on a volume rather than on some enormously provisioned server. When you use AWS, one volume is much the same as the next, so putting databases on their own volumes is pretty sensible. Keep your basic server’s disk very bare bones, put more complex stuff on modular disks that you can move around if you need.

So with that!! I made a directory for myself there to separate from the production stuff, confirmed that mongodump/mongorestore do NOT interrupt read/write operations, and made mongodumps of versions 1, 2 and 3. This took.. maybe an hour. Then, because they were still quite large (Mongo is very jealous of disk space), I tarballed & gzipped them to reduce them down to half a gig or so. We use magic-wormhole all the time at work (available with a quick pip install magic-wormhole [assuming you have Python and pip installed {but it doesn’t have to be just a Python thing, just like I use ag and that’s a super Perl-y tool}]) so I sent these tarballs to my local machine, untarred/ungzipped, and mongorestored to the versions of Storeo 1, 2, & 3 that I have locally to run our app on my own machine. This probably, with carefulness and lots of reading, took another couple hours. At this point we’re probably a week in.

Problem 2 – How to Test

At this point, I finally started testing the migration itself since everything was a safe copy and totally destructible. Also I retained the tarballs in case I ended up wanting to drop the database or fiddle with it in some unrecoverable way. I took a count of the documents being migrated, and of the space taken up by each DB (which was different than on prod – I thought until this week that those sizes should be constant from prod-mongodump-tarball-mongorestore, but that’s not true – apparently most databases are wiggly with their sizing). The migration script is a javascript script (how do you even say that) that you feed into mongo like so mongo migration1-to-2.js, within which you define dbSource and dbTarget. The source, in this case, is version 1 of Storeo, and the target is version 2. Each of these is a distinct database managed by Mongo. With great trepidation, I did iiiit. Ok, I’ve left a piece out. I, um, didn’t know how to run JS. Googling said “oh just give the path to the browser!” so I did and, uh – that didn’t work. You may be saying “Duh.” Look, I’ve never done any front-end at all, and have never touched javascript outside that Codecademy series I did on here a couple years back. With my tail between my legs I asked my boss again, & was told about the above, just mongo filename.js.

The script took three hours!! Gah! So I ran the next one, which took SEVEN (since it contained everything from the first one, too), and regular attention to the ssh session so I didn’t lose the process (don’t worry, linux-loving friends, I’ll get there, just keep reading). These two migrations took two business days. At this point, we started talking to the team who manages the data analysis dashboards for our partners to talk about some of the complexities. Because a) this isn’t a tool from Mongo, there are no public docs on it and b) you can only test Storeo performance after the data has been scrubbed and sent, even locally, we decided to set up a few demo servers to point to test versions of the database.

Remember the volume attached to Storeo on production? Whoo! I logged onto Storeo and learned a ton more about mongodump & mongorestore, and made teststoreo1, teststoreo2, and teststoreo3, exact mongodump/restore copies of versions 1, 2 & 3 of Storeo. Their sizes, again, were different, but we’ve learned that that’s ok! Mongo has a lot of guarantees, space management isn’t one of them, so pack extra disk and we’ll be fine. So because this took a lot of googling and careful testing, because the last thing I wanted to do was mongorestore back into the place I’d mongodumped from – at the time I wasn’t sure if mongorestore overwrites the disk entirely, and wanted to be cautious versus potential lost data. So, make the directory, mongdump into it while specifying the database. Then restore into a new database (with the same name as the directory you’ve just made – this isn’t mandatory but made it easier to trace) while feeding it the path where the mongodump lives.

mkdir teststoreo1 # make the directory
mongodump -d storeo1 teststoreo1/ # dump the database with the name storeo1 into the dir we just made 
... # this takes some time, depending of course on the size
mongorestore -d teststoreo1 teststoreo1/storeo1 # there could be a dump/ in front of this end path

So after doing this for the other two Storeo databases as well, a show dbs command in the Mongo shell outputs all three production Storeos, as well as all three test Storeos. This meant we were in a good place to do some final testing. There were a few more meetings assessing risk and the complexity of all the pieces of our infrastructure that touch Storeo, how you do. Because the function of Storeo is to continually take in stripped data, I had to ensure that we weren’t going to lose information being sent during the migration. Because it’s not an officially supported tool but instead something that we wrote in-house, and I hadn’t been able to find a tool that moves data from one mongo DB to another, it’s hard to know what will and won’t impact production, so I set up one of our demo servers to send its stripped data to teststoreo1, and then kicked off the migration from teststoreo1 to teststoreo2 to make sure there was no data loss. On that demo server, while the migration was migratin’, I made a bunch of new dummy data that I’d be able to trace back to this demo server. A few hours later, when the 1-to-2 migration was complete, sure enough there were a handful of documents in teststoreo1 that were new – they’d been held & NOT sent! With this, I was very happy with the migration script.

So I kicked off the following script with mongo migrate1-2.js, quit the process with ctrl-z, and put it in the background (after identifying it as job 1) with bg %1, so it wouldn’t be interrupted by my leaving the session (see?)..

'use strict';

var dbSource = connect("localhost/storeo1");
var dbTarget = connect("localhost/storeo2");

// The migration process could take so long that new documents may be created
// while the script is still running. We will move only the ones created
// before the start of the process
var now = new ISODate();

dbSource.collection_1.find().forEach(function(elem){
    elem.schemaVersion = 2; // this means each element is given the NEW schema version
    dbTarget.collection_1.insert(elem);
});

dbSource.collection_2.find({createTime: {$lt: now}}).forEach(function(elem){
    elem.schemaVersion = 2;
    dbTarget.collection_2.insert(elem);
});

dbSource.collection_3.find({timestamp: {$lt: now}}).forEach(function(elem){
    elem.schemaVersion = 2;
    dbTarget.collection_3.insert(elem);
});


dbSource.collection_1.remove({}); // this collection did not have a timestamp
dbSource.collection_2.remove({createTime: {$lt: now}});
dbSource.collection_3.remove({timestamp: {$lt: now}});

The second script was the same but for the definitions of dbSource and dbTarget to storeo2 and storeo3, respectively. As with the testing, the first one took about three hours, the second, seven. With each one, I kicked it off, then put it in the background, then checked on it… later. Because it’d been backgrounded (that’s a verb, sure), it wasn’t quiiiiite possible to tell when it was done. That could be fixed with some kind of output at the end of the script, but that’s not how I did it!

Then I set up a lil cron job there at the end to regularly move data from 1 to 2, and once that had run for the first time, then I set up the second cron job to move it from 2 to 3.

Who wants to talk about Mongo????????