Mongo migration

For the past few months I’ve been at a terrific job, doing devops at a small SaaS company. Real quick, SaaS means “Software as a Service” & refers to companies that have a webapp that they either sell access to and/or set up a version of for their customers. There are a lot of challenges with doing devops for a company like this, trying to find the balance between the heavyweight solutions and the latest and greatest to find what’s right for us, all the while (personally speaking) doing a LOT of learning on the topic. That’s not to say that heavyweight versus the latest&greatest are opposed; there are a few more weights on that spinning disk, not the least of which is “what we were doing before was …”.

So what I’ve been working on for the last few weeks, somewhere between the old solution & the new hotness, has been a Mongo problem. We deal in data that must be scrubbed before we analyze it. So the way that works, is that the host captures data, then scrubs ALL data there, and then sends it on to our long-term storage database, and then all local data on that host is removed after a couple days. What we’ll do with all of this in five, ten years will hopefully be the subject of another post, but for now we are only dealing with about 30GB of data in the long-term storage DB, collected over the last couple years. Let’s call that “Storeo,” and the hosts that they come from “partner databases,” which is true enough.

We’ve developed a couple of schemas for Storeo, and we only upgrade our partners from one to the next with code releases. So we have a couple old versions of Storeo kicking around. The next piece of this story is that we have an analytics dashboard set up for each partner, which pulls from Storeo, based on a domain field in the data we get from each partner. There’s one for each version of Storeo that they (and we) have to refer to, which means multiple dashboards just to get all the info! So that’s foolish, yeah? As a result, a previous engineer wrote a Mongo migration script to migrate all data from version 1 to 2, and then from version 2 to 3, the current version. So there are two steps to this – first, to migrate all the legacy data up to the current version so everything can be analyzed in the same way, and second, to do this regularly so even if partners are using older versions, we roll that data up so there is ONE source of truth for all their data.

As happens occasionally, no one can quite remember how I got this project, but it’s been a ride. Mostly good, occasionally “how the hell does Mongo even work?”. Some of the problems I’ve gone through have been of a Mongo nature, some of them of a sysadmin nature, some of them just basic DBA. Many of these steps might make you scream, but I’m cataloguing them because I want to try to get down what all I’ve done and learned. When you are self-taught, your education comes in fits and starts and in no particular, and in sometimes infuriating (out of) order. So I’m going to do my best to show you all the things I did wrong, too.

Problem 1 – Where to Test

I wanted to test the migration locally, not on the production Storeo server, which continues to receive data from all our partner database. First, I fired up the mongodump docs and tried that. Well, I nearly immediately ran out of room, and deleted that dump/ directory with those contents. When I looked around with a df -h /, a command which shows you the disk file size on root, human-readable, the output was that there were only a couple gigs left. Well, I knew that dumping a 15GB database wasn’t going to work locally. So I investigated a lot of other options, like sending the mongodump to another server (technically possible), SSHing into the server but sending all dumped data to my local machine with plenty of space on it. This probably took a couple days of investigation between other tasks.

None of this really panned out (but I still think it should have), and my boss let me know that there’s a 300GB volume attached to Storeo, and I said, wait, but I didn’t see that, I looked for something like that, and they gently let me know not to give df any arguments in order to see all disks mounted on a server. With that, a df -h showed me the 300GB volume, mounted on /var/lib! Excellent. On a practical note, it’s extremely sensible to have all the data for your application stored on a volume rather than on some enormously provisioned server. When you use AWS, one volume is much the same as the next, so putting databases on their own volumes is pretty sensible. Keep your basic server’s disk very bare bones, put more complex stuff on modular disks that you can move around if you need.

So with that!! I made a directory for myself there to separate from the production stuff, confirmed that mongodump/mongorestore do NOT interrupt read/write operations, and made mongodumps of versions 1, 2 and 3. This took.. maybe an hour. Then, because they were still quite large (Mongo is very jealous of disk space), I tarballed & gzipped them to reduce them down to half a gig or so. We use magic-wormhole all the time at work (available with a quick pip install magic-wormhole [assuming you have Python and pip installed {but it doesn’t have to be just a Python thing, just like I use ag and that’s a super Perl-y tool}]) so I sent these tarballs to my local machine, untarred/ungzipped, and mongorestored to the versions of Storeo 1, 2, & 3 that I have locally to run our app on my own machine. This probably, with carefulness and lots of reading, took another couple hours. At this point we’re probably a week in.

Problem 2 – How to Test

At this point, I finally started testing the migration itself since everything was a safe copy and totally destructible. Also I retained the tarballs in case I ended up wanting to drop the database or fiddle with it in some unrecoverable way. I took a count of the documents being migrated, and of the space taken up by each DB (which was different than on prod – I thought until this week that those sizes should be constant from prod-mongodump-tarball-mongorestore, but that’s not true – apparently most databases are wiggly with their sizing). The migration script is a javascript script (how do you even say that) that you feed into mongo like so mongo migration1-to-2.js, within which you define dbSource and dbTarget. The source, in this case, is version 1 of Storeo, and the target is version 2. Each of these is a distinct database managed by Mongo. With great trepidation, I did iiiit. Ok, I’ve left a piece out. I, um, didn’t know how to run JS. Googling said “oh just give the path to the browser!” so I did and, uh – that didn’t work. You may be saying “Duh.” Look, I’ve never done any front-end at all, and have never touched javascript outside that Codecademy series I did on here a couple years back. With my tail between my legs I asked my boss again, & was told about the above, just mongo filename.js.

The script took three hours!! Gah! So I ran the next one, which took SEVEN (since it contained everything from the first one, too), and regular attention to the ssh session so I didn’t lose the process (don’t worry, linux-loving friends, I’ll get there, just keep reading). These two migrations took two business days. At this point, we started talking to the team who manages the data analysis dashboards for our partners to talk about some of the complexities. Because a) this isn’t a tool from Mongo, there are no public docs on it and b) you can only test Storeo performance after the data has been scrubbed and sent, even locally, we decided to set up a few demo servers to point to test versions of the database.

Remember the volume attached to Storeo on production? Whoo! I logged onto Storeo and learned a ton more about mongodump & mongorestore, and made teststoreo1, teststoreo2, and teststoreo3, exact mongodump/restore copies of versions 1, 2 & 3 of Storeo. Their sizes, again, were different, but we’ve learned that that’s ok! Mongo has a lot of guarantees, space management isn’t one of them, so pack extra disk and we’ll be fine. So because this took a lot of googling and careful testing, because the last thing I wanted to do was mongorestore back into the place I’d mongodumped from – at the time I wasn’t sure if mongorestore overwrites the disk entirely, and wanted to be cautious versus potential lost data. So, make the directory, mongdump into it while specifying the database. Then restore into a new database (with the same name as the directory you’ve just made – this isn’t mandatory but made it easier to trace) while feeding it the path where the mongodump lives.

mkdir teststoreo1 # make the directory
mongodump -d storeo1 teststoreo1/ # dump the database with the name storeo1 into the dir we just made 
... # this takes some time, depending of course on the size
mongorestore -d teststoreo1 teststoreo1/storeo1 # there could be a dump/ in front of this end path

So after doing this for the other two Storeo databases as well, a show dbs command in the Mongo shell outputs all three production Storeos, as well as all three test Storeos. This meant we were in a good place to do some final testing. There were a few more meetings assessing risk and the complexity of all the pieces of our infrastructure that touch Storeo, how you do. Because the function of Storeo is to continually take in stripped data, I had to ensure that we weren’t going to lose information being sent during the migration. Because it’s not an officially supported tool but instead something that we wrote in-house, and I hadn’t been able to find a tool that moves data from one mongo DB to another, it’s hard to know what will and won’t impact production, so I set up one of our demo servers to send its stripped data to teststoreo1, and then kicked off the migration from teststoreo1 to teststoreo2 to make sure there was no data loss. On that demo server, while the migration was migratin’, I made a bunch of new dummy data that I’d be able to trace back to this demo server. A few hours later, when the 1-to-2 migration was complete, sure enough there were a handful of documents in teststoreo1 that were new – they’d been held & NOT sent! With this, I was very happy with the migration script.

So I kicked off the following script with mongo migrate1-2.js, quit the process with ctrl-z, and put it in the background (after identifying it as job 1) with bg %1, so it wouldn’t be interrupted by my leaving the session (see?)..

'use strict';

var dbSource = connect("localhost/storeo1");
var dbTarget = connect("localhost/storeo2");

// The migration process could take so long that new documents may be created
// while the script is still running. We will move only the ones created
// before the start of the process
var now = new ISODate();

dbSource.collection_1.find().forEach(function(elem){
    elem.schemaVersion = 2; // this means each element is given the NEW schema version
    dbTarget.collection_1.insert(elem);
});

dbSource.collection_2.find({createTime: {$lt: now}}).forEach(function(elem){
    elem.schemaVersion = 2;
    dbTarget.collection_2.insert(elem);
});

dbSource.collection_3.find({timestamp: {$lt: now}}).forEach(function(elem){
    elem.schemaVersion = 2;
    dbTarget.collection_3.insert(elem);
});


dbSource.collection_1.remove({}); // this collection did not have a timestamp
dbSource.collection_2.remove({createTime: {$lt: now}});
dbSource.collection_3.remove({timestamp: {$lt: now}});

The second script was the same but for the definitions of dbSource and dbTarget to storeo2 and storeo3, respectively. As with the testing, the first one took about three hours, the second, seven. With each one, I kicked it off, then put it in the background, then checked on it… later. Because it’d been backgrounded (that’s a verb, sure), it wasn’t quiiiiite possible to tell when it was done. That could be fixed with some kind of output at the end of the script, but that’s not how I did it!

Then I set up a lil cron job there at the end to regularly move data from 1 to 2, and once that had run for the first time, then I set up the second cron job to move it from 2 to 3.

Who wants to talk about Mongo????????

Advertisements

Exploring Dockerfiles

I’d like to continue the previous entry on Docker a little further. Last time we talked about the installation process & a little more, so this time we’re going to talk about the next part of getting started with Docker – writing a Dockerfile.

Here’s what we talked about last time, and with one odd little exception (why did I promise to talk about load testing…) we’re going to cover all these things!

So, next steps, make the container persistent – it isn’t yet, and play around with Dockerfiles, and just do a little more spying on the produced container itself & probably try to do some babby’s frist load testing things in there & spy on the container as a process without the box & all its processes within!

First let’s take a look at the Docker process we created last time. Just like at your native command line, docker commands all resemble low-level Linux commands, so just like you’d use ps to look at the processes running at any given time on your machine, you can use docker ps to see all the Docker processes it is managing at any given time. If you followed along last time you’ll see some that have been exited but which you don’t have access to – each time you run the docker run -it bash you get a new process. But the old ones are still there! The all flag will show us these Exited boxes with a docker ps -a.

rachel $ docker ps -a
CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS                     PORTS               NAMES
b5de9583d7b3        fedora                     "bash"                   10 minutes ago      Exited (0) 3 seconds ago                       pedantic_morse
35192bfa05d4        images/cowsay-dockerfile   "/usr/games/cowsay *P"   2 hours ago         Exited (0) 2 hours ago                         gigantic_goldberg
a0e40d55125a        images/cowsayimage         "/usr/games/cowsay 'D"   3 hours ago         Exited (0) 3 hours ago                         jovial_mcnulty
d32381833772        debian                     "bash"                   3 hours ago         Exited (0) 3 hours ago                         cowsay

You’ll notice a few things, first that the names are a mix of adjective_noun, except one – the cowsay container example is from the excellent Using Docker where I’ve gained a lot of my recent Docker information. Their status is all Exited. Some of the container-specific commands are similar to the init.d service commands, like start, stop, and rm, so let’s start the desired container in that list up there. The container we’re going to start up is similar to the one we made before & is Fedora, though it is true that I only made it ~10m ago!

docker start pedantic_morse

So now the output of docker ps includes the container we just made. So how do we keep it? We commit it, just like with Git! Replace pedantic_morse with whatever name yours has been assigned beneath the NAMES column.

rachel $ docker commit pedantic_morse images/morse
sha256:b398fe28d7fd26a52e0947fc8eebb7614b8a8d6d19a5332359df167c9296c04f

So what we’ve done here is create an image from which we can create containers. images/morse is the image, pedantic_morse is the Docker process that we crafted it from. For every time we run the image images/morse, it creates a new Docker process, so at this point it’s still not persistent in ONE image, HOWEVER we can use this image to perform one-offs.

Clearly we’re not getting into the strength of Docker, yet. So now it’s time for a very basic Dockerfile. Just like Vagrantfile and Procfile & probably a few other similarly intended setup files, the D in Dockerfile is capitalized and there’s no extension to it, because remember – Linux doesn’t care about file extensions!

The main piece to know with Dockerfiles is that their syntax can be as minimal as you like, and personally I recommend making them non-complex – major structural pieces, and insert kickoff scripts or use some config management in the container itself for anything much more complicated. I reserve the right to change my mind on this later! And this is also more for next time to learn. But the way it looks, the RUN command will run any bash you put in it, but if you need anything more complex, the contents become a lot more murky, in my opinion. Simple is better than complex, but complex is better than complicated, so let’s do what we need to here.

For posterity and a simplistic example, here’s the first Dockerfile I ever wrote. (ed note: I trimmed this down because each line of a Dockerfile creates a new filesystem – try to truncate Dockerfile lines as much as possible)

FROM fedora:23
RUN /bin/bash
RUN echo "the dockerfile took!"

RUN dnf install -y wget tar man

MAINTAINER Rachel!

The output of this, which is a bit long to post, pulls down version 23 of Fedora, uses bash for the following commands, prints “the dockerfile took!” to stdout, and then installs those three packages. I’m unsure why some of those aren’t present in a base Fedora image, but it doesn’t appear to be related to what I’m working on in this blog post, so we’ll leave it be for now.

This is about ten times longer than I thought it would be, woohoo! I hope you learned something, please please let me know if I’ve missed the mark on anything, cheers!

Tune in next time and we’ll talk about a more complicated Dockerfile, and syncing it up to… something 🙂 come back and you’ll find out what!

Aspirations toward Gittry

Over the last month, we’ve been working hard to finish up the coding and environment for, and finally the filming of, a series of tutorial videos. It’s something we’ve been working on at my job since, solidly, May, and but for a few release-based loose ends (“will our requirements.txt file really work with pip? why isn’t it working over … THERE?” etc etc), the project is over, and my contract is coming to a close. So I have a few projects I’d like to work on, AND NATURALLY, document for you!

First, let me point you to my website, which I have hugely upgraded. I’ve got a style sheet which I first applied just to the main page, then I applied it to the resume which I also updated, a bit, though it’s difficult since each position I apply for has a subtly different set of information.
therachelkelly.com
Regardless, I’m proud of the small, attractive changes I’ve made. Next up is to get a handle on some bootstrap and cherry-pick pieces of it, like the nav bar and a few other nice ideas.

Next, I’d like to run through another codecademy class, maybe the advanced web design one, but what’s more likely is the API-manipulation course. At some point soon I’d like to begin a project where I get a couple of the open APIs out there to talk to each other. My intention is not to re-invent the wheel, but to get a look at its inner workings myself!

I’m also about 60 pages through Jon Loeliger and Matthew McCullough’s Version Control with Git, 2nd ed, which is only about a year old, so quite up to date. As I’ve said, I want to be a Git wizard, and to earn that pointy, star-covered hat, it’s time to take a deep dive. It’s extremely exciting to me that I can read this book – when he (it seems like it’s mostly Loeliger’s game) says “The first number, 100644, represents the file attributes of the object… [and] should be familiar to anyone who has used the Unix chmod command.” which I am! I am familiar with chmod, Unix, and so much more relating to this topic! Wow! This is not to say that chmod is a particularly difficult concept, but that it is NOT a terribly entry-level topic either – I am rather beyond entry-level knowledge for many topics, and that is enormously gratifying.

brief aside: chmod refers to the the command which determines the level of permissions of a given file or directory. go here for more information. want to write more on this in the future, because I still haven’t found a super terse explanation.

So this Git book is great. They’ve already referenced someone that I KNOW, so that’s charming and a bit surreal. I suppose living in (the extremely small town of) Portland and being as active in the communit(y/ies) as I am, it’s bound to happen that I’ll meet or already know some People. But on to it – the book practically begins with SHA1 numbers, the hash number that Git assigns to a unique commit. Did you know that if your file is named yourfile and it says the same thing that my file, also called yourfile, says, then the SHA1 will be identical? WILD. Mind = blown. Apparently there is (very infrequently) a concern of “collision,” or two different kinds of commits yielding the same hash, but as the SHA1 has approximately 2^160 permutations (pretty sure I can’t use that word here?), that’s pretty unlikely.

So for now, the plan is to write about whatever I’m learning in the Git book, because the Git book is awesome. SEE YOU NEXT TIME!

Lesson 43

Ok, it’s starting to get pretty complex! As best as I can understand it at the moment, a class is a a sort of super structure that can be used more flexibly than can a simple function, that other variables and structures can be flexibly swapped in and out of. I am fairly lost on a good amount of the syntax, like the __init__ and the self.name = name, to name a few. But I’ve decided to go ahead and start copying in the code for this exercise, though I don’t entirely understand it yet. Evidently it always takes a long time to really grok OOP. So that’s ok.

Some things I notice: The classes call a generic object seem to be hierarchically above others, e.g. class Scene(object) (rather than class Death(Scene) which calls Scene()). I don’t know if this is important or not, but it is how Zed’s exercise is going.

And at this point, I’ve pushed through, even not being totally sure of what I’m doing, and punched in all the code, learning snippets here and there, and I’ve checked it all over, and it doesn’t run. I am frustrated. I have read the last few exercises several times where he has started discussing classes and object-oriented programming, and I’m really not getting the hang of it. Someone told me, when I shared that I was trying to start learning what the meat and potatoes of modern programming, OOP, really is, that it took learning another language before he really grokked what the deal was. Barring that, I just keep trying to read about it. I know I can get it, but I don’t have it right now, and I can’t finish this lesson of LPTHW, and I’m frustrated. So I shall move on and try to come back later.

EOY Recap and 2014 Goals

I love end of year posts, like Scalzi’s and probably others whose blogs I do actually read but can’t think of, ha ha! This has been the year of transferring my nerdiness to generic pop-culture things like reading a ton of scifi and guzzling Star Trek, Doctor Who, Firefly, etc, to some real gritty stuff like some actual facility with computer systems. Very exciting.

I have made a zillion miles of progress and Big Life Decisions this year, and programming-wise, were and are incredibly impactful. My goal that I have been working toward for the last few years has been to teach high school math, I figure I have enough of an aptitude and teenagers are (don’t panic!) actually pretty cool to work with. And I knew I needed to be studying toward something that would get me an actual, adult, non-entry-level job, the contrast to which I worked in for four years, and let me tell you, hearing “well, we think you do deserve a raise, but there’s just no money” is pretty hard to hear/believe when you work for one of the biggest and most successful companies in the world, earning record profits quarter after quarter even in the most difficult of times (see 2009, when I saw coworkers’ 401(k)s slashed to 20-40% of what they had been).

Rant over, getting back to it, part of the coursework for the high school math teaching program at one of the graduate programs I was looking at included an introductory programming course. As I’ve mentioned elsewhere, this really struck me and I joined a ladies’ programming group and stuck with it. After hearing that one definition of a master is 10,000 hours of work put into it, I tried to come up with a figure of how much I’ve put into python and related projects, and I think I’m right around 250h, give or take. So I have a long way to go, but I’m very pleased with having allotted that much over the course of a year. It’s 250 more than I had by last year, and that’s the only metric that’s important right now!

And now, for your handy-dandy list-format of what I’ve started and accomplished this year, under the cut. Wahoo!

Continue reading

Lesson 41

Oh, wow, Exercise 41. This is great. The code is complicated and I’m still working through it, with gems like:

for sentence in snippet, phrase:
result = sentence[:]

Wow-ee, why is there a colon between the two brackets!
EDIT: This was a CSQ! Here’s what Zed sez:

What does result = sentence[:] do?
That’s a Python way of copying a list. You’re using the list slice syntax [:] to effectively make a slice from the very first element to the very last one.

Also, don’t forget this lil guy:

PHRASE_FIRST = False
if len(sys.argv) == 2 and sys.argv[1] == "english":
PHRASE_FIRST = True

argv is used to import a file, so I can only guess that it is related to the url that the module urlopen imports. the len operation is easy enough to parse, but I don’t really know why sys is in there. [Edit: duh-doy, sys is just the imported module and sys.argv calls it.] The rest of it, honestly, I have not much idea about the rest of it other than the fact that we’re using some Boolean conditionals.

At any rate, while some of the code is confusing, it became significantly less so once I ran oop_test.py. It’s a test! As in, guess the right answer test! It scrapes the text of this learn code the hard way list of words to make up class names in order to ask what the relation is amongst all cited.

I am going to spend some time with this! The test itself is enjoyable to run. And THEN I shall try to really scrape some understanding of the code into my noggin : )

Edit: THIS IS MY TWENTIETH POST!! HOORAY!! wow!

Lesson 36

Lessons 32, 33, and 34 were list comprehension and conditionals like if/elif/else, for, and while, all of which I feel fairly comfortable with. Exercise 35 was an application of these ideas in the form of, woohoo, a text-based adventure game! This one came in a bit longer at 76 lines, which I honestly love copying in. He had a few handy tricks that he hadn’t talked about, like the exit(0) which kicks you out of the program, and a few clever nestings and conditionals, like the combination of the boolean in one of the “rooms” with a while and an if/else.

Rather than give each of those its own post, since I don’t have much to add having learned those functionalities long ago, I moved over them so I could advance a bit more quickly. In exercise 36, he asks for a unique game! So I made one! Woo-hoo! After a not too long trial and error period, I came up with the following game, of which the following is probably the most representative output that I got when I had someone else play it. Check the repo if you’re interested in the code! the output is more fun anyway : )

WELCOME TO THE DUNDJEL
you can go north and south. maybe other directions?
> north
You carry on your dumb way north, dummy
You went north! That's probably fine.
You see an A and a B, and also a button.
> a
yeah, you only have so many options, here.
nice that you think I'm smart enough to come
up with more stuff, but I haven't.
> b
yeah, you only have so many options, here.
nice that you think I'm smart enough to come
up with more stuff, but I haven't.
> A
consider all of your options my friend.
but not today, for today you die.
you lose. you see a howling hag:
SHE'S A WIIIIIITCH
BOOOOO. BOOOOOOOO!!

It makes me laugh which is really all that I’m looking for in my own text-based adventures. There IS a way to win, but it doesn’t look like they made it!

Hey, quick sidenote, have you seen Depression Quest? It’s eligible for Steam Greenlight, and it, well, my description won’t do it justice, the concept is amazing & you should just go take a look.