Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Faster Than Grep

VIEWS: 0 PAGES: 2

									Faster Than Grep It was looking like a boring day. I got in late, almost 11 (B-A-D, I
usually work 10:30-18:30), and by 11:30 had nailed all the daily maintenance stuff and
was looking at a series of deadlocked and waiting-for-other-guys jobs. And a few epic
jobs that couldn't really be furthered today. Then the ops manager walked up to me...
"What's faster than grep?" If you aren't a geek, grep is a really versatile and REALLY
FAST thing you search text files with. If somebody important wants something faster
than grep, they are either in an unholy hurry, or have a truckload of data. Or worst case,
like this case, both. Fourty-four gig of data all told, a 180k list of some 13,000 records to
pull out. Variable-length text data, a worst-case scenario. And we need the information
extracted by tomorrow morning. Oh and you need to match two different fields. Great.
The good news? The first field has a lot of duplicates. It's 13,000 records, but only 220
unique in the first field. I flag that as "interesting". The problem with a looped grep is,
you're spinning the Big Wheel fast and the little wheel slow. You're cycling through
44gig of data 13,000 times. Nothing's going to be in RAM (this is 2006, and I still don't
have a box with 64GB of RAM darnit.. I stuck one in at my last job with 32 though). So I
come at it from the other direction. A repeated grep command was literally taking days (it
was tried before they handed me the problem) I've written something faster than grep
before. But I cheated then, and indexed the data using something called CDB, which is
the fastest database system I've ever seen (look it up, it is truly annihilative). But today I
had no time to index anything..... one pass was going to have to be enough. How'd I do
it? I wrote it in C. No kidding, that volume of data, I need the fastest _performing_ tool I
can find, or it's gonna take weeks. Compile with -O3! PHP would take way less time to
code, saving an hour or two even, but take days longer to run. First thing I did was load
the 13,000 sets of two fields into an array. This meant that the data doing most of the
work was stuck in RAM, and we only had to cycle through the 44GB of junk once. This
is a good thing. Lots of bugs ensued. The usual C stuff... you hack something together
and it segfaults the first 100 times you run it, while you madly run around putting print
statements everywhere trying to figure out where the fire is. Meanwhile, the clock is
ticking and my boss's boss is sitting next to me or pacing around his office looking
worried. Tick tock. I get the thing working, and move it over the HPUX box that has all
the data on it. I didn't even have vim on that box, so I coded it first up on my linux
desktop. I start it up there on a small sample of the data that had been working on my
desktop. It crashes. And crashes. And crashes. I run into bug after bug in the awful HP
libraries. sscanf doesn't work anything like it does under gnu libc. I'm about ready to
move the data off onto a heavy linux server when my Boss (not my Boss's Boss)
mentions that he's installed the GNU compiler and libraries on the hpux machine. Sweet!
Compiles first go, runs first go now. Except it's too slow. "What cpu's in this thing man?"
"Uhhh I think it's 4x 360Mhz" "$%#%!" I start transferring the data onto the
aforementioned heavy linux server. It has about 10 times the processor power. I have a
brief discussion with IP Engineering about LETTING ME THROUGH THE FIREWALL
NOW PLZ. Then it starts..... 5MB a sec anyone? This is the kind of thing, I mention, it
would be USEFUL to UPGRADE TO GIGABIT ETHERNET for. Tick tock. Hours
later.... The hours are good in a way, they give me time to do a bit of other work and
think about how to make my hacked-together program go faster. The first thing is, I
realise, smacking myself in the head, that I'm matching stuff and then continuing to
compare the rest of the 13,000 wanted records to the line after it's been matched! I fix
this, and then I remember the "interesting" thing from earlier. 220 unique records in the
first field only. I write a loop (I hate C, this would be another one-liner in php :() to find
these and put them into an array. I use this array to "screen" each record before scanning
it. If the first field isn't in the 220, the line doesn't get compared to the 13,000 - we skip it
and move on. This actually slows down my sample data noticeably. But I know that my
sample data has an unusually high number of matches - I figure my program is just doing
an extra 220 comparisons on each line for that data, whereas the bulk of the REAL data
will be skipped over quickly by this code. How do I test that theory? I'm able to run the
program on the incomplete file as it downloads. Don't try this on Windows :) I'm getting
what looks like good results. It's fast, and it's Getting the Data. I'm running out of hours.
The first half of the data (glad it's in two separate files now!) is finished downloading and
I get to work on it for real. Looks good... it's matching stuff, and the counter I've got in
the program that prints a line when it gets to a million is clipping past fairly quickly. The
dead spots in the data without any records in the 220 list fly past, and I estimate it's going
to take just 90 minutes to get through the first file! 285 million records in 90 minutes is
just fine by me. WE ARE GOING TO WIN. The rest is easy. A few hours later, the
second file finishes downloading. I start it up and walk home. It's done by the time I get
there and VPN in. Sweet. Home by 9pm, and the work done! Today reminded me why I
love my job. James Hicks owns and operates http://isnerd.net He has ten years experience
in the Information Technology / Information Services industry, including eight as a Linux
Systems Administrator. He has worked as a senior Unix Administrator for Primus
Telecom Australia (a large Australian telco/ISP) and is currently Production Support
Manager at AusRegistry - the infrastructure company that maintains the com.au, net.au,
org.au (and other) domain spaces. He became a RedHat Certified Engineer in 2004, and
currently lives in Melbourne, Australia.

								
To top