How do you search Billions of records?

A method for searching text files on a large scale.

When providing services to clients, one of your roles may be that of an archival service.  Simply keeping track of what happened and when it happened… essentially a record or log of events.  Often times these records or logs will have retention requirements ether set internally by the client or by various laws.  If all goes well for your client, these records will never do anything more than collect the proverbial digital dust and take up what will feel like insurmountable volumes of wasted hard drive space.  But what happens when things don’t go well?  What happens when you receive a request from your client asking to find all records of events pertaining to XYZ?  And, as luck would have it, this request means you need to search through billions of records?

Not so long ago, we had to deal with such a request.  One of our clients receives an incredible amount of web traffic and accrues billions of web hits each year.  Each hit is recorded at the CDN in a series of text log files.  These files are then archived for future reference per the client’s request.  As they are rarely if ever referenced, they have been simply compressed and stored as raw text log files.  However, that one rainy day finally came along and this particular client handed us a list of a couple thousand IP addresses and asked us to find all entries pertaining to those IPs over a given date range.

Now is about the time you think to yourself “wouldn’t this be so easy if all that data was in a database?”  The problem is, setting up and ingesting that much data into a database doesn’t happen overnight.  Further aggravating the problem is that the compressed log files are already consuming terabytes of data.  Decompressing them and storing them into a searchable database presents the real possibility that the database would swell into storage requirements reaching into the petabyte size.  This is not something altogether desirable for a once in a rainy day situation.

So, a solution needs to be devised that efficiently meets the customers desires, doesn’t take up petabytes of storage space, and can return results in a timely fashion.  Our first thoughts were to see how a command line grep search of the files would work.  Unfortunately, the search was taking so long to return results we had to kill it and find another system.  Our next method of attack was to create a custom Python application that could more efficiently do the task at hand.

On the outset, Python doesn’t seem like the best solution because it is an interpreted language.  Certainly, some performance can be lost because of this.  But, it is an easy language to work with and still quite powerful if you know how to leverage some of its features.  So, how do you then make it powerful and efficient enough to chew through billions of records?  Well, there are actually two steps to making that work.  The first is by simply leveraging multiple processors.  Python is inherently single threaded.  However, it does have a multiprocessing function that can be imported to leverage full use of the servers CPU capabilities.  When each line of each log file is read, it doesn’t matter if they are done in order.  It only matters that they are searched for a match.  So, we start by creating a single process that reads the files line by line.  Each line is then passed in a round robin fashion to a series of multiple search processes.  Each search process then uses the multiprocessor function to run on an independent CPU core.  If no matches are found, the search process just dumps the data and works on the next record/line assigned to it.  However, should a match be found, yet another process is established to receive the matched data.  Again, this process is set to run on an independent CPU core.  Any additional last minute processing is performed and sent to an output search results file.

Should you only be required to find a hadfull of matches, this alone would be enough.  However, our request was to match thousands of IP addresses.  So, while multiprocessing did a great deal to improve speed, even further improvements are realized by optimizing the matching process.  In a very basic traditional search, you simply run through the list from top to bottom looking for a match.  If you don’t find a match, you just wasted thousands of searches.  If you sort the data first, you could do a double check, one for a match and one to see if your current position is greater than the item being searched for.  But, this means performing two checks instead of one.  So, even if you on average jump out at the half way point, you’ve still used up roughly twice as many cycles doing the double check and effectively took just as long as running through the full list with a single check for a match.  Other matching techniques will let you divide and conquer through a series of jumps through the list.  But these to prove to chew up considerable processing time.  The solution we used instead was to fold the search data into an array.

To fold the search data you first sort the data and then sequentially fill it into a two-dimensional array.  Ideally, the array would be perfectly square, so simply take the square root of the total data set size and build from there.  Obviously to contain the full data set, the resulting numbers need to be rounded up.  Then pad the data set with values that will not yield a match for your search results.  This will completely fill the array similar to the example table here.

1 8 15 22 29 36 43
2 9 16 23 30 37 44
3 10 17 24 31 38 45
4 11 18 25 32 39 46
5 12 19 26 33 40 47
6 13 20 27 34 41 0
7 14 21 28 35 42 0

Now, when it comes to searching the data set, start by scanning across the top row from right to left.  If the data is greater than or equal to the top item in that row, dive into searching that column.  Work down the column to see if the data matches any records in the search list until the bottom of the column is reached.  If a match is found, set a match flag and break the loop.  If the bottom of the column is reached without a match, break the top row search loop.  In this manner, instead of searching the full data set, a complete miss results in having searched twice the square root of the data set.  So, a data set of say 5,000 becomes a grid of 71 x 71.  So, instead of a miss taking up 5,000 searches, it only takes up 71 + 71 = 142 searches.  Had the data set been significantly larger, say 100,000 items, the array could have been folded into a cube.  The same process would once again conducted.  The cubed root of 100,000 is 46.4.  So, a cubed data set could be arranged to be 47 x 47 x 46 which means a missed search would at worst take up 47 + 47 + 46 = 140 searches.  A vast improvement over searching through the full 100,000 items! 

This processes of folding can be conducted as many times as needed.  For visual purposes, think of folding it a 3rd time to be a row of cubes.  A 4th time to be a 2 dimensional array of cubes.  A 5th time to be a cubed array of cubes, and so on.  High levels of folding is probably only practical for extremely large data sets, although I’m sure someone will send me an e-mail proving me wrong.

So, with our tools in hand, and coding complete, how did we do?  Well, we ran the process on a 12 core server.  In a matter of hours the search completed and we had our results in hand.  For log files that were very small (some only containing only a single line) the process spent almost as much time opening and closing files as it did running the search.  This made minimal use of the multiprocessing capabilities and searches were being conducted in the tens of thousands of records per second.  For portions where significantly larger files were being searched, searches were coming in at over five times the small file speed.  However, after observing the CPU utilization of the Python processes, further optimization could have been made.  It was noted that during the processing of the larger files, the Python process that was reading and handing out work to the search processes was running full tilt at 100% CPU core utilization.  The 10 search processes on the other hand were only running at 30% utilization each.  This means that in hind sight, the file list could have been split and sent to two file processors.  Each file process would then have only 4-5 search processes assigned to it.  Assuming that the underlying file system could keep up with the data requests, this would have effectively double the speed of the search.

So, there you have it!  Now you have a starting point for creating your own custom application to perform those rainy day search requests for your clients.  I would post code examples or outright show the entire program as written (it was less than 600 lines of code); however, as some functionality was hard coded and overall it was written specific to a particular client, it would be inappropriate to share such details.  All the same, I hope the details provided here are enough to put you on the right track.  Happy coding! 

Need help proecessing logs or data events?  Contact us we'd be happy to help with all your web infrastructure projects. 

Big Data image courtesy of  lucky_sun