Original Source
..
On the Goodness of Binary Search
by Tim Bray
Anyone who regards themselves as a serious programmer has internalized a
lot of different ways of searching: hash tables, binary, and many
different kinds of trees. I've used pretty well all of these seriously at some
point, but for a decade or so, as far as can I recall I've used almost
exclusively binary search, and I see no reason to change that. Herewith an essay
for programmers on why.
..
I'll start out with a discussion of the performance of binary search, look at
some of the reasons why people don't use it and try to poke some holes in them,
and conclude with a small tutorial on how to code binary search properly, with
working Java examples.
..
Motivation
If you've already fully internalized binary search and make heavy regular use
of it, you can skip what, after all, is basically a rehash of something you
took in second-year CS. This note is provoked by a recent conversation where
someone said "There are five million records here, and we need to do quick
lookup; why don't we stick them in an Oracle table?" I had a better idea.
..
Also, I think binary search is aesthetically pleasing.
..
Logarithmic Time... mmm, Tasty
Everybody knows that in Computer Science jargon, binary search is
O(log2(N))
. Let's just look at what that means. If it
takes you 1 millisecond to search a million (say 220) items, then
it'll take you about 1.05 to search 2 million, 1.10 to search 4 million, and a
whole 1.301 when you get up to 64 million. Those are awfully nice numbers.
..
The Alternatives
Now, everyone knows that a hashtable has O(1)
search time, and
trees can be updated in place, so why would you use something as primitive as
binary search when you have these excellent alternatives? Also, with trees and
so on you can potentially search really large data sets, too large to fit into
memory, with persistent on-disk structures.
..
All that granted, here are some reasons you might want to use binary search
anyhow:
..
Simplicity
Simplicity is good, and binary search, coded properly, (see below) is
awfully damn simple. Hashing requires you to deal with collisions and
chaining and all that kind of stuff, and while tree search is pretty
straightforward, tree update is hellishly complex. Less code means less room
for bugs, and better use of your CPU's I-cache, and is just generally good.
..
Memory Overhead
Binary search requires that you store the search items you're going to compare
in an array of some sort, with exactly zero indexing overhead. Zero,
that's a nice number. If you need the kind of performance you get from
in-memory searching, you're going to be able to cram more in with binary than
with any other approach.
..
Code Your Own
Of course, any good programming system library will come with a decent
precooked hashtable function and maybe a persistent on-disk updateable tree
too, so why not use them?
..
Well, since they're coded portably they'll require you to implement
hashkey()
and/or compare()
methods for your objects,
so you're going to be doing a whole lot of method dispatching or procedure
calling, which is a reasonable price for using pre-written code. On the other
hand, binary search is so simple and robust that you can code up five subtly
different binary searches for different places in your code, all using optimized
direct references into your internal structures, and not suffer any software
engineering angst.
..
Performance
In fact, in practical terms, O(1)
is in many cases not noticeably
faster than O(log2(N))
. The difference, if it isn't
eaten up in all that collision handling and chaining code, tending to vanish in
the static. Run some measurements; I have.
..
Size
Suppose you have way too much data to fit that array into memory. To start
with, are you really sure? Check out the cost of stacking up the necessary
memory versus the cost of writing and maintaining the code to manage on-disk
data structures; memory prices have probably gone down since the last time you
looked.
..
Suppose it's still too big. Consider trying to stuff it in anyhow. Modern
operating systems have these wonderful system calls that tell the system to
consider a file to be mapped into memory; on Unix and Linux systems, it's
called mmap()
. You'll be surprised the first time you call this
on a file that's eight times the size of your memory and the system comes back
a couple of nanoseconds later and says "OK, what next?" What's happening is,
the job's been turned over to the operating sytem's Virtual Memory
subsystem. This code tends to have been written by people like Dennis Ritchie
and Bill Joy and Linus Torvalds, i.e. better programmers than you or I, and be
really terrifically smart about making the best use of memory.
..
Look at it this way: you can write a bunch of complicated code to manage the
on-disk vs. in-memory parts of your data, or you can pretend it's all in memory
and use that Dennis/Bill/Linus code to take care of the details.
..
Although I am given to frequent (ab)use of mmap()
, I continue to
be astonished how well it holds up under ridiculous stress.
..
Finally, in a couple of years, when the cost of RAM has dropped to even more
ridiculously small levels, and the current practice of putting half-a-gig of
RAM on a laptop seems quaint, your data will have all smoothly migrated into
RAM without you having to change anything.
..
Update
Yes (gasp), it is perfectly possible to update large contiguous arrays in RAM.
GNU Emacs, one of the fastest and most robust editing systems on the planet,
just blasts your file into a great big buffer and addresses it as an array of
bytes using a low-level C primitive it calls CharAt
, which
addresses the character at any given integer offset. The trick is, it keeps
the buffer in two extents with a "gap" in the middle, and moves the gap
around to let you edit; CharAt
checks where the gap is, does a
little bit of ridiculously-simple arithmetic and lets you pretend it's all
contiguous. Last time I checked (er maybe ten years ago) CharAt
was a macro and you'd see things like:
..
CharAt(curPos++) = whateverTheUserJustTyped;
..
There are also tricks like page tables which let you monkey with arrays at
run-time; for example, as a programmer you are allowed to believe that your
process memory is a big contiguous array, but that's all a pleasant illusion
brought about by that same virtual-memory code mentioned above.
..
Of course, you have to worry about concurrency, but if your data structure is,
well, an array, or an array with a gap, or an array with a page table, the
locking and mutexing you have to do is way simpler than with a tree.
..
How To Code a Binary Search
There's a story I heard from a prof, that the basic Binary Search idea was
around when Alan Turing was still in diapers, but that the first 18 published
implementations were buggy. And I know for a fact that the first few times I
coded it I got it wrong.
..
At this point I should say that I wasn't smart enough to devise a non-buggy
implementation, nor smart enough to see that a lot of places where people use
other things they should really be using binary search. I learned both those
things in the late Eighties
from Gaston Gonnet,
my mentor back
in the days of New Oxford English Dictionary Project at the University of Waterloo, who is both a
frighteningly good mathematician and a world-class programmer, responsible for
much of the implementation of the Maple algebra language.
..
All that said, the core idea of binary search is mind-bogglingly simple; you
have a sorted array, you look at the middle element which tells you what you're
looking for is in the top or bottom half, then you take that half, rinse and
repeat until done.
..
Binary Search is a Two-Step Process
..
The easiest way to go off the rails with binary search is checking whether
you've found your target while you're doing the cutting-down-by-half. This
adds all kinds of complexity for the benefit of cutting down the number of
steps from O(log2(N))
to one or two less, which is
stupid.
..
So the recipe is:
- Take a "high" pointer and a "low" pointer and keep cutting the space
between them in half until they run into each other.
- One of them points to what you're looking for, or it's not
there.
..
That said, here's the basic code. I've been mostly in Perl-and-C land since
about 1999, but decided to outline this in Java because of all the computer
languages I know, Java is the most readable. People whom I respect say Python
is good too, but I've never got around to learning it. The following routine
is part of a fully-worked-out .java
file with a test driver and so
on that you can download, we'll get to that. If you don't like my coding
style, too bad.
1 static public int search1(int [] array, int target)
2 {
3 int high = array.length, low = -1, probe;
4 while (high - low > 1)
5 {
6 probe = (high + low) / 2;
7 if (array[probe] > target)
8 high = probe;
9 else
10 low = probe;
11 }
12 if (low == -1 || array[low] != target)
13 return -1;
14 else
15 return low;
16 }
..
Some commentary:
- Line 3: Note that the high and low bounds are set one off the ends
of the array, neither of them are initially valid array indices. This makes
all sorts of corner cases go away.
- Line 4: This loop runs till the
high
and
low
bounds are next to each other; there is no loop breakout.
- Line 6: Check and satisfy yourself that
probe
can
never fail to be a valid array index, nor can it end up out of the closed
high
-low
interval.
- Lines 7-10: The loop invariant is that
high
is one off
the end of the array, or it points at something greater than
target
; low
is either -1 or points to something
<= target
.
- Lines 12-15: If the search succeeded,
low
will end up
pointing at the target. We can't test for that directly, because if we started
with something lower than the lowest value in the array, low
will
be left at -1
.
..
Now there's a little problem with this; suppose the array contains
[1,2,2,3]
and you search for 2
. At the end of the
search loop, high
will be pointing at the 3
and
low
at the second 2
; that is to say, when
there are stretches of equal values in the array, this will always find the last
one. Suppose this isn't what you want; if you're comfortable with the idiom,
it's easy enough to turn around so it finds the first of a repeated
sequence:
static public int search(int [] array, int target)
{
int high = array.length, low = -1, probe;
while (high - low > 1)
{
probe = (high + low) / 2;
if (array[probe] < target)
low = probe;
else
high = probe;
}
if (high == array.length || array[high] != target)
return -1;
else
return high;
}
..
With any kind of a decent compiler, this is going to generate hellaciously
tight code, particularly the search loop. And, it can't break.
..
And, once you get used to it, you can code it up without having to think too
much. I got the body of the search1
method up there right,
character for character, the first time I typed it in; it compiled and the
tests came out right.
..
Getting Fancy
Once you get comfortable with
this idiom, you realize you can fiddle the algorithm depending on what you're
trying to accomplish. Suppose, for example, you want to search for a
range in that array; take two input numbers and find the index of the
entry that is smaller than the lesser of them, and of the entry that is larger
than the greater. For example, if the array is [2,4,6,8]
and I pass
in 5
and 6
as arguments, I want the indices
1
and 3
back.
..
Return the answer in a 2-integer array;
the values -1
and array.length
are perfectly
reasonable.
static public int [] range(int [] array, int floor, int ceiling)
{
int [] answer = new int[2];
int high, low, probe;
// work on floor
high = array.length; low = -1;
while (high - low > 1)
{
probe = (high + low) / 2;
if (array[probe] < floor)
low = probe;
else
high = probe;
}
answer[0] = low;
// work on ceiling
high = array.length; low = -1;
while (high - low > 1)
{
probe = (high + low) / 2;
if (array[probe] > ceiling)
high = probe;
else
low = probe;
}
answer[1] = high;
return answer;
}
..
I'll leave out the detailed commentary; I just wanted to make the point
that once you get comfy with it, you can do all sorts of useful things with
this efficient, compact, simple, robust programming idiom
..
Measurements
I ran some performance tests, which reminded me how aggravating Java can be,
there's basically no good way to measure CPU consumption without linking in
OS-specific code and going the JNI route. However, you can measure wall-clock
time quite easily. I ran ten passes each of three searches, looking up ten
thousand targets in arrays of size one, five, and ten million integers. I lack
the patience to aggregate the numbers and run proper stats, but here's a
summary of what I typically saw:
..
|
1M |
5M |
10M |
min, max (sec) |
.021, .034 |
.033, .035 |
.037, .039 |
..
In other words, the performance difference in going from a million to ten
million basically vanishes in the static. This is a a good thing.
..
You can grab the
Java source if you want to take a closer look.
Updated: 2003/03/23
Serif / Sans-Serif
..
I work at Sun Microsystems. The opinions expressed here are my
own, and neither Sun nor any other party necessarily agrees with
them.