Binary tree, is something you guys should have studied in college and how it is very good for finding an element in terms of time complexity.
Now all of that is true and it is awesome, but if you’re trying to find the set of values in a range, it gets harder. In comes the B+ tree.
In essence it is the binary tree + plus a sorted linked list of element at the lowermost row.
Now the nodes at the lower most part of the tree are the actual values. Everything above is placeholder values for pointing the search to the right pointer in the lower most linked list.
Given how the binary tree works, the values of upper nodes will be in the actual values, generally speaking, but need not be.
That is to say during insertion, we’d be inserting the value in one of the upper nodes(for routing), but during deletion, we don’t need to delete it. (unless of course, the tree has become too unbalanced, and we’re balancing it).
Of course there’s a downside. After all there’s no free lunch. I’d guessed it above with comment on balancing, but more specifically, see below:
But there are new problems (again!). If you add or remove a row in a database (and therefore in the associated B+Tree index):
you have to keep the order between nodes inside the B+Tree otherwise you won’t be able to find nodes inside the mess.
you have to keep the lowest possible number of levels in the B+Tree otherwise the time complexity in O(log(N)) will become O(N).
I other words, the B+Tree needs to be self-ordered and self-balanced. Thankfully, this is possible with smart deletion and insertion operations. But this comes with a cost: the insertion and deletion in a B+Tree are in O(log(N)). This is why some of you have heard that using too many indexes is not a good idea. Indeed, you’re slowing down the fast insertion/update/deletion of a row in a table since the database needs to update the indexes of the table with a costly O(log(N)) operation per index. Moreover, adding indexes means more workload for the transaction manager (we will see this manager at the end of the article).
Now let’s go and dig deeper into specific. Let’s pick an open source database say, postgresql.
GiST (Generalized Search Tree) indexing is an advanced system which brings together a wide array of different sorting and searching algorithms including B-tree, B+-tree, R-tree, partial sum trees, ranked B+-trees and many others. It also provides an interface which allows both the creation of custom data types as well as extensible query methods with which to search them. Thus, GiST offers the flexibility to specify what you store, how you store it, and the ability to define new ways to search through it — ways that far exceed those offered by standard B-tree, R-tree and other generalized search algorithms.
GiST serves as a foundation for many public projects that use PostgreSQL such as OpenFTS and PostGIS. OpenFTS (Open Source Full Text Search engine) provides online indexing of data and relevance ranking for database searching. PostGIS is a project which adds support for geographic objects in PostgreSQL, allowing it to be used as a spatial database for geographic information systems (GIS), much like ESRI’s SDE or Oracle’s Spatial extension.
Ok so that looks like there isn’t this theoretically nice and elegant code available at the implementation layer. But let’s now take a look into that GIST source code. See here.
Ah.. a quick read of the README inside the gist folder suggests it’s a whole lot more complicated, and the idea I started off the blog post with (search tree) is old and research has progressed a lot further. This is not one blog post, but a series of blog post I should do tracking the evolution of trees and reading papers mentioned in the README. (And that’s gonna take time seeing how awfully I’m out of touch with the field research. ).