cpython hashing function implementation for long int object.

This is a follow up to this post.

So went on opening up the Objects/dictobject.c file and start reading up the comments. It’s extremely well documented. I read through the first set of comments(about 50 lines) to learn that
python’s dict has four type of slots. i.e: an entry in the python’s dict object can be one of 4 types:

1. Active : active (key,value pair)
2. Unused: key=value=NULL
3. Dummy: key=dummy value=NULL — was previously active, but was deleted.
4. Pending not yet inserted or deleted. key !=dummy and key != null

As for further implications and usage of these different states i refer you to go here.

Moving on to the comments on hashing function.

From the comment section:

Major subtleties ahead: Most hash schemes depend on having a “good” hash
function, in the sense of simulating randomness. Python doesn’t: its most
important hash functions (for strings and ints) are very regular in common
cases:
To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints.
The same is approximately true when keys are “consecutive” strings. So this
gives better-than-random behavior in common cases, and that’s very desirable.

Also, the comment goes on to talk about how the simple implementation has problems with collisions and the collision resolution strategy is crucial. But that’s stuff for another post.

Ok clearly, my question can’t be answered by looking at this file. i’ll have to look deeper

So i look up where the actual hash function is found and realize it’s part of the builtin modules. Therefore found in Python/bltinmodule.c . Now the builtin hash function implementation here just takes a PyObject(the C-representation of a Python object) and calls its’ hash function. I was initiall confused by the PyObject var and was searching for a similar variable anywhere else in the file and implemented. Dang, my C is rusty and broken… Anyway, after asking some awkward newbie questions on the python-dev irc, some developers there clarified that the hash function implementation is type-specific.

Again realization hits pile screw driver. I was not thinking about it and somehow had assumed a generic hash function. Curse you anand, you’re letting the dynamic typing of python let form bad habits in your thinking wake up. there’s no magic in the real world.
Only people willingly believing there is for whatever reason (pain-avoidance/laziness/sanity etc…)

2100 hrs.
And stupid me looks up Objects/listobject.c. Ofcourse it’s not implemented because well it(hashing)’s not needed for a python list. Conveniently, being just post dinner mind/body state is a good excuse. moving on.

Next stop is the long int object source at Objects/longobject.c

File Size: 154K, Lines: 5K.
Whoa, i could spend a week writing about this cpython long int type implementation. May be i will write more blog posts after this one on hash function.

static Py_hash_t
long_hash(PyLongObject *v)
{
    Py_uhash_t x;
    Py_ssize_t i;
    int sign;

    i = Py_SIZE(v);
    switch(i) {
    case -1: return v->ob_digit[0]==1 ? -2 : -(sdigit)v->ob_digit[0];
    case 0: return 0;
    case 1: return v->ob_digit[0];
    }
    sign = 1;
    x = 0;
    if (i < 0) {         sign = -1;         i = -(i);     }     while (--i >= 0) {

        x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) |             (x >> (_PyHASH_BITS - PyLong_SHIFT));
        x += v->ob_digit[i];
        if (x >= _PyHASH_MODULUS)
            x -= _PyHASH_MODULUS;
    }
    x = x * sign;
    if (x == (Py_uhash_t)-1)
        x = (Py_uhash_t)-2;
    return (Py_hash_t)x;
}

I have removed some of the comments for the sake of space and anyway, am going to read them para phrase them. but before i start with the comments paraphrasing, i’ll try to summarize the function with my guesses. It’ll be good practice to refresh my C and will be fairly simple, given the code base is old enough and has meaningful variable names.

Py_uhash_t x; this is a unsigned type of variable(probably a long int or int in C) renamed.
Py_ssize_t i; this is a short int type of variable(used to store sizes of vars/objects).
int sign; sign of the long object being passed.
i = Py_SIZE(v); a function to find the size of a python object PyObj

Ok this is feeling ridiculous thankfully variables are over.

welcome to the switch case.
If the size of the object is -1,0 or 1 this switch-case returns.
In case of, -1 it checks ob_digit(i am assuming this is the first digit)
and returns 1

           if it's 0:
              return 1
           else:
              return -(sdigit)v ->ob_digit[0];

here comes sign. it is being assigned value = 1. Ok i assume that means Py_Size() return value is greater than 1(ignoring < -2 possibility) if the long object is unsigned. or signed. whatever that sign variable was meant to represent :-)

Oops hasty thinking that reasoning is meaningless as the next if condition (i ob_digit[i] modulo
_PyHASH_MODULUS.

x * 2**PyLong_SHIFT % _PyHASH_MODULUS.

while (–i >= 0) {

x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) | (x >> (_PyHASH_BITS – PyLong_SHIFT));
x += v->ob_digit[i];
if (x >= _PyHASH_MODULUS)
x -=

Advertisements