Python list vs dictionary.

Was talking to a colleague(.Net developer) and ended up lecturing him about how array(list in python) is a specific type of data structure and is a specific type of associated array. Now. the logic goes like this. Associated array(dict in python) is an key-value store. It is a method to store data in the format of a key  mapped to a value. it is usually implemented  with the help of a good hash function.

Anyway, the only constraint being any new insertion has to be of the format key, value , where both the key, value are hashable* values.

Add one more condition that the key values have to be in the increasing order of whole numbers(numbers starting at 0) and you have an array/list.

This discussion/lecture got me thinking about how it would be implemented at the python core language level. I promise to actually check the open-source code base and write a summary later, but for now here’s my thought process/guesses after some pruning over a walk.

1. A list by virtue of having whole numbers for key values will be easier to access. i.e: it can be stored in constant interval  locations in the memory. (I know python being dynamic typed and python lists being capable of storing different type of values in a list, complicates things, but only the implementation. In theory, i can just add a pointer in that memory segment to the real memory where it is stored(you know in case of a huge wall of text that doesn’t fit in the memory intervals.)).   Effect? Accessing a list can be done in Const. Time O(1).**

2. A dictionary since it can have an arbitrary data type as key, cannot be assumed to have const. memory spaces. but then we have a hash function. i.e , we pass our key to a function that is guaranteed to return a unique hash value for any  unique key.  Now the lookup becomes two fold. First

a, a Hashing to get the hash key,

b, Search the table for an entry with the same key as the hashed key value.

Now what is the Big O time for this. My first thought is well it depends on

a. the hashing function implementation

b.  The table size or rather the hashed value size w.r.to  the dictionary size.

Anyway, this reminded me of an older post i had made and the excursions i had made into the cpython source at that time. And i clearly remember some comment from that Objects/dictobject.c/h file about the hashing function being special enough to make O(1) look up. Now i did not really get it at that time and will need to check the code + comment again in context. but the basic reasoning as i remember is by avoiding most of the outlier cases and assuming most simplest/popularly used distribution of the keys, they can simplify the hashing function to O(1).  Will update more some time.

 

 

** — Turns out not exactly const time, but invariable time with respect to the number of elements in the list.  In cases of pointers, there will be variation in time depending on the size of the element stored, but the first lookup is a simple const. time lookup table.

* — by our hash function. but generally, not a file stream, socket handler, port handler , device file,  IPC pipe etc…

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s