Pandas Data Frame performance

I ran into this interesting thing at work.
Pandas.read_csv has this option called skiprows to skip some records.

Now, quite accidentally(silly mistake) i initially ran it with sampleSize=3000 value instead of len(file) – sampleSize:
I was trying to compare it against reading the whole file, and then using random.sample and dataframe.iloc on top of the dataframe object.

So this was the loading time.

— skiprows: 0.426070928574
random.sample: 1.00320601463

Of course in this case the comparisons are meaningless as the output dataframe lengths are different (sampleSize in one case and len(file)-sampleSize in another)
At this point I had a hunch that skiprows, may not be fast, even if it takes less memory (RAM), because it is basically Hard Disk seek(s)(or seeks depending how the pandas implementation is. i.e: clubbed together skip contiguous rows — makes sense, or random skips )

So let’s the stats after finding file size(with a wc ) and then employing skiprows.
Note so far I’m using python’s timeit module, which essentially runs the program so many times and returns average time taken. these were all for 300 runs.

. — skiprows: 1.80713701248
— random.sample 0.965622901917

Voila, now the skiprows shoots up to dobule that of random.sample.. previously it was 0.42 now it is 1.8.

Ok .. at this point am beginning to suspect, pandas might deliberately issue multiple seek calls for skipping rows instead of simply skipping rows contiguously.. Now how do we find out?
Option 1: Patiently dig through panda’s source code and find out.(Not a bad idea, as I’m so far impressed by pandas’ ability to handle files with 7Million records)
Option 2: See if you can get more detailed/low-level profiling on the python program*. Enter hotshot.

Here’s the output of running that hotshot module with the same code I had used for pandas.

119821 function calls (115989 primitive calls) in 0.373 seconds

Ordered by: internal time, call count
List reduced from 2228 to 40 due to restriction

ncalls tottime percall cumtime percall filename:lineno(function)
44 0.011 0.000 0.017 0.000 zipfile.py:803(_RealGetContents)
705/187 0.011 0.000 0.024 0.000 sre_parse.py:388(_parse)
18 0.009 0.001 0.010 0.001 collections.py:288(namedtuple)
2 0.008 0.004 0.373 0.187 __init__.py:4()
22 0.008 0.000 0.091 0.004 __init__.py:1()
1381/169 0.007 0.000 0.014 0.000 sre_compile.py:64(_compile)
273 0.006 0.000 0.007 0.000 function_base.py:2945(add_newdoc)
4194 0.006 0.000 0.006 0.000 posixpath.py:68(join)
566 0.005 0.000 0.019 0.000 pkg_resources.py:1888(find_on_path)
423 0.005 0.000 0.018 0.000 pkgutil.py:176(find_module)
575 0.005 0.000 0.007 0.000 __init__.py:79(open_resource)
1 0.005 0.005 0.006 0.006 numeric.py:1()
8130 0.005 0.000 0.005 0.000 sre_parse.py:191(__next)
2281 0.005 0.000 0.006 0.000 posixpath.py:139(islink)
1 0.005 0.005 0.005 0.005 socket.py:45()
1 0.005 0.005 0.007 0.007 _lxml.py:2()
1 0.004 0.004 0.005 0.005 polynomial.py:55()
1 0.004 0.004 0.005 0.005 legendre.py:83()
537 0.004 0.000 0.005 0.000 sre_compile.py:256(_optimize_charset)
6 0.004 0.001 0.005 0.001 string.py:148(substitute)
1 0.004 0.004 0.005 0.005 hermite_e.py:59()
1 0.004 0.004 0.005 0.005 chebyshev.py:87()
1 0.004 0.004 0.005 0.005 hermite.py:59()
1 0.004 0.004 0.005 0.005 laguerre.py:59()
1 0.004 0.004 0.006 0.006 util.py:214(_findSoname_ldconfig)
233 0.004 0.000 0.005 0.000 pkg_resources.py:2494(insert_on)
1942/760 0.004 0.000 0.004 0.000 sre_parse.py:149(getwidth)
464 0.003 0.000 0.011 0.000 posixpath.py:387(_joinrealpath)
538 0.003 0.000 0.010 0.000 pkg_resources.py:2294(from_location)
1 0.003 0.003 0.025 0.025 __init__.py:101()
4 0.003 0.001 0.158 0.039 api.py:3()
2196 0.003 0.000 0.003 0.000 zipfile.py:287(__init__)
5775 0.003 0.000 0.003 0.000 sre_parse.py:139(__getitem__)
1441 0.003 0.000 0.049 0.000 re.py:230(_compile)
687 0.003 0.000 0.003 0.000 genericpath.py:15(exists)
3 0.003 0.001 0.011 0.004 __init__.py:9()
1 0.002 0.002 0.004 0.004 machar.py:113(_do_init)
6804 0.002 0.000 0.006 0.000 sre_parse.py:210(get)
377/358 0.002 0.000 0.003 0.000 :1()
1 0.002 0.002 0.003 0.003 groupby.py:3136(DataFrameGroupBy)

481 function calls in 0.004 seconds

Ordered by: internal time, call count
List reduced from 138 to 40 due to restriction

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 0.002 0.002 parsers.py:1148(read)
1 0.000 0.000 0.000 0.000 random.py:293(sample)
1 0.000 0.000 0.000 0.000 parsers.py:1040(__init__)
14 0.000 0.000 0.000 0.000 numeric.py:392(asarray)
2 0.000 0.000 0.000 0.000 internals.py:2296(_rebuild_blknos_and_blklocs)
4 0.000 0.000 0.000 0.000 common.py:731(take_nd)
1 0.000 0.000 0.004 0.004 pandasPerfTest.py:18(readCsvWrandom)
4 0.000 0.000 0.000 0.000 internals.py:3743(_stack_arrays)
3 0.000 0.000 0.000 0.000 index.py:123(__new__)
1 0.000 0.000 0.000 0.000 indexing.py:1388(_is_valid_list_like)
1 0.000 0.000 0.000 0.000 index.py:896(__getitem__)
7 0.000 0.000 0.000 0.000 _methods.py:31(_any)
9 0.000 0.000 0.000 0.000 index.py:4471(_ensure_index)
1 0.000 0.000 0.000 0.000 internals.py:3585(form_blocks)
1 0.000 0.000 0.000 0.000 numeric.py:462(asanyarray)
8 0.000 0.000 0.000 0.000 internals.py:63(__init__)
7 0.000 0.000 0.000 0.000 index.py:882(__contains__)
7 0.000 0.000 0.000 0.000 series.py:2561(_sanitize_array)
4 0.000 0.000 0.000 0.000 numeric.py:586(require)
3 0.000 0.000 0.000 0.000 common.py:1978(_possibly_infer_to_datetimelike)
8 0.000 0.000 0.000 0.000 internals.py:2094(make_block)
1 0.000 0.000 0.000 0.000 frame.py:4692(extract_index)
1 0.000 0.000 0.003 0.003 parsers.py:221(_read)
1 0.000 0.000 0.000 0.000 _methods.py:27(_prod)
21 0.000 0.000 0.000 0.000 common.py:59(_check)
1 0.000 0.000 0.001 0.001 indexing.py:1441(_getitem_axis)
4 0.000 0.000 0.000 0.000 internals.py:850(take_nd)
1 0.000 0.000 0.003 0.003 parsers.py:329(parser_f)
1 0.000 0.000 0.000 0.000 indexing.py:1688(_maybe_convert_indices)
2 0.000 0.000 0.000 0.000 common.py:2275(_asarray_tuplesafe)
1 0.000 0.000 0.000 0.000 internals.py:3295(take)
1 0.000 0.000 0.000 0.000 index.py:973(take)
1 0.000 0.000 0.000 0.000 _methods.py:15(_amax)
1 0.000 0.000 0.001 0.001 generic.py:1304(take)
2 0.000 0.000 0.000 0.000 common.py:1903(_possibly_cast_to_datetime)
2 0.000 0.000 0.000 0.000 internals.py:3709(_multi_blockify)
8 0.000 0.000 0.000 0.000 internals.py:140(mgr_locs)
7 0.000 0.000 0.000 0.000 internals.py:2242(shape)
2 0.000 0.000 0.000 0.000 internals.py:2201(__init__)
2 0.000 0.000 0.000 0.000 internals.py:2422(_verify_integrity)

1.80713701248
0.965622901917

Whoa there, what is this, how do I make sense of this. and how is this going to help. Well, to begin with, am being lazy, to read through pandas source, so just trying out random/easy stuff. This may not help at all. second of all, this doesn’t seem to give/create resolution at the C function level at all, so if you need that take a hike or try something more low level.

* — All code used for this blog can be found here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s