>>> import _dictinfo
>>> from timeit import timeit
>>> def bits(n):
...    n += 2**32
...    return bin(n)[-32:]  # remove '0b'
...
>>> print bits(1)
00000000000000000000000000000001
>>> print bits(-1)
11111111111111111111111111111111
# All timeit() metrics were performed with
# Python 2.6 as packaged on Ubuntu 9.10, and
# run on my 2GHz Dell Latitude D630.

The Mighty Dictionary

Brandon Craig Rhodes
PyCon 2010 Atlanta
Q: How can Python lists access
every one of their items
with equal speed?
timeit('mylist[0]', 'mylist = [1] * 9000')
# --> 0.053692102432250977
#     ~50 ns per getitem

timeit('mylist[7000]', 'mylist = [1] * 9000')
# --> 0.051460027694702148
#     ~50 ns per getitem
A: Python lists use segments of RAM
and RAM acts like a Python list (!)
  • RAM is a vast array
  • Addressed by sequential integers
  • Its first address is zero!
  • Easy to implement a list atop memory

The Dictionary

Uses keys instead of indexes
and keys can be almost anything
>>> d = {
...    'Brandon': 35,
...    3.1415: 'pi',
...    'flickr.com': '68.142.214.24',
...    (2, 6, 4): 'Python version',
...    }
How can we turn
the keys dictionaries use
into indexes that reach memory quickly?

The Three Rules

#1 A dictionary is really a list
>>> # An empty dictionary is an 8-element list!
>>> d = {}
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert0.png
>>> # This “list” of “items” is managed
>>> # as a “hash table” containing “slots”
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert0.png

The Three Rules

#1 A dictionary is really a list
#2 Keys are hashed to produce indexes
Python lets you see hashing
in action through the hash() builtin
>>> for key in 'Monty', 3.1415, (2, 6, 4):
...     print bits(hash(key)), key
01100111100110010110110011111110 Monty
01101010101011010000100100000010 3.1415
01000111010110111010001100110111 (2, 6, 4)
Quite similar values often have
very different hashes
>>> k1 = bits(hash('Monty'))
>>> k2 = bits(hash('Money'))
>>> diff = ('^ '[a==b] for a,b in zip(k1, k2))
>>> print k1; print k2; print ''.join(diff)
01100111100110010110110011111110
01100110101101001000101011101001
       ^  ^ ^^ ^^^^  ^^    ^ ^^^
Hashes look crazy, but the same value
always returns the same hash!
>>> for key in 3.1415, 3.1415, 3.1415:
...     print bits(hash(key)), key
01101010101011010000100100000010 3.1415
01101010101011010000100100000010 3.1415
01101010101011010000100100000010 3.1415

Keys and Indexes

To build an index, Python uses
the bottom n bits of the hash
>>> d['ftp'] = 21
>>> b = bits(hash('ftp'))
>>> print b
11010010011111111001001010100001
>>> print b[-3:]  # last 3 bits = 8 combinations
001
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert1a.png
>>> d['ftp'] = 21
>>> b = bits(hash('ftp'))
>>> print b
11010010011111111001001010100001
>>> print b[-3:]  # last 3 bits = 8 combinations
001
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert1b.png
>>> d['ssh'] = 22
>>> print bits(hash('ssh'))[-3:]
101
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert2a.png
>>> d['ssh'] = 22
>>> print bits(hash('ssh'))[-3:]
101
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert2b.png
>>> d['smtp'] = 25
>>> print bits(hash('smtp'))[-3:]
100
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert3a.png
>>> d['smtp'] = 25
>>> print bits(hash('smtp'))[-3:]
100
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert3b.png
>>> d['time'] = 37
>>> print bits(hash('time'))[-3:]
111
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert4a.png
>>> d['time'] = 37
>>> print bits(hash('time'))[-3:]
111
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert4b.png
>>> d['www'] = 80
>>> print bits(hash('www'))[-3:]
010
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert5a.png
>>> d['www'] = 80
>>> print bits(hash('www'))[-3:]
010
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert5b.png
d = {'ftp': 21, 'ssh': 22,
     'smtp': 25, 'time': 37,
     'www': 80}
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert6.png

Lookup: same 3 steps

  • Compute the hash
  • Truncate it
  • Look in that slot
>>> print d['smtp']
25
>>> print bits(hash('smtp'))[-3:]
100
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/lookup1a.png

Consequence #1

Dictionaries tend to return their
contents in a crazy order
>>> # Different than our insertion order:
>>> print d
{'ftp': 21, 'www': 80, 'smtp': 25, 'ssh': 22,
 'time': 37}
>>> # But same order as in the hash table!
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert6.png
>>> # keys and values also in table order
>>> d.keys()
['ftp', 'www', 'smtp', 'ssh', 'time']
>>> d.values()
[21, 80, 25, 22, 37]
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert6.png

The Three Rules

#1 A dictionary is really a list
#2 Keys are hashed to produce indexes
#3 If at first you don't
succeed, try, try again

“Collision”

When two keys in a dictionary
want the same slot
>>> # start over with a new dictionary
>>> d = {}
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert0.png
>>> # first item inserts fine
>>> d['smtp'] = 21
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide1a.png
>>> # first item inserts fine
>>> d['smtp'] = 21
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide1b.png
>>> # second item collides!
>>> d['dict'] = 2628
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide2a.png
>>> # second item collides!
>>> d['dict'] = 2628
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide2b.png
>>> # third item also finds empty slot
>>> d['svn'] = 3690
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide3a.png
>>> # third item also finds empty slot
>>> d['svn'] = 3690
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide3b.png
>>> # fourth item has multiple collisions
>>> d['ircd'] = 6667
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide4a.png
>>> # fourth item has multiple collisions
>>> d['ircd'] = 6667
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide4b.png
>>> # fifth item collides, but less deeply
>>> d['zope'] = 9673
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5a.png
>>> # fifth item collides, but less deeply
>>> d['zope'] = 9673
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5b.png
# Only ⅖ of the keys in this dictionary
# can be found in the right slot
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5b.png

Consequence #2

Because collisions move keys
away from their natural hash values,
key order is quite sensitive
to dictionary history
>>> d = {'smtp': 21, 'dict': 2628,
...   'svn': 3690, 'ircd': 6667, 'zope': 9673}
>>> d.keys()
['svn', 'dict', 'zope', 'smtp', 'ircd']
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/keyorder1.png
>>> e = {'ircd': 6667, 'zope': 9673,
...   'smtp': 21, 'dict': 2628, 'svn': 3690}
>>> e.keys()
['ircd', 'zope', 'smtp', 'svn', 'dict']
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/keyorder2.png

The same yet different

Although these two dictionaries
are considered equal, their different
histories put their keys
in a different order
>>> d == e
True
>>> d.keys()
['svn', 'dict', 'zope', 'smtp', 'ircd']
>>> e.keys()
['ircd', 'zope', 'smtp', 'svn', 'dict']

Consequence #3

The lookup algorithm is actually
more complicated than
“hash, truncate, look”

Consequence #3

It's more like “until you find
an empty slot, keep looking,
it could be here somewhere!”
>>> # Successful lookup, length 1
>>> # Compares HASHES then compares VALUES
>>> d['svn']
3690
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5d.png
>>> # Successful lookup, length 4
>>> d['ircd']
6667
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5e.png
>>> # Unsuccessful lookup, length 1
>>> d['nsca']
Traceback (most recent call last):
  ...
KeyError: 'nsca'
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5f.png
>>> # Unsuccessful lookup, length 4
>>> d['netstat']
Traceback (most recent call last):
  ...
KeyError: 'netstat'
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5g.png

Consequence #4

Not all lookups are created equal.
Some finish at their first slot
Some loop over several slots

Stupid Dictionary Trick #1

# Because integers hash as themselves,
# we can create unlimited collisions!
threes = {3: 1, 3+8: 2, 3+16: 3,
          3+24: 4, 3+32: 5}
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/stupid1.png

Stupid Dictionary Trick #1

# Thanks to piling collisions atop each
# other, we can make lookup more expensive
timeit('d[3]', 'd=%r' % threes)    # -> 0.078
timeit('d[3+32]', 'd=%r' % threes) # -> 0.082
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/stupid1.png

Consequence #5

When deleting a key,
you need to leave
“dummy” keys
del d['smtp']

# Can we simply make its slot empty?
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5c.png
del d['smtp']

# But what would happen to d['ircd']?
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5e.png
When a key is deleted,
its slot cannot simply
be marked as empty
Otherwise, any keys
that collided with it would
now be impossible to find!
So we create a dummy key instead
>>> # Creates a <dummy> slot that
>>> # can be re-used as storage
>>> del d['smtp']
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5h.png
>>> # That way, we can still find d['ircd']
>>> d['ircd']
6667
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5i.png

Stupid Dictionary Trick #2

>>> del d['svn'], d['dict'], d['zope']
>>> d['ircd']
6667
>>> # Still requires 4 steps!
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5j.png

Dicts refuse to get full

To keep collisions rare,
dicts resize when only ⅔ full
When < 50k entries, size ×4
When > 50k entries, size ×2
Let's watch a dictionary in action
against words pulled from the standard
dictionary on my Ubuntu box
>>> wordfile = open('/usr/share/dict/words')
>>> text = wordfile.read().decode('utf-8')
>>> words = [ w for w in text.split()
...     if w == w.lower() and len(w) < 6 ]
>>> words
[u'a', u'abaci', u'aback', u'abaft', u'abase',
 ..., u'zoom', u'zooms', u'zoos', ...]
d = {}
# Again, an empty dict has 8 slots
# Let's start filling it with keys
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert0.png
d = dict.fromkeys(words[:5])
# collision rate 40%
# but now ⅔ full — on verge of resizing!
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words5.png
d['abash'] = None
# Resizes ×4 to 32, collision rate drops to 0%
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words6.png
d = dict.fromkeys(words[:21])
# ⅔ full again — collision rate 29%
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words21.png
d['abode'] = None
# Resizes ×4 to 128, collision rate drops to 9%
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words22.png
d = dict.fromkeys(words[:85])
# ⅔ full again — collision rate 33%
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words85.png
Life cycle as dictionary fills:
Gradually more crowded as keys are added
Then suddenly less as dict resizes

Consequence #6

Average dictionary
performance is excellent

Real-life collisions

A dictionary of common words:

>>> wfile = open('/usr/share/dict/words')
>>> words = wfile.read().split()[:1365]
>>> print words
['A', "A's", ..., "Backus's", 'Bacon', "Bacon's"]

We can examine which keys collide:

>>> pmap = _dictinfo.probe_all_steps(words)

Some keys are in the first slot probed:

>>> pmap['Ajax']
[1330]
>>> pmap['Agamemnon']
[2020]

While some keys collided several times:

>>> pmap['Aristarchus']  # requires 5 probes
[864, 1089, 801, 1108, 74]
>>> pmap['Baal']         # requires 16 probes!
[916, 1401, 250, 1359, 399, 1156, 1722, 420, 53,
 266, 1331, 512, 513, 518, 543, 668]
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/average_probes.png

But probes are very fast

>>> setup = "d=dict.fromkeys(%r)" % words
>>> fast = timeit("d['Ajax']", setup)
>>> slow = timeit("d['Baal']", setup)
>>> '%.1f' % (slow/fast)
'1.7'
http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/average_time.png

Consequence #7

Because of resizing,
a dictionary can completely reorder
during an otherwise innocent insert
>>> d = {'Double': 1, 'double': 2, 'toil': 3,
...      'and': 4, 'trouble': 5}
>>> d.keys()
['toil', 'Double', 'and', 'trouble', 'double']
>>> d['fire'] = 6
>>> d.keys()
['and', 'fire', 'Double', 'double', 'toil',
 'trouble']

Consequence #8

Because an insert can radically
reorder a dictionary, key insertion
is prohibited during iteration
>>> d = {'Double': 1, 'double': 2, 'toil': 3,
...     'and': 4, 'trouble': 5}
>>> for key in d:
...     d['fire'] = 6
Traceback (most recent call last):
  ...
RuntimeError: dictionary changed size during
  iteration

Take-away #1

Hopefully “the rules”
now make a bit more sense
and seem less arbitrary
  • Don't rely on order
  • Don't insert while iterating
  • Can't have mutable keys

Take-away #2

Dictionaries trade space for time
If you need more space,
there are alternatives
  • Tuples or namedtuples (Python 2.6)
  • Give classes __slots__

Take-away #3

If your class needs its own __hash__()
method you now know how hashes
should behave
  • Scatter bits like crazy
  • Equal instances must have equal hashes
  • Must also implement __eq__() method
  • Make hash and equality quick!
(You can often get away with ^ xor'ing
the hashes of your instance variables)

Hashing your own classes

class Point(object):
    def __init__(self, x, y):
        self.x, self.y = x, y

    def __eq__(self, p):
        return self.x == p.x and self.y == p.y

    def __hash__(self):
        return hash(self.x) ^ hash(self.y)

Take-away #4

Equal values
should have equal hashes
regardless of their type!
>>> hash(9)
9
>>> hash(9.0)
9
>>> hash(complex(9, 0))
9

The End

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert5b.png
May your hashes be unique,
Your hash tables never full,
And may your keys rarely collide

Other material

How much time does malloc take? Both on going bigger and smaller!

  • emphasize that hashes compared before objects
  • make steps of dictionary lookup clearer at beginning (HASH then COMPARE)
  • talk about how setdefault() does only one lookup