>>> import _dictinfo
>>> from timeit import timeit
>>> def bits(n):
...    n += 2**32
...    return bin(n)[-32:]  # remove '0b'
...
>>> print bits(1)
00000000000000000000000000000001
>>> print bits(-1)
11111111111111111111111111111111

# All timeit() metrics were performed with
# Python 2.6 as packaged on Ubuntu 9.10, and
# run on my 2GHz Dell Latitude D630.

The Mighty Dictionary

Brandon Craig Rhodes

PyCon 2010 Atlanta

Q: How can Python lists access
every one of their items
with equal speed?

timeit('mylist[0]', 'mylist = [1] * 9000')
# --> 0.053692102432250977
#     ~50 ns per getitem

timeit('mylist[7000]', 'mylist = [1] * 9000')
# --> 0.051460027694702148
#     ~50 ns per getitem

A: Python lists use segments of RAM

and RAM acts like a Python list (!)

RAM is a vast array
Addressed by sequential integers
Its first address is zero!
Easy to implement a list atop memory

The Dictionary

Uses keys instead of indexes

and keys can be almost anything

>>> d = {
...    'Brandon': 35,
...    3.1415: 'pi',
...    'flickr.com': '68.142.214.24',
...    (2, 6, 4): 'Python version',
...    }

How can we turn
the keys dictionaries use
into indexes that reach memory quickly?

The Three Rules

#1 A dictionary is really a list

>>> # An empty dictionary is an 8-element list!
>>> d = {}

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert0.png

>>> # This “list” of “items” is managed
>>> # as a “hash table” containing “slots”

The Three Rules

#1 A dictionary is really a list

#2 Keys are hashed to produce indexes

Python lets you see hashing

in action through the hash() builtin

>>> for key in 'Monty', 3.1415, (2, 6, 4):
...     print bits(hash(key)), key
01100111100110010110110011111110 Monty
01101010101011010000100100000010 3.1415
01000111010110111010001100110111 (2, 6, 4)

Quite similar values often have

very different hashes

>>> k1 = bits(hash('Monty'))
>>> k2 = bits(hash('Money'))
>>> diff = ('^ '[a==b] for a,b in zip(k1, k2))
>>> print k1; print k2; print ''.join(diff)
01100111100110010110110011111110
01100110101101001000101011101001
       ^  ^ ^^ ^^^^  ^^    ^ ^^^

Hashes look crazy, but the same value

always returns the same hash!

>>> for key in 3.1415, 3.1415, 3.1415:
...     print bits(hash(key)), key
01101010101011010000100100000010 3.1415
01101010101011010000100100000010 3.1415
01101010101011010000100100000010 3.1415

Keys and Indexes

To build an index, Python uses

the bottom n bits of the hash

>>> d['ftp'] = 21

>>> b = bits(hash('ftp'))
>>> print b
11010010011111111001001010100001
>>> print b[-3:]  # last 3 bits = 8 combinations
001

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert1a.png

>>> d['ftp'] = 21

>>> b = bits(hash('ftp'))
>>> print b
11010010011111111001001010100001
>>> print b[-3:]  # last 3 bits = 8 combinations
001

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert1b.png

>>> d['ssh'] = 22

>>> print bits(hash('ssh'))[-3:]
101

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert2a.png

>>> d['ssh'] = 22

>>> print bits(hash('ssh'))[-3:]
101

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert2b.png

>>> d['smtp'] = 25

>>> print bits(hash('smtp'))[-3:]
100

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert3a.png

>>> d['smtp'] = 25

>>> print bits(hash('smtp'))[-3:]
100

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert3b.png

>>> d['time'] = 37

>>> print bits(hash('time'))[-3:]
111

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert4a.png

>>> d['time'] = 37

>>> print bits(hash('time'))[-3:]
111

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert4b.png

>>> d['www'] = 80

>>> print bits(hash('www'))[-3:]
010

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert5a.png

>>> d['www'] = 80

>>> print bits(hash('www'))[-3:]
010

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert5b.png

d = {'ftp': 21, 'ssh': 22,
     'smtp': 25, 'time': 37,
     'www': 80}

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/insert6.png

Lookup: same 3 steps

Compute the hash
Truncate it
Look in that slot

>>> print d['smtp']
25

>>> print bits(hash('smtp'))[-3:]
100

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/lookup1a.png

Consequence #1

Dictionaries tend to return their

contents in a crazy order

>>> # Different than our insertion order:
>>> print d
{'ftp': 21, 'www': 80, 'smtp': 25, 'ssh': 22,
 'time': 37}
>>> # But same order as in the hash table!

>>> # keys and values also in table order
>>> d.keys()
['ftp', 'www', 'smtp', 'ssh', 'time']
>>> d.values()
[21, 80, 25, 22, 37]

The Three Rules

#1 A dictionary is really a list

#2 Keys are hashed to produce indexes

#3 If at first you don't

succeed, try, try again

“Collision”

When two keys in a dictionary

want the same slot

>>> # start over with a new dictionary
>>> d = {}

>>> # first item inserts fine
>>> d['smtp'] = 21

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide1a.png

>>> # first item inserts fine
>>> d['smtp'] = 21

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide1b.png

>>> # second item collides!
>>> d['dict'] = 2628

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide2a.png

>>> # second item collides!
>>> d['dict'] = 2628

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide2b.png

>>> # third item also finds empty slot
>>> d['svn'] = 3690

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide3a.png

>>> # third item also finds empty slot
>>> d['svn'] = 3690

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide3b.png

>>> # fourth item has multiple collisions
>>> d['ircd'] = 6667

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide4a.png

>>> # fourth item has multiple collisions
>>> d['ircd'] = 6667

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide4b.png

>>> # fifth item collides, but less deeply
>>> d['zope'] = 9673

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5a.png

>>> # fifth item collides, but less deeply
>>> d['zope'] = 9673

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5b.png

# Only ⅖ of the keys in this dictionary
# can be found in the right slot

Consequence #2

Because collisions move keys
away from their natural hash values,
key order is quite sensitive
to dictionary history

>>> d = {'smtp': 21, 'dict': 2628,
...   'svn': 3690, 'ircd': 6667, 'zope': 9673}
>>> d.keys()
['svn', 'dict', 'zope', 'smtp', 'ircd']

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/keyorder1.png

>>> e = {'ircd': 6667, 'zope': 9673,
...   'smtp': 21, 'dict': 2628, 'svn': 3690}
>>> e.keys()
['ircd', 'zope', 'smtp', 'svn', 'dict']

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/keyorder2.png

The same yet different

Although these two dictionaries
are considered equal, their different
histories put their keys
in a different order

>>> d == e
True
>>> d.keys()
['svn', 'dict', 'zope', 'smtp', 'ircd']
>>> e.keys()
['ircd', 'zope', 'smtp', 'svn', 'dict']

Consequence #3

The lookup algorithm is actually
more complicated than
“hash, truncate, look”

Consequence #3

It's more like “until you find
an empty slot, keep looking,
it could be here somewhere!”

>>> # Successful lookup, length 1
>>> # Compares HASHES then compares VALUES
>>> d['svn']
3690

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5d.png

>>> # Successful lookup, length 4
>>> d['ircd']
6667

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5e.png

>>> # Unsuccessful lookup, length 1
>>> d['nsca']
Traceback (most recent call last):
  ...
KeyError: 'nsca'

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5f.png

>>> # Unsuccessful lookup, length 4
>>> d['netstat']
Traceback (most recent call last):
  ...
KeyError: 'netstat'

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5g.png

Consequence #4

Not all lookups are created equal.

Some finish at their first slot

Some loop over several slots

Stupid Dictionary Trick #1

# Because integers hash as themselves,
# we can create unlimited collisions!
threes = {3: 1, 3+8: 2, 3+16: 3,
          3+24: 4, 3+32: 5}

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/stupid1.png

Stupid Dictionary Trick #1

# Thanks to piling collisions atop each
# other, we can make lookup more expensive
timeit('d[3]', 'd=%r' % threes)    # -> 0.078
timeit('d[3+32]', 'd=%r' % threes) # -> 0.082

Consequence #5

When deleting a key,
you need to leave
“dummy” keys

del d['smtp']

# Can we simply make its slot empty?

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5c.png

del d['smtp']

# But what would happen to d['ircd']?

When a key is deleted,
its slot cannot simply
be marked as empty

Otherwise, any keys
that collided with it would
now be impossible to find!

So we create a dummy key instead

>>> # Creates a <dummy> slot that
>>> # can be re-used as storage

>>> del d['smtp']

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5h.png

>>> # That way, we can still find d['ircd']

>>> d['ircd']
6667

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5i.png

Stupid Dictionary Trick #2

>>> del d['svn'], d['dict'], d['zope']
>>> d['ircd']
6667
>>> # Still requires 4 steps!

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/collide5j.png

Dicts refuse to get full

To keep collisions rare,

dicts resize when only ⅔ full

When < 50k entries, size ×4

When > 50k entries, size ×2

Let's watch a dictionary in action
against words pulled from the standard
dictionary on my Ubuntu box

>>> wordfile = open('/usr/share/dict/words')
>>> text = wordfile.read().decode('utf-8')
>>> words = [ w for w in text.split()
...     if w == w.lower() and len(w) < 6 ]
>>> words
[u'a', u'abaci', u'aback', u'abaft', u'abase',
 ..., u'zoom', u'zooms', u'zoos', ...]

d = {}
# Again, an empty dict has 8 slots
# Let's start filling it with keys

d = dict.fromkeys(words[:5])
# collision rate 40%
# but now ⅔ full — on verge of resizing!

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words5.png

d['abash'] = None
# Resizes ×4 to 32, collision rate drops to 0%

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words6.png

d = dict.fromkeys(words[:21])
# ⅔ full again — collision rate 29%

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words21.png

d['abode'] = None
# Resizes ×4 to 128, collision rate drops to 9%

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words22.png

d = dict.fromkeys(words[:85])
# ⅔ full again — collision rate 33%

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/words85.png

Life cycle as dictionary fills:
Gradually more crowded as keys are added
Then suddenly less as dict resizes

Consequence #6

Average dictionary

performance is excellent

Real-life collisions

A dictionary of common words:

>>> wfile = open('/usr/share/dict/words')
>>> words = wfile.read().split()[:1365]
>>> print words
['A', "A's", ..., "Backus's", 'Bacon', "Bacon's"]

We can examine which keys collide:

>>> pmap = _dictinfo.probe_all_steps(words)

Some keys are in the first slot probed:

>>> pmap['Ajax']
[1330]
>>> pmap['Agamemnon']
[2020]

While some keys collided several times:

>>> pmap['Aristarchus']  # requires 5 probes
[864, 1089, 801, 1108, 74]
>>> pmap['Baal']         # requires 16 probes!
[916, 1401, 250, 1359, 399, 1156, 1722, 420, 53,
 266, 1331, 512, 513, 518, 543, 668]

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/average_probes.png

But probes are very fast

>>> setup = "d=dict.fromkeys(%r)" % words
>>> fast = timeit("d['Ajax']", setup)
>>> slow = timeit("d['Baal']", setup)
>>> '%.1f' % (slow/fast)
'1.7'

http://rhodesmill.org/brandon/slides/2010-03-pycon/figures/average_time.png

Consequence #7

Because of resizing,
a dictionary can completely reorder
during an otherwise innocent insert

>>> d = {'Double': 1, 'double': 2, 'toil': 3,
...      'and': 4, 'trouble': 5}
>>> d.keys()
['toil', 'Double', 'and', 'trouble', 'double']
>>> d['fire'] = 6
>>> d.keys()
['and', 'fire', 'Double', 'double', 'toil',
 'trouble']

Consequence #8

Because an insert can radically
reorder a dictionary, key insertion
is prohibited during iteration

>>> d = {'Double': 1, 'double': 2, 'toil': 3,
...     'and': 4, 'trouble': 5}
>>> for key in d:
...     d['fire'] = 6
Traceback (most recent call last):
  ...
RuntimeError: dictionary changed size during
  iteration

Take-away #1

Hopefully “the rules”
now make a bit more sense
and seem less arbitrary

Don't rely on order
Don't insert while iterating
Can't have mutable keys

Take-away #2

Dictionaries trade space for time

If you need more space,

there are alternatives

Tuples or namedtuples (Python 2.6)
Give classes __slots__

Take-away #3

If your class needs its own __hash__()
method you now know how hashes
should behave

Scatter bits like crazy
Equal instances must have equal hashes
Must also implement __eq__() method
Make hash and equality quick!

(You can often get away with ^ xor'ing

the hashes of your instance variables)

Hashing your own classes

class Point(object):
    def __init__(self, x, y):
        self.x, self.y = x, y

    def __eq__(self, p):
        return self.x == p.x and self.y == p.y

    def __hash__(self):
        return hash(self.x) ^ hash(self.y)

Take-away #4

Equal values
should have equal hashes
regardless of their type!

>>> hash(9)
9
>>> hash(9.0)
9
>>> hash(complex(9, 0))
9

The End

May your hashes be unique,
Your hash tables never full,
And may your keys rarely collide

Other material

How much time does malloc take? Both on going bigger and smaller!

emphasize that hashes compared before objects
make steps of dictionary lookup clearer at beginning (HASH then COMPARE)
talk about how setdefault() does only one lookup