How Do We Store Data, Then?

J. Waldmann, March 9, 2017.

This is an addendum to the List Or Not List post, which says, “in most cases, you don’t want lists.” So how do we store data, then?

That’s described in any textbook on data structures and algorithms, and if it’s in any such book, then there’s bound to be a Haskell implementation for it. I do not aim to replace textbooks here, so I just give a summary, and pointers to typical implementations, and I recommend you go visit the library for detail. (A library is a house with rooms with books. Yes?)

A data store provides a the abstract type set (just data) or a map (from keys to data, with finite domain). What concrete type to use for the store’s implementation, depends on the operations that you want to do with the set or map, as a whole and with single elements.

Ordered Keys

If you want to access data by key, and you have a total order on the keys, use a balanced search tree (Data.Set, Data.Map). You can expect all operations on individual elements to take logarithmic cost in the number of elements stored, and that’s much better than the linear cost you will pay for lists.

Internally, these libraries use size-balanced binary trees:

    data Map k a  = Bin Size k a (Map k a) (Map k a)
                | Tip

Yes, it’s as simple as that.

What’s more, these libraries provide bulk operations (intersection, union, difference). Use these! This will make your programs more readable and more efficient. Consider this: you could compute the union of (the sets represented by) A and B by inserting into A, each element of B one after another. This already works if B is a list, and is the idea behind java.util.Collections.addAll. But we know more about B - it is represented by a balanced tree. So the library can use the balanced tree structure of A and of B to speed up the process. A sequence of individual inserts will only use the structure of A (and the trees built from it).

For some applications, access to the collection is restricted. For example, in a stack, you will only operate with its top element. We learned earlier that Haskell lists are stacks. An extension of this is a priority queue, which provides quick access to the entry with highest priority (but only to this one), plus arbitrary inserts.

Integer Keys

If your keys are (or can be mapped to) integers, then use a Patricia tree (Data.IntMap) or a hash table (Data.HashMap). Then you can expect effectively constant cost for individual operations. But a Patricia tree is still a tree? Yes, but the depth is at most the bit size of the key, and often smaller. Hashing is even better. The actual cost depends on how good your hash function avoids collisions.

Contiguous Integer Keys

If the set of keys forms a contiguous interval of integers, then you want to use them as indices into a contiguous storage area, which is then called an array in some languages, or a vector in a typical Haskell library (Data.Vector).

With that, access operations have no noticeable cost: to get an element by its index, you need one addition (of the index to the base address), and one memory access (which might of course incur a cache miss). There are even some no-cost bulk operations, because of slicing: A slice is a contiguous sub-vector that is denoted simply by using an existing vector with a different base address and length. The internal representation is indeed

    data Vector a = Vector Int Int (Array a)

so a vector consist of (a pointer to) an underlying array that holds actual data, plus an offset (first component) and a length (second). On the other hand, there is no free lunch - update and concatenation of vectors requires fresh allocation (of an area the size of the result) and bulk copies. Looks expensive, but since all areas are contiguous, it might not be too bad in some cases.

Catenable Sequences

What if keys are contiguous integers, and we need efficient access and efficient concatenation? Use Data.Sequence, which gives us logarithmic cost for access and for concatenation, and, as a bonus, constant time access to first and last element of the sequence. How does this work? It’s a balanced tree where the root has extra pointers to leftmost and rightmost leaf.

The code uses a nice type-level trick:

    newtype Seq a = Seq (FingerTree (Elem a))
  
    data FingerTree a
      = EmptyT
      | Single a
      | Deep Int (Digit a) (FingerTree (Node a)) (Digit a)

    data Node a
      = Node2 Int a a
      | Node3 Int a a a

FingerTree is a non-regular type constructor: notice that that the recursion changes the type argument (from a to Node a). For implementation details, RTFC.

Strings

Conceptually, a string is a sequence of characters. So, you could use a list, if you access them in order, and in all other cases, it’s something like a vector. Recommended libraries are Data.Text (if encodings are an issue), and Data.ByteString (for raw binary data). Both come in two variants: a strict one, where each string is actually one vector, as described above, and a lazy one, where each string is a lazy singly linked list of vectors.

There is much more to be said about the (mis-)use of strings in Haskell. The Haskell standard definition String = [Char] uses lists, so all the problems of lists mentioned above do apply directly, and then some - coming from the implied assumption that characters are basically numbers, which they are not, in today’s world of Unicode. Also, the standard prescribes that String be used in parsing (class Read) and printing (class Show). All this makes it a suitable topic for a separate rant, but we’ll keep our focus on lists for now.

Practice Problem

As an exercise, browse the function signatures and specifications in Data.List and mark those that are misguided and misguiding by the above reasoning.

Hint: a lot of them are. It starts with the index operator

(!!) :: [a] -> Int -> a

Each time you use the !! operator, there’s a 99 percent chance that you’re using a list in a non-linear, and thus inefficient, way. Check the arguments of xs !! i - if there will be another call xs !! j with the same left argument, then you will be traversing the list xs twice.

Data Structures in Other Languages’ Libraries

Take your favourite language and check their standard libraries:

  • what abstractions (Set, Map) and what implementations (Trees, Vectors, …) do they provide,
  • and how does this compare to the Haskell containers framework?
  • In particular, how you would they (or you) handle bulk operations (union, intersection)?
  • And - do they avoid lists where they should?

Be aware of differences in naming. For instance, java.util uses the name List not for a concrete datatype, but for an interface. This comes with two main implementations: LinkedList and ArrayList.

Now the linked lists are actually doubly linked, which means that their iterator can go back and forth. Other than that, all the reservations about linked lists apply. If you use the iterator of a linked list, then you’re fine, if you access it in any other way, then you’re not. In particular, access by index slow.

An ArrayList is essentially what we called a vector. Exercise: do they provide efficient slicing? Answer: yes, but not exactly at this type. Use API doc and source to find out.

Also, check how API authors document resource usage. Example: Java does mention logarithmic cost in the preamble for TreeSet but not for individual operations. Praise the authors of the Haskell containers library. They document resource usage so well that it deserves to be linked again.