There is a saying to the effect of
First get your data structures right, then the program will write itself.
The following rant is about the question: When are (singly linked) lists the right structure to use? My answer is: almost never, despite appearances that would suggest the contrary.
At first glance, the designers of the programming language Haskell were madly in love with singly linked lists. They even granted notational privileges: e.g., the data constructor for a non-empty list ("cons") uses just one letter (:
)
Is one concrete data type (singly linked lists) more important than the concept of static typing? Looking at the Haskell standard, the answer seems to be Yes. When declaring the type of an identifier, we must use two letters ::
, because the natural (that is, mathematical) notation :
is taken.
Also, note that the list type constructor has just two letters ([ ]
). Can you guess Haskell's type with the next shortest name (hint: three letters)?
One goal of language design is to make typical expected usage easy to write. Once the design is cast in stone, this becomes self-fulfilling: unsuspecting users of the language will write programs in what they think is idiomatic style, as suggested by built-in syntactical convenience. So it seems that it is typical Haskell to: not declare types for identifiers, and use machine numbers and lists all over the place.
This, with the lists, I mean, needs to stop. Now.
Lists do have their purpose, but it's rather limited. In fact, there is just one:
If you use lists for anything else, in particular
then you're doing it wrong. I will explain why, and what you should do instead. This turned out to grow into a rather large text. As an appetizer, here are some bumper-sticker slogans extracted from the text.
!!
operator), then your program is wrong.length
function, then your program is wrong.Foldable
.Sounds unbelievable? Read on!
Certainly Haskell builds on the legacy of LISP, the very first functional programming language - that that has lists in its name already. Indeed, singly linked lists were the only way to structure nested data in LISP.
This is a huge advantage (it's soo uniform), but also a huge dis-advantage (you cannot distinguish data - they all look like nested lists). For discriminating between different shapes of nested data, people then invented algebraic data types, pattern matching, and static typing. The flip side of static typing is that it seems to hinder uniform data processing. So, language designers invented polymorphism (type constructors) and generic programming, to get back flexibility and uniformity, while keeping static safety.
This was all well known at the time of Haskell's design, so why are lists that prominent in the language? The one major design goal for Haskell was: to define a functional programming language with static typing and lazy evaluation. For functional programming with static typing and strict evaluation, the language ML was already well established, cf. [D. A. Turner: Some History of Functional Programming Languages] (https://www.cs.kent.ac.uk/people/staff/dat/tfp12/tfp12.pdf). Haskell has lazy evaluation for function calls, including constructor applications, and a prime showcase for this is that we can handle apparently infinite data, like streams:
data Stream a = Cons a (Stream a)
Such streams do never end, but we can still do reasonable things with them, by evaluating finite prefixes on demand, in other words, lazily.
Now a standard Haskell list is just such a stream that can end, because there is an extra constructor for the empty stream
data Stream a = Nil | Cons a (Stream a)
and the problem with Haskell lists is that people tend to confuse what the type [ a ]
stands for:
If you have an object-oriented background, then this can be phrased as the confusion
A Haskell list is a very good iterator - we even might say, it is the essence of iteration, but it makes for a very bad implementation of a collection.
The (historic) first impression of "functional programming is inherently list processing" seems hard to erase. Indeed it is perpetuated by some Haskell teaching texts, some of which are popular. If a text(book) aimed at beginners presents Int
, String
and [Int]
on pages 1, 2, 3; but keywords data
and case
(and assorted concepts of algebraic data type and pattern matching) appear only after the first two hundred pages, well, then something's wrong.
For example, the shiny new http://haskell.org/ gets this exactly wrong. At the very top of the page, it presents example code that is supposed to highlight typical features of the language. We see a version of Eratosthenes' sieve that generates an infinite list of primes. Declarative it certainly is. But see - Lists, numbers. Must be typical Haskell.
Also, that example code features list comprehensions - that's more syntactic sugar to misguide beginners. And it claims "statically typed code" in the previous line, but - there is no type annotation in the code. Yes, I know, the types are still there, because they are inferred by the compiler, but what does a beginner think?
The icing on all this is that the one thing that this example code really exemplifies, namely: infinite streams with lazy evaluation, is not mentioned anywhere in the vincinity.
The answer is simple: never use lists for data storage with out-of-order access.
Accessing an element of a singly linked list at some arbitrary position i takes time proportional to i, and you do not want that.
The only operation that's cheap with a singly linked list is accessing its first element element (the head of list), and prepending a fresh first element.
So can we access any of the other elements in any reasonable way? Yes, we can access the second element, but only after we drop the first one. The second element becomes the "new head", and the "old head" is then gone for good. We are singly linked, we cannot go back. If we want to access any element in a list, we have to access, and then throw away, all the elements to the left of it.
In other words, Haskell lists are very similar to what object oriented people call the iterator pattern (Iterator
Or, we can think of a list as a stack, and going to the next element is its "pop" operation.
Can we really never go back? Well, if xs
is an iterator for some collection, then x : xs
is an iterator that first will provide x
, and after that, all elements of xs
. So, thinking of a list as a stack, the list constructor (:
) realizes the "push" operation.
On the other hand, the "merge" function (of "mergesort" fame, it merges two monotonically increasing sequences) is a good use case for Haskell lists, since both inputs are traversed in order (individually - globally, the two traversals are interleaved). Its source code is
merge :: Ord a => [a] -> [a] -> [a]
merge as@(a:as') bs@(b:bs')
| a `cmp` b == GT = b:merge as bs'
| otherwise = a:merge as' bs
merge [] bs = bs
merge as [] = as
We never copy lists, and we only ever pattern match on their beginning.
Indeed you can easily (in Java or C#) write a merge
function that takes two iterators for monotone sequences, and produces one. Try it, it's a fun exercise. Use it to learn about yield return
in C#.
Back to Haskell: how should we implement merge-sort then, given merge
? The truthful answer is: we should not, see below. But for the moment, let's try anyway, as an exercise. A straightforward approach is to split long input lists at (or near) the middle:
msort :: Ord a => [a] -> [a]
msort [] = [] ; msort [x] = [x]
msort xs = let (lo,hi) = splitAt (div (length xs) 2) xs
in merge (msort lo) (msort hi)
but this is already wrong. Here, xs
is not used in a linear way, but twice: first in length xs
(to compute to position of the split), and secondly, in splitAt ... xs
, to actually do the split. This is inefficient, since the list is traversed twice.
Indeed, the Haskell standard library contains an implementation of mergesort that does something different. The basic idea is
msort :: Ord a => [a] -> [a]
msort xs = mergeAll (map (\x -> [x]) xs)
where mergeAll [x] = x
mergeAll xs = mergeAll (mergePairs xs)
mergePairs (a:b:xs) = merge a b: mergePairs xs
mergePairs xs = xs
But the actual implementation has an extra provision for handling monotone segments of the input more efficiently. Make sure to read the discussion mentioned in the comments atop the source.
But - if you sort a list in your application at all, then you're probably doing something wrong already. There's a good chance that your data should have never been stored in a list from the start.
That's described in any textbook on data structures and algorithms, and if it's in any such book, then there's bound to be a Haskell implementation for it. I do not aim to replace textbooks here, so I just give a summary, and pointers to typical implementations, and I recommend you go visit the library for detail. (A library is a house with rooms with books. Yes?)
A data store provides a the abstract type set (just data) or a map (from keys to data). What concrete type to use for the store's implementation, depends on the operations that you want to do with the set or map, as a whole and with single elements.
If you want to access data by key, and you have a total order on the keys, use a balanced search tree (Data.Set, Data.Map). You can expect all operations on individual elements to take logarithmic cost in the number of elements stored, and that's much better than the linear cost you will pay for lists.
Internally, these libraries use size-balanced binary trees:
data Map k a = Bin Size k a (Map k a) (Map k a)
| Tip
Yes, it's as simple as that.
What's more, these libraries provide bulk operations (intersection, union, difference). Use these! This will make your programs more readable and more efficient. Consider this: you could compute the union of (the sets represented by) A
and B
by inserting into A
, each element of B
one after another. This already works if B
is a list, and is the idea behind java.util.Collections.addAll. But we know more about B
- it is represented by a balanced tree. So the library can use the balanced tree structure of A
and of B
to speed up the process. A sequence of individual inserts will only use the structure of A
(and the trees built from it).
For some applications, access to the collection is restricted. For example, in a stack, you will only operate with its top element. We learned earlier that Haskell lists are stacks. An extension of this is a priority queue, which provides quick access to the entry with highest priority (but only to this one), plus arbitrary inserts.
If your keys are (or can be mapped to) integers, then use a Patricia tree (Data.IntMap) or a hash table (Data.HashMap). Then you can expect effectively constant cost for individual operations. But a Patricia tree is still a tree? Yes, but the depth is at most the bit size of the key, and often smaller. Hashing is even better. The actual cost depends on how good your hash function avoids collisions.
If the set of keys forms a contiguous interval of integers, then you want to use them as indices into a contiguous storage area, which is then called an array in some languages, or a vector in a typical Haskell library (Data.Vector).
With that, access operations have no noticeable cost: to get an element by its index, you need one addition (of the index to the base address), and one memory access (which might of course incur a cache miss). There are even some no-cost bulk operations, because of slicing: A slice is a contiguous sub-vector that is denoted simply by using an existing vector with a different base address and length. The internal representation is indeed
data Vector a = Vector Int Int (Array a)
so a vector consist of (a pointer to) an underlying array that holds actual data, plus an offset (first component) and a length (second). On the other hand, there is no free lunch - update and concatenation of vectors requires fresh allocation (of an area the size of the result) and bulk copies. Looks expensive, but since all areas are contiguous, it might not be too bad in some cases.
What if keys are contiguous integers, and we need efficient access and efficient concatenation? Use Data.Sequence, which gives us logarithmic cost for access and for concatenation, and, as a bonus, constant time access to first and last element of the sequence. How does this work? It's a balanced tree where the root has extra pointers to leftmost and rightmost leaf. The code uses a nice type-level trick:
newtype Seq a = Seq (FingerTree (Elem a))
data FingerTree a = EmptyT | Single a | Deep Int (Digit a) (FingerTree (Node a)) (Digit a)
data Node a = Node2 Int a a | Node3 Int a a a
FingerTree
is a non-regular type constructor: notice that that the recursion changes the type argument (from a
to Node a
). For implementation details, RTFC.
What about strings? Conceptually, a string is a sequence of characters. So, you could use a list, if you access them in order, and in all other cases, it's something like a vector. Recommended libraries are Data.Text (if encodings are an issue), and Data.ByteString (for raw binary data). Both come in two variants: a strict one, where each string is actually one vector, as described above, and a lazy one, where each string is a lazy singly linked list of vectors.
There is much more to be said about the (mis-)use of strings in Haskell. The Haskell standard definition String = [Char]
uses lists, so all the problems of lists mentioned above do apply directly, and then some - coming from the implied assumption that characters are basically numbers, which they are not, in today's world of Unicode. Also, the standard prescribes that String
be used in parsing (class Read
) and printing (class Show
). All this makes it a suitable topic for a separate rant, but we'll keep our focus on lists for now.
As an exercise, browse the function signatures and specifications in Data.List and mark those that are misguided and misguiding by the above reasoning.
Hint: a lot of them are. It starts with the index operator
(!!) :: [a] -> Int -> a
Each time you use the !!
operator, there's a 99 percent chance that you're using a list in a non-linear, and thus inefficient, way. Check the arguments of xs !! i
- if there will be another call xs !! j
with the same left argument, then you will be traversing the list xs
twice.
As an aside, what about lists in other languages' standard libraries? Well, it depends. For instance, java.util
uses the name List
not for a concrete datatype, but for an interface. This comes with two main implementations: LinkedList and ArrayList. Now the linked lists are actually doubly linked, which means that their iterator can go back and forth. Other than that, all the reservations about linked lists apply. If you use the iterator of a linked list, then you're fine, if you access it in any other way, then you're not. In particular, access by index slow.
An ArrayList is essentially what we called a vector. Exercise: do they provide efficient slicing? Answer: yes, but not exactly at this type. Use API doc and source to find out. Also, lament the fact that their API docs do not give expected runtime costs (leaving you no choice but to read the source) and praise the authors of the Haskell containers library. They document resource usage so well that it deserves to be linked again.
There are two exemptions from the rules stated above.
If you are positively certain that your collections have a size bounded by, say, three, then you can choose to implement it in any way you want, because all operations will run in constant time.
Then, education. With lists, you can exemplify pattern matching and recursion (with streams, you exemplify co-recursion). Certainly each student should see and do recursion early on in their first term. But the teacher must make very clear that this is in the same ballpark with exercises of computing Fibonacci numbers, or implementing multiplication and exponentiation for Peano numbers: the actual output of the computation is largely irrelevant. What is relevant are the techniques they teach (algebraic data types, pattern matching, recursion, induction) because these have broad applications. So, indeed, when I teach algebraic data types, I do also use singly linked lists, among other examples (Bool, Maybe, Either, binary trees), but I always write (and have students write)
data List a = Nil | Cons a (List a)
and I avoid Haskell's built-in lists altogether. But, back to "real code".
We learned above that the one legitimate use case for lists is iteration. This is an important one. It's so important that the Glasgow Haskell compiler applies clever program transformations in order to generate efficient machine code for programs that combine iterators (that is, lists) in typical ways.
The most efficient code is the one that is not there. Indeed, several program transformations in the compiler remove intermediate iterators (lists). Consider the expression
sum $ map (\x -> x*x) [1 .. 1000]
contains two iterations: the list [1 .. 1000]
(the producer), and another list (of squares, built by map
, which is a transformer). Finally, we have sum
as a consumer (its output is not a list, but a number).
We note that a transformer is just a consumer (of the input list) coupled to a producer (of the output list). Now the important observation is that we can remove the intermediate list between a producer and the following consumer (where each of them may actually be part of a transformer).
This removal is called fusion. With fusion, lists are used prominently in the source code, but the compiler will transform this into machine code that does not mention lists at all! In particular, the above program will run in constant space. It allocates nothing on the heap, and it needs no garbage collector.
For historic reference and detail, see Phil Wadler: Deforestation: transforming programs to eliminate trees, 1990, Duncan Coutts , Roman Leshchinskiy , Don Stewart: Stream Fusion. From Lists to Streams to Nothing at All (2007).
Now I will present a more recent development (in the style of writing libraries for Haskell) that aims to remove lists from the actual source code.
We have, e.g., and [True, False, True] == False
, and you would think that and
has type [Bool] -> Bool
(it maps a list of Booleans to a Boolean). But no, the type is and :: Foldable t => t Bool -> Bool
, and indeed we can apply and
to any object of a type t a
where t
implements Foldable
.
As a first approximation, Foldable t
means that t
is a type constructor (it has kind * -> *
) and there must be a function toList :: t a -> [a]
. So, c
of type t a
with Foldable t
can be thought of as some kind of collection, and we can get its elements by calling toList
. This is similar to Iterable<A> c
, and toList (c)
is c.iterator()
.
For example, Data.Set
(mentioned above) does this, so we can make a set s = S.fromList [False, True]
and then write and s
. It will produce the same result as and (toList s)
where this time, and :: [Bool] -> Bool
.
Why all this? Actually, this is not about toList
. Quite the contrary - this Foldable
class exists to make it easier to avoid lists. Typically, we would not implement or call the toList
function, but use foldMap
or foldr
instead.
We give another example. Ancient standard libraries used to have a function sum :: Num a => [a] -> a
but now its actual type is sum :: (Num a, Foldable t) => t a -> a
, and so we can take the sum of all elements in a set s :: Set Int
simply via sum s
, instead of writing sum (toList s)
. See - the lists are gone from the source code.
Let us look at the type of foldMap
, and the class Monoid
that it uses. (And, if you're new to Haskell, remember that class
here is more like interface
in Java - it contains method signatures)
class Foldable t where
foldMap :: Monoid m => (a -> m) -> t a -> m
class Monoid m where
mempty :: m
mappend :: m -> m -> m
This is fancy syntax for the following (slightly re-arranged)
foldMap :: m -> (a -> m) -> (m -> m -> m) -> t a -> m
So when you call foldMap
over some collection (of type t a
), then you have to supply the result value (mempty :: m
) for the empty collection, the mapping (of type a -> m
) to be applied to individual elements, and a binary operation (mappend :: m -> m -> m
) to be used to combine results for sub-collections.
Let us see how this is used to sum
over Data.Set
, the type mentioned above, repeated here for convenience:
data Set a = Bin Size a (Set a) (Set a)
| Tip
instance Foldable Set where
foldMap f t = go t
where go Tip = mempty
go (Bin 1 k _ _) = f k
go (Bin _ k l r) = go l `mappend` (f k `mappend` go r)
This means that foldMap f s
, for s :: Set a
of size larger than one, does the following: each branch node n
is replaced by two calls of mappend
that combine results of recursive calls in subtrees l
and r
of n
, and f k
for the key k
of n
, and we make sure to traverse the left subtree, then the key at the root, then the right subtree.
The definition of sum
is
class Foldable t where
sum :: Num a => t a -> a
sum = getSum #. foldMap Sum
with the following auxiliary types
newtype Sum a = Sum { getSum :: a }
instance Num a => Monoid (Sum a) where
mempty = Sum 0
mappend = Sum x `mappend` Sum y = Sum (x + y)
So, when we take the sum
of some Set a
, we do a computation that adds up all elements of the set, but without an explicit iterator - there is no list in sight.
Now, when the types for sum, and
, and many more functions were generalized from lists to Foldable
a while back, some over-the top Foldable
instances were introduced that result in funny-looking semantics like maximum (2,1) == 1
, which is only "natural" if you think of the curried type constructor application (,) a b
, remove the last argument, and demand that (,) a
"obviously" must be Functor
, Foldable
, and so on. Well, this might save some people some keystrokes, but it also
moves Haskell into the Matlab league of languages
where any code is accepted, with unpredictable results. I know Henning said that, but where exactly? Alas, it's a separate topic, and this page here is long enough already.