What is datatype-generic programming?

Familiarity with a statically-typed functional language makes this post easier to follow. The examples are written in PureScript.

Some introductions to datatype-generic programming begin with technical details like the Generic typeclass. That can be confusing if the premise isn’t clear. This is an attempt to explain datatype-generic programming or “generics” in terms of the big picture, addressing my own points of confusion.

What does “generic” mean?

The term “generic” might evoke associations with parametric polymorphism. Types like forall a. a -> a are sometimes said to be generic. But datatype-generic programming is not a reference to parametric polymorphism. They are only related as two varieties of polymorphism, two ways to generalize about types.

Parametric and ad hoc are perhaps the most common varieties of polymorphism. Parametric polymorphism expresses a universal generalization. In the type forall a. a -> a, the variable a stands for any type. Ad hoc polymorphism is a more limited generalization, where a is any type that satisfies a constraint.

Differences of generalization reflect differences of abstraction. A function of any type a abstracts over logic that does not need to know (and cannot know) the type of a. The function is implemented once for all types. Type constraints abstract over interfaces but not implementations. The implementations are type-specific.

Ad hoc and parametric are not the only varieties of polymorphism. Some logic is concerned exclusively with the structure of a datatype. That logic is not universally polymorphic; it depends on some information about the type. However, the implementation need not be type-specific, since datatypes all share the same structural components. Neither ad hoc nor parametric, this is a third variety called structural polymorphism.¹

What is structural polymorphism?

The beauty of all of these patterns of computation is the direct relationship between their shape and that of the data they manipulate…
—Jeremy Gibbons²

The show function is often an example of latent structural polymorphism. show returns a string representation of a datatype. Those strings often follow the same pattern: the name of the data constructor followed by its arguments.

> show (Just 1)
"(Just 1)"

> show (Just (Just 1))
"(Just (Just 1))"

This logic, while not universally general, is not type-specific. It depends only on knowing the structure of the type; not the type itself.

Datatype-generic programming is an approach to writing functions, like show, that abstract over the structure of a datatype. To do this, we need a “generic” representation of that structure. A generic representation decouples information about the structure from the type itself. All types given a generic representation can share abstractions over structure.

What is a generic structural representation?

This is more approachable than it might sound. We already use generic language to describe datatypes. We say that:

Datatypes have constructors
Constructors have names
Some constructors have one argument
Some constructors have more than one argument
Some constructors have no arguments
Some datatypes have more than one constructor

With this limited and generic vocabulary, we can describe the structure of any datatype.

A toy example

We want to implement show once for values of type Maybe String and values of type Tuple Number String. We begin with a generic representation for values of those types, a generalization of their structure.

data Rep
  = Num Number
  | Str String
  | NoArgs String
  | OneArg String Rep
  | ManyArgs String (Array Rep)

The structure of any value inhabiting either type can be expressed in terms of this Rep type. Here are some examples:

foo :: Maybe String
foo = Nothing

fooRep :: Rep
fooRep = NoArgs "Nothing"

bar :: Maybe String
bar = Just "foo"

barRep :: Rep
barRep = OneArg "Just" (Str "bar")

baz :: Tuple Number String
baz = Tuple 1.0 "baz"

bazRep :: Rep
bazRep = ManyArgs "Tuple" [Num 1.0, Str "baz"]

Rep enables us to write functions that are structurally polymorphic for types Maybe String and Tuple Number String, and any other type described in terms of the Rep constructors. Our purpose is to write a generic show function, so next we write a Show instance for Rep.

instance showRep :: Show Rep where
  show = case _ of
    Num n ->
      show n
    Str s ->
      "\"" <> s <> "\""
    NoArgs name ->
      name
    OneArg name arg ->
      format name (show arg)
    ManyArgs name args ->
      format name (joinWith " " $ map show args)
    where format name args =
            "(" <> name <> " " <> args <> ")"

Now we can stringify any value given this generic representation. But the question remains: how to write a show function that is polymorphic for values of types Maybe String and Tuple Number String? We’ve only implemented a monomorphic show function for Rep.

The trick is to convert from the original types to Rep, and then leverage the Show instance. The Generic class specifies a conversion interface.

class Generic a where
  from :: a -> Rep

instance foo :: Generic (Maybe String) where 
  from Nothing = NoArgs "Nothing"
  from (Just str) = OneArg "Just" (Str str)

instance bar :: Generic (Tuple Number String) where
  from (Tuple num str) = 
    ManyArgs "Tuple" [Num num, Str str]

This interface enables us to convert at the boundary of our generic functions without knowing what type we’re converting from.

genericShow :: forall a. Generic a => a -> String
genericShow = show <<< from

And thus we have a structurally polymorphic implementation of show.

> genericShow $ (Nothing :: Maybe String)
"Nothing"

> genericShow $ Just "foo"
"(Just \"foo\")"

> genericShow $ Tuple 1.0 "foo"
"(Tuple 1.0 \"foo\")"

This example illustrates the principle of datatype-generic programming. The approach taken by libraries is often a little different.

Datatype-generic programming libraries

A generics library has two essential components: a generic representation of a datatype, and conversions to and from the data and the representation. In the last example, we used Rep for the representation. But this is not the only, or even the most popular, approach.

A more common approach is to encode structure on the type level. The type Rep contains no structural information. Structural information is only found on the value level, encoded by the Rep constructors. When structure is encoded on the type level, the type of a generic representation describes the structure of the datatype it represents.

Here’s a bit of the last example to demonstrate this approach using PureScript’s generics-rep library.

import Data.Generic.Rep

foo :: Maybe String
foo = Nothing

fooRep :: Sum (Constructor "Nothing" NoArguments)
              (Constructor "Just" (Argument String))
fooRep = Inl (Constructor NoArguments)

bar :: Maybe String
bar = Just "foo"

barRep :: Sum (Constructor "Nothing" NoArguments)
              (Constructor "Just" (Argument String))
barRep = Inr (Constructor (Argument "foo"))

baz :: Tuple Number String
baz = Tuple 1.0 "baz"

bazRep :: Constructor "Tuple"
                      (Product (Argument Number)
                               (Argument String))
bazRep = Constructor (Product (Argument 1.0)
                              (Argument "baz"))

Type-level encoding is achieved by giving each structural component its own type. Type constructors, like data constructors, are named for the corresponding structural components. Those type constructors are combined, just as the data constructors are combined, to describe the structure of a datatype.

Notice that the generic representations of Maybe String and Tuple Number String now have different types. Maybe is represented as type Sum and Tuple as type Constructor. This changes two things: the Generic class and the implementation of our genericShow function.

Our Generic class assumed one representation type, Rep. Now there are several representation types. Depending on the structure of the datatype, the type of from could be

a -> Constructor ...

or it could be

a -> Product ...

from must be a function from a to any generic type. So we must parameterize Generic not only by the datatype but also by the representation.

class Generic a rep | a -> rep where
  from :: a -> rep

The notation a -> rep states that there is only one type rep for each type a.

Our implementation of genericShow also expected one type, Rep. With multiple types, we can no longer use one implementation for all representations. We must implement genericShow for each structural type we support. Then a typeclass is used to wrap all implementations in one generic interface.

In the last example, the type of genericShow was:

genericShow :: forall a. Generic a => a -> String

In this approach, that type is modified to include a GenericShow constraint:

genericShow 
  :: forall a rep
   . Generic a rep 
  => GenericShow rep 
  => a 
  -> String

An instance of GenericShow defines how to stringify a structural type. NoConstructors, Constructor, and Product all have instances of GenericShow. While more than one, that is still far fewer than one implementation for each datatype.

These are implementation details. The strategy remains the same—convert from a datatype to its generic representation and then pass the generic representation to the structurally polymorphic genericShow.

Generic programming is made convenient by the compiler

Datatype-generic programming, as a way of reducing boilerplate, would not succeed without help from the compiler. Generic representations are verbose and tedious to write by hand. Doing so would mean exchanging one kind of boilerplate for another. Fortunately, this is not required.

The Generic class specifies a functional dependency between datatype and generic representation. This enables the compiler to infer the generic representation from the datatype. Thus we can derive the Generic instance without specifying the generic type explicitly. The boilerplate associated with generics is work we can delegate to the compiler.

Generic programming is structural programming

All datatypes are formed of the same structural elements. Logic concerned with structure should not be specific to one datatype. The same logic applies to all datatypes because they all share the same structural components. Such logic is structurally polymorphic.

Structural polymorphism is achieved by giving a generic representation to the structure of a datatype. A generic representation describes structure in general terms like “constructor” and “argument”. Generic functions are implemented for any datatype given this generic representation. The result is a new level of expressivity.

Thanks to Vaibhav for reading a draft of this post.

The approach extends the familiar notion of parametric polymorphism by allowing the definition of functions which are generic with respect to data structures as well as to individual types. For example, structural polymorphism accommodates generalizations of the usual length and map functions which may be applied not only to lists, but also to trees, binary trees or similar algebraic structures. Under traditional polymorphic type systems, these functions may be defined for arbitrary component types, but must be (laboriously) re-defined for every distinct data structure.

—Fritz Ruehr, Structural Polymorphism , page 1

↵
Datatype-Generic Programming by Jeremy Gibbons, page 18. ↵