M-x compile - Faster Recursive Template Compilation

Faster Recursive Template Compilation

30 March 2013 -

bstamour brought an interesting point to my attention. The take away from it was a faster implementation of applyTuple from here that compiled quicker than my own implementation while still maintaining the same run-time equivalence.

This brings up an interesting angle to meta-programming in C++. While run-time performance is usually the highest priority, compile-time performance is an important metric to use too when designing a meta-programming algorithm.

What made Paul’s implementation so much faster at compile time than mine was that while both used recursion, his recursive implementation was dependent only on the indices to be passed to the tuple function, which meant the compiler was able to preform the recursive operation once, and reuse it many times. In contrast, the method I used was dependent on the functor type, as well as the arguments in the tuple. Essentially, mine did the recursive unroll each time I did an applyTuple, while his did it once, for all tuples of that same size.

There is still a constant cost to each invocation of applyTuple, but it certainly beats the linear cost in terms of tuple size of the inefficient method. For instance, with this code:

template <typename... T> static auto make_2(T&... t) RETURN( std::make_tuple(t..., t...))
template <typename... T> static auto make_4(T&... t) RETURN( make_2(t..., t...))
template <typename... T> static auto make_8(T&... t) RETURN( make_4(t..., t...))
template <typename... T> static auto make_16(T&... t) RETURN( make_8(t..., t...))
template <typename... T> static auto make_32(T&... t) RETURN( make_16(t..., t...))
template <typename... T> static auto make_64(T&... t) RETURN( make_32(t..., t...))

struct F {
  template <typename... Args>
  void operator()(Args... args) const {
    std::cout << sizeof...(args) << std::endl;
  }
};

int main() {
  auto tuple = make_64("A");
  
  printTuple(tuple); 
  
  printTuple(tail(tuple)); 
  printTuple(append(tuple, std::string("String"))); 
  
  F f;
  applyTuple(f, tuple);
  return 0;
}

Compile time with applyTuple defined in the inefficient way took roughly 28 seconds on my machine, compared to ~6.5 seconds with Paul’s implementation. Both compile times are roughly identical if only the first printTuple is done at ~6.5 seconds.

Generalizing this Approach

Now that we know there is a much more efficient method available, it’d be nice to generalize this to our other recursive algorithms.

Lets begin with the principle type used. If this type is the only recursively template instantiation in a variadic algorithm, we will be in good shape:

  template <size_t... N>
  struct indices {}; 

  template < size_t begin, size_t end, typename I = void> 
    struct range {
      using type = typename range<begin, end, indices<> >::type;
    };

  template < size_t begin, size_t end, size_t... N > 
    struct range<begin, end, indices<N...> > {
    using type = typename range<begin+1, end, indices<N..., begin> >::type;
  };

  template < size_t end, size_t... N > 
    struct range<end, end, indices<N...> > {
    using type = indices<N...>;
  };

Again, attribution for this goes to Paul Preney. I refactored his implementation slightly to be slightly more readable/concise.

For reference, here is our new applyTuple implementation:

  template < typename F, typename Tuple,  size_t... N>
    static inline auto applyTuple( F&& f, Tuple& t, const indices<N...>& ) 
    RETURN( f( std::get<N>(t)... ));
  
  template < typename F, typename Tuple>
    static inline auto applyTuple( F&& f, Tuple& t ) 
    RETURN( applyTuple(f, t, typename range<0, std::tuple_size<Tuple>::value>::type() ));

Note the implementation is more concise owing to us not having to define any wrapping structs to specialize.

Since variadic packs aren’t first class members in C++ (ie, we can’t pull the N parameters from the indices type directly), we ‘bounce’ the variadic pack into the first function by forcing the compiler to infer them from the passed in indices type. From there, we can use the expansion operator ‘…’ to expand the entire ‘std::get(t)’ once for each indices, which expands precisely into the form we want. To be honest, I didn’t know until I saw it done that the expansion operator was this powerful.

We can apply this method to our other algorithms with a similar approach. However, there are some limitations we need to develop work arounds for with the expansion operator. For instance, lets look at the ‘iterate’ algorithm. We may (and I did) expect the following to work:

    template <typename F, typename Tuple, size_t... N>
      inline auto iterate(F&& f, const Tuple& t, const indices<N...>&) -> void {
      f(std::get<N>(t))...; // Does NOT work. 
    }

    template <typename F, typename Tuple>
      inline auto iterate(F&& f, const Tuple& t) -> void {
      iterate(std::forward<F>(f), t, typename range<0, std::tuple_size<Tuple>::value>::type());
    }

But it doesn’t — the expansion operator only works in certain contexts, and while we could reasonably expect this form to expand to

f(std::get<0>(t)), f(std::get<1>(t)), ... f(std::get<N>(t))

the compiler complains at once that it expected a semi-colon before …, and that no expansion operator was used on the packed type of N. What is going on here is that, broadly speaking, if you use a variadicly defined type/argument, it expects it to produce a single statement. So why not pass our invocations as arguments into a function:

    template <typename... T> static inline void noop(T&&...){}
    
    template <typename F, typename Tuple, size_t... N>
      inline auto iterate(F&& f, const Tuple& t, const indices<N...>&) -> void {
      noop( f(std::get<N>(t))...); // Won't work... mostly. 
    }

So we simply define a noop function that takes in whatever and does nothing with it(which will be optimized out), and expand into that just like we did with applyTuple.

One problem: For the vast majority of iterates callers, the passed in functor will return void, and the compiler flags this as an error — passing a void argument into a function is clearly non-sense. Luckily, the comma operator can be used to work around this:

    template <typename F, typename Tuple, size_t... N>
      inline auto iterate(F&& f, const Tuple& t, const indices<N...>&) -> void {
      noop( (f(std::get<N>(t)), 0)...); // Works... but out of order possibly.
    }

For those not familiar with the comma operator, it essentially evaluates the expression to the left, and then uses the one on the right. So this will effectively call ‘noop(0, 0, 0 … 0)’, but as a side effect calls all our functions.

One final problem: Function argument evaluation is unspecified, so we have no guarantees about the ordering of our function calls. It’d be nice to have that — especially since they are normally called right-to-left, which is counter-intuitive for an iterate function. Luckily, the new initializer list syntax is defined left-to-right, so this works:

    static inline void noop(const std::initializer_list<int>&){}

    template <typename F, typename Tuple, size_t... N>
      inline auto iterate(F&& f, const Tuple& t, const indices<N...>&) -> void {
      noop({ (f(std::get<N>(t)),0)... });
    }

You can find the code for all of this in the mxcomp repo, namely under tuple_ext.h, apply.h, and range.h. This contains most of the variants of iterate and apply with the faster implementation in place.

Note that I’m not sure it’s possible to implement fold in this new way. I haven’t spent much time puzzling over it since I don’t use it in any of my code, but if anyone has ideas on how to implement it, please let me know.

Prev: Working with tuples part II