Skip to content

The Joy Of Benchmarks

October 15, 2011

Introduction

This has happened to all of us. You are sitting at your desk one day, when your boss bursts in:

"I was reading this article on the web  last night and it says that in C++ iterators are far more efficient in terms of speed than using indexes to access arrays! I think we ought to change all our code to take advantage of this, immediately!"

There are two basic responses to this. One is to roll your eyes and go back to whatever you are doing, secure in the knowledge that the guy will have forgotten about the whole thing by tomorrow, and the other, which is to find out if there is objectively something in what he is saying, unlikely as that may be. In the second case, you need to do some performance benchmarking.

This article describes how to do simple performance benchmarking, explains some of the pitfalls you can encounter, and provides the truth about iterators and indexes. To follow it, you will need some C++ experience and a C++ compiler of some sort.

Timing Things

In order to benchmark code, you need to be able to time the execution of the benchmark code. There are several ways of doing this. If you are on a UNIX-like platform, or have the bash shell installed on Windows, you will have access to the time command, which times program execution.  A typical use of time looks like this:

$ time ls > /dev/null

real    0m0.284s
user    0m0.030s
sys     0m0.155s

This times the execution of the ls command (list files) in the current directory. Output is redirected to the throw-away device /dev/null to avoid console scrolling performance interfering with the benchmark. The outputs are the elapsed time taken to execute the command, the time spent in user code, and the time spent in system calls.

The time command is very convenient, but it can only time the execution of an entire program. To compare iterators and indexes you would have to write two different programs, and compile and run them separately. This is a bit of a pain, and is prone to errors – you compile the two programs with different optimisations, for example. It’s normally best to have all the benchmark code in the same program, and to insert timers into the code.

As you may know, C++ provides a time() function, which returns the number of seconds since a certain date (or at least, that’s what the vast majority of C++ implementations do – the Standard is somewhat ambiguous about what is actually required) at the moment the function is called. Unfortunately, for timing code execution you often want granularity of less than a second, and the time() function does not provide this.

C++ also provides a slightly less well-known function, clock(), which is closer to what you need. This function returns the number of clock "ticks" that a program has consumed since it started. The Standard makes no guarantees about the granularity of this timer, but for most platforms it will be millisecond or better. Here’s an invocation of clock() for Windows:

#include <windows.h>
#include <iostream>
#include <ctime>
using namespace std;

int main() {
  clock_t start = clock();
  Sleep( 500 );
  clock_t end = clock();
  cout << "Sleep(500) took " << end - start << " ticks\n";
}

The clock() function and the clock_t data type (which is just a typedef for a suitable integer type) are declared in the ctime header. The Sleep() function is Windows-specific, and suspends the current thread for the specified number of milliseconds.

Compiled with the GCC compiler, the above program produced this output:

Sleep(500) took 500 ticks

which means that the granularity of this clock() function is one millisecond. Rather than determining it yourself, you can (and should) get this information from your C++ implementation – it is defined by the CLOCKS_PER_SEC macro. So if you rewrote the above code as:

cout << "Sleep(500) took " 
     << (end - start)/(double)CLOCKS_PER_SEC 
     << " secs\n";

you would get the output:

Sleep(500) took 0.5 secs

It’s important to realise that the granularity of the clock() function is compiler implementation specific – if you replaced GCC with another compiler, you might well get a different value for CLOCKS_PER_SEC. Note also that CLOCKS_PER_SEC is an integer value (as are the clock_t values) so a cast to a double was needed in order to get floating point arithmetic performed, otherwise the number of seconds would have been zero.

This subtracting and casting is a bit of a pain, and most experienced C++ programmers have a small class they use to do the timing. The one I use (for which I make no great claims) looks something like this:

class Stopwatch {
  public:
    Stopwatch() : mClock( std::clock() ) {}

    std::clock_t Ticks() const {
      return std::clock() - mClock;
    }

    double Secs() const {
      double t = std::clock() - mClock;
      return t / CLOCKS_PER_SEC;
    }

    void Restart() {
      mClock = std::clock();
    }

  private:
    std::clock_t mClock;
};

inline std::ostream & operator<<( std::ostream & os, 
                                  const Stopwatch & c ) {
  return os << c.Secs();
}

With a class like this in place, you are ready to start writing your benchmark, which is where the fun really begins!

A First Attempt

In order to benchmark very fast operations like indexing of vectors, you need to execute a lot of them if your timer only has millisecond granularity. So a first attempt at a benchmark might look like this:

#include <iostream> #include <vector> #include "stopwatch.h"

using namespace std; const unsigned int BIG = 10000000; int main() { Stopwatch sw; vector <int> v( BIG ); int n = 0; sw.Restart(); for (unsigned int i = 0; i < v.size(); i++ ) { n += v[i]; } cout << "index " << sw << "\n"; n = 0; sw.Restart(); vector <int>::iterator it = v.begin(); while( it != v.end() ) { n += *it++; } cout << "iterator " << sw << "\n"; }

The code creates a suitably large vector and then iterates over it once using an integer index, and once using an iterator, timing both. What could possibly go wrong? Well, lots – here are the timings on my laptop:

index 0.105
iterator 0.349

The indexed access appears to be over 3x faster than using the iterator!

Given this result, your first impulse might be to run off to your boss and denounce his assertion that iterators were the performance kings as sadly deluded. This would be a mistake though, and not just because you presumably want to keep your job.

The first thing you should always do when benchmarking is to run the code multiple times. This makes sure that all the pages for the program are in the disk cache (along with any data from files) and that everything will be running as fast as the OS allows. When I did that with the above code, there was no noticeable difference in the average results, but that’s because this is a small program which does not itself read from disk.

The second thing you should do is do an optimised build.

An Optimised Build

If you care about performance (and presumably you do, else why are you bothering with this benchmark), you should let the compiler optimise the build as much as it can for release code. I’ve emphasised the "for release code" because debugging heavily optimised code in a debugger can be very difficult, so you don’t want to use optimisations for all your builds.

To compile the benchmark program with GCC from the command line with common optimisations turned on is easy (the program is called bm.cpp):

$ g++ -O2 bm.cpp

The -O2 option tells the compiler to optimise the code. If you run the program now, the results are somewhat different:

index 0
iterator 0

This is impressive! Both operations were equally fast, and neither took any time at all!  So what is going on? Well, if you look at the assembler output of the compiler, you will see that whole chunks of code have simply been removed. For example, there will be no representation in the assembler output for:

for (unsigned int i = 0; i < v.size(); i++ ) {
  n += v[i];
}

or for its iterator equivalent.

Why is this? It’s because the code, from the compilers point of view, does nothing. Yes, you accumulate all the values in the vector into n, but then nothing is done with n. So the compiler feels justified in removing n, and the code that accumulates it, completely. Thus, both the index and the iterator versions of the code generate the same assembler output – none. and executing that non-existent code takes no time!

Forcing Side-Effects

Unfortunately, for real code you presumably are are accumulating values into n to some purpose, and having the compiler gaily junk them is not what you want. To force the compiler not to junk the code, you need a side-effect. A side-effect is  something that the compiler cannot get rid of during optimisation and have the program run correctly. A popular side-effect is to print out something – the compiler is mostly (in C++ copy-construction elision is a counter example) not free to optimise away side-effects. So you should modify your loop:

for (unsigned int i = 0; i < v.size(); i++ ) {
  n += v[i];
}
cout << sw << "\n";
cout << n << "\n";

Do the same for the iterator loop. Now the output is:

index 0.038
0
iterator 0.047
0

Subsequent runs produce similar result (I should point out I did some extra testing, not described here, to ensure that use of the iostream outputs were not distorting the results). It turns out that index access is typically around 10% faster than iterator access, for the case benchmarked here.

A Really Clever Compiler

The above code forces the side-effect  of printing out the accumulated value of the contents of the vector, which is zero, because the vector is zero-initialised by its constructor (look it up, if you don’t believe me). A really clever compiler could see this zero-initialisation and realise that the accumulated value will always be zero. This is possible because std::vector is a header-only template, so the compiler has complete information about its implementation.

Having said that, it seems the GCC compiler at least is not that clever (though it is pretty smart about arrays).  But it’s only a question of time before such smart compilers arrive, so how would you force the compiler to read the vector?

Well, one way of doing it would be to initially populate the vector with random numbers, so that the accumulated value cannot be known at compile time. In fact, populating objects with random values is a common benchmarking ploy to prevent the compiler from completely removing code when optimising.

So What Can Be Learned?

Firstly, these benchmarks only cover the very simple case laid out here, for code compiled using GCC 4.6, on the Windows platform, on my specific Dell laptop. Different combinations of these factors can (and probably will) produce different results.

It’s important to realise that I haven’t tested (for example) read access to the vector, random access using iterators, or the use of indexes and iterators with other types (like std:;deque, for example) that support random access. Also, not much real code involves summing large vectors – if you want to do that, maybe one of the Standard Library algorithms would be a better bet.

Another thing to learn is that benchmarking can be complicated, and there are lots of nooks and crannies, and things you might get wrong – I’m far from sure that everything in this article is correct. So that people can spot your mistakes easily, and help you to correct them:

  • always post/publish/make available the full code for your benchmark
  • always publish all your results, specifying how you obtained them, on which hardware, OS etc.
  • publish the compiler version, and the command line options you used to compile your code

One last big thing to learn here is that intuition is a bad guide to performance. Vector iterators in C++ are typically implemented using pointers, and many programmers seem to equate pointers with performance. A programmer’s (or at least your boss’s) intuition has turned out to be faulty – this is very often the case.

About these ads

From → c++

4 Comments
  1. Adding a volatile to the array over which your are iterating will likely destroy the timings completely. That is, the benchmark will no longer be indicative of non-volatile optimized code. The approach to using random data is far better. Alternately you could modify the volatile approach. Store the vector pointer to a volatile pointer, then assign back to a non-volatile pointer. The compiler will be forced to load each memory location, but no longer needs to assume it is volatile.

    If you don’t want to output the result, or some temporary in the middle, again you can simply assign the value to a volatile. Thus you don’t need to output it. If you do enough internal operations the initial and final assign to/from volatile won’t have a significant impact on the loop speed.

    • Adding a volatile to the array over which your are iterating will likely destroy the timings completely. That is, the benchmark will no longer be indicative of non-volatile optimized code.

      That’s true, but the comparative timings for the index and iterator access should stay the same. But I must admit, I normally go the random data way myself.

      • I’d suspect access through a volatile would actually totally mess with the comparative timings as well. Consider at a minimum the vectorization optimization (to use those vector calculation functions on the chip). With a normal index it would be able to do that, with a volatile it would not. With an iterator I’m not sure it could ever do it. Thus by making it volatile you might make the index comparatively worse.

        In the same vein the volatile forces ordering constraints on the indexed version that might not otherwise be there, but might be there in the iterator version. I can’t say for sure, but logically there is good reason to believe access through a volatile will screw with the comparative timings — that is, it won’t slow them down equally.

      • Actually, the whole thing is a bit academic! I just tried making the vector volatile, and then the type it contains volatile, and GCC is having none of it! I’ll edit the article to reflect this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: