Skip to content

Things About Strings – Part 3

Introduction

In the previous article in this series, I looked at how you can iterate over the characters in a string. Do do that, you often need some idea of how big the string is. In this article I’ll look at the various ways you can find out how big a string is, and how you can manipulate its capacity.

Not Like This

First, I’d like to look at one way of finding the length of a string that is guaranteed not to work. This code:

string s( "foobar" );
cout << sizeof( s ) << endl;

will not tell you how many characters there are in the string – it tells you the size of the string object. For example, in Visual C++ 2012, this prints out 28.

Now,you probably knew that wouldn’t work, but many beginner programmers forget that when it comes to serialising class objects. For example, suppose you have this simple struct:

struct Person {
    string forename, surname;
};

You then try to write an instance of Person out to disk like this:

Person p = { "John", "Doe" }; write( f, &p, sizeof( p ) );

But sizeof(p) is simply the sum of the sizeofs of its member variables, so once again you get the wrong answer, and the actual values of the forename and surname strings are not written at all.

The size() and length() functions

To find the length of the sequence of characters held in a string, use either the size() or the length() function. These have identical functionality, and are implemented in exactly the same way. They are declared as:

size_t size() const;
size_t length() const;

The previous code, re-written:

string s( "foobar" );
cout << s.size() << endl;

will print 6. In other words, these functions return what the C-style function strlen would have – they do not count the null-terminating character that, as the previous article in this series explained, is at the end of a C++11 std::string.

Why two functions? Well, historical reasons, probably. Which one you use is a matter of taste, but you should aim to be consistent. I personally always use size(), largely because it is fewer characters to type!

String capacity

As well as supporting the concept of length/size, std::strings also have the concept of capacity. The capacity of a string is the number of characters it could hold without having to re-allocate its internal storage. To see why such a concept is needed, suppose we have this code:

string s( "foobar" );
s += 'x';

For a naive string implementation, the constructor could allocate 7 characters to hold “foobar” (plus the null-terminator). Then when we append the character ‘x’ to the string, it could allocate 8 characters, copy the 7-character string to this 8-character allocation, tack on the ‘x’, and then free the 7-character allocation.  I think it’s obvious that doing this would be horrendously inefficient, and would pretty much preclude building up strings by appending single characters, which is a popular way of using strings.

Instead of allocating space for contents a character at a time, std::string implementations allocate the storage in larger chunks. This results in some wasted storage if we never increase the size of a string, but makes increasing the size much faster when we do.  The C++ Library provides a capacity() function, which tells you how big the current chunk is. For example, this code:

string s( "foobar" );
cout << s.capacity() << endl;

will print 15 for Visual C++ 2012. The capacity() function does not take into account the null-terminator, so VC++ has actually allocated 16 characters for a 6 character string. The capacity won’t change until we add the 16th character, when we see this:

s += 'x';    // 16th character
cout << s.capacity() << endl;

printing out 31.  When the capacity was exceeded, the string doubled the current capacity (by allocating a new chunk and copying contents, as described above). This doubling of capacity is a popular strategy, but different compilers can use other strategies – you might like to investigate what your specific compiler does.

Note that  reducing the size of a string does not change the capacity – we’ll look at how you do that next.

Changing the capacity

If you know that you are going to be increasing the length of a string many times, you can somewhat optimise things by setting the string’s capacity before you start. For example, suppose you know that you are going to  end up with strings of around 200 characters for the processing you are doing. You can write code like this:

string s;
s.reserve( 200 );

This causes a single allocation of a suitable chunk of characters of at least size 200 so that now when you append characters it is unlikely that any further allocations or copying will be needed.

You can also reduce the capacity. It’s not so common to want to do this, but it might be necessary if memory is tight and you want to re-use the string. Prior to C++11, you had to do this with the “swap trick”. For example, this reduces the capacity of the string s to a minimum:

string().swap( s );

It works by creating a temporary empty string (which will have minimum capacity) and then swapping the “insides” of this string with those of s, so that s ends up with the minimum capacity and the temporary ends up with what was in s, which is then disposed of by the temporary’s destructor.

The swap trick is clever, but far from obvious. From C++11, a shrink_to_fit() function is provided:

s.shrink_to_fit();

which does notionally what the swap trick example does. I say notionally, because shrinking is not a binding request on the library implementation, and might be ignored.

Conclusion

We’ve looked at the ways  you can  find out the size of a string and at how the capacity of the string is managed by the library implementation for efficiency purposes.

Things About Strings – Part 2

Introduction

In the previous article in this series, I looked at how std::strings can be created. In this article, I’m going to look at how we go about iterating over the  individual characters inside std::strings, a topic which has more to it than you might imagine. To illustrate the issues, I’ve written a small C++ program:

#include <string>
#include <iostream>
#include <cassert>
using namespace std;

int countc( const string & s, char c ) {
    int n = 0;
    for( size_t i = 0; i < s.size(); ++i ) {
        if ( s[i] == c ) {
            n++;
        }
    }
    return n;
}

int main() {
    string pc( "the panama canal" );
    int count = countc( pc, 'a');
    assert( count == 5 );
}

This implements the function countc, which returns a count of the number of times a specified character occurs in a string, without changing the string. I’ve used the assert macro to check that the function actually works. Before going any further, I’d like to say a couple of things about writing functions like this in general.

Pass std::strings by reference

The above code passes the string parameter to the countc function via a reference. Why do that? Well, suppose we had written the function like this:

int countc( string s, char c ) {

This is passing the string by value, and if you do this the copy constructor will be used to create a new copy of the string you pass, and when the function exits, the strings destructor will be called  to free the memory used by the copy. Creating new strings and disposing of them is a relatively expensive process, so we would like to avoid it wherever possible. To do that we pass a reference to the string, so no copy is created. You almost always want to pass strings to functions by reference.

Pass std::strings by const reference

The countc function could also have been written like this:

int countc( string & s, char c ) {

Why bother with that const keyword? Well, the spec for the function says “returns a count of the number of times a specified character occurs in a string, without changing the string.” Using a const reference guarantees that if we do try to change the string, perhaps by writing some (incorrect) code like this:

if ( s[i] = c ) {

then the compiler will tell us about the error.

Operator []

The std::string class’s operator[] is used to access individual characters in the std::string via an integer index. There are actually two versions of this operator (here, as in other places, I’ve changed the names of some types to hopefully make the declarations clearer, without being misleading):

char & operator[]( size_t pos );
const char & operator[]( size_t pos ) const;

The first version returns a reference to the character at position pos in the string, where pos is a zero-based index.This is the version that would have been called if we had passed a non-const reference to the function.

The second version is the one that our code actually uses. It returns a const reference which means we can’t modify the character it refers to. The second const, at the end of the declaration, tells the compiler that it is safe to apply this version to const objects.

The pos parameter of both versions is of type size_t. This is an unsigned integer type defined by the C++ Standard Library and is the same type as returned by the size() function. This is why the loop control variable in our code is defined like this:

size_t i = 0;

If we defined it like this:

int i = 0;

and then compiled with reasonable compiler warning levels, the compiler will emit warnings like this:

warning: comparison between signed and

unsigned integer expressions

The character at the end of the string

In C++11, a change was made to specify what is lurking at the end of a std::string. The C++11 Standard says it is a character with the integer value zero (not the value ‘0’) and that it is legal to read (but not to write) this value using operator[] (in other words, the storage is pretty much the same as for a C-style string).  This means, in C++11, we could rewrite our loop like this (assuming the string contains no embedded zero values):

for( size_t i = 0; s[i] != 0; ++i ) {

However, be aware that  this is not  the case for C+98 – in that Standard the character at the end of the string and the semantics of accessing it are undefined.

The at() function

The [] operator is probably the most natural way of accessing characters in a string using an integer index, but it does have one drawback – it doesn’t check that the index used is valid for the size of the string being accessed. This means that code like this:

string s( "hello world" );
char c = s[100];

have undefined behaviour (we can’t say what a program containing such code will do), because we are accessing characters beyond the end of the string.

C++ does provide checking for index validity, via the at() member function. Like operator[], this comes in two flavours:

char & at( size_t pos );
const char & at( size_t pos ) const;
    

Both at() functions check the validity of their pos parameter, and if it is invalid throw a std::out_of_range exception. If you don’t catch this exception, the application that contains the invalid pos parameter will be terminated. On Windows, you will get a message like this:

This application has requested the Runtime to terminate it

in an unusual way.

Note that unlike operator[], it is not allowed to access the string’s null-terminator with the at() function.

So, is it worth using the at() function to check all your string accesses? My personal opinion is “no”. I do use at() when I’m writing some very tricky code that juggles string indexes, but most code isn’t like that – for example, it’s hard to see how our countc function could ever perform an invalid character access.

Iterators

Strings are not strictly speaking Standard Library containers (historically, they were not part of the Standard Template Library), but in many ways they behave like things like std::vector and std::deque. This means we can use iterators to walk over the characters in the string. For example, we can re-write the countc function like this:

int countc( const string & s, char c ) {
    int n = 0;
    for( auto it = begin(s); it != end(s); ++it ) {
        if ( *it == c ) {
            n++;
        }
    }
    return n;
}

Here I’ve used the auto type deduction feature of C++11 to avoid having to write the full name of the iterator.

Why would you want to write code like that? Maybe you you wouldn’t. However, the fact that std::strings support iterators means that they can be used by many of the C++ Standard Library algorithms. For example, here’s some code that does away with the countc function altogether:

#include <string>
#include <algorithm>
#include <cassert>
using namespace std;

int main() {
    string s( "the panama canal" );
    int n = count( begin(s), end(s), 'a' );
    assert( n == 5 ); 
}

This uses the std::count algorithm to count the number of occurrences of ‘a’ in the string.

Pointers

The characters comprising objects of type std::string are stored contiguously in memory and (as of C++11) are null terminated. That means the storage is the same as for a C-style string, and means that you can perform C-style iteration on such strings. Here’s countc re-written yet again to do this:

int countc( const string & s, char c ) {
    int n = 0;
    for( const char * p = &s[0]; *p != 0 ; ++p ) {
        if ( *p == c ) {
            n++;
        }
    }
    return n;
}

 

Conclusion

The std::string class supports multiple methods of iteration and of accessing individual characters. These include:

  • iteration using integer indexes and operator[] or the at() function
  • iterators
  • pointers
  •  

Things About Strings – Part 1

Introduction

Of all of the classes in the C++ Standard Library, the std::string class must be the most commonly used, by quite a wide margin. However, many beginner (and even experienced) C++ programmers do not seem to use many of the features of the class, so I thought it would be worth to provide a survey of the features of std::string that you might not be aware of. I’ll be talking about std::strings as defined by the C++11 Standard, and the articles will be aimed at beginner to intermediate C++ programmers (and I promise to do my best to actually complete the series).

In this first article, I consider what std::strings really are, and look at how we go about creating them.

What exactly is std::string?

Actually, it’s a typedef! It’s defined in the C++ Standard Library like this:

typedef basic_string <char> string;


The basic_string class itself is a template:

template<
    class CharType,
    class Traits = char_traits<CharType>,
    class Allocator = allocator<CharType>
> class basic_string

The template parameter CharType says what kind of characters the string will contain – you can theoretically use any integer type to create  new string types. The other two parameters specify the features of the character type, and the memory allocator that will be used to dynamically allocate the strings. It would be pretty unusual to use anything but the default values for these parameters, so I won’t consider them further here. To simplify things, I’ll refer to std::string throughout, even though occasionally I’ll be talking about the std::basic_string implementation.

But what’s inside a std::string?

The C++ Standard does not say how the string class must be implemented, and library writers have traditionally produced many different implementation mechanisms. You should never write code that depends on a particular implementation layout.

At heart, the string has to contain a pointer of some sort to the dynamically allocated memory it controls, the size of this memory block (which must be contiguous), and some way of obtaining the size of the sequence of characters currently held by the string in constant time. It is not required that std::string implement reference counting, and modern implementations typically do not do so. The sequence of characters the string holds must be null-terminated (like C-style strings), but may contain embedded null characters (more on this later).

Constructing std::strings

As with all C++ objects, we construct strings objects using their constructors. The string class has more constructors than you can shake a stick at; to illustrate their uses I’ve written this small C++ program. I use the assert macro to make sure I get tricky things like offsets and lengths right:

#include <string>
#include <vector>
#include <cassert>
using namespace std;

int main() {
    string s0;
    string s1( "hello world" );
    string s2( s1 );
    string s3 = s1;
    string s4 = "hello world";
    string s5( 4, 'X' );
    assert( s5 == "XXXX" );
    string s6( "hello world", 5 );
    assert( s6 == "hello" );
    string s7( s1, 6, 5 );
    assert( s7 == "world" );
    vector <char> v{ 'a', 'b', 'c' };
    string s8( v.begin(), v.end() );
    assert( s8 == "abc" );
}

I’ll now look at these constructors in turn:

The string s0 is constructed using the default constructor, which takes no parameters. This constructs an empty string of size zero.

The string s1 is constructed via the constructor that takes a const char * as a parameter. The characters pointed to by this parameter are copied into the newly-created string, up to the first null character.

The string s2 is created using the copy constructor, which makes a copy of the string referred to by its parameter, which is a const string &.

Perhaps surprisingly, the string s3 is also created using the copy constructor – this is simply alternate syntax for a copy constructor call. Note that it is not a use of the assignment operator!

The next string creation is a bit complicated. Notionally, what happens is that the same constructor that was used to create s1 is used to create a temporary string, and then this temporary is copied, via the copy constructor, into s4. However, the compiler is allowed to do some removal of constructor calls here, and in fact the copy constructor will almost certainly not be called.

The string s5 is constructed by taking 4 copies of the character ‘X’. This constructor can be useful when you want to do things like creating “ASCII Art” frames.

The string s6 is constructed by taking the first 5 characters of the C-style string “hello world”.

The string s7 is constructed using the substring constructor. This takes a substring of its first parameter, starting at the zero-based offset of its second parameter, of a length of its third parameter. So in this case, starting at offset 6, it takes 5 characters from the string “hello world” to give a resulting string containing only “world”.

The last string, s8, is created using the constructor which takes a pair of iterators over some container holding character data – in this case a vector, but I could have also used a list or a deque, or anything that supports forward iteration.

But that’s not all!

The constructors described above have been part of the C++ Standard since its inception, but C++11 has added some new constructors to the mix. The first of these is the constructor that provides unified initialisation via an initialiser list. If I added it to my little program above, it would look like this:

string s8 { 'a', 'b', 'c' };
assert( s8 == "abc" );

Frankly, it’s a little hard to see a use for this constructor in human-written code, but it may be needed when writing templates.

Much more useful is the provision of a move-constructor. This is something similar to the copy constructor, except that it does not make a copy. Consider this function:

void addbrackets( vector <string> & v, const string & s ) {
    v.push_back( "(" + s + ")" ;
}

This pushes a copy of the string s, enclosed in parentheses into the vector, so we might end up with the vector containing a sequence like “(quick)”, “(brown)”, “(fox)”. Why would we want such a thing? Who knows.

Prior to C++11, what this function would have done would be to create a nameless temporary string object which contained the bracketed value. It would then have used the copy constructor to create an entirely new copy of that value which would be stored in the vector. The temporary object would then be destroyed. Copying and destroying things are typically expensive operations, and we would like to avoid them, where possible.

In C++11, what happens is that the temporary object is still created, but its implementation (the pointers and other stuff I mentioned previously) are then simply moved across to the vector, leaving an empty temporary object. Because the temporary object is empty, no expensive destruction is necessary. all this is done via the move constructor, which has this rather frightening signature:

basic_string( basic_string&& ) noexcept

Mercifully, all this is handled for you by the compiler and the Standard Library – you don’t have to do anything special to have the move constructor invoked for you. If you build your code with a C++11 implementation, your code using strings will simply automagically get a little faster.

Conclusion

In this article I’ve given a helicopter view of what a std::string actually is, and how you can go about creating them via the wealth of constructors that the C++ Standard Library provides.

Common C++ Error Messages #2 – Undefined reference

Introduction

In this article I’ll be looking at the “undefined reference” error message (or “unresolved external symbol, for Visual C++ users). This is not actually a message from the compiler, but is emitted by the linker, so the first thing to do is to understand what the linker is, and what it does.

Linker 101

To understand the linker, you have to understand how C++ programs are built. For all but the very simplest programs, the program is composed of multiple C++ source files (also known as “translation units”). These are compiled separately, using the C++ compiler, to produce object code files (files with a .o or a .obj extension) which contain machine code. Each object code file knows nothing about the others, so if you call a function from one object file that exists in another, the compiler cannot provide the address of the called function.

This is where the the linker comes in. Once all the object files have been produced, the linker looks at them and works out what the final addresses of functions in the executable will be. It then patches up the addresses the compiler could not provide. It does the same for any libraries (.a and .lib files) you may be using. And finally it writes the executable file out to disk.

The linker is normally a separate program from the compiler (for example, the GCC linker is called ld) but will normally be called for you when you use your compiler suite’s driver program (so the GCC driver g++ will call ld for you).

Traditionally, linker technology has lagged behind compilers, mostly because it’s generally more fun to build a compiler than to build a linker. And linkers do not necessarily have access to the source code for the object files they are linking. Put together, you get a situation where linker errors, and the reasons for them, can be cryptic in the extreme.

Undefined reference

Put simply, the “undefined reference”  error means you have a reference (nothing to do with the C++ reference type) to a name (function, variable, constant etc.) in your program that the linker cannot find a definition for when it looks through all the object files and libraries that make up your project. There are any number of reasons why it can’t find the definition – we’ll look at the commonest ones now.

No Definition

Probably the most common reason for unresolved reference errors is that you simply have not defined the thing you are referencing. This code illustrates the problem:

int foo();

int main() {
    foo();
}

Here, we have a declaration of the function foo(), which we call in main(),  but no definition. So we get the error (slightly edited for clarity):

a.cpp:(.text+0xc): undefined reference to `foo()'
error: ld returned 1 exit status

The way to fix it is to provide the definition:

int foo();

int main() {
    foo();
}

int foo() {
    return 42;
}

 

Wrong Definition

Another common error is to provide a definition that does not match up with declaration (or vice versa). For example, if the code above we had provided a definition of foo() that looked like this:

int foo(int n) {
    return n;
}

then we would still get an error from the linker because the signatures (name, plus parameter list types) of the declaration and definition don’t match, so the definition actually defines a completely different function from the one in the declaration. To avoid this problem, take some care when writing declarations and definitions, and remember that things like references, pointers and const all count towards making a function signature unique.

Didn’t Link Object File

This is another common problem. Suppose you have two C++ source files:

// f1.cpp
int foo();

int main() {
    foo();
}

and:

// f2.cpp
int foo() {
    return 42;
}

If you compile f1.cpp on its own you get this:

f1.cpp:(.text+0xc): undefined reference to `foo()'

and if you compile f2.cpp on its own, you get this even more frightening one:

crt0_c.c:(.text.startup+0x39): undefined reference to `WinMain@16

In this situation, you need to compile both the the source files on the same command line, for example, using GCC:

$ g++ f1.cpp f2.cpp -o myprog

or if you have compiled them separately down to object files:

$ g++ f1.o f2.o -o myprog

For further information on compiling and linking multiple files in C++, particularly with GCC, please see my series of three blog articles starting here.

 

Wrong Project Type

The linker error regarding WinMain  above can occur in a number of situations, particularly when you are using a C++ IDE such as CodeBlocks or Visual Studio. These IDEs offer you a number of project types such as “Windows Application” and “Console Application”. If you want to write a program that has a int main() function in it, always make sure that you choose “Console Application”, otherwise the IDE may configure the linker to expect to find a WinMain() function instead.

No Library

To understand this issue, remember that a header file (.h) is not a library. The linker neither knows nor cares about header files – it cares about .a and .lib files. So if you get a linker error regarding a name that is in a library you are using, it is almost certainly because you have not linked with that library. To perform the linkage, if you are using an IDE you can normally simply add the library to your project, if using the command line, once again please see my series of blog articles on the GCC command line starting here, which describes some other linker issues you may have.

Conclusion

The unresolved reference error can have many causes, far from all of which have been described here. But it’s not magic – like all errors it means that you have done something wrong, in you code and/or your project’s configuration, and you need to take some time to sit down, think logically, and figure out what.

My C++ Interview Questions

Introduction

Over the years, I’ve done my share (more, it has often seemed) of interviewing candidates for C++ programming posts. During this time I’ve zeroed in on a small subset of questions I ask candidates. These have no correct answers, do not refer to manhole covers, and require no maths ability to answer. They have, however, proved effective in deciding if the candidate actually has some knowledge of C++. I present them for your delectation.

“Tell me about the copy constructor”

This is my start-off question. If people look blank when I ask it (and depressingly, lots do), then I write out the signature of the constructor for them. If they still look blank, the interview terminates – this has saved me a lot of time over the years.

What I’m looking for here is someone that knows how important copying is in C++, where it takes place, and how it can be avoided. I’d expect a decent C++ programmer to be able to talk for around 15 minutes on this, with me asking subsidiary questions.

“What are your favourite C++ books? And why are they your favourites?”

C++ is a very complex language, and to learn it thoroughly you simply have to read several good books – internet resources will not be enough. I don’t particularly care which books the candidate talks about, providing there is more than one of them, and they can come up with some convincing reasons for liking them.

“Write a program to…”

I’m not a big fan of getting candidates to write code in interviews. Often, the problem they are asked to solve is too small to prove anything, and they are unfamiliar with the setup on your specific workstations, so you are really testing how quickly they can get to grips with a toolset. However, sometimes HR or senior management will insist on a coding test. If that’s the case, then I set this problem, or something very like it, which tests the candidate’s familiarity with the C++ Standard Library in several ways:

Write a program that can be called from a command line environment like this:

myprog.bible.txt

The file bible.txt, which is provided, contains the text of the King James Bible. Your program should read this file and output the 10 most frequently occurring words, together with their frequency,  ignoring any punctuation and character case. Use the style of coding, commenting etc. that you would for a large program.

As with the other questions here, I’m not particularly interested in the details of the solution, provided they show a knowledge of Standard Library classes like strings and maps, and that the program actually works.

“What do you think the three most important features of C++11 are?”

Or similar – if the candidate has professed a knowledge of a library like Boost, I might ask about that instead. Once again, I don’t particularly care what the candidate thinks are important, only that they can talk about them articulately and knowledgably.

“If you were interviewing me, what question would you ask me”

This one is down to a guy I worked with at the now defunct Lehman Brothers (thank you, Mr James). He interviewed me, and after a couple of  questions said, “Well, you obviously know a lot more about C++ than I do, so tell me, if you were interviewing me, what questions would you ask?” I thought that was brilliant at the time, and still do – use it if you are interviewing a self-styled guru. Of course, you have to ask them why they would ask that question!

Conclusion

From the above, I think you can see that you do not have to ask candidates trick questions, or puzzles about manhole covers. Instead, you should be trying to get them to talk at length about their knowledge of the C++language and how it can and should be used.