Skip to content

Five Easy Pieces #3 – Picking From A Stream

April 18, 2012

Introduction

This is the third article in a series of five, in each of which I will present a simple problem and show how it can be solved in C++. The solutions will illustrate various aspects of C++ and its Standard Library in what is hopefully an easily digestible form, and each will build on the solution to the previous problem.

You will need some basic knowledge of C++ in order to follow these articles, and I suggest if you want to learn from them, you equip yourself with a C++ compiler (instructions for how to install one for Windows are here) and try to write and compile code to solve  problem as (or before!)  I present the solution.  This article builds on the second one, so it would be a good idea to read that before proceeding.


Problem

Write a program called spick which picks N  lines at random from standard input. The number of lines to pick (which must be greater than zero) is to be provided on the program’s command line. So, for example, this Linux command line would pick 1 value from the names of file sin the current directory:

ls | spick 1

while this Windows command line would select 3 lines at random from a text file called names.txt:

spick 3 < names.txt

Reading Input

This problem obviously has three main challenges – how to read the lines from standard input, how to read a number from the command line, and how to use the lines and the number to generate the required randomised output. Let’s start with reading lines from standard input.

Most C++ programmers write a program like this when they are beginning to learn the language:

#include <iostream>
using namespace std;

int main() {
    char name[16];
    cin >> name;
    cout << "Hello " << name << "\n";
}

Unfortunately, this program has two problems. Firstly, the >> operator only reads up to the first whitespace character (i.e. it reads a word), so that if you interact with the program like this:

Jane Doe
Hello Jane

you will only get the first name of the person stored in the name[] array.

The second and more serious problem is that it’s very easy to write outside the bounds of the name[] array by entering a long string. Writing past the end of the array results in your program having "undefined behaviour", which is something you want to avoid.

Both problems can be avoided by using the C++ std::string class, which allocates strings dynamically for you and grows them to accept any length of input,  and its associated std::getline() function which reads entire lines of text instead of individual words. Using these, the program above  can be re-written as:

#include <iostream>
#include <string>
using namespace std;

int main() {
    string name;
    getline( cin, name );
    cout << "Hello " << name << "\n";
}

You can adapt this program to read multiple names from standard input by using a while loop:

#include <iostream>
#include <string>
using namespace std;

int main() {
    string name;
    while( getline( cin, name ) ) {
        cout << "Hello " << name << "\n";
    }
}

You need to use the getline() call in the while() test so that you can detect if it actually has read a line or encountered end of file. The getline() function actually returns a stream reference, but this is converted for you into a boolean value that you can test to see if the call worked.

Now you can  run the program  like this:

spick < names.txt

which given a text file containing names like this;

John Doe
Fred Bloggs
Jane Doe

will produce this:

Hello John Doe
Hello Fred Bloggs
Hello Jane Doe

It’s also worth knowing how not to do this. Beginners often come across the eof() stream member function and use it to write code like this:

#include <iostream>
#include <string>
using namespace std;

int main() {
    string name;
    while( ! cin.eof() ) {
        getline( cin, name );
        cout << "Hello " << name << "\n";
    }
}

This produces the following incorrect output – there’s an extra "Hello":

Hello John Doe
Hello Fred Bloggs
Hello Jane Doe
Hello

The reason for this is easy to see once you know how eof() works. It tests a status bit in the stream that is only set after a read operation encounters the end of the file. For the logic in the above code to work, eof() would somehow have to predict that the next read operation was going to  encounter the end of file, and that it cannot do. You should generally never use eof() to control read loops.

You can now read lines from a stream, but are currently discarding them. It’s probably a better idea to hang on to the lines as you read them, so that you can do something with them later (like pick some). To do this you need to read the lines into one of the C++ Standard Library container classes. The default container, used when you perhaps are not quite sure what you are going to do next, is the std::vector, which is effectively a dynamically sized array that grows as you add things to it. Vectors are template classes, so you need to say what sort of thing you want to store in them, in this case strings. Here’s the code with the vector added, and some name changes to reflect the real purpose of variables:

#include <iostream>
#include <string>
#include <vector>
using namespace std;

int main() {
    vector <string> lines;
    string line;
    while( getline( cin, line ) ) {
        lines.push_back( line );
    }
}

The push_back() member function of vector adds its parameter to the end of the vector, growing the vector to fit it. You therefore end up with a vector full of lines in the order they were read.

Getting A Number

You can now input lines from standard input, so let’s leave that aspect of the problem for the moment and consider how to read a number from the command line. If you remember from the the previous article, the command line is accessed via the parameters of the main() function, which looks like this:

int main( int argc, char *argv[] );

so if you ran your program like this:

spick 3

then argc would be 2 (the total number of things on the command line), argv[0] would point at the string "spick"  and argv[1] would point at the string "3".

The problem is of course that you don’t want a string "3", you want an integer value 3. So how to convert from "3" to 3, or from "42" to 42, for example? C++ actually supplies multiple ways of doing this, all of which have advantages and disadvantages.

Firstly, you could use the std::atoi() function, declared in the <cstdlib> header. It looks like this:

int atoi( const char * s );

The function takes a string of characters and converts them to an integer value, if possible. If the conversion cannot be performed, the function returns zero. So:

atoi( "42" );     // returns 42
atoi( "0" );      // returns 0
atoi( "foo" );    // returns 0
atoi( "42foo" );  // returns 42

The problem is you cannot tell if the return value of atoi() is actually an error or not, because both "0" (ok) and "foo" (bad) return a  value of zero. Also, you can’t tell if the input was partly correct, like "42foo". Checking for values like "42foo" is important in well-written code, as it is almost certainly an error on the user’s part that they need to be informed of.

The second possibility is the std::strtol() function. This is a somewhat more imposing function than atoi(), and looks like this:

long strtol( const char * s, char ** p, int base );

The idea of this function is that it converts the string of characters pointed to by the pointer s, and stops when it meets something that is not part of a valid integer, like the ‘f’ character in "42foo". If no invalid characters are found, it stops when it meets the null-terminator at the end of the string. In either case, the pointer p is set to point at the character that caused conversion to terminate. The conversion is done using the numeric base specified by the third parameter.

So, to convert your command line using strtol(), you could write code something like this:

#include <cstdlib>
#include <iostream>
using namespace std;

int main( int argc, char *argv[] )  {
    if ( argc != 2 ) {
        cout << "wrong args\n";
    }
    else {
        char * end;
        long n = strtol( argv[1], & end, 10 );
        if ( * end == 0 ) {
            cout << "ok value " << n << "\n";
        }
        else {
            cout << "invalid value: " << argv[1] << "\n";
        }
    }
}

The strotol() function is quite powerful, and in fact is what atoi() and similar functions like scanf() are actually using under the hood. However, you may feel uncomfortable messing around with pointers and want a more C++-like solution. And there is one – you can use a stringstream.

Think of  what happens if you write C++ code like this:

int n;
cin >> n;

If you enter an integer, all is well, but if you don’t the stream goes bad. You can test for bad streams, so a C++ way to convert integers is to try to convert them using a stream, and then detect if the stream is still OK after the conversion.

Unfortunately, you don’t have a stream, you have a C-style string, and you cannot say things like this:

argv[1] >> n;

You need to convert the string to a stream first, and C++ supplies a type istringstream (declared in <sstream>) which will do the job – it creates an input stream from a string in memory. Let’s write a function to convert a string to an integer using an istringstream:

int StrToInt( const char * s ) {
    istringstream is( s );
    int n;
    s >> n;
    return n;
}

This is OK as far as it goes, but you haven’t detected the error condition. And when you do detect the error, what are you going to return? You have already seen how functions like atoi() cannot return error codes for all conditions. You could take the strtol() approach and return a  pointer to any problem character, but in that case you might as well simply use strtol(). For a C++ solution, you need to do something different – you need to throw an exception. Exceptions are perfect for this kind of situation where no sensible return value is available. With error checking and exceptions, the function looks like this:

int StrToInt( const char * s ) {
    istringstream is( s );
    int n;
    if ( is >> n && is.eof() ) {
        return n;
    }
    else {
        throw "invalid integer";
    }
}

Note the re-appearance of the eof() function that I said you shouldn’t generally be using. However, this is a correct use of the function – you want to say that things are OK if you converted some characters AND that there are no more characters to convert – i.e. if you are at the end of the stream.

Putting Things Together

You can now read lines from a file into a vector, and you can get a number from the command line. You now need to use the number you read (let’s call it N) to pick N values at random from the vector. The simplest way to do that is to loop N times and each time through the loop call the Random() function from the previous article in this series to select a line from the vector. Here’s code that does that:

int main( int argc, char *argv[] ) {

    int n = StrToInt( argv[1] );

    vector <string> lines;
    string line;
    while( getline( cin, line ) ) {
        lines.push_back( line );
    }

    Randomise();
    while( n-- > 0 ) {
        int r = Random( lines.size() );
        cout << lines[ r ] << "\n";
    }
}

Here the random number returned by Random() is used to select one of the lines from the vector using vector’s overloaded operator[], just like you would from a fixed-size C-style array.

This code works, but it does have  a few flaws.  Firstly, if the user specifies a count of zero or less on the command line, there is no output. This may or may not confuse the user, but it would be nice to tell them they need to use a positive number. Also, if they enter a non-number on the command line, they will be dumped back to the operating system with an incomprehensible error message, as you don’t catch the exception thrown by the StrToInt() function. Fixing the first problem is easy – you just need to write some checking code at the start of main():

    if ( argc != 2 ) {
        throw "usage: rpick count";
    }

    int n = StrToInt( argv[1] );
    if ( n <= 0 ) {
        throw "count must be greater than zero"; 
    }

Note that exceptions are used to report the possible errors – this gives a you a unified means of handling all error messages in one location.

To catch the exceptions that are thrown, you need to write a try block and one or more catch blocks. In outline, these will like this:

try {
   // stuff that might throw
}
catch( exception type X ) {
   // handle exceptions of type X
}

If an exception is thrown by code inside a try block, the handling mechanism looks for a catch block that matches the exceptions type, and executes the code in that block. In the case of  the spick program, all the exceptions thrown by user code are string literals which have the type const char *. So with exception handling added the final program looks like this:

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>
#include <ctime>
using namespace std;

int Random( int n ) {
    return rand() % n;
}

int Randomise() {
    srand( time(0) );
}

int StrToInt( const char * s ) {
    istringstream is( s );
    int n;
    if ( is >> n && is.eof() ) {
        return n;
    }
    else {
        throw "invalid integer";
    }
}

int main( int argc, char *argv[] ) {

    try {

        if ( argc != 2 ) {
            throw "usage: rpick count";
        }

        int n = StrToInt( argv[1] );
        if ( n <= 0 ) {
            throw "count must be greater than zero"; 
        }

        vector <string> lines;
        string line;
        while( getline( cin, line ) ) {
            lines.push_back( line );
        }

        Randomise();
        while( n-- > 0 ) {
            int r = Random( lines.size() );
            cout << lines[ r ] << "\n";
        }
    }
    catch( const char * msg ) {
        cerr << msg << "\n";
    }
}

Now when you run the code correctly, random lines are selected:

$ spick 2 < names.txt
Jane Doe
Fred Bloggs

and if you enter an incorrect count, you get an informative error message:

$ spick -5 < names.txt
count must be greater than zero

Note that the code does not attempt to wrap individual function calls in try/catch blocks. In fact, if you do this, you are misusing exception handling – you should always wrap the largest block of code you possibly can, and do so at the highest possible level of the program.

Conclusion

In this article, you have seen how to read text from a stream into a vector, and how to then access the text in the vector. You have also seen how to validate numbers in C++ and how to use exceptions to handle errors.

Footnote

The algorithm used for picking random lines presented here works well for small to medium files, but for large files has the disadvantage that you have to read everything into memory before picking. There is an algorithm that will read a file of any size and pick from it while keeping only a single line in memory, but it will only work for picking one line at random, not multiple ones. Here’s an implementation of it for you to puzzle over:

#include <iostream>
#include <string>
#include <cstdlib>
#include <ctime>
using namespace std;

int Random( int n ) {
    return rand() % n;
}

int Randomise() {
    srand( time(0) );
}

int main() {
    Randomise();
    string pick, line;
    int n = 0;
    while( getline( cin, line ) ) {
        if ( Random(++n) == 0 ) {
            pick = line;
        }
    }
    cout << pick << "\n";
}
Advertisements

From → c++, tutorial

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: