Skip to content

Writing a Real C++ Program – Part 4

August 5, 2011

This is the fourth instalment in a series of C++ programming tutorials that started here.

Introduction

In this instalment, I want to look at how to go about getting the words that your spellchecker will check. Recall that words are handed to you as part of a submission, which is an ASCII text document. You need to check the words in in the document, reporting the lines at which misspellings occur, together with a bit of context from the line.

For the moment, I suggest you not worry too much about what exactly is meant by a "word" (as you’ll see later, its quite complicated), and simply say that its a string that you can pass to the dictionary you already implemented for checking. This is a fairly common tactic when programming – put off fully defining things for as long as possible. what you need is some way of getting words out of a submission. several alternatives come to mind.

  • You could make the submission be able to provide words. This is certainly do-able, but in general you should try to avoid designing things that are both physical entities (in this case text files) and that have complex actions that can be performed on them, in this case extracting words. Make storage entities just concern themselves with storage, and put actions somewhere else.
  • You could write a GetWord function. This was  in fact the first idea I had when I was implementing the checker. I soon found out however that I would need to remember quite a lot of state (things like position in a line) between calls to such a function. Whenever you have remembered state, a function is not likely to be the right answer.
  • You could implement it as a new class – one we haven’t identified yet. In this case, this is I believe the correct solution, but it won’t always be so.

So if you need a new class, what to call it? Well,  something like WordGetter springs immediately to mind, but when naming things it’s usually a good idea not to go along with your first idea. Good names are important. What you are actually going to be doing here is parsing the text in the submission into words. It looks like Parser might be a good name.

Designing the Parser

What is the interface to the Parser going to look like? There are several requirements that you already know. You are going to have to give it a submission to parse, it;s going to have to give you the words in the submission, one at a time. You also need to know what the current line number and context is for error reporting.

How to pass the submission to the Parser?  You could give it the file name, as you did for the Dictionary class, but in this case you don’t only want to parse files – it would be useful if the Parser can also read standard input, which doesn’t have a name. So it looks like passing in input stream object would be a good idea. You could do this via a SetStream method, but a better bet is to pass it as a constructor parameter – that way you never have an "empty" Parser with no submission associated with it.

You also need some kind of GetWord function which pulls the next word out of the submission. You already use std::strings for words in the Dictionary, so returning a string seems like a good idea. You also need a some way of indicating "no more words" – an empty string would do that, at least for now. For the line number and the context, functions that return  an integer and a string respectively should do the job.

Implementing the Parser

Putting this together, you can begin to write your C++ code. Create a new header file in the inc directory called parser.h:

#ifndef INC_SCHECK_PARSER_H
#define INC_SCHECK_PARSER_H

#include <string>
#include <iostream>

class Parser {
    public:
        Parser( std::istream & is );
        std::string NextWord();
        unsigned int LineNo() const;
        std::string Context() const;
};

#endif

 

A few things to note here; use an istream and not an ifstream, because you want to be able to read stdin, which is not an ifstream. Make the LineNo function return an unsigned integer – you are not going to be dealing with negative line numbers. And make the the LineNo and Context functions const  – calling them should not change the object they are called on. Writing the public interface for the class out like this is a good cheap design method. At this point you don’t need to worry about  the private implementation.

Up until now, you have implemented all the code for the application in header files, with the exception of the main function. However, that is not how real C++ programs are normally written, for a variety of reasons that I’ll discuss in one of the next instalments. Normally, the declarations for the classes are all that go in a header file, and the definitions go in a .cpp source file. From now on, you’ll implement most code in this way, so create a file called parser.cpp in the src directory that looks like this:

#include "parser.h"

You can check that the new code builds, though of course it won’t do anything yet:

$ g++ -I inc src/main.cpp src/parser.cpp -o bin/scheck

You now need some test data. I recommend sticking with the "brown fox" dictionary and producing a test submission file that looks like this:

quick brown fox
lazy dog
jumped
baaadd
good but not in dictionary

Call it sub1.txt and put it in the data subdirectory. Now modify your main function to use this data and your parser:

try {
    cout << "scheck version 0.4" << endl;
    Dictionary d( "data/mydict.dat" );
    
    ifstream sub( "data/sub1.txt" );
    if ( ! sub.is_open() ) {
        throw ScheckError( "cannot open data/sub1.txt" );
    }
    
    Parser p( sub );

    string word;
    while( ( word = p.NextWord() ) != "" ) {
        if ( d.Check( word ) ) {
            cout << word << " is OK\n";
        }
        else {
            cout << word << " is misspelt\n";
        }
    }
}

Check that this change builds – it should compile, but you will get a couple of linker errors because Parser is not yet defined – let’s do that now. You need to add implementations of the Parser constructor and NextWord method. The NextWord method has to extract a word from the input stream and return it. Well, the C++ extraction operator (that is, the >> operator) will kind of do that. It has a number of issues, which I’ll discuss in a moment, but it can be used to provide a first approximation of what we need. Let’s try it.

First, you need to add an istream reference member to the Parser class as a private member:

class Parser {
    public:
        ... as before ...
    private:
        std::istream & mSubmission;
};

You need a reference, because stream objects cannot be copied, so a simple value would not do. You could have used a pointer, but that would make remaining code significantly uglier – in general if you have a choice between using a reference and a pointer, prefer the reference. You can now add the constructor implementation to parser.cpp:

#include "parser.h"
using std::istream;

Parser :: Parser( istream & is ) : mSubmission( is ) {
}

A couple of things to note here. Firstly a using statement  is provided. This means you don’t have to prefix all mentions of istream with the std:: namespace specifier. Normally, you will want to add using statements like this for commonly used types. You should only do this in the source code file however, not in the header. The other point of interest is that an initialiser list is used to initialise the mSubmission istream member variable. In fact, in this case such a list must be used, as you cannot assign istreams – this would not compile:

Parser :: Parser( istream & is ) {
    mIn = is;
}

You can now try writing the NextWord function. You should end up with something similar to this:

#include "parser.h"
#include "error.h"

using std::istream;
using std::string;

Parser :: Parser( istream & is ) : mIn( is ) {
}

string Parser :: NextWord() {
    string word;
    if ( mIn >> word ) {
        return word;
    }
    else if ( mIn.eof() ) {
        return "";
    }
    else {
        throw ScheckError( "read error" );
    }
}

 

Despite advice in a previous instalment not to use eof() here it is being used – what gives? Well, here it is being used correctly – the code performs a test to see if input worked before using eof() to see what caused the failure. If end-of-file was the culprit, that is fine, and the code returns an empty string signifying this. If it wasn’t the culprit, then a general read error exception is thrown – frankly this is unlikely ever to be activated.

If you compile and run this code:

$ g++ -I inc src/main.cpp src/parser.cpp -o bin/scheck
$ bin/scheck.exe

you should get the somewhat gratifying output:

scheck version 0.4
quick is OK
brown is OK
fox is OK
lazy is OK
dog is OK
jumped is OK
baaadd is misspelt
good is misspelt
but is misspelt
not is misspelt
in is misspelt
dictionary is misspelt

This looks pretty good! It has passed the words that are in the little dictionary, and rejected those that are not. It looks like Parser has been as simple to write as Dictionary was! Unfortunately, that is not the case. Remember that you need the line numbers of the mispelled words and their context. How are you to get them?

It turns out that by using operator >> you have discarded all ideas of file structure like line numbering, as it just sees the file as a character stream, with end-of-line being just another whitespace character. This is one of the many things that makes the operator more or less useless when used directly – if you want line numbers, you will have to read lines yourself. This as it happens is not too difficult to do.

Adding Line Numbers

In previous instalments, you have seen how to read lines of text using the getline function. What you need to do is to read your submission files line by line, keeping track of the line number. You then need to chop each line up into words. In fact, what you need to do is to treat each line as if it were a little stream – you can then use operator >> on it to get the words. Luckily, C++ provides exactly this facility in the istringstream class.

First, you will need to change the header file. You need to add string to read lines into, which will also be used to provide context, a line counter and an istringstream. To make life easier, you should also add a private ReadLine function which will encapsulate some of the messy details:

#ifndef INC_SCHECK_PARSER_H
#define INC_SCHECK_PARSER_H
#include <string>
#include <iostream>
#include <sstream>
class Parser {
    public:
        Parser( std::istream & is );
        std::string NextWord();
        unsigned int LineNo() const;
        std::string Context() const;
    private:
        bool ReadLine();
        std::istream & mIn;
        std::string mLine;
        unsigned int mLineNo;
        std::istringstream mIs;
};
#endif

Now you can easily implement the LineNo and Context functions:

unsigned int Parser :: LineNo() const {
    return mLineNo;
}

string Parser :: Context() const {
    return mLine;
}

You need to initialise the line number to zero in the constructor:

Parser :: Parser( istream & is ) 

: mIn( is ), mLineNo( 0 ) { }

Now comes the tricky bit. The logic for NextWord changes to this:

  • try to read a a word from the stringstream (not from the file!)
  • if that worked, return the word (as before)
  • if it failed and we are at the end of the stringstream, read a line and use it to populate the stringstream. Then call NextWord recursively to get a word
  • otherwise, there was error reading the stringstream (unlikely)

The code to implement it is:

string Parser :: NextWord() { string word; if ( mIs >> word ) { return word; } else if ( mIs.eof() ) { if ( ReadLine() ) { return NextWord(); } else { return ""; } } else { throw ScheckError( "string stream read error" ); } }

Lastly, you need to implement ReadLine. This uses getline to read lines from the submission. If it successfully reads a line, it places it in the stringstream so that NextWord can begin pulling words out of it. ReadLine returns true if it managed to read a line and false (or throws an exception) if the line read failed.  The code looks like this:

bool Parser :: ReadLine() {
    if ( getline( mIn, mLine ) ) {
        mIs.clear();
        mIs.str( mLine );
        mLineNo++;
        return true;
    }
    else if ( mIn.eof() ) {
        return false;
    }
    else {
        throw ScheckError( "file read error" );
    }
}

This is the most complicate code presented so far, and there are a couple of things to notice about it. Firstly, adding the ReadLine function makes the code much easier to understand (if you don’t believe this, a nice exercise is to rewrite NextWord without the ReadLine function), whenever you come across complex code (any function longer than about 20 lines is too complex for my liking) break it down into smaller functions. Secondly, use of recursion is common in C++ – it is not some "special feature" only used by Computer Science types.

But does it work? You need to make one small change to main:

 cout << word << " is misspelt at line " << p.LineNo() << "\n";

If you now rebuild and run, you should get this output:

scheck version 0.4
quick is OK
brown is OK
fox is OK
lazy is OK
dog is OK
jumped is OK
baaadd is misspelt at line 4
good is misspelt at line 5
but is misspelt at line 5
not is misspelt at line 5
in is misspelt at line 5
dictionary is misspelt at line 5

Which looks pretty near to what you want the final output to be.

Conclusion

On that high note, we’ll call it a day. Unfortunately, in the next instalment you’ll see that things are not quite so easy as that – basically, the definition of a word that this code implies is too simplistic. Still, hopefully you have learned:

  • Using operator >> directly on files usually doesn’t work.
  • But it can work well if used indirectly on the istringstream class.

Coming next: Hyphens, punctuation, numbers and other annoyances.

Sources for this and all other tutorials in the series available here.

Advertisements

From → c++, linux, tutorial, windows

3 Comments
  1. Duncan permalink

    I know you said that you wouldn’t discuss your naming conventions and things like that but is “mIn” really a good name for the istream. Apart from that I like it. Keep it up.

  2. It seems like you forgot to change the mIn variable to mSubmission in the rest of the code.

    Great write ups btw.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: