Skip to content

Writing a Real C++ Program – Part 5

August 9, 2011

This is the fifth instalment in a series of C++ programming tutorials that started here.

Introduction

In the previous instalment,  you saw how to use a mixture of the C++ Standard Library getline function, an istringstream and operator >> to pull words out of the text for a submission to the spelling checker. This was quite easy to do, but unfortunately it has one glaring problem – its idea of what words are, and how they should be constructed, is too simplistic.

The code in the previous instalment assumed that words are separated by whitespace characters. Unfortunately, that is not true – in English text lots of different punctuation characters are used as well as spaces. Also, some of those punctuation characters become part of the words being extracted. Here are some examples:

English Words
light-hearted light hearted
he said, “hello” he said hello
the dog’s lazy the dog’s lazy
one,two,three one two three

Also, you need to handle numbers somehow. Numbers should not be spellchecked, and normally neither should “words” that contain numbers, such as “R2D2”. Lastly, hyphens at the end of lines need to be removed, but need to join the word-parts they hyphenate in the final output. so the hyphenated word in this:

It was a great mis-
take at the time.

Should become “mistake” when it is extracted it from the submission.

The solution in the previous instalment can be adapted to provide these features, but it would be a bit of a hack. Instead, I propose that a better idea is to write your own parser using a state machine.

State Machines

A state machine is program which has a number of states between which the program moves depending on rules called transitions. Here’s a diagram for a simple state machine that implements checking if a string of characters is a signed integer (circles are states, arrows are transitions):

state1

which recognises strings containing numbers like “42”, “0” and “-1”, but rejects things like “+-1” or “12.34”.

The machine starts in the START state and looks at the first character in the string; if it is a + or a – sign, it moves to state 2. If it is a digit, it moves to state 3. If neither of these is the case, we have an error. The machine is now in state 2 or state 3. If it is in state 2, the next character read must be a digit (because we can’t have + or – on its own), in which case we move to state 3. In state 3, we can keep reading as many digits as we want – the only way out of this state without an error is to read an end-of-string (EOS) character, in which case we move to the END state, and the number has been validated.

State machines are widely used in real programs – regular expressions are one common form of state machine (actually all programs are really extremely complex state machines), and its common to create your own to solve specialised tasks.

Words Machine

The state machine for extracting words is not much more complex than that for recognising numbers:

state2

We start in state 1, and if the first character we look at is a letter (alpha) we move to state 2; if it is a digit, we move to state 3. If it is anything else (indicated by the transition arrow without a label) we stay in state 1. This means that we will keep on eating and discarding non alphanumeric characters, skipping over any whitespace or uninteresting punctuation.

If we are in state 2, we have something that might be a word, and we stay in this state so long as the next character is an alpha character or an apostrophe, which lets us build up things like “dog” and “dog’s”. If we ever see a digit, we go to state 3 and never return directly. If we see anything else, we are at the end of a word, and we go back to skipping whitespace and/or punctuation in state 1.

If we are in state 3, we stay in that state until something that is not a digit, an alpha character or an apostrophe (APP)  is seen. This eats things like “1234” or “R2D2” that we don’t want to spellcheck. When something that isn’t one of those comes along, we go back to state 1.

This machine does not deal with what happens when we reach end-of-file. It’s often convenient not to show such transitions, as they can take place from any state and only serve to muddle the diagram. It also does not show that we need to build up the word we are extracting, and to reset it when we are done.

Implementing The Machine

To implement the machine, you will need a NextChar function which extracts a character from the submission. This function should return the next extracted character, or a zero byte if you have reached the end of the submission. It’s convenient to make this function also handle reading lines and maintaining a line count. Here’s what the new parser.h header file for the Parser class needs to change to:

#ifndef INC_SCHECK_PARSER_H
#define INC_SCHECK_PARSER_H

#include <string>
#include <iostream>
#include <sstream>

class Parser {
    public:
        Parser( std::istream & is );
        std::string NextWord();
        unsigned int LineNo() const;
        std::string Context() const;
    private:
        char NextChar();
        bool ReadLine();
        enum State { stInSpace, stInWord, stInNum };
        State mState;
        std::istream & mSubmission;
        std::string mLine;
        unsigned int mLineNo, mPos;
};

#endif

Notice that the public interface to the class has not changed from that used in the previous instalment. This is important – public interfaces should be treated as immutable, so that users of your classes will not have to change your code if you change the implementation.

You should change the constructor to initialise the state variable:

Parser :: Parser( istream & is ) 
           : mSubmission( is ), mLineNo( 0 ), 
              mPos( 0 ), mState( stInSpace ) {
}

The code for getting characters from the submission, in parser.cpp, looks like this:

char Parser :: NextChar() { if ( mPos >= mLine.size() ) { if ( ! ReadLine() ) { return 0; }

} return mLine[ mPos++ ]; } bool Parser :: ReadLine() { if ( getline( mSubmission, mLine ) ) { mPos = 0; mLineNo++; mLine += ‘ ‘; return true; } else if ( mSubmission.eof() ) { return false; } else { throw ScheckError( “file read error” ); } }

This is less complicated than it looks – NextChar looks to see if there is a character in the current line. If there isn’t, it calls ReadLine to get the next line from the submission. If ReadLine fails, we are presumably at end-of-file, so we return the special zero-byte indicator. If it succeeds, or if there was one or more characters in the current line, that character is returned and the current position in the line is bumped.

ReadLine itself is not much more complicated. It uses getline to read  a line, resetting the current position to the line start. The only wrinkle here is that a single space character is always appended to the line end. There are two reasons for doing this – firstly so that NextChar never has to worry about dealing with an empty line, and secondly so that a word at the end of the line is properly separated from the first word on the next line. Adding extra characters to make life easy for yourself (the alternative is to add a lot of special end-of-line processing code) is a common programmer’s trick.

Now for the state machine itself. there are a number of ways of implementing these things. One generic way is to use a table of function pointers (or functors) which get called on the specific transitions. This is worth implementing if you have to implement several similar machines, but would be overkill in this case. Here’s the new code for NextWord:

string Parser :: NextWord() {
    string word;
    while( char c = NextChar() ) {
        switch( mState ) {
            case stInSpace:
                if ( std::isalpha( c ) ) {
                    word += c;
                    mState = stInWord;
                }
                else if ( std::isdigit( c ) ) {
                    mState = stInNum;
                }
                break;
            case stInWord:
                if (  std::isalpha( c ) || c == '\'' ) {
                    word += c;
                }
                else if ( std::isdigit( c ) ) {
                    mState = stInNum;
                }
                else {
                    mState = stInSpace;
                    return word;
                }
                break;
            case stInNum:
                if (  std::isalnum( c ) || c == '\'' ) {
                    word += c;
                }
                else {
                    mState = stInSpace;
                    word = "";
                }
                break;
            default:
                throw ScheckError( "bad state" );
        }
    }
    return word;
}

First thing to notice here is the use of classification functions from the C++ Standard Library. These are:

Function Description
isdigit(c) Returns true if c is a digit character from ‘0’ to ‘9’
isalpha(c) Returns true if c is a letter
isalnum(c) Returns true if isalpha(c) || isdigit(c)

There are a lot of other classification functions like these, and you should always prefer to use them over writing your own character testing code.

The code performs a switch on the current state we are in, which is represented by an enumeration value held in the mState variable. If the state is stInSpace (which is state 1 in the original diagram) the state will only change if the character being considered is a letter or  a digit. If it is a letter, you need to accumulate it in the word variable so it can later be returned.

If the state is stInWord (state 2 in the diagram) the machine remains in this state so long as the next character is a letter or an apostrophe. This means that things like “foo’bar’s” will be treated as a single word, but this seems to me to be acceptable, as they will be spotted later by the spellcheck. If the next character is a digit, the machine goes to the skipping numbers state. If the character is none of these – the end of a word has been reached and the word is returned, first setting the state for the next call of the function back to the first state..

If the state is stInNum (state 3), the machine remains in this state so long as something that is a letter, a digit or an apostrophe is seen. Note that the characters are not accumulated into a word, as the code never returns a word from this state.

Lastly, there is a check that we haven’t reached an unknown state. It is always worth having such a check, as state machines get changed very frequently during development, and it is easy to miss out a case for a particular state.

You should now be able to compile your code:

$ g++ -I inc src/main.cpp src/parser.cpp -o bin/scheck

You will need new test data to test the added functionality. I suggest editing data/sub1.txt so that it contains things like this:

fox-brown
dog-lazy
the "quick brown" fox
dog 42
dog42
dog,fox,brown
dog  ,   fox        brown
the dog's lazy

 

Hyphenation

One last thing you should deal with is hyphenation like this:

It was a great mis-

take at the time.

which should result in “mistake” being extracted. There are several ways of approaching this problem. I intend to mention only a couple, and leave the implementation to you as an exercise – why should I do all the work!

Firstly, you could extend the state machine with new states so that a hyphen followed by a space did not cause the code to to return but did not accumulate the hyphen and the space. This is probably the “cleanest” way of implementing things, but does make the state diagram more complicated.

Secondly, you could simply remember the hyphenated prefix across function calls in a string member variable, which gets tacked on to the front of a new word. This is somewhat more of a hack, and introduces more state into the class, but is probably easier to implement.

Thirdly – you could think of your own way of implementing it; I’m sure the above are not the only solutions to the problem. Either way, I’ll provide my own solution as part of the code for the next instalment (so you won’t be tempted to peek!)

Conclusion

This instalment has introduced the concept of a state machine, which is a pattern  you should consider using whenever you need to extract, recognise or validate strings of characters.

Coming next:  Multi-file programming and the use of make.

Sources for this and all other tutorials in the series available here.

Advertisements

From → c++, linux, tutorial, windows

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: