HELP: Word Count program in C++

Qacer

Platinum Member
Apr 5, 2001
2,721
1
91
Hi all,

I'm trying to help out my friend. She is trying to write a word count program in C++. Basically, she is reading from a text file and then counting how many times certain words (the, an, a, of, etc...) are used. She is currently assigning the text file into one big string.

I'm quite rusty with my C++. But I told her that she would need a WHILE loop with an EOF condition. I am not quite seeing the part where each words are parsed. I'm picturing that within the WHILE loop I need another WHILE loop with an EOL (end of line) condition and inside it I would assign each line to one string. Then, using string functions in C++ that I don't remember I can use the space character as my delimiter and parse my the words. I would loop in the WHILE with EOL condition subtracting that word from my string. If a word matches the word lists that I have, then a counter assigned to it would be incremented.

Is my logic right? Is there a way to optimize this? Someone told me to use a HASH table, but I'm not familiar with it. Is there an example program that I can view?

Thanks!
 

NeoGodzilla

Member
Nov 24, 2000
39
0
0
You can use an EOF controlled loop with a switch statement inside. Hope this helps you out...



inFile >> wordFromFile;

while(inFile)
{
switch (toupper(wordFromFile[0]))
{
case 'A': if (toupper(wordFromFile[1]) == 'N') //in order to use toupper you need to #include <cctype>
{
wordCountAn++;
break;
}
case 'T' : if (toupper(wordFromFile[1]) == 'H')
{
wordCountThe++;
break;
}
}

inFile >> wordFromFile;
}



EDIT: I just realized that posting this code took out all the whitespace so it might look confusing :p
 

Sahakiel

Golden Member
Oct 19, 2001
1,746
0
86
Been a long time since I've done programming, but here goes.
I'm assuming you have to count every instance of every word.
If so, then this is how I would do it.
Take the text file, and parse it into strings of, say, a maximum 60 characters including spaces. Just make sure to end the string at a space, period, comma, whatnot.
Take those strings and queue em up; FIFO, FILO, doesn't really matter.
Create a data tree using characters as your branches. I.E. top level has 26 entries for the first letter of the word which branches to the second level based on the second letter, and so on. Sort of like a branched list sort of thing.
Run a loop that'll pull each string in and shove the words into the tree so it's organized by spelling. The last letter becomes the data cell where you increment the counter for each word. This way, if you have the word 'every' and the word 'everyone', two cells would be at the 'y' and the last 'e' with counters for 'every' and 'everyone' respectively.
Once you've finished the queue, just use a recursive function to scan through the whole thing and output all the counters greater than 0 including the word spellings.
Or, if you need to input a word and have it output, it'll be easier since each branch is based off a letter in the word.
If this is a hash table, well, I never got that far in CS. This was my little pet idea that I wanted to do for my final project. Ended up doing something else.

If you already have the list of words you're looking for.. (note this is not entirely code, you'll have to do that yourself)

do{
Parse a word from istream and shove it into string istring
if istring==word
counter++;
} while string != 'eof';
output counter

That's the simplest way to do it conceptually. I'm sure you know how to parse and how to compare two strings.
 

Turkey

Senior member
Jan 10, 2000
839
0
0
You just need one while loop:

hashtable words;
string nextWord;

while (!file.bad()) {
cin >> nextWord;
words[nextWord]++;
}

Search google for "hash table tutorial" to learn what a hashtable is.

you may have to make your own hashtable, where the [] operator hashes the string, then either returns or creates a value for the given key. Implementing a hash table can be somewhat tricky though... try just using the STL hash_set or hash_map first. If you use the STL hashtable, you can use iterators to iterate through the data structure and print the keys & values, or if you create your own you can create a .print() function that prints out the values and keys.
 

Qacer

Platinum Member
Apr 5, 2001
2,721
1
91
Thanks for the tips! :)

Believe it or not, I just found out that there is a function in Visual C++ called fgets() that does all this work. Heh! Tsk.. tsk..

 

imgod2u

Senior member
Sep 16, 2000
993
0
0
I'm not very familiar with c++ but is there something similar to a String Tokenizer like there is in the Java API? If so, building an array of all the words should be quite easy, then just use a while loop to check them all.
 

jasonroehm

Member
Dec 1, 2001
97
0
0
There is nothing directly comparable to the StringTokenizer class in Java, but by default, streams use the whitespace character as a delimeter when using the stream extraction (>>) operator. That's why Turkey's code would work for hashing each word. This effectively tokenizes the expression.
 

glugglug

Diamond Member
Jun 9, 2002
5,340
1
81
What you actually want to use is a STL map.

std::map<std::string,int> strCount;

while(!(infile.eof()))
{
infile >> nextWord;
strCount[nextWord] = strCount[nextWord] + 1; // 0 is the "default" value for a newly allocated integer, so this works even when there is no value in the map yet.
}

The actual data structure behind the map in VC++ is a tree, and what is really cool is that when you iterate through it, the data is naturally sorted (in this case using operator<(std::string,std::string)).

Loop through like this to see the string counts in alphabetical order:

for(std::map<std::string,int>::iterator mapIter=strCount.begin();mapIter != strCount.end();mapIter++)
{
cout << mapIter->first << " occured " << mapIter->second << " times\n";
}