How Could I Randomize A Dictionary?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
So I sorta got this working but my work computer does not have enough ram to compile it with the entire list so have not tested it fully yet.

randword.cpp
Code:
#include <iostream>
#include <vector>
#include <time.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

using namespace std;

vector<string> m_words;


int main()
{
    //add words from file generated by genwordlist:
    #include "words.cpp"

    //ex entry:  m_words.push_back("word");


    //pick random word:

    int processid = getpid(); //this should be a diff number each time it's ran.

    srand (time(NULL)+processid); //seed random number generator

    int size = m_words.size();  //get size of array

    int randnum = rand()%size;  //generate the number



    //display random word:

    cout<<m_words[randnum]<<endl;

    return 0;
}


genwordlist.cpp:
Code:
#include <iostream>
#include <fstream>


using namespace std;


void ProcessWordFile(string filename)
{
    ofstream wordlistout;
    ifstream wordlistin;
    string tmpline="";
    string tmpline_clean="";

    wordlistout.open("words.cpp",ios::app);
    wordlistin.open(filename);

    while (getline(wordlistin,tmpline))
    {

        //Sanitize pass 1:

        tmpline_clean="";
        for(int i=0;i<(tmpline.length());i++)
        {
           //getline keeps the return, just remove both CR and NL to cover our bases:
           if(tmpline[i]=='\r' || tmpline[i]=='\n')
           {
               continue;
           }

           //Remove invalid chars that cause problems with compile:
           if(tmpline[i]=='\\' || tmpline[i]=='"')
           {
               continue;
           }


           //append char to a new string:
           tmpline_clean.push_back(tmpline[i]);
        }




        //Sanitize pass 2:

        tmpline=tmpline_clean;
        tmpline_clean="";
        for(int i=0;i<(tmpline.length());i++)
        {

           //the list seems to have a white space after each word, this gets rid of it:
           //reasoning to do this as a 2nd pass is so it's easier to know if space is at the end.
           if(tmpline[i]==0x20 && i==(tmpline.length()-1))
           {
               continue;
           }

           //append char to a new string:
           tmpline_clean.push_back(tmpline[i]);
        }




        wordlistout<<"m_words.push_back(\""<<tmpline_clean<<"\");\r\n";


    }


    wordlistout.close();
    wordlistin.close();
}


int main()
{
    remove("words.cpp"); //delete word list cpp file


    ProcessWordFile("csv/Aword.csv");
    ProcessWordFile("csv/Bword.csv");
    ProcessWordFile("csv/Cword.csv");
    ProcessWordFile("csv/Dword.csv");
    //ProcessWordFile("csv/Eword.csv");
    //ProcessWordFile("csv/Fword.csv");
    //ProcessWordFile("csv/Gword.csv");
    //ProcessWordFile("csv/Hword.csv");
    //ProcessWordFile("csv/Iword.csv");
    //ProcessWordFile("csv/Jword.csv");
    //ProcessWordFile("csv/Kword.csv");
    //ProcessWordFile("csv/Lword.csv");
    //ProcessWordFile("csv/Mword.csv");
    //ProcessWordFile("csv/Nword.csv");
    //ProcessWordFile("csv/Oword.csv");
    //ProcessWordFile("csv/Pword.csv");
    //ProcessWordFile("csv/Qword.csv");
    //ProcessWordFile("csv/Rword.csv");
    //ProcessWordFile("csv/Sword.csv");
    //ProcessWordFile("csv/Tword.csv");
    //ProcessWordFile("csv/Uword.csv");
    //ProcessWordFile("csv/Vword.csv");
    //ProcessWordFile("csv/Wword.csv");
    //ProcessWordFile("csv/Xword.csv");
    //ProcessWordFile("csv/Yword.csv");
    //ProcessWordFile("csv/Zword.csv");


    return 0;
}

Essentially there's two parts to this, first there is the genwordlist.cpp program which generates the code that contains the list of words. It's basically just a bunch of entries that append the word to a global vector. This is included separately as words.cpp.

Once that file is generated then you compile randomword.cpp which then creates the binary which works stand alone.

The word list I used is the "csv" that has only the words that was linked here. Turns out it's not really a CSV and just a list of words one per line, but that made it easier to parse out. There is lot of garbage and duplicates though, I didn't do much to deal with that except remove characters that broke compilation. I only tested up to D so not sure if there may be other weird stuff going on the other files that I didn't account for.

There is probably a much better way to do this and trying to pack the data into the executable may not be the best approach or at least not the way I did it, but I was curious to try it anyway. It works, but you just need a computer built in this millennium to compile it on.

Though considering the whole thing is only like 50k lines of very simple code (only going up to D in the list) I am not sure why it's this hard on the ram to compile as I've compiled much larger projects before, some around 500k lines of code in a VM with about same amount of ram. Though maybe it has to do with trying to compile a single file that is that size.

The resulting file size is also kinda big, but using -O2 optimization seems to yield the smaller size of 998k for the A list only.

I spent a bit more time that I care to admit on this silly thing, but it passed some time at work and let me brush up on C++ a little. Been a while. I could have used python or php or something but was curious to try a compiled language.
 
Jul 27, 2020
25,444
17,644
146
@Red Squirrel

How much RAM does your system have?

Have you tried viewing the compiler's RAM consumption increasing in Task Manager before it says it has run out of RAM?

What is the exact error message you are getting?

Maybe the data structure you are using, was not intended to be packed with this much data by the compiler writers?
 

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
Only 4GB, this is an old corporate laptop that they never picked up and probably never will, so I use it as a surfing machine and actually put Linux on it since it had Windows xp before. I was watching the % climb in top and eventually it would fail. The compile also takes a good 10 minutes before failing lol.

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

It is odd that it was using that much ram to compile though. I wonder what would happen if I broke down the file into smaller chunks. Maybe the compiler works on one file at a time while loading it in ram and doing all the processing in ram, then cleans itself when it moves to another file. I might actually play with that out of curiosity.
 
  • Like
Reactions: igor_kavinski

Captante

Lifer
Oct 20, 2003
30,340
10,859
136
Any way to up it to 8GB's?

In Linux even a semi-current web browser will eat up 4gb's quick unless you really keep the open tabs to a bare minimum. (like 3-4 max)

If that's not an option giving the compiler less to load into RAM by breaking up the file MIGHT fix the issue .... depends on the exact cause.
 

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
I had to close Firefox so I could compile lmao. This is a work owned machine, but I'm home now so going to try it on my own machine.

If I try to open Facebook and Youtube at the same time on that machine, the fan just revs up to max speed, the cursor starts to be very slow to move and eventually the whole machine locks up. If I'm lucky I can get into a terminal and do a killall firefox but usually it's futile.

I actually wonder if a Raspberry Pi 4 would do better lmao.
 

IronWing

No Lifer
Jul 20, 2001
72,276
32,743
136
Yep, it's a 1913 dictionary. :)

Babylonish (n.) Pertaining to Rome and papal power.



 

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
I copied the wrong file over so don't have access to it here. Will have to figure out why it didn't copy properly tomorrow. But it's a week day so probably won't have time to play with it.
 
Jul 27, 2020
25,444
17,644
146
If I'm lucky I can get into a terminal and do a killall firefox but usually it's futile.
The issue may be that Firefox on Linux isn't very optimized. Or the cross platform code they are writing is somehow prioritizing Windows performance over the performance in alternative OSes.

Try Opera.
 

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
Firefox is ok on all my other machines. It's just that particular machine is bloody crappy. :p

Also I'm a noob I posted the source here, don't need the files that I tried to copy. (although it had a few helper scripts to cleanup etc).

Compiling the full list right now. Already up to 8gig used by compiler lol. It might actually fail on this machine too at this pace.
 

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
Yep it failed lol. So ya I think my technique of making the data built in is not liked by the compiler. Got an early shift tomorrow so time for bed, but I am still curious to see how I can make this work. There must be something odd about the way I'm doing it that is causing the ram to balloon like that, since I've compiled much larger programs on VMs with less ram. The final file list does come up to 176k lines of code though.

Edit: I had to give it one more go before bed, with -O2 optimization. Been running for like 20+ minutes and it's sitting at 3.3g of ram used without climbing so it might actually work just take a long time. Will let it go overnight and report back tomorrow.

I need to redo this and use a const string[] actually instead of a vector, it would make more sense given it's static data.
 
Last edited:
  • Haha
Reactions: igor_kavinski
Jul 27, 2020
25,444
17,644
146
Guess you need to thank Gizmo for this interesting adventure :)

You have a Github repo, right? Care to share the URL?
 

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
I actually don't, I tend to just store stuff locally and just never bothered to make one. So many projects, not a single one is complete enough to be worth to release.:p And yeah, I'm still up, I wanted to try the string array idea and just got it working. Compiles in about 10-15 seconds.


New code:



randword.cpp:
Code:
#include <iostream>
#include <vector>
#include <time.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

using namespace std;


int main()
{
    //add words from file generated by genwordlist:
    #include "wordarray.cpp"

  

    //pick random word:

    int processid = getpid(); //this should be a diff number each time it's ran.

    srand (time(NULL)+processid); //seed random number generator

    unsigned int randnum = rand()%word_array_size;  //generate the number



    //display random word:

    cout<<word_array[randnum]<<endl;

    return 0;
}

genwordlist.cpp:
Code:
#include <iostream>
#include <fstream>


using namespace std;


int ProcessWordFile(string filename, unsigned int arrayindex)
{
    ofstream wordlistout;
    ifstream wordlistin;
    string tmpline="";
    string tmpline_clean="";

    wordlistout.open("wordarray.tmp",ios::app);
    wordlistin.open(filename);

    while (getline(wordlistin,tmpline))
    {

        //Sanitize pass 1:

        tmpline_clean="";
        for(int i=0;i<(tmpline.length());i++)
        {
           //getline keeps the return, just remove both CR and NL to cover our bases:
           if(tmpline[i]=='\r' || tmpline[i]=='\n')
           {
               continue;
           }

           //Remove invalid chars that cause problems with compile:
           if(tmpline[i]=='\\' || tmpline[i]=='"')
           {
               continue;
           }


           //append char to a new string:
           tmpline_clean.push_back(tmpline[i]);
        }




        //Sanitize pass 2:

        tmpline=tmpline_clean;
        tmpline_clean="";
        for(int i=0;i<(tmpline.length());i++)
        {

           //the list seems to have a white space after each word, this gets rid of it:
           //reasoning to do this as a 2nd pass is so it's easier to know if space is at the end.
           if(tmpline[i]==0x20 && i==(tmpline.length()-1))
           {
               continue;
           }

           //append char to a new string:
           tmpline_clean.push_back(tmpline[i]);
        }




        wordlistout<<"word_array["<<arrayindex<<"]=\""<<tmpline_clean<<"\";\r\n";
     arrayindex++;


    }


    wordlistout.close();
    wordlistin.close();
   
    return arrayindex;
}


int main()
{
    //delete files to make sure we're starting fresh:
   
    remove("wordarray.tmp");
    remove("wordarray.cpp");
   
    //run through each file list, keeping track of array index so we know how big to make array:

    unsigned int arrayindex=0;

    arrayindex=ProcessWordFile("csv/Aword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Bword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Cword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Dword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Eword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Fword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Gword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Hword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Iword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Jword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Kword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Lword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Mword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Nword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Oword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Pword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Qword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Rword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Sword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Tword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Uword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Vword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Wword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Xword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Yword.csv",arrayindex);
    arrayindex=ProcessWordFile("csv/Zword.csv",arrayindex);
   
   
   
    //combine word list with array declaration and in right order:
   
    ofstream wordlist_final;
    wordlist_final.open("wordarray.cpp",ios::out);   
    wordlist_final<<"string word_array["<<arrayindex<<"];\r\n";   
    wordlist_final<<"const int word_array_size="<<arrayindex<<";\r\n\r\n";   
    wordlist_final.close();
   
    system("cat wordarray.tmp >> wordarray.cpp");
   
    remove("wordarray.tmp");
   
   

    return 0;
}


The binary is 5.8MB without any optimization, compiling with -O2 to see how that changes. That seems to be taking longer, so going to let it do it's thing and head to bed for real this time. :p
 
  • Love
Reactions: igor_kavinski

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
So this is an interesting one, remember how I let it compile with -O2 option because it was taking long, well I just got home to see the result and it looks like I managed to make the compiler segfault lol. Just got home to this:

Code:
$ g++ -O2 randword.cpp -o randword2
g++: internal compiler error: Segmentation fault signal terminated program cc1plus
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-9/README.Bugs> for instructions.

It works fine without any optimization though so I consider this good enough for this experiment. The last code I posted works with the full dictionary and you end up with a single stand alone binary that spits out a random word when it's run. Just can't compile with O2 optimization on it.


Ex output: (executing it repeatedly manually)

Code:
$./randword
zyme
$./randword
setulose
$./randword
devow
$./randword
say
$./randword
fatted
$./randword
insulous
$./randword
wheatstone''s bridge
$./randword
watchdog
$./randword
war
$./randword
fit
$./randword
enorthotrope
$./randword
delayment
$./randword
claiming
$./randword
levee
$./randword
marmot
$./randword
corrival
$./randword
steer
$./randword
unget
$./randword
spleened
$./randword
hopping
$./randword
mockbird
$./randword
xanthopous
$./randword
lignite
$./randword
appair
$./randword
emulable
$./randword
retraction
$./randword
sinewed
$./randword
suffragating
$./randword
sesquialter
$./randword
root
$./randword
trough
$./randword
rostrum
$./randword
timocratic
$./randword
vulva
$./randword
muff
$./randword
manbote
$./randword
rowelling
$./randword
sublime
$./randword
selenography
$./randword
imprint
$./randword
rectoress
$./randword
run
$./randword
conjoint
$./randword
shortage
$./randword
embattled
$./randword
tasteless
$
 

Red Squirrel

No Lifer
May 24, 2003
70,040
13,502
126
www.anyf.ca
What I'm trying to do is kinda unconventional, but I am curious to see if I can find out what exactly is causing it to crash and see if it can be done in smaller scale. Since yeah I may have potentially found a bug. Maybe something to play with when I'll be on night shifts. :p
 
  • Haha
Reactions: igor_kavinski

Gizmo j

Golden Member
Nov 9, 2013
1,447
400
136
I was thinking I could pay someone $50 per letter in the alphabet to attach a number to every word and definition so I could just randomize the numbers and thus randomizing the dictionary.
 

nakedfrog

No Lifer
Apr 3, 2001
61,651
17,296
136
I was thinking I could pay someone $50 per letter in the alphabet to attach a number to every word and definition so I could just randomize the numbers and thus randomizing the dictionary.
This sounds like a job for Fiverr.
 

lxskllr

No Lifer
Nov 30, 2004
59,303
9,813
126
Sounds like a job for a computer. If I were gonna come up with dumb ways to do this dumb project, that would be high on the list. I feel if I wait patiently though, I'll hear some dumber ideas.