Most efficient way to do this exercise?

purbeast0

No Lifer
Sep 13, 2001
53,582
6,424
126
first I just want to tell people that i don't want the actual code on how to solve this. i'm simply looking at other ideas on HOW to solve this, as in, talking in plain english. it's a minor programming exercise i have to do for a potential job. i'll give my current idea (after thinking only like 5 minutes about it) after the instructions. here they are

Code:
You get a document and a lexicon (part of speech dictionary).

The lexicon contains a part-of-speech label for each word it defines.
There are many words that are not defined.  The file is named 'lexicon'.
 There is one definition per line, the format is "WORD<TAB>PartOfSpeech"

The task is to process the document, adding the part of speech
whenever it's defined.  No re-formatting should be done.

For example, if the lexicon is

THE	article
BROWN	adjective
BAG	noun

and the document is 

The small brown bag ripped.

Then your output should be 

The/article small brown/adjective bag/noun ripped.

this is going to be solved in java.

what i'm thinking is so read the 'lexicon' file into a hashmap just reading the file line by line, and probably strsplit using a tab as the delimeter between the two strings. then set the KEY and OBJECT being "THE" and "article" for the first line of the lexicon above.

then after that, simply parsing the 'document' file line by line, and a 'word' will simply be what exists between 2 space characters. then going word by word, seeing if the KEY exists in the hashmap, and if so, then getting the object and appending the "/<OBJECT>" to the word and output it. i know i didn't explain the exact formatting of the output, but that part isn't really a concern to me.

i will also have to special case when there is a period at the end of a sentence. for instance, if the 'document' was...

I hate the brown bag.

I believe the output should be

I hate the/article brown/adjective bag/noun.

but what i'm wondering, is if anyone has any other suggestions that may be more efficient and out of the box than this one. my solution just seems very straight foward. but again i've only thought about this for like a few minutes.
 

Crusty

Lifer
Sep 30, 2001
12,684
2
81
I think you're on the right track, it really is pretty simple.

Don't forget about all the different kinds of punctuation marks and different cases for the letters in the words.
 

purbeast0

No Lifer
Sep 13, 2001
53,582
6,424
126
I think you're on the right track, it really is pretty simple.

Don't forget about all the different kinds of punctuation marks and different cases for the letters in the words.

yea i'm thinking i can write a function that will check a regex against it. but at the same time, if it finds a period or comma in the middle of a number, i don't want it to think that is the end of the word. but it would pretty much be like any punctuation mark with a space character after it, which again, pretty sure i could write a regex to handle it.

but it did seem pretty simple/straightforward, which was making me think if i was just blind and overlooked something heh.

and now im wondering if something like 'simple/straightforward' would be considered 1 or 2 'words' in this scenario. they explicitly tell me that a 'word' is not defined, and to just tell them what i consider a word is in my solution. so i can pick, but i definitely don't want to simply do the easy way out.
 

KIAman

Diamond Member
Mar 7, 2001
3,342
23
81
Or, if you are good with regular expression, you can solve this in 1 line.

Edit: Ahh lexicon, disregard my post, probably more than 1 line, LOL.
 
Last edited:

nyker96

Diamond Member
Apr 19, 2005
5,630
2
81
funny I am also working a lexicon based software in java for my personal use. It actually parses a document, look up the words in a tab delimited dictionary, store the definition in the tooltip for the vocab, so when you hover over it, the definition is shown. I think it's kinda neat. My dictionay is exactly like yours:
word[tab]definition

I have pretty much done w/ this one now. One thing I find very helpful is this java lib class:
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html

It does great job extracting sentences and words for you, saves you much programming. Look up is using a Map<String,String>.

Oh another thing is, I also made a deconjugator that try to deconjugate words like 'walking' to 'walk' etc. first since my dictionary only contains original forms. You might not need to do that. but something to consider if you do have a dictionary that doesn't have all the conjugated forms. otherwise, you only gonna get definition on 'walk' but not 'walking' or 'walked' etc.
 

esun

Platinum Member
Nov 12, 2001
2,214
0
0
I wouldn't think about the period as a "special case". It's better to think about it like a white list: the word ends when you no longer see [a-zA-Z] (and perhaps hyphen and 0-9). If you wanted to handle periods or single quotes in words (e.g., "can't", or using a number in a sentence "12.34"), you could have a second class of characters for which you look ahead another character to see if it fits in your original set of characters.

In any case, I highly doubt they're looking for you to catch obscure edge cases. They're most likely looking to see if you understand the basics of writing Java code. I would just make sure you notate your assumptions.

For example, your solution works well assuming that the lexicon file fits in memory. What if, for some reason, the lexicon file is massive (or your memory is constrained)? Your time-efficient hashmap solution will fail, while a slow but memory-efficient implementation (that searches the lexicon each time) will be better.
 

purbeast0

No Lifer
Sep 13, 2001
53,582
6,424
126
so i started the exercise last night and got to the part i pick apart the document.

i forgot that i hadn't done regex since college (over 7 years ago) and trying to relearn it quickly was a lot more difficult than i expected syntax wise.

initially, i know i have to break the "words" apart based on white space. that is just the definition of words 101 in general heh, that spaces are between words.

now after that is where i'm having a little bit of trouble, because i need to account for situations such as the following:

word.
word!
word?
word;
"Word"
(Word)
"word word"
(word word)
(word is a test (word))

and on top of that, i have to also not change the formatting at all, and instead of outputting the following:

"word"/def

i want it to output:

"word/def"

in the instances of "word".

i actually had a solution for the case of "word" that outputted to "word/def" which worked, but it basically had me running a regex on "word" to get the word itself, then finding it in the lexicon, then then using a String.replace() on the "word" string and replacing word with word/def ... if that makes sense. but in a case like "word word" that would not work.

i was thinking about basically stripping a set of characters from the beginning and end of the string, so that if I get "word" i can just get word. and then if i had something like (word i could also just end up with word. and then use the String.replace() function on the original string, with the word/def string. man re-reading this i wonder if this even makes sense to anyone lol.

so does anyone have any suggestions on how I could do something this? ill have some more time to think about this later tonight but i'm kind of thinking about it in the back of my mind as well while at work right now.
 

tatteredpotato

Diamond Member
Jul 23, 2006
3,934
0
76
Probably the cleanest/most flexible approach would be to use a parser generator to pull out all the patterns you want (hand-written parsers are hard to write). Parser generators are cool and useful, but non-trivial to learn (especially if you have a tight deadline for this assignment).
 

atreader

Member
Apr 28, 2010
95
0
0
Search for 'symbol table', 'scanner', 'parser' and use the appropriate available java library.