Most efficient way to do this exercise?

purbeast0 · Apr 3, 2012

first I just want to tell people that i don't want the actual code on how to solve this. i'm simply looking at other ideas on HOW to solve this, as in, talking in plain english. it's a minor programming exercise i have to do for a potential job. i'll give my current idea (after thinking only like 5 minutes about it) after the instructions. here they are

Code:

You get a document and a lexicon (part of speech dictionary).

The lexicon contains a part-of-speech label for each word it defines.
There are many words that are not defined.  The file is named 'lexicon'.
 There is one definition per line, the format is "WORD<TAB>PartOfSpeech"

The task is to process the document, adding the part of speech
whenever it's defined.  No re-formatting should be done.

For example, if the lexicon is

THE	article
BROWN	adjective
BAG	noun

and the document is 

The small brown bag ripped.

Then your output should be 

The/article small brown/adjective bag/noun ripped.

this is going to be solved in java.

what i'm thinking is so read the 'lexicon' file into a hashmap just reading the file line by line, and probably strsplit using a tab as the delimeter between the two strings. then set the KEY and OBJECT being "THE" and "article" for the first line of the lexicon above.

then after that, simply parsing the 'document' file line by line, and a 'word' will simply be what exists between 2 space characters. then going word by word, seeing if the KEY exists in the hashmap, and if so, then getting the object and appending the "/<OBJECT>" to the word and output it. i know i didn't explain the exact formatting of the output, but that part isn't really a concern to me.

i will also have to special case when there is a period at the end of a sentence. for instance, if the 'document' was...

I hate the brown bag.

I believe the output should be

I hate the/article brown/adjective bag/noun.

but what i'm wondering, is if anyone has any other suggestions that may be more efficient and out of the box than this one. my solution just seems very straight foward. but again i've only thought about this for like a few minutes.

Crusty · Apr 3, 2012

I think you're on the right track, it really is pretty simple.

Don't forget about all the different kinds of punctuation marks and different cases for the letters in the words.

purbeast0 · Apr 3, 2012

Crusty said:
I think you're on the right track, it really is pretty simple.

Don't forget about all the different kinds of punctuation marks and different cases for the letters in the words.

yea i'm thinking i can write a function that will check a regex against it. but at the same time, if it finds a period or comma in the middle of a number, i don't want it to think that is the end of the word. but it would pretty much be like any punctuation mark with a space character after it, which again, pretty sure i could write a regex to handle it.

but it did seem pretty simple/straightforward, which was making me think if i was just blind and overlooked something heh.

and now im wondering if something like 'simple/straightforward' would be considered 1 or 2 'words' in this scenario. they explicitly tell me that a 'word' is not defined, and to just tell them what i consider a word is in my solution. so i can pick, but i definitely don't want to simply do the easy way out.

KIAman · Apr 3, 2012

Or, if you are good with regular expression, you can solve this in 1 line.

Edit: Ahh lexicon, disregard my post, probably more than 1 line, LOL.

nyker96 · Apr 3, 2012

funny I am also working a lexicon based software in java for my personal use. It actually parses a document, look up the words in a tab delimited dictionary, store the definition in the tooltip for the vocab, so when you hover over it, the definition is shown. I think it's kinda neat. My dictionay is exactly like yours:
word[tab]definition

I have pretty much done w/ this one now. One thing I find very helpful is this java lib class:
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html

It does great job extracting sentences and words for you, saves you much programming. Look up is using a Map<String,String>.

Oh another thing is, I also made a deconjugator that try to deconjugate words like 'walking' to 'walk' etc. first since my dictionary only contains original forms. You might not need to do that. but something to consider if you do have a dictionary that doesn't have all the conjugated forms. otherwise, you only gonna get definition on 'walk' but not 'walking' or 'walked' etc.

esun · Apr 4, 2012

I wouldn't think about the period as a "special case". It's better to think about it like a white list: the word ends when you no longer see [a-zA-Z] (and perhaps hyphen and 0-9). If you wanted to handle periods or single quotes in words (e.g., "can't", or using a number in a sentence "12.34"), you could have a second class of characters for which you look ahead another character to see if it fits in your original set of characters.

In any case, I highly doubt they're looking for you to catch obscure edge cases. They're most likely looking to see if you understand the basics of writing Java code. I would just make sure you notate your assumptions.

For example, your solution works well assuming that the lexicon file fits in memory. What if, for some reason, the lexicon file is massive (or your memory is constrained)? Your time-efficient hashmap solution will fail, while a slow but memory-efficient implementation (that searches the lexicon each time) will be better.

purbeast0 · Apr 4, 2012

so i started the exercise last night and got to the part i pick apart the document.

i forgot that i hadn't done regex since college (over 7 years ago) and trying to relearn it quickly was a lot more difficult than i expected syntax wise.

initially, i know i have to break the "words" apart based on white space. that is just the definition of words 101 in general heh, that spaces are between words.

now after that is where i'm having a little bit of trouble, because i need to account for situations such as the following:

word.
word!
word?
word;
"Word"
(Word)
"word word"
(word word)
(word is a test (word))

and on top of that, i have to also not change the formatting at all, and instead of outputting the following:

"word"/def

i want it to output:

"word/def"

in the instances of "word".

i actually had a solution for the case of "word" that outputted to "word/def" which worked, but it basically had me running a regex on "word" to get the word itself, then finding it in the lexicon, then then using a String.replace() on the "word" string and replacing word with word/def ... if that makes sense. but in a case like "word word" that would not work.

i was thinking about basically stripping a set of characters from the beginning and end of the string, so that if I get "word" i can just get word. and then if i had something like (word i could also just end up with word. and then use the String.replace() function on the original string, with the word/def string. man re-reading this i wonder if this even makes sense to anyone lol.

so does anyone have any suggestions on how I could do something this? ill have some more time to think about this later tonight but i'm kind of thinking about it in the back of my mind as well while at work right now.

tatteredpotato · Apr 6, 2012

Probably the cleanest/most flexible approach would be to use a parser generator to pull out all the patterns you want (hand-written parsers are hard to write). Parser generators are cool and useful, but non-trivial to learn (especially if you have a tight deadline for this assignment).

atreader · Apr 6, 2012

Search for 'symbol table', 'scanner', 'parser' and use the appropriate available java library.

Search

Most efficient way to do this exercise?

purbeast0

No Lifer

Crusty

Lifer

purbeast0

No Lifer

KIAman

Diamond Member

nyker96

Diamond Member

esun

Platinum Member

purbeast0

No Lifer

tatteredpotato

Diamond Member

atreader

Member

TRENDING THREADS