How can determine what LANGUAGE an HTML page is written in?

Superwormy

Golden Member
Feb 7, 2001
1,637
0
0
I want to write a comptuer program which determines what language a computer program is written in.

If that's too hard, I'd just like to write a program which excludes any page thats not written in English.

Is there a QUICK way to do this? I see stuff about charset= in HTML, and Content-language... I'll be using PHP / Perl if that helps... I just need somewhere to start, cause right now I have no idea...

Anyoen?
 

notfred

Lifer
Feb 12, 2001
38,241
4
0
You're going to have to parse the page for <html lang=xx> and see if you can't figure out the language that way. If the author of the page left that tag out, you're going to have to look at the http content-language header, and see if that's set. If neither of those is set, you make your best guess.