Generate safe filenames from Unicode strings

Leros

Lifer
Jul 11, 2004
21,867
7
81
I have a bunch of names in every language possible. I want to use these names to generate filenames that can be downloaded by any browser on any OS by somebody in any country. I need to make sure these filenames are going to be valid on their system. The filenames need to contain the name.

Up until now, I've only had to deal with English speaking countries, so I've just stripped all non-alphanumeric characters and replaced spaces. I got results like this:
"John Smith" -> "John_Smith.ext"
"John O'Henry" -> "John_OHenry.ext"
"John van Smith III" -> "John_van_Smith_III.ext"
This has been good enough...

Now I need to do the same thing for all countries. Stripping non-alphanumeric characters doesn't work for obvious reasons (e.g. Japanese names).

The best idea I've got so far is to find some sort of global list of characters that are not valid in filenames on any OS. Strip all of these characters. This is problematic because my list might be incomplete.

Any suggestions for how to do this?
 
Last edited:

Aluvus

Platinum Member
Apr 27, 2006
2,913
1
0
The best idea I've got so far is to find some sort of global list of characters that are not valid in filenames on any OS. Strip all of these characters. This is problematic because my list might be incomplete.

If your intent is to include, say, DOS, then a blacklist is not a good solution. DOS allows a very limited set of characters in filenames. And Unicode is an enormous set of characters, getting bigger all the time. If you were to go that route, what you really need is a whitelist of characters allowed by a very restrictive OS like DOS, from which you may need to remove even more characters.

If you want to do this for every language practical and want the filenames to be at least vaguely intelligible, I would suggest you look into accent folding and transliteration. Accent folding lets you work around the diacritic marks common in some Western languages. Transliteration lets you work around the completely different character sets used by some languages.

Depending on how robust you want your solution to be, this kind of thing can be really, really painful.
 

Leros

Lifer
Jul 11, 2004
21,867
7
81
If your intent is to include, say, DOS, then a blacklist is not a good solution. DOS allows a very limited set of characters in filenames. And Unicode is an enormous set of characters, getting bigger all the time. If you were to go that route, what you really need is a whitelist of characters allowed by a very restrictive OS like DOS, from which you may need to remove even more characters.

If you want to do this for every language practical and want the filenames to be at least vaguely intelligible, I would suggest you look into accent folding and transliteration. Accent folding lets you work around the diacritic marks common in some Western languages. Transliteration lets you work around the completely different character sets used by some languages.

Depending on how robust you want your solution to be, this kind of thing can be really, really painful.

These files will be downloaded by people from their browser, so I feel comfortable assuming they're at least on Windows XP and whatever the equivalents are for Mac and Linux.

I'm actually not too concerned about Western languages. Its pretty easy to use something like the accent folding that you mentioned and then have a white list of valid characters. It's the non-Western languages that I'm not sure how to handle.

The name could be in Japanese (e.g. "高岡和子") or Arabic (e.g. "محمد سعيد بن عبد العزيز الفلسطيني"). Of course these names could potentially contain invalid characters ("高?岡和\\子*" or "محمد /سعيد بن عبد ?العزيز :الفلسطيني\\") that need to be stripped.

I really don't like the idea of transliteration. If the person's name is "高?岡和\\子*", then the filename should be something like "高岡和子.ext"

It seems like the only way to do this is to have a comprehensive blacklist of characters.
 
Last edited: