On the usage of URL Shorteners/Base36/Base62 and bad words

Train

Lifer
Jun 22, 2000
13,583
80
91
www.bing.com
I'm building an app. For the purpose of keeping URL's short, I am considering using a notation other than decimal to represent record ID's in the URL, a lot of sites already do this. (Imgur comes to mind) but I always wondered, how do they prevent myDomain.com/f*ck or /cu*t, etc?

Let's explore some options:
Hexadecimal:
FFFF = 65,535 saving 1 char,
FFFFF = 1,048,575 saving 2 chars
Worst possible words to show up in a url: BED, FED, DED?
* No case sensitivity required

Base36:
zzz = 46,655 saving 2 chars
zzzz = 1,679,615 saving 3 chars
Worst possible words in 3 chars: sex, i love you, dik, with variations like f4g, s3x, d1k
Four letter possibilities: a lot of bad ones, with even more variations.
* No case sensitivity required

Base62:
ZZZ = 238,328 saving 3 chars
ZZZZ = 14,776,336 saving 4 chars
Worst possible words: all the same as Base36
* Case sensitivity required

Keeping a list of badwords that needs to be checked every time a new ID is generated would be a pain in the ass. I'm also considering letting users choose their own "url slug" when creating a record, just checking for uniqueness, with automated ones just generating something in dec/hex. This would do double duty as it would help with SEO

How do sites like Imgur prevent this? My app is supposed to be "family friendly" as a lot of the sponsors are family safe brands, I'd hate to have someone complain because their daughter's bookmarked event is myDomain.com/slUt
 

purbeast0

No Lifer
Sep 13, 2001
53,543
6,368
126
i don't know how they do it, but i don't think checking a list of badwords everytime a new id is generated would be a pain in the ass at all. the most difficult part would just be generating the list, but even that seems like it would be pretty simple to do.

if it's in the list, then just generate a new id.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,620
4,536
75
Here's a bunch of hex words, if you're looking for a list.

Edit: after grepping /usr/dict/words for -v '[^a-foils]' I'm sure that list isn't comprehensive.

Edit2: In fact, I can come up with many words, some of them dirty, using just numbers. For instance, 7177135. ("tittles". :colbert:)
 
Last edited:

sm625

Diamond Member
May 6, 2011
8,172
137
106
My app is supposed to be "family friendly" as a lot of the sponsors are family safe brands, I'd hate to have someone complain because their daughter's bookmarked event is myDomain.com/slUt

Each psuedoname should be a certain minimum length (myDomain.com/XXXXXXXX) and be fully randomized. ie the first site assigned should not be 00000001, but rather something like d3Fh7Qu6. As for the bad words, you just filter them out if and when they appear. For example if you generate a name like "gY6slUt3w" then you just throw it out and generate a new one. But I'm betting you probably wouldnt even need to bother. Who is going to care if the word "slut" randomly appears within a larger string of what is clearly just gibberish?

Also, you shouldnt use base62, If anything, use base64. It ends up being a LOT easier to code if you keep it divisible by 16. (use hyphen and underscore to round out the 64 chars)
 
Last edited:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,620
4,536
75
Who is going to care if the word "slut" randomly appears within a larger string of what is clearly just gibberish?
Well, there are certain words, like **** and ****** for instance, that get converted to ****** on these forums. You wouldn't want your string to get converted from d3slUtu6 to d3****u6 someplace.

I'd probably skip any string that contained any three-letter-or-longer English word. For simplicity, it might make sense to skip all strings that contain some number of letters in a row. This also makes it easier for people to propose their own shortened URLs. (And if those get converted to **** it's the creators' fault.)
 

Bubbleawsome

Diamond Member
Apr 14, 2013
4,834
1,204
146
Places like imgur protect against words (probably with a badwords list) but numbers and caps still get you them sometimes. It seems that if a banned word is generated it swaps the characters around. You'll find swapped words in imgur URLs fairly often.
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
If you used a character set for the encoding that used A-Z,a-z-,0-9,symbols but dropped (a,e,i,o,u and sometimes y) then the bad words would only be l33t sp34k versions.
 

Leros

Lifer
Jul 11, 2004
21,867
7
81
The founder of imgur posted a comment on reddit talking about how he generates URLs. He just takes 6 random characters, checks to see if it has already has been used before. If it has, he generates another one.

If you think about it, collisions are going to be very rare. If your alphabet is A-Z, a-z, 0-9, you have 26+26+10 = 62 characters. Takes 6 characters and you have 62^6 = 56,800,235,584 possible URLs. The chance of generating an existing URL is pretty low. If you start getting into the billions of URLs, you could just add another few characters and be good for a long time.

In terms of bad words, there dictionaries of bad words in all languages. You can check all substrings of the URL against a dictionary pretty quickly.
 
Last edited:

Rakehellion

Lifer
Jan 15, 2013
12,181
35
91
I'd just filter every dictionary word if possible. Google to see if someone created a leet speak filter and that'll handle 99% of cases.
 

brianmanahan

Lifer
Sep 2, 2006
24,591
5,994
136
i actually wrote a case-insensitive base 10-to-N code generator for a large website of a well-known company which shall remain nameless.

what it did was this - mapped base 10 numbers to a configurable set of digits/letters. i removed all vowels plus a few of the most commonly used consonants. it was still something like base 26 or 27 which was plenty, and after checking against a swares dictionary with substitutions, we could avoid any close calls.

but the business analysts and product owner argued so much over which characters to disallow, they just decided to allow ALL OF THEM D:

someone was gonna be a little upset when they got the code 0F*CK generated and sent to them, which i calculated would take a couple of years. plus all the other, shorter swares happening well before that.

maybe they fixed it in time... i didn't stick around long enough to see :awe:
 

Train

Lifer
Jun 22, 2000
13,583
80
91
www.bing.com
Thanks for all the comments.

here's what I am thinking so far.

4 char random Base64 (a-z, A-Z, 0-9, _ and -) this gives me a clean 24 bits of values (16.7 million), but I'll probably just use a char field anyways so the # of bits doesn't really matter.

I only have to worry about 3 and 4 letter words showing up. Instead of using a dictionary I think I will hard code a list with l33t speak variants to take care of the major ones.

If I ever get into several million records (at current pace we would be lucky to use 200k per year) I'll maybe add a 5th character. Even at 5 characters it's a relatively short ID and allows us to have over a billion records.
 
Last edited: