• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Regular Expression to NOT match list of characaters

What's the best way to black list a set of characters via regular expression?

1) I want the regular expression to black list the set of characters included in the brackets. The following, to my understanind, white lists them. I need the opposite

2) How can I escape the brackets in the list [ ] ? they are the opening tags for the regular expression but I also want them included in the list.

Code:
/^[!@##$%^&*()+-=`~{}|[]/:"";'<>?/]
 
I'm guessing you are doing this to prevent injection attacks. If not, ignore the following.

First, don't do this. Injection attacks can be pretty sneaky you are better off using a scrubbing library (there are several available). It is the same as rolling your own encryption. There are a thousand situations you aren't thinking of so it just isn't a wise idea to do it yourself.

But if you do decide that you want to roll your own, your best bet is definitely whitelisting over blacklisting. Blacklists are generally easy to bypass, especially if you are dealing with something like UTF-8 where the same character can have 1000 representations.
 
Now, if you aren't doing injection attack prevention but rather something else.

1) While you can use regexes for this, honestly for something this simple it is probably easier to make a set of what you want the blacklist, iterate over each character in the string, and then check the set for existence in the set. You can do this with regexes, but where you are taking advantage of the most basic of features of regex, a set might be easier for some future maintainer to grok. If you must do a regex for this, then something like this is what you are after

Code:
a|b|c|d|\]

| is an or. and \ is the escape character for regex. If you want to look for anything that is a regex special char, you just prepend a \ in front of it and you are off.

2) So to match [ and ] you would say \[ and \]
 
The intent is for validation of some form variables.

I would prefer to keep it simple and just whitelist characters, but for some instances I think it would be easier to blacklist specific characters.
 
The intent is for validation of some form variables.

I would prefer to keep it simple and just whitelist characters, but for some instances I think it would be easier to blacklist specific characters.

You'll just want to make sure that whatever the user enters into the form either never makes it to the website, or if it does, make sure it goes through some html escaping code.
 
May I point out the obvious?

Blacklist:
if(!someString.matches(yourRegexThatMatchesOnBlacklistedCharacters))
{
// string is "white".
}


So in the OP's case, follow this psuedocode:

//I did not check this regex for accuracy
string blackListRegex = "/^[!@##$%^&*()+-=`~{}\|\[\]/:"";'<>?/]";

bool doesInputContainBlacklistedCharacters(string input)
{
return input.matches(blackListRegex);
}
 
Adding a carat (^) at the beginning of a character class negates it (matches characters that aren't in the class)

The regular expression you want is:

Code:
[^!@#$%^&*()\/+-=`~{}|\[\]\\"'<>?]+
As tested using rubular.com

The problem with all the special characters is different languages have drastically different ways of handling special characters and escape sequences in strings. Sometimes you have to escape the escape character in the string to use it as an escape character for the regexp.

To match c:\temp, you need to use the regex c:\\temp. As a string in C++ source code, this regex becomes "c:\\\\temp". Four backslashes to match a single one indeed.
 
Last edited:
The intent is for validation of some form variables.

I would prefer to keep it simple and just whitelist characters, but for some instances I think it would be easier to blacklist specific characters.
Would it work to use a simple \w?

Phrased another way, are there any non-word characters that should be allowed? The only characters I don't see on your black list are ,.\

Code:
 #perl
my $input = from_some_form_or_something();
my $blacklist = qr/[^\w\s\,\.\\]/; #not a word, space, comma, period, or slash
if ($input =~ $blacklist) {
   print "input contains special characters\n";
} else {
   print "input is acceptable\n";
}
 
Back
Top