I recently came across a function that looked like this:
def paths_match(working_directory, target_directory): matches =  if re.match(working_directory, target_directory): matches.append('partial') if working_directory == target_directory: matches.append('partial') matches.append('full') return ' '.join(matches)
This compares two file paths, and then returns one of four strings,
'', if the file paths don't match at all,
'partial', if the target directory is a descendant of the current working directory,
'partial partial full', if the target directory is the current working directory, or, crucially
'partial full', in very few select cases, described below.
If we, for a second, pretend that
working_directory is a pre-escaped (using
re.escape) string, then we can translate the entire function into something like
def paths_match(working_directory, target_directory): if working_directory == target_directory: return 'partial full' elif target_directory.startswith(working_directory): return 'partial'
which, at least in my opinion, is much easier to read, understand and fix bugs in. It also removes the weird special case difference between
'partial partial full' and
Unfortunately, it assumes that
working_directory is escaped, which we can't know.
Subtle, Non-Breaking Bug
So basically, what I'm saying is that there could be some user of this function that relies on
working_directory being unescaped. Someone, somewhere, might inject regexes in their path to get the behaviour they want, so we can't remove the
re.match call. And since the "partial" branch is not a consequence of the "full" branch in the presence of regexes, we can't return early either. We have to check for both separately. Something could be a full but not partial match if there are regexes in the path.
You are probably not meant to inject regexes in your paths, because that's just silly. Who would ever think that's a good idea? I have a really hard time seeing anyone planning for that kind of usage scenario. This is a very subtle bug, because it doesn't affect normal use of the function, but it is a bug none the less.
The problem is, we can't fix it either. Someone might use it that way despite all the intentions by the original author. The situation is very reminiscent of the one described by Randall Munroe in xkcd 1172.
So, how do we make sure that doesn't happen again?
Code review. This was probably written by someone who was tired and distracted on a friday afternoon, who reached for the first string matching tool that came to mind –
re.match. They just didn't think that
re.match will actually parse regexes too. Or maybe they realised it, but thought "it won't matter, it will never be used that way anyway" and just didn't care for escaping the input. That's a very human thing to do.
That's where code review comes into play. Before that change gets merged to the development branch, someone else has to read that code. Someone else will read that code and go, "Dude. What the shit? Why are you using regexes here when you clearly just want literal string matching?" The same someone can do something as simple as switch out
str.startswith and the problem doesn't exist anymore.
Then the code is merged and everyone is happy. And I don't have to deal with code that doesn't work properly and can't be fixed.