8.3. Handle Metacharacters

Many systems, such as the command line shell and SQL interpreters, have ``metacharacters'', that is, characters in their input that are not interpreted as data. Such characters might commands, or delimit data from commands or other data. If there's a language specification for that system's interface that you're using, then it certainly has metacharacters. If your program invokes those other systems and allows attackers to insert such metacharacters, the usual result is that an attacker can completely control your program.

One of the most pervasive metacharacter problems are those involving shell metacharacters. The standard Unix-like command shell (stored in /bin/sh) interprets a number of characters specially. If these characters are sent to the shell, then their special interpretation will be used unless escaped; this fact can be used to break programs. According to the WWW Security FAQ [Stein 1999, Q37], these metacharacters are:
& ; ` ' \ " | * ? ~ < > ^ ( ) [ ] { } $ \n \r

I should note that in many situations you'll also want to escape the tab and space characters, since they (and the newline) are the default parameter separators. The separator values can be changed by setting the IFS environment variable, but if you can't trust the source of this variable you should have thrown it out or reset it anyway as part of your environment variable processing.

Unfortunately, in real life this isn't a complete list. Here are some other characters that can be problematic:

What makes the shell metacharacters particularly pervasive is that several important library calls, such as popen(3) and system(3), are implemented by calling the command shell, meaning that they will be affected by shell metacharacters too. Similarly, execlp(3) and execvp(3) may cause the shell to be called. Many guidelines suggest avoiding popen(3), system(3), execlp(3), and execvp(3) entirely and use execve(3) directly in C when trying to spawn a process [Galvin 1998b]. At the least, avoid using system(3) when you can use the execve(3); since system(3) uses the shell to expand characters, there is more opportunity for mischief in system(3). In a similar manner the Perl and shell backtick (`) also call a command shell; for more information on Perl see Section 10.2.

Since SQL also has metacharacters, a similar issue revolves around calls to SQL. When metacharacters are provided as input to trigger SQL metacharacters, it's often called "SQL injection". See SPI Dynamic's paper ``SQL Injection: Are your Web Applications Vulnerable?'' for further discussion on this. As discussed in Chapter 5, define a very limited pattern and only allow data matching that pattern to enter; if you limit your pattern to ^[0-9]$ or ^[0-9A-Za-z]*$ then you won't have a problem. If you must handle data that may include SQL metacharacters, a good approach is to convert it (as early as possible) to some other encoding before storage, e.g., HTML encoding (in which case you'll need to encode any ampersand characters too). Also, prepend and append a quote to all user input, even if the data is numeric; that way, insertions of white space and other kinds of data won't be as dangerous.

Forgetting one of these characters can be disastrous, for example, many programs omit backslash as a shell metacharacter [rfp 1999]. As discussed in the Chapter 5, a recommended approach by some is to immediately escape at least all of these characters when they are input. But again, by far and away the best approach is to identify which characters you wish to permit, and use a filter to only permit those characters.

A number of programs, especially those designed for human interaction, have ``escape'' codes that perform ``extra'' activities. One of the more common (and dangerous) escape codes is one that brings up a command line. Make sure that these ``escape'' commands can't be included (unless you're sure that the specific command is safe). For example, many line-oriented mail programs (such as mail or mailx) use tilde (~) as an escape character, which can then be used to send a number of commands. As a result, apparently-innocent commands such as ``mail admin < file-from-user'' can be used to execute arbitrary programs. Interactive programs such as vi, emacs, and ed have ``escape'' mechanisms that allow users to run arbitrary shell commands from their session. Always examine the documentation of programs you call to search for escape mechanisms. It's best if you call only programs intended for use by other programs; see Section 8.4.

The issue of avoiding escape codes even goes down to low-level hardware components and emulators of them. Most modems implement the so-called ``Hayes'' command set. Unless the command set is disabled, inducing a delay, the phrase ``+++'', and then another delay forces the modem to interpret any following text as commands to the modem instead. This can be used to implement denial-of-service attacks (by sending ``ATH0'', a hang-up command) or even forcing a user to connect to someone else (a sophisticated attacker could re-route a user's connection through a machine under the attacker's control). For the specific case of modems, this is easy to counter (e.g., add "ATS2-255" in the modem initialization string), but the general issue still holds: if you're controlling a lower-level component, or an emulation of one, make sure that you disable or otherwise handle any escape codes built into them.

Many ``terminal'' interfaces implement the escape codes of ancient, long-gone physical terminals like the VT100. These codes can be useful, for example, for bolding characters, changing font color, or moving to a particular location in a terminal interface. However, do not allow arbitrary untrusted data to be sent directly to a terminal screen, because some of those codes can cause serious problems. On some systems you can remap keys (e.g., so when a user presses "Enter" or a function key it sends the command you want them to run). On some you can even send codes to clear the screen, display a set of commands you'd like the victim to run, and then send that set ``back'', forcing the victim to run the commands of the attacker's choosing without even waiting for a keystroke. This is typically implemented using ``page-mode buffering''. This security problem is why emulated tty's (represented as device files, usually in /dev/) should only be writeable by their owners and never anyone else - they should never have ``other write'' permission set, and unless only the user is a member of the group (i.e., the ``user-private group'' scheme), the ``group write'' permission should not be set either for the terminal [Filipski 1986]. If you're displaying data to the user at a (simulated) terminal, you probably need to filter out all control characters (characters with values less than 32) from data sent back to the user unless they're identified by you as safe. Worse comes to worse, you can identify tab and newline (and maybe carriage return) as safe, removing all the rest. Characters with their high bits set (i.e., values greater than 127) are in some ways trickier to handle; some old systems implement them as if they weren't set, but simply filtering them inhibits much international use. In this case, you need to look at the specifics of your situation.

A related problem is that the NIL character (character 0) can have surprising effects. Most C and C++ functions assume that this character marks the end of a string, but string-handling routines in other languages (such as Perl and Ada95) can handle strings containing NIL. Since many libraries and kernel calls use the C convention, the result is that what is checked is not what is actually used [rfp 1999].

When calling another program or referring to a file always specify its full path (e.g, /usr/bin/sort). For program calls, this will eliminate possible errors in calling the ``wrong'' command, even if the PATH value is incorrectly set. For other file referents, this reduces problems from ``bad'' starting directories.