Directory Trees Issue 14

"Linux Gazette...making Linux just a little more lovable!"

Directory Trees in Outline Format

By James T. Dennis jim@starshine.org

Since I frequently post messages to various Unix and Linux newsgroups and mailing lists I often get technical questions mailed to me ``out of the blue.''

I recently received a request for a script to produce the following sort of output:

 
        dir/
	   file1
	   file2
        file
	dir/
	   dir/
	      file

	(etc)

Here was my quick and dirty solution:

 
     	find . | awk -F/ '{for (x=1;x<NF;x++) { printf "\t"}; print $NF}'

... which only does about 80% of the job. The only problem is that the directory entries don't end with the ``/'' to indicate their file type. It was late -- so that's what I sent him.

Here's how that works:

find . just prints a list of full paths (using GNU find). Some non-Linux users may have to using 'find . -print' to accomplish this (or update to the GNU version on their systems).

awk is a text processing language/utility.

The -F (capital ``f'') sets a field separator to the '/' (slash character). Awk defaults to parsing it's input into records (lines) of fields (whitespace delimited). Using the -F allows me to tell awk to treat each record (still just lines) as a group of fields that are separated by slashes -- allowing me to deal with each directory element as a separate element very easily.

The next parameter to awk is a short program -- a for loop (like the C for() construct). It iterates from 1 to NF.

NF in awk is the ``number of fields'' for each record. This, among many other values, is preset by awk as it parses its input.

Awk defaults to reading it's input from a pipe or from each file listed after it's script on the command line. We're supplying it with input through the pipe, of course.

In the body of my awk 'for' loop I simply print a tab for each directory named in that line. This has the appearance of "wiping out" all of the leading directory names and indenting my line as desired.

Finally, after the end of the for loop I simply print the last field ($NF). Note how the printf takes a string similar to C's printf -- and it doesn't assume a newline. I could put C-like format specifiers like %s and %f in there -- and I'd have to supply additional parameters to the printf call if I did.

By contrast the awk print command (no trailing ``f'') does add an ORS (output record separator) character to the end of its line and doesn't treat its first argument as a format specification.

This evening I happened to be cleaning up my home directory (while procrastinating on doing paying work and cleaning the house) I happened across a copy of this and decided to fix it.

 
		find . | { while read i ; 
			do 
			   [ -d $i ] \
			   && echo $i/  \
			   || echo $i 
			   done } \
			   | awk -F/ '
			   	/\/$/ { for (x = 1; x < NF -1 ;x++) {  
						printf "\t" }; 
				        print $(NF-1) "/";
					next;
					} 
				{ for (x = 1; x < NF; x++) {  
					printf "\t" } 
				  print $NF }'

Note that the original script: 'find ....| awk -F/ ...' is mostly still there. But the script has gone from one line to eleven -- all to get that silly little slash character on the end of each directory name.

(If anyone as a shorter program -- I'd like to see it -- there's probably a fairly quick way to do this using perl and find2perl)

The main thing I've added is the while loop which works like this:

find's output is piped into a group of commands (that's what the braces are for). That group of commands starts with a bash "while... do" loop. The bash "while...do" loop works like this:

 
			'while'
				some command returns no error
				'do'
				some commands

'done' Note that, unlike C or Pascal programming the ``condition'' for the while loop is actually any command (or group of commands -- enclosed in braces or parentheses). The fact that programs return values (called errorlevels in DOS and some Mainframe OS) makes all commands implicitly ``conditions.'' (Actually C allows a variety of function calls within conditionals -- but we won't go into that).

Note that some commands might not return values that make any sense -- so those would not be suitable for use with any of the conditional contexts in any shell.

The command I'm using is bash' internal ``read'' command which just takes a variable name as an argument. Note that I don't say ``read $i'' -- the shell would then fill the value of $i into the command (i.e it would ``dereference'' it) and the read command would have no arguments. If you give the read command no argument it simply reads a value and throws it away (no error).

When you set values in bash (or Bourne shell, or zsh etc) you also don't ``dereference'' it. $i=foo would be an error unless you actually wanted to set the value of some variable -- whose name was currently stored in $i to be set to foo.

Back to our script. When the find command stops printing filenames into the pipe, the 'read i' command will fail to get any value -- so the body of the do loop will be skipped.

The 'do' keyword just marks the end of the list of commands in the conditional section and the beginning of the body of the loop (big surprise -- huh?).

The next three lines of the script are another common shell construct --

[ is really an alias for or link to the 'test' command.
-d is a parameter to 'test' that is true if the next parameter ($i) is a directory.
That line ends with a ``\'' (backslash) to mark a continuation character. This causes the shell to treat the next line as an extension of this one.
I could certainly have put all of this one line. However, for readability I broke it up and formatted it with leading tabs -- otherwise *I* couldn't read it, much less expect anyone else to do so.
The next line (continuation) starts with the '&&' operator. In bash and related shells you have things like the familiar ``|'' (pipe) and ``;'' semicolon which are called operators. This operator means ``if that last command was O.K. -- returned no error -- then ...''
You can think of the '&&' operator as do this ``and'' to that (in the *conditional* sense of the the word and).
The next line uses the '||' operator -- which is, as you might expect, similar to the '&&' operator except it means -- ``if the last command executed returned an error then ...'' This is roughly analogous to the English ``or'' (again, it the conditional sense).
Of course I could have wrapped this in an 'if ....; then ....; else...' construct -- but I'm used to the '&&' and '||' as are most shell programmers.
So far all we've done is added a ``/'' character to the end of each directory.
Now I'm left with a print out of full paths with directories ending in ``/'' (slashes) and other files printed normally -- back to replacing all but the last thing with tabs -- so we pipe the 'while' loop's output into the same awk script we were using before.
Ooops! Well, almost the same script -- it turns out that awk -F is happy to consider the trailing slash as a blank field on the end of a line. Hmm. O.K. we add an extra condition to the awk script.
An awk script consists of condition-action pairs. The most common awk ``conditions'' are patterns. That is so say that they are regular expressions (like the things you use grep to search for). A pattern is usually delimited by slashes (a mnemonic to the users of ed, later upgraded ex, later upgraded to vi) although you can also ``match'' against strings that are enclosed in quotes.
Actions in awk are enclosed in braces.
Awk is an extremely forgiving language. If you leave out the ``condition'' or ``pattern'' it will execute the action on that line for every record (line) that it comes across. That's what my first script did.
If you leave off the action (i.e. if you have a line that consists just of a condition) then awk will simply print the record. In other words the default action is {print}.
When I was a regular in the comp.lang.awk newsgroup (and alt.lang.awk that preceded it) I used to enjoy pointing out that the shorted awk programs in the work are:
```
 
			1

			and 

			.
```
(The first one just prints every line it sees since ``1'' is a ``true'' condition; the second program (a dot) prints every line that has at least one character -- since that is the regular expression for ``any character''. The second program actually does filter out blank lines since awk doesn't count the record separator as part of the line).
So, the modification of my awk script for this purpose is to add a condition that handles any record that *ends* with a slash. In those cases I convert all *but* the next-to-last field to a tab, and print that ``next-to-last'' field. I also have to add the ``/'' character to the end of that since awk doesn't consider the field separator to be part of any field.
Finally I add a 'next' command which tells awk not to look for any more pattern-action pairs with *this* record. If I didn't do that than awk would execute the action for each ``directory'' line -- and also execute the other action for it (i.e. it would print a blank line after printing each directory line).
Is the extra 10 lines of code worth it just to add a slash to the end of the directory names in our outline? Depends on how much your customer is willing to pay -- or how much grief it causes you, your boss or your users.
Mostly I decided to work on this as a training example. I think there are some neat constructs that every budding shell programmer might benefit from learning.
The ``find .... | {while read i .... do ... done}'' construct is well worth remember for other cases. It allows you to do complex operations on large numbers of files without resorting to writing a temporary file and having to clean up after it.
When you write scripts that explicitly create temporary files you suddenly have a host of new concerns -- what do I name it? where do I put it? don't forget to remove it! do I have enough space for it? what if my script gets interrupted? etc.
To be sure there are answers to each of these. For example I suggest ~/tmp/$0.`date +%Y%m%d`.$$ for a generic temporary filename for any script -- it gives the name of your script, the date in YYYYMMDD format and the process ID of the current instance of your script as the filename. It puts that into the temporary directory under your home (which no one else should have access to). There is virtually no chance of a name collision using this scheme (particularly if you change the date format to +%s which is the total number of seconds since midnight on Jan. 1, 1970). You can use the 'trap' command to ensure that your temp files are cleaned in all but the most extreme cases etc.
However, as I've said, it's worth understanding how to avoid temporary files -- and usually your scripts will execute faster as a result.
The [ ... ] && ... || ... construct is absolutely essential to any Unix sysadmin. Many of legacy scripts (particularly those in /etc/rc.d/ -- or it's local equivalent) rely on these operators and the test or '[' command.
Finally there is 'awk'. I've heard it argued that awk is a dinosaur and that we should convert all the awk code to perl (and presumably most of the Bourne shell and sed code with it). I won't argue that point here. Suffice it to say that anything you learn how to do in awk will just make learning perl that much easier when you get to it. awk is a much simpler language and is phenomenally easy to integrate into shell scripts (as you can see here).
Jim Dennis, Starshine Technical Services

Copyright © 1997, James T. Dennis
Published in Issue 14 of the Linux Gazette

"Linux Gazette...making Linux just a little more lovable!"

Directory Trees in Outline Format

By James T. Dennis jim@starshine.org

Copyright © 1997, James T. Dennis Published in Issue 14 of the Linux Gazette

Copyright © 1997, James T. Dennis
Published in Issue 14 of the Linux Gazette