Wednesday, May 29, 2019

Beware of this find command gotcha

find is a basic useful command that Linux users run all the time. The command searches a file system from a given starting location, and returns all matches based on input filters that you provide as arguments.

The Gotcha

The gotcha is when you try to narrow the search by pruning a sub-directory from the search (including the directory itself and everything under it). For instance, suppose you want to find all files under the directory /data that are owned by root, excluding the sub-directory /data/keepit and all files underneath.

My first attempt at the solution results in the following find command.

find /data -path /data/keepit -prune -o -user 0

The -o argument specifies the logical 'or' operator. The expression on the left,  '-path /data/keepit -prune'  indicates where to prune the search. The idea is that when the search reaches /data/keepit, the -prune argument causes the search to not descend further into the sub-directory. Furthermore, -prune always returns true. Hence, the whole expression returns 'true', without having to evaluate the expression on the right of -o.

The expression right of -o tests for root ownership (root is user 0).

I was befuddled to learn that running the above command returns /data/keepit (but not its descendants). If the search is snipped at /data/keepit, why is the sub-directory itself included in the output? Besides, /data/keepit is not owned by root.

Being unaware of this behavior could lead to some unintended and very bad consequences as files named in the find output are often piped to the xargs command for further processing.

The Explanation

Before I present my solution, let's discuss why the point of pruning, i.e., the sub-directory named in -path, is actually included in the output.

The primary purpose of find is to search for file matches. Yet, it can have side effects through actions you specify on the command line. In addition to -print/-print0, there is also the -exec action. Unless you explicitly specify an action, the find command assumes the default action is -print.

The above example has no explicit -print or -exec action, therefore, the  action defaults to print all file matches. This explains why /data/keepit, a match for -path, is in the output. Its descendants, on the other hand, were excluded because of pruning.

The Solution

My solution is to specify -print explicitly on the command line.

find /data -path /data/keepit -prune -o -user root -print

Lo and behold. When you run the above command, /data/keepit is no longer part of the output.

By specifying the -print action explicitly, the find command no longer defaults  to printing out each file match. Instead, it will only print a file match if it is explicitly requested.

Summary & Conclusion

The pruning logic of the find command is quite confusing. Reading its man page offers some help, but may generate more questions than answers. I hope that this article is of help. But, I recommend that before you use the -prune feature on your production data, test it on some dummy data first.

You have been forewarned.