Code release: preg_find() - A recursive file listing tool for PHP

| 13 Comments
Version 2.1

I originally wrote this a few years ago and never really promoted it beyond the realms of the #php IRC channel on EfNet.  However, it has managed to find its way into applications such as WordPress and many other PHP apps.  It is gratifying to know that others are finding it useful.

So what is preg_find() anyway? A short summary for those who have never encountered it: Imaging a recursive capable glob() with the ability to filter the results with a regex (PCRE) and various arguments to modify the results to bring back additional data.

Well today I thought I would add one commonly requested feature. Sorting.  Using the power of PHP's anonymous (lambda-style) functions, preg_find() now creates a custom sort routine based on the arguments passed in, filename, dir+filename, last modified, file size, disk usage (yes those last 2 are different) in either ascending or decending order.

Download preg_find.phps
Download preg_find.php in plain text format


A simple example to get started - we'll work on my PHP miscellaneous code directory:

Example 1: List the files (no directories):

Code:

include 'preg_find.php';
$files = preg_find('/./', '../code');
foreach($files as $file) printf("<br>%s\n", $file);

You can see the result here


Now let us look at a recursive search - this is easy, just pass in the PREG_FIND_RECURSIVE argument.
Example 2: List the files, recursively:

Code:

$files = preg_find('/./', '../code', PREG_FIND_RECURSIVE);
foreach($files as $file) printf("<br>%s\n", $file);

You can see the result here


Lets go futher, this time we don't want to see any files - only a directory structure.
Example 3: List the directory tree:

Code:

$files = preg_find('/./', '../code', PREG_FIND_DIRONLY|PREG_FIND_RECURSIVE);
foreach($files as $file) printf("<br>%s\n", $file);

You can see the result here

It should be obvious by now that we are using constants as our modifier arguments. What might not be immediately obvious is that these constants are "bit" values (.e.g. 1, 2, 4, 8, 16, ..., 1024, etc) and using PHP's Bitwise Or operator "|" we can combine modifiers to pass multiple modifiers into the function.


How about a regex? Files starting with str_ and ending in .php
Example 4: Using a regex on the same code as example 1:

Code:

$files = preg_find('/^str_.*?\.php$/D', '../code');
foreach($files as $file) printf("<br>%s\n", $file);

You can see the result here


What about that funky PREG_FIND_RETURNASSOC modifier?
This will change the output dramatically from a simple file/directory array to an associative array where the key is the filename, and the value is lots of information about that file.

Example5: Use of PREG_FIND_RETURNASSOC

Code:

$files = preg_find('/^str_.*?\.php$/D', '../code', PREG_FIND_RETURNASSOC);
foreach($files as $file) printf("<br>%s\n", $file);

You can see the result here


As I mentioned earlier, I added sorting capability to the results, so let us look at some examples of that.

Example 6. Sorting the results (of example 1)

Code:

$files = preg_find('/./', '../code', PREG_FIND_SORTKEYS);
foreach($files as $file) printf("<br>%s\n", $file);

You can see the result here


Example 7. And reverse sort.

Code:

$files = preg_find('/./', '../code', PREG_FIND_SORTKEYS|PREG_FIND_SORTDESC);
foreach($files as $file) printf("<br>%s\n", $file);

You can see the result here


Ok, thats all well and good, what about something more interesting?

Example 8. Finding the largest 5 files in the tree, sorted by filesize, descending.

Code:

$files = preg_find('/./', '../code',
  PREG_FIND_RECURSIVE|PREG_FIND_RETURNASSOC|PREG_FIND_SORTFILESIZE|PREG_FIND_SORTDESC);
$i=1;
foreach($files as $file => $stats) {
  printf('<br>%d) %d %s', $i, $stats['stat']['size'], $file);
  $i++;
  if ($i > 5) break;
}

You can see the result here.

Or what about the 10 most recently modified files?

Example 9.

Code:

$files = preg_find('/./', '../code',
  PREG_FIND_RECURSIVE|PREG_FIND_RETURNASSOC|PREG_FIND_SORTMODIFIED|PREG_FIND_SORTDESC);
$i=1;
foreach($files as $file => $stats) {
  printf('<br>%d) %s - %d bytes - %s', $i,
    date('Y-m-d H:i:s', $stats['stat']['mtime']), $stats['stat']['size'], $file);
  $i++;
  if ($i > 10) break;
}

You can see the result here.

I am keen to receive feedback on what you think of this function.   If you have used it in some other application - great, I would love to know.  Suggestions, improvements, criticisms are also always welcome.
Bookmark and Share

13 Comments


pgregg wrote:

Hi Reid,

Great spot there - I had realised that with recursion came additional sorting, but I did not realise that the memory hit would be so large.   I've patched the code to only sort at the final function exit, however rather than break out a further function call, I used a static variable to record the current recursive depth.

Patch to 2.1 (non-contextual because it is smaller) is below, or 2.2 is now in place.

Thanks for pointing that out



Not a problem I appreciated having this available when I needed to get an alphabetically sorted recursive listing of directories, preserving the hierarchical order. I kept getting memory exhaustion while testing in interactive mode, each create_function seemed to want ~500kb of memory (in non-interactive mode it was significantly less), and after some research I found out that memory allocated by create_function is irreclaimable by garbage collection.

I'm glad I found something to give back. Thanks very much for sharing your tremendously useful code!

Hey guys -

This function looks immensely useful and I'm looking at implementing it in an object oriented context.  I'm in the process of trying to make the necessary modifications (properly credited, be happy to donate the changes back, blah, blah, blah :-) and, not being a seasoned PHP developer, I'm trying to understand the syntactical meaning of the single ampersand.

Code:

if ($args & self::PREG_FIND_NEGATE)

I'm not familiar with that syntax in php and every search I do seems to return more "&&" references than I'm willing to sort through.  :-)

I'll keep digging around, but any insight would be very much appreciated.

Rob

Oops.  Disregard the "self::" syntax above.  That's one of my changes and I forgot to strip it out after I pasted the code.  Sorry.

Thanks for this, really useful.

I just want to have a list of filename without the directory path, is there any option to co this?

nevermind I just look at the code and changed  array_push($files_matched, $filepath) to  array_push($files_matched, $file);

You might be better using basename() on the filename just before you use it.

Got it.  Bitwise operations...

preg_find's recursive search leaks a large amount of memory when sorting  due to the numerous create_function() calls to build the sorting function, and as a side effect this also incurs extra sorting complexity - as the tree depth of an item increases so does the number of times it is sorted. When searching over a large tree, you can quickly exhaust php's available memory. I overcame this by renaming the preg_find algorithm to _preg_find (and changing the recursive call to _preg_find),  separating the sorting code into a wrapper preg_find function that takes the same argument list. This function first calls _preg_find, then applies the sorting to the result set. This way, the recursion does not affect the sorting, and resource consumption is much more manageable.

patch:

Code:

Index: preg_find.php
===================================================================
--- preg_find.php (revision 11)
+++ preg_find.php (working copy)
@@ -52,12 +52,34 @@
// to use more than one simply seperate them with a | character


+//wrapper function, ensure that we only sort once and only incur the memory hit of create_function once
+function preg_find($pattern, $start_dir='.', $args=NULL) {
+ $files_matched = _preg_find($pattern, $start_dir, $args);
+
+ // Before returning check if we need to sort the results.
+ if ($args & (PREG_FIND_SORTKEYS|PREG_FIND_SORTBASENAME|PREG_FIND_SORTMODIFIED|PREG_FIND_SORTFILESIZE|PREG_FIND_SORTDISKUSAGE)) {
+ $order = ($args & PREG_FIND_SORTDESC) ? 1 : -1;
+ $sortby = '';
+ if ($args & PREG_FIND_RETURNASSOC) {
+ if ($args & PREG_FIND_SORTMODIFIED) $sortby = "['stat']['mtime']";
+ if ($args & PREG_FIND_SORTBASENAME) $sortby = "['basename']";
+ if ($args & PREG_FIND_SORTFILESIZE) $sortby = "['stat']['size']";
+ if ($args & PREG_FIND_SORTDISKUSAGE) $sortby = "['du']";
+ }
+
+ $filesort = create_function('$a,$b', "\$a1=\$a$sortby;\$b1=\$b$sortby; if (\$a1==\$b1) return 0; else return (\$a1<\$b1) ? $order : 0- $order;");
+ uasort($files_matched, $filesort);
+ }
+
+ return $files_matched;

+}
+
// Search for files matching $pattern in $start_dir.
// if args contains PREG_FIND_RECURSIVE then do a recursive search
// return value is an associative array, the key of which is the path/file
// and the value is the stat of the file.
-Function preg_find($pattern, $start_dir='.', $args=NULL) {
+function _preg_find($pattern, $start_dir='.', $args=NULL) {

$files_matched = array();

@@ -94,25 +116,12 @@
}
if ( is_dir($filepath) && ($args & PREG_FIND_RECURSIVE) ) {
$files_matched = array_merge($files_matched,
- preg_find($pattern, $filepath, $args));
+ _preg_find($pattern, $filepath, $args));
}
}

closedir($fh);

- // Before returning check if we need to sort the results.
- if ($args & (PREG_FIND_SORTKEYS|PREG_FIND_SORTBASENAME|PREG_FIND_SORTMODIFIED|PREG_FIND_SORTFILESIZE|PREG_FIND_SORTDISKUSAGE)) {
- $order = ($args & PREG_FIND_SORTDESC) ? 1 : -1;
- $sortby = '';
- if ($args & PREG_FIND_RETURNASSOC) {
- if ($args & PREG_FIND_SORTMODIFIED) $sortby = "['stat']['mtime']";
- if ($args & PREG_FIND_SORTBASENAME) $sortby = "['basename']";
- if ($args & PREG_FIND_SORTFILESIZE) $sortby = "['stat']['size']";
- if ($args & PREG_FIND_SORTDISKUSAGE) $sortby = "['du']";
- }
- $filesort = create_function('$a,$b', "\$a1=\$a$sortby;\$b1=\$b$sortby; if (\$a1==\$b1) return 0; else return (\$a1<\$b1) ? $order : 0- $order;");
- uasort($files_matched, $filesort);
- }
return $files_matched;

}

Hi Reid,

Great spot there - I had realised that with recursion came additional sorting, but I did not realise that the memory hit would be so large.   I've patched the code to only sort at the final function exit, however rather than break out a further function call, I used a static variable to record the current recursive depth.

Patch to 2.1 (non-contextual because it is smaller) is below, or 2.2 is now in place.

Thanks for pointing that out

Code:

9c9,10
< * Version: 2.1
---
> * Updated 9 June 2007 to prevent multiple calls to sort during recursion
> * Version: 2.2
61a63,65
> static $depth = -1;
> ++$depth;
>
104c108
< if ($args & (PREG_FIND_SORTKEYS|PREG_FIND_SORTBASENAME|PREG_FIND_SORTMODIFIED|PREG_FIND_SORTFILESIZE|PREG_FIND_SORTDISKUSAGE)) {
---
> if (($depth==0) && ($args & (PREG_FIND_SORTKEYS|PREG_FIND_SORTBASENAME|PREG_FIND_SORTMODIFIED|PREG_FIND_SORTFILESIZE|PREG_FIND_SORTDISKUSAGE)) ) {
115a120
> --$depth;

I incorporated the changes from Reid above as well as added a sort for file extension and the ability not to recurse links (hitting a bad link that points to the parent direectory could cause an infinite recursion....

--- preg_find.php 2009-06-11 23:27:38.000000000 -0400
+++ preg_find.sean 2009-06-11 23:27:16.000000000 -0400
@@ -23,6 +23,7 @@
define('PREG_FIND_FULLPATH', 4);
define('PREG_FIND_NEGATE', 8);
define('PREG_FIND_DIRONLY', 16);
+define('PREG_FIND_IGNORELINKS', 24);
define('PREG_FIND_RETURNASSOC', 32);
define('PREG_FIND_SORTDESC', 64);
define('PREG_FIND_SORTKEYS', 128);
@@ -30,10 +31,12 @@
define('PREG_FIND_SORTMODIFIED', 512); # requires PREG_FIND_RETURNASSOC
define('PREG_FIND_SORTFILESIZE', 1024); # requires PREG_FIND_RETURNASSOC
define('PREG_FIND_SORTDISKUSAGE', 2048); # requires PREG_FIND_RETURNASSOC
+define('PREG_FIND_SORTEXTENSION, 4096); # requires PREG_FIND_RETURNASSOC

// PREG_FIND_RECURSIVE - go into subdirectorys looking for more files
// PREG_FIND_DIRMATCH - return directorys that match the pattern also
// PREG_FIND_DIRONLY - return only directorys that match the pattern (no files)
+// PREG_FIND_IGNORELINKS - Do not follow links
// PREG_FIND_FULLPATH - search for the pattern in the full path (dir+file)
// PREG_FIND_NEGATE - return files that don't match the pattern
// PREG_FIND_RETURNASSOC - Instead of just returning a plain array of matches,
@@ -58,7 +61,29 @@
// if args contains PREG_FIND_RECURSIVE then do a recursive search
// return value is an associative array, the key of which is the path/file
// and the value is the stat of the file.
-Function preg_find($pattern, $start_dir='.', $args=NULL) {
+function preg_find($pattern, $start_dir='.', $args=NULL) {
+ $start_dir = chop($start_dir,'\/');
+ $files_matched = _preg_find($pattern, $start_dir, $args);
+
+ //Before returning check if we need to sort the results
+ if($args & (PREG_FIND_SORTKEYS|PREG_FIND_SORTBASENAME|PREG_FIND_SORTMODIFIED|PREG_FIND_SORTFILESIZE|PREG_FIND_SORTDISKUSAGE)) {
+ $order = ($args & PREG_FIND_SORTDESC) ? 1 : -1;
+ $sortby = '';
+ if ($args & PREG_FIND_RETURNASSOC) {
+ if ($args & PREG_FIND_SORTMODIFIED) $sortby = "['stat']['mtime']";
+ if ($args & PREG_FIND_SORTBASENAME) $sortby = "['basename']";
+ if ($args & PREG_FIND_SORTFILESIZE) $sortby = "['stat']['size']";
+ if ($args & PREG_FIND_SORTDISKUSAGE) $sortby = "['du']";
+ if ($args & PREG_FIND_SORTEXTENSION) $sortby = "['extension']";
+ }
+
+ $filesort = create_function('$a,$b', "\$a1=\$a$sortby;\$b1=\$b$sortby; if (\$a1==\$b1) return 0; else return (\$a1 + uasort($files_matched, $filesort);
+ }
+ return $files_matched;
+}
+
+function _preg_find($pattern, $start_dir='.', $args=NULL) {

static $depth = -1;
++$depth;
@@ -91,32 +116,22 @@
if (function_exists('dirname')) $fileres['dirname'] = dirname($filepath);
if (function_exists('basename')) $fileres['basename'] = basename($filepath);
if (isset($fileres['uid']) && function_exists('posix_getpwuid')) $fileres['owner'] = posix_getpwuid ($fileres['uid']);
+ if (function_exists('end')) $fileres['extension'] = pathinfo($filepath, PATHINFO_EXTENSION);
$files_matched[$filepath] = $fileres;
} else
array_push($files_matched, $filepath);
}
}
if ( is_dir($filepath) && ($args & PREG_FIND_RECURSIVE) ) {
- $files_matched = array_merge($files_matched,
- preg_find($pattern, $filepath, $args));
+ if (!is_link($filepath) && !($args & PREG_FIND_IGNORELINKS) ) {
+ $files_matched = array_merge($files_matched,
+ _preg_find($pattern, $filepath, $args));
+ }
}
}

closedir($fh);

- // Before returning check if we need to sort the results.
- if (($depth==0) && ($args & (PREG_FIND_SORTKEYS|PREG_FIND_SORTBASENAME|PREG_FIND_SORTMODIFIED|PREG_FIND_SORTFILESIZE|PREG_FIND_SORTDISKUSAGE)) ) {
- $order = ($args & PREG_FIND_SORTDESC) ? 1 : -1;
- $sortby = '';
- if ($args & PREG_FIND_RETURNASSOC) {
- if ($args & PREG_FIND_SORTMODIFIED) $sortby = "['stat']['mtime']";
- if ($args & PREG_FIND_SORTBASENAME) $sortby = "['basename']";
- if ($args & PREG_FIND_SORTFILESIZE) $sortby = "['stat']['size']";
- if ($args & PREG_FIND_SORTDISKUSAGE) $sortby = "['du']";
- }
- $filesort = create_function('$a,$b', "\$a1=\$a$sortby;\$b1=\$b$sortby; if (\$a1==\$b1) return 0; else return (\$a1 - uasort($files_matched, $filesort);
- }
--$depth;
return $files_matched;

Hi Sean,

This is a good idea to add a sort-by-extension, however your implementation is flawed as the value used must represent a single bit of a 32bit integer. The value "24" won't work as that just represents both PREG_FIND_DIRONLY and PREG_FIND_NEGATE turned on at the same time.

I'll add "ext" functionality to the code shortly and post it up tonight.

Regards,
PG

New diff Sean - please note that the previous code had implemented Reid's suggestion of preventing the multiple sorts - however I did it via a static variable which I would argue is better than splitting the routine into two functions. Thus you should have started with my base code instead of Reid's.

10c10,12
---
> * Updated 12 June 2009 to allow for sorting by extension and prevent following
> * symlinks by default
> * Version: 2.3
32a35,36
> define('PREG_FIND_SORTEXTENSION', 4096); # requires PREG_FIND_RETURNASSOC
> define('PREG_FIND_FOLLOWSYMLINKS', 8192);
40a45,48
> // PREG_FIND_FOLLOWSYMLINKS - Recursive searches (from v2.3) will no longer
> // traverse symlinks to directories, unless you
> // specify this flag. This is to prevent nasty
> // endless loops.
52a61
> // PREG_FIND_SORTEXTENSION - Sort based on the filename extension
92a102
> if (($i=strrpos($fileres['basename'], '.'))!==false) $fileres['ext'] = substr($fileres['basename'], $i+1); else $fileres['ext'] = '';
100c110,111
---
> if (!is_link($filepath) || ($args & PREG_FIND_FOLLOWSYMLINKS))
> $files_matched = array_merge($files_matched,
115a127
> if ($args & PREG_FIND_SORTEXTENSION) $sortby = "['ext']";

The published preg_find has been bumped to v2.3 and includes this change - you can get it via the link at te top of this article.

Leave a comment

About this Entry

This page contains a single entry by Paul Gregg published on April 18, 2007 11:48 PM.

Release: vmclone.pl for VMware ESX Server was the previous entry in this blog.

Delphi for PHP (aka D4PHP) is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.