How to Use glob() in Python

Posted: 2023-07-28 | Tags: Python, File

In Python, the glob module allows you to get a list or an iterator of file and directory paths that satisfy certain conditions using special characters like the wildcard *.

glob — Unix style pathname pattern expansion — Python 3.11.4 documentation

Contents

Basic usage of glob()
Wildcards available in glob()
Get paths recursively: recursive
Set the root directory: root_dir
Get only file names
Get only directory names
Specify conditions with regex
Get as an iterator: iglob()

This article uses the glob and os modules in the sample codes. As they are part of the standard library, no additional installation is necessary.

import glob
import os

source: glob_usage.py

The following files and directories are used as examples.

temp/
├── 1.txt
├── 12.jpg
├── 123.txt
├── [x].txt
├── aaa.jpg
└── dir/
    ├── 987.jpg
    ├── bbb.txt
    ├── sub_dir1/
    │   ├── 98.txt
    │   └── ccc.jpg
    └── sub_dir2/
        └── ddd.jpg

source: glob_usage.py

Basic usage of `glob()`

The first argument to glob() is a path string, which can include special characters like the wildcard *.

The function returns a list of path strings meeting the criteria.

l = glob.glob('temp/*.txt')
print(l)
# ['temp/[x].txt', 'temp/1.txt', 'temp/123.txt']

print(type(l))
# <class 'list'>

source: glob_usage.py

Wildcards available in `glob()`

In glob(), you can use wildcards such as * and ? that are used in the Unix shell.

fnmatch — Unix filename pattern matching — Python 3.11.4 documentation

`*` matches everything

* matches any string, regardless of its length, including zero characters.

print(glob.glob('temp/*'))
# ['temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir', 'temp/1.txt', 'temp/123.txt']

print(glob.glob('temp/*.jpg'))
# ['temp/12.jpg', 'temp/aaa.jpg']

print(glob.glob('temp/dir/*/*.jpg'))
# ['temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir2/ddd.jpg']

source: glob_usage.py

`?` matches any single character

? matches any single character.

For example, to extract paths with file names (excluding extensions) of three characters, use ???.*.

print(glob.glob('temp/???.*'))
# ['temp/[x].txt', 'temp/aaa.jpg', 'temp/123.txt']

source: glob_usage.py

`[seq]` matches any character in `seq`

[seq] matches any character in seq. For example, [aZ1] matches either a, Z, or 1.

Additionally, you can define a range of characters using a hyphen -. For example, [0-9] matches any digit from 0 to 9, and [a-z] matches any lowercase letter from a to z.

print(glob.glob('temp/[0-9].*'))
# ['temp/1.txt']

print(glob.glob('temp/[0-9][0-9].*'))
# ['temp/12.jpg']

print(glob.glob('temp/[a-z][a-z][a-z].*'))
# ['temp/aaa.jpg']

source: glob_usage.py

Prefixing with ! matches characters not in the brackets. For example, [!a-z] matches any character without a lowercase letter.

print(glob.glob('temp/[!a-z].*'))
# ['temp/1.txt']

source: glob_usage.py

Escape wildcards

To escape wildcards, simply wrap them with [].

print(glob.glob('temp/[[]*'))
# ['temp/[x].txt']

source: glob_usage.py

Get paths recursively: `recursive`

By setting the recursive argument of glob() to True and using **, it matches any files and zero or more directories and subdirectories.

While * only matches files at the same directory level, ** can match across multiple directory levels.

print(glob.glob('temp/*/*.jpg'))
# ['temp/dir/987.jpg']

print(glob.glob('temp/**/*.jpg', recursive=True))
# ['temp/12.jpg', 'temp/aaa.jpg', 'temp/dir/987.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir2/ddd.jpg']

source: glob_usage.py

You can recursively get a list of all files and directories within a specific directory.

print(glob.glob('temp/**', recursive=True))
# ['temp/', 'temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir', 'temp/dir/987.jpg', 'temp/dir/sub_dir1', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir1/98.txt', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2', 'temp/dir/sub_dir2/ddd.jpg', 'temp/1.txt', 'temp/123.txt']

source: glob_usage.py

However, using ** might take a long time when there are numerous files and directories. So, if possible, it's recommended to use other special characters to define the conditions.

Set the root directory: `root_dir`

The root_dir argument in glob() can be utilized to specify the root directory. This argument modifies the behavior of glob() as if the current directory was switched to root_dir before execution, although the actual current directory remains unchanged.

The default value is root_dir=None, keeping the current directory as the base.

print(glob.glob('temp/*.txt'))
# ['temp/[x].txt', 'temp/1.txt', 'temp/123.txt']

print(glob.glob('*.txt', root_dir='temp'))
# ['[x].txt', '1.txt', '123.txt']

source: glob_usage.py

When root_dir is set, the result is a relative path from root_dir. So, be careful when passing it to functions expecting a relative path from the current directory. You may need to concatenate root_dir and the result.

Get only file names

To retrieve only file names, you can employ os.path.isfile() within a conditional branch of a list comprehension to verify if a path is a file.

print([p for p in glob.glob('temp/**', recursive=True) if os.path.isfile(p)])
# ['temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir/987.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir1/98.txt', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2/ddd.jpg', 'temp/1.txt', 'temp/123.txt']

source: glob_usage.py

If retaining information about the parent directory is not necessary, you can utilize os.path.basename() to extract solely the file name.

Get the filename, directory, extension from a path string in Python

print([os.path.basename(p) for p in glob.glob('temp/**', recursive=True)
       if os.path.isfile(p)])
# ['[x].txt', '12.jpg', 'aaa.jpg', '987.jpg', 'ccc.jpg', '98.txt', 'bbb.txt', 'ddd.jpg', '1.txt', '123.txt']

source: glob_usage.py

If you want to retain information about the intermediate directory, specify the root_dir argument. As mentioned above, when root_dir is set, the result is a relative path from root_dir, so pass it to os.path.isfile() after concatenating with root_dir.

print([p for p in glob.glob('**', recursive=True, root_dir='temp')
       if os.path.isfile(os.path.join('temp', p))])
# ['[x].txt', '12.jpg', 'aaa.jpg', 'dir/987.jpg', 'dir/sub_dir1/ccc.jpg', 'dir/sub_dir1/98.txt', 'dir/bbb.txt', 'dir/sub_dir2/ddd.jpg', '1.txt', '123.txt']

source: glob_usage.py

Get only directory names

To obtain only directory names, you can use os.path.isdir(), or simply append a directory separator at the end of ** for easier implementation.

print(glob.glob('temp/**/', recursive=True))
# ['temp/', 'temp/dir/', 'temp/dir/sub_dir1/', 'temp/dir/sub_dir2/']

source: glob_usage.py

To exclude the specified directory itself, use */**/ as shown below, or specify the root_dir argument.

print(glob.glob('temp/*/**/', recursive=True))
# ['temp/dir/', 'temp/dir/sub_dir1/', 'temp/dir/sub_dir2/']

print(glob.glob('**/', recursive=True, root_dir='temp'))
# ['dir/', 'dir/sub_dir1/', 'dir/sub_dir2/']

source: glob_usage.py

If you don't need the separator at the end of the result, you can remove it with rstrip(). The separator for each OS can be obtained with os.sep.

Remove a part of a string (substring) in Python

print([p.rstrip(os.sep) for p in glob.glob('temp/**/', recursive=True)])
# ['temp', 'temp/dir', 'temp/dir/sub_dir1', 'temp/dir/sub_dir2']

source: glob_usage.py

If the parent directory information is unnecessary, os.path.basename() can be used.

print([os.path.basename(p.rstrip(os.sep)) for p
       in glob.glob(os.path.join('temp/**/'), recursive=True)])
# ['temp', 'dir', 'sub_dir1', 'sub_dir2']

print([os.path.basename(p.rstrip(os.sep)) + os.sep for p
       in glob.glob(os.path.join('temp/**/'), recursive=True)])
# ['temp/', 'dir/', 'sub_dir1/', 'sub_dir2/']

source: glob_usage.py

Specify conditions with regex

While wildcards like * and ? can define certain conditions, for more complex criteria, use the re module for regular expressions.

Use glob() to generate a list of files and directories recursively, then apply re.search() to this list.

For example, you can extract files that either have a file name composed solely of numbers with an extension of txt, or files with a file name of 3 characters that are not numbers with an extension of either txt or jpg.

import re

print([p for p in glob.glob('temp/**', recursive=True)
       if re.search('\d+\.txt', p)])
# ['temp/dir/sub_dir1/98.txt', 'temp/1.txt', 'temp/123.txt']

print([p for p in glob.glob('temp/**', recursive=True)
       if re.search('\D{3}\.(txt|jpg)', p)])
# ['temp/[x].txt', 'temp/aaa.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2/ddd.jpg']

source: glob_usage.py

\d matches digits, \D matches non-digit characters, {n} matches n repetitions, + matches one or more repetitions, and (a|b) matches either a or b. Since . is also a special character, it needs to be escaped as \. See the following article for more details.

Regular expressions with the re module in Python

Since glob() returns a list of strings, you can also extract elements using the in operator, string methods, etc., in addition to regular expressions.

Extract and replace elements that meet the conditions of a list of strings in Python

Get as an iterator: `iglob()`

As shown in the previous examples, glob() generates a list.

If you are processing the extracted paths with a for loop, it can be more memory efficient to use an iterator instead of a list.

The iglob() function accepts the same arguments as glob() and returns an iterator.

print(type(glob.iglob('temp/*.txt')))
# <class 'generator'>

for p in glob.iglob('temp/*.txt'):
    print(p)
# temp/[x].txt
# temp/1.txt
# temp/123.txt

source: glob_usage.py

How to Use glob() in Python

Basic usage of `glob()`

Wildcards available in `glob()`

`*` matches everything

`?` matches any single character

`[seq]` matches any character in `seq`

Escape wildcards

Get paths recursively: `recursive`

Set the root directory: `root_dir`

Get only file names

Get only directory names

Specify conditions with regex

Get as an iterator: `iglob()`

Related Categories

Related Articles

How to Use glob() in Python

Basic usage of glob()

Wildcards available in glob()

* matches everything

? matches any single character

[seq] matches any character in seq

Escape wildcards

Get paths recursively: recursive

Set the root directory: root_dir

Get only file names

Get only directory names

Specify conditions with regex

Get as an iterator: iglob()

Related Categories

Related Articles

Basic usage of `glob()`

Wildcards available in `glob()`

`*` matches everything

`?` matches any single character

`[seq]` matches any character in `seq`

Get paths recursively: `recursive`

Set the root directory: `root_dir`

Get as an iterator: `iglob()`