How to use glob() in Python

Posted: | Tags: Python, File

In Python, the glob module allows you to get a list or an iterator of file and directory paths that satisfy certain conditions using special characters like the wildcard *.

This article uses the glob and os modules in the sample codes. As they are part of the standard library, no additional installation is necessary.

import glob
import os

The following files and directories are used as examples.

temp/
├── 1.txt
├── 12.jpg
├── 123.txt
├── [x].txt
├── aaa.jpg
└── dir/
    ├── 987.jpg
    ├── bbb.txt
    ├── sub_dir1/
    │   ├── 98.txt
    │   └── ccc.jpg
    └── sub_dir2/
        └── ddd.jpg

Basic usage of glob()

The first argument to glob() is a path string, which can include special characters like the wildcard *.

The function returns a list of path strings meeting the criteria.

l = glob.glob('temp/*.txt')
print(l)
# ['temp/[x].txt', 'temp/1.txt', 'temp/123.txt']

print(type(l))
# <class 'list'>

Wildcards available in glob()

In glob(), you can use wildcards such as * and ? that are used in the Unix shell.

* matches everything

* matches any string, regardless of its length, including zero characters.

print(glob.glob('temp/*'))
# ['temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir', 'temp/1.txt', 'temp/123.txt']

print(glob.glob('temp/*.jpg'))
# ['temp/12.jpg', 'temp/aaa.jpg']

print(glob.glob('temp/dir/*/*.jpg'))
# ['temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir2/ddd.jpg']

? matches any single character

? matches any single character.

For example, to extract paths with file names (excluding extensions) of three characters, use ???.*.

print(glob.glob('temp/???.*'))
# ['temp/[x].txt', 'temp/aaa.jpg', 'temp/123.txt']

[seq] matches any character in seq

[seq] matches any character in seq. For example, [aZ1] matches either a, Z, or 1.

Additionally, you can define a range of characters using a hyphen -. For example, [0-9] matches any digit from 0 to 9, and [a-z] matches any lowercase letter from a to z.

print(glob.glob('temp/[0-9].*'))
# ['temp/1.txt']

print(glob.glob('temp/[0-9][0-9].*'))
# ['temp/12.jpg']

print(glob.glob('temp/[a-z][a-z][a-z].*'))
# ['temp/aaa.jpg']

Prefixing with ! matches characters not in the brackets. For example, [!a-z] matches any character without a lowercase letter.

print(glob.glob('temp/[!a-z].*'))
# ['temp/1.txt']

Escape wildcards

To escape wildcards, simply wrap them with [].

print(glob.glob('temp/[[]*'))
# ['temp/[x].txt']

Get paths recursively: recursive

By setting the recursive argument of glob() to True and using **, it matches any files and zero or more directories and subdirectories.

While * only matches files at the same directory level, ** can match across multiple directory levels.

print(glob.glob('temp/*/*.jpg'))
# ['temp/dir/987.jpg']

print(glob.glob('temp/**/*.jpg', recursive=True))
# ['temp/12.jpg', 'temp/aaa.jpg', 'temp/dir/987.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir2/ddd.jpg']

You can recursively get a list of all files and directories within a specific directory.

print(glob.glob('temp/**', recursive=True))
# ['temp/', 'temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir', 'temp/dir/987.jpg', 'temp/dir/sub_dir1', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir1/98.txt', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2', 'temp/dir/sub_dir2/ddd.jpg', 'temp/1.txt', 'temp/123.txt']

However, using ** might take a long time when there are numerous files and directories. So, if possible, it's recommended to use other special characters to define the conditions.

Set the root directory: root_dir

The root_dir argument in glob() can be utilized to specify the root directory. This argument modifies the behavior of glob() as if the current directory was switched to root_dir before execution, although the actual current directory remains unchanged.

The default value is root_dir=None, keeping the current directory as the base.

print(glob.glob('temp/*.txt'))
# ['temp/[x].txt', 'temp/1.txt', 'temp/123.txt']

print(glob.glob('*.txt', root_dir='temp'))
# ['[x].txt', '1.txt', '123.txt']

When root_dir is set, the result is a relative path from root_dir. So, be careful when passing it to functions expecting a relative path from the current directory. You may need to concatenate root_dir and the result.

Get only file names

To retrieve only file names, you can employ os.path.isfile() within a conditional branch of a list comprehension to verify if a path is a file.

print([p for p in glob.glob('temp/**', recursive=True) if os.path.isfile(p)])
# ['temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir/987.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir1/98.txt', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2/ddd.jpg', 'temp/1.txt', 'temp/123.txt']

If retaining information about the parent directory is not necessary, you can utilize os.path.basename() to extract solely the file name.

print([os.path.basename(p) for p in glob.glob('temp/**', recursive=True)
       if os.path.isfile(p)])
# ['[x].txt', '12.jpg', 'aaa.jpg', '987.jpg', 'ccc.jpg', '98.txt', 'bbb.txt', 'ddd.jpg', '1.txt', '123.txt']

If you want to retain information about the intermediate directory, specify the root_dir argument. As mentioned above, when root_dir is set, the result is a relative path from root_dir, so pass it to os.path.isfile() after concatenating with root_dir.

print([p for p in glob.glob('**', recursive=True, root_dir='temp')
       if os.path.isfile(os.path.join('temp', p))])
# ['[x].txt', '12.jpg', 'aaa.jpg', 'dir/987.jpg', 'dir/sub_dir1/ccc.jpg', 'dir/sub_dir1/98.txt', 'dir/bbb.txt', 'dir/sub_dir2/ddd.jpg', '1.txt', '123.txt']

Get only directory names

To obtain only directory names, you can use os.path.isdir(), or simply append a directory separator at the end of ** for easier implementation.

print(glob.glob('temp/**/', recursive=True))
# ['temp/', 'temp/dir/', 'temp/dir/sub_dir1/', 'temp/dir/sub_dir2/']

To exclude the specified directory itself, use */**/ as shown below, or specify the root_dir argument.

print(glob.glob('temp/*/**/', recursive=True))
# ['temp/dir/', 'temp/dir/sub_dir1/', 'temp/dir/sub_dir2/']

print(glob.glob('**/', recursive=True, root_dir='temp'))
# ['dir/', 'dir/sub_dir1/', 'dir/sub_dir2/']

If you don't need the separator at the end of the result, you can remove it with rstrip(). The separator for each OS can be obtained with os.sep.

print([p.rstrip(os.sep) for p in glob.glob('temp/**/', recursive=True)])
# ['temp', 'temp/dir', 'temp/dir/sub_dir1', 'temp/dir/sub_dir2']

If the parent directory information is unnecessary, os.path.basename() can be used.

print([os.path.basename(p.rstrip(os.sep)) for p
       in glob.glob(os.path.join('temp/**/'), recursive=True)])
# ['temp', 'dir', 'sub_dir1', 'sub_dir2']

print([os.path.basename(p.rstrip(os.sep)) + os.sep for p
       in glob.glob(os.path.join('temp/**/'), recursive=True)])
# ['temp/', 'dir/', 'sub_dir1/', 'sub_dir2/']

Specify conditions with regex

While wildcards like * and ? can define certain conditions, for more complex criteria, use the re module for regular expressions.

Use glob() to generate a list of files and directories recursively, then apply re.search() to this list.

For example, you can extract files that either have a file name composed solely of numbers with an extension of txt, or files with a file name of 3 characters that are not numbers with an extension of either txt or jpg.

import re

print([p for p in glob.glob('temp/**', recursive=True)
       if re.search('\d+\.txt', p)])
# ['temp/dir/sub_dir1/98.txt', 'temp/1.txt', 'temp/123.txt']

print([p for p in glob.glob('temp/**', recursive=True)
       if re.search('\D{3}\.(txt|jpg)', p)])
# ['temp/[x].txt', 'temp/aaa.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2/ddd.jpg']

\d matches digits, \D matches non-digit characters, {n} matches n repetitions, + matches one or more repetitions, and (a|b) matches either a or b. Since . is also a special character, it needs to be escaped as \. See the following article for more details.

Since glob() returns a list of strings, you can also extract elements using the in operator, string methods, etc., in addition to regular expressions.

Get as an iterator: iglob()

As shown in the previous examples, glob() generates a list.

If you are processing the extracted paths with a for loop, it can be more memory efficient to use an iterator instead of a list.

The iglob() function accepts the same arguments as glob() and returns an iterator.

print(type(glob.iglob('temp/*.txt')))
# <class 'generator'>

for p in glob.iglob('temp/*.txt'):
    print(p)
# temp/[x].txt
# temp/1.txt
# temp/123.txt

Related Categories

Related Articles