How to use glob() in Python
In Python, the glob
module allows you to get a list or an iterator of file and directory paths that satisfy certain conditions using special characters like the wildcard *
.
This article uses the glob
and os
modules in the sample codes. As they are part of the standard library, no additional installation is necessary.
import glob
import os
The following files and directories are used as examples.
temp/
├── 1.txt
├── 12.jpg
├── 123.txt
├── [x].txt
├── aaa.jpg
└── dir/
├── 987.jpg
├── bbb.txt
├── sub_dir1/
│ ├── 98.txt
│ └── ccc.jpg
└── sub_dir2/
└── ddd.jpg
Basic usage of glob()
The first argument to glob()
is a path string, which can include special characters like the wildcard *
.
The function returns a list of path strings meeting the criteria.
l = glob.glob('temp/*.txt')
print(l)
# ['temp/[x].txt', 'temp/1.txt', 'temp/123.txt']
print(type(l))
# <class 'list'>
Wildcards available in glob()
In glob()
, you can use wildcards such as *
and ?
that are used in the Unix shell.
*
matches everything
*
matches any string, regardless of its length, including zero characters.
print(glob.glob('temp/*'))
# ['temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir', 'temp/1.txt', 'temp/123.txt']
print(glob.glob('temp/*.jpg'))
# ['temp/12.jpg', 'temp/aaa.jpg']
print(glob.glob('temp/dir/*/*.jpg'))
# ['temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir2/ddd.jpg']
?
matches any single character
?
matches any single character.
For example, to extract paths with file names (excluding extensions) of three characters, use ???.*
.
print(glob.glob('temp/???.*'))
# ['temp/[x].txt', 'temp/aaa.jpg', 'temp/123.txt']
[seq]
matches any character in seq
[seq]
matches any character in seq
. For example, [aZ1]
matches either a
, Z
, or 1
.
Additionally, you can define a range of characters using a hyphen -
. For example, [0-9]
matches any digit from 0
to 9
, and [a-z]
matches any lowercase letter from a
to z
.
print(glob.glob('temp/[0-9].*'))
# ['temp/1.txt']
print(glob.glob('temp/[0-9][0-9].*'))
# ['temp/12.jpg']
print(glob.glob('temp/[a-z][a-z][a-z].*'))
# ['temp/aaa.jpg']
Prefixing with !
matches characters not in the brackets. For example, [!a-z]
matches any character without a lowercase letter.
print(glob.glob('temp/[!a-z].*'))
# ['temp/1.txt']
Escape wildcards
To escape wildcards, simply wrap them with []
.
print(glob.glob('temp/[[]*'))
# ['temp/[x].txt']
Get paths recursively: recursive
By setting the recursive
argument of glob()
to True
and using **
, it matches any files and zero or more directories and subdirectories.
While *
only matches files at the same directory level, **
can match across multiple directory levels.
print(glob.glob('temp/*/*.jpg'))
# ['temp/dir/987.jpg']
print(glob.glob('temp/**/*.jpg', recursive=True))
# ['temp/12.jpg', 'temp/aaa.jpg', 'temp/dir/987.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir2/ddd.jpg']
You can recursively get a list of all files and directories within a specific directory.
print(glob.glob('temp/**', recursive=True))
# ['temp/', 'temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir', 'temp/dir/987.jpg', 'temp/dir/sub_dir1', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir1/98.txt', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2', 'temp/dir/sub_dir2/ddd.jpg', 'temp/1.txt', 'temp/123.txt']
However, using **
might take a long time when there are numerous files and directories. So, if possible, it's recommended to use other special characters to define the conditions.
Set the root directory: root_dir
The root_dir
argument in glob()
can be utilized to specify the root directory. This argument modifies the behavior of glob()
as if the current directory was switched to root_dir
before execution, although the actual current directory remains unchanged.
The default value is root_dir=None
, keeping the current directory as the base.
print(glob.glob('temp/*.txt'))
# ['temp/[x].txt', 'temp/1.txt', 'temp/123.txt']
print(glob.glob('*.txt', root_dir='temp'))
# ['[x].txt', '1.txt', '123.txt']
When root_dir
is set, the result is a relative path from root_dir
. So, be careful when passing it to functions expecting a relative path from the current directory. You may need to concatenate root_dir
and the result.
Get only file names
To retrieve only file names, you can employ os.path.isfile()
within a conditional branch of a list comprehension to verify if a path is a file.
print([p for p in glob.glob('temp/**', recursive=True) if os.path.isfile(p)])
# ['temp/[x].txt', 'temp/12.jpg', 'temp/aaa.jpg', 'temp/dir/987.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/sub_dir1/98.txt', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2/ddd.jpg', 'temp/1.txt', 'temp/123.txt']
If retaining information about the parent directory is not necessary, you can utilize os.path.basename()
to extract solely the file name.
print([os.path.basename(p) for p in glob.glob('temp/**', recursive=True)
if os.path.isfile(p)])
# ['[x].txt', '12.jpg', 'aaa.jpg', '987.jpg', 'ccc.jpg', '98.txt', 'bbb.txt', 'ddd.jpg', '1.txt', '123.txt']
If you want to retain information about the intermediate directory, specify the root_dir
argument. As mentioned above, when root_dir
is set, the result is a relative path from root_dir
, so pass it to os.path.isfile()
after concatenating with root_dir
.
print([p for p in glob.glob('**', recursive=True, root_dir='temp')
if os.path.isfile(os.path.join('temp', p))])
# ['[x].txt', '12.jpg', 'aaa.jpg', 'dir/987.jpg', 'dir/sub_dir1/ccc.jpg', 'dir/sub_dir1/98.txt', 'dir/bbb.txt', 'dir/sub_dir2/ddd.jpg', '1.txt', '123.txt']
Get only directory names
To obtain only directory names, you can use os.path.isdir()
, or simply append a directory separator at the end of **
for easier implementation.
print(glob.glob('temp/**/', recursive=True))
# ['temp/', 'temp/dir/', 'temp/dir/sub_dir1/', 'temp/dir/sub_dir2/']
To exclude the specified directory itself, use */**/
as shown below, or specify the root_dir
argument.
print(glob.glob('temp/*/**/', recursive=True))
# ['temp/dir/', 'temp/dir/sub_dir1/', 'temp/dir/sub_dir2/']
print(glob.glob('**/', recursive=True, root_dir='temp'))
# ['dir/', 'dir/sub_dir1/', 'dir/sub_dir2/']
If you don't need the separator at the end of the result, you can remove it with rstrip()
. The separator for each OS can be obtained with os.sep
.
print([p.rstrip(os.sep) for p in glob.glob('temp/**/', recursive=True)])
# ['temp', 'temp/dir', 'temp/dir/sub_dir1', 'temp/dir/sub_dir2']
If the parent directory information is unnecessary, os.path.basename()
can be used.
print([os.path.basename(p.rstrip(os.sep)) for p
in glob.glob(os.path.join('temp/**/'), recursive=True)])
# ['temp', 'dir', 'sub_dir1', 'sub_dir2']
print([os.path.basename(p.rstrip(os.sep)) + os.sep for p
in glob.glob(os.path.join('temp/**/'), recursive=True)])
# ['temp/', 'dir/', 'sub_dir1/', 'sub_dir2/']
Specify conditions with regex
While wildcards like *
and ?
can define certain conditions, for more complex criteria, use the re
module for regular expressions.
Use glob()
to generate a list of files and directories recursively, then apply re.search()
to this list.
For example, you can extract files that either have a file name composed solely of numbers with an extension of txt
, or files with a file name of 3 characters that are not numbers with an extension of either txt
or jpg
.
import re
print([p for p in glob.glob('temp/**', recursive=True)
if re.search('\d+\.txt', p)])
# ['temp/dir/sub_dir1/98.txt', 'temp/1.txt', 'temp/123.txt']
print([p for p in glob.glob('temp/**', recursive=True)
if re.search('\D{3}\.(txt|jpg)', p)])
# ['temp/[x].txt', 'temp/aaa.jpg', 'temp/dir/sub_dir1/ccc.jpg', 'temp/dir/bbb.txt', 'temp/dir/sub_dir2/ddd.jpg']
\d
matches digits, \D
matches non-digit characters, {n}
matches n repetitions, +
matches one or more repetitions, and (a|b)
matches either a
or b
. Since .
is also a special character, it needs to be escaped as \.
See the following article for more details.
Since glob()
returns a list of strings, you can also extract elements using the in
operator, string methods, etc., in addition to regular expressions.
Get as an iterator: iglob()
As shown in the previous examples, glob()
generates a list.
If you are processing the extracted paths with a for
loop, it can be more memory efficient to use an iterator instead of a list.
The iglob()
function accepts the same arguments as glob()
and returns an iterator.
print(type(glob.iglob('temp/*.txt')))
# <class 'generator'>
for p in glob.iglob('temp/*.txt'):
print(p)
# temp/[x].txt
# temp/1.txt
# temp/123.txt