NumPy: Read and write CSV files (np.loadtxt, np.genfromtxt, np.savetxt)
In NumPy, you can use np.loadtxt()
or np.genfromtxt()
to read a CSV file as an array (ndarray
), and np.savetxt()
to write an ndarray
as a CSV file.
For clarity, while the title and headings specifically mention CSV, this functionality is not limited to comma-separated values; it also extends to any text files separated by delimiters like TSV (tab-separated values).
As discussed later, pandas is more convenient for reading and writing files that contain headers or have both numeric and string columns.
Additionally, for cases where interoperability with other applications is unnecessary, saving it in NumPy's proprietary binary format (npy
and npz
) is a practical choice. For more information, refer to the following article.
The NumPy version used in this article is as follows. Note that functionality may vary between versions.
import numpy as np
print(np.__version__)
# 1.26.1
Note that not all arguments are covered in this article, so please refer to the official documentation for more details.
Read CSV files as arrays: np.loadtxt()
Basic usage
To read any text file separated by an arbitrary character as a NumPy array (ndarray
), use np.loadtxt()
.
Consider the following file separated by spaces. For explanation purposes, the file's contents are shown using open()
. See the following article for more about open()
.
with open('data/src/sample.txt') as f:
print(f.read())
# 11 12 13 14
# 21 22 23 24
# 31 32 33 34
Specify the file path as the first argument. By default, the data type (dtype
) is float
, whose bit size depends on the environment.
a = np.loadtxt('data/src/sample.txt')
print(a)
# [[11. 12. 13. 14.]
# [21. 22. 23. 24.]
# [31. 32. 33. 34.]]
print(type(a))
# <class 'numpy.ndarray'>
print(a.dtype)
# float64
You can specify either a path string or a pathlib.Path
object as the first argument.
Specify delimiter: delimiter
To read a comma-separated file (CSV file), specify a comma (','
) for the delimiter
argument.
with open('data/src/sample.csv') as f:
print(f.read())
# 11,12,13,14
# 21,22,23,24
# 31,32,33,34
print(np.loadtxt('data/src/sample.csv', delimiter=','))
# [[11. 12. 13. 14.]
# [21. 22. 23. 24.]
# [31. 32. 33. 34.]]
The default value of delimiter
is a space (' '
), so omitting it will result in an error with CSV files.
# print(np.loadtxt('data/src/sample.csv'))
# ValueError: could not convert string '11,12,13,14' to float64 at row 0, column 1.
For a tab-separated file (TSV file), set delimiter='\t'
.
Specify data type: dtype
By default, the data type (dtype
) is float
, whose bit size depends on the environment. Any data type can be specified with the dtype
argument.
a = np.loadtxt('data/src/sample.csv', delimiter=',', dtype='int64')
print(a)
# [[11 12 13 14]
# [21 22 23 24]
# [31 32 33 34]]
print(a.dtype)
# int64
Specify rows and columns to read: skiprows
, max_rows
, usecols
If the file contains unwanted data, use the skiprows
, max_rows
, usecols
arguments to specify which rows and columns to read.
skiprows
- Specify how many rows to skip from the beginning as an integer value
- Empty lines and comment lines are also counted
max_rows
- Specify the number of rows to read after
skiprows
as an integer value - Empty lines and comment lines are not counted (from NumPy version 1.23 onwards)
- Specify the number of rows to read after
usecols
- Specify the indexes (0-based) of the columns to read as a list or other sequence object
- If only one column is to be read, it can also be specified as an integer value
By default, lines starting with #
are ignored as comments. You can specify the characters to be treated as comment indicators in the comments
argument, either as a single string or a list of strings.
By specifying these arguments, you can read only the required data from files.
with open('data/src/sample_header_index.csv') as f:
print(f.read())
# ,a,b,c,d
# ONE,11,12,13,14
# TWO,21,22,23,24
# THREE,31,32,33,34
a = np.loadtxt('data/src/sample_header_index.csv', delimiter=',', dtype='int64',
skiprows=1, usecols=[1, 2, 3, 4])
print(a)
# [[11 12 13 14]
# [21 22 23 24]
# [31 32 33 34]]
As discussed later, such files are more easily handled using pandas.
Read complex CSV files as arrays: np.genfromtxt()
np.genfromtxt()
allows you to read complex CSV files with missing values or various data types.
However, for files with multiple data types, pandas is often more convenient; thus, this article offers only a brief introduction. For more details, refer to the official documentation.
Basic usage
The basic usage of np.genfromtxt()
is similar to np.loadtxt()
.
Specify the file path as the first argument, the delimiter as the delimiter
argument, and the data type as the dtype
argument. Additionally, specify which rows and columns to read using arguments like skip_header
(equivalent to skiprows
in np.loadtxt()
), max_rows
, and usecols
.
with open('data/src/sample_header_index.csv') as f:
print(f.read())
# ,a,b,c,d
# ONE,11,12,13,14
# TWO,21,22,23,24
# THREE,31,32,33,34
a = np.genfromtxt('data/src/sample_header_index.csv',
delimiter=',', dtype='int64',
skip_header=1, usecols=[1, 2, 3, 4])
print(a)
# [[11 12 13 14]
# [21 22 23 24]
# [31 32 33 34]]
Handle missing values
Consider a file with missing values, which would cause an error if read using np.loadtxt()
.
with open('data/src/sample_nan.csv') as f:
print(f.read())
# 11,12,,14
# 21,,,24
# 31,32,33,34
# a = np.loadtxt('data/src/sample_nan.csv', delimiter=',')
# ValueError: could not convert string '' to float64 at row 0, column 3.
Using np.genfromtxt()
, missing values are read as np.nan
.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
print(a)
# [[11. 12. nan 14.]
# [21. nan nan 24.]
# [31. 32. 33. 34.]]
The filling_values
argument allows specifying a value to fill in missing values.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',',
filling_values=0)
print(a)
# [[11. 12. 0. 14.]
# [21. 0. 0. 24.]
# [31. 32. 33. 34.]]
For methods to replace missing values with the average of non-missing values, as well as other techniques for handling missing data, refer to the following articles.
- NumPy: Replace NaN (np.nan) using np.nan_to_num() and np.isnan()
- NumPy: Remove NaN (np.nan) from an array
- NumPy: Functions ignoring NaN (np.nansum, np.nanmean, etc.)
Handle different data types
Consider the following file with different data types (strings and numbers) in each column.
with open('data/src/sample_pandas_normal.csv') as f:
print(f.read())
# name,age,state,point
# Alice,24,NY,64
# Bob,42,CA,92
# Charlie,18,CA,70
# Dave,68,TX,70
# Ellen,24,CA,88
# Frank,30,NY,57
Although not mentioned previously, np.loadtxt()
can also read such files as structured arrays if an appropriate dtype
is specified.
a = np.loadtxt('data/src/sample_pandas_normal.csv', delimiter=',', skiprows=1,
dtype={'names': ('name', 'age', 'state', 'point'),
'formats': ('<U7', '<i8', '<U2', '<i8')})
print(a)
# [('Alice', 24, 'NY', 64) ('Bob', 42, 'CA', 92) ('Charlie', 18, 'CA', 70)
# ('Dave', 68, 'TX', 70) ('Ellen', 24, 'CA', 88) ('Frank', 30, 'NY', 57)]
print(type(a))
# <class 'numpy.ndarray'>
print(a.dtype)
# [('name', '<U7'), ('age', '<i8'), ('state', '<U2'), ('point', '<i8')]
In np.genfromtxt()
, setting the names
argument to True
and the dtype
argument to None
reads the file as a structured array with field names taken from the first line and automatically determined types for each column.
a = np.genfromtxt('data/src/sample_pandas_normal.csv', delimiter=',',
names=True, dtype=None, encoding='utf-8')
print(a)
# [('Alice', 24, 'NY', 64) ('Bob', 42, 'CA', 92) ('Charlie', 18, 'CA', 70)
# ('Dave', 68, 'TX', 70) ('Ellen', 24, 'CA', 88) ('Frank', 30, 'NY', 57)]
print(type(a))
# <class 'numpy.ndarray'>
print(a.dtype)
# [('name', '<U7'), ('age', '<i8'), ('state', '<U2'), ('point', '<i8')]
For more details on structured arrays, refer to the official documentation.
Again, handling such files is easier with pandas.
Write arrays to CSV files: np.savetxt()
To save a NumPy array (ndarray
) as a text file separated by an arbitrary string, use np.savetxt()
.
Consider the following ndarray
.
a = np.arange(6).reshape(2, 3)
print(a)
# [[0 1 2]
# [3 4 5]]
Basic usage
Specify the file path as the first argument and the original ndarray
as the second argument.
np.savetxt('data/temp/np_savetxt.txt', a)
A file with the following contents will be created.
with open('data/temp/np_savetxt.txt') as f:
print(f.read())
# 0.000000000000000000e+00 1.000000000000000000e+00 2.000000000000000000e+00
# 3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00
Specify format: fmt
The fmt
argument allows you to specify any format.
It's possible to specify the number of decimal places. However, be aware that if values are rounded when saved as text, they cannot be converted back to their original precision.
The default format is '%.18e'
, which uses 18-decimal-place scientific notation, as shown in the example above. In this notation, the numbers following the .
represent the decimal places, while e
signifies scientific notation.
np.savetxt('data/temp/np_savetxt_5e.txt', a, fmt='%.5e')
with open('data/temp/np_savetxt_5e.txt') as f:
print(f.read())
# 0.00000e+00 1.00000e+00 2.00000e+00
# 3.00000e+00 4.00000e+00 5.00000e+00
Since scientific notation can be read directly by np.loadtxt()
, unless you have a specific preference, the default format should be fine.
print(np.loadtxt('data/temp/np_savetxt.txt'))
# [[0. 1. 2.]
# [3. 4. 5.]]
f
is for fixed-point notation.
np.savetxt('data/temp/np_savetxt_5f.txt', a, fmt='%.5f')
with open('data/temp/np_savetxt_5f.txt') as f:
print(f.read())
# 0.00000 1.00000 2.00000
# 3.00000 4.00000 5.00000
d
is for decimal integers.
np.savetxt('data/temp/np_savetxt_d.txt', a, fmt='%d')
with open('data/temp/np_savetxt_d.txt') as f:
print(f.read())
# 0 1 2
# 3 4 5
x
is for hexadecimal notation. Zero padding is also possible. For example, 04
signifies a total of 4 digits, with the remaining filled with 0
. For clarity, values are multiplied by 10
before saving in the following example.
print(a * 10)
# [[ 0 10 20]
# [30 40 50]]
np.savetxt('data/temp/np_savetxt_x.txt', a * 10, fmt='%04x')
with open('data/temp/np_savetxt_x.txt') as f:
print(f.read())
# 0000 000a 0014
# 001e 0028 0032
Since np.loadtxt()
cannot directly read hexadecimal notation, it's advisable to avoid this format when planning to reuse the data in NumPy.
For more details on format specification, refer to the official documentation.
Specify delimiter: delimiter
Like np.loadtxt()
and np.genfromtxt()
, the default delimiter in np.savetxt()
is a space (' '
).
You can specify any delimiter using the delimiter
argument. For saving as a CSV (comma-separated values), use delimiter=','
, and for TSV (tab-separated values), use delimiter='\t'
.
np.savetxt('data/temp/np_savetxt.csv', a, delimiter=',', fmt='%d')
with open('data/temp/np_savetxt.csv') as f:
print(f.read())
# 0,1,2
# 3,4,5
np.savetxt('data/temp/np_savetxt.tsv', a, delimiter='\t', fmt='%d')
with open('data/temp/np_savetxt.tsv') as f:
print(f.read())
# 0 1 2
# 3 4 5
Only 1D and 2D arrays can be output
np.savetxt()
can only write one-dimensional and two-dimensional arrays. Attempting to write arrays with higher dimensions will result in an error.
a_3d = np.arange(24).reshape(2, 3, 4)
print(a_3d)
# [[[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
#
# [[12 13 14 15]
# [16 17 18 19]
# [20 21 22 23]]]
# np.savetxt('data/temp/np_savetxt_3d.txt', a_3d)
# ValueError: Expected 1D or 2D array, got 3D array instead
Arrays of three or more dimensions can be converted to two or fewer dimensions using flatten()
or reshape()
before saving.
However, to revert them to their original ndarray
form after loading with loadtxt()
, you will need to reshape them back to their original shape using reshape()
. This requires separately saving the shape information, which might not always be practical.
Alternatively, saving arrays in binary files (npy
and npz
) preserves their data type and shape as is. Since multidimensional arrays with three or more dimensions can be directly saved this way, choosing binary files over text files might be a simpler option if text format is unnecessary.
Read and write CSV files with pandas
Using pandas makes reading and writing complex files easier. numpy.ndarray
and pandas.DataFrame
can be converted to each other.
This article briefly introduces a few examples. For more details on argument settings and other information, refer to the following articles.
Read and write CSV files with header and index
Consider the following CSV file.
import numpy as np
import pandas as pd
with open('data/src/sample_header_index.csv') as f:
print(f.read())
# ,a,b,c,d
# ONE,11,12,13,14
# TWO,21,22,23,24
# THREE,31,32,33,34
With pd.read_csv()
, by default, the first row is treated as the header, and the column specified by the index_col
argument is treated as the index.
df = pd.read_csv('data/src/sample_header_index.csv', index_col=0)
print(df)
# a b c d
# ONE 11 12 13 14
# TWO 21 22 23 24
# THREE 31 32 33 34
To convert a DataFrame
to an ndarray
, use the values
attribute of DataFrame
.
a = df.values
print(a)
# [[11 12 13 14]
# [21 22 23 24]
# [31 32 33 34]]
print(type(a))
# <class 'numpy.ndarray'>
To save an ndarray
with a header and index, first create a DataFrame
by specifying the index
and columns
arguments in its constructor, and then use the to_csv()
method to write it.
a = np.arange(6).reshape(2, 3)
print(a)
# [[0 1 2]
# [3 4 5]]
df = pd.DataFrame(a, index=['ONE', 'TWO'], columns=['a', 'b', 'c'])
print(df)
# a b c
# ONE 0 1 2
# TWO 3 4 5
df.to_csv('data/temp/sample_pd.csv')
with open('data/temp/sample_pd.csv') as f:
print(f.read())
# ,a,b,c
# ONE,0,1,2
# TWO,3,4,5
Handle missing values
Consider the following CSV file with missing data.
with open('data/src/sample_nan.csv') as f:
print(f.read())
# 11,12,,14
# 21,,,24
# 31,32,33,34
With pd.read_csv()
, missing values are treated as nan
even without any special settings. As mentioned above, since the first row is processed as the header by default, if there is no header as in this example, set the header
argument to None
.
df = pd.read_csv('data/src/sample_nan.csv', header=None)
print(df)
# 0 1 2 3
# 0 11 12.0 NaN 14
# 1 21 NaN NaN 24
# 2 31 32.0 33.0 34
For more on handling missing values in pandas, refer to the following articles.
- Missing values in pandas (nan, None, pd.NA)
- pandas: Interpolate NaN (missing values) with interpolate()
Handle different data types
Consider the following CSV file containing both numeric and string columns.
with open('data/src/sample_pandas_normal.csv') as f:
print(f.read())
# name,age,state,point
# Alice,24,NY,64
# Bob,42,CA,92
# Charlie,18,CA,70
# Dave,68,TX,70
# Ellen,24,CA,88
# Frank,30,NY,57
Each column in a DataFrame has its own data type (dtype
). In pd.read_csv()
, the data type of each column is automatically inferred and set by default.
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
print(df.dtypes)
# name object
# age int64
# state object
# point int64
# dtype: object
For more on data types in pandas, refer to the following article.
The select_dtypes()
method of DataFrame
can be used to extract columns of a specific data type.
print(df.select_dtypes('int'))
# age point
# 0 24 64
# 1 42 92
# 2 18 70
# 3 68 70
# 4 24 88
# 5 30 57
You can extract only the numerical columns from a CSV file containing extra data like strings and convert them to an ndarray
.
a = pd.read_csv('data/src/sample_pandas_normal.csv').select_dtypes('int').values
print(a)
# [[24 64]
# [42 92]
# [18 70]
# [68 70]
# [24 88]
# [30 57]]
print(type(a))
# <class 'numpy.ndarray'>
print(a.dtype)
# int64