NumPy: Read and write CSV files (np.loadtxt, np.genfromtxt, np.savetxt)

Posted: | Tags: Python, NumPy, CSV

In NumPy, you can use np.loadtxt() or np.genfromtxt() to read a CSV file as an array (ndarray), and np.savetxt() to write an ndarray as a CSV file.

For clarity, while the title and headings specifically mention CSV, this functionality is not limited to comma-separated values; it also extends to any text files separated by delimiters like TSV (tab-separated values).

As discussed later, pandas is more convenient for reading and writing files that contain headers or have both numeric and string columns.

Additionally, for cases where interoperability with other applications is unnecessary, saving it in NumPy's proprietary binary format (npy and npz) is a practical choice. For more information, refer to the following article.

The NumPy version used in this article is as follows. Note that functionality may vary between versions.

import numpy as np

print(np.__version__)
# 1.26.1

Note that not all arguments are covered in this article, so please refer to the official documentation for more details.

Read CSV files as arrays: np.loadtxt()

Basic usage

To read any text file separated by an arbitrary character as a NumPy array (ndarray), use np.loadtxt().

Consider the following file separated by spaces. For explanation purposes, the file's contents are shown using open(). See the following article for more about open().

with open('data/src/sample.txt') as f:
    print(f.read())
# 11 12 13 14
# 21 22 23 24
# 31 32 33 34

Specify the file path as the first argument. By default, the data type (dtype) is float, whose bit size depends on the environment.

a = np.loadtxt('data/src/sample.txt')
print(a)
# [[11. 12. 13. 14.]
#  [21. 22. 23. 24.]
#  [31. 32. 33. 34.]]

print(type(a))
# <class 'numpy.ndarray'>

print(a.dtype)
# float64

You can specify either a path string or a pathlib.Path object as the first argument.

Specify delimiter: delimiter

To read a comma-separated file (CSV file), specify a comma (',') for the delimiter argument.

with open('data/src/sample.csv') as f:
    print(f.read())
# 11,12,13,14
# 21,22,23,24
# 31,32,33,34

print(np.loadtxt('data/src/sample.csv', delimiter=','))
# [[11. 12. 13. 14.]
#  [21. 22. 23. 24.]
#  [31. 32. 33. 34.]]

The default value of delimiter is a space (' '), so omitting it will result in an error with CSV files.

# print(np.loadtxt('data/src/sample.csv'))
# ValueError: could not convert string '11,12,13,14' to float64 at row 0, column 1.

For a tab-separated file (TSV file), set delimiter='\t'.

Specify data type: dtype

By default, the data type (dtype) is float, whose bit size depends on the environment. Any data type can be specified with the dtype argument.

a = np.loadtxt('data/src/sample.csv', delimiter=',', dtype='int64')
print(a)
# [[11 12 13 14]
#  [21 22 23 24]
#  [31 32 33 34]]

print(a.dtype)
# int64

Specify rows and columns to read: skiprows, max_rows, usecols

If the file contains unwanted data, use the skiprows, max_rows, usecols arguments to specify which rows and columns to read.

  • skiprows
    • Specify how many rows to skip from the beginning as an integer value
    • Empty lines and comment lines are also counted
  • max_rows
    • Specify the number of rows to read after skiprows as an integer value
    • Empty lines and comment lines are not counted (from NumPy version 1.23 onwards)
  • usecols
    • Specify the indexes (0-based) of the columns to read as a list or other sequence object
    • If only one column is to be read, it can also be specified as an integer value

By default, lines starting with # are ignored as comments. You can specify the characters to be treated as comment indicators in the comments argument, either as a single string or a list of strings.

By specifying these arguments, you can read only the required data from files.

with open('data/src/sample_header_index.csv') as f:
    print(f.read())
# ,a,b,c,d
# ONE,11,12,13,14
# TWO,21,22,23,24
# THREE,31,32,33,34

a = np.loadtxt('data/src/sample_header_index.csv', delimiter=',', dtype='int64',
               skiprows=1, usecols=[1, 2, 3, 4])
print(a)
# [[11 12 13 14]
#  [21 22 23 24]
#  [31 32 33 34]]

As discussed later, such files are more easily handled using pandas.

Read complex CSV files as arrays: np.genfromtxt()

np.genfromtxt() allows you to read complex CSV files with missing values or various data types.

However, for files with multiple data types, pandas is often more convenient; thus, this article offers only a brief introduction. For more details, refer to the official documentation.

Basic usage

The basic usage of np.genfromtxt() is similar to np.loadtxt().

Specify the file path as the first argument, the delimiter as the delimiter argument, and the data type as the dtype argument. Additionally, specify which rows and columns to read using arguments like skip_header (equivalent to skiprows in np.loadtxt()), max_rows, and usecols.

with open('data/src/sample_header_index.csv') as f:
    print(f.read())
# ,a,b,c,d
# ONE,11,12,13,14
# TWO,21,22,23,24
# THREE,31,32,33,34

a = np.genfromtxt('data/src/sample_header_index.csv',
                  delimiter=',', dtype='int64',
                  skip_header=1, usecols=[1, 2, 3, 4])
print(a)
# [[11 12 13 14]
#  [21 22 23 24]
#  [31 32 33 34]]

Handle missing values

Consider a file with missing values, which would cause an error if read using np.loadtxt().

with open('data/src/sample_nan.csv') as f:
    print(f.read())
# 11,12,,14
# 21,,,24
# 31,32,33,34

# a = np.loadtxt('data/src/sample_nan.csv', delimiter=',')
# ValueError: could not convert string '' to float64 at row 0, column 3.

Using np.genfromtxt(), missing values are read as np.nan.

a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
print(a)
# [[11. 12. nan 14.]
#  [21. nan nan 24.]
#  [31. 32. 33. 34.]]

The filling_values argument allows specifying a value to fill in missing values.

a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',',
                  filling_values=0)
print(a)
# [[11. 12.  0. 14.]
#  [21.  0.  0. 24.]
#  [31. 32. 33. 34.]]

For methods to replace missing values with the average of non-missing values, as well as other techniques for handling missing data, refer to the following articles.

Handle different data types

Consider the following file with different data types (strings and numbers) in each column.

with open('data/src/sample_pandas_normal.csv') as f:
    print(f.read())
# name,age,state,point
# Alice,24,NY,64
# Bob,42,CA,92
# Charlie,18,CA,70
# Dave,68,TX,70
# Ellen,24,CA,88
# Frank,30,NY,57

Although not mentioned previously, np.loadtxt() can also read such files as structured arrays if an appropriate dtype is specified.

a = np.loadtxt('data/src/sample_pandas_normal.csv', delimiter=',', skiprows=1,
               dtype={'names': ('name', 'age', 'state', 'point'),
                      'formats': ('<U7', '<i8', '<U2', '<i8')})
print(a)
# [('Alice', 24, 'NY', 64) ('Bob', 42, 'CA', 92) ('Charlie', 18, 'CA', 70)
#  ('Dave', 68, 'TX', 70) ('Ellen', 24, 'CA', 88) ('Frank', 30, 'NY', 57)]

print(type(a))
# <class 'numpy.ndarray'>

print(a.dtype)
# [('name', '<U7'), ('age', '<i8'), ('state', '<U2'), ('point', '<i8')]

In np.genfromtxt(), setting the names argument to True and the dtype argument to None reads the file as a structured array with field names taken from the first line and automatically determined types for each column.

a = np.genfromtxt('data/src/sample_pandas_normal.csv', delimiter=',',
                  names=True, dtype=None, encoding='utf-8')
print(a)
# [('Alice', 24, 'NY', 64) ('Bob', 42, 'CA', 92) ('Charlie', 18, 'CA', 70)
#  ('Dave', 68, 'TX', 70) ('Ellen', 24, 'CA', 88) ('Frank', 30, 'NY', 57)]

print(type(a))
# <class 'numpy.ndarray'>

print(a.dtype)
# [('name', '<U7'), ('age', '<i8'), ('state', '<U2'), ('point', '<i8')]

For more details on structured arrays, refer to the official documentation.

Again, handling such files is easier with pandas.

Write arrays to CSV files: np.savetxt()

To save a NumPy array (ndarray) as a text file separated by an arbitrary string, use np.savetxt().

Consider the following ndarray.

a = np.arange(6).reshape(2, 3)
print(a)
# [[0 1 2]
#  [3 4 5]]

Basic usage

Specify the file path as the first argument and the original ndarray as the second argument.

np.savetxt('data/temp/np_savetxt.txt', a)

A file with the following contents will be created.

with open('data/temp/np_savetxt.txt') as f:
    print(f.read())
# 0.000000000000000000e+00 1.000000000000000000e+00 2.000000000000000000e+00
# 3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00

Specify format: fmt

The fmt argument allows you to specify any format.

It's possible to specify the number of decimal places. However, be aware that if values are rounded when saved as text, they cannot be converted back to their original precision.

The default format is '%.18e', which uses 18-decimal-place scientific notation, as shown in the example above. In this notation, the numbers following the . represent the decimal places, while e signifies scientific notation.

np.savetxt('data/temp/np_savetxt_5e.txt', a, fmt='%.5e')

with open('data/temp/np_savetxt_5e.txt') as f:
    print(f.read())
# 0.00000e+00 1.00000e+00 2.00000e+00
# 3.00000e+00 4.00000e+00 5.00000e+00

Since scientific notation can be read directly by np.loadtxt(), unless you have a specific preference, the default format should be fine.

print(np.loadtxt('data/temp/np_savetxt.txt'))
# [[0. 1. 2.]
#  [3. 4. 5.]]

f is for fixed-point notation.

np.savetxt('data/temp/np_savetxt_5f.txt', a, fmt='%.5f')

with open('data/temp/np_savetxt_5f.txt') as f:
    print(f.read())
# 0.00000 1.00000 2.00000
# 3.00000 4.00000 5.00000

d is for decimal integers.

np.savetxt('data/temp/np_savetxt_d.txt', a, fmt='%d')

with open('data/temp/np_savetxt_d.txt') as f:
    print(f.read())
# 0 1 2
# 3 4 5

x is for hexadecimal notation. Zero padding is also possible. For example, 04 signifies a total of 4 digits, with the remaining filled with 0. For clarity, values are multiplied by 10 before saving in the following example.

print(a * 10)
# [[ 0 10 20]
#  [30 40 50]]

np.savetxt('data/temp/np_savetxt_x.txt', a * 10, fmt='%04x')

with open('data/temp/np_savetxt_x.txt') as f:
    print(f.read())
# 0000 000a 0014
# 001e 0028 0032

Since np.loadtxt() cannot directly read hexadecimal notation, it's advisable to avoid this format when planning to reuse the data in NumPy.

For more details on format specification, refer to the official documentation.

Specify delimiter: delimiter

Like np.loadtxt() and np.genfromtxt(), the default delimiter in np.savetxt() is a space (' ').

You can specify any delimiter using the delimiter argument. For saving as a CSV (comma-separated values), use delimiter=',', and for TSV (tab-separated values), use delimiter='\t'.

np.savetxt('data/temp/np_savetxt.csv', a, delimiter=',', fmt='%d')

with open('data/temp/np_savetxt.csv') as f:
    print(f.read())
# 0,1,2
# 3,4,5
np.savetxt('data/temp/np_savetxt.tsv', a, delimiter='\t', fmt='%d')

with open('data/temp/np_savetxt.tsv') as f:
    print(f.read())
# 0 1   2
# 3 4   5

Only 1D and 2D arrays can be output

np.savetxt() can only write one-dimensional and two-dimensional arrays. Attempting to write arrays with higher dimensions will result in an error.

a_3d = np.arange(24).reshape(2, 3, 4)
print(a_3d)
# [[[ 0  1  2  3]
#   [ 4  5  6  7]
#   [ 8  9 10 11]]
# 
#  [[12 13 14 15]
#   [16 17 18 19]
#   [20 21 22 23]]]

# np.savetxt('data/temp/np_savetxt_3d.txt', a_3d)
# ValueError: Expected 1D or 2D array, got 3D array instead

Arrays of three or more dimensions can be converted to two or fewer dimensions using flatten() or reshape() before saving.

However, to revert them to their original ndarray form after loading with loadtxt(), you will need to reshape them back to their original shape using reshape(). This requires separately saving the shape information, which might not always be practical.

Alternatively, saving arrays in binary files (npy and npz) preserves their data type and shape as is. Since multidimensional arrays with three or more dimensions can be directly saved this way, choosing binary files over text files might be a simpler option if text format is unnecessary.

Read and write CSV files with pandas

Using pandas makes reading and writing complex files easier. numpy.ndarray and pandas.DataFrame can be converted to each other.

This article briefly introduces a few examples. For more details on argument settings and other information, refer to the following articles.

Read and write CSV files with header and index

Consider the following CSV file.

import numpy as np
import pandas as pd

with open('data/src/sample_header_index.csv') as f:
    print(f.read())
# ,a,b,c,d
# ONE,11,12,13,14
# TWO,21,22,23,24
# THREE,31,32,33,34

With pd.read_csv(), by default, the first row is treated as the header, and the column specified by the index_col argument is treated as the index.

df = pd.read_csv('data/src/sample_header_index.csv', index_col=0)
print(df)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34

To convert a DataFrame to an ndarray, use the values attribute of DataFrame.

a = df.values
print(a)
# [[11 12 13 14]
#  [21 22 23 24]
#  [31 32 33 34]]

print(type(a))
# <class 'numpy.ndarray'>

To save an ndarray with a header and index, first create a DataFrame by specifying the index and columns arguments in its constructor, and then use the to_csv() method to write it.

a = np.arange(6).reshape(2, 3)
print(a)
# [[0 1 2]
#  [3 4 5]]

df = pd.DataFrame(a, index=['ONE', 'TWO'], columns=['a', 'b', 'c'])
print(df)
#      a  b  c
# ONE  0  1  2
# TWO  3  4  5

df.to_csv('data/temp/sample_pd.csv')

with open('data/temp/sample_pd.csv') as f:
    print(f.read())
# ,a,b,c
# ONE,0,1,2
# TWO,3,4,5

Handle missing values

Consider the following CSV file with missing data.

with open('data/src/sample_nan.csv') as f:
    print(f.read())
# 11,12,,14
# 21,,,24
# 31,32,33,34

With pd.read_csv(), missing values are treated as nan even without any special settings. As mentioned above, since the first row is processed as the header by default, if there is no header as in this example, set the header argument to None.

df = pd.read_csv('data/src/sample_nan.csv', header=None)
print(df)
#     0     1     2   3
# 0  11  12.0   NaN  14
# 1  21   NaN   NaN  24
# 2  31  32.0  33.0  34

For more on handling missing values in pandas, refer to the following articles.

Handle different data types

Consider the following CSV file containing both numeric and string columns.

with open('data/src/sample_pandas_normal.csv') as f:
    print(f.read())
# name,age,state,point
# Alice,24,NY,64
# Bob,42,CA,92
# Charlie,18,CA,70
# Dave,68,TX,70
# Ellen,24,CA,88
# Frank,30,NY,57

Each column in a DataFrame has its own data type (dtype). In pd.read_csv(), the data type of each column is automatically inferred and set by default.

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

print(df.dtypes)
# name     object
# age       int64
# state    object
# point     int64
# dtype: object

For more on data types in pandas, refer to the following article.

The select_dtypes() method of DataFrame can be used to extract columns of a specific data type.

print(df.select_dtypes('int'))
#    age  point
# 0   24     64
# 1   42     92
# 2   18     70
# 3   68     70
# 4   24     88
# 5   30     57

You can extract only the numerical columns from a CSV file containing extra data like strings and convert them to an ndarray.

a = pd.read_csv('data/src/sample_pandas_normal.csv').select_dtypes('int').values
print(a)
# [[24 64]
#  [42 92]
#  [18 70]
#  [68 70]
#  [24 88]
#  [30 57]]

print(type(a))
# <class 'numpy.ndarray'>

print(a.dtype)
# int64

Related Categories

Related Articles