pandas: Write DataFrame to CSV with to_csv()

Posted: | Tags: Python, pandas, CSV

You can write data from pandas.DataFrame and pandas.Series to CSV files using the to_csv() method. This method also allows appending to an existing CSV file. By altering the delimiter, the data can be saved as a TSV (Tab-separated values) file.

Not all arguments are covered in this article. For a comprehensive understanding of all arguments, please refer to the official documentation linked above.

The pandas.read_csv() function allows you to read CSV files and load them into DataFrame objects.

The sample code in this article uses pandas version 2.0.3. Consider the following DataFrame as an example.

import pandas as pd

print(pd.__version__)
# 2.0.3

df = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0).head(3)
print(df)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70

The following examples use DataFrame but are equally applicable to Series.

Write to CSV file with to_csv()

to_csv() is provided as a method for both pandas.DataFrame and pandas.Series.

By specifying the file path as the first argument, the data can be written to that path in CSV format.

df.to_csv('data/dst/to_csv_out.csv')
name,age,state,point
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70

The path can be either absolute or relative. For how to check and change the current directory, see the following article.

Write only specific columns: columns

To write only specific columns, specify a list of column names to the columns argument.

df.to_csv('data/dst/to_csv_out_columns.csv', columns=['age', 'point'])
name,age,point
Alice,24,64
Bob,42,92
Charlie,18,70

By default, columns is set to None, and all columns are included in the output.

Write column/row names: header, index

You can control whether to write column names (columns) and row names (index) by setting the header and index arguments to True or False.

df.to_csv('data/dst/to_csv_out_header_index.csv', header=False, index=False)
24,NY,64
42,CA,92
18,CA,70

By default, both are set to True, and both columns (header) and index are included in the output, as demonstrated in the previous examples.

Encoding: encoding

The encoding argument allows you to define the encoding of the output CSV file. By default, this is utf-8. You can change the encoding as needed, for example, to encoding='shift_jis' or encoding='cp932'.

Delimiter: sep

The sep argument allows you to set the delimiter. The default delimiter is a comma ,, which produces a CSV file.

To create a TSV file, which uses a tab character \t as the delimiter, you can simply set sep='\t'.

df.to_csv('data/dst/to_csv_out.tsv', sep='\t')
name    age state   point
Alice   24  NY  64
Bob 42  CA  92
Charlie 18  CA  70

Write mode (write/overwrite or append): mode

The mode argument allows you to specify the write mode, similar to the built-in open() function.

By default, mode is set to 'w'. If the specified path does not exist, it creates a new file; if the path already exists, it overwrites the existing file.

To prevent overwriting an existing file, use mode='x'. If the specified path does not exist, a new file is created; if it does, an error is returned, and the file is not overwritten.

# df.to_csv('data/dst/to_csv_out.csv', mode='x')
# FileExistsError: [Errno 17] File exists: 'data/dst/to_csv_out.csv'

For appending data, use mode='a'. This will append the DataFrame to the end of the existing file as new rows. Note that the header (columns) will also be appended unless header=False is specified.

df.to_csv('data/dst/to_csv_out_a.csv')
df.to_csv('data/dst/to_csv_out_a.csv', mode='a', header=False)
name,age,state,point
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70

mode='a' is used to append rows. If you want to add columns, you should read the target file, add the columns, and then overwrite the original file. You can use mode='w' for overwriting the file, but this argument can be omitted because 'w' is the default mode.

df.to_csv('data/dst/to_csv_out_a_new_column.csv')

df_new = pd.read_csv('data/dst/to_csv_out_a_new_column.csv', index_col=0)
print(df_new)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70

df_new['new_col'] = 'new data'
print(df_new)
#          age state  point   new_col
# name                               
# Alice     24    NY     64  new data
# Bob       42    CA     92  new data
# Charlie   18    CA     70  new data

df_new.to_csv('data/dst/to_csv_out_a_new_column.csv')
name,age,state,point,new_col
Alice,24,NY,64,new data
Bob,42,CA,92,new data
Charlie,18,CA,70,new data

Handle missing values NaN: na_rep

Consider the following DataFrame containing missing values NaN.

df_nan = df.copy()
df_nan.iat[0, 1] = float('nan')
df_nan.iat[1, 2] = float('nan')

print(df_nan)
#          age state  point
# name                     
# Alice     24   NaN   64.0
# Bob       42    CA    NaN
# Charlie   18    CA   70.0

By default, missing values are represented as empty strings in the output.

df_nan.to_csv('data/dst/to_csv_out_nan.csv')
name,age,state,point
Alice,24,,64.0
Bob,42,CA,
Charlie,18,CA,70.0

You can replace missing values with a specific string using the na_rep argument.

df_nan.to_csv('data/dst/to_csv_out_nan_rep.csv', na_rep='NaN')
name,age,state,point
Alice,24,NaN,64.0
Bob,42,CA,NaN
Charlie,18,CA,70.0

Refer to the following articles for information on handling missing values in DataFrame.

Format for floating-point numbers (float): float_format

Consider the following DataFrame.

df = pd.DataFrame({'col1': [0.123456789, 1000000000.0],
                   'col2': [123456789.0, 0.0],
                   'col3': [123456789, 0]})
print(df)
#            col1         col2       col3
# 0  1.234568e-01  123456789.0  123456789
# 1  1.000000e+09          0.0          0

print(df.dtypes)
# col1    float64
# col2    float64
# col3      int64
# dtype: object

When you use print(), the output appears in scientific notation. However, this is due to the display settings and does not indicate that the actual value has been rounded off.

print(df.iat[0, 0])
# 0.123456789

print(df.iat[1, 0])
# 1000000000.0

By default, when using to_csv(), the value is saved as is.

df.to_csv('data/dst/to_csv_out_float_default.csv')
,col1,col2,col3
0,0.123456789,123456789.0,123456789
1,1000000000.0,0.0,0

With the float_format argument, you can control the formatting of floating-point numbers (float) when saving. You can either provide a format string in style used by printf (with a %), or use a callable object like format().

Here is an example using the printf-style, where the number of digits after the decimal point is set to three.

print('%.3f' % 0.123456789)
# 0.123

print('%.3f' % 123456789)
# 123456789.000

df.to_csv('data/dst/to_csv_out_float_format_3f.csv', float_format='%.3f')
,col1,col2,col3
0,0.123,123456789.000,123456789
1,1000000000.000,0.000,0

Here is an example using a callable object. This outputs in scientific notation with three digits after the decimal point.

print('{:.3e}'.format(0.123456789))
# 1.235e-01

print('{:.3e}'.format(123456789))
# 1.235e+08

df.to_csv('data/dst/to_csv_out_float_format_3e.csv', float_format='{:.3e}'.format)
,col1,col2,col3
0,1.235e-01,1.235e+08,123456789
1,1.000e+09,0.000e+00,0

Please note that specifying the number of digits may lead to loss of information beyond the specified number of digits during saving.

Write with custom formats

The float_format argument, as the name suggests, only applies to columns with the float data type. Columns with the integer (int) type remain as they are. Moreover, you can't specify different formats for each column.

If you want to specify a format for int columns, or use a different format for each float column, you should convert the data in the original DataFrame to strings in the desired format before saving it.

df = pd.DataFrame({'col1': [0.123456789, 1000000000.0],
                   'col2': [123456789.0, 0.0],
                   'col3': [123456789, 0]})

df['col1'] = df['col1'].map('{:.3f}'.format)
df['col2'] = df['col2'].map('{:.3e}'.format)
df['col3'] = df['col3'].map('{:#010x}'.format)

print(df)
#              col1       col2        col3
# 0           0.123  1.235e+08  0x075bcd15
# 1  1000000000.000  0.000e+00  0x00000000

df.to_csv('data/dst/to_csv_out_float_format_str.csv')
,col1,col2,col3
0,0.123,1.235e+08,0x075bcd15
1,1000000000.000,0.000e+00,0x00000000

In the example above, note that when saving integers in hexadecimal form, pandas.read_csv() will interpret them as strings. If you want them to be treated as numerical data, you'll need to convert them after loading.

df = pd.read_csv('data/dst/to_csv_out_float_format_str.csv', index_col=0)
print(df)
#            col1         col2        col3
# 0  1.230000e-01  123500000.0  0x075bcd15
# 1  1.000000e+09          0.0  0x00000000

print(df.dtypes)
# col1    float64
# col2    float64
# col3     object
# dtype: object

df['col3'] = df['col3'].map(lambda x: int(x, 16))
print(df)
#            col1         col2       col3
# 0  1.230000e-01  123500000.0  123456789
# 1  1.000000e+09          0.0          0

print(df.dtypes)
# col1    float64
# col2    float64
# col3      int64
# dtype: object

Related Categories

Related Articles