pandas: Write DataFrame to CSV with to_csv()
You can write data from pandas.DataFrame
and pandas.Series
to CSV files using the to_csv()
method. This method also allows appending to an existing CSV file. By altering the delimiter, the data can be saved as a TSV (Tab-separated values) file.
- pandas.DataFrame.to_csv — pandas 2.0.3 documentation
- pandas.Series.to_csv — pandas 2.0.3 documentation
Not all arguments are covered in this article. For a comprehensive understanding of all arguments, please refer to the official documentation linked above.
- Write to CSV file with
to_csv()
- Write only specific columns:
columns
- Write column/row names:
header
,index
- Encoding:
encoding
- Delimiter:
sep
- Write mode (write/overwrite or append):
mode
- Handle missing values
NaN
:na_rep
- Format for floating-point numbers (
float
):float_format
- Write with custom formats
The pandas.read_csv()
function allows you to read CSV files and load them into DataFrame
objects.
The sample code in this article uses pandas version 2.0.3
. Consider the following DataFrame
as an example.
import pandas as pd
print(pd.__version__)
# 2.0.3
df = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0).head(3)
print(df)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
The following examples use DataFrame
but are equally applicable to Series
.
Write to CSV file with to_csv()
to_csv()
is provided as a method for both pandas.DataFrame
and pandas.Series
.
By specifying the file path as the first argument, the data can be written to that path in CSV format.
df.to_csv('data/dst/to_csv_out.csv')
name,age,state,point
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
The path can be either absolute or relative. For how to check and change the current directory, see the following article.
Write only specific columns: columns
To write only specific columns, specify a list of column names to the columns
argument.
df.to_csv('data/dst/to_csv_out_columns.csv', columns=['age', 'point'])
name,age,point
Alice,24,64
Bob,42,92
Charlie,18,70
By default, columns
is set to None
, and all columns are included in the output.
Write column/row names: header
, index
You can control whether to write column names (columns
) and row names (index
) by setting the header
and index
arguments to True
or False
.
df.to_csv('data/dst/to_csv_out_header_index.csv', header=False, index=False)
24,NY,64
42,CA,92
18,CA,70
By default, both are set to True
, and both columns
(header) and index
are included in the output, as demonstrated in the previous examples.
Encoding: encoding
The encoding
argument allows you to define the encoding of the output CSV file. By default, this is utf-8
. You can change the encoding as needed, for example, to encoding='shift_jis'
or encoding='cp932'
.
Delimiter: sep
The sep
argument allows you to set the delimiter. The default delimiter is a comma ,
, which produces a CSV file.
To create a TSV file, which uses a tab character \t
as the delimiter, you can simply set sep='\t'
.
df.to_csv('data/dst/to_csv_out.tsv', sep='\t')
name age state point
Alice 24 NY 64
Bob 42 CA 92
Charlie 18 CA 70
Write mode (write/overwrite or append): mode
The mode
argument allows you to specify the write mode, similar to the built-in open()
function.
By default, mode
is set to 'w'
. If the specified path does not exist, it creates a new file; if the path already exists, it overwrites the existing file.
To prevent overwriting an existing file, use mode='x'
. If the specified path does not exist, a new file is created; if it does, an error is returned, and the file is not overwritten.
# df.to_csv('data/dst/to_csv_out.csv', mode='x')
# FileExistsError: [Errno 17] File exists: 'data/dst/to_csv_out.csv'
For appending data, use mode='a'
. This will append the DataFrame
to the end of the existing file as new rows. Note that the header (columns
) will also be appended unless header=False
is specified.
df.to_csv('data/dst/to_csv_out_a.csv')
df.to_csv('data/dst/to_csv_out_a.csv', mode='a', header=False)
name,age,state,point
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
mode='a'
is used to append rows. If you want to add columns, you should read the target file, add the columns, and then overwrite the original file. You can use mode='w'
for overwriting the file, but this argument can be omitted because 'w'
is the default mode.
df.to_csv('data/dst/to_csv_out_a_new_column.csv')
df_new = pd.read_csv('data/dst/to_csv_out_a_new_column.csv', index_col=0)
print(df_new)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
df_new['new_col'] = 'new data'
print(df_new)
# age state point new_col
# name
# Alice 24 NY 64 new data
# Bob 42 CA 92 new data
# Charlie 18 CA 70 new data
df_new.to_csv('data/dst/to_csv_out_a_new_column.csv')
name,age,state,point,new_col
Alice,24,NY,64,new data
Bob,42,CA,92,new data
Charlie,18,CA,70,new data
Handle missing values NaN
: na_rep
Consider the following DataFrame
containing missing values NaN
.
df_nan = df.copy()
df_nan.iat[0, 1] = float('nan')
df_nan.iat[1, 2] = float('nan')
print(df_nan)
# age state point
# name
# Alice 24 NaN 64.0
# Bob 42 CA NaN
# Charlie 18 CA 70.0
By default, missing values are represented as empty strings in the output.
df_nan.to_csv('data/dst/to_csv_out_nan.csv')
name,age,state,point
Alice,24,,64.0
Bob,42,CA,
Charlie,18,CA,70.0
You can replace missing values with a specific string using the na_rep
argument.
df_nan.to_csv('data/dst/to_csv_out_nan_rep.csv', na_rep='NaN')
name,age,state,point
Alice,24,NaN,64.0
Bob,42,CA,NaN
Charlie,18,CA,70.0
Refer to the following articles for information on handling missing values in DataFrame
.
- pandas: Remove NaN (missing values) with dropna()
- pandas: Replace NaN (missing values) with fillna()
Format for floating-point numbers (float
): float_format
Consider the following DataFrame
.
df = pd.DataFrame({'col1': [0.123456789, 1000000000.0],
'col2': [123456789.0, 0.0],
'col3': [123456789, 0]})
print(df)
# col1 col2 col3
# 0 1.234568e-01 123456789.0 123456789
# 1 1.000000e+09 0.0 0
print(df.dtypes)
# col1 float64
# col2 float64
# col3 int64
# dtype: object
When you use print()
, the output appears in scientific notation. However, this is due to the display settings and does not indicate that the actual value has been rounded off.
print(df.iat[0, 0])
# 0.123456789
print(df.iat[1, 0])
# 1000000000.0
By default, when using to_csv()
, the value is saved as is.
df.to_csv('data/dst/to_csv_out_float_default.csv')
,col1,col2,col3
0,0.123456789,123456789.0,123456789
1,1000000000.0,0.0,0
With the float_format
argument, you can control the formatting of floating-point numbers (float
) when saving. You can either provide a format string in style used by printf (with a %
), or use a callable object like format()
.
Here is an example using the printf-style, where the number of digits after the decimal point is set to three.
print('%.3f' % 0.123456789)
# 0.123
print('%.3f' % 123456789)
# 123456789.000
df.to_csv('data/dst/to_csv_out_float_format_3f.csv', float_format='%.3f')
,col1,col2,col3
0,0.123,123456789.000,123456789
1,1000000000.000,0.000,0
Here is an example using a callable object. This outputs in scientific notation with three digits after the decimal point.
print('{:.3e}'.format(0.123456789))
# 1.235e-01
print('{:.3e}'.format(123456789))
# 1.235e+08
df.to_csv('data/dst/to_csv_out_float_format_3e.csv', float_format='{:.3e}'.format)
,col1,col2,col3
0,1.235e-01,1.235e+08,123456789
1,1.000e+09,0.000e+00,0
Please note that specifying the number of digits may lead to loss of information beyond the specified number of digits during saving.
Write with custom formats
The float_format
argument, as the name suggests, only applies to columns with the float
data type. Columns with the integer (int
) type remain as they are. Moreover, you can't specify different formats for each column.
If you want to specify a format for int
columns, or use a different format for each float
column, you should convert the data in the original DataFrame
to strings in the desired format before saving it.
df = pd.DataFrame({'col1': [0.123456789, 1000000000.0],
'col2': [123456789.0, 0.0],
'col3': [123456789, 0]})
df['col1'] = df['col1'].map('{:.3f}'.format)
df['col2'] = df['col2'].map('{:.3e}'.format)
df['col3'] = df['col3'].map('{:#010x}'.format)
print(df)
# col1 col2 col3
# 0 0.123 1.235e+08 0x075bcd15
# 1 1000000000.000 0.000e+00 0x00000000
df.to_csv('data/dst/to_csv_out_float_format_str.csv')
,col1,col2,col3
0,0.123,1.235e+08,0x075bcd15
1,1000000000.000,0.000e+00,0x00000000
In the example above, note that when saving integers in hexadecimal form, pandas.read_csv()
will interpret them as strings. If you want them to be treated as numerical data, you'll need to convert them after loading.
df = pd.read_csv('data/dst/to_csv_out_float_format_str.csv', index_col=0)
print(df)
# col1 col2 col3
# 0 1.230000e-01 123500000.0 0x075bcd15
# 1 1.000000e+09 0.0 0x00000000
print(df.dtypes)
# col1 float64
# col2 float64
# col3 object
# dtype: object
df['col3'] = df['col3'].map(lambda x: int(x, 16))
print(df)
# col1 col2 col3
# 0 1.230000e-01 123500000.0 123456789
# 1 1.000000e+09 0.0 0
print(df.dtypes)
# col1 float64
# col2 float64
# col3 int64
# dtype: object