pandas: Write DataFrame to CSV with to_csv()
You can write data from pandas.DataFrame and pandas.Series to CSV files using the to_csv() method. This method also allows appending to an existing CSV file. By altering the delimiter, the data can be saved as a TSV (Tab-separated values) file.
- pandas.DataFrame.to_csv — pandas 2.0.3 documentation
- pandas.Series.to_csv — pandas 2.0.3 documentation
Not all arguments are covered in this article. For a comprehensive understanding of all arguments, please refer to the official documentation linked above.
- Write to CSV file with
to_csv() - Write only specific columns:
columns - Write column/row names:
header,index - Encoding:
encoding - Delimiter:
sep - Write mode (write/overwrite or append):
mode - Handle missing values
NaN:na_rep - Format for floating-point numbers (
float):float_format - Write with custom formats
The pandas.read_csv() function allows you to read CSV files and load them into DataFrame objects.
The sample code in this article uses pandas version 2.0.3. Consider the following DataFrame as an example.
import pandas as pd
print(pd.__version__)
# 2.0.3
df = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0).head(3)
print(df)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
The following examples use DataFrame but are equally applicable to Series.
Write to CSV file with to_csv()
to_csv() is provided as a method for both pandas.DataFrame and pandas.Series.
By specifying the file path as the first argument, the data can be written to that path in CSV format.
df.to_csv('data/dst/to_csv_out.csv')
name,age,state,point
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
The path can be either absolute or relative. For how to check and change the current directory, see the following article.
Write only specific columns: columns
To write only specific columns, specify a list of column names to the columns argument.
df.to_csv('data/dst/to_csv_out_columns.csv', columns=['age', 'point'])
name,age,point
Alice,24,64
Bob,42,92
Charlie,18,70
By default, columns is set to None, and all columns are included in the output.
Write column/row names: header, index
You can control whether to write column names (columns) and row names (index) by setting the header and index arguments to True or False.
df.to_csv('data/dst/to_csv_out_header_index.csv', header=False, index=False)
24,NY,64
42,CA,92
18,CA,70
By default, both are set to True, and both columns (header) and index are included in the output, as demonstrated in the previous examples.
Encoding: encoding
The encoding argument allows you to define the encoding of the output CSV file. By default, this is utf-8. You can change the encoding as needed, for example, to encoding='shift_jis' or encoding='cp932'.
Delimiter: sep
The sep argument allows you to set the delimiter. The default delimiter is a comma ,, which produces a CSV file.
To create a TSV file, which uses a tab character \t as the delimiter, you can simply set sep='\t'.
df.to_csv('data/dst/to_csv_out.tsv', sep='\t')
name age state point
Alice 24 NY 64
Bob 42 CA 92
Charlie 18 CA 70
Write mode (write/overwrite or append): mode
The mode argument allows you to specify the write mode, similar to the built-in open() function.
By default, mode is set to 'w'. If the specified path does not exist, it creates a new file; if the path already exists, it overwrites the existing file.
To prevent overwriting an existing file, use mode='x'. If the specified path does not exist, a new file is created; if it does, an error is returned, and the file is not overwritten.
# df.to_csv('data/dst/to_csv_out.csv', mode='x')
# FileExistsError: [Errno 17] File exists: 'data/dst/to_csv_out.csv'
For appending data, use mode='a'. This will append the DataFrame to the end of the existing file as new rows. Note that the header (columns) will also be appended unless header=False is specified.
df.to_csv('data/dst/to_csv_out_a.csv')
df.to_csv('data/dst/to_csv_out_a.csv', mode='a', header=False)
name,age,state,point
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
Alice,24,NY,64
Bob,42,CA,92
Charlie,18,CA,70
mode='a' is used to append rows. If you want to add columns, you should read the target file, add the columns, and then overwrite the original file. You can use mode='w' for overwriting the file, but this argument can be omitted because 'w' is the default mode.
df.to_csv('data/dst/to_csv_out_a_new_column.csv')
df_new = pd.read_csv('data/dst/to_csv_out_a_new_column.csv', index_col=0)
print(df_new)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
df_new['new_col'] = 'new data'
print(df_new)
# age state point new_col
# name
# Alice 24 NY 64 new data
# Bob 42 CA 92 new data
# Charlie 18 CA 70 new data
df_new.to_csv('data/dst/to_csv_out_a_new_column.csv')
name,age,state,point,new_col
Alice,24,NY,64,new data
Bob,42,CA,92,new data
Charlie,18,CA,70,new data
Handle missing values NaN: na_rep
Consider the following DataFrame containing missing values NaN.
df_nan = df.copy()
df_nan.iat[0, 1] = float('nan')
df_nan.iat[1, 2] = float('nan')
print(df_nan)
# age state point
# name
# Alice 24 NaN 64.0
# Bob 42 CA NaN
# Charlie 18 CA 70.0
By default, missing values are represented as empty strings in the output.
df_nan.to_csv('data/dst/to_csv_out_nan.csv')
name,age,state,point
Alice,24,,64.0
Bob,42,CA,
Charlie,18,CA,70.0
You can replace missing values with a specific string using the na_rep argument.
df_nan.to_csv('data/dst/to_csv_out_nan_rep.csv', na_rep='NaN')
name,age,state,point
Alice,24,NaN,64.0
Bob,42,CA,NaN
Charlie,18,CA,70.0
Refer to the following articles for information on handling missing values in DataFrame.
- pandas: Remove NaN (missing values) with dropna()
- pandas: Replace NaN (missing values) with fillna()
Format for floating-point numbers (float): float_format
Consider the following DataFrame.
df = pd.DataFrame({'col1': [0.123456789, 1000000000.0],
'col2': [123456789.0, 0.0],
'col3': [123456789, 0]})
print(df)
# col1 col2 col3
# 0 1.234568e-01 123456789.0 123456789
# 1 1.000000e+09 0.0 0
print(df.dtypes)
# col1 float64
# col2 float64
# col3 int64
# dtype: object
When you use print(), the output appears in scientific notation. However, this is due to the display settings and does not indicate that the actual value has been rounded off.
print(df.iat[0, 0])
# 0.123456789
print(df.iat[1, 0])
# 1000000000.0
By default, when using to_csv(), the value is saved as is.
df.to_csv('data/dst/to_csv_out_float_default.csv')
,col1,col2,col3
0,0.123456789,123456789.0,123456789
1,1000000000.0,0.0,0
With the float_format argument, you can control the formatting of floating-point numbers (float) when saving. You can either provide a format string in style used by printf (with a %), or use a callable object like format().
Here is an example using the printf-style, where the number of digits after the decimal point is set to three.
print('%.3f' % 0.123456789)
# 0.123
print('%.3f' % 123456789)
# 123456789.000
df.to_csv('data/dst/to_csv_out_float_format_3f.csv', float_format='%.3f')
,col1,col2,col3
0,0.123,123456789.000,123456789
1,1000000000.000,0.000,0
Here is an example using a callable object. This outputs in scientific notation with three digits after the decimal point.
print('{:.3e}'.format(0.123456789))
# 1.235e-01
print('{:.3e}'.format(123456789))
# 1.235e+08
df.to_csv('data/dst/to_csv_out_float_format_3e.csv', float_format='{:.3e}'.format)
,col1,col2,col3
0,1.235e-01,1.235e+08,123456789
1,1.000e+09,0.000e+00,0
Please note that specifying the number of digits may lead to loss of information beyond the specified number of digits during saving.
Write with custom formats
The float_format argument, as the name suggests, only applies to columns with the float data type. Columns with the integer (int) type remain as they are. Moreover, you can't specify different formats for each column.
If you want to specify a format for int columns, or use a different format for each float column, you should convert the data in the original DataFrame to strings in the desired format before saving it.
df = pd.DataFrame({'col1': [0.123456789, 1000000000.0],
'col2': [123456789.0, 0.0],
'col3': [123456789, 0]})
df['col1'] = df['col1'].map('{:.3f}'.format)
df['col2'] = df['col2'].map('{:.3e}'.format)
df['col3'] = df['col3'].map('{:#010x}'.format)
print(df)
# col1 col2 col3
# 0 0.123 1.235e+08 0x075bcd15
# 1 1000000000.000 0.000e+00 0x00000000
df.to_csv('data/dst/to_csv_out_float_format_str.csv')
,col1,col2,col3
0,0.123,1.235e+08,0x075bcd15
1,1000000000.000,0.000e+00,0x00000000
In the example above, note that when saving integers in hexadecimal form, pandas.read_csv() will interpret them as strings. If you want them to be treated as numerical data, you'll need to convert them after loading.
df = pd.read_csv('data/dst/to_csv_out_float_format_str.csv', index_col=0)
print(df)
# col1 col2 col3
# 0 1.230000e-01 123500000.0 0x075bcd15
# 1 1.000000e+09 0.0 0x00000000
print(df.dtypes)
# col1 float64
# col2 float64
# col3 object
# dtype: object
df['col3'] = df['col3'].map(lambda x: int(x, 16))
print(df)
# col1 col2 col3
# 0 1.230000e-01 123500000.0 123456789
# 1 1.000000e+09 0.0 0
print(df.dtypes)
# col1 float64
# col2 float64
# col3 int64
# dtype: object