pandas: How to use astype() to cast dtype of DataFrame
pandas.Series has a single data type (dtype), while pandas.DataFrame can have a different data type for each column.
You can specify dtype in various contexts, such as when creating a new object using a constructor or when reading from a CSV file. Additionally, you can cast an existing object to a different dtype using the astype() method.
See the following article on how to extract columns by dtype.
See the following article about dtype and astype() in NumPy.
Please note that the sample code used in this article is based on pandas version 2.0.3 and behavior may vary with different versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.0.3
List of basic data types (dtype) in pandas
The following is a list of basic data types (dtype) in pandas.
dtype |
character code | description |
|---|---|---|
int8 |
i1 |
8-bit signed integer |
int16 |
i2 |
16-bit signed integer |
int32 |
i4 |
32-bit signed integer |
int64 |
i8 |
64-bit signed integer |
uint8 |
u1 |
8-bit unsigned integer |
uint16 |
u2 |
16-bit unsigned integer |
uint32 |
u4 |
32-bit unsigned integer |
uint64 |
u8 |
64-bit unsigned integer |
float16 |
f2 |
16-bit floating-point number |
float32 |
f4 |
32-bit floating-point number |
float64 |
f8 |
64-bit floating-point number |
float128 |
f16 |
128-bit floating-point number |
complex64 |
c8 |
64-bit complex floating-point number |
complex128 |
c16 |
128-bit complex floating-point number |
complex256 |
c32 |
256-bit complex floating-point number |
bool |
? |
Boolean (True or False) |
unicode |
U |
Unicode string |
object |
O |
Python objects |
Note that the numbers in dtype represent bits, whereas those in character codes represent bytes. The character code for the bool type is ?. It does not mean unknown; rather, ? is literally assigned.
You can specify dtype in various ways. For example, any of the following representations can be used for float64:
np.float64'float64''f8'
s = pd.Series([0, 1, 2], dtype=np.float64)
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype='float64')
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype='f8')
print(s.dtype)
# float64
You can also specify data types using Python types like int, float, or str, without specifying bit-precision.
In such cases, the type is converted to the equivalent dtype. Examples in Python3, 64-bit environment are as follows. Although uint is not a native Python type, it's included in the table for convenience.
| Python type | Example of equivalent dtype |
|---|---|
int |
int64 |
float |
float64 |
str |
object (Each element is str) |
(uint) |
uint64 |
You can use types like int, float, or the strings 'int' and 'float'. However, you cannot use uint because it is not a native Python type.
s = pd.Series([0, 1, 2], dtype='float')
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype=float)
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype='uint')
print(s.dtype)
# uint64
You can check the range of possible values (minimum and maximum values) for integer and floating-point numbers types with np.iinfo() and np.finfo().
The data types discussed here are primarily based on NumPy, but pandas has extended some of its own data types.
object type and string
This section explains the object type and the string (str).
Note that StringDtype was introduced in pandas version 1.0.0 as a data type for strings. This type might become the standard in the future, but it is not mentioned here. See the official documentation for details.
The special data type: object
The object type is a special data type that can store references to any Python objects. Each element may be of a different type.
The data type for Series and DataFrame columns containing strings is object. However, each element can have its own distinct type, meaning not all elements need to be strings.
Here are some examples. The built-in function type() is applied to each element using the map() method to check its type. np.nan represents a missing value.
- pandas: Apply functions to values, rows, columns with map(), apply()
- Get and check the type of an object in Python: type(), isinstance()
- Missing values in pandas (nan, None, pd.NA)
s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_object.map(type))
# 0 <class 'int'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
If str is specified in the astype() method (see below for details), all elements, including NaN, are converted to str. The dtype remains as object.
s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0 0
# 1 abcde
# 2 nan
# dtype: object
print(s_str_astype.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'str'>
# dtype: object
If str is specified in the dtype argument of the constructor, NaN remains float. Note that, in version 0.22.0, NaN was converted to str.
s_str_constructor = pd.Series([0, 'abcde', np.nan], dtype=str)
print(s_str_constructor)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_str_constructor.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
Note: String methods
Note that even when the dtype is object, the result of string methods (accessed via the str accessor) can differ based on the type of each element.
For example, applying str.len(), which returns the number of characters, an element of numeric type returns NaN.
s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_object.str.len())
# 0 NaN
# 1 5.0
# 2 NaN
# dtype: float64
If the result of the string method includes NaN, it indicates that not all elements might be of type str, even if the column's data type is object. In such cases, you can apply astype(str) before using the string method.
s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0 0
# 1 abcde
# 2 nan
# dtype: object
print(s_str_astype.str.len())
# 0 1
# 1 5
# 2 3
# dtype: int64
See also the following articles for string methods.
- pandas: Handle strings (replace, strip, case conversion, etc.)
- pandas: Extract rows that contain specific strings from a DataFrame
- pandas: Split string columns by delimiters or regular expressions
Note: NaN
You can determine the missing value NaN with isnull() or remove it with dropna().
- pandas: Detect and count NaN (missing values) with isnull(), isna()
- pandas: Remove NaN (missing values) with dropna()
s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_object.map(type))
# 0 <class 'int'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
print(s_object.isnull())
# 0 False
# 1 False
# 2 True
# dtype: bool
print(s_object.dropna())
# 0 0
# 1 abcde
# dtype: object
Note that if cast to the string (str), NaN becomes the string 'nan' and is not treated as a missing value.
s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0 0
# 1 abcde
# 2 nan
# dtype: object
print(s_str_astype.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'str'>
# dtype: object
print(s_str_astype.isnull())
# 0 False
# 1 False
# 2 False
# dtype: bool
print(s_str_astype.dropna())
# 0 0
# 1 abcde
# 2 nan
# dtype: object
You can treat it as a missing value before casting, or replace the string 'nan' with NaN using replace().
s_str_astype_nan = s_str_astype.replace('nan', np.nan)
print(s_str_astype_nan)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_str_astype_nan.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
print(s_str_astype_nan.isnull())
# 0 False
# 1 False
# 2 True
# dtype: bool
Cast data type (dtype) with astype()
You can cast the data type (dtype) with the method astype() of DataFrame and Series.
- pandas.DataFrame.astype — pandas 2.0.3 documentation
- pandas.Series.astype — pandas 2.0.3 documentation
astype() returns a new Series or DataFrame with the specified dtype. The original object is not changed.
Cast data type of pandas.Series
You can specify the data type (dtype) to astype().
s = pd.Series([1, 2, 3])
print(s)
# 0 1
# 1 2
# 2 3
# dtype: int64
s_f = s.astype('float64')
print(s_f)
# 0 1.0
# 1 2.0
# 2 3.0
# dtype: float64
As mentioned above, you can specify dtype in various forms.
s_f = s.astype('float')
print(s_f.dtype)
# float64
s_f = s.astype(float)
print(s_f.dtype)
# float64
s_f = s.astype('f8')
print(s_f.dtype)
# float64
Cast data type of all columns of pandas.DataFrame
DataFrame has the data type (dtype) for each column. You can check each dtype with the dtypes attribute.
df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
# a b c
# 0 11 12 13
# 1 21 22 23
# 2 31 32 33
print(df.dtypes)
# a int64
# b int64
# c int64
# dtype: object
If you specify the data type (dtype) to astype(), the data types of all columns are changed.
df_f = df.astype('float64')
print(df_f)
# a b c
# 0 11.0 12.0 13.0
# 1 21.0 22.0 23.0
# 2 31.0 32.0 33.0
print(df_f.dtypes)
# a float64
# b float64
# c float64
# dtype: object
Cast data type of any column of pandas.DataFrame individually
You can change the data type (dtype) of any column individually by specifying a dictionary of {column name: data type} to astype().
df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
# a b c
# 0 11 12 13
# 1 21 22 23
# 2 31 32 33
print(df.dtypes)
# a int64
# b int64
# c int64
# dtype: object
df_fcol = df.astype({'a': float})
print(df_fcol)
# a b c
# 0 11.0 12 13
# 1 21.0 22 23
# 2 31.0 32 33
print(df_fcol.dtypes)
# a float64
# b int64
# c int64
# dtype: object
df_fcol2 = df.astype({'a': 'float32', 'c': 'int8'})
print(df_fcol2)
# a b c
# 0 11.0 12 13
# 1 21.0 22 23
# 2 31.0 32 33
print(df_fcol2.dtypes)
# a float32
# b int64
# c int8
# dtype: object
Specify data type (dtype) when reading CSV files with read_csv()
In pandas, pd.read_csv() is used to read CSV files, and you can set data types using the dtype argument.
Use the following CSV file as an example.
,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z
If the dtype argument is omitted, a data type is automatically chosen for each column.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df)
# a b c d
# ONE 1 1 100.0 x
# TWO 2 20 NaN y
# THREE 3 300 300.0 z
print(df.dtypes)
# a int64
# b int64
# c float64
# d object
# dtype: object
Specify the same data type (dtype) for all columns
If you specify a data type for the dtype argument, all columns are converted to that type. If there are columns that cannot be converted to the specified data type, an error will be raised.
# pd.read_csv('data/src/sample_header_index_dtype.csv',
# index_col=0, dtype=float)
# ValueError: could not convert string to float: 'ONE'
If you set dtype=str, all columns are converted to strings. However, in this case, the missing value (NaN) will still be of type float.
df_str = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype=str)
print(df_str)
# a b c d
# ONE 1 001 100 x
# TWO 2 020 NaN y
# THREE 3 300 300 z
print(df_str.dtypes)
# a object
# b object
# c object
# d object
# dtype: object
print(df_str.applymap(type))
# a b c d
# ONE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# TWO <class 'str'> <class 'str'> <class 'float'> <class 'str'>
# THREE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
If you read the file without specifying dtype and then cast it to str with astype(), NaN values are also converted to the string 'nan'.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df.astype(str))
# a b c d
# ONE 1 1 100.0 x
# TWO 2 20 nan y
# THREE 3 300 300.0 z
print(df.astype(str).applymap(type))
# a b c d
# ONE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# TWO <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# THREE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
Specify data type (dtype) for each column
As with astype(), you can use a dictionary to specify the data type for each column in read_csv().
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype={'a': float, 'b': str})
print(df_col)
# a b c d
# ONE 1.0 001 100.0 x
# TWO 2.0 020 NaN y
# THREE 3.0 300 300.0 z
print(df_col.dtypes)
# a float64
# b object
# c float64
# d object
# dtype: object
The dictionary keys can also be column numbers. Be careful, if you are specifying the index column, you need to specify the column numbers including the index column.
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype={1: float, 2: str})
print(df_col)
# a b c d
# ONE 1.0 001 100.0 x
# TWO 2.0 020 NaN y
# THREE 3.0 300 300.0 z
print(df_col.dtypes)
# a float64
# b object
# c float64
# d object
# dtype: object
Implicit type conversions
In addition to explicit type conversions using astype(), data types may also be converted implicitly during certain operations.
Consider a DataFrame with columns of integer (int) and columns of floating point (float) as an example.
df_mix = pd.DataFrame({'col_int': [0, 1, 2], 'col_float': [0.0, 0.1, 0.2]}, index=['A', 'B', 'C'])
print(df_mix)
# col_int col_float
# A 0 0.0
# B 1 0.1
# C 2 0.2
print(df_mix.dtypes)
# col_int int64
# col_float float64
# dtype: object
Implicit type conversion by arithmetic operations
For example, the result of addition by the + operator of an int column to a float column is a float.
print(df_mix['col_int'] + df_mix['col_float'])
# A 0.0
# B 1.1
# C 2.2
# dtype: float64
Similarly, operations with scalar values implicitly convert the data type. The result of division by the / operator is float.
print(df_mix / 1)
# col_int col_float
# A 0.0 0.0
# B 1.0 0.1
# C 2.0 0.2
print((df_mix / 1).dtypes)
# col_int float64
# col_float float64
# dtype: object
For arithmetic operations like +, -, *, //, and **, operations involving only integers return int, while those involving at least one floating-point number return float. This is equivalent to the implicit type conversion of the NumPy array ndarray.
print(df_mix * 1)
# col_int col_float
# A 0 0.0
# B 1 0.1
# C 2 0.2
print((df_mix * 1).dtypes)
# col_int int64
# col_float float64
# dtype: object
print(df_mix * 1.0)
# col_int col_float
# A 0.0 0.0
# B 1.0 0.1
# C 2.0 0.2
print((df_mix * 1.0).dtypes)
# col_int float64
# col_float float64
# dtype: object
Implicit type conversion by transposition, etc.
The data type may change when you select a row as a Series using loc or iloc, or when you transpose a DataFrame with T or transpose().
print(df_mix.loc['A'])
# col_int 0.0
# col_float 0.0
# Name: A, dtype: float64
print(df_mix.T)
# A B C
# col_int 0.0 1.0 2.0
# col_float 0.0 0.1 0.2
print(df_mix.T.dtypes)
# A float64
# B float64
# C float64
# dtype: object
Implicit type conversion by assignment to elements
The data type may also be implicitly converted when assigning a value to an element.
For example, assigning a float value to an element in the int column converts that column to float, while assigning an int value to an element in the float column retains the float type for that element.
df_mix.at['A', 'col_int'] = 10.1
df_mix.at['A', 'col_float'] = 10
print(df_mix)
# col_int col_float
# A 10.1 10.0
# B 1.0 0.1
# C 2.0 0.2
print(df_mix.dtypes)
# col_int float64
# col_float float64
# dtype: object
When a string value is assigned to an element in the numeric column, the data type of the column is cast to object.
df_mix.at['A', 'col_float'] = 'abc'
print(df_mix)
# col_int col_float
# A 10.1 abc
# B 1.0 0.1
# C 2.0 0.2
print(df_mix.dtypes)
# col_int float64
# col_float object
# dtype: object
print(df_mix.applymap(type))
# col_int col_float
# A <class 'float'> <class 'str'>
# B <class 'float'> <class 'float'>
# C <class 'float'> <class 'float'>
The sample code above is based on version 2.0.3. In version 0.22.0, the column type remained unchanged after assigning an element of a different type, though the type of the assigned element itself changed. Note that the behavior might differ depending on the version.