Calculate Mean, Median, Mode, Variance, Standard Deviation in Python
The Python statistics module provides various statistical operations, such as the computation of mean, median, mode, variance, and standard deviation.
This article does not cover all functions of the module, like the calculation of harmonic and geometric means. Refer to the official documentation linked above for more information.
Although separate installation is required, using NumPy allows for operations on rows and columns of two-dimensional arrays, among other functionalities.
The sample code in this article uses the statistics and math modules. Both are included in the standard library and do not require additional installation.
import statistics
import math
Mean (arithmetic mean): statistics.mean()
statistics.mean()
calculates the arithmetic mean, which is the sum of elements divided by their count. It accepts iterable objects, such as lists and tuples, as arguments. The same applies to the functions presented in the following sections.
l = [1, 3, 8, 15]
print(statistics.mean(l))
# 6.75
You can calculate the mean using the built-in functions, sum()
and len()
.
print(sum(l) / len(l))
# 6.75
Median: statistics.median()
, statistics.median_low()
, statistics.median_high()
statistics.median()
, statistics.median_low()
, and statistics.median_high()
find the median, the middle value when the data is sorted. It's important to note that the data doesn't need to be sorted beforehand.
- statistics.median() — Mathematical statistics functions — Python 3.11.4 documentation
- statistics.median_low() — Mathematical statistics functions — Python 3.11.4 documentation
- statistics.median_high() — Mathematical statistics functions — Python 3.11.4 documentation
If the number of data points is odd, all three functions return the middle value directly.
l = [3, 1, 8]
print(statistics.median(l))
# 3
print(statistics.median_low(l))
# 3
print(statistics.median_high(l))
# 3
If the number of data points is even, statistics.median()
returns the arithmetic mean of the two middle values, statistics.median_low()
returns the smaller value, and statistics.median_high()
returns the larger value.
l = [3, 1, 8, 15]
print(statistics.median(l))
# 5.5
print(statistics.median_low(l))
# 3
print(statistics.median_high(l))
# 8
You can use the built-in sorted()
function and the sort()
method of lists for sorting your data.
Mode: statistics.mode()
, statistics.multimode()
statistics.mode()
and statistics.multimode()
allow you to find the mode, which is the most frequently occurring value.
- statistics.mode() — Mathematical statistics functions — Python 3.11.4 documentation
- statistics.multimode() — Mathematical statistics functions — Python 3.11.4 documentation
statistics.multimode()
always returns the modes as a list, even if there is only one.
l = [3, 2, 3, 2, 1, 2]
print(statistics.mode(l))
# 2
print(statistics.multimode(l))
# [2]
If multiple modes exist, statistics.mode()
returns the first one.
l = [3, 2, 3, 2, 1, 2, 3]
print(statistics.mode(l))
# 3
print(statistics.multimode(l))
# [3, 2]
You can use the Counter
class from the collections
module to count the frequency of each element and sort them accordingly.
Variance
Population variance: statistics.pvariance()
statistics.pvariance()
computes the population variance, which is the appropriate measure when the data represents the entire population.
l = [10, 1, 3, 7, 1]
print(statistics.pvariance(l))
# 12.64
The population variance $\sigma^2$ is calculated as follows for a population consisting of $n$ data points with mean $\mu$.
$$ \sigma^2=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)^2 $$
By default, the mean is automatically calculated. However, the optional second argument, mu
, allows you to specify the mean value directly. For example, if you've already calculated the mean, providing it through mu
can help avoid recalculations.
mu = statistics.mean(l)
print(statistics.pvariance(l, mu))
# 12.64
You can calculate this using the built-in functions, sum()
and len()
.
print(sum((x - sum(l) / len(l)) ** 2 for x in l) / len(l))
# 12.64
A generator expression is passed to sum()
.
Sample variance: statistics.variance()
statistics.variance()
computes the sample variance, which is the appropriate measure when the data is a sample from a larger population.
l = [10, 1, 3, 7, 1]
print(statistics.variance(l))
# 15.8
This method specifically calculates the unbiased sample variance where the denominator is $n-1$, not $n$. This adjustment to the denominator, known as Bessel's correction, helps to correct the bias in the estimation of the population variance from a sample.
The unbiased sample variance $s^2$ is calculated as follows for a sample of $n$ data points from the population with mean $\overline{x}$.
$$ s^2=\frac{1}{n-1} \sum_{i=1}^{n} (x_i-\overline{x})^2 $$
By default, the mean is automatically calculated. However, the optional second argument, xbar
, allows you to specify the mean value directly. For example, if you've already calculated the mean of the sample, providing it through xbar
can help avoid recalculations.
xbar = statistics.mean(l)
print(statistics.variance(l, xbar))
# 15.8
You can calculate this using the built-in functions, sum()
and len()
.
print(sum((x - sum(l) / len(l)) ** 2 for x in l) / (len(l) - 1))
# 15.8
Standard deviation
Population standard deviation: statistics.pstdev()
statistics.pstdev()
returns the population standard deviation.
l = [10, 1, 3, 7, 1]
print(statistics.pstdev(l))
# 3.5552777669262356
The population standard deviation is the square root of the population variance.
print(math.sqrt(statistics.pvariance(l)))
# 3.5552777669262356
Sample standard deviation: statistics.stdev()
statistics.stdev()
returns the sample standard deviation.
l = [10, 1, 3, 7, 1]
print(statistics.stdev(l))
# 3.9749213828703582
The sample standard deviation is the square root of the sample variance.
print(math.sqrt(statistics.variance(l)))
# 3.9749213828703582