Understanding Central Tendencies in Data Science

5 min readSep 19, 2022

Many of us have been perplexed by terminology like Mean, Median, and Mode, as well as the context in which they are employed. Although these phrases are sometimes used interchangeably to refer to central tendencies, they actually represent very distinct things. We will try to grasp their context and how it relates to Data Science in this article.

Central Tendencies

The term “central tendencies” refers to the distribution of data that tends to accumulate more towards the center, thus the name “central tendencies.” In layman’s words, central tendencies are the study of the arithmetic mean, median, and mode. It can also be described as a study of the skew properties of the data. Let us look at the definitions of the terms mean, median, and mode.

Mean

The average or sum of all numerical values on the total number of quantities is referred to as the mean or arithmetic mean. These quantities are always ordinal or quantitative in nature, including a numeric value that may be employed in computation. These values can be either continuous or discrete. A continuous numeric amount has non-stop values between a beginning and an end. A discrete numeric quantity is one with erratic values that do not demonstrate any continuity across the measurement range.

Mean of a data distribution can be calculated using the below formula:

Usage in Data Science:

The most typical application of mean is when imputing values from data with a normal distribution. For example, if we have a dataset including student information for the same class with some missing values in student age, we may confidently assume that whatever the actual age of the missing student is, it will most likely be within a modest range of the class average. It is critical to notice that mean is particularly useful when the variance in the data is relatively low or the data is virtually symmetrically distributed. In other words, there shouldn’t be too many outliers in the data.

Median

The term “median” refers to the value of the middle-most occurrence in any distribution. The term median literally means “middle point”. When the likelihood of the number of occurrences before and after this statistic is almost equal, this statistic is employed.

The formula for calculating the median varies depending on the number of occurrences in the distribution, that is, even or odd total number of occurrences. The formula is as shown below:

Usage in Data Science:

As the name implies, the median is located in the center of the distribution and hence is resistant to substantially large variance. In other words, when there are a lot of outliers in the data, we may confidently use the median to fill in the missing values. It is vital to note that when using median, the values should be arranged in the order in which they appear, with no sort function applied. The most practical applications of medians are finding the median pay of employees in a corporation, real estate prices, and ad revenue generation.

Mode

The most often accessible value in a data distribution is referred to as the mode. In other terms, the mode of a discrete data distribution is the value with the highest chance of being picked at random. However, given data with a continuous class distribution, we must first estimate the modal class. Then, in order to locate the mode, we need to know the frequency values of the preceding class and the frequency values of the succeeding class to the modal class, as well as certain data about the modal class. The formula for the same can be realized as below:

Usage of Mode in Data Science:

Mode may be used for any type of data and is especially useful for identifying nominal or qualitative data when neither mean nor median can be employed. In other words, deciding on which value of a category variable to employ. Furthermore, because mode employs the variable with the highest frequency in the distribution, mode is immune to a significant number of outliers in the data. The sole disadvantage of mode is that it cannot be utilized to perform mathematical computations. Most common usage of mode can be seen in determining most frequently sold product, job vacancies, and marketing.

Data Skew using Central Tendencies

If the data is negatively skewed, meaning that more occurrences are in the later half of the data, the mean is to the left of the median and the mode is to the right. This is due to the fact that the majority of the samples are in the latter half.

For a normal distribution, all three central tendencies tend to cluster around the distribution’s center.

If the data is positively skewed, that is, more occurrences are towards the beginning, the mode is to the left of the median and the mean is to the right. This is done to make up for the vast number of lower-valued occurrences.

Conclusion

By now, you should have a good understanding of the many core tendencies that are widely employed in statistics, as well as the main differences between them, as well as some examples to help you comprehend their application.

Feel free to follow me for more such related content on Data Science, Artificial Intelligence and Machine Learning!!

Understanding Central Tendencies in Data Science

Central Tendencies

Mean

Median

Mode

Data Skew using Central Tendencies

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Debanjan Saha

No responses yet