Understanding Skewness: Measures & Interpretation in AI
Learn about skewness, a key statistical metric for asymmetry in data distribution. Essential for AI, ML, and data analysis for informed decision-making and forecasting.
4. Skewness: Measures and Interpretation
Skewness is a statistical metric used to quantify the asymmetry in a data distribution. It indicates whether the data values are concentrated on one side of the mean, causing the distribution to lean towards the left or right. Understanding skewness is crucial for analyzing data distribution patterns, making informed decisions, assessing risk, and forecasting future trends.
4.1 What is Skewness?
Skewness describes the degree and direction of asymmetry in a probability distribution. A symmetrical distribution has a skewness of zero. In contrast, skewed distributions deviate from symmetry.
- Positive Skewness (Right Skew): The tail on the right side of the distribution is longer or fatter than the left side. The bulk of the data is concentrated on the left, and the mean is typically greater than the median.
- Negative Skewness (Left Skew): The tail on the left side of the distribution is longer or fatter than the right side. The bulk of the data is concentrated on the right, and the mean is typically less than the median.
- Zero Skewness (Symmetrical Distribution): The distribution is perfectly symmetrical around the mean. The mean, median, and mode are all equal.
4.2 Measurement of Skewness
Several methods exist to measure skewness. The most common ones include Karl Pearson's coefficient of skewness, Bowley's coefficient of skewness, and Kelly's coefficient of skewness.
4.3 Karl Pearson's Measure of Skewness
Karl Pearson's coefficient of skewness is one of the most widely used measures. It is based on the relationship between the mean, median, and standard deviation.
Formula:
There are two common formulas for Karl Pearson's coefficient of skewness:
-
Using Mean, Median, and Standard Deviation: $$ \text{Skewness} (SK_1) = \frac{\text{Mean} - \text{Mode}}{\text{Standard Deviation}} $$ Note: This formula is less reliable when the mode is not clearly defined or when the distribution is multimodal.
-
Using Mean, Median, and Standard Deviation (More general): $$ \text{Skewness} (SK_1) = \frac{3 \times (\text{Mean} - \text{Median})}{\text{Standard Deviation}} $$ This formula is generally preferred as it is more robust due to the median's stability.
Interpretation:
- $SK_1 = 0$: Perfectly symmetrical distribution.
- $SK_1 > 0$: Positively skewed (right tail is longer).
- $SK_1 < 0$: Negatively skewed (left tail is longer).
Example: Consider a dataset with Mean = 10, Median = 8, and Standard Deviation = 4. $$ SK_1 = \frac{3 \times (10 - 8)}{4} = \frac{3 \times 2}{4} = \frac{6}{4} = 1.5 $$ This indicates a moderate positive skewness.
4.4 Bowley's Measure of Skewness (Coefficient of Skewness based on Quartiles)
Bowley's coefficient of skewness is based on quartiles and is less sensitive to extreme values than Pearson's measure.
Formula:
$$ \text{Skewness} (SK_2) = \frac{Q_3 + Q_1 - 2 \times \text{Median}}{\text{Q}_3 - \text{Q}_1} $$ Where:
- $Q_1$ is the first quartile (25th percentile).
- $Q_3$ is the third quartile (75th percentile).
- Median is the second quartile ($Q_2$, 50th percentile).
Interpretation:
- $SK_2 = 0$: Symmetrical distribution.
- $SK_2 > 0$: Positively skewed.
- $SK_2 < 0$: Negatively skewed.
Example: Consider a dataset where $Q_1 = 20$, Median ($Q_2$) = 30, and $Q_3 = 45$. $$ SK_2 = \frac{45 + 20 - 2 \times 30}{45 - 20} = \frac{65 - 60}{25} = \frac{5}{25} = 0.2 $$ This indicates a slight positive skewness.
4.5 Kelly's Measure of Skewness
Kelly's coefficient of skewness is a more generalized measure that uses percentiles. It is particularly useful when dealing with distributions where quartiles might not be sufficiently discriminative.
Formula:
$$ \text{Skewness} (SK_3) = \frac{P_{90} + P_{10} - 2 \times P_{50}}{P_{90} - P_{10}} $$ Where:
- $P_{10}$ is the 10th percentile.
- $P_{50}$ is the 50th percentile (which is the median).
- $P_{90}$ is the 90th percentile.
Interpretation:
Similar to Bowley's measure, it indicates symmetry or the direction and degree of skewness.
4.6 Interpretation of Skewness
The value of the skewness coefficient provides insights into the shape of the distribution:
- Skewness close to 0: Indicates a relatively symmetrical distribution.
- Positive Skewness: The tail extends towards higher values. The mean is generally greater than the median. This implies that there are some unusually high values that are pulling the mean upwards.
- Negative Skewness: The tail extends towards lower values. The mean is generally less than the median. This implies that there are some unusually low values that are pulling the mean downwards.
General Guidelines for Interpretation (Values are approximate and context-dependent):
- 0 to 0.5: Fairly symmetrical.
- 0.5 to 1: Moderately skewed.
- Greater than 1: Highly skewed.
- -0.5 to 0: Fairly symmetrical.
- -1 to -0.5: Moderately skewed.
- Less than -1: Highly skewed.
4.7 Difference between Dispersion and Skewness
While both dispersion and skewness describe aspects of a data distribution, they measure different characteristics:
Feature | Dispersion | Skewness |
---|---|---|
What it measures | The spread or variability of data points. | The asymmetry of the data distribution. |
Key Metrics | Range, Variance, Standard Deviation, IQR. | Karl Pearson's, Bowley's, Kelly's coefficients. |
Indication | How far apart data points are from each other or the mean. | Whether the data is concentrated on one side of the mean. |
Symmetry Impact | Does not directly measure symmetry. A symmetrical distribution can have high or low dispersion. | Directly measures the lack of symmetry. |
Example: Two datasets can have the same mean and standard deviation but different skewness values, indicating different shapes of distribution. Conversely, datasets with different dispersions could have similar skewness if their asymmetry is comparable.
4.8 Tests of Skewness
While visual inspection and calculation of skewness coefficients are common, statistical tests can formally assess whether a distribution significantly deviates from symmetry. Some common tests include:
- D'Agostino's K-squared test: A comprehensive test that combines measures of skewness and kurtosis.
- Bowley's skewness test: Based on the sign of Bowley's coefficient of skewness.
- Jarque-Bera test: Tests if sample data has skewness and kurtosis matching a normal distribution.
These tests are often implemented in statistical software packages.
4.9 Positive and Negative Skewness
4.10 Positive Skewness (Right Skew)
- Description: The right tail of the distribution is longer than the left tail. The bulk of the data is concentrated on the left side of the distribution.
- Relationship between Mean, Median, and Mode: Mean > Median > Mode (though the mode might not be distinct or present in all distributions).
- Visual Representation:
/------- / / /___________
- Examples: Income distributions (a few very high earners pull the mean up), age at retirement.
4.11 Negative Skewness (Left Skew)
- Description: The left tail of the distribution is longer than the right tail. The bulk of the data is concentrated on the right side of the distribution.
- Relationship between Mean, Median, and Mode: Mean < Median < Mode (again, with caveats about mode presence).
- Visual Representation:
\ \ \ --------\_______
- Examples: Test scores where most students score high but a few score very low, lifespan of a product that tends to fail early.
4.12 Zero Skewness (Symmetrical Distribution)
- Description: The distribution is perfectly balanced around the mean. The left and right tails are mirror images of each other.
- Relationship between Mean, Median, and Mode: Mean = Median = Mode.
- Visual Representation:
____ / \ / \ /________\
- Examples: Normal distribution, heights of a large population.
4.13 Difference between Dispersion and Skewness
This section has been covered in Section 4.7.
AM, GM, HM Relationship in AI & Statistics Explained
Explore the crucial relationship between Arithmetic Mean (AM), Geometric Mean (GM), & Harmonic Mean (HM) for data analysis in AI & statistics.
What is Skewness? Understanding Data Asymmetry in ML
Learn about skewness, a key statistical measure quantifying data asymmetry. Discover its impact on probability distributions and ML model performance.