STATISTICS 101

Not signed in

A histogram is a graph that organises and displays numerical data in picture form. The bars connect to each other in a histogram, unlike the bar graph. The height of each bar of a histogram represents either the number of data points (frequency) or the percentage of data points (relative frequency) in each group. Each data point from a data set falls into one and only one bar of the histogram. It is possible to make a histogram from any numerical data set; however, it is not possible to determine the actual values of the data set from a histogram.^[1,p.108] Histogram illustrates the shape of numerical data- how data falls into groups, how many values are close to or far from the mean, where the center is and how many outliers there are. For example, all the data may be exactly the same (no variability), in which case the histogram is just one tall bar. Or the data might have an equal number in each group; in which case the shape is flat and it means a fair amount of variability. The idea if a flat histogram indicating some variability may be counter-intuitive, especially when confused with a time chart, where single numbers are plotted over time. But histograms don't show data over time, they show all the data at the same point in time.^{[1,p.79,111,114]} Histogram is not about history. Equally, confusing is tha idea that a histogram with a big lump in the middle and tails sloping sharply down on each side actually has less variability than a histogram that is straight accross, The curves looking like hills in a histogram represent clumps of data that are close together; a flat histogram show data equally dispersed, with more variability. Variability in a histogram is higher when there are more bars of more equal height more spread out from the mean, and lower when there are taller bars closer to the mean.^[1,p.114]

There are no fixed rules for how to create a histogram; the person making the graph can choose the grouping on the x-axis as well as the scale and starting and ending points on the y-axis. However, not every choice is appropriate; in fact, a histogram can be made to be misleading in several ways:^{[1,p.117,119,339]}

If the interval of grouping of the numerical variable is really small, there will be too many bars in the histogram; the data may be hard to interpret because the heights of the bars look more variable than they should be. On the other hand, if the ranges are really large, there are too few bars and something interesting may be missed in the data.
The y-axis of a histogram shows how many individuals are in each group, using counts or percents. A histogram can be misleading if it has a descriptive scale and/or inappropriate starting and ending points on the y-axis. If it goes by large increments and has an ending point that is much higher than needed, there will be a lot of white space above the histogram. The height of the bars will be squeezed down, making their differences look more uniform than they should. If the scale goes by small increments and ends at the smallest value possible, the bars become stretched vertically, exaggerating and suggesting bigger differences than really exist.
If the vertical axis reports relative frequency, sample size must be supplied along with the graph.

Tips for setting up a histogram well are:^[1,p.110]

Each data set requires different ranges for grouping, but ranges that are too wide or too narrow should be avoided:
- A histogram that has too wide ranges for its groups places all the data into a very small number of bars that make meaningful comparisons impossible.
- A histogram that has too narrow ranges for its groups looks like a big series of tiny bars with no clear pattern.
Groups should have equal width. If one bar is wider than the others, it may contain more data than it should.
Borderline data points should all be consistently put either into their respective lower bar or their respective upper bar.
Both x and y axis should have good descriptive labels to help with interpreting the histogram.
Since it is not possible to calculate measures of center and variability from the histogram without knowing the exact values, basic statistics of center and variation should be calculated and presented along with the histogram.^[1,p.115]

Skewness

Data sets can have many different possible shapes. Three shapes that are commonly discussed in introductory statistics courses are right skewed, left skewed and symmetric data.^[1,p.79]

Symmetric data has about the same shape on either side of the middle. If cut down the middle the left-hand and the right-hand side resemble mirror images of each other. When data is symmetric, the mean and the median are close together.^[1,p.79,111] "Close" is defined in the context of the data; for example the numbers 50 and 55 are said to be close if all the values lie between 0 and 1000, but are considered to be farther apart if all the values lie between 49 and 56.^[1,p.116] The "tails" on both sides of the graph are of approximately equal length. As long as the shape is approximately the same on both sides, it is said to be symmetric, but not all symmetric data has a bell shape of a normal distribution.^[1,p.80] Likewise, not all non-symetrical data is skewed to either direction; many data sets have no distinct shape at all.^[1,p.113]

Data is said to be skewed to the right if most of the data is on the left side of the histogram, but a few larger values are on the right. When the data is skewed right, the mean is larger than the median. The "tail" on the right side of the graph is longer than on the left.^[1,p.79] In other words, the outliers on the right side skew (distort) the histogram and pull the mean towards larger values.

Data is said to be skewed to the left if most of the data is on the right side of the histogram, but a few smaller values are on the left. When the data is skewed left, the mean is smaller than the median. The "tail" on the left side of the graph is longer than on the right.^[1,p.79] In other words, the outliers on the left side skew (distort) the histogram and pull the mean towards smaller values.

Data

Statistics

Charts

Data Sets

Sources

Skewness