Monday, November 5

Grouped Frequency Distributions

Problem: You've collected data, but there are too many unique values to make a frequency table practical.

Answer: Consider a Grouped Frequency Distribution. A grouped frequency distribution splits your data into groups, called classes. A range of values defines each class, so group frequency tables are appropriate for both discrete and continuous data.


The most difficult aspect of using a grouped frequency distribution is often that of deciding upon the size of each class. Since your original data is "lost", it's important to partition (split) your data into classes that make finding patterns within the data easy. Typically, ten classes is considered a "good" number of groups.


Random.org is a pretty cool website where you can learn all about different facets of "randomness." But, how random are their results? To test the ability of Random.org to generate a truly random list of 50 decimals, we'll analyze the results with a grouped frequency distribution. Using a standard frequency table is probably not a good idea, since the random decimals will range between 0-1 (which would potentially require 101 rows).

Here are some "random" results:


0.41 0.95 0.35 0.17 0.93 0.32 0.86 0.92 0.20 0.98
0.51 0.37 0.23 0.80 0.07 0.51 0.14 0.04 0.03 0.09
0.11 0.50 0.85 0.30 0.16 0.97 0.06 0.17 0.38 0.69
0.58 0.59 0.82 0.95 0.22 0.76 0.16 0.23 0.74 0.40
0.23 0.91 0.48 0.98 0.93 0.84 0.27 0.32 0.64 0.52

To make a grouped frequency distribution, we'll need to first divide our data into classes. How many classes should we use? It depends on the experiment... For this experiment, our expected outcome is that there is a similar occurrence of decimals across the entire range of 0-1. So, splitting our data into two class is probably not enough. Let's use ten classes, each with a class width of 0.1 . Class width is the "range" of each group in your grouped frequency table.

Typically, class width is calculated with the following formula:

class width = ( largest data value - smallest data value ) / number of classes

For us, this would be:

class width = (0.98 - 0.03)/10 = 0.95/10 = 0.095



But, we'll use 0.1 as a class width, mostly because it's much easier to work with:

So, are the results really "random"? The observed outcome from this experiment seems to say "NO." Why? Well, since there are 10 classes, and 50 pieces of data, a random distribution of data would place about 5 numbers in each class. However, some intervals have only 2 numbers, while others have as many as 9.

But, is testing only 50 numbers a fair test? Why not?

Wednesday, October 31

Representing Discrete Data - Frequency Distributions

A store owner takes a survey of the customers that come in to try on shoes one Saturday afternoon. Trying to decide which sizes would be the best to stock, he compiles the following list:
6, 6, 6.5, 7, 8, 9, 7, 5.5, 8, 11, 7.5, 9, 9, 8.5, 9.5, 6.5, 7.5, 8, 9, 7, 8, 10, 10, 9.5, 8, 7.5

His son, preparing for the S1 A-level exam, suggests that a frequency distribution table might be an easy way for his father to analyze the results of his survey. "Poppa," he says, " why don't you count how many times someone orders each shoe size? Then, you can use statistics to figure out how many percent of each shoe you should stock!"

Together, they make a frequency distribution table:

"I understand why it would be important to figure out the frequency that each shoe size is selected, " says the store owner, " but why should I care about the cumulative frequency?"
After his son shows him the following table, the store owner begins to smile:

"It seems that 20% of people tried on sizes 9.5 and above, while the same percentage tried on size 8 alone! Yes! " says the store owner, "I will carry 20% of my stock in size 8!"
...which may actually be a bit hasty. You see, only a random sample of 25 customers were surveyed. A more prudent shopkeeper would gather a bit more data before translating his statistical analysis into a tool for real-world decision making. Furthermore, since shoe sizes 8.5 and 10.5 were tried on by 0% of customers, would it be wise for him to stock absolutely no shoes sized 8.5 or 10.5?
When conducting a statistical analysis, it is very important to collect an appropriate amount of data. Surveying every single one of his customers may be a bit impractical, but it would be very easy for the shop owner to take a better refined sample of his customers. Good sampling gathers data in a thoughtful way and considers not only the amount of data that should be collected, but also the portion of the population that should be surveyed. For instance, why should the shop keeper arbitrarily choose Saturday? What if most of his Saturday customers are children?

Tuesday, October 30

Different Types of Data

Statistics revolves around the analysis of variables. A variable is the thing which we are measuring when we collect data. Suppose a teacher takes a survey amongst his students. If he wants to analyze their shoe size, the variable being measured would be shoe size. Often, experiments involve several variables, so it is important (especially when using calculators and software) to indicate which variable you are analyzing.

Qualitative data cannot be defined directly with numbers.
For shoes, qualitative variables could include: color, brand, smell, material, style...

Quantitative data inherently involves the use of numbers.
For shoes, quantitative variables could include: size, price, height, heel length, weight...

Though shoe size and weight are both quantitative variables that can be used to describe a shoe, the two are fundamentally different types of measurements. Shoe size is a discrete measurement because there are very specific shoe sizes, with no other choices. For instance, you may buy either a size 7 or size 7.5 shoe, but not a size 7.25... A shoe's weight is a continuous measurement. There are in fact an infinite number of possible weights for a shoe, even between 7kg and 7.5kg....

Shoe size moves in well-defined steps, so it is a good example of a discrete variable.

A shoe's weight may take any value within a range, so it is a good example of a continuous variable.

Creating a Mathematical Model

Every mathematical model begins with a real world observation. Someone says, "Hmmm... I think I notice a pattern here!" If the pattern is relatively simple, coming up with a mathematical model could be straight-forward...

For instance, imagine that a biologist is observing a newly discovered type of (very simple) alien organism: Slowly, the organism begins to multiply. Being an intelligent observer, the biologist notes the time of each spawn:

Translating what she sees into mathematics, the thoughtful biologist makes a table of observed outcomes:

"Hmmmm...," she thinks. "The alien doubles every hour. That sure does remind me of something!" And, without a second to lose, the Biologist scribbles a formula onto a sheet of paper:

So, now the biologist has a mathematical model. The next step for her is to use her model to make a prediction, "I think that in the fourth hour there will be 16 aliens!" Before she can be sure that her model is accurate, the biologist will then have to wait until she has verified a sufficient number of predictions with expected outcomes.

Now, what if the alien starts acting a little funny. What if during the fifth hour, the alien were to only sort of double...

Would the biologist be completely wrong in her mathematical model? Well, it's hard to say... but statistics is one of the tools by which we can gauge the accuracy of a mathematical model. If the alien creatures continue to spawn erratically, the biologist would have no choice but to refine her model.

Try this exercise to test your understanding of how a mathematical model is developed!

Developing a Mathematical Model

Albert Einstein, perhaps one of the most well-known scientists of the Twentieth century, developed a mathematical model in the form of


However, before his idea that an objects' energy and mass are related could be accepted, it first needed to be experimentally verified. In other words, scientists had to independently verify whether Einstein's mathematical model provided an accurate representation of reality. Even though Einstein developed his famous formula in 1905, it wasn't experimentally verified until 1930.

Statistics plays a huge role in verifying how well someones mathematical model matches up with the real world. By making predictions, gathering data, and then analyzing the data, it is possible to say whether or not a mathematical model is a good representation of reality.