Archive for the ‘Statistics’ Category.

A Son Named Luigi

Suppose that we choose all families with two children, such that one of them is a son named Luigi. Given that the probability of a boy to be named Luigi is p, what is the probability that the other child is a son?

Here is a potential “solution.” Luigi is a younger brother’s name in one of the most popular video games: Super Mario Bros. Probably the parents loved the game and decided to name their first son Mario and the second Luigi. Hence, if one of the children is named Luigi, then he must be a younger son. The second child is certainly an older son named Mario. So, the answer is 1.

The solution above is not mathematical, but it reflects the fact that children’s names are highly correlated with each other.

Let’s try some mathematical models that describe how the parents might name their children and see what happens. It is common to assume that the names of siblings are chosen independently. In this case the first son (as well as the second son) will be named Luigi with probability p. Therefore, the answer to the puzzle above is (2-p)/(4-p).

The problem with this model is that there is a noticeable probability that the family has two sons, both named Luigi.

As parents usually want to give different names to their children, many researchers suggest the following naming model to avoid naming two children in the same family with the same name. A potential family picks a child’s name at random from a distribution list. Children are named independently of each other. Families in which two children are named the same are crossed out from the list of families.

There is a problem with this approach. When we cross out families we may disturb the balance in the family gender distributions. If we assume that boys’ and girls’ names are different then we will only cross out families with children of the same gender. Thus, the ratio of different-gender families to same-gender families will stop being 1/1. Moreover, it could happen that the number of boy-boy families will differ from the number of girl-girl families.

There are several ways to adjust the model. Suppose there is a probability distribution of names that is used for the first son. If another son is born, the name of the first son is crossed out from the distribution and following that we proportionately adjust the probabilities of all other names for this family. In this model the probability of naming the first son by some name and the second son by the same name changes. For example, the most popular name’s probability decreases with consecutive sons, while the least popular name’s probability increases.

I like this model, because I think that it reflects real life.

Here is another model, suggested by my son Alexey. Parents give names to their children independently of each other from a given distribution list. If they give the same name to both children the family is crossed-out and replaced with another family with children of the same genders. The advantage of this model is that the first child and the second child are named independently from each other with the same probability distribution. The disadvantage is that the probability distribution of names in the resulting set of families will be different from the probability distribution of names in the original preference list.

I would like my readers to comment on the models and how they change the answer to the original problem.

Share:Facebooktwitterredditpinterestlinkedinmail

Mr. Jones

The following two problems appeared together in Martin Gardner’s Scientific American column in 1959.

Mr. Smith has two children. At least one of them is a boy. What is the probability that both children are boys?

Mr. Jones has two children. The older child is a girl. What is the probability that both children are girls?

Many people, including me and Martin Gardner, wrote a lot about Mr. Smith. In his original column Martin Gardner argued that the answer to the first problem is 1/3. Later he wrote a column titled “Probability and Ambiguity,” where he corrected himself about Mr. Smith.

… the answer depends on the procedure by which the information “at least one is a boy” is obtained.

This time I would like to ignore Mr. Smith, as I wrote a whole paper about him that is now under consideration for publication at the College Mathematics Journal. I would rather get back to Mr. Jones.

Mr. Jones failed to stir a controversy from the start and was forgotten. Olivier Leguay asked me about Mr. Jones in a private email, reminding me that the answer to the problem about his children also depends on the procedure.

One of the reasons Mr. Jones was forgotten is that for many natural procedures the answer is 1/2. For example, the following procedures will produce an answer of 1/2:

  • We ask Mr. Jones whether his older child is a daughter and he says “yes.”
  • Mr. Jones flips a coin deciding which child to talk about. After that he has to tell us the gender and whether the child is the oldest.
  • Mr. Jones is asked to say nothing if he doesn’t have a daughter, to select the daughter if has just one, or to pick one at random if he has two daughters. After that he has to tell us whether the daughter he has selected is the oldest.

There are many other procedures that lead to the answer 1/2. However, there are many procedures that lead to other answers.

Suppose I know Mr. Jones, and also know that he has two children. I meet Mr. Jones at a mall, and he tells me that he is buying a gift for his older daughter. Most probably I would assume that the other child is a daughter, too. In my experience, people who have a son and a daughter would say that they are buying a gift for “my daughter.” Only people with two daughters would bother to specify that they are buying a gift for “my older daughter.”

In some sense I didn’t forget about Mr. Jones. I wrote about him implicitly in my essay Two Coins Puzzle. His name was Carl and he had two coins instead of two children.

Share:Facebooktwitterredditpinterestlinkedinmail

The Random Sequence

Fifteen years ago I attended Silvio Micali‘s cryptography course. During one of the lectures, he asked me to close my eyes. When I did, he wrote a random sequence of coin flips of length six on the board and invited me to guess it.

I am a teacher at heart, so I imagined a random sequence I would write for my students. Suppose I start with 0. I will not continue with zero, because 00 looks like a constant sequence, which is not random enough. So my next step would be sequence 01. For the next character I wouldn’t say zero, because 010 seems to promise a repetitive pattern 010101. So my next step would be 011. After that I do not want to say one, because I will have too many ones. So I would follow up with 0110. I need only two more characters. I do not want to end this with 11, because the result would be periodic, I do not want to end this with 00, because I would have too many zeroes. I do not want to end this with 01, because the sequence 011001 has a symmetry: reversing and negating this sequence produces the same sequence.

During the lecture all these considerations happened in the blink of an eye in my mind. I just said: 011010. I opened my eyes and saw that Micali had written HTTHTH on the board. He was not amused and may even have thought that I was cheating.

Many teachers, when writing a random sequence, do not flip a coin. They choose a sequence that looks “random”: it doesn’t have too many repetitions and the number of ones and zeroes is balanced (that is, approximately the same). When they write it character by character on the board, they often choose a sequence so that any prefix looks “random” too.

As a result, the sequence they choose stops being random. Actually, they’re making a choice from a small set of sequences that look “random”. So the fact that I guessed Micali’s sequence is not surprising at all.

If you have gone to many math classes, you’ve seen a lot of professors choosing very similar-looking “random” sequences. This discriminates against sequences that do not look “random”. To restore fairness to those under-represented sequences, I have decided that the next time I need a random sequence, I will choose 000000.

Share:Facebooktwitterredditpinterestlinkedinmail

Women, Science and The Right Tail of a Bell Curve

by Rebecca Frankel

The article Daring to Discuss Women in Science by John Tierney in the New York Times on June 7, 2010 purports to present a dispassionate scientific defense of Larry Summers’s claims, in particular by reviewing and expanding his argument that observed differences in the length of the extreme right tail of the bell curves of men’s and women’s test scores indicate real differences in their innate ability. But in fact any argument like this has to acknowledge a serious difficulty: it is problematic to assume without comment that the abilities of a group can be inferred from the tail of a bell curve. We are so used to invoking bell curves to talk about group abilities, we don’t notice that such arguments usually use only the mean of the curve. Using the tail is a totally different story.

Think about it: it is reasonable to question whether a single data point — the test score of an individual person — is a true indication of his/her ability. It might not be. Maybe a single test score represents a dunce with hyper-overachieving parents who push him to study all the time. So does that single false reading destroy the validity of the curve? No of course not: because some other kid might have been a super-genius who was drunk last night and can barely keep his eyes open during the test. One is testing above his “true ability” and the other is testing below his “true ability,” and the effect cancels out. Thus the means of curves are a good way to measure the ability of large groups, because all the random false readings average out.

But tails are not. On the tail this “canceling out” effect doesn’t work. Look at the extreme right tail. The relatively slow but hyper-motivated kids are not canceled out by the hoard of far-above-the-mean super geniuses who had drunken revels the night before. There just aren’t that many super-geniuses and they just don’t party that much.

Or let’s look at it another way: imagine that you had a large group which you divided in half totally at random. At this point their bell curve of test scores looks exactly the same. Lets call one of the group “boys” and the other group “girls”. But they are two utterly randomly selected groups. Now lets inject the “boys” with a chemical that gives the ones who are very good already a burning desire to dominate any contest they enter into. And let us inject the “girls” with a chemical that makes the ones who are already good nonetheless unwilling to make anyone feel bad by making themselves look too good. What will happen to the two bell curves? Of course the upper tail of the “boys'” curve will stretch out, while the “girls'” tail will shrink in. It will look like the “boys” whipped the “girls” on the right tail of ability hands down, no contest. But the tail has nothing to do with ability. Remember they started out with the same distribution of abilities, before they got their injections. It is only the effect of the chemicals on motivation that makes it look like the “boys” beat the “girls” at the tail.

So, when you see different tails, you can’t automatically conclude that this is caused by difference in underlying innate ability. It is possible that other factors are at play — especially since if we were looking to identify these hypothetical chemicals we might find obvious candidates like “testosterone” and “estrogen”.

The possibility of alternative explanations for these findings calls into question Tierney and Summers’ claims to superior dispassionate scientific objectivity. Moving from the mean to the tail of a bell curve makes systematic effects on averages irrelevant, true, but it is instead susceptible to systematic effects on deviations, which are irrelevant at the mean. An argument that uses this trick to dodge gender differences in averages cannot claim the mantle of scientific responsibility without accounting for gender differences in deviations. I am deeply disappointed that Tierney and Summers did not accompany their assertions with a suitable reminder of this fact.

Share:Facebooktwitterredditpinterestlinkedinmail

How to Live Longer

I just received a mass email on how to live longer and it made these points:

Tip 1. Delay your retirement. Studies show that people who retire at 65 live longer than people who retire at 60.

Tip 2. Sex makes you younger. Studies show that older people who have sex twice a week look ten years younger than their peers who do not have sex at all.

People who draw conclusions from such studies usually do not understand statistics. Correlation doesn’t mean causality. Let me use the above-mentioned studies to reach different conclusions by reversing the causality assumption of the unknown writer of the mass email. You can compare results and make your own decisions.

Case 1. Studies show that people who retire at 65 live longer than people who retire at 60. Reversed causality: People who live longer are healthier, so they are able to keep working and to retire later in life.

Case 2. Studies show that older people who have sex twice a week look ten years younger than their peers who do not have sex at all. Reversed causality: Older people who look ten years younger than their peers can get laid easier, so they have sex more often.

Share:Facebooktwitterredditpinterestlinkedinmail

Shannon Entropy Rescues the Tuesday Child

My son Alexey Radul and I were discussing the Tuesday’s child puzzle:

You run into an old friend. He has two children, but you do not know their genders. He says, “I have a son born on a Tuesday.” What is the probability that his second child is also a son?

Here is a letter he wrote me on the subject. I liked it because unlike many other discussions, Alexey not only asserts that different interpretations of the conditions in the puzzle form different mathematical problems, but also measures how different they are.

by Alexey Radul

If you assume that boys and girls are symmetric, and that days of the week are symmetric (and if you have no information to the contrary, assuming anything else would be sheer presumption on your part), then you can be in one of at least two states.

1) You say that “at least one son born on a Tuesday” is all the information you have, in which case your distribution including this information is uniform over consistent cases, in which case your answer is 13/27 boy and your information entropy is

− ∑27 (1/27) log(1/27) = − log(1/27) = 3.2958.

2) You say that the information you have is “The guy might have said any true thing of the form ‘I have at least one {boy/girl} born on a {day of the week}’, and he said ‘boy’, ‘Tuesday’.” This is a different mathematical problem with a different solution. The solution: By a symmetry argument (see below [*]) we must assign uniform probability of him making any true statement in any particular situation. Then we proceed by Bayes’ Rule: the statement we heard is d, and for each possible collection of children h, the posterior is given by p(h|d) = p(h)p(d|h)/p(d). Here, p(h) = 1/142 = 1/196; p(d) = 1/14; and p(d|h) is either 1 or 1/2 according as whether his other child is or is not another boy also born on a Tuesday (or p(d|h) = 0 if neither child is a boy born on a Tuesday). There are 1 and 26 of these situations, respectively. The answer they lead to is of course 1/2; but the entropy is

− ∑ p log p = − 1/14 log 1/14 − 26/28 log 1/28 = 3.2827

Therefore that assumed additional structure really is more information, which is only present at best implicitly in the original problem. How much more information? The difference in entropies is 3.2958 – 3.2827 = 0.0131 nats (a nat is to a bit what the natural log is to the binary log). How much information is that? Well, the best I can do is to reproduce an argument of E.T. Jaynes’, which may or may not really apply to this situation. Suppose you have some repeatable experiment with some discrete set of possible outcomes, and suppose you assign probabilities to those outcomes. Then the number of ways those probabilities can be realized as frequencies counted over N trials is proportional to eNH, where H is the entropy of the distribution in nats. Which means that the ratio by which one distribution is easier to realize is approximately eN(H1-H2). In the case of N = 1000 and H1 – H2 = 0.0131, that’s circa 5×105. For each way to get a 1000-trial experiment to agree with version 2, there are half a million ways to get a 1000-trial experiment to agree with version 1. So that’s a nontrivial assumption.

[*] The symmetry argument: We are faced with the following probability assignment problem

Suppose our subject’s first child is a boy born on a Tuesday, and his second child is a girl born on a Friday. What probability must we assign to him asserting that he has at least one boy born on a Tuesday?

Good question. Let’s transform our coordinates: Let Tuesday’ be Friday, Friday’ be Tuesday, boy’ be girl, girl’ be boy, first’ be second and second’ be first. Then our problem becomes

Suppose our subject’s second’ child is a girl’ born on a Friday’, and his first’ child is a boy’ born on a Tuesday’. What probability must we assign to him asserting that he has at least one girl’ born on a Friday’?

Our transformation necessitates p(boy Tuesday) = p(girl’ Friday’), and likewise p(girl Friday) = p(boy’ Tuesday’). But our state of complete ignorance about what’s going on with respect to the man’s attitudes about boys, girls, Tuesdays, Fridays, and first and second children has the symmetry that question and question’ are the same question, and must, by the desideratum of consistency, have the same answer. Therefore p(boy Tuesday) = p(boy’ Tuesday’) = p(girl Friday) = 1/2.

Share:Facebooktwitterredditpinterestlinkedinmail

A Tuesday Quiz

I recently wrote two pieces about the puzzle relating to sons born on a Tuesday: A Son born on Tuesday and Sons and Tuesdays. I also posted a beautiful essay on the subject by Peter Winkler: Conditional Probability and “He Said, She Said”. Here is the problem:

You run into an old friend. He has two children, but you do not know what their gender is. He says, “I have a son born on a Tuesday.” What is the probability that his second child is also a son?

A side note. My son Alexey explained to me that I made an English mistake in the problem in those previous posts. It is better to say “born on a Tuesday” than “born on Tuesday.” I apologize.

Despite this error, I was gratified to hear from a number of people who told me that I had converted them from their solution to my solution. To ensure that the conversion is substantial, I’ve created a new version of the puzzle on which my readers can test out their new-found understanding. Here it is:

You run into an old friend. He has two children, but you do not know what their gender is. He says, “I have a son born on a Tuesday.” What is the probability that his second child is born on a Wednesday?

Share:Facebooktwitterredditpinterestlinkedinmail

Conditional Probability and “He Said, She Said”

by Peter Winkler

As a writer of books on mathematical puzzles I am often faced with delicate issues of phrasing, none more so than when it comes to questions about conditional probability. Consider the classic “X has two children and at least one is a boy. What is the probability that the other is a boy?”

It is reasonable to interpret this puzzle as asking you “What is the probability that X has two boys, given that at least one of the children is a boy” in which case the answer is unambiguously 1/3—given the usual assumptions about no twins and equal gender frequency.

This puzzle confounds people *legitimately*, however, because most of the ways in which you are likely to find out that X has at least one boy contain an implicit bias which changes the answer. For example, if you happen to meet one of X‘s children and it’s a boy, the answer changes to 1/2.

Suppose the puzzle is phrased this way: X says “I have two children and at least one is a boy.” What is the probability that the other is a boy?

Put this way, the puzzle is highly ambiguous. Computer scientists, cryptologists and others who must deal carefully with message-passing know that what counts is not what a person says (even if she is known never to lie) but *under what circumstances would she have said it.*

Here, there is no context and thus no way to know what prompted X to make this statement. Could he instead have said “At least one is a girl”? Could he have said “Both are boys”? Could he have said nothing? If you, the one faced with solving the puzzle, are desperate to disambiguate it, you’d probably have to assume that what really happened was: X (for some reason unconnected with X‘s identity) was asked whether it was the case that he had at least one son, and, after being warned—by a judge?—that he had to give a yes-or-no answer, said “yes.” An unlikely scenario, to say the least, but necessary if you want to claim that the solution to the puzzle is 1/3.

Consider the puzzle presented (according to Alex Bellos) by Gary Foshee at the recent 9th Gathering for Gardner:

I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?

If the puzzle was indeed put exactly this way, and your life depended on defending any particular answer, God help you. You cannot answer without knowing, for example, what the speaker would have said if he had one boy and one girl, and the boy was born on Wednesday. Or if he had two boys, one born on Tuesday and one on Wednesday. Or two girls, both born on Tuesday. Et cetera.

Now, there is nothing mathematically wrong (given the usual assumptions, including X being random) about saying that “The probability that X has two sons, given that at least one of X‘s two children is a boy born on Tuesday, is 13/27.” But if that is to be turned into an unambiguous puzzle attached to a presumed situation, some serious hypothesizing is necessary. For instance: you get on the phone and start calling random people. Each is asked if he or she has two children. If so, is it the case that at least one is a boy born on a Tuesday? And if the answer is again yes, are the children both boys? Theoretically, of the times you reach the third question, the fraction of pollees who say “yes” should tend to 13/27.

Kind of takes the fun out of the puzzle, though, doesn’t it? Kudos to Gary for stirring up controversy with a quickie.

Share:Facebooktwitterredditpinterestlinkedinmail

Sons and Tuesdays

I recently discussed the following problem:

You run into an old friend. He has two children, but you do not know their genders. He says, “I have a son born on Tuesday.” What is the probability that his second child is also a son?

I had heard this problem at the Gathering for Gardner 9 in a private conversation. My adversary had been convinced that the answer to the problem is 13/27. I came back to Boston from the gathering and wrote my aforementioned essay in which I disagreed with his conclusion.

I will tell you my little secret: when I started writing I substituted Wednesday for Tuesday. Then I checked my sons’ birthdays and they were born on Saturday and Tuesday. So I changed my essay back to Tuesday.

After I published it people sent me several links to other articles discussing the same problem, such as those of Keith Devlin and Alex Bellos, both of whom think the answer is 13/27. So I invented a fictional opponent — Jack, and here is my imaginary conversation with him.

Jack: The probability that a father with two sons has a son born on Tuesday is 13/49. The probability that a father with a son and a daughter has a son born on Tuesday is 1/7. A dad with a son and a daughter is encountered twice as often as a dad with just two sons. Hence, we compare 13/49 with 14/49, and the probability of the father having a second son is 13/27.

Me: What if the problem is about Wednesday?

Jack: It doesn’t matter. The particular day in question was random. The answer should be the same: 13/27.

Me: Suppose the father says, “I have a son born on *day.” He mumbles the day, so you do not hear it exactly.

Jack: Well, as the answer is the same for any day, it shouldn’t matter. The probability that his second child will also be a son is still 13/27.

Me: Suppose he says, “I have a son born …”. So he might have continued and mentioned the day, he might not have. What is the probability?

Jack: We already decided that it doesn’t depend on the day, so it shouldn’t matter. The probability is still 13/27.

Me: Suppose he says, “I have a son and I do not remember when he was born.” Isn’t that the same as just saying, “I have a son.” And by your arguments the probability that his second child is also a son is 13/27.

Jack: Hmm.

Me: Do you remember your calculation? If we denote the number of days in a week as d, then the probability of him having a second son is (2d−1)/(4d−1). My point is that this probability depends on the number of days of the week. So, if tomorrow we change a week length to another number his probability of having a son changes. Right?

At this point my imaginary conversation stops and I do not know whether I have convinced Jack or not.

Now let me give you another probability problem, where the answer is 13/27:

You pick a random father of two children and ask him, “Yes or no, do you have a son born on Tuesday?” Let’s make a leap and assume that all fathers know the day of the births of their children and that they answer truthfully. If the answer is yes, what is the probability of the father having two sons?

Jack’s argument works perfectly in this case.

My homework for the readers is: Explain the difference between these two problems. Why is the second problem well-defined, while the first one is not?

Share:Facebooktwitterredditpinterestlinkedinmail

A Son Born on Tuesday

Suppose you meet a friend who you know for sure has two children, and he says: “I have a son born on Tuesday.” What is the probability that the second child of this man is also a son?

People argue about this problem a lot. Although I’ve discussed similar problems in the past, this particular problem has several interesting twists. See if you can identify them.

First, let us agree on some basic assumptions:

  1. Sons and daughters are equally probable. This is not exactly true, but it is a reasonable approximation.
  2. For our purposes, twins do not exist. Not only is the proportion of twins in the population small, but because they are born on the same day, twins might complicate the calculation.
  3. All days of the week are equally probable birthdays. While this can’t actually be true — for example, assisted labors are unlikely to be scheduled for weekends — it is a reasonable approximation.

Now let us consider the first scenario. A father of two children is picked at random. He is instructed to choose a child by flipping a coin. Then he has to provide information about the chosen child in the following format: “I have a son/daughter born on Mon/Tues/Wed/Thurs/Fri/Sat/Sun.” If his statement is, “I have a son born on Tuesday,” what is the probability that the second child is also a son?

The probability that a father of two daughters will make such a statement is zero. The probability that a father of differently-gendered children will produce such a statement is 1/14. Indeed, with a probability of 1/2 the son is chosen over the daughter and with a probability of 1/7 Tuesday is the birthday.

The probability that a father of two sons will make this statement is 1/7. Among the fathers with two children, there are twice as many who have a son and a daughter than fathers who have two sons. Plugging these numbers into the formula for calculating the conditional probability will give us a probability of 1/2 for the second child to also be a son.

Now let us consider the second scenario. A father of two children is picked at random. If he has two daughters he is sent home and another one picked at random until a father is found who has at least one son. If he has one son, he is instructed to provide information on his son’s day of birth. If he has two sons, he has to choose one at random. His statement will be, “I have a son born on Mon/Tues/Wed/Thurs/Fri/Sat/Sun.” If his statement is, “I have a son born on Tuesday,” what is the probability that the second child is also a son?

The probability that a father of differently-gendered children will produce such a statement is 1/7. If he has two sons, the probability will likewise be 1/7. Among the fathers with two children, twice as many have a son and a daughter as have two sons. Plugging these numbers into the formula for calculating the conditional probability gives us a probability of 1/3 for the second child to also be a son.

Now let us consider the third scenario. A father of two children is picked at random. If he doesn’t have a son who is born on Tuesday, he is sent home and another is picked at random until one who has a son that was born on Tuesday is found. He is instructed to tell you, “I have a son born on Tuesday.” What is the probability that the second child is also a son?

The probability that a father of two daughters will have a son born on Tuesday is zero. The probability that a father of differently-gendered children will have a son who is born on Tuesday is 1/7. The probability that a father of two sons will have a son born on Tuesday is 13/49. Among the fathers with two children, twice as many have a son and a daughter than two sons. Plugging these numbers into the formula for calculating the conditional probability will give us a probability of 13/27 for the second child to also be a son.

Now let’s go back to the original problem. Suppose you meet your friend who you know has two children and he tells you, “I have a son born on Tuesday.” What is the probability that the second child is also a son?

What puzzles me is that I’ve never run into a similar problem about daughters or mothers. I’ve discussed this math problem about these probabilities with many people many times. But I keep stumbling upon men who passionately defend their wrong solution. When I dig into why their solution is wrong, it appears that they implicitly assume that if a man has a daughter and a son, he won’t bother talking about his daughter’s birthday at all.

I’ve seen this so often that I wonder if this is a mathematical mistake or a reflection of their bias.

How to solve the original problem? The problem is under-defined. The solution depends on the reason the father only mentions one child, or the Tuesday.

The funny part of this story is that I, Tanya Khovanova, have two children. And the following statement is true: “I have a son born on Tuesday.” What is the probability that my second child is a son?

Share:Facebooktwitterredditpinterestlinkedinmail