13 September 2017

Now even second-hand books are fake

First off, my excuses for the shameless plug in this post.  I co-edited a book and it's due to appear this week:

Actually, I should say it "was" due to appear this week, because as you can see, Amazon thought it would be available yesterday (Tuesday 12 September).  My colleagues and I have submitted the final proof corrections and been told that it's in production, but I guess these things can always slip a bit.  So Amazon's robots have detected that the book is officially available, but they don't have any copies in the warehouse.  Hence the (presumably automatically-generated) message saying "Temporarily out of stock".

As of today the book costs £140 at Amazon UK, $187.42 at Amazon.com, €202.16 at Amazon.de, €158.04 at Amazon.es, and Amazon.fr don't have a price for it.  This is a lot of money and I can't really recommend the average student to buy a personal copy, although I hope that anyone who is even peripherally interested in positive psychology will pester their librarian to acquire several!  (That said, I've recently reviewed some academic books that are about one-quarter the size and which cost £80-100.  So perhaps our book isn't too bad value in relative terms.  Plus at 590 pages thick you can probably use it to fend off attackers, although obviously this is not professional security advice and you should definitely call the police first.)

But all of that is a bit academic (ha ha) today, because the book is out of stock.  Or is it?  Look at that picture again: "1 Used from £255.22".

Now, I guess it makes sense that a used book would sometimes be more expensive than the new one, if the latter is out of stock.  Maybe the seller has a copy and hopes that someone really, really needs the book quickly and is prepared to pay a premium for that (assuming that a certain web site *cough* doesn't have the full PDF yet).  Wonderful though I obviously think our book is, though, I can't imagine that anyone has been assigned it as a core text for their course just yet (hint hint).  But I guess this stuff is all driven by algorithms, which presumably have to cope with books like this and the latest from J. K. Rowling, so maybe that's OK.

However, alert readers may already have spotted the bigger problem here.  There is no way that the seller of the used book can actually have a copy of it in stock, because the book does not exist yet.  I clicked on the link and got this:

So not only is the book allegedly used, but it's in the third-best possible condition ("Good", below "Like New" and "Very Good").

The seller, Tundra Books, operates out of what appears to be a residential address in Sevilla, Spain.  Of course, plenty of people work from home, but it doesn't look like you could fit a great deal of stock in one of those maisonettes.  I wonder what magical powers they possess that enable them to beam slightly dog-eared academic handbooks back from the future?  Or is it just possible that if I order this book for over £100 more than the list price, I will receive it in about four of weekssay, the time it takes to order a new one from Amazon.es, ruffle the pages a bit, and send it on to me?

- My attention was first drawn to the "out of stock" issue by my brother-in-law, Tony Douglas, who also painted the wonderful picture on the cover.  Seriously, even if you don't open the book it's worth the price for the cover alone.  (And remember, "it's not nepotism if the work is really good".)
- Matti Heino spotted the "1 Used from £255.22" in the screen shot.

07 June 2017

Exploring John Carlisle's "bombshell" article about fabricated data in RCTs

For the past couple of days, this article by John Carlisle has been causing a bit of a stir on Twitter. The author claims that he has found statistical evidence that a surprisingly high proportion of randomised controlled trials (RCTs) contain data patterns that cannot have arisen by chance.  Given that he has previously been instrumental in uncovering high-profile fraud cases, and also that he used data from articles that are known to be fraudulent (because they have been retracted) to calibrate his method, the implication is that some percentage of these impossible numbers are the result of fraud.  The title of the article is provocative, too: "Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals".  So yes, there are other reasons, but the implication is clear (and has been picked up by the media): There is a bit / some / a lot of data fabrication going on.

Because I anticipate that Carlisle's article is going to have quite an impact once more of the mainstream media decide to run with it, I thought I'd spend some time trying to understand exactly what Carlisle did.  This post is a summary of what I've found out so far.  I offer it in the hope that it may help some people to develop their own understanding of this interesting forensic technique, and perhaps as part of the ongoing debate about the limitations of such "post publication analysis" techniques (which also include things such as GRIM and statcheck).

[[Update 2017-06-12 19:35 UTC: There is now a much better post about this study by F. Perry Wilson here.]]

How Carlisle identified unusual patterns in the articles

Carlisle's analysis method was relatively simple.  He examined the baseline differences between the groups in the trial on (most of) the reported continuous variables.  These could be straightforward demographics like age and height, or they could be some kind of physiological measure taken at the start of the trial.  Because participants have been randomised into these groups, any difference between them is (by definition) due to chance.  Thus, we would expect a certain distribution of the p values associated with the statistical tests used to compare the groups; specifically, we would expect to see a uniform distribution (all p values are equally likely when the null hypothesis is true).

Not all of the RCTs report test statistics and/or p values for the difference between groups at baseline (it is not clear what a p value would mean, given that the null hypothesis is known to be true), but they can usually be calculated from the reported means and standard deviations.  In his article, Carlisle gives a list of the R modules and functions that he used to reconstruct the test statistics and perform his other analyses.

Carlisle's idea is that, if the results have been fabricated (for example, in an extreme case, if the entire RCT never actually took place), then the fakers probably didn't pay too much attention to the p values of the baseline comparisons.  After all, the main reason for presenting these statistics in the article is to show the reader that your randomisation worked and that there were no differences between the groups on any obvious confounders.  So most people will just look at, say, the ages of the participants, and see that in the experimental condition the mean was 43.31 with an SD of 8.71, and in the control condition it was 42.74 with an SD of 8.52, and think "that looks pretty much the same".  With 100 people in each condition, the p value for this difference is about .64, but we don't normally worry about that very much; indeed, as noted above, many authors wouldn't even provide a p value here.

Now consider what happens when you have ten baseline statistics, all of them fabricated.  People are not very good at making up random numbers, and the fakers here probably won't even realise that as well as inventing means and SDs, they are also making up p values that ought to be randomly distributed.  So it is quite possible that they will make up mean/SD combinations that imply differences between groups that are either collectively too small (giving large p values) or too large (giving small p values).

Reproducing Carlisle's analyses

In order to better understand exactly what Carlisle did, I decided to reproduce a few of his analyses.  I downloaded the online supporting information (two Excel files which I'll refer to as S1 and S2, plus a very short descriptive document) here.  The Excel files have one worksheet per journal with the worksheet named NEJM (corresponding to articles published in the New England Journal of Medicine) being on top when you open the file, so I started there.

Carlisle's principal technique is to take the p values from the various baseline comparisons and combine them.  His main way of doing this is with Stouffer's formula, which is what I've used in this post.  Here's how that works:
1. Convert each p value into a z score.
2. Sum the z scores.
3. If there are k scores, divide the sum of the z scores from step 2 by the square root of k.
4. Calculate the one-tailed p value associated with the overall z score from step 3.

In R, that looks like this.  Note that you just have to create the vector with your p values (first line) and then you can just copy/paste the second line, which implements the entire Stouffer formula.

plist = c(.95, .84, .43, .75, .92)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist))) 
[1] 0.02110381

That is, the p value associated with the test that these five p values arose by chance is .02.  Now if we start suggesting that something is untoward based on the conventional significance threshold of .05 we're going to have a lot of unhappy innocent people, as Carlisle notes in his article (more than 1% of the articles he examined had a p value < .00001), so we can probably move on from this example quite rapidly.  On the other hand, if you have a pattern like this in your baseline t tests:

plist = c(.95, .84, .43, .75, .92, .87, .83, .79, .78, .84)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.00181833

then things are starting to get interesting.  Remember that the p values should be uniformly distributed from 0 to 1, so we might wonder why all but one of them are above .50.

In Carlisle's model, suspicious distributions are typically those with too many high p values (above 0.50) or too many low ones, which give overall p values that are close to either 0 or 1, respectively.  For example, if you subtract all five of the p values in the first list I gave above from 1, you get this:

plist = c(.05, .16, .57, .25, .08)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.9788962

and if you subtract that final p value from 1, you get the value of 0.0211038 that appears above.

To reduce the amount of work I had to do, I chose three articles that were near the top of the NEJM worksheet in the S1 file (in principle, the higher up the worksheet the study is, the bigger the problems) and that had not too many variables in them.  I have not examined any other articles at this time, so what follows is a report on a very small sample and may not be representative.

Article 1

The first article I chose was by Jonas et al. (2002), "Clinical Trial of Lamivudine in Children with Chronic Hepatitis B", which is on line 8 of the NEJM worksheet in the S1 file.  The trial number (cell B8 of the NEJM worksheet in the S1 file) is given as 121, so I next looked for this number in column A of the NEJM worksheet of the S2 file and found it on lines 2069 through 2074.  Those lines allow us to see exactly which means and SDs were extracted from the article and used as the basis for the calculations in the S1 file.  (The degree to which Carlisle has documented his analyses is extremely impressive.)  In this case, the means and SDs correspond to the three baseline variables reported in Table 1 of Jonas et al.'s article:
By combining the p values from these variables, Carlisle arrived at an overall inverted (see p. 4 of his article) p value of .99997992. This needs to be subtracted from 1 to give a conventional p value, which in this case is .00002.  That would appear to be very strong evidence against the null hypothesis that these numbers are the product of chance. However, there are a couple of problems here.

First, Carlisle made the following note in the S1 file (cell A8):
Patients were randomly assigned. Variables correlated (histologic scores). Supplementary baseline data reported as median (range). p values noted by authors.
But in cells O8, P8, and Q8, he gives different p values from those in the article, which I presume he calculated himself.  After subtracting these p values from 1 (to take into account the inversion of the input p values that Carlisle describes on p. 4 of his article), we can see that the third p value in cell Q8 is rather different (.035 has become approximate .10).  This is presumably because the original p values were derived from a non-parametric test which would be impossible to reproduce without the data, so presumably Carlisle assumed a parametric model (for example, p values can be calculated for a t test from the mean, SD, and sample sizes of the two groups).  Note that in this case, the difference in p values actually works against a false positive, but the general point is that not all p value analyses can be reproduced from the summary statistics.

Second, and much more importantly, the three baseline variables here are clearly not independent.  The first set of numbers ("Total score") is merely the sum of the other two, and arguably these other two measures of liver deficiencies are quite likely to be related to one another.  Even if we ignore that last point and only remove "Total score", considering the other two variables to be completely independent, the overall p value for this RCT would change from .00002 to .001.

plist = c(0.997916597, 0.998464969, 0.900933333)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 2.007969e-05
plist = c(0.998464969, 0.900933333)     #omitting "Total score"
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.001334679
Carlisle discusses the general issue of non-independence on p. 7 of his article, and in the quote above he actually noted that the liver function scores in Jonas et al. were correlated.  That makes it slightly unfortunate that he didn't take some kind of measures to compensate for the correlation.  Leaving the raw numbers in the S1 file as if the scores were uncorrelated meant that Jonas et al.'s article appeared to be the seventh most severely problematic article in NEJM.

(* Update 2017-06-08 10:07 UTC: In the first version of this post, I wrote "it is slightly unfortunate that [Carlisle] apparently didn't spot the problem in this case".  This was unfair of me, as the quote from the S1 file shows that Carlisle did indeed spot that the variables were correlated.)

Article 2

Looking further down file S1 for NEJM articles with only a few variables, I came across Glauser et al. (2010), "Ethosuximide, Valproic Acid, and Lamotrigine in Childhood Absence Epilepsy" on line 12.  This has just two baseline variables for participants' IQs measured with two different instruments.  The trial number is 557, which leads us to lines 10280 through 10285 of the NEJM worksheet in the S2 file.  Each of the two variables has three values, corresponding to the three treatment groups.

Carlisle notes in his article (p. 7) that the p value for the one-way ANOVA comparing the groups for the second variable is misreported.  The authors stated that this value is .07, whereas Carlisle calculates (and I concur, using ind.twoway.second() from the rpsychi package in R) that this should be around .0000007.  Combining this p value with the .47 from the first variable, Carlisle obtains an overall (conventional) p value of .0004 to test the null hypothesis that these group differences are the result of chance.

But it seems to me that there may be a more parsimonious explanation for these problems.  The two baseline variables are both measures of IQ, and one would expect them to be correlated.  Inspection of the group means in Glauser et al.'s Table 2 (a truncated version of which is show above) suggests that the value for the Lamotrigine group on the WPPSI measure is a lot lower than might be expected, given that this group scored slightly higher on the WISC measure.  Indeed, when I replaced the value of 92.0 with 103.0, I obtained a p value for the one-way ANOVA of almost exactly .07.  Of course, there is no direct evidence that 92.0 was the result of a finger slip (or, perhaps, copying the wrong number from a printed table), but it certainly seems like a reasonable possibility.  A value of 96.0 instead of 92.0 would also give a p value close to .07.

xsd = c(16.6, 14.5, 14.8)
xn = c(155, 149, 147)
an1 = ind.oneway.second(m=c(99.1, 92.0, 100.0), sd=xsd, n=xn)
[1] 12.163 
1 - pf(q=12.163, df1=2, df2=448)
[1] 7.179598e-06

an2 = ind.oneway.second(m=c(99.1, 103.0, 100.0), sd=xsd, n=xn)
[1] 2.668
1 - pf(q=2.668, df1=2, df2=448)
[1] 0.0704934
It also seems slightly strange that someone who was fabricating data would choose to make up this particular pattern.  After having invented the numbers for the WISC measure, one would presumably not add a few IQ points to two of the values and subtract a few from the other, thus inevitably pushing the groups further apart, given that the whole point of fabricating baseline IQ data would be to show that the randomisation had succeeded; to do the opposite would seem to be very sloppy.  (However, it may well be the case that people who feel the need to fabricate data are typically not especially intelligent and/or statistically literate; the reason that there are very few "criminal masterminds" is that most masterminds find a more honest way to earn a living.)

Article 3

Continuing down the NEJM worksheet in the S1 file, I came to Heydendael et al. (2003) "Methotrexate versus Cyclosporine in Moderate-to-Severe Chronic Plaque Psoriasis" on line 33.  Here there are three baseline variables, described on lines 3152 through 3157 of the NEJM worksheet in the S2 file.  These variables turn out to be the patient's age at the start of the study, the extent of their psoriasis, and the age at which psoriasis first developed, as shown in Table 1, which I've reproduced here.

Carlisle apparently took the authors at their word that the numbers after the ± symbol were standard errors, as he seems to have converted them to standard deviations by multiplying them by the square root of the sample size (cells F3152 through F3157 in the S2 file).  However, it seems clear that, at least in the case of the ages, the "SEs" can only have been SDs.  The values calculated by Carlisle for the SDs are around 80, which is an absurd standard deviation for human ages; in contrast, the values ranging from 12.4 through 14.5 in the table shown above are quite reasonable as SDs.  It is not clear whether the "SE" values in the table for the psoriasis area-and-severity index are in fact likely to be SDs, or whether Carlisle's calculated SD values (23.6 and 42.8, respectively) are more likely to be correct.

Carlisle calculated an overall p value of .005959229 for this study.  Assuming that the SDs for the age variables are in fact the numbers listed as SEs in the above table, I get an overall p value of around .79 (with a little margin for error due to rounding error on the means and SDs, which are given to only one decimal place).

xn = c(43, 42)

an1 = ind.oneway.second(m=c(41.6, 38.3), sd=c(13.0, 12.4), n=xn)
[1] 1.433
1 - pf(q=1.433, df1=1, df2=83)
[1] 0.2346833 
an2 = ind.oneway.second(m=c(25.1, 24.3), sd=c(14.5, 13.3), n=xn)
[1] 0.07
1 - pf(q=0.07, df1=1, df2=83)
[1] 0.7919927 1 - pf(q=0.07, df1=1, df2=83)
# Keep value for psoriasis from cell P33 of S1 file
plist = c(0.2346833, 0.066833333, 0.7919927)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.7921881

The fact that Carlisle apparently did not spot this issue is slightly ironic given that he wrote about the general problem of confusion between standard deviations and standard errors in his article (pp. 56) and also included comments about possible mislabelling by authors of SDs and SEs in several of the notes in column A of the S1 spreadsheet file.


The above analyses show how easy it can be to misinterpret published articles when conducting systematic forensic analyses. I can't know what was going through Carlisle's mind when he was reading the articles that I selected to check, but having myself been through the exercise of reading several hundred articles over the course of a few evenings looking for GRIM problems, I can imagine that obtaining a full understanding of the relations between each of the baseline variables may not always have been possible.

I want to make it very clear that this post is not intended as a "debunking" or "takedown" of Carlisle's article, for several reasons.  First, I could have misunderstood something about his procedure (my description of it in this post is guaranteed to be incomplete).  Second, Carlisle has clearly put a phenomenal amount of effortthousands of hours, I would guessinto these analyses, for which he deserves a vast amount of credit (and does not deserve to be the subject of nitpicking).  Third, Carlisle himself noted in his article (p. 8) that is was inevitable that he had made a certain number of mistakes.  Fourth, I am currently in a very similar line of business myself at least part of the time, with GRIM and the Cornell Food and Brand Lab saga, and I know that I have made multiple errors, sometimes in public, where I was convinced that I had found a problem and someone gently pointed out that I had missed something (and that something was usually pretty straightforward).  I should also point out that the quotes around the word "bombshell" in the title of this post are not meant to belittle the results of Carlisle's article, but merely to indicate that this is how some media outlets will probably refer to it (using a word that I try to avoid like the plague).

If I had a takeaway message, I think it would be that this technique of examining the distribution of p values from baseline variable comparisons is likely to be less reliable as a predictor of genuine problems (such as fraud) when the number of variables is small.  In theory the overall probability that the results are legitimate and correctly reported is completely taken care of by the p values and Stouffer's formula for combining them, but in practice when there are only a few variables it only takes a small issue—such as a typo, or some unforeseen non-independence—to distort the results and make it appear as if there is something untoward when there probably isn't.

I would also suggest that when looking for fabrication, clusters of small p values—particularly those below .05—may not be as good an indication as clusters of large p values.  This is just a continuation of my argument about the p value of .07 (or .0000007) from Article 2, above.  I think that Carlisle's technique is very clever and will surely catch many people who do not realise that their "boring" numbers showing no difference will produce p values that need to follow a certain distribution, but I question whether many people are fabricating data that (even accidentally) shows a significant baseline difference between groups, when such differences might be likely to attract the attention of the reviewers.

To conclude: One of the reasons that science is hard is that it requires a lot of attention to detail, which humans are not always very good at it.  Even people who are obviously phenomenally good at it (including John Carlisle!) make mistakes.  We learned when writing our GRIM article what an error-prone process the collection and analysis of data can be, whether this be empirical data gathered from subjects (some of the stories about how their data were collected or curated that were volunteered by the authors whom we contacted to ask for their datasets were both amusing and slightly terrifying) or data extracted from published articles for the purposes of meta-analysis or forensic investigation.  I have a back burner project to develop a "data hygiene" course, and hope to get round to actually developing and giving it one day!

27 April 2017

An open letter to Dr. Todd Shackelford

To the editor of Evolutionary Psychological Science:

Dear Dr. Shackelford,

On April 24, 2017, in your capacity as editor of Evolutionary Psychological Science, you issued an Editorial Note [PDF] that referenced the article "Eating Heavily: Men Eat More in the Company of Women," by Kevin M. Kniffin, Ozge Sigirci, and Brian Wansink (Evolutionary Psychological Science, 2016, Vol. 2, No. 1, pp. 38–46).

The key point of the note is that the "authors report that the units of measurement for pizza and salad consumption were self-reported in response to a basic prompt 'how many pieces of pizza did you eat?' and, for salad, a 13-point continuous rating scale."

For comparison, here is the description of the data collection method from the article (p. 41):
Consistent with other behavioral studies of eating in naturalistic environments (e.g., Wansink et al. 2012), the number of slices of pizza that diners consumed was unobtrusively observed by research assistants and appropriate subtractions for uneaten pizza were calculated after waitstaff cleaned the tables outside of the view of the customers. In the case of salad, customers used a uniformly small bowl to self-serve themselves and, again, research assistants were able to observe how many bowls were filled and, upon cleaning by the waitstaff, make appropriate subtractions for any uneaten or half-eaten bowls at a location outside of the view of the customers.
It is clear that this description was, to say the least, not an accurate representation of the research record.  Nobody observed the number of slices of pizza.  Nobody counted partial uneaten slices when the plates were bussed.  Nobody made any surreptitious observations of salad either.  All consumption was self-reported.  It is difficult to imagine how this 100-plus word description could have accidentally slipped into an article.

Even if we ignore what appears to have been a deliberately misleading description of the method, there is a further very substantial problem now that the true method is known.  That is, the entire study would seem to depend on the amounts of food consumed having been accurately and objectively measured. Hence, the use of self-report measures of food consumption (which are subject to obvious biases, including questions around desirability), when the entire focus of the article is on how much food people actually (and perhaps unconsciously, due to the influence of evolutionarily-determined forces) consumed in various social situations, would seem to cast severe doubt on the validity of the study.  The methods described in the Editorial Note and the article itself are thus contradictory, as they describe substantially different methodologies. The difference between real-time unobtrusive observations by others, versus post hoc self-reports, is both practically and theoretically significant in this case. 

Hence, we are surprised that you apparently considered that issuing an "Editorial Note" was the appropriate response to the disclosure by the authors that they had given an incorrect description of their methods in the article.  Anyone who downloads the article today will be unaware that the study simply did not take place as described, nor that the results are probably confounded by the inevitable limitations of self-reporting.

Your note also fails to address a number of other discrepancies between the article and the dataset.  These include: (1) The data collection period, which the article reports as two weeks, but which the cover page for the dataset states was seven weeks; (2) The number of participants excluded for dining alone, which is reported as eight in the article but which appears to be six in the dataset; (3) The overall number of participants, which the article reports as 105, a number that is incompatible with the denominator degrees of freedom reported on five F tests on pp. 41–42 (109, 109, 109, 115, and 112).

In view of these problems, we believe that the only reasonable course of action in this case is to retract the article, and to invite the authors, if they wish, to submit a new manuscript with an accurate description of the methods used, including a discussion of the consequences of their use of self-report measures for the validity of their study.

Please note that we have chosen to publish this e-mail as an open letter here.   If you do not wish your reply to be published there, please let us know, and we will, of course, respect your wishes.


Nicholas J. L. Brown
Jordan Anaya
Tim van der Zee
James A. J. Heathers
Chris Chambers

12 April 2017

The final (maybe?) two articles from the Food and Brand Lab

It's been just over a week since Cornell University, and the Food and Brand Lab in particular, finally started to accept in public that there was something majorly wrong with the research output of that lab.  I don't propose to go into that in much detail here; it's already been covered by Retraction Watch and by Andrew Gelman on his blog.  As my quote in the Retraction Watch piece says, I'm glad that the many hours of hard, detailed, insanely boring work that my colleagues and I have put into this are starting to result in corrections to the scientific record.

The statement by Dr. Wansink contained a link to a list of articles for which he states that he has "reached out to the six journals involved to alert the editors to the situation".  When I clicked on that list, I was surprised to see two articles that neither my colleagues nor I had looked at yet.  I don't know whether Dr. Wansink decided to report these articles to the journals by himself, or perhaps someone else did some sleuthing and contacted him.  In any case, I thought that for completeness (and, of course, to oblige Tim van der Zee to update his uberpost yet again) I would have a look at what might be causing a problem with these two articles.

Wansink, B. (1994). Antecedents and mediators of eating bouts. Family and Consumer Sciences Research Journal, 23, 166182. http://dx.doi.org/10.1177/1077727X94232005

Wansink, B. (1994). Bet you can’t eat just one: What stimulates eating bouts. Journal of  Food Products Marketing1(4), 324. http://dx.doi.org/10.1300/J038v01n04_02

First up, there is a considerable overlap in the text of these two articles.  I estimate that 35–40% of the text from "Antecedents" had been recycled verbatim into "Bet", as shown in this image of the two articles side by side (I apologise for the small size of the page images from "Bet"):

The two articles present what appears to be the same study, from two different viewpoints (especially in the concluding sections, which as you can see above do not have any overlapping text) and with a somewhat different set of results reported. In "Antecedents", the theme is about education: broadly speaking, getting people to understand why the embark on phases of eating the same food, and the implications for dietary education.  In "Bet", by contrast, the emphasis is placed on food marketers; the aim is to get them to understand how they can encourage people to consume more of their product.  I suppose that, like the arms export policy of a country that sells arms to both sides in the same conflict, this could be viewed as hypocrisy or blissful neutrality.

The Method and Results sections show some curious discrepancies.  I assume the two articles must be describing the same study since the basic (212) and final (178) sample sizes are the same, and where the same item responses are reported in both articles, the numbers are generally identical, with one exception that I will mention below.  Yet some details differ for no obvious reason.  Thus, in "Antecedents", participants typically took 35 minutes to fill out a 19-page booklet, whereas in "Bet" then took 25 minutes to fill out an 11-page booklet.  In "Antecedents", the reported split between the kinds of food that participants discussed eating was 41% sweet, 29% salty, 16% dairy, and 14% "other".  In "Bet" the split was 52% sweet, 36% salty, and 12% "other".  The Cronbach's alpha reported for coder agreement was .87 in "Antecedents" but .94 in "Bet".

There are further inconsistencies in the main tables of results (Table 2 in "Antecedents", Table 1 in "Bet").  The principal measured variable changes from consumption intensity (i.e., the amount of the "eating bout" food that was consumed) to consumption frequency (the number of occasions on which the food was consumed), although the numbers remain the same.  The ratings given in response to the item "I enjoyed the food" are 0.8 lower in both conditions in "Bet" compared to "Antecedents".  On p. 14 of "Bet", the author reuses some text from "Antecedents" to describe the mean correlation between nutritiousness and consumption frequency, but inexplicably manages to copy the two correlations incorrectly from Table 2 and then calculate their mean incorrectly.

Finally, the F statistics and associated p values on p. 175 of "Antecedents" and pp. 12–13 of "Bet" have incorrectly reported degrees of freedom (177 should be 176) and in several cases, the p value is not, as claimed in the article, below .05.

Is this interesting?  Well, less than six months ago it would have been major news.  But so, today so much has changed that I don't expect many people to want to read a story saying "Cornell professor blatantly recycled sole-authored empirical article", just as you can't get many people to click on "President of the United States says something really weird".  Even so, I think this is important.  It shows, as did James Heathers' post from a couple of weeks ago, that the same problems we've been finding in the output of the Cornell Food and Brand Lab go back more than 20 years, past the period when that lab was headquartered at UIUC (1997–2005), through its brief period at Penn (1995–1997), to Dr. Wansink's time at Dartmouth.  When Tim gets round to updating his summary of our findings, we will be up to 44 articles and book chapters with problems, over 23 years.  That's a fairly large problem for science, I think.

You can find annotated versions of the article discussed in this post here.

30 March 2017

More problematic articles from the Food and Brand Lab

If you've been following my posts, and those of my co-authors, on the problems with the research from the Cornell Food and Brand Lab, there probably won't be very much new here.  This post is mainly intended to collect a few problems in other articles that haven't been published yet, and which don't show any particularly new problem.

If you're trying to keep track of all of the problems, I recommend Tim van der Zee's excellent blog post entitled "The Wansink Dossier: An Overview", which he is updating from time to time to included new discoveries (including, hopefully, the ones below).

Apparent duplication of text without appropriate attribution

Wansink, B., & van Ittersum, K. (2007). Portion size me: Downsizing our consumption norms. Journal of the American Dietetic Association, 107, 11031106. http://dx.doi.org/10.1016/j.jada.2007.05.019

Wansink, B. (2010).  From mindless eating to mindlessly eating better. Physiology & Behavior, 100,  454–463. http://dx.doi.org/10.1016/j.physbeh.2010.05.003

Wansink, B., & van Ittersum, K. (2013). Portion size me: Plate-size induced consumption norms and win-win solutions for reducing food intake and waste.  Journal of Experimental Psychology: Applied, 19, 320–332. http://dx.doi.org/10.1037/a0035053

The 2010 article contains about 500 words (in the sections entitled "1.1. Consumption norms are determined by our environment", p. 455, and "1.2. Consumption monitoring — do people really know when they are full?", p. 456) that have been copied verbatim (with only very minor differences) from the sections entitled "Portion Sizes Create Our Consumption Norms" (p. 1104) and "We Underestimate the Calories in Large Portions" (pp. 1104–1105) in the 2007 article.

The 2013 article contains about 300 words (in the section entitled "Consumption Norms", p. 321) that have been copied verbatim (with only very minor differences) from the section entitled "Portion Sizes Create Our Consumption Norms" (p. 1104) in the 2007 article.  An indication that this text has been merely copied and pasted can be found in the text "For instance, larger kitchenware in homes all [sic] suggest a consumption norm...", which appears in the 2013 article; in the 2007 article, "larger kitchenware" was one of three items in a list, so that the word "all" was not inappropriate in that case.  (Remarkably, the 2013 article has a reference to the 2007 article in the middle of the text that was copied, without attribution, from that earlier article.)

The annotated versions of these articles, showing the apparently duplicated text, can be found here.

Unusual distributions of terminal digits in data

Wansink, B. (2003).  Profiling nutritional gatekeepers: Three methods for differentiating influential cooks.  Food Quality and Preference, 14, 289–297. http://dx.doi.org/10.1016/S0950-3293(02)00088-5

This is one of several studies from the Food and Brand Lab where questionnaires were sent out to different-sized samples of people chosen from different populations, and exactly 770 replies were received in each case, as I mentioned last week here.

I aggregated the reported means and F statistics from Tables 1 through 4 of this article, giving a total of 415 numbers reported to two decimal places.  Here is the distribution of the last digits of these numbers:

I think it is reasonable to assume that these final digits ought, in principle, to be uniformly distributed. Following Mosimann, Dahlberg, Davidian, and Krueger (2002), we can calculate the chi-square goodness-of-fit statistic for the counts of each of the 10 different final digits across the four tables:

> chisq.test(c(28, 41, 54, 59, 39, 48, 38, 26, 40, 42))
X-squared = 22.855, df = 9, p-value = 0.006529

It appears that we can reject the null hypothesis that the last digits of these numbers resulted from random processes.

Another surprising finding in this article is that in Table 4, the personality traits "Curious" and "Imaginative" load identically on eight of the ten different categories of cook that are described.  The factor loadings for these traits are described in two consecutive lines of the table.  It's not clear if this is a copy/paste error, a "retyping from paper" error, or if these numbers are actually correct (which seems like quite a coincidence).

This article, annotated with the above near-duplicate line highlighted, can be found here.

Test statistics inconsistent with reported means and standard deviations

Wansink, B., Cardello, A., & North, J. (2005). Fluid consumption and the potential role of canteen shape in minimizing dehydration. Military Medicine, 170, 871–873. http://dx.doi.org/10.7205/MILMED.170.10.871

All of the reported test statistics in Table 1 are inconsistent with the means and standard deviations to which they are meant to correspond:

Actual ounces poured: reported F=21.2; possible range 24.39 to 24.65
Estimated ounces poured: reported F=2.3; possible range 2.57 to 2.63
Actual ounces consumed: reported F=16.1; possible range 17.77 to 17.97

Additionally, the degrees of freedom for these F tests (and others in the article) are consistently misreported as (1,49) instead of (1,48).

On p. 873, the following is reported: "A second study involving 37 military police cadets in basic
training at Fort Leonard Wood, Missouri, indicated that there was a similar tendency to pour more water into a short, wide opaque canteen than into a tall, narrow prototype canteen bottle (11.6 vs. 10.2 ounces; F(1,35) = 4.02; p < 0.05)".  Here, the degrees of freedom appear to be correctly reported (assuming that each participant used only one type of canteen), but the correct p value for F(1,35) is .053.  (This is one of the very rare problems in the Food and Brand Lab's output that statcheck might have been expected to detect.  However, it seems that the "=" sign in the reported statistic is a graphic, not an ASCII = character, and so statcheck can't read it.)

The annotated versions of this article, showing the apparently duplicated text, can be found here.

22 March 2017

Strange patterns in some results from the Food and Brand Lab

Regular readers of this blog, and indeed the news media, will be aware that there has recently been some scrutiny of the work of Dr. Brian Wansink, Director of the Cornell Food and Brand Lab. We have seen what appear to be impossible means and test statistics; inconsistent descriptions of the same research across articles; and recycling of text (and even, it would appear, a table of results) from one article or book chapter to another. (Actually, Dr. Wansink has since claimed that this table of results was not recycled; apparently, the study was rerun with a completely different set of participants, and yet almost all of the measured results—17 out of 18—were identical, including the decimal places.  This seems quite remarkable.)

In this post I'm going to explore some mysterious patterns in the data of three more articles that were published when Dr. Wansink was still at the University of Illinois at Urbana-Champaign (UIUC).  These articles appear to form a family because they all discuss consumer attitudes towards soy products; Dr. Wansink's CV here [PDF] records that in 2001 he received $3,000 for “Disseminating soy-based research to developing countries”. The articles are:

Wansink, B., & Chan, N. (2001). Relation of soy consumption to nutritional knowledge. Journal of Medicinal Food, 4, 145–150. http://dx.doi.org/10.1089/109662001753165729

Wansink, B., & Cheong, J. (2002). Taste profiles that correlate with soy consumption in
developing countries. Pakistan Journal of Nutrition, 1, 276–278.

Wansink, B., & Westgren, R. (2003). Profiling taste-motivated segments. Appetite, 41, 323–327.

For brevity, I'll mostly refer to these articles by the names of their co-authors, as "Chan", "Cheong", and "Westgren", respectively.

Wansink & Chan (2001)

Chan describes a study of people's attitudes towards "medicinal and functional foods such as soy". It's not clear what a "functional food"—or, indeed, a "non-functional food"—might be, and it might come as a surprise to people in many Asian countries who have been consuming soy all their life to hear it described as a "medicinal food", but I guess this article was written from an American perspective. Exactly what is categorised under "soy" is not made clear in the article, but one of the items asked people how many times in the past year they purchased "tofu or soy milk", so I presume we're talking about those kind of soy products that tend to be associated in Western countries with vegetarian or vegan diets, rather than soy sauce or processed foods containing soy lecithin.

Of interest to us here is Table 2 from Chan. This shows the responses of 770 randomly-selected Americans, split by their knowledge of "functional foods" (apparently this knowledge was determined by asking them to define that term, with the response being coded in some unspecified way) to a number of items about their attitudes and purchasing habits with respect to soy products. Here is that table:

The authors' stated aim in this study was to see whether "a basic (even nominal) level of functional foods knowledge is related to soy consumption" (p. 148). To this end, they conducted a one-way ANOVA between the two groups (people with either no or some knowledge of functional foods), with the resulting F statistic being shown in the right-hand column of the table. You can see that with one exception ("Soy has an aftertaste"), all of the F statistics have at least one star by them, indicating that they are significant at the .05 or .01 level. Here is our first error, because as every undergraduate knows, F(1, D) is never significant at the .05 level below a value of 3.84 no matter how large the denominator degrees of freedom D are, thus making three of those stars (for 3.1, 3.6, and 2.9) wrong.

Also wrong are the reported degrees of freedom for the F test, which with the sample sizes at the top of the columns (138 and 269) should be (1, 405). Furthermore, the number of participants who answered the question about their knowledge of functional foods seems to be inconsistently reported: first as 363 on p. 147 of the article, then as 190 in the footnote to Table 1, which also appears to claim that 138 + 269 = 580. (It's also slightly surprising that out of 770 participants, either 363 or 190 didn't give a simple one-line answer to the question about their knowledge of functional foods; the word "none" would apparently have sufficed for them to be included.)

However, if you have been following this story for the past couple of months, you will know that these kinds of rather sloppy errors are completely normal in articles from this lab, and you might have guessed that I wouldn't be writing yet another post about such "minor" problems unless there was quite a lot more to come.

It would be nice to be able to check the F statistics in the above table, but that requires knowledge of the standard deviations (SDs) of the means in each case, which are not provided. However, we can work provisionally with the simplifying assumption that the SDs are equal for each mean response to the same item. (If the SDs are unequal, then one will be smaller than the pooled value and the other will be larger, which actually exacerbates the problems reported below.) Using this assumption, we can try a number of candidate pooled SDs in an iterative process and calculate an approximation to the SD for the two groups. That gives these results:

The items on lines 1–3 and 26–28 had open-ended response formats, but those on lines 4–25 were  answered on 9-point Likert scales, from 1="strongly disagree" to 9="strongly agree". This means that the absolute maximum possible SD for the means on these lines is about 4.01 (where 4 is half the  difference between the highest and lowest value, and .01 is a little bonus for the fact that the formula for the sample SD has N−1, rather than N, in the denominator). You would get that maximum SD if half of your participants responded with 1 and half responded with 9. And that is only possible with a mean of 5.0; as the mean approaches 1 or 9, the maximum SD becomes smaller, so that for example with a mean of 7.0 or 3.0 the maximum SD is around 3.5.  (Again, it is possible that one of the SDs is smaller and the other larger. But if we can show that the pooled SD is impossible with either mean, then any split into two different SDs will result in one SD being even higher, making one of the means "even more impossible".)

In the above image, I have highlighted in orange (because red makes the numbers hard to read) those SDs that are impossible, either because they exceed 4.01, or because they exceed the largest possible SD for the corresponding means. In a couple of cases the SD is possible for one of the means (M1), but not the other (M2), and if the SD of M2 were reduced to allow it to be (just) possible, the SD of M1 would become impossible.

I have also highlighted in yellow the SDs that, while not strictly impossible, are highly implausible. For example, the most moderate of these cases ("Soy will fulfill my protein requirements", M=4.8, SD=3.4) requires well over half of the 138 participants in the "no knowledge of functional food" group to have responded with either 1 or 9 to an item that, on average, they had no very strong opinion about (as shown by the mean, which is almost exactly midway between "strongly disagree" and "strongly agree"). The possible distribution of these responses shown below reminds me about the old joke about the statistician with one foot in a bucket of ice and another in a bucket of boiling water, who reports being "very comfortable" on average.

Thus, around half of the results—either the means, or the F statistics, or both—for the 22 items in the middle of Table 2 of Chan cannot be correct, either due to mathematical impossibility or because it would require extreme response patterns that simply don't happen in this kind of survey (and which, if they had occurred by some strange chance, the authors ought to have detected and reported).

A demonstration that several of the results for the open-ended (first and last three) items of the table are also extremely implausible is beyond the scope of this blog post (hint: some people spend a lot of time at the store checking that there is no soy in the food they buy, and some people apparently eat dinner more than once a day), but my colleague James Heathers will probably be along to tell you all about this very soon as part of his exciting new tool/method/mousetrap that he calls SPRITE, which he kindly deployed to make the above image, and the three other similar images that appear later in this post.

One more point on the sample size of 770 in this study.  The article reports that questionnaires were mailed to "a random national sample (obtained from U.S. Census data) of 1,002 adults", and 770 were returned, for a payment of $6.  This number of responses seems to be very common in research from this lab.  For example, in this study 770 questionnaires out of 1,600 were returned by a sample of "North Americans", a term which (cf. the description of the sample in Westgren, below) presumably means US and Canadian residents, who were paid $5. Meanwhile, in this study, 770 questionnaires out of 2,000 mailed to "a representative sample from 50 US states" were returned in exchange for $3.  One might get the impression from those proportions that paying more brings a higher response rate, but in this study when 2,000 questionnaires were mailed to "North Americans", even a $6 payment was not sufficient to change the number of responses from 770.  Finally, it is unclear whether the sample of 770 mentioned in this article and (in almost-identical paragraphs) in this article and this book chapter represents yet another mailing, or if it is the same as one of those just listed, because the references do not lead anywhere; this article gives slightly more details, but again refers back to one of the others.  If any readers can find a published description of this "loyalty program survey of current customers of a Kraft product" then I would be interested to see it. (A couple of people have also mentioned to me that a response rate of 77%, or even half that, is remarkably high for a randomly mailed survey, even with an "honor check" as an incentive.)

Now let's look at the other two articles out of the three that are the main subject of this blog post. As we'll see, it makes sense to read them together.

Wansink & Cheong (2002); Wansink & Westgren (2003)

Cheong (available here [PDF]) reports a study of the attitudes and soy consumption habits of a sample of 132 Indians and Pakistanis who were living in the United States (thus making the article's title, "Taste profiles that correlate with soy consumption in developing countries [emphasis added]", perhaps a little inaccurate) and associated in some way with UIUC. Westgren describes the results of a very similar study, with almost exactly the same items, among 606 randomly-selected North Americans (US and Canadian residents, selected from phone records).

The first thing one notices in reading these two articles is that about 40% of the text of Cheong has been duplicated verbatim in Westgren, making up about 20% of the latter article. We have seen this before with the lead author of these articles, but apparently he considers it not to be a big deal to "re-emphasize" his work in this way. Some of the duplicated text is in the Methods section, which a few people claim is not a particular egregious form of self-plagiarism, but the majority comes from the Results and Discussion sections, which is unusual, to say the least, for two different empirical articles. This image shows the extent of the duplication; Cheong is on the left, Westgren on the right.

The evolution of Cheong into Westgren can be followed by downloading two drafts of the latter article from here (dated May 2003) and here (dated July 2003). The July version is very close to the final published text of Westgren. Interestingly, the Properties field of both of these PDFs reveals that the working title of the manuscript was "Profiling the Soy Fanatic". The co-author on the May draft is listed as JaeHak Cheong, but by July this had been changed to Randall Westgren.

As with Chan, the really interesting element of each of these articles is their respective tables of results, which are presented below, with Cheong first and Westgren second. The first seven items were answered on a 1–9 Likert-type scale; the others are expressed as a number of evening meals per week and so one would normally expect people to reply with a number in the range 0–7.

(For what it's worth, there is another incorrect significance star on the F statistic of 3.6 on the item "In general, I am an adventurous person" here.)

[[ Update 2017-03-23 22:10 UTC: As pointed out by an anonymous commenter, the above statement is incorrect. I hadn't taken into account that the numerator degrees of freedom are 2 here, or that the threshold for a star is p < 0.10, not 0.05.  I apologise for this sloppiness on my part.  However, this means that there are in fact two errors in the above table, because both this 3.6 and 5.9 ("Number of evening meals with which you drink wine during the average week") should have two stars.  It's particularly strange than 5.9 doesn't have two stars, since 5.3 ("I am traditional") does. ]]

Just as an aside here: I'm not an expert on the detailed eating habits of Indians and Pakistanis, but as far as I know, soy is not a major component of the traditional diet of citizens of either nation. Pakistanis are mostly Muslims, almost all of whom eat meat, and Indians are mostly Hindus and Sikhs (who tend to consume a lot of dairy if they are vegetarians) or Muslims. So I was quite surprised that out of 132 people from those two countries surveyed in Cheong, 91 claimed to eat soy.  Maybe people from the Indian sub-continent make more radical adaptations to their diet when they join an American academic community than just grabbing the occasional lunch at Subway.

OK, back to the tables.  Once again, it's instructive to examine the F statistics here, as they tell us something about the possible SDs for the samples.

For ease (I hope) of comparison, I have included blank lines for the two items that appeared in Cheong but not in Westgren. It is not clear why these two items were not included, since the May 2003 draft version of Westgren contains results (means and F statistics) for these items that are not the same as those in the published Cheong article, and so presumably represent the results of the second survey (rather than the remnants of the recycling exercise that was apparently involved in the generation of the Westgren manuscript). There are also four means that differ between the May draft of Westgren and the final article: "I live with (or am) a great cook"/"Health-related" (6.1 instead of 5.8), and all three means for "Number of evening meals eaten away from home during the average week" (0.9, 1.2, and 2.0 instead of 0.7, 1.7, and 1.7). However, the F statistics in the draft version for these two items are the same as in the published article.

The colour scheme in the tables above is the same as in the corresponding table for Chan, shown earlier. Again, a lot of the SDs are just impossible, and several others are, to say the least, highly implausible. As an example of the latter, consider the number of evening meals containing a soy-related food per week in the Indian-Pakistani group. If the SD for the first mean of 0.6 is indeed equal to the pooled value of 2.8, then the only possible distribution of integers giving that mean and SD suggests that two of these "non soy-eaters" must be eating soy for dinner eleven and fourteen times per week, respectively:

If instead the SD for this mean is half of that pooled value at 1.4, then a couple of the non-soy eaters must be having soy for dinner four or five times a week. This is one of several possible distributions, but they all look fairly similar:

This "more reasonable" SD of 1.4 for the non-soy eaters would also imply a pooled SD of 3.2 for the other two means, which would mean that about a third of the people who were "unambiguously categorized as eating soy primarily for health reasons" actually reported never eating soy for dinner at all:

To summarise: For five out of 11 items in Cheong, and seven out of nine items in Westgren, the numbers in the tables cannot—either due to mathematical limitations, or simply because of our prior knowledge about how the world works—be correct representations of the responses given by the participants, because the means and F statistics imply standard deviations that either cannot exist, or require crazy distributions.

Similarities between results in Cheong and Westgren

It is also interesting to note also that the items in Cheong that had problems with impossible or highly implausible SDs in their study of 132 Indians and Pakistanis also had similar problems in Westgren with a sample of 606 random North Americans. This might suggest that whatever is causing these problems might not be an entirely random factor.

Two items from Cheong were not repeated in Westgren (in fact, as noted previously, it seems from the May 2003 draft of Westgren that these two items were apparently included in the questionnaire, but the responses were dropped at some point during drafting), but most of the answers to the remaining nine items seem to be quite similar. As an exercise, I took the sets of 27 means corresponding to the items that appear in both tables and treated them as vectors of potentially independent results.  The scatterplot of these two vectors looks like this:

This seems to me like a remarkably straight line.  As noted above, some of the variables have a range of 1–9 and others 0–7, but I don't think that changes very much.

I also calculated the correlation coefficient for these 27 pairs of scores.  I'm not going to give a p value for this because, whatever the sample, there is likely to be some non-zero degree of correlation at a few points in these data anyway due to the nature of certain items (e.g., for "Number of evening meals in which you eat a soy-related food during the average week", we would expect the people who "never eat soy" to have lower values than those who stated that consumed soy for either taste or health reasons, whatever the sample), so it's not quite clear what our null hypothesis ought to be.

> cheong = c(2.8,5.6,7.1,5.6,5.7,4.5,4.9,5.7,7.9,4.9,4.3,5.4,
> westgren = c(2.3,5.8,7.2,5.3,4.2,3.1,3.8,6.3,7.8,4.1,4.6,5.8,
> cor(cheong, westgren)
[1] 0.9731398

Even allowing for the likely non-zero within-item correlation across the two samples mentioned in the preceding paragraph, this seems like a very high value.  We already know from earlier analyses that a considerable number of either the means or the F statistics (or both) in these two articles are not correct. If the problem is in the means, then something surprising has happened for these incorrect means to correlate so highly across the two studies. If, however, these means are correct, then as with the brand loyalty articles discussed here (section E), the authors seem to have discovered a remarkably stable effect in two very different populations.

Uneven distribution of last digits

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results (in the Cheong, Chan, and Westgren articles). Specifically, there is a curious absence of zeroes among these last digits.  Intuitively, we would expect 10% of the means of measured random variables, and the F statistics that result from comparing those means, to end in a zero at the last decimal place (which in most cases in the articles we are examining here is the only decimal place) . A mathematical demonstration that this is indeed a reasonable expectation can be found here.

Here is a plot of the number of times each decimal digit appears in the last position in these tables:

For each table of results, here are the number of zeroes:

Chan: 84 numbers, 3 ending in zero (item 21/F statistic, 22/"Some knowledge", and 28/F statistic).
Cheong: 44 numbers, 1 ending in zero ("Number of evening meals which contain a meat during
the average week"/"Taste").
Westgren: 36 numbers, none ending in zero.

Combining these three counts, we have four zeroes out of a total of 164 numbers. We can compute (using R, or Excel's BINOMDIST function, or the online calculator here) the binomial probability of this number of zeroes (or fewer) occurring by chance, if the last digits of the numbers in question are indeed random (either because they are the last digits of correctly calculated item means or F statistics, or because the errors—which we know, from the above analysis of the SDs, that some of them must represent—are also random).

> pbinom(4, size=164, prob=0.1)
[1] 0.0001754387

Alternatively, as suggested by Mosimann, Dahlberg, Davidian, and Krueger (2002), we can calculate the chi-square statistics for the counts of each of the 10 different final digits in these tables, to see how [im]probable the overall distribution of all of the final digits is:

> chisq.test(c(4,15,14,17,15,11,19,20,26,23))
X-squared = 21.244, df = 9, p-value = 0.01161

Either way, to put this in terms of a statistical hypothesis test in the Fisherian tradition, we would seem to have good reasons to reject the null hypothesis that the last digits of these numbers resulted from random processes.


Ignoring the "minor" problems that we left behind a couple of thousand words ago, such as the
unwarranted significance stars, the inconsistently-reported sample sizes, and the apparent recycling of substantial amounts of text from one article to another, we have the following:

1. Around half of the F statistics reported in these three articles cannot be correct, given the means that were reported. Either the means are wrong, or the F statistics are wrong, or both.

2. The attitudes towards soy products reported by the participants in the Cheong and Westgren studies are remarkably similar, despite the samples having been drawn from very different populations. This similarity also seems to apply to the items for which the results give impossible test statistics.

3. The distribution of the digits after the decimal point in the numbers representing the means and F statistics does not appear to be consistent with these numbers representing the values of measured random variables (or statistical operations performed on such variables); nor does it appear to be consistent with random transcription errors.

I am trying hard to think of possible explanations for all of this.

All of the relevant files from this article are available here, if the links given earlier don't work
and/or your institution doesn't have a subscription to the relevant journal