30 Bananas a Day!

Its purpose was to re-articulate the limitations of her analysis, but also to inform.  Good science should prevail, after all. 
********

Hi Denise,


As promised, I'm posting my response to your email [yes, she emailed me] on your site.  You asked that I provide some tips on where to start and how to proceed.  BTW, you mentioned "epidemiology secrets" and I just want to say: no "secrets"!!  Epidemiology is just critical thinking, but with numbers.  It's no different from many other disciplines.  Maybe some time you can help me with writing (scientists are generally terrible writers, hehe).  


Note: I've included some comments on what went wrong and how it can be corrected merely for demonstrative purposes - not at all malicious attacks, OK?  This is how we all learn after all.  In caps, I will highlight steps in the action plan for you.


STEP 0: Do a literature search.  I find it helpful to keep an excel spreadsheet with columns for author, title, journal, year, summary of paper, strengths of the study, weaknesses, and concluding remarks.  This is essential, as one shouldn't just blindly go into an analysis without having at least some background information on the subject matter.  No need to be an expert, but good to know what's already out there, and what needs to be done.  


1. Correlations:

For this discussion, the outcome will be colorectal cancer, since you used it on your post.  Similarly, the primary exposure of interest will be total cholesterol.  By by basing your conclusions on uncorrected correlations alone, you've made a huge leap that doesn't have much ground to stand on.  The simple correlations are biased, as you yourself pointed out when evaluating total cholesterol, schistomiasis, and colorectal cancer.  As such, if you don't adjust for potential confounders via multiple regression, the association you observe is biased.  We almost always need to adjust for confounders, and this is very true in your case.


STEP 1: It's a good habit to evaluate the correlations between all exposures and also between all exposures and the outcome <i>at the individual level</i>.  So, for *every* analysis you plan on doing, run create scatterplots for every X against X and every X against Y, using the *individual* data (where possible), and provide the correlation + 95% confidence interval for each.


STEP 2: Create histograms for every exposure of that is categoric and density plots (or you can create histograms with very narrow bars) for every exposure that is continuous.  This will tell you how the variables are distributed and what the appropriate summary statistics for them would be.  For example, if total cholesterol is not normally distributed (follow a bell curve) then *median* total cholesterol might be a better summary statistic then *mean* total cholesterol (good to know when you present descriptive statistics of the data you're using).  Sometimes it's useful to present different stats for a single variable.


2. Individual data vs. aggregated data:

You stated you didn't see much curvature, but keep in mind that you were presenting with aggregated data (eg. average total cholesterol for all individuals) instead of including individual-level data (the exposure and outcome for a single individual).  Consequently, there was a big loss in information, and you can't make accurate decisions on how to model your data if you plot aggregated data.  Related to this, your analysis was ecologic (used aggregated/grouped data) but you made individual-level conclusions when you used the term "risk factor."  This is referred to as an ecologic fallacy - and it's just that.  A fallacy.  For example, all we can say based on your cholesterol-colorectal cancer example (the one that doesn't account for schistomiasis) is that the counties with higher mean total cholesterol tend to have higher incidence rates of colorectal cancer.  We can't make the leap to calling cholesterol a *risk factor* for colorectal cancer.


STEP 3: Don't aggregate your data in  your analysis.  Why?  You lose A LOT of information when you aggregate data and you can bias your results.  So keep that data at the individual-level.  For descriptive tables, by all means, aggregated data is necessary for obvious reasons.  But in your analysis, individual-level data when you've got it is essential.  


3. The right regression model:

One of your outcomes was incidence rates of colorectal cancer.  When you do your analysis with individual-level data, with incidence rates of colorectal cancer as your outcome, linear regression = WRONG model.  Make sure you know which models to use and when.  To start - when modeling "raw" rates (case counts and person time), we almost always use Poisson regression, and often we need to account for overdispersion as well.  Get to know some of the other common regression models as well.    


STEP 4: Write out all of the primary exposures of interest you want to investigate and the corresponding outcome of interest and how you're setting up your outcome variable (are you interested in colorectal cancer *incidence rates*, *prevalence*, a simple yes/no the person has colorectal cancer?)


STEP 5: Write out what the appropriate regression model would be for the different analyses you plan to conduct.  


4. Confounders:

These are factors that are related to the exposure and the outcome of interest such that *not* adjusting for them will produce a biased association between exposure and outcome.  As you saw, schistomiasis might be a confounder.  And in fact, county might be too - and is actually upstream of schistomiasis in some sense, right?  Two confounders that almost *always* must be included in a model are AGE and SEX (provided your analysis isn't restricted to one sex).  This is especially true for chronic disease (eg. cardiovascular disease and cancer).  In this particular case, body mass index (BMI) would be very important to include as well.  County may also be important.  


STEP 6: For every analysis you do, write out all potential confounders you can think of and why.  You know the data better than I do as you've worked with it extensively.  And, from STEP 0, you'll know your context.  


STEP 7: Write out *how* the confounders are related to the exposure and outcome.  Is the confounder protective (i.e. decrease risk) for the outcome?  Or is it a risk factor?  How is it associated with the primary exposure of interest?  This is where those scatterplots in STEP 1 come in handy!  The purpose of this is to give you an idea of *how* an observed association might be biased if you *don't* adjust for certain confounders.  It is tedious, but thorough and, like STEP 6, will allow you to approach your analyses with more contextual background.


5.  "Cleaning" and "recoding" your data:

Raw data is not *in and of itself* a bad thing.  It is simply the data in its original form.  But in order to be useful for analysis we often need to "clean" it and "recode" it.  When I say "clean" it, I mean setting up the *dataset* that is free (to the greatest extent possible) of unnecessary data (for example, if you're interested in ovarian cancer, you wouldn't include men), or mistakes (for example, if an individual in the data was coded as being a man with ovarian cancer, this is clearly wrong).  In this case, you might either omit it since you don't have a way to check which is correct or, based on other data for that individual choose to change "man" to "woman" or "ovarian cancer" to "no ovarian cancer."  "Recoding" means setting up the *variables* to be useful.  For example, we might recode BMI in categories of underweight, normal, overweight, and obese rather than leave it as continuous.  Some variables may already be categoric, if the corresponding data were collected that way.


STEP 8: Clean your data.  You will likely need to set up multiple datasets.  


STEP 9: Write out *how* you've cleaned your data.  (This is good record keeping.)


STEP 10: Recode your data.  This might include combining variables too.  


STEP 11: Create a "data dictionary" similar to the one on the Oxford site.  But in addition, include a description of how you've coded your data (eg. 1=underweight, 2=normal, 3=overweight, 4=obese).  Again, good for record keeping, but also "keeps you honest" so others know how you set up your data.  This will often be apparent when you present your results, but not always.  It's a good habit to keep track of this, in any event. 


STEP 12: Replot all newly *categorized* variables against the outcome(s) of interest.  Why?  Because the categorized data may reveal non-linear relationships with the outcome (in fact, this is a strength of categorizing data - that we can account for some non-linear relationships).  For example, underweight might be a risk for something, whereas normal BMI is protective, while overweight and obese are a risk ("U-shaped").  


6. Exploration of your data through descriptive statistics:

Almost all scientific papers start out with a "Table 1" which presents a description of the data.  It tells us things like What's the % of women and men in our data, What is the proportion of people with and without the exposure and with and without the outcome?


STEP 13: Create descriptive tables of all relevant variables.  This includes your primary exposure of interest, confounders, and outcome.  Obviously, you will have different tables for each analysis as you're interested in different primary exposures (cholesterol? meat? total caloric intake?) and outcomes (cardiovascular disease? colorectal cancer? bladder cancer?).  To save time, you might include all relevant exposures and confounders in rows, and cross-classify them with all outcomes of interest in columns.  


6.  Analysis: 

The fun part.  


STEP 14: Run your models.  Keep track of what you include in your models b/c oftentimes we will evaluate several models for each analysis depending on what's called "fit statistics."  Since you are familiar with p-values and I assume interpretation of beta coefficients, use these to help inform you of which variables to include in your final model *within the context of the analysis at hand* (this is key - if you have reason to believe that a confounder is important to include, keep it in the model even if it's non-significant).  


STEP 15: Create tables for results from *all* analyses (including the models you decide to can in favor for another one) and what regression model was used.  This is much more transparent than simply producing your final model.


There's more "post-analysis" stuff that should be done, but really Steps 1-15 is a pretty thorough. 


7. Publish:

I can't stress this enough.  This is a long-term goal for sure, especially as you will likely end up with multiple papers!  But once you think you've got the data set-up and analyses down, you need to write it up and send it on for peer-review.  Peer-review is not perfect for sure, but it is the best measure we have for good science.  It gives credibility to your efforts.  Besides, you *do* want to be acknowledged for your efforts, right?  By publishing in a peer-reviewed journal, you're more likely to gain more widely publicized attention, which I think should be the goal of most epidemiological studies; we want to improve public health through informing not only our peers, but also the public.  


As a last note, I know this is a huge undertaking, but these are steps to a thorough analysis.  I have no doubt you're capable of tackling it.  


Best wishes.


PS. I'm sure you already planned to do this, but make all of the above available.  With your large readership you can make this a collaborative effort.

Views: 693

Replies to This Discussion

Wow nice work veganmama! Very thorough. You are being extremely helpful to Denise, it Sounds fair to me!
wow, I shudder to think how long it would take to do all this properly then have it peer reviewed. thanks so much for your reply and posting here :)
this is fantastic!
there is much we all can learn from your instructions to denise!
you are furthering your goal to do good science for everyone!!

in friendship,
prad
thank you all for your feedback. i genuinely hope she will give it a try.
REBUTTAL TO THE REBUTTALS

A number of people have pointed out that the criticisms of Denise's analysis apply to Campbell's as well, and since they seem to be at least somewhat familiar with statistics, I'll expand on my initial critique.

First and foremost Denise did not take into account potential confounders. I think everyone understands at this point that confounders can bias the observed correlation towards or away from the null (i.e., correlation=0). While she took schistosomiasis into account by restricting her analysis to counties without schistosomiasis, it doesn’t tell us whether schistosomiasis really is a confounder – it simply removed the “effect” of schistosomiasis. Furthermore, her p-values only reflect the test of whether the correlation was significantly different from zero. Not if there was a statistically significant change in the exposure-outcome correlation after taking schistomiasis into account.

Let me repeat that. The p-values Denise provides reflect whether correlation=0. They do not tell us whether or not schistosomiasis is a potential confounder. To help us determine this, we need to know if the correlation of +33 for all counties was statistically significantly different from the correlation of +13 for just the counties without schistosomiasis. This is where 95% confidence intervals would be helpful, but Denise doesn't provide these. Nor does she tell us what the correlation is only among counties with schistosomiasis. There are several ways to tease out whether we should include a factor in our final analysis, but here are two commonly used methods, using the schistosomiasis/cholesterol/colorectal cancer example:

Method 1:
1. Calculate correlation for entire sample
--> Denise calculated this to be +33.

2. Now stratify on the variable you think is a potential confounder, i.e., schistosomiasis, and calculate the correlation within each stratum.
--> Denise stratified on county but we'll let this slide b/c this was probably her only choice. For counties with no schistosomiasis, the correlation was +13. What about the correlation for counties with schistosomiasis? Denise does not provide this.

3. Compare the within-strata correlations (+13 and ??) to the correlation for the the entire sample (+33), and test whether they are statistically significantly different from each other (not whether they are significantly different from 0). One should first perform a global test, and if the result is significant, proceed with pair-wise tests.
--> Denise did not do this.

4. If the correlations are significantly different from each other, then there is evidence that there may be confounding. If they are not significantly different from each other, there is evidence for no confounding.
--> Denise did not do this.

5. Bonus step: if the pair-wise tests between the stratum-specific correlations are significant, this is evidence that schistosomiasis is an *effect modifier*, not a confounder.
--> Denise did not do this.

Method 2:
1. Run a full model that includes cholesterol and schistosomiasis as exposures (ideally, the model would include more than just this, but we'll keep it simple) and colorectal cancer as the outcome. Obtain the adjusted correlation, and make a note of the residual deviance or log likelihood for the model.

2. Run a reduced model that does not include the variable you think is a potential confounder, i.e., just include cholesterol as an exposure. Make a note of the residual deviance or log likelihood for this reduced model.

3. Now take the difference of the deviances or the -2 times the difference in the log likelihoods. This is your chi-square test statistic with k degrees of freedom (in our example, the degrees of freedom=1). Calculate the corresponding p-value. A significant/small p-value strongly suggests that the we should stick with the full model (i.e., the one with cholesterol and schistosomiasis). A large/non-significant p-value suggests that the full model doesn't add much more information and therefore we would opt for the more parsimonious model. In other words, the reduced model (i.e., the one with cholesterol only) is probably sufficient.

I'm assuming Denise did none of this since there was no mention of it. To her credit, Denise does mention why she took a look at schistosomiasis, but didn't follow through with a complete analysis. Therefore, there isn't much ground for her to stand on.

When people criticize Campbell for not including schistomiasis, it is very possible that upon further inspection, it was not a potential confounder as Denise concluded based on her results. A factor is a confounder if and only if it:
1. Is associated with the exposure (cholesterol)
2. Is a risk factor or protective factor for the outcome (colorectal cancer), and
3. Is not on the causal pathway between the exposure and outcome.

Perhaps criterion 1 was not met and therefore not included in Campbell's final analysis. Only Campbell and colleagues know for sure what the detailed analyses were; a final presentation will always include only the most salient points.

As for many of Campbell's conclusions being drawn from purely ecologic data, I think this ignores the fact that while the China-Cornell-Oxford Project was a large component of the book "The China Study," the book's thesis is based on *hundreds* (in fact, nearly 1000) of additional references that corroborate the Project's findings.
You're ok if I blog this Veganmama? Can't let your brilliance go unnoticed.
absolutely! i'd be honored. ;-)
Veganmama--I have to say, this was excellent advice and an insightful analysis of Denise's work. Also, I know you were very diplomatic and polite, but really, with the number and magnitude of the flaws in Denise's work, it pretty much invalidates most of her conclusions. Even point #1 alone is enough to invalidate much of her analysis. Thanks, though, for keeping the tone respectful and positive, while still making very important points :)
thanks for the nice feedback, courtney!
Veganmama18--Just curious, if you don't mind me asking :), what's your scientific background?
hi courtney, my background is in biochemistry and cancer epidemiology. do you have an interest in either of these?
So cool :) That's awesome!

I did my undergrad degrees in biology (biochem/molecular/cellular) and in phyiscs, but am now doing my PhD in physics.

But in recent years, I've been wondering if I should change course, go back to my biology roots, and do a postdoc in nutrition and/or epidemiology. I'm reading through Willett's nutritional epidemiology book at the moment and I have a friend who works in epidemiology at the CDC :) What sort of cancer epidemiology do you do?

RSS

About

TheBananaGirl created this Ning Network.

30BaD Search

Latest Activity

Aysel is now friends with Houdini Steve Owens and Sugalena
Wednesday
OrganicMark posted a status
"Why the 2008 Housing Crisis Recovery Is Just an Illusion (w/ Keith Jurow) #evolution #awakening #peace #truth #light http://j.mp/2ODSZ93"
Tuesday
OrganicMark posted a status
"Not Transitory - Fed Liquidity Handout Surges To Near $90 Billion #evolution #awakening #peace #truth #love #light http://j.mp/35F4Sl5"
Tuesday
Frugisaurus and Alex Curtis-Slep are now friends
Monday
Michael Lanfield posted a video

Return to the Gentle Sea by Michael Lanfield with Music by Dr. Will Tuttle, PhD (Full Audiobook)

Return to the Gentle Sea: For the Love That Lives in Everyone is a book on spiritual healing and cultural transformation on the human relationship to nonhuman animals. It explains how living in a herding culture, eating animals and their secretions,…
Sunday
Michael Lanfield commented on ednshell's video
Sunday
Michael Lanfield and Olga are now friends
Sunday
Raw Aussie Athlete and Lil Green Coconut are now friends
Sunday
ednshell replied to Tams's discussion Best vegan retreats ?
Oct 13
ednshell posted a video

Take Back Your Power 2017 (Official) - smart meter documentary

This award-winning film documents the real story on smart meters. + Subscribe for free EMF guide & help stop 'smart' meters and 5G: https://TakeBackYourPower...
Oct 13
ednshell posted a discussion
Oct 13
Profile Iconmichael wilson, Savannah Holte, tori jacobs and 2 more joined 30 Bananas a Day!
Oct 13
OrganicMark posted a status
"QE4 "Not A QE" Begins: Fed Starts Buying $60BN In Bills Per Month Beginning Next Week #awakening #peace #truth #light http://j.mp/2M9cmoQ"
Oct 12
Rock and Tams are now friends
Oct 12
Tams posted a discussion
Oct 11
OrganicMark posted a status
"Powell says the Fed will start expanding its balance sheet ‘soon’ in response to funding issues #peace #truth #light http://j.mp/30Zc8ol"
Oct 11

© 2019   Created by TheBananaGirl.   Powered by

Badges  |  Report an Issue  |  Terms of Service