30 Bananas a Day!

Its purpose was to re-articulate the limitations of her analysis, but also to inform.  Good science should prevail, after all. 

Hi Denise,

As promised, I'm posting my response to your email [yes, she emailed me] on your site.  You asked that I provide some tips on where to start and how to proceed.  BTW, you mentioned "epidemiology secrets" and I just want to say: no "secrets"!!  Epidemiology is just critical thinking, but with numbers.  It's no different from many other disciplines.  Maybe some time you can help me with writing (scientists are generally terrible writers, hehe).  

Note: I've included some comments on what went wrong and how it can be corrected merely for demonstrative purposes - not at all malicious attacks, OK?  This is how we all learn after all.  In caps, I will highlight steps in the action plan for you.

STEP 0: Do a literature search.  I find it helpful to keep an excel spreadsheet with columns for author, title, journal, year, summary of paper, strengths of the study, weaknesses, and concluding remarks.  This is essential, as one shouldn't just blindly go into an analysis without having at least some background information on the subject matter.  No need to be an expert, but good to know what's already out there, and what needs to be done.  

1. Correlations:

For this discussion, the outcome will be colorectal cancer, since you used it on your post.  Similarly, the primary exposure of interest will be total cholesterol.  By by basing your conclusions on uncorrected correlations alone, you've made a huge leap that doesn't have much ground to stand on.  The simple correlations are biased, as you yourself pointed out when evaluating total cholesterol, schistomiasis, and colorectal cancer.  As such, if you don't adjust for potential confounders via multiple regression, the association you observe is biased.  We almost always need to adjust for confounders, and this is very true in your case.

STEP 1: It's a good habit to evaluate the correlations between all exposures and also between all exposures and the outcome <i>at the individual level</i>.  So, for *every* analysis you plan on doing, run create scatterplots for every X against X and every X against Y, using the *individual* data (where possible), and provide the correlation + 95% confidence interval for each.

STEP 2: Create histograms for every exposure of that is categoric and density plots (or you can create histograms with very narrow bars) for every exposure that is continuous.  This will tell you how the variables are distributed and what the appropriate summary statistics for them would be.  For example, if total cholesterol is not normally distributed (follow a bell curve) then *median* total cholesterol might be a better summary statistic then *mean* total cholesterol (good to know when you present descriptive statistics of the data you're using).  Sometimes it's useful to present different stats for a single variable.

2. Individual data vs. aggregated data:

You stated you didn't see much curvature, but keep in mind that you were presenting with aggregated data (eg. average total cholesterol for all individuals) instead of including individual-level data (the exposure and outcome for a single individual).  Consequently, there was a big loss in information, and you can't make accurate decisions on how to model your data if you plot aggregated data.  Related to this, your analysis was ecologic (used aggregated/grouped data) but you made individual-level conclusions when you used the term "risk factor."  This is referred to as an ecologic fallacy - and it's just that.  A fallacy.  For example, all we can say based on your cholesterol-colorectal cancer example (the one that doesn't account for schistomiasis) is that the counties with higher mean total cholesterol tend to have higher incidence rates of colorectal cancer.  We can't make the leap to calling cholesterol a *risk factor* for colorectal cancer.

STEP 3: Don't aggregate your data in  your analysis.  Why?  You lose A LOT of information when you aggregate data and you can bias your results.  So keep that data at the individual-level.  For descriptive tables, by all means, aggregated data is necessary for obvious reasons.  But in your analysis, individual-level data when you've got it is essential.  

3. The right regression model:

One of your outcomes was incidence rates of colorectal cancer.  When you do your analysis with individual-level data, with incidence rates of colorectal cancer as your outcome, linear regression = WRONG model.  Make sure you know which models to use and when.  To start - when modeling "raw" rates (case counts and person time), we almost always use Poisson regression, and often we need to account for overdispersion as well.  Get to know some of the other common regression models as well.    

STEP 4: Write out all of the primary exposures of interest you want to investigate and the corresponding outcome of interest and how you're setting up your outcome variable (are you interested in colorectal cancer *incidence rates*, *prevalence*, a simple yes/no the person has colorectal cancer?)

STEP 5: Write out what the appropriate regression model would be for the different analyses you plan to conduct.  

4. Confounders:

These are factors that are related to the exposure and the outcome of interest such that *not* adjusting for them will produce a biased association between exposure and outcome.  As you saw, schistomiasis might be a confounder.  And in fact, county might be too - and is actually upstream of schistomiasis in some sense, right?  Two confounders that almost *always* must be included in a model are AGE and SEX (provided your analysis isn't restricted to one sex).  This is especially true for chronic disease (eg. cardiovascular disease and cancer).  In this particular case, body mass index (BMI) would be very important to include as well.  County may also be important.  

STEP 6: For every analysis you do, write out all potential confounders you can think of and why.  You know the data better than I do as you've worked with it extensively.  And, from STEP 0, you'll know your context.  

STEP 7: Write out *how* the confounders are related to the exposure and outcome.  Is the confounder protective (i.e. decrease risk) for the outcome?  Or is it a risk factor?  How is it associated with the primary exposure of interest?  This is where those scatterplots in STEP 1 come in handy!  The purpose of this is to give you an idea of *how* an observed association might be biased if you *don't* adjust for certain confounders.  It is tedious, but thorough and, like STEP 6, will allow you to approach your analyses with more contextual background.

5.  "Cleaning" and "recoding" your data:

Raw data is not *in and of itself* a bad thing.  It is simply the data in its original form.  But in order to be useful for analysis we often need to "clean" it and "recode" it.  When I say "clean" it, I mean setting up the *dataset* that is free (to the greatest extent possible) of unnecessary data (for example, if you're interested in ovarian cancer, you wouldn't include men), or mistakes (for example, if an individual in the data was coded as being a man with ovarian cancer, this is clearly wrong).  In this case, you might either omit it since you don't have a way to check which is correct or, based on other data for that individual choose to change "man" to "woman" or "ovarian cancer" to "no ovarian cancer."  "Recoding" means setting up the *variables* to be useful.  For example, we might recode BMI in categories of underweight, normal, overweight, and obese rather than leave it as continuous.  Some variables may already be categoric, if the corresponding data were collected that way.

STEP 8: Clean your data.  You will likely need to set up multiple datasets.  

STEP 9: Write out *how* you've cleaned your data.  (This is good record keeping.)

STEP 10: Recode your data.  This might include combining variables too.  

STEP 11: Create a "data dictionary" similar to the one on the Oxford site.  But in addition, include a description of how you've coded your data (eg. 1=underweight, 2=normal, 3=overweight, 4=obese).  Again, good for record keeping, but also "keeps you honest" so others know how you set up your data.  This will often be apparent when you present your results, but not always.  It's a good habit to keep track of this, in any event. 

STEP 12: Replot all newly *categorized* variables against the outcome(s) of interest.  Why?  Because the categorized data may reveal non-linear relationships with the outcome (in fact, this is a strength of categorizing data - that we can account for some non-linear relationships).  For example, underweight might be a risk for something, whereas normal BMI is protective, while overweight and obese are a risk ("U-shaped").  

6. Exploration of your data through descriptive statistics:

Almost all scientific papers start out with a "Table 1" which presents a description of the data.  It tells us things like What's the % of women and men in our data, What is the proportion of people with and without the exposure and with and without the outcome?

STEP 13: Create descriptive tables of all relevant variables.  This includes your primary exposure of interest, confounders, and outcome.  Obviously, you will have different tables for each analysis as you're interested in different primary exposures (cholesterol? meat? total caloric intake?) and outcomes (cardiovascular disease? colorectal cancer? bladder cancer?).  To save time, you might include all relevant exposures and confounders in rows, and cross-classify them with all outcomes of interest in columns.  

6.  Analysis: 

The fun part.  

STEP 14: Run your models.  Keep track of what you include in your models b/c oftentimes we will evaluate several models for each analysis depending on what's called "fit statistics."  Since you are familiar with p-values and I assume interpretation of beta coefficients, use these to help inform you of which variables to include in your final model *within the context of the analysis at hand* (this is key - if you have reason to believe that a confounder is important to include, keep it in the model even if it's non-significant).  

STEP 15: Create tables for results from *all* analyses (including the models you decide to can in favor for another one) and what regression model was used.  This is much more transparent than simply producing your final model.

There's more "post-analysis" stuff that should be done, but really Steps 1-15 is a pretty thorough. 

7. Publish:

I can't stress this enough.  This is a long-term goal for sure, especially as you will likely end up with multiple papers!  But once you think you've got the data set-up and analyses down, you need to write it up and send it on for peer-review.  Peer-review is not perfect for sure, but it is the best measure we have for good science.  It gives credibility to your efforts.  Besides, you *do* want to be acknowledged for your efforts, right?  By publishing in a peer-reviewed journal, you're more likely to gain more widely publicized attention, which I think should be the goal of most epidemiological studies; we want to improve public health through informing not only our peers, but also the public.  

As a last note, I know this is a huge undertaking, but these are steps to a thorough analysis.  I have no doubt you're capable of tackling it.  

Best wishes.

PS. I'm sure you already planned to do this, but make all of the above available.  With your large readership you can make this a collaborative effort.

Views: 673

Replies to This Discussion

how cool!! i do a lot of work with human tissue repositories and linking them with clinical (eg. treatment) and risk factor (eg. oral contraceptive use) data. right now i'm working on ovarian and endometrial cancers. we use the tissues for immunohistochemistry and genotyping and evaluate whether these can better assist us in diagnosis and determining prognosis.

walter willett is an authority on nutritional cancer!! there was a guy i used to work with who you might find interesting. he has a background in physics as well and then became very interested in cervical cancer etiology and detection:


you could bring a lot to the field with your unique background!
(ok so I can't figure out why i can't post this post in the right place, so i'll just post it here for now :))

Wow, that's REALLY important work that you are doing, veganmama18!! I'm so glad to hear that this is being investigated. Very cool :)

I did some basic immunohistochemistry when I was studying a novel calcium-binding brain protein at NIH, but that was several years ago :)

I'll definitely check out Philip Castle--thanks for the lead!

I teeter back and forth about pretty much every day about whether to say in physics (theoretical cosmology) or to change fields within science and do something more applied and more relevant to health.

How did you decide on studying cancer epidemiology? I'd love to hear what are you long-term career goals are.

I've been vegan for over a decade--done it both the wrong ways and right way--and raw on and off for 5 years, and am doing LFRV (not quite down to 10% fat or 100% raw consistently yet, but am on my way, as high fruit + high greens makes me feel best).


courtney states:
(ok so I can't figure out why i can't post this post in the right place, so i'll just post it here for now :))

the reason you can't is because of the ning forum structure which is more suited as a dating service than for serious discussion. as a result, you have only so many replies possible by which time i'm sure the designers figured either a hook-up would have been made or the parties would go their separate ways.

in order to continue discussions which have 'run off the edge', what i usually do is start a new reply to the original post and provide the link to the item i am replying to. you can get this actual link by right-clicking on the 'infinity' symbol to the left of the words "Reply by" at the top of every post.

this way, you enable a new series of replies ... at least for a while.

in friendship,
Ah, it did show up in the right place eventually...great! Thanks so much, prad, for explaining the flaw with ning (lol! your explanation was too funny :))
how funny!! we should email about this! the fellowship is open to any PhD, so it's definitely something to consider! my former mentor went through the program and i'm happy to get you two connected if you'd like. :-)



TheBananaGirl created this Ning Network.

30BaD Search

Latest Activity

Houdini Steve Owens commented on niloofar's photo
pradtf replied to pradtf's discussion great site with nutritional research information
Rob is now a member of 30 Bananas a Day!
Houdini Steve Owens replied to Alex Sunbear's discussion Seeking some fruity friends in Brooklyn/Manhattan in the group NYC and vicinity
Houdini Steve Owens replied to Alex Sunbear's discussion Seeking some fruity friends in Brooklyn/Manhattan in the group NYC and vicinity
Houdini Steve Owens replied to Alex Sunbear's discussion Seeking some fruity friends in Brooklyn/Manhattan in the group NYC and vicinity
Houdini Steve Owens joined TheBananaGirl's group
OrganicMark posted a status
"The Other Side of the “Shut Down”: Covert Operations Ongoing to Eviscerate the Deep State [videos] #truth #light #now http://bit.ly/2Fz69R1"
OrganicMark posted a status
"CONFIRMED: President Trump Can Start Laying Off Furloughed Workers After 30 Days With ‘Reduction In Force’ Procedure http://bit.ly/2RSj0DH"
ednshell replied to Raw Mormon Mommy's discussion If YOU were diagnosed with CANCER..
Jan 16
pradtf replied to pradtf's discussion great site with nutritional research information
Jan 16
Courtney Beth replied to Raw Mormon Mommy's discussion If YOU were diagnosed with CANCER..
Jan 16
OrganicMark posted a status
"What If This Puerto Rico Trip Is For "Tribunals" For 30 Democrats And 109 Lobbyists?! #peace #truth #love #light #now http://bit.ly/2suXgiY"
Jan 16
Courtney Beth replied to Roar of Tiger's discussion Can 811 rv diet heal cancer?
Jan 16
Courtney Beth replied to Dutchie's discussion Cancer on 811?
Jan 16
Houdini Steve Owens replied to Frugisaurus's discussion Looking for housemate brazil
Jan 14

© 2019   Created by TheBananaGirl.   Powered by

Badges  |  Report an Issue  |  Terms of Service