As promised, I'm posting my response to your email [yes, she emailed me] on your site. You asked that I provide some tips on where to start and how to proceed. BTW, you mentioned "epidemiology secrets" and I just want to say: no "secrets"!! Epidemiology is just critical thinking, but with numbers. It's no different from many other disciplines. Maybe some time you can help me with writing (scientists are generally terrible writers, hehe).
Note: I've included some comments on what went wrong and how it can be corrected merely for demonstrative purposes - not at all malicious attacks, OK? This is how we all learn after all. In caps, I will highlight steps in the action plan for you.
STEP 0: Do a literature search. I find it helpful to keep an excel spreadsheet with columns for author, title, journal, year, summary of paper, strengths of the study, weaknesses, and concluding remarks. This is essential, as one shouldn't just blindly go into an analysis without having at least some background information on the subject matter. No need to be an expert, but good to know what's already out there, and what needs to be done.
For this discussion, the outcome will be colorectal cancer, since you used it on your post. Similarly, the primary exposure of interest will be total cholesterol. By by basing your conclusions on uncorrected correlations alone, you've made a huge leap that doesn't have much ground to stand on. The simple correlations are biased, as you yourself pointed out when evaluating total cholesterol, schistomiasis, and colorectal cancer. As such, if you don't adjust for potential confounders via multiple regression, the association you observe is biased. We almost always need to adjust for confounders, and this is very true in your case.
STEP 1: It's a good habit to evaluate the correlations between all exposures and also between all exposures and the outcome <i>at the individual level</i>. So, for *every* analysis you plan on doing, run create scatterplots for every X against X and every X against Y, using the *individual* data (where possible), and provide the correlation + 95% confidence interval for each.
STEP 2: Create histograms for every exposure of that is categoric and density plots (or you can create histograms with very narrow bars) for every exposure that is continuous. This will tell you how the variables are distributed and what the appropriate summary statistics for them would be. For example, if total cholesterol is not normally distributed (follow a bell curve) then *median* total cholesterol might be a better summary statistic then *mean* total cholesterol (good to know when you present descriptive statistics of the data you're using). Sometimes it's useful to present different stats for a single variable.
2. Individual data vs. aggregated data:
You stated you didn't see much curvature, but keep in mind that you were presenting with aggregated data (eg. average total cholesterol for all individuals) instead of including individual-level data (the exposure and outcome for a single individual). Consequently, there was a big loss in information, and you can't make accurate decisions on how to model your data if you plot aggregated data. Related to this, your analysis was ecologic (used aggregated/grouped data) but you made individual-level conclusions when you used the term "risk factor." This is referred to as an ecologic fallacy - and it's just that. A fallacy. For example, all we can say based on your cholesterol-colorectal cancer example (the one that doesn't account for schistomiasis) is that the counties with higher mean total cholesterol tend to have higher incidence rates of colorectal cancer. We can't make the leap to calling cholesterol a *risk factor* for colorectal cancer.
STEP 3: Don't aggregate your data in your analysis. Why? You lose A LOT of information when you aggregate data and you can bias your results. So keep that data at the individual-level. For descriptive tables, by all means, aggregated data is necessary for obvious reasons. But in your analysis, individual-level data when you've got it is essential.
3. The right regression model:
One of your outcomes was incidence rates of colorectal cancer. When you do your analysis with individual-level data, with incidence rates of colorectal cancer as your outcome, linear regression = WRONG model. Make sure you know which models to use and when. To start - when modeling "raw" rates (case counts and person time), we almost always use Poisson regression, and often we need to account for overdispersion as well. Get to know some of the other common regression models as well.
STEP 4: Write out all of the primary exposures of interest you want to investigate and the corresponding outcome of interest and how you're setting up your outcome variable (are you interested in colorectal cancer *incidence rates*, *prevalence*, a simple yes/no the person has colorectal cancer?)
STEP 5: Write out what the appropriate regression model would be for the different analyses you plan to conduct.
These are factors that are related to the exposure and the outcome of interest such that *not* adjusting for them will produce a biased association between exposure and outcome. As you saw, schistomiasis might be a confounder. And in fact, county might be too - and is actually upstream of schistomiasis in some sense, right? Two confounders that almost *always* must be included in a model are AGE and SEX (provided your analysis isn't restricted to one sex). This is especially true for chronic disease (eg. cardiovascular disease and cancer). In this particular case, body mass index (BMI) would be very important to include as well. County may also be important.
STEP 6: For every analysis you do, write out all potential confounders you can think of and why. You know the data better than I do as you've worked with it extensively. And, from STEP 0, you'll know your context.
STEP 7: Write out *how* the confounders are related to the exposure and outcome. Is the confounder protective (i.e. decrease risk) for the outcome? Or is it a risk factor? How is it associated with the primary exposure of interest? This is where those scatterplots in STEP 1 come in handy! The purpose of this is to give you an idea of *how* an observed association might be biased if you *don't* adjust for certain confounders. It is tedious, but thorough and, like STEP 6, will allow you to approach your analyses with more contextual background.
5. "Cleaning" and "recoding" your data:
Raw data is not *in and of itself* a bad thing. It is simply the data in its original form. But in order to be useful for analysis we often need to "clean" it and "recode" it. When I say "clean" it, I mean setting up the *dataset* that is free (to the greatest extent possible) of unnecessary data (for example, if you're interested in ovarian cancer, you wouldn't include men), or mistakes (for example, if an individual in the data was coded as being a man with ovarian cancer, this is clearly wrong). In this case, you might either omit it since you don't have a way to check which is correct or, based on other data for that individual choose to change "man" to "woman" or "ovarian cancer" to "no ovarian cancer." "Recoding" means setting up the *variables* to be useful. For example, we might recode BMI in categories of underweight, normal, overweight, and obese rather than leave it as continuous. Some variables may already be categoric, if the corresponding data were collected that way.
STEP 8: Clean your data. You will likely need to set up multiple datasets.
STEP 9: Write out *how* you've cleaned your data. (This is good record keeping.)
STEP 10: Recode your data. This might include combining variables too.
STEP 11: Create a "data dictionary" similar to the one on the Oxford site. But in addition, include a description of how you've coded your data (eg. 1=underweight, 2=normal, 3=overweight, 4=obese). Again, good for record keeping, but also "keeps you honest" so others know how you set up your data. This will often be apparent when you present your results, but not always. It's a good habit to keep track of this, in any event.
STEP 12: Replot all newly *categorized* variables against the outcome(s) of interest. Why? Because the categorized data may reveal non-linear relationships with the outcome (in fact, this is a strength of categorizing data - that we can account for some non-linear relationships). For example, underweight might be a risk for something, whereas normal BMI is protective, while overweight and obese are a risk ("U-shaped").
6. Exploration of your data through descriptive statistics:
Almost all scientific papers start out with a "Table 1" which presents a description of the data. It tells us things like What's the % of women and men in our data, What is the proportion of people with and without the exposure and with and without the outcome?
STEP 13: Create descriptive tables of all relevant variables. This includes your primary exposure of interest, confounders, and outcome. Obviously, you will have different tables for each analysis as you're interested in different primary exposures (cholesterol? meat? total caloric intake?) and outcomes (cardiovascular disease? colorectal cancer? bladder cancer?). To save time, you might include all relevant exposures and confounders in rows, and cross-classify them with all outcomes of interest in columns.
The fun part.
STEP 14: Run your models. Keep track of what you include in your models b/c oftentimes we will evaluate several models for each analysis depending on what's called "fit statistics." Since you are familiar with p-values and I assume interpretation of beta coefficients, use these to help inform you of which variables to include in your final model *within the context of the analysis at hand* (this is key - if you have reason to believe that a confounder is important to include, keep it in the model even if it's non-significant).
STEP 15: Create tables for results from *all* analyses (including the models you decide to can in favor for another one) and what regression model was used. This is much more transparent than simply producing your final model.
There's more "post-analysis" stuff that should be done, but really Steps 1-15 is a pretty thorough.
I can't stress this enough. This is a long-term goal for sure, especially as you will likely end up with multiple papers! But once you think you've got the data set-up and analyses down, you need to write it up and send it on for peer-review. Peer-review is not perfect for sure, but it is the best measure we have for good science. It gives credibility to your efforts. Besides, you *do* want to be acknowledged for your efforts, right? By publishing in a peer-reviewed journal, you're more likely to gain more widely publicized attention, which I think should be the goal of most epidemiological studies; we want to improve public health through informing not only our peers, but also the public.
As a last note, I know this is a huge undertaking, but these are steps to a thorough analysis. I have no doubt you're capable of tackling it.
PS. I'm sure you already planned to do this, but make all of the above available. With your large readership you can make this a collaborative effort.