Oh Stats…. how I love thee. I love thee like seals love the great white shark, like the moon loves the sun, or like the air loves the water. We are intricately connected to each other, and yet if you weren’t in my life I feel like I would be fine – nay, happier even. So thanks for that.
My winter months are spent doing a lot of statistical analysis in between field seasons, which sounds like I am just doing the same thing all the time. The reality, however, is that I’m actually doing an incredibly diverse amount of things – all of which are related to statistical analyses. I’ve had a few blog posts on this process already, so I’ll try not to repeat myself. There are new things I’ve learned this week though… (as always).
Fools Rush In…
And I am sometimes a fool.
It takes a long time to clean data and get it ready for analysis that when it’s finally done, it’s so tempting to just run into a stats program and start clicking buttons and looking for that ever-sought-after p-value that will say whether or not the results are significant. I’m working in SPSS, which is a stats program that basically allows you to click a bunch of buttons telling it what test to run and what statistics to provide and then it gives an output with all the answers. Yeah! Great! Easy.
Unless I look at the output and don’t have a freaking clue what it means.
Just over a week ago, I took my cleaned and prepped remote camera data to run the regression analyses I’ve been planning. And I got this output that made no sense to me. I’m trying to see how the number of grizzly bear photos captured on the remote cameras (dependent variable) is related to a series of independent variables (time of day, season, and level of human use). The output looked like it was telling me that the grizzly bear photos were positively correlated with time of day, which in this case meant that there were more pictures of bears at night than day or dawn/dusk. Ok fine. But when I looked at the number of grizzly bear photos in each time category, that was clearly not the case – there aren’t hardly any pictures of grizzly bears at night.
So… what the heck?
In a meeting with my supervisor, who doesn’t use SPSS, we sat looking at the output and brainstorming why it seemed to saying what it was. He was drawing equations on paper explaining regressions and that’s when I realized that I didn’t know what the hell I was doing at all. What’s a regression? What’s a bear? Who am I?
Back to the books
I left that meeting knowing I needed to brush up on a few things and I had to go back to basics. That is so often my strategy with this PhD. If things are getting too complex and I’m a little lost, then it’s time to take it back to basics. (Funny enough that attitude is expanding to other parts of my life like my yoga practice which has lately focused more on getting the more perfect down-dog rather than doing some hard twisty arm balance. But I digress). I found myself sitting down with a stats textbook reading a chapter called: Introduction to Regression Analysis. It started at the very beginning, with equations and variable definitions. And it was great!
Then I literally spent hours watching about a dozen videos on SPSS about regression analysis outputs. Sounds super fun, right? I don’t know how there are so many people who have time in their lives to film videos about regression in SPSS and put them on you tube, but I want to give them all a big fat hug right now. (Those of you who care, check out the how2stats channel on youtube – great videos!) I watched every video I could about every kind of regression analysis.
And it worked. I ran my regression analysis again and I understood the output… mostly.
Talking through it
Off I went to a meeting with supervisor yesterday feeling like the Queen of Regression Analysis. I literally felt like jumping up and down “I got this! I understand! … mostly”.
I ran two different regressions with my data to see which would work better: a backwards stepwise linear regression and a binomial logit regression. Sounds pretty smart, eh? Oh yeah it was. Still the second regression showed that weird positive relationship between time of day and number of grizzly photos. The stepwise linear regression worked better, but there can sometimes be issues with that method and overestimating your significant results. Sometimes it’s good to run a couple different but similar tests to see if they produce similar outputs.
After some brainstorming, we realized that the second regression wasn’t considering how the overall number of pictures taken in the different time periods is so different. There are over 75000 pictures taken in the day time (of people and bears), but only a few hundred taken at night. An analysis that isn’t considering this discrepancy is going to suggest that there’s a higher chance of capturing a bear at night because there are fewer pictures in the first place – the ratios are just way off. So the backwards stepwise is the better analysis because it considers sample size and provides a more accurate measure of significance.
And all that for…
From the first “What the Heck” output to figuring this out yesterday took over a week. And now I can say that:
Season and Human Use Category significantly affected the probability of capturing a grizzly bear on camera on a hiking trail. More bears were captured in the spring and summer than fall; and more bears were captured on low and medium use trail than high use trails.
Or can I? Is that really all there is?
The analysis is getting there, but it’s not done done. There are still two very important variables that I haven’t included in my analysis – the camera’s distance to a road, and the habitat quality of the area where the camera is. So I can’t say anything for sure yet.
Research has shown that bears are impacted by roads, so I hypothesize that camera distance to the road will be inversely related to the probability of capturing a bear on camera (the closer the camera to the road, the less likely a bear will be there).
We also know that bears are driven by their bellies. So I hypothesize that cameras in high quality habitats will be more likely to capture bears than cameras in low quality habitats.
So how do I do that?
ArcGIS. Yeah! Meanwhile in mapping land I’ve also been taking a couple online courses to learn ArcGIS and how to calculate the distance of the cameras to the roads and assign a habitat quality score to each camera. Once I do that in ArcGIS, then I’ll be able to take those variables in to the regression and run it again.
The regression will be able to tell me which of all these variables is the most important when it comes to predicting whether or not a grizzly bear will end up on my trail camera.
Each step is one step closer, but there are a lot of steps and many of them are happening simultaneously.
Multitasking to the Max
Last week, I was doing these things:
- reading a stats book and re-learning regression analyses;
- watching a million videos about SPSS on you tube;
- taking an ArcGIS course;
- chasing down and cleaning the last of the data from summer 2014 to get ready for analysis;
- playing with data in Excel (making graphs to see what the data looks like) and in SPSS;
- some contract work for good measure, because I had nothing else going on; and
- I threw in a grant application for good measure
So even though I’m focused on one specific project, I am actually working on multiple aspects of that project all the time.
When I first started this PhD, I told my friends that I wasn’t going to have kids. I’m going to give birth to a thesis instead.
I truly believe that young mothers are the best multi-taskers in the history of the Universe. I watch my friends keeping up to their toddlers and juggling work and cooking dinner and cleaning the house and staying sane. They are the best multi-taskers ever.
and grad students like me.