Jumping (into) the Turnstile (Data)

- September 29, 2019

This blogpost marks the completion of my first week at data science bootcamp. In our first week, we were assigned a group project to use NYC MTA turnstile data to approximate the best times of day and locations for an engagement street team to be placed.

The approach to the project was very open-ended, which meant I made a lot of useful mistakes.

EDUCATIONAL MISTAKE #1 - I tried to transform data before really looking at it

When I first started the project, I downloaded the data and immediately started trying to do operations on it. I knew that cleaning data and exploring it beforehand are important - but surely the MTA cleaned their data and made it ready for easy public use before publishing it?

I spent four hours figuring out how to filter data, making subsets, aggregating, and converting datatypes before I ever did a single .describe( ). The second I did, I saw that column names had whitespace, counts were randomly negative every once in a while, and there were a ton of strange outlier data points.

Not only did I jump into working with unexplored data, I spent barely any time reading the documentation accompanying the data and explaining the different columns. Sure, it seemed a bit strange that there seemed to be billions of people walking through a single turnstile in a week, but New York has a lot of people, right? (oh, those numbers are cumulative? Ohhhhhh..... oh.)

So, after an especially conscientious classmate came over to ask me how I was dealing with the cumulative data, I actually looked at descriptions of the data. And then I remembered the lecture we'd had about exploratory data analysis, and the notes I'd taken about the kinds of things you do before working with data. And I felt very embarrassed. But research has indicated that memories associated with pain and negativity can be more easily accessible than positive memories, so I'm going to assume this means great things for my future explorations.

EDUCATIONAL MISTAKE #2 - Stopping as soon as it works: aesthetics gtfo

The graph above was the first graph featured in my group's eventual presentation, made by me. Too many stations, too many unnecessary colors, station names left in their column heading shorthand.... it's a mess.

Look, I am not unfamiliar with visual communication. I'd argue I have 8 years of experience as a science teacher trying to distill complex topics into the cleanest, clearest presentation possible. BUT - when I'm coding - I spend so much time wrestling with errors and typos and problems I don't understand.... and so when I finally get some decent graphical output, my reaction is "IT WORKED! I'm a genius!!!! I deserve a snack. What's next?" and it doesn't really occur to me that a few quick changes will make my hard work significantly more legible.

One week down, many more (great) mistakes to come!

Comments

BarbaraNovember 9, 2019 at 8:08 PM
👏👏👏
ReplyDelete
Replies

Add comment

Search This Blog

Endless Forms Most Aggregated - A Data Science Blog

Jumping (into) the Turnstile (Data)

Comments

Post a Comment

Popular posts from this blog

Predicting Ambitious Instruction at CPS

Mini-Project: Finding Allies Based on Mission Statements

Impact of Reform on Chicago Police Accountability