How have you dealt with messy data in the past? (Two Sigma)

Up to 80% of a data analyst’s time can be spent on cleaning data. That makes this a very important concept to understand. Even more important when you consider that, if your data is unclean and produces inaccurate insights, it could lead to costly company actions based on false information. Yikes. That could mean trouble for you.

You need to demonstrate not only that you understand the difference between messy data and clean data but also that you used that knowledge to cleanse the data. This article shows the sort of workflow you might be looking for in your response, as well as some methods for identifying inconsistent data and cleaning it.

Just as with any other question where you’re asked to describe a situation you’ve encountered in the past, it’s a good time to employ the STAR method: situation, task, action, result.

A client of ours was unhappy with our staffing reports, so I needed to pore over one to see what was causing their chagrin. I was looking at some data in a spreadsheet that contained information about when our call center employees went to break, took lunch, etc., and I noticed that the time stamps were inconsistent: some had a.m., some had p.m., some didn’t have any specifications for morning or night, and worst of all, many of these employees were located in different time zones, so this needed to be made more consistent as well.

To solve the a.m./p.m. dilemma, I made sure all times were specified in military. This had two benefits: first, it eliminated the strings in the data and made the whole column numeric; second, it removed any need to specify morning or night as military time does this inherently. Next, I converted all times to UTC, this way all of the data was on the same time zone. This was important for the report I was working on because otherwise the data would be presented out of order and it could cause confusion for our client. Reorganizing the report’s data this way helped improve our relationship with the client, who, due to the time discrepancies, previously believed we were understaffed at specific times of day.