You are given a cancer detection data set. Let’s suppose when you build a classification model you achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

You can do the following: Add more data Treat missing outlier values Feature Engineering Feature Selection Multiple Algorithms Algorithm Tuning Ensemble Method Cross-Validation While achieving a 96% accuracy rate on a cancer detection dataset might seem impressive at first glance, there are several reasons why one shouldn’t be entirely satisfied with this result: Class Imbalance: … Read more

Suppose you are given a data set which has missing values spread along 1 standard deviation from the median. What percentage of data would remain unaffected and Why?

Since the data is spread across the median, let’s assume it’s a normal distribution. As you know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values. If the … Read more

A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head?

There are two ways of choosing a coin. One is to pick a fair coin and the other is to pick the one with two heads. Probability of selecting fair coin = 999/1000 = 0.999 Probability of selecting unfair coin = 1/1000 = 0.001 Selecting 10 heads in a row = Selecting fair coin * … Read more

How do you map nicknames (Pete, Andy, Nick, Rob, etc) to real names?

This problem can be solved in n number of ways. Let’s assume that you’re given a data set containing 1000s of twitter interactions. You will begin by studying the relationship between two people by carefully analyzing the words used in the tweets. This kind of problem statement can be solved by implementing Text Mining using … Read more

How would you predict who will renew their subscription next month? What data would you need to solve this? What analysis would you do? Would you build predictive models? If so, which algorithms?

Let’s assume that we’re trying to predict renewal rate for Netflix subscription. So our problem statement is to predict which users will renew their subscription plan for the next month. Next, we must understand the data that is needed to solve this problem. In this case, we need to check the number of hours the … Read more