What do you understand by Type I vs Type II error ?

Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’. In the context of confusion matrix, we can say Type I error occurs when … Read more

‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?

The basic idea for this kind of recommendation engine comes from collaborative filtering. Collaborative Filtering algorithm considers “User Behavior” for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the … Read more

You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

We can deal with them in the following ways: Assign a unique category to missing values, who knows the missing values might decipher some trend We can remove them blatantly. Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a … Read more

What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

Neither. In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with … Read more

We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?

Don’t get baffled at this question. It’s a simple question asking the difference between the two. Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has … Read more