What are some key business metrics for (S-a-a-S startup | Retail bank | e-Commerce site)?

Thinking about key business metrics, often shortened as KPI’s (Key Performance Indicators), is an essential part of a data scientist’s job. Here are a few examples, but you should practice brainstorming your own. Tip: When in doubt, start with the easier question of “how does this business make money?” S-a-a-S startup: Customer lifetime value, new … Read more

Explain bagging

Bagging, or Bootstrap Aggregating, is an ensemble method in which the dataset is first divided into multiple subsets through resampling. Then, each subset is used to train a model, and the final predictions are made through voting or averaging the component models. Bagging is performed in parallel. In the context of machine learning, bagging, short … Read more

Why are ensemble methods superior to individual models?

They average out biases, reduce variance, and are less likely to overfit. There’s a common line in machine learning which is: “ensemble and get 2%.” This implies that you can build your models as usual and typically expect a small performance boost from ensembling. Ensemble methods are often superior to individual models due to several … Read more

What is the ROC Curve and what is AUC (a.k.a. AUROC)?

The ROC (receiver operating characteristic) the performance plot for binary classifiers of True Positive Rate (y-axis) vs. False Positive Rate (x- axis). AUC is area under the ROC curve, and it’s a common performance metric for evaluating binary classification models. It’s equivalent to the expected probability that a uniformly drawn random positive is ranked before … Read more

Explain Latent Dirichlet Allocation (LDA).

Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying documents by subject matter. LDA is a generative model that represents documents as a mixture of topics that each have their own probability distribution of possible words. The “Dirichlet” distribution is simply a distribution of distributions. In LDA, documents are distributions of … Read more