MuSigma Data Scientist Interview Questions

Q1. How to control leaf height and Pruning in Decission Tree?
Ans: To control the leaf size, we can set the parameters:-

  1. Maximum depth :
    Maximum tree depth is a limit to stop the further splitting of nodes when the specified tree depth has
    been reached during the building of the initial decision tree.
    NEVER use maximum depth to limit the further splitting of nodes. In other words: use the
    largest possible value.
  2. Minimum split size:
    Minimum split size is a limit to stop the further splitting of nodes when the number of observations
    in the node is lower than the minimum split size.
    This is a good way to limit the growth of the tree. When a leaf contains too few observations, further
    splitting will result in overfitting (modeling of noise in the data).
  3. Minimum leaf size
    Minimum leaf size is a limit to split a node when the number of observations in one of the child nodes
    is lower than the minimum leaf size.
    Pruning is mostly done to reduce the chances of overfitting the tree to the training data and reduce
    the overall complexity of the tree.

There are two types of pruning: Pre-pruning and Post-pruning.

  1. Pre-pruning is also known as the early stopping criteria. As the name suggests, the criteria
    are set as parameter values while building the model. The tree stops growing when it meets
    any of these pre-pruning criteria, or it discovers the pure classes.

  2. In Post-pruning, the idea is to allow the decision tree to grow fully and observe the CP value.
    Next, we prune/cut the tree with the optimal CP(Complexity Parameter) value as the
    parameter.

Q2. What is Variance and Bias tradeoff?
Ans: In predicting models, the prediction error is composed of two different errors

  1. Bias
  2. Variance

It is important to understand the variance and bias trade-off which tells about to minimize the Bias
and Variance in the prediction and avoids overfitting & under fitting of the model.

Bias: It is the difference between the expected or average prediction of the model and the correct
value which we are trying to predict. Imagine if we are trying to build more than one model by
collecting different data sets, and later on, evaluating the prediction, we may end up by different
prediction for all the models. So, bias is something which measures how far these model prediction
from the correct prediction. It always leads to a high error in training and test data.

Variance: Variability of a model prediction for a given data point. We can build the model multiple
times, so the variance is how much the predictions for a given point vary between different
realizations of the model.

Q3. What is the Confusion Matrix?
Ans: A confusion matrix is a table that is often used to describe the performance of a classification model
(or “classifier”) on a set of test data for which the true values are known. It allows the visualization
of the performance of an algorithm.

A confusion matrix is a summary of prediction results on a classification problem. The number of
correct and incorrect predictions are summarized with count values and broken down by each class.
This is the key to the confusion matrix.
It gives us insight not only into the errors being made by a classifier but, more importantly, the types
of errors that are being made

Q4. How do you treat heteroscedasticity in regression?
Heteroscedasticity means unequal scattered distribution. In regression analysis, we generally talk about
the heteroscedasticity in the context of the error term. Heteroscedasticity is the systematic change in the
spread of the residuals or errors over the range of measured values. Heteroscedasticity is the problem
because Ordinary least squares (OLS) regression assumes that all residuals are drawn from a random
population that has a constant variance.

What causes Heteroscedasticity?
Heteroscedasticity occurs more often in datasets, where we have a large range between the largest and
the smallest observed values. There are many reasons why heteroscedasticity can exist, and a generic
explanation is that the error variance changes proportionally with a factor.
We can categorize Heteroscedasticity into two general types:-
Pure heteroscedasticity:- It refers to cases where we specify the correct model and let us observe the
non-constant variance in residual plots.
Impure heteroscedasticity:- It refers to cases where you incorrectly specify the model, and that causes
the non-constant variance. When you leave an important variable out of a model, the omitted effect is
absorbed into the error term. If the effect of the omitted variable varies throughout the observed range of
data, it can produce the telltale signs of heteroscedasticity in the residual plots.

Q5. What is Principal Component Analysis(PCA), and why we do?
Ans: The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set
consisting of many variables correlated with each other, either heavily or lightly, while retaining the
variation present in the dataset, up to the maximum extent. The same is done by transforming the
variables to a new set of variables, which are known as the principal components (or simply, the PCs)
and are orthogonal, ordered such that the retention of variation present in the original variables decreases
as we move down in the order. So, in this way, the 1st principal component retains maximum variation
that was present in the original components. The principal components are the eigenvectors of a
covariance matrix, and hence they are orthogonal.

Main important points to be considered:

  1. Normalize the data
  2. Calculate the covariance matrix
  3. Calculate the eigenvalues and eigenvectors
  4. Choosing components and forming a feature vector
  5. Forming Principal Components
Comments (0)