1. What is the use of a training set, a validation set, and a test set? What is the difference between a validation set and a test set?
In Machine Learning, there are three separate sets of data when training a model:
Training Set: this data set is used to adjust the weights on the ML model.
Validation Set: this data set is used to minimize overfitting. You are not adjusting the weights of the model with this data set, you are just verifying that any increase in accuracy over the training data set actually yields an increase in accuracy over a data set that has not been shown to the model before, or at least the model hasn’t trained on it (i.e. validation data set). If the accuracy over the training data set increases, but the accuracy over the validation data set stays the same or decreases, then you are overfitting your ML model, and you should stop training.
Testing Set: this data set is used only for testing the final solution in order to confirm the actual predictive power of the model.
Difference between a validation set and a test set
The validation data set is a set of data for the function you want to learn, which you are not directly using to train the network. You are training the network with a set of data which you call the training data set. If you are using a gradient-based algorithm to train the model, the error surface and the gradient at some point will entirely depend on the training data set thus the training data set is being directly used to adjust the weights. To make sure you don’t overfit the model, you need to input the validation dataset to the model and check if the error is within some range. Because the validation set is not using directly to adjust the weights of the network, therefore it’s a good error for the validation. Also, the test set indicates that the model predicts well for the train set examples, also it is expected to perform well when the new example is presented to the model which was not used in the training process. Once a model is selected based on the validation set, the test set data is applied to the network model, and the error for this set is found. This error is a representative of the error which we can expect from absolutely new data for the same problem.
2. What is stratified cross-validation and where is it used?
Cross-validation is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is a prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a stratified variant of this approach, the random samples are generated in such a way that the mean response value (i.e., the dependent variable in the regression) is equal in the training and testing sets. This is particularly useful if the responses are dichotomous with an unbalanced representation of the two response values in the data.
Stratified cross-validation can be used in the following scenarios:
A dataset with multiple categories. When the dataset is smaller and categories are imbalanced, this is when stratified cross-validation will be used.
A dataset with data of different distributions. When we can’t ensure that both types of dataset are present in training and validation, we will have to use stratified cross-validation.
3. Why are ensembles typically considered better than individual models?
Ensemble models have been used extensively in credit scoring applications and other areas because they are considered to be more stable and, more importantly, predict better than single classifiers. They are also known to reduce model bias and variance. However, Individual classifiers pursue different objectives to develop a (single) classification model. Statistical methods either estimate (+|x) directly (e.g., logistic regression), or estimate class-conditional probabilities (x|y), which they then convert into posterior probabilities using Bayes rule (e.g., discriminant analysis). Semi-parametric methods, such as NN or SVM, operate in a similar manner, but support different functional forms and require the modeler to select one specification a priori. The parameters of the resulting model are estimated using nonlinear optimization. Tree-based methods recursively partition a data set so as to separate good and bad loans through a sequence of tests (e.g., is loan amount > threshold). This produces a set of rules that facilitate assessing new loan applications. Moreover, Ensemble classifiers pool the predictions of multiple base models. Much empirical and theoretical evidence has shown that model combination increases predictive accuracy. Ensemble learners create base models in an independent or dependent manner.
4. What is regularization? Give some examples of the techniques?
Regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.
Some techniques of regularization:
L1 and L2 are the most common types of regularization. These update the general cost function by adding another term known as the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization term
Due to the addition of this regularization term, the values of weight matrices decrease because it assumes that a neural network with smaller weight matrices leads to simpler models. Therefore, it will also reduce overfitting to quite an extent.
5. What is the curse of dimensionality? How to deal with it?
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
Dimensionality reduction is an important technique to overcome the curse of dimensionality in data science and machine learning. As the number of predictors (or dimensions or features) in the dataset increase, it becomes computationally more expensive (ie. increased storage space, longer computation time) and exponentially more difficult to produce accurate predictions in classification or regression models. Moreover, it is hard to wrap our head around to visualize the data points in more than 3 dimensions.
6. What is an imbalanced dataset? How to overcome its challenges?
Imbalanced datasets are a special case for classification problem where the class distribution is not uniform among the classes. Typically, they are composed of two classes: The majority (negative) class and the minority (positive) class. These type of sets suppose a new challenging problem for Data Mining since standard classification algorithms usually consider a balanced training set and this supposes a bias towards the majority class.
Ways to overcome the Imbalanced dataset challenges
1. Data Level approach: Resampling Techniques
2. Algorithmic Ensemble Techniques
7. What is the difference between supervised, unsupervised, and reinforcement learning?
Here is the difference
In a supervised learning model, the algorithm learns on a labeled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on training data. An unsupervised model, in contrast, provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own.
Semi-supervised learning takes a middle ground. It uses a small amount of labeled data bolstering a larger set of unlabeled data. And reinforcement learning trains an algorithm with a reward system, providing feedback when an artificial intelligence agent performs the best action in a particular situation.
8. What are some factors determining the success and recent rise of deep learning?
Here are some of the success factors of deep learning:
1. Gnarly data
2. Built-in feature engineering
3. Topology design process
4. Adoption of GPUs
5. Availability of purpose-built open source libraries
9. What is data augmentation? Provide some examples?
Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more in-depth insight.
Computer vision is one of the fields where data augmentation can be used. We can do various modification with the images:
- Add noise
- Modify colors
10. What are convolutional neural networks? What are its applications?
A convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery.
1. Image recognition
2. Video analysis
3. Natural Language Processing (NLP)
4. Drug discovery
5. Health risk assessment
6. Checker games