What does the term ‘Generalization’ signify in machine learning?

When you feed data to a machine learning model, it learns the underlying patterns that describe the relationship among the data points of the given dataset. Some of these are general patterns, while other are inherent to the data points of the training dataset.

General patterns are those patterns which would still be present when new data is fed to the network. While patterns inherent to data points of the training dataset are classified as noise. Thus generalization of a model is characterized by it’s capability to identify the general patterns and ignore the noise.

What is ‘Goodness of Fit’?

The goodness of fit of a machine learning model describes how well the model fits the data i.e. how accurately it is able to identify the patterns in the data. In other words the better the fit the closer the curve described by your model is to data points. Hence a smaller error rate and smaller residuals.

There are many methods to check the goodness of fit of a model and the most common ones are MSE, RMSE, MAE, R-squared statistics, adjusted R-squared statistics

 

 

Describe polynomial regression in few words

In polynomial regression, the regression equation has independent variables raised to a powers of 2 or more i.e. the relationship is not linear. It is used when the dependent and independent variables share a relationship that can’t be adequately represented by a straight line i.e. high bias.

Why do we do a train test split?

We do a test train split to ensure that the model prediction we are getting is due to actual learning and not overfitting. We train the model on training set and then make the predictions, using the equation learned from training set, on test set. If the accuracy of prediction is similar for both training and test sets then the model is working properly and has not over fitted.

Explain Lasso Regression.

In lasso regression the loss function in addition to RSS (or other error function) has lambda (a constant) times summation of coefficients of independent variables. By doing this we make loss proportional to magnitude of coefficients of independent variables and using the gradient descent we get the new value of coefficients which are lower than their previous values. Thus reducing overfitting.

It is generally used when the multicollinearity is relatively high. Unlike ridge regression, lasso does reduce the value of some coefficients to zero, thus performing feature selection and treating multicollinearity. It is especially useful in descriptive models. The value of lambda is a hyperparameter and is calculated during cross validation i.e. to arrive at the optimal value hyperparameter tuning is to be performed.

Explain Ridge Regression.

In ridge regression the loss function in addition to RSS (or other error function) has lambda (a constant) times summation of squares of coefficients of independent variables. By doing this we make loss proportional to magnitude of coefficients of independent variables and using the gradient descent we get the new value of coefficients which are lower than their previous values. Thus reducing overfitting.

It is generally used when the multicollinearity is relatively low. The value of lambda is a hyperparameter and is calculated during cross validation.

How to handle categorical values in the data?

It depends on the type of categorical variables. If the categorical variables are hierarchical then they could be represented using a single column with numeric values corresponding to the hierarchy. If they aren’t hierarchical then we will use one hot encoding and create n-1 columns [n being the number of distinct values the categorical variable has].

What is normalization?

Normalization basically means rescaling the data so that it falls in a smaller range. In machine learning problems normalization is used to change the values of numeric columns in the dataset to use a common scale, without distorting the differences in range values or losing information. Suppose you’re working on a problem where one attribute may be in kilograms, other in meters while yet another is present as hours. Owing to the difference in the units they will have difference in magnitude, for ex- change of 1 unit in hour column may correspond to change of 2000 m in distance column.

Machine learning equations operate on certain basic assumptions and difference in magnitude of variance in input variables could adversely affect their performance. For example distance based algorithms and gradient descent algorithms malfunction when input variables are on significantly different scales. What we mean by malfunction is that the learning is dominated by variable whose magnitude of variation is higher than others.

What is abstraction?

Abstraction is defined as the process of making something easier to understand by ignoring the details that may be unimportant. Abstraction makes our day to day to day tasks doable/possible. For example, something as simple as looking what time your watch shows would have become humongous task, if you had to take in all the details of the watch like- length & width of needles, intricate internal mechanism with small and large gears interlocking through their tooth etc.- to name a few. But you ignore all this (i.e. unnecessary details) & just look at the time.

Similarly our neural networks, use abstraction to solve complex problems. Let us take simple route finding problem to drive home the point.

Let us assume that you decided to visit a city in Japan. Your mission is to reach the district headquarters as soon as you left the airport. The person who gave you the mission brief, say Mr M, forgot an important detail that was conveyed to him by a local of the city X, say Mr T. Mr T had told Mr M that one only had to follow the road signs, which not only told which turn you have to take to reach the city center but also how far you’re from it. Mr M realized his mistake and told you that, though he didn’t remember the exact advice of Mr T, but he can provide you with a dataset through which you could figure out the route to the city center. Being a proactive employee and AI engineer you readily accepted the challenge of building a model that can accurately predict the route to city center. But there is one problem, the dataset is in a language which neither you nor google translate understands. You have no clue, which column denotes what and categorical volume are an absolute nightmare.

Further the dataset has umpteen features, ranging from colour of houses on the way, traffic info, weather info, street vendors on the way, trees planted along the road, specie & age of those trees etc. Hence you don’t know what each column denotes, on top of that most columns are useless.

If you use linear regression to build your model then it would take ginormous amount of effort & ages in feature selection, tuning the model, validation etc., before you arrive at the accurate prediction. If the data of road signs was in form of images, then it would be practically impossible to achieve the desired result.

On the other hand if you chose to build a neural network, with nonlinear activation functions, then you’ll be able to make accurate predictions in a reasonable amount of time and with much more ease.

As you would have guessed by now, that we create neural networks to solve those problems with a level of accuracy that other algorithms simply can’t deliver. Even if they could, they would require excessive amount of time & effort. Thus having a linear activation function simply defeats the purpose of a neural network.