To incorporate categorical features, also known as qualitative predictors, into a linear regression model, they must first be converted into a numerical format.
There are two types of categorical features:
- Nominal features: This corresponds to features where there is no inherent ordering between different values. For example: gender, color etc. These features can be represented using one hot encoding, and dummy encoding techniques. The primary disadvantage of using nominal features is the explosion of the feature set, especially if the number of unique possible values is large.
- Ordinal features: These are categorical features that have an inherent ordering to them. For example: the size of a t-shirt (Small/Medium/Large). These features can be represented using Ordinal encoding. The disadvantage of using Ordinal features is that the difference between two representative integers might not be a true representation of the ordinal difference between the raw features.
There are three common approaches to deal with categorical or qualitative predictors: (a) Dummy encoding, (b) One-hot encoding, and (c) Ordinal encoding
Dummy Encoding
The classical approach to dealing with nominal categorical or qualitative predictors is to use dummy encoding to numerically represent the different levels of that feature using binary 1’s and 0’s. If a predictor has k categories, k-1 dummy variables are needed to uniquely represent that attribute in the model. As one level can be represented when all of the dummy variables are set to 0, there should be one less dummy variable compared to the total number of unique categories of the variable to avoid redundancy.
The table below depicts Dummy encoding using an Auto Loan example, where there are three types of loan: New Auto, Used Auto, and Signature. Dummy encoding would produce the following transformation of data, creating two new Dummy variables and the third category (‘Signature’ in this example) is inferred when the value of the other two categories is 0. ‘Signature’ category is sometimes also called the reference level.
LoanID | Loan Type (Original) | New Auto (Dummy variable) | Used Auto (Dummy variable) | Explanation |
---|---|---|---|---|
L1 | New Auto | 1 | 0 | The value of dummy variable New Auto is 1 and that of Used Auto is 0 |
L2 | Used Auto | 0 | 1 | The value of dummy variable New Auto is 0 and that of Used Auto is 1 |
L3 | Used Auto | 0 | 1 | Same as L2 |
L4 | Signature | 0 | 0 | The value of both dummy variables New Auto and Used Auto is 0 |
L5 | New Auto | 1 | 0 | Same as L1 |
One-hot Encoding
A similar but slightly different technique from dummy encoding is one-hot encoding used for nominal categorical features. In this approach, assuming the same scenario of a predictor with k categories, k different binary columns are created, where each takes on the value of 1 if the original value of that observation belongs to that particular category and 0 otherwise. For the same dataset, one-hot encoding would transform the loan type variable as follows:
LoanID | Loan Type (Original) | New Auto (One-hot variable) | Used Auto (One-hot variable) | Signature (One-hot variable) | Explanation |
---|---|---|---|---|---|
L1 | New Auto | 1 | 0 | 0 | The value of one-hot variable New Auto is 1 and others are 0 |
L2 | Used Auto | 0 | 1 | 0 | The value of one-hot variable Used Auto is 1 and others are 0 |
L3 | Used Auto | 0 | 1 | 0 | Same as L2 |
L4 | Signature | 0 | 0 | 1 | The value of one-hot variable Signature is 1 and others are 0. In Dummy encoding, 'Signature' does not appear as a separate variable |
L5 | New Auto | 1 | 0 | 0 | Same as L1 |
In one-hot encoding, each category has a separate regression coefficient in the model. This is in contrast to dummy encoding, where only k-1 levels have coefficients representing the effect of that level. In either dummy or one-hot encoding, if an observation contains missing values for the original variable, it would be represented with a value of 0 for all columns created as a result of using either transformation technique, since a value of 1 only appears to indicate the presence of the original value.
Ordinal Encoding
When dealing with ordinal categorical features, where there is a natural and consistently-spaced ordering to the levels of a categorical variable, such as temperature being recorded on a scale of low, medium, or high, it might make sense to map its values to the integers values 1, 2, and 3, respectively. The transformation of such a variable would look like the following:
ObservationID | Temperature (Original) | Temperature_coded (Ordinal variable) |
---|---|---|
O1 | Medium | 2 |
O2 | High | 3 |
O3 | High | 3 |
O4 | Low | 1 |
O5 | Medium | 2 |
It should be clearly noted that if there is no intrinsic order to the variable’s categories, this would not be a viable approach. It also might be questionable if the original categories are spaced at different intervals apart. For example, if Low represented 0 degrees, Medium 10 degrees, and High 50 degrees, it might be difficult to preserve the practical meaning of that spacing after an ordinal transformation, thus possibly leading to information loss. If it is decided not to use ordinal encoding, dummy encoding would probably be more suitable than one-hot encoding, as the Low category might naturally lend itself to the reference level, and the order would still be preserved among the three levels.