How can categorical predictors be incorporated in linear regression?

To incorporate categorical features, also known as qualitative predictors, into a linear regression model, they must first be converted into a numerical format.

There are two types of categorical features:

  1. Nominal features: This corresponds to features where there is no inherent ordering between different values. For example: gender, color etc. These features can be represented using one hot encoding, and dummy encoding techniques. The primary disadvantage of using nominal features is the explosion of the feature set, especially if the number of unique possible values is large.
  2. Ordinal features: These are categorical features that have an inherent ordering to them. For example: the size of a t-shirt (Small/Medium/Large). These features can be represented using Ordinal encoding. The disadvantage of using Ordinal features is that the difference between two representative integers might not be a true representation of the ordinal difference between the raw features. 

There are three common approaches to deal with categorical or qualitative predictors: (a) Dummy encoding, (b) One-hot encoding, and (c) Ordinal encoding

Dummy Encoding

The classical approach to dealing with nominal categorical or qualitative predictors is to use dummy encoding to numerically represent the different levels of that feature using binary 1’s and 0’s. If a predictor has k categories, k-1 dummy variables are needed to uniquely represent that attribute in the model. As one level can be represented when all of the dummy variables are set to 0, there should be one less dummy variable compared to the total number of unique categories of the variable to avoid redundancy.

The table below depicts Dummy encoding using an Auto Loan example, where there are three types of loan: New Auto, Used Auto, and Signature. Dummy encoding would produce the following transformation of data, creating two new Dummy variables and the third category (‘Signature’ in this example) is inferred when the value of the other two categories is 0. ‘Signature’ category is sometimes also called the reference level.

LoanID Loan Type (Original) New Auto
(Dummy variable)
Used Auto
(Dummy variable)
Explanation
L1
New Auto10The value of dummy variable New Auto is 1 and that of Used Auto is 0
L2Used Auto01The value of dummy variable New Auto is 0 and that of Used Auto is 1
L3Used Auto01Same as L2
L4Signature00The value of both dummy variables New Auto and Used Auto is 0
L5New Auto10Same as L1
An example for Dummy Encoding, Source: AIML.com

One-hot Encoding

A similar but slightly different technique from dummy encoding is one-hot encoding used for nominal categorical features. In this approach, assuming the same scenario of a predictor with k categories, k different binary columns are created, where each takes on the value of 1 if the original value of that observation belongs to that particular category and 0 otherwise. For the same dataset, one-hot encoding would transform the loan type variable as follows:

LoanIDLoan Type (Original)New Auto
(One-hot variable)
Used Auto
(One-hot variable)
Signature
(One-hot variable)
Explanation
L1
New Auto100
The value of one-hot variable New Auto is 1 and others are 0
L2Used Auto010The value of one-hot variable Used Auto is 1 and others are 0
L3Used Auto010Same as L2
L4Signature001The value of one-hot variable Signature is 1 and others are 0. In Dummy encoding, 'Signature' does not appear as a separate variable
L5New Auto100Same as L1
An example for One-hot Encoding, Source: AIML.com

In one-hot encoding, each category has a separate regression coefficient in the model. This is in contrast to dummy encoding, where only k-1 levels have coefficients representing the effect of that level. In either dummy or one-hot encoding, if an observation contains missing values for the original variable, it would be represented with a value of 0 for all columns created as a result of using either transformation technique, since a value of 1 only appears to indicate the presence of the original value.

Ordinal Encoding

When dealing with ordinal categorical features, where there is a natural and consistently-spaced ordering to the levels of a categorical variable, such as temperature being recorded on a scale of low, medium, or high, it might make sense to map its values to the integers values 1, 2, and 3, respectively. The transformation of such a variable would look like the following:

ObservationIDTemperature (Original)Temperature_coded
(Ordinal variable)
O1
Medium2
O2High3
O3High3
O4Low1
O5Medium2
An example for Ordinal Encoding, Source: AIML.com

It should be clearly noted that if there is no intrinsic order to the variable’s categories, this would not be a viable approach. It also might be questionable if the original categories are spaced at different intervals apart. For example, if Low represented 0 degrees, Medium 10 degrees, and High 50 degrees, it might be difficult to preserve the practical meaning of that spacing after an ordinal transformation, thus possibly leading to information loss. If it is decided not to use ordinal encoding, dummy encoding would probably be more suitable than one-hot encoding, as the Low category might naturally lend itself to the reference level, and the order would still be preserved among the three levels. 

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute