Sometime data-set contains both continuous and categorical variables. Usually we skip the categorical variables but in some cases categorical variable contains information that we don't want to skip. In that cases we have to prepare our model for categorical data as we can't fit categorical variable into regression/classification equation in there row form. They must be treated. Also Most ML algorithm gives best result with numeric data.
Recently I was making a predictive model. That's works fine on continuous data but that can't be used for categorical data. To deal with categorical data either you have to change the data into numeric or switch to another model. Now I have a situation that how can I change my categorical data into numeric values using Python, To solve this let's get started.
In a data-set the value for particular column/variable is not represented by some numeric value. Instead it is represented by some string value. These string value create different levels for that variable.
Now moving to our programming part! The data-set is Titanic data-set.
Lets First import our modules
import pandas as pd import numpy as np
load our data
Let's take a look on our data
|0||1||0||3||Braund, Mr. Owen Harris||male||22.0||1||0||A/5 21171||7.2500||NaN||S|
|1||2||1||1||Cumings, Mrs. John Bradley (Florence Briggs Th...||female||38.0||1||0||PC 17599||71.2833||C85||C|
|2||3||1||3||Heikkinen, Miss. Laina||female||26.0||0||0||STON/O2. 3101282||7.9250||NaN||S|
|3||4||1||1||Futrelle, Mrs. Jacques Heath (Lily May Peel)||female||35.0||1||0||113803||53.1000||C123||S|
|4||5||0||3||Allen, Mr. William Henry||male||35.0||0||0||373450||8.0500||NaN||S|
Sometime by using a small python script, we can convert our categorical or non-leveled data into a more sophisticated data so that our classification model can give best result. Let's Check...
This data range from age 0.8 to age 80. So what we can do we can divide this in 5 Levels where each level represents an age group ranging from 0 to 15. if there is an 'NaN' value, I am going to replace by putting that into a level between range age 30 to age 45.
age= data['Age'] age_processed= for i in age: if i < 15.0: age_processed.append(int(1)) elif i >= 15.0 or i < 30.0: age_processed.append(int(2)) elif i >= 30.0 or i < 45.0: age_processed.append(int(3)) elif i >= 45.0 or i< 60.0: age_processed.append(int(4)) elif i >= 60.0: age_processed.append(int(5)) else: age_processed.append(int(2))
Now by using a simple python script, our data has been transformed into a 5-leveled value.
Python's sklearn provide a method to deal with categorical data using label encoding. This method is used to per-process data before applying to any model.
first import our library
from sklearn import preprocessing
Now define our label encoder
Now tell our encoder that how much types of label we have. let's our categorical variable(Country) have below values
The data has been fitted to model. So Check how many classes are defined.
array(['Bangladesh', 'China', 'India', 'Japan', 'Spain', 'UK', 'USA'], dtype='|S10')
Now let's transform some data from categorical to numeric values.
Country=['India','China','USA','India','China','USA','UK','Bangladesh','Japan', 'Spain','UK','Bangladesh', 'India','China','India','China','USA','UK','Bangladesh','Japan', 'Spain','USA', 'UK','Bangladesh','Japan', 'Spain','Japan', 'Spain',] country_nuemeric= le.transform(Country)
Transformation has been done. Now check how the array country_numeric look a like.
array([2, 1, 6, 2, 1, 6, 5, 0, 3, 4, 5, 0, 2, 1, 2, 1, 6, 5, 0, 3, 4, 6, 5, 0, 3, 4, 3, 4])
Our data has been transformed and each integer value represent corresponding to le.classes_ output.
Python's data manipulation library Pandas also provides a method to deal with categorical data.
Data has been transformed into new variables.
Here Embarked variable has three values (Level) 'C', 'Q', and 'S'. Pandas's get_dummies create three new variable in the data-frame and assign the value accordingly. for a single row only one column will have active value(1).
So if the value for categorical variable have limited variation then it can be helpful.Now the new data can be joined with existing data or treated separately.
It would be difficult to tell you about the best method fit for you. you have to check that for yourself.Go get your hand dirty with data. Corrections and suggestions are welcome in the comments.Published On : 2016-12-01 Tweet