import pandas as pd
import numpy as np
import re
from fastcore.utils import *
Extracting and grouping text info by category
About the text file:
The text file we will work on is a data description file which is part of the Kaggle dataset. The file is used to describe all of the 79 explanatory variables (features) and the meaning of each corresponding category variable’s code.
The goal
The goal of this project is to extract the information about the explanatory variables and category variables’ codes, then create a seperate pandas dataframe for each one and save as csv
file for downstream tasks.
Import packages
Load the data
The path of the Kaggle dataset.
= Path('./house-prices-advanced-regression-techniques/') path
Have a look at the data
= pd.read_csv(path/'train.csv')
train_df train_df.shape
(1460, 81)
-5:] train_df.tail().iloc[:,
MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|
1455 | 8 | 2007 | WD | Normal | 175000 |
1456 | 2 | 2010 | WD | Normal | 210000 |
1457 | 5 | 2010 | WD | Normal | 266500 |
1458 | 4 | 2010 | WD | Normal | 142125 |
1459 | 6 | 2008 | WD | Normal | 147500 |
= pd.read_csv(path/'test.csv')
test_df test_df.shape
(1459, 80)
-5:] test_df.tail().iloc[:,
MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|
1454 | 0 | 6 | 2006 | WD | Normal |
1455 | 0 | 4 | 2006 | WD | Abnorml |
1456 | 0 | 9 | 2006 | WD | Abnorml |
1457 | 700 | 7 | 2006 | WD | Normal |
1458 | 0 | 11 | 2006 | WD | Normal |
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
From the above, we know that train set (except the dependent valiable SalePrice
) and test set both have 79 explanatory variables (excluding Id
).
All the independent variables are:
= train_df.columns[1:-1]
ind_vars ind_vars
Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
'MoSold', 'YrSold', 'SaleType', 'SaleCondition'],
dtype='object')
len(ind_vars)
79
Among the independent variables, some are numerical variables, others are category variables.
= train_df.select_dtypes(include=np.number).columns # numeric columns
num_cols num_cols
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
dtype='object')
= train_df.select_dtypes(include='object').columns # category columns
cat_cols cat_cols
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
'SaleType', 'SaleCondition'],
dtype='object')
Let have a look at the data description file: data_description.txt
The first 30 lines is as below:
!head -n 30 {path/'data_description.txt'}
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
Let grep the lines with MS
!grep MS {path/'data_description.txt'}
MSSubClass: Identifies the type of dwelling involved in the sale.
MSZoning: Identifies the general zoning classification of the sale.
!rg MS {path/'data_description.txt'}
1:MSSubClass: Identifies the type of dwelling involved in the sale.
20:MSZoning: Identifies the general zoning classification of the sale.
Read the text file and split lines into a list
= (path/'data_description.txt').read_text().splitlines()
desc 30] desc[:
['MSSubClass: Identifies the type of dwelling involved in the sale.\t',
'',
' 20\t1-STORY 1946 & NEWER ALL STYLES',
' 30\t1-STORY 1945 & OLDER',
' 40\t1-STORY W/FINISHED ATTIC ALL AGES',
' 45\t1-1/2 STORY - UNFINISHED ALL AGES',
' 50\t1-1/2 STORY FINISHED ALL AGES',
' 60\t2-STORY 1946 & NEWER',
' 70\t2-STORY 1945 & OLDER',
' 75\t2-1/2 STORY ALL AGES',
' 80\tSPLIT OR MULTI-LEVEL',
' 85\tSPLIT FOYER',
' 90\tDUPLEX - ALL STYLES AND AGES',
' 120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER',
' 150\t1-1/2 STORY PUD - ALL AGES',
' 160\t2-STORY PUD - 1946 & NEWER',
' 180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER',
' 190\t2 FAMILY CONVERSION - ALL STYLES AND AGES',
'',
'MSZoning: Identifies the general zoning classification of the sale.',
'\t\t',
' A\tAgriculture',
' C\tCommercial',
' FV\tFloating Village Residential',
' I\tIndustrial',
' RH\tResidential High Density',
' RL\tResidential Low Density',
' RP\tResidential Low Density Park ',
' RM\tResidential Medium Density',
'\t']
Remove all the empty lines.
= [i for i in desc if len(i.strip()) != 0]
desc 30] desc[:
['MSSubClass: Identifies the type of dwelling involved in the sale.\t',
' 20\t1-STORY 1946 & NEWER ALL STYLES',
' 30\t1-STORY 1945 & OLDER',
' 40\t1-STORY W/FINISHED ATTIC ALL AGES',
' 45\t1-1/2 STORY - UNFINISHED ALL AGES',
' 50\t1-1/2 STORY FINISHED ALL AGES',
' 60\t2-STORY 1946 & NEWER',
' 70\t2-STORY 1945 & OLDER',
' 75\t2-1/2 STORY ALL AGES',
' 80\tSPLIT OR MULTI-LEVEL',
' 85\tSPLIT FOYER',
' 90\tDUPLEX - ALL STYLES AND AGES',
' 120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER',
' 150\t1-1/2 STORY PUD - ALL AGES',
' 160\t2-STORY PUD - 1946 & NEWER',
' 180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER',
' 190\t2 FAMILY CONVERSION - ALL STYLES AND AGES',
'MSZoning: Identifies the general zoning classification of the sale.',
' A\tAgriculture',
' C\tCommercial',
' FV\tFloating Village Residential',
' I\tIndustrial',
' RH\tResidential High Density',
' RL\tResidential Low Density',
' RP\tResidential Low Density Park ',
' RM\tResidential Medium Density',
'LotFrontage: Linear feet of street connected to property',
'LotArea: Lot size in square feet',
'Street: Type of road access to property',
' Grvl\tGravel\t']
After looking at the above text file, we noticed that the data types of some independent variables are incorrect.
Let fix it.
Define a function to extract all the descriptions about the 79 explanatory variables and check the data type.
def var_desc(txt, ind_vars):
= {}
d = {}
cat for i, t in enumerate(txt):
= re.findall(r'(\S+): ([^\t?]+)', t)
rt if rt:
if rt[0][0] in ind_vars:
0][0]] = rt[0][1]
d[rt[if txt[i+1].startswith(' '):
0][0]] = 'Category'
cat[rt[= pd.DataFrame(d.items(), columns=['Variable', 'Variabe_Description'])
df 'Variable_Type'] = df.Variable.map(cat).fillna('Numerical')
df[return df
= var_desc(desc, ind_vars)
df df.head()
Variable | Variabe_Description | Variable_Type | |
---|---|---|---|
0 | MSSubClass | Identifies the type of dwelling involved in th... | Category |
1 | MSZoning | Identifies the general zoning classification o... | Category |
2 | LotFrontage | Linear feet of street connected to property | Numerical |
3 | LotArea | Lot size in square feet | Numerical |
4 | Street | Type of road access to property | Category |
len(df)
77
/f'vars_desc.csv', index=False) df.to_csv(path
= df['Variable'][df['Variable_Type'] == 'Category'].values
cat_vars len(cat_vars) cat_vars,
(array(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape',
'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle',
'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition'], dtype=object),
46)
len(cat_cols) cat_cols,
(Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
'SaleType', 'SaleCondition'],
dtype='object'),
43)
for i in cat_cols if i not in cat_vars] [i
[]
for i in cat_vars if i not in cat_cols] [i
['MSSubClass', 'OverallQual', 'OverallCond']
After comparing the results of data types from the original train set and text file, we get the category variables.
cat_vars
array(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape',
'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle',
'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition'], dtype=object)
Define a function to extract all the information about each category variable’s codes.
def cat_desc(txt, col_name):
= {}
d # find the column name match the category variable
for i in range(len(txt)):
= re.match(col_name, txt[i])
rt1 if rt1:
# print(rt1.group(0), i) # debugging
= i+1
j # extract the info about the the corresponding category variable
for t in txt[j:]:
if t.startswith(' '): # check if the text start with white space
# find two parts (the code and code description) using regex
# the code might be 'NA' , digits or any word characters
= re.findall(r'\s+(\d+\.\d+\w+|NA\s+|\w+\&?\w*?\s*?)\t(.*)', t, flags=re.IGNORECASE)
rt2 0][0].strip()] = (rt2[0][1]).strip()
d[rt2[else:
break
# return a dataframe of the category variable
return pd.DataFrame(d.items(), columns=[col_name, col_name + '_Description'])
Test the function with category variable MSSubClass
and SaleCondition
= 'MSSubClass'
col_name cat_desc(desc, col_name)
MSSubClass | MSSubClass_Description | |
---|---|---|
0 | 20 | 1-STORY 1946 & NEWER ALL STYLES |
1 | 30 | 1-STORY 1945 & OLDER |
2 | 40 | 1-STORY W/FINISHED ATTIC ALL AGES |
3 | 45 | 1-1/2 STORY - UNFINISHED ALL AGES |
4 | 50 | 1-1/2 STORY FINISHED ALL AGES |
5 | 60 | 2-STORY 1946 & NEWER |
6 | 70 | 2-STORY 1945 & OLDER |
7 | 75 | 2-1/2 STORY ALL AGES |
8 | 80 | SPLIT OR MULTI-LEVEL |
9 | 85 | SPLIT FOYER |
10 | 90 | DUPLEX - ALL STYLES AND AGES |
11 | 120 | 1-STORY PUD (Planned Unit Development) - 1946 ... |
12 | 150 | 1-1/2 STORY PUD - ALL AGES |
13 | 160 | 2-STORY PUD - 1946 & NEWER |
14 | 180 | PUD - MULTILEVEL - INCL SPLIT LEV/FOYER |
15 | 190 | 2 FAMILY CONVERSION - ALL STYLES AND AGES |
= 'SaleCondition'
col_name cat_desc(desc, col_name)
SaleCondition | SaleCondition_Description | |
---|---|---|
0 | Normal | Normal Sale |
1 | Abnorml | Abnormal Sale - trade, foreclosure, short sale |
2 | AdjLand | Adjoining Land Purchase |
3 | Alloca | Allocation - two linked properties with separa... |
4 | Family | Sale between family members |
5 | Partial | Home was not completed when last assessed (ass... |
Create pandas dataframe and save them as csv
files.
for cat_var in cat_vars:
= cat_desc(desc, cat_var)
cat_df /f'{cat_var}.csv', index=False) cat_df.to_csv(path
!ls {path}
Alley.csv Foundation.csv MiscFeature.csv
BldgType.csv Functional.csv Neighborhood.csv
BsmtCond.csv GarageCond.csv OverallCond.csv
BsmtExposure.csv GarageFinish.csv OverallQual.csv
BsmtFinType1.csv GarageQual.csv PavedDrive.csv
BsmtFinType2.csv GarageType.csv PoolQC.csv
BsmtQual.csv Heating.csv RoofMatl.csv
CentralAir.csv HeatingQC.csv RoofStyle.csv
Condition1.csv HouseStyle.csv SaleCondition.csv
Condition2.csv KitchenQual.csv SaleType.csv
Electrical.csv LandContour.csv Street.csv
ExterCond.csv LandSlope.csv Utilities.csv
ExterQual.csv LotConfig.csv data_description.txt
Exterior1st.csv LotShape.csv sample_submission.csv
Exterior2nd.csv MSSubClass.csv test.csv
Fence.csv MSZoning.csv train.csv
FireplaceQu.csv MasVnrType.csv vars_desc.csv