Extracting and grouping text info by category

data cleaning
regex
This notebook shows matching text pattern and extract text and processing strings using Regex.
Published

October 23, 2022

About the text file:

The text file we will work on is a data description file which is part of the Kaggle dataset. The file is used to describe all of the 79 explanatory variables (features) and the meaning of each corresponding category variable’s code.

The goal

The goal of this project is to extract the information about the explanatory variables and category variables’ codes, then create a seperate pandas dataframe for each one and save as csv file for downstream tasks.

Import packages

import pandas as pd
import numpy as np
import re
from fastcore.utils import *

Load the data

The path of the Kaggle dataset.

path = Path('./house-prices-advanced-regression-techniques/')

Have a look at the data

train_df = pd.read_csv(path/'train.csv')
train_df.shape
(1460, 81)
train_df.tail().iloc[:, -5:]
MoSold YrSold SaleType SaleCondition SalePrice
1455 8 2007 WD Normal 175000
1456 2 2010 WD Normal 210000
1457 5 2010 WD Normal 266500
1458 4 2010 WD Normal 142125
1459 6 2008 WD Normal 147500
test_df = pd.read_csv(path/'test.csv')
test_df.shape
(1459, 80)
test_df.tail().iloc[:, -5:]
MiscVal MoSold YrSold SaleType SaleCondition
1454 0 6 2006 WD Normal
1455 0 4 2006 WD Abnorml
1456 0 9 2006 WD Abnorml
1457 700 7 2006 WD Normal
1458 0 11 2006 WD Normal
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

From the above, we know that train set (except the dependent valiable SalePrice) and test set both have 79 explanatory variables (excluding Id).

All the independent variables are:

ind_vars = train_df.columns[1:-1]
ind_vars
Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition'],
      dtype='object')
len(ind_vars)
79

Among the independent variables, some are numerical variables, others are category variables.

num_cols = train_df.select_dtypes(include=np.number).columns # numeric columns
num_cols
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')
cat_cols = train_df.select_dtypes(include='object').columns # category columns
cat_cols
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')

Let have a look at the data description file: data_description.txt

The first 30 lines is as below:

!head -n 30 {path/'data_description.txt'}
MSSubClass: Identifies the type of dwelling involved in the sale.   

        20  1-STORY 1946 & NEWER ALL STYLES
        30  1-STORY 1945 & OLDER
        40  1-STORY W/FINISHED ATTIC ALL AGES
        45  1-1/2 STORY - UNFINISHED ALL AGES
        50  1-1/2 STORY FINISHED ALL AGES
        60  2-STORY 1946 & NEWER
        70  2-STORY 1945 & OLDER
        75  2-1/2 STORY ALL AGES
        80  SPLIT OR MULTI-LEVEL
        85  SPLIT FOYER
        90  DUPLEX - ALL STYLES AND AGES
       120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150  1-1/2 STORY PUD - ALL AGES
       160  2-STORY PUD - 1946 & NEWER
       180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190  2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
        
       A    Agriculture
       C    Commercial
       FV   Floating Village Residential
       I    Industrial
       RH   Residential High Density
       RL   Residential Low Density
       RP   Residential Low Density Park 
       RM   Residential Medium Density
    

Let grep the lines with MS

!grep MS {path/'data_description.txt'}
MSSubClass: Identifies the type of dwelling involved in the sale.   
MSZoning: Identifies the general zoning classification of the sale.
!rg MS {path/'data_description.txt'}
1:MSSubClass: Identifies the type of dwelling involved in the sale. 
20:MSZoning: Identifies the general zoning classification of the sale.

Read the text file and split lines into a list

desc = (path/'data_description.txt').read_text().splitlines()
desc[:30]
['MSSubClass: Identifies the type of dwelling involved in the sale.\t',
 '',
 '        20\t1-STORY 1946 & NEWER ALL STYLES',
 '        30\t1-STORY 1945 & OLDER',
 '        40\t1-STORY W/FINISHED ATTIC ALL AGES',
 '        45\t1-1/2 STORY - UNFINISHED ALL AGES',
 '        50\t1-1/2 STORY FINISHED ALL AGES',
 '        60\t2-STORY 1946 & NEWER',
 '        70\t2-STORY 1945 & OLDER',
 '        75\t2-1/2 STORY ALL AGES',
 '        80\tSPLIT OR MULTI-LEVEL',
 '        85\tSPLIT FOYER',
 '        90\tDUPLEX - ALL STYLES AND AGES',
 '       120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER',
 '       150\t1-1/2 STORY PUD - ALL AGES',
 '       160\t2-STORY PUD - 1946 & NEWER',
 '       180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER',
 '       190\t2 FAMILY CONVERSION - ALL STYLES AND AGES',
 '',
 'MSZoning: Identifies the general zoning classification of the sale.',
 '\t\t',
 '       A\tAgriculture',
 '       C\tCommercial',
 '       FV\tFloating Village Residential',
 '       I\tIndustrial',
 '       RH\tResidential High Density',
 '       RL\tResidential Low Density',
 '       RP\tResidential Low Density Park ',
 '       RM\tResidential Medium Density',
 '\t']

Remove all the empty lines.

desc = [i for i in desc if len(i.strip()) != 0]
desc[:30]
['MSSubClass: Identifies the type of dwelling involved in the sale.\t',
 '        20\t1-STORY 1946 & NEWER ALL STYLES',
 '        30\t1-STORY 1945 & OLDER',
 '        40\t1-STORY W/FINISHED ATTIC ALL AGES',
 '        45\t1-1/2 STORY - UNFINISHED ALL AGES',
 '        50\t1-1/2 STORY FINISHED ALL AGES',
 '        60\t2-STORY 1946 & NEWER',
 '        70\t2-STORY 1945 & OLDER',
 '        75\t2-1/2 STORY ALL AGES',
 '        80\tSPLIT OR MULTI-LEVEL',
 '        85\tSPLIT FOYER',
 '        90\tDUPLEX - ALL STYLES AND AGES',
 '       120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER',
 '       150\t1-1/2 STORY PUD - ALL AGES',
 '       160\t2-STORY PUD - 1946 & NEWER',
 '       180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER',
 '       190\t2 FAMILY CONVERSION - ALL STYLES AND AGES',
 'MSZoning: Identifies the general zoning classification of the sale.',
 '       A\tAgriculture',
 '       C\tCommercial',
 '       FV\tFloating Village Residential',
 '       I\tIndustrial',
 '       RH\tResidential High Density',
 '       RL\tResidential Low Density',
 '       RP\tResidential Low Density Park ',
 '       RM\tResidential Medium Density',
 'LotFrontage: Linear feet of street connected to property',
 'LotArea: Lot size in square feet',
 'Street: Type of road access to property',
 '       Grvl\tGravel\t']

After looking at the above text file, we noticed that the data types of some independent variables are incorrect.

Let fix it.

Define a function to extract all the descriptions about the 79 explanatory variables and check the data type.

def var_desc(txt, ind_vars):
    d = {}
    cat = {}
    for i, t in enumerate(txt):
        rt = re.findall(r'(\S+): ([^\t?]+)', t)
        if rt:
            if rt[0][0] in ind_vars:
                d[rt[0][0]] = rt[0][1]
                if txt[i+1].startswith(' '):
                    cat[rt[0][0]] = 'Category'
    df = pd.DataFrame(d.items(), columns=['Variable', 'Variabe_Description'])
    df['Variable_Type'] = df.Variable.map(cat).fillna('Numerical')
    return df
df = var_desc(desc, ind_vars)
df.head()
Variable Variabe_Description Variable_Type
0 MSSubClass Identifies the type of dwelling involved in th... Category
1 MSZoning Identifies the general zoning classification o... Category
2 LotFrontage Linear feet of street connected to property Numerical
3 LotArea Lot size in square feet Numerical
4 Street Type of road access to property Category
len(df)
77
df.to_csv(path/f'vars_desc.csv', index=False)
cat_vars = df['Variable'][df['Variable_Type'] == 'Category'].values
cat_vars, len(cat_vars)
(array(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape',
        'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
        'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
        'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle',
        'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
        'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
        'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
        'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
        'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
        'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence',
        'MiscFeature', 'SaleType', 'SaleCondition'], dtype=object),
 46)
cat_cols, len(cat_cols)
(Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
        'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
        'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
        'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
        'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
        'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
        'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
        'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
        'SaleType', 'SaleCondition'],
       dtype='object'),
 43)
[i for i in cat_cols if i not in cat_vars]
[]
[i for i in cat_vars if i not in cat_cols]
['MSSubClass', 'OverallQual', 'OverallCond']

After comparing the results of data types from the original train set and text file, we get the category variables.

cat_vars
array(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape',
       'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
       'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence',
       'MiscFeature', 'SaleType', 'SaleCondition'], dtype=object)

Define a function to extract all the information about each category variable’s codes.

def cat_desc(txt, col_name):
    d = {}
    # find the column name match the category variable
    for i in range(len(txt)):
        rt1 = re.match(col_name, txt[i])
        if rt1:
#             print(rt1.group(0), i) # debugging
            j = i+1
    # extract the info about the the corresponding category variable
    for t in txt[j:]:
        if t.startswith(' '): # check if the text start with white space
            # find two parts (the code and code description) using regex
            # the code might be 'NA' , digits or any word characters
            rt2 = re.findall(r'\s+(\d+\.\d+\w+|NA\s+|\w+\&?\w*?\s*?)\t(.*)', t, flags=re.IGNORECASE) 
            d[rt2[0][0].strip()] = (rt2[0][1]).strip()
        else:
            break
    # return a dataframe of the category variable
    return pd.DataFrame(d.items(), columns=[col_name, col_name + '_Description'])

Test the function with category variable MSSubClass and SaleCondition

col_name = 'MSSubClass'
cat_desc(desc, col_name)
MSSubClass MSSubClass_Description
0 20 1-STORY 1946 & NEWER ALL STYLES
1 30 1-STORY 1945 & OLDER
2 40 1-STORY W/FINISHED ATTIC ALL AGES
3 45 1-1/2 STORY - UNFINISHED ALL AGES
4 50 1-1/2 STORY FINISHED ALL AGES
5 60 2-STORY 1946 & NEWER
6 70 2-STORY 1945 & OLDER
7 75 2-1/2 STORY ALL AGES
8 80 SPLIT OR MULTI-LEVEL
9 85 SPLIT FOYER
10 90 DUPLEX - ALL STYLES AND AGES
11 120 1-STORY PUD (Planned Unit Development) - 1946 ...
12 150 1-1/2 STORY PUD - ALL AGES
13 160 2-STORY PUD - 1946 & NEWER
14 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
15 190 2 FAMILY CONVERSION - ALL STYLES AND AGES
col_name = 'SaleCondition'
cat_desc(desc, col_name)
SaleCondition SaleCondition_Description
0 Normal Normal Sale
1 Abnorml Abnormal Sale - trade, foreclosure, short sale
2 AdjLand Adjoining Land Purchase
3 Alloca Allocation - two linked properties with separa...
4 Family Sale between family members
5 Partial Home was not completed when last assessed (ass...

Create pandas dataframe and save them as csv files.

for cat_var in cat_vars:
    cat_df = cat_desc(desc, cat_var)
    cat_df.to_csv(path/f'{cat_var}.csv', index=False)
!ls {path}
Alley.csv             Foundation.csv        MiscFeature.csv
BldgType.csv          Functional.csv        Neighborhood.csv
BsmtCond.csv          GarageCond.csv        OverallCond.csv
BsmtExposure.csv      GarageFinish.csv      OverallQual.csv
BsmtFinType1.csv      GarageQual.csv        PavedDrive.csv
BsmtFinType2.csv      GarageType.csv        PoolQC.csv
BsmtQual.csv          Heating.csv           RoofMatl.csv
CentralAir.csv        HeatingQC.csv         RoofStyle.csv
Condition1.csv        HouseStyle.csv        SaleCondition.csv
Condition2.csv        KitchenQual.csv       SaleType.csv
Electrical.csv        LandContour.csv       Street.csv
ExterCond.csv         LandSlope.csv         Utilities.csv
ExterQual.csv         LotConfig.csv         data_description.txt
Exterior1st.csv       LotShape.csv          sample_submission.csv
Exterior2nd.csv       MSSubClass.csv        test.csv
Fence.csv             MSZoning.csv          train.csv
FireplaceQu.csv       MasVnrType.csv        vars_desc.csv