Pre-Processing

Step-by-step

Data pre-processing

Census Data

Data cover the entire Mexico City, but have a sub-unit (Manzana) that we do not consider. After importing, select AGEB, replace special character * with Nan values, select non-string objects and convert them into float data type. Save file.

import pandas as pd
import numpy as np

INEGI portal, inegi.org.mx, has been down for a while. Use the local version.

df = pd.read_csv('data/RESAGEBURB_09XLSX20.csv')
df.head(3)
CVEGEO ENTIDAD NOM_ENT MUN NOM_MUN LOC NOM_LOC AGEB MZA POBTOT ... VPH_TELEF VPH_CEL VPH_INTER VPH_STVP VPH_SPMVPI VPH_CVJ VPH_SINRTV VPH_SINLTC VPH_SINCINT VPH_SINTIC
0 0900000000000 9 Ciudad de México 0 Total de la entidad Ciudad de México 0 Total de la entidad 0000 0 9209944 ... 1898265 2536523 2084156 1290811 957162 568827 46172 77272 561128 10528
1 0900200000000 9 Ciudad de México 2 Azcapotzalco 0 Total del municipio 0000 0 432205 ... 96128 123961 105899 66399 50965 31801 1661 2869 22687 322
2 0900200010000 9 Ciudad de México 2 Azcapotzalco 1 Total de la localidad urbana 0000 0 432205 ... 96128 123961 105899 66399 50965 31801 1661 2869 22687 322

3 rows × 231 columns

df = df[df['NOM_LOC']=='Total AGEB urbana']
df = df.replace('*', np.nan)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2433 entries, 3 to 68915
Columns: 231 entries, CVEGEO to VPH_SINTIC
dtypes: int64(6), object(225)
memory usage: 4.3+ MB

Most of the entries are objects. We need to convert them in numbers.

df1 = df.iloc[:,:9]
df2 = df.iloc[:,9:]
df2 = df2.fillna(0).astype(float)
df = pd.concat([df1,df2], axis=1)
#save file
df.to_csv('data/census_data2020.csv', index = False)

VAW Data

VAW data of Mexico City is included in the crime victim dataset. This means that I have to identify which rows are significant to the project based on the description of the crime and the victim’s gender.

# Read data
vaw = pd.read_csv('data/victimas_completa_marzo_2022.csv')
vaw.head()
idCarpeta Año_inicio Mes_inicio FechaInicio Delito Categoria Sexo Edad TipoPersona CalidadJuridica ... Mes_hecho FechaHecho HoraHecho HoraInicio AlcaldiaHechos ColoniaHechos Calle_hechos Calle_hechos2 latitud longitud
0 8324429.0 2019.0 Enero 04/01/2019 FRAUDE DELITO DE BAJO IMPACTO Masculino 62.0 FISICA OFENDIDO ... Agosto 29/08/2018 12:00:00 12:19:00 ALVARO OBREGON GUADALUPE INN INSUGENTES SUR NaN 19.36125 -99.18314
1 8324430.0 2019.0 Enero 04/01/2019 PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI... DELITO DE BAJO IMPACTO Femenino 38.0 FISICA VICTIMA Y DENUNCIANTE ... Diciembre 15/12/2018 15:00:00 12:20:00 AZCAPOTZALCO VICTORIA DE LAS DEMOCRACIAS AV. CUATLAHUAC NaN 19.47181 -99.16458
2 8324431.0 2019.0 Enero 04/01/2019 ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE... ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ... Masculino 42.0 FISICA VICTIMA Y DENUNCIANTE ... Diciembre 22/12/2018 15:30:00 12:23:00 COYOACAN COPILCO UNIVERSIDAD ISSSTE COPILCO NaN 19.33797 -99.18611
3 8324435.0 2019.0 Enero 04/01/2019 ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI... ROBO DE VEHÍCULO CON Y SIN VIOLENCIA Masculino 35.0 FISICA VICTIMA Y DENUNCIANTE ... Enero 04/01/2019 06:00:00 12:27:00 IZTACALCO AGRÍCOLA PANTITLAN CALLE 6 ENTRE PRIVADA DEL VALLE Y PRIVADA GONZALEZ 19.40327 -99.05983
4 8324438.0 2019.0 Enero 04/01/2019 ROBO DE MOTOCICLETA SIN VIOLENCIA ROBO DE VEHÍCULO CON Y SIN VIOLENCIA Masculino NaN FISICA VICTIMA ... Enero 03/01/2019 20:00:00 12:35:00 IZTAPALAPA PROGRESISTA UNIVERSIDAD NaN 19.35480 -99.06324

5 rows × 22 columns

Inspect with info() method to get better data insight.

vaw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 780439 entries, 0 to 780438
Data columns (total 22 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   idCarpeta        780439 non-null  float64
 1   Año_inicio       780438 non-null  float64
 2   Mes_inicio       780438 non-null  object 
 3   FechaInicio      780438 non-null  object 
 4   Delito           780439 non-null  object 
 5   Categoria        780439 non-null  object 
 6   Sexo             628287 non-null  object 
 7   Edad             495406 non-null  float64
 8   TipoPersona      774405 non-null  object 
 9   CalidadJuridica  780438 non-null  object 
 10  competencia      780439 non-null  object 
 11  Año_hecho        780120 non-null  float64
 12  Mes_hecho        780120 non-null  object 
 13  FechaHecho       780120 non-null  object 
 14  HoraHecho        780130 non-null  object 
 15  HoraInicio       780438 non-null  object 
 16  AlcaldiaHechos   779234 non-null  object 
 17  ColoniaHechos    745330 non-null  object 
 18  Calle_hechos     778237 non-null  object 
 19  Calle_hechos2    280532 non-null  object 
 20  latitud          745544 non-null  float64
 21  longitud         745542 non-null  float64
dtypes: float64(6), object(16)
memory usage: 131.0+ MB

There are 679 500 observations, which are quite a lot! Notice that the non-null values of latitude and longitude are less than the total entries. Therefore, drop all rows that have a NaN value in these columns and run a preliminary sub-selection by keeping female victims and latest year available.

From the data structure, I know that “Ano_inicio” field corresponds to the opening date of the investigation, while “Ano_hecho” represents the date on which the crime was committed. Then, go for the second.

# drop Nan values
vaw = vaw.dropna(subset=['latitud', 'longitud'])

# sort unique years in the data
year = sorted(vaw['Año_hecho'].unique())
print("Last year available: ", year[-1])
Last year available:  2022.0

Filter data accordingly

vaw = vaw[vaw['Año_hecho']== 2021]
vaw = vaw[vaw['Sexo']=='Femenino']

Abortion in Mexico: a Crime?

Since 2021, abortion in Mexico is no longer a crime, although its legalisation still varies by state (source).

Yet, in our data it still labeled as a crime:

vaw[vaw['Delito']== 'ABORTO'].sample(1)
idCarpeta Año_inicio Mes_inicio FechaInicio Delito Categoria Sexo Edad TipoPersona CalidadJuridica ... Mes_hecho FechaHecho HoraHecho HoraInicio AlcaldiaHechos ColoniaHechos Calle_hechos Calle_hechos2 latitud longitud
566672 8929948.0 2021.0 Mayo 17/05/2021 ABORTO DELITO DE BAJO IMPACTO Femenino NaN FISICA VICTIMA ... Mayo 17/05/2021 12:00:00 17:49:00 VENUSTIANO CARRANZA AMPLIACION ADOLFO LOPEZ MATEOS EJE 4 OTE. RIO CHURUBUSCO ESQUINA CON LA CALLE... NaN 19.416515 -99.073591

1 rows × 22 columns

Considering abortion as a crime is beyond the scope of this project, but you too may find this interesting.

Back to Data Preparation

Select vaw crimes that match with the strings “sexual, domestic, femicide” and then print the results.

vaw = vaw[vaw['Delito'].str.contains('sexual|familiar|feminicidio', case=False)]

print('List of crimes after text query:\n', vaw['Delito'].unique())
List of crimes after text query:
 ['VIOLENCIA FAMILIAR' 'ABUSO SEXUAL' 'ACOSO SEXUAL'
 'CONTRA LA INTIMIDAD SEXUAL' 'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES'
 'FEMINICIDIO' 'FEMINICIDIO POR ARMA BLANCA' 'TENTATIVA DE FEMINICIDIO'
 'FEMINICIDIO POR DISPARO DE ARMA DE FUEGO'
 'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)'
 'FEMINICIDIO POR GOLPES']

Using str.contains() as filter method in data preparation is highly not recommended because the chance of missing important information is real. For example, strings can have special characters. In this case, I didn’t find any viable alternative.

Let’s now begin to translate the DataFrame from Spanish to English starting from the columns names:

# select columns and rename
vaw = vaw[['FechaHecho', 'Delito', 'latitud', 'longitud']]
vaw.columns = ['Time', 'Crime', 'latitude', 'longitude']
vaw
Time Crime latitude longitude
480286 01/01/2021 VIOLENCIA FAMILIAR 19.198153 -99.026277
480289 01/01/2021 VIOLENCIA FAMILIAR 19.193027 -99.095006
480302 01/01/2021 VIOLENCIA FAMILIAR 19.336963 -99.152265
480306 01/01/2021 VIOLENCIA FAMILIAR 19.561525 -99.146590
480320 01/01/2021 VIOLENCIA FAMILIAR 19.395045 -99.061060
... ... ... ... ...
779813 08/12/2021 ABUSO SEXUAL 19.282468 -99.010266
780165 31/03/2021 VIOLENCIA FAMILIAR 19.371821 -99.071966
780166 01/12/2021 VIOLENCIA FAMILIAR 19.259511 -99.101052
780286 19/07/2021 ABUSO SEXUAL 19.352580 -99.019935
780320 12/12/2021 ABUSO SEXUAL 19.325796 -99.069686

31613 rows × 4 columns

and continuing with the body. List all the values for ‘Crime’:

vaw['Crime'].unique()
array(['VIOLENCIA FAMILIAR', 'ABUSO SEXUAL', 'ACOSO SEXUAL',
       'CONTRA LA INTIMIDAD SEXUAL',
       'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES', 'FEMINICIDIO',
       'FEMINICIDIO POR ARMA BLANCA', 'TENTATIVA DE FEMINICIDIO',
       'FEMINICIDIO POR DISPARO DE ARMA DE FUEGO',
       'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)',
       'FEMINICIDIO POR GOLPES'], dtype=object)

Several types of ‘FEMICIDE’ are in. I use a simple regular expression to join them and replace them with the English word.

vaw['Crime'] = vaw['Crime'].str.replace(r'.*FEMINICIDIO.*', r'FEMICIDE', regex = True)

Conclude the translation for the other crime types.

from_text = ['VIOLENCIA FAMILIAR', 'ABUSO SEXUAL', 'ACOSO SEXUAL',
             'CONTRA LA INTIMIDAD SEXUAL', 'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES',
             'FEMICIDE', 'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)']

replace_with = ['DOMESTIC VIOLENCE', 'SEXUAL ABUSE', 'SEXUAL HARASSMENT',
        'AGAINST SEXUAL INTIMACY', 'AGGRAVATED SEXUAL HARASSMENT AGAINST MINORS', 'FEMICIDE',
        'DEPRIVATION OF PERSONAL LIBERTY (PERFORMING A SEXUAL ACT)']

vaw = vaw.replace(dict(zip(from_text, replace_with)))
vaw.head()
Time Crime latitude longitude
480286 01/01/2021 DOMESTIC VIOLENCE 19.198153 -99.026277
480289 01/01/2021 DOMESTIC VIOLENCE 19.193027 -99.095006
480302 01/01/2021 DOMESTIC VIOLENCE 19.336963 -99.152265
480306 01/01/2021 DOMESTIC VIOLENCE 19.561525 -99.146590
480320 01/01/2021 DOMESTIC VIOLENCE 19.395045 -99.061060
#save file
vaw.to_csv('data/vaw_filtered.csv', index = False)

Neighbourhood

Add to neighbourhood variables the total population per unit by merging with ‘Population per Neighbourhood’.

import geopandas as gpd
# Read the data
gdf = gpd.read_file('data/alcaldías_cdmx/alcaldias_cdmx.shp')
gdf = gdf.rename(columns = {'nomgeo':'NOMGEO'})
#gdf.head(5)

Population per Neighbourhood

df = pd.read_csv('data/poblacion_total_tasa_crecimiento_1.1.csv')
df = df[df['Año'] == 2020]
df = df.reset_index(drop=True)
#df
# drop the last summary row
df = df.drop([16])
df = df.rename(columns = {'Alcaldia':'NOMGEO', 'Población total': 'POBTOT' })
df = df.apply(lambda x: x.astype(str).str.upper())
merge = gdf.merge(df, on='NOMGEO')
merge = merge.rename(columns = {'cvegeo':'CVEGEO'})
merge['POBTOT'] = merge['POBTOT'].str.replace(',', '').astype(int)
gdf = merge[['CVEGEO', 'NOMGEO', 'POBTOT', 'geometry']]
# Write to GeoJSON

gdf.to_file("data/CDMX_municipalities.gjson", driver="GeoJSON")