Data pre-processing

Census Data

Data cover the entire Mexico City, but have a sub-unit (Manzana) that we do not consider. After importing, select AGEB, replace special character * with Nan values, select non-string objects and convert them into float data type. Save file.

import pandas as pd
import numpy as np

INEGI portal, inegi.org.mx, has been down for a while. Use the local version.

df = pd.read_csv('data/RESAGEBURB_09XLSX20.csv')
df.head(3)

	CVEGEO	ENTIDAD	NOM_ENT	MUN	NOM_MUN	LOC	NOM_LOC	POBTOT	...	VPH_TELEF	VPH_CEL	VPH_INTER	VPH_STVP	VPH_SPMVPI	VPH_CVJ	VPH_SINRTV	VPH_SINLTC	VPH_SINCINT	VPH_SINTIC
0	0900000000000	9	Ciudad de México	0	Total de la entidad Ciudad de México	0	Total de la entidad	9209944	...	1898265	2536523	2084156	1290811	957162	568827	46172	77272	561128	10528
1	0900200000000	9	Ciudad de México	2	Azcapotzalco	0	Total del municipio	432205	...	96128	123961	105899	66399	50965	31801	1661	2869	22687	322
2	0900200010000	9	Ciudad de México	2	Azcapotzalco	1	Total de la localidad urbana	432205	...	96128	123961	105899	66399	50965	31801	1661	2869	22687	322

3 rows × 231 columns

df = df[df['NOM_LOC']=='Total AGEB urbana']
df = df.replace('*', np.nan)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2433 entries, 3 to 68915
Columns: 231 entries, CVEGEO to VPH_SINTIC
dtypes: int64(6), object(225)
memory usage: 4.3+ MB

Most of the entries are objects. We need to convert them in numbers.

df1 = df.iloc[:,:9]
df2 = df.iloc[:,9:]
df2 = df2.fillna(0).astype(float)

df = pd.concat([df1,df2], axis=1)

#save file
df.to_csv('data/census_data2020.csv', index = False)

VAW Data

VAW data of Mexico City is included in the crime victim dataset. This means that I have to identify which rows are significant to the project based on the description of the crime and the victim’s gender.

# Read data
vaw = pd.read_csv('data/victimas_completa_marzo_2022.csv')
vaw.head()

	idCarpeta	Año_inicio	Mes_inicio	FechaInicio	Delito	Categoria	Sexo	Edad	TipoPersona	CalidadJuridica	...	Mes_hecho	FechaHecho	HoraHecho	HoraInicio	AlcaldiaHechos	ColoniaHechos	Calle_hechos	Calle_hechos2	latitud	longitud
0	8324429.0	2019.0	Enero	04/01/2019	FRAUDE	DELITO DE BAJO IMPACTO	Masculino	62.0	FISICA	OFENDIDO	...	Agosto	29/08/2018	12:00:00	12:19:00	ALVARO OBREGON	GUADALUPE INN	INSUGENTES SUR	NaN	19.36125	-99.18314
1	8324430.0	2019.0	Enero	04/01/2019	PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...	DELITO DE BAJO IMPACTO	Femenino	38.0	FISICA	VICTIMA Y DENUNCIANTE	...	Diciembre	15/12/2018	15:00:00	12:20:00	AZCAPOTZALCO	VICTORIA DE LAS DEMOCRACIAS	AV. CUATLAHUAC	NaN	19.47181	-99.16458
2	8324431.0	2019.0	Enero	04/01/2019	ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...	ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...	Masculino	42.0	FISICA	VICTIMA Y DENUNCIANTE	...	Diciembre	22/12/2018	15:30:00	12:23:00	COYOACAN	COPILCO UNIVERSIDAD ISSSTE	COPILCO	NaN	19.33797	-99.18611
3	8324435.0	2019.0	Enero	04/01/2019	ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...	ROBO DE VEHÍCULO CON Y SIN VIOLENCIA	Masculino	35.0	FISICA	VICTIMA Y DENUNCIANTE	...	Enero	04/01/2019	06:00:00	12:27:00	IZTACALCO	AGRÍCOLA PANTITLAN	CALLE 6	ENTRE PRIVADA DEL VALLE Y PRIVADA GONZALEZ	19.40327	-99.05983
4	8324438.0	2019.0	Enero	04/01/2019	ROBO DE MOTOCICLETA SIN VIOLENCIA	ROBO DE VEHÍCULO CON Y SIN VIOLENCIA	Masculino	NaN	FISICA	VICTIMA	...	Enero	03/01/2019	20:00:00	12:35:00	IZTAPALAPA	PROGRESISTA	UNIVERSIDAD	NaN	19.35480	-99.06324

5 rows × 22 columns

Inspect with info() method to get better data insight.

vaw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 780439 entries, 0 to 780438
Data columns (total 22 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 idCarpeta        780439 non-null  float64
 Año_inicio       780438 non-null  float64
 Mes_inicio       780438 non-null  object 
 FechaInicio      780438 non-null  object 
 Delito           780439 non-null  object 
 Categoria        780439 non-null  object 
 Sexo             628287 non-null  object 
 Edad             495406 non-null  float64
 TipoPersona      774405 non-null  object 
 CalidadJuridica  780438 non-null  object 
competencia      780439 non-null  object 
Año_hecho        780120 non-null  float64
Mes_hecho        780120 non-null  object 
FechaHecho       780120 non-null  object 
HoraHecho        780130 non-null  object 
HoraInicio       780438 non-null  object 
AlcaldiaHechos   779234 non-null  object 
ColoniaHechos    745330 non-null  object 
Calle_hechos     778237 non-null  object 
Calle_hechos2    280532 non-null  object 
latitud          745544 non-null  float64
longitud         745542 non-null  float64
dtypes: float64(6), object(16)
memory usage: 131.0+ MB

There are 679 500 observations, which are quite a lot! Notice that the non-null values of latitude and longitude are less than the total entries. Therefore, drop all rows that have a NaN value in these columns and run a preliminary sub-selection by keeping female victims and latest year available.

From the data structure, I know that “Ano_inicio” field corresponds to the opening date of the investigation, while “Ano_hecho” represents the date on which the crime was committed. Then, go for the second.

# drop Nan values
vaw = vaw.dropna(subset=['latitud', 'longitud'])

# sort unique years in the data
year = sorted(vaw['Año_hecho'].unique())
print("Last year available: ", year[-1])

Last year available:  2022.0

Filter data accordingly

vaw = vaw[vaw['Año_hecho']== 2021]
vaw = vaw[vaw['Sexo']=='Femenino']

Abortion in Mexico: a Crime?

Since 2021, abortion in Mexico is no longer a crime, although its legalisation still varies by state (source).

Yet, in our data it still labeled as a crime:

vaw[vaw['Delito']== 'ABORTO'].sample(1)

	idCarpeta	Año_inicio	Mes_inicio	FechaInicio	Delito	Categoria	Sexo	Edad	TipoPersona	CalidadJuridica	...	Mes_hecho	FechaHecho	HoraHecho	HoraInicio	AlcaldiaHechos	ColoniaHechos	Calle_hechos	Calle_hechos2	latitud	longitud
566672	8929948.0	2021.0	Mayo	17/05/2021	ABORTO	DELITO DE BAJO IMPACTO	Femenino	NaN	FISICA	VICTIMA	...	Mayo	17/05/2021	12:00:00	17:49:00	VENUSTIANO CARRANZA	AMPLIACION ADOLFO LOPEZ MATEOS	EJE 4 OTE. RIO CHURUBUSCO ESQUINA CON LA CALLE...	NaN	19.416515	-99.073591

1 rows × 22 columns

Considering abortion as a crime is beyond the scope of this project, but you too may find this interesting.

Back to Data Preparation

Select vaw crimes that match with the strings “sexual, domestic, femicide” and then print the results.

vaw = vaw[vaw['Delito'].str.contains('sexual|familiar|feminicidio', case=False)]

print('List of crimes after text query:\n', vaw['Delito'].unique())

List of crimes after text query:
 ['VIOLENCIA FAMILIAR' 'ABUSO SEXUAL' 'ACOSO SEXUAL'
 'CONTRA LA INTIMIDAD SEXUAL' 'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES'
 'FEMINICIDIO' 'FEMINICIDIO POR ARMA BLANCA' 'TENTATIVA DE FEMINICIDIO'
 'FEMINICIDIO POR DISPARO DE ARMA DE FUEGO'
 'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)'
 'FEMINICIDIO POR GOLPES']

Using str.contains() as filter method in data preparation is highly not recommended because the chance of missing important information is real. For example, strings can have special characters. In this case, I didn’t find any viable alternative.

Let’s now begin to translate the DataFrame from Spanish to English starting from the columns names:

# select columns and rename
vaw = vaw[['FechaHecho', 'Delito', 'latitud', 'longitud']]
vaw.columns = ['Time', 'Crime', 'latitude', 'longitude']

vaw

	Time	Crime	latitude	longitude
480286	01/01/2021	VIOLENCIA FAMILIAR	19.198153	-99.026277
480289	01/01/2021	VIOLENCIA FAMILIAR	19.193027	-99.095006
480302	01/01/2021	VIOLENCIA FAMILIAR	19.336963	-99.152265
480306	01/01/2021	VIOLENCIA FAMILIAR	19.561525	-99.146590
480320	01/01/2021	VIOLENCIA FAMILIAR	19.395045	-99.061060
...	...	...	...	...
779813	08/12/2021	ABUSO SEXUAL	19.282468	-99.010266
780165	31/03/2021	VIOLENCIA FAMILIAR	19.371821	-99.071966
780166	01/12/2021	VIOLENCIA FAMILIAR	19.259511	-99.101052
780286	19/07/2021	ABUSO SEXUAL	19.352580	-99.019935
780320	12/12/2021	ABUSO SEXUAL	19.325796	-99.069686

31613 rows × 4 columns

and continuing with the body. List all the values for ‘Crime’:

vaw['Crime'].unique()

array(['VIOLENCIA FAMILIAR', 'ABUSO SEXUAL', 'ACOSO SEXUAL',
       'CONTRA LA INTIMIDAD SEXUAL',
       'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES', 'FEMINICIDIO',
       'FEMINICIDIO POR ARMA BLANCA', 'TENTATIVA DE FEMINICIDIO',
       'FEMINICIDIO POR DISPARO DE ARMA DE FUEGO',
       'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)',
       'FEMINICIDIO POR GOLPES'], dtype=object)

Several types of ‘FEMICIDE’ are in. I use a simple regular expression to join them and replace them with the English word.

vaw['Crime'] = vaw['Crime'].str.replace(r'.*FEMINICIDIO.*', r'FEMICIDE', regex = True)

Conclude the translation for the other crime types.

from_text = ['VIOLENCIA FAMILIAR', 'ABUSO SEXUAL', 'ACOSO SEXUAL',
             'CONTRA LA INTIMIDAD SEXUAL', 'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES',
             'FEMICIDE', 'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)']

replace_with = ['DOMESTIC VIOLENCE', 'SEXUAL ABUSE', 'SEXUAL HARASSMENT',
        'AGAINST SEXUAL INTIMACY', 'AGGRAVATED SEXUAL HARASSMENT AGAINST MINORS', 'FEMICIDE',
        'DEPRIVATION OF PERSONAL LIBERTY (PERFORMING A SEXUAL ACT)']

vaw = vaw.replace(dict(zip(from_text, replace_with)))

vaw.head()

	Time	Crime	latitude	longitude
480286	01/01/2021	DOMESTIC VIOLENCE	19.198153	-99.026277
480289	01/01/2021	DOMESTIC VIOLENCE	19.193027	-99.095006
480302	01/01/2021	DOMESTIC VIOLENCE	19.336963	-99.152265
480306	01/01/2021	DOMESTIC VIOLENCE	19.561525	-99.146590
480320	01/01/2021	DOMESTIC VIOLENCE	19.395045	-99.061060

#save file
vaw.to_csv('data/vaw_filtered.csv', index = False)

Neighbourhood

Add to neighbourhood variables the total population per unit by merging with ‘Population per Neighbourhood’.

Date source:

CDMX neighborhoods - Mexico City Open Data Portal
Data Structure:

Structure of the data is described in a separate excel file download here

import geopandas as gpd

# Read the data
gdf = gpd.read_file('data/alcaldías_cdmx/alcaldias_cdmx.shp')
gdf = gdf.rename(columns = {'nomgeo':'NOMGEO'})
#gdf.head(5)

Population per Neighbourhood

Date source:

CDMX Population per Neighbourhood - Mexico City Open Data Portal
Data Structure:

Structure of the data is described in a separate excel file download here

df = pd.read_csv('data/poblacion_total_tasa_crecimiento_1.1.csv')
df = df[df['Año'] == 2020]
df = df.reset_index(drop=True)
#df

# drop the last summary row
df = df.drop([16])
df = df.rename(columns = {'Alcaldia':'NOMGEO', 'Población total': 'POBTOT' })
df = df.apply(lambda x: x.astype(str).str.upper())

merge = gdf.merge(df, on='NOMGEO')
merge = merge.rename(columns = {'cvegeo':'CVEGEO'})
merge['POBTOT'] = merge['POBTOT'].str.replace(',', '').astype(int)

gdf = merge[['CVEGEO', 'NOMGEO', 'POBTOT', 'geometry']]

# Write to GeoJSON

gdf.to_file("data/CDMX_municipalities.gjson", driver="GeoJSON")