Data cover the entire Mexico City, but have a sub-unit (Manzana) that we do not consider. After importing, select AGEB, replace special character *
with Nan values, select non-string objects and convert them into float data type. Save file.
import pandas as pd
import numpy as np
INEGI portal, inegi.org.mx, has been down for a while. Use the local version.
df = pd.read_csv('data/RESAGEBURB_09XLSX20.csv')
df.head(3)
CVEGEO | ENTIDAD | NOM_ENT | MUN | NOM_MUN | LOC | NOM_LOC | AGEB | MZA | POBTOT | ... | VPH_TELEF | VPH_CEL | VPH_INTER | VPH_STVP | VPH_SPMVPI | VPH_CVJ | VPH_SINRTV | VPH_SINLTC | VPH_SINCINT | VPH_SINTIC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0900000000000 | 9 | Ciudad de México | 0 | Total de la entidad Ciudad de México | 0 | Total de la entidad | 0000 | 0 | 9209944 | ... | 1898265 | 2536523 | 2084156 | 1290811 | 957162 | 568827 | 46172 | 77272 | 561128 | 10528 |
1 | 0900200000000 | 9 | Ciudad de México | 2 | Azcapotzalco | 0 | Total del municipio | 0000 | 0 | 432205 | ... | 96128 | 123961 | 105899 | 66399 | 50965 | 31801 | 1661 | 2869 | 22687 | 322 |
2 | 0900200010000 | 9 | Ciudad de México | 2 | Azcapotzalco | 1 | Total de la localidad urbana | 0000 | 0 | 432205 | ... | 96128 | 123961 | 105899 | 66399 | 50965 | 31801 | 1661 | 2869 | 22687 | 322 |
3 rows × 231 columns
df = df[df['NOM_LOC']=='Total AGEB urbana']
df = df.replace('*', np.nan)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2433 entries, 3 to 68915
Columns: 231 entries, CVEGEO to VPH_SINTIC
dtypes: int64(6), object(225)
memory usage: 4.3+ MB
Most of the entries are objects. We need to convert them in numbers.
df1 = df.iloc[:,:9]
df2 = df.iloc[:,9:]
df2 = df2.fillna(0).astype(float)
df = pd.concat([df1,df2], axis=1)
#save file
df.to_csv('data/census_data2020.csv', index = False)
VAW data of Mexico City is included in the crime victim dataset. This means that I have to identify which rows are significant to the project based on the description of the crime and the victim’s gender.
# Read data
vaw = pd.read_csv('data/victimas_completa_marzo_2022.csv')
vaw.head()
idCarpeta | Año_inicio | Mes_inicio | FechaInicio | Delito | Categoria | Sexo | Edad | TipoPersona | CalidadJuridica | ... | Mes_hecho | FechaHecho | HoraHecho | HoraInicio | AlcaldiaHechos | ColoniaHechos | Calle_hechos | Calle_hechos2 | latitud | longitud | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8324429.0 | 2019.0 | Enero | 04/01/2019 | FRAUDE | DELITO DE BAJO IMPACTO | Masculino | 62.0 | FISICA | OFENDIDO | ... | Agosto | 29/08/2018 | 12:00:00 | 12:19:00 | ALVARO OBREGON | GUADALUPE INN | INSUGENTES SUR | NaN | 19.36125 | -99.18314 |
1 | 8324430.0 | 2019.0 | Enero | 04/01/2019 | PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI... | DELITO DE BAJO IMPACTO | Femenino | 38.0 | FISICA | VICTIMA Y DENUNCIANTE | ... | Diciembre | 15/12/2018 | 15:00:00 | 12:20:00 | AZCAPOTZALCO | VICTORIA DE LAS DEMOCRACIAS | AV. CUATLAHUAC | NaN | 19.47181 | -99.16458 |
2 | 8324431.0 | 2019.0 | Enero | 04/01/2019 | ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE... | ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ... | Masculino | 42.0 | FISICA | VICTIMA Y DENUNCIANTE | ... | Diciembre | 22/12/2018 | 15:30:00 | 12:23:00 | COYOACAN | COPILCO UNIVERSIDAD ISSSTE | COPILCO | NaN | 19.33797 | -99.18611 |
3 | 8324435.0 | 2019.0 | Enero | 04/01/2019 | ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI... | ROBO DE VEHÍCULO CON Y SIN VIOLENCIA | Masculino | 35.0 | FISICA | VICTIMA Y DENUNCIANTE | ... | Enero | 04/01/2019 | 06:00:00 | 12:27:00 | IZTACALCO | AGRÍCOLA PANTITLAN | CALLE 6 | ENTRE PRIVADA DEL VALLE Y PRIVADA GONZALEZ | 19.40327 | -99.05983 |
4 | 8324438.0 | 2019.0 | Enero | 04/01/2019 | ROBO DE MOTOCICLETA SIN VIOLENCIA | ROBO DE VEHÍCULO CON Y SIN VIOLENCIA | Masculino | NaN | FISICA | VICTIMA | ... | Enero | 03/01/2019 | 20:00:00 | 12:35:00 | IZTAPALAPA | PROGRESISTA | UNIVERSIDAD | NaN | 19.35480 | -99.06324 |
5 rows × 22 columns
Inspect with info() method to get better data insight.
vaw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 780439 entries, 0 to 780438
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 idCarpeta 780439 non-null float64
1 Año_inicio 780438 non-null float64
2 Mes_inicio 780438 non-null object
3 FechaInicio 780438 non-null object
4 Delito 780439 non-null object
5 Categoria 780439 non-null object
6 Sexo 628287 non-null object
7 Edad 495406 non-null float64
8 TipoPersona 774405 non-null object
9 CalidadJuridica 780438 non-null object
10 competencia 780439 non-null object
11 Año_hecho 780120 non-null float64
12 Mes_hecho 780120 non-null object
13 FechaHecho 780120 non-null object
14 HoraHecho 780130 non-null object
15 HoraInicio 780438 non-null object
16 AlcaldiaHechos 779234 non-null object
17 ColoniaHechos 745330 non-null object
18 Calle_hechos 778237 non-null object
19 Calle_hechos2 280532 non-null object
20 latitud 745544 non-null float64
21 longitud 745542 non-null float64
dtypes: float64(6), object(16)
memory usage: 131.0+ MB
There are 679 500 observations, which are quite a lot! Notice that the non-null values of latitude and longitude are less than the total entries. Therefore, drop all rows that have a NaN value in these columns and run a preliminary sub-selection by keeping female victims and latest year available.
From the data structure, I know that “Ano_inicio” field corresponds to the opening date of the investigation, while “Ano_hecho” represents the date on which the crime was committed. Then, go for the second.
# drop Nan values
vaw = vaw.dropna(subset=['latitud', 'longitud'])
# sort unique years in the data
year = sorted(vaw['Año_hecho'].unique())
print("Last year available: ", year[-1])
Last year available: 2022.0
Filter data accordingly
vaw = vaw[vaw['Año_hecho']== 2021]
vaw = vaw[vaw['Sexo']=='Femenino']
Abortion in Mexico: a Crime?
Since 2021, abortion in Mexico is no longer a crime, although its legalisation still varies by state (source).
Yet, in our data it still labeled as a crime:
vaw[vaw['Delito']== 'ABORTO'].sample(1)
idCarpeta | Año_inicio | Mes_inicio | FechaInicio | Delito | Categoria | Sexo | Edad | TipoPersona | CalidadJuridica | ... | Mes_hecho | FechaHecho | HoraHecho | HoraInicio | AlcaldiaHechos | ColoniaHechos | Calle_hechos | Calle_hechos2 | latitud | longitud | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
566672 | 8929948.0 | 2021.0 | Mayo | 17/05/2021 | ABORTO | DELITO DE BAJO IMPACTO | Femenino | NaN | FISICA | VICTIMA | ... | Mayo | 17/05/2021 | 12:00:00 | 17:49:00 | VENUSTIANO CARRANZA | AMPLIACION ADOLFO LOPEZ MATEOS | EJE 4 OTE. RIO CHURUBUSCO ESQUINA CON LA CALLE... | NaN | 19.416515 | -99.073591 |
1 rows × 22 columns
Considering abortion as a crime is beyond the scope of this project, but you too may find this interesting.
Back to Data Preparation
Select vaw crimes that match with the strings “sexual, domestic, femicide” and then print the results.
vaw = vaw[vaw['Delito'].str.contains('sexual|familiar|feminicidio', case=False)]
print('List of crimes after text query:\n', vaw['Delito'].unique())
List of crimes after text query:
['VIOLENCIA FAMILIAR' 'ABUSO SEXUAL' 'ACOSO SEXUAL'
'CONTRA LA INTIMIDAD SEXUAL' 'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES'
'FEMINICIDIO' 'FEMINICIDIO POR ARMA BLANCA' 'TENTATIVA DE FEMINICIDIO'
'FEMINICIDIO POR DISPARO DE ARMA DE FUEGO'
'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)'
'FEMINICIDIO POR GOLPES']
Using str.contains() as filter method in data preparation is highly not recommended because the chance of missing important information is real. For example, strings can have special characters. In this case, I didn’t find any viable alternative.
Let’s now begin to translate the DataFrame from Spanish to English starting from the columns names:
# select columns and rename
vaw = vaw[['FechaHecho', 'Delito', 'latitud', 'longitud']]
vaw.columns = ['Time', 'Crime', 'latitude', 'longitude']
vaw
Time | Crime | latitude | longitude | |
---|---|---|---|---|
480286 | 01/01/2021 | VIOLENCIA FAMILIAR | 19.198153 | -99.026277 |
480289 | 01/01/2021 | VIOLENCIA FAMILIAR | 19.193027 | -99.095006 |
480302 | 01/01/2021 | VIOLENCIA FAMILIAR | 19.336963 | -99.152265 |
480306 | 01/01/2021 | VIOLENCIA FAMILIAR | 19.561525 | -99.146590 |
480320 | 01/01/2021 | VIOLENCIA FAMILIAR | 19.395045 | -99.061060 |
... | ... | ... | ... | ... |
779813 | 08/12/2021 | ABUSO SEXUAL | 19.282468 | -99.010266 |
780165 | 31/03/2021 | VIOLENCIA FAMILIAR | 19.371821 | -99.071966 |
780166 | 01/12/2021 | VIOLENCIA FAMILIAR | 19.259511 | -99.101052 |
780286 | 19/07/2021 | ABUSO SEXUAL | 19.352580 | -99.019935 |
780320 | 12/12/2021 | ABUSO SEXUAL | 19.325796 | -99.069686 |
31613 rows × 4 columns
and continuing with the body. List all the values for ‘Crime’:
vaw['Crime'].unique()
array(['VIOLENCIA FAMILIAR', 'ABUSO SEXUAL', 'ACOSO SEXUAL',
'CONTRA LA INTIMIDAD SEXUAL',
'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES', 'FEMINICIDIO',
'FEMINICIDIO POR ARMA BLANCA', 'TENTATIVA DE FEMINICIDIO',
'FEMINICIDIO POR DISPARO DE ARMA DE FUEGO',
'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)',
'FEMINICIDIO POR GOLPES'], dtype=object)
Several types of ‘FEMICIDE’ are in. I use a simple regular expression to join them and replace them with the English word.
vaw['Crime'] = vaw['Crime'].str.replace(r'.*FEMINICIDIO.*', r'FEMICIDE', regex = True)
Conclude the translation for the other crime types.
from_text = ['VIOLENCIA FAMILIAR', 'ABUSO SEXUAL', 'ACOSO SEXUAL',
'CONTRA LA INTIMIDAD SEXUAL', 'ACOSO SEXUAL AGRAVADO EN CONTRA DE MENORES',
'FEMICIDE', 'PRIVACION DE LA LIBERTAD PERSONAL (REALIZAR ACTO SEXUAL)']
replace_with = ['DOMESTIC VIOLENCE', 'SEXUAL ABUSE', 'SEXUAL HARASSMENT',
'AGAINST SEXUAL INTIMACY', 'AGGRAVATED SEXUAL HARASSMENT AGAINST MINORS', 'FEMICIDE',
'DEPRIVATION OF PERSONAL LIBERTY (PERFORMING A SEXUAL ACT)']
vaw = vaw.replace(dict(zip(from_text, replace_with)))
vaw.head()
Time | Crime | latitude | longitude | |
---|---|---|---|---|
480286 | 01/01/2021 | DOMESTIC VIOLENCE | 19.198153 | -99.026277 |
480289 | 01/01/2021 | DOMESTIC VIOLENCE | 19.193027 | -99.095006 |
480302 | 01/01/2021 | DOMESTIC VIOLENCE | 19.336963 | -99.152265 |
480306 | 01/01/2021 | DOMESTIC VIOLENCE | 19.561525 | -99.146590 |
480320 | 01/01/2021 | DOMESTIC VIOLENCE | 19.395045 | -99.061060 |
#save file
vaw.to_csv('data/vaw_filtered.csv', index = False)
Add to neighbourhood variables the total population per unit by merging with ‘Population per Neighbourhood’.
Date source:
Data Structure:
Structure of the data is described in a separate excel file download here
import geopandas as gpd
# Read the data
gdf = gpd.read_file('data/alcaldías_cdmx/alcaldias_cdmx.shp')
gdf = gdf.rename(columns = {'nomgeo':'NOMGEO'})
#gdf.head(5)
Population per Neighbourhood
Date source:
CDMX Population per Neighbourhood - Mexico City Open Data Portal
Data Structure:
Structure of the data is described in a separate excel file download here
df = pd.read_csv('data/poblacion_total_tasa_crecimiento_1.1.csv')
df = df[df['Año'] == 2020]
df = df.reset_index(drop=True)
#df
# drop the last summary row
df = df.drop([16])
df = df.rename(columns = {'Alcaldia':'NOMGEO', 'Población total': 'POBTOT' })
df = df.apply(lambda x: x.astype(str).str.upper())
merge = gdf.merge(df, on='NOMGEO')
merge = merge.rename(columns = {'cvegeo':'CVEGEO'})
merge['POBTOT'] = merge['POBTOT'].str.replace(',', '').astype(int)
gdf = merge[['CVEGEO', 'NOMGEO', 'POBTOT', 'geometry']]
# Write to GeoJSON
gdf.to_file("data/CDMX_municipalities.gjson", driver="GeoJSON")