library(tidyverse)
library(sf)
library(knitr)
crosswalk <- readRDS("../data/crosswalk_ine_igm.rds")INE–IGM Community Crosswalk: Methods and Results
Bolivia Community Geography Project
This report was generated using AI under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.
1 Introduction
This document describes the construction of a crosswalk linking the Bolivian National Statistics Institute (INE) community registry to point locations from the Instituto Geográfico Militar (IGM). The goal is to assign each of the 19,418 communities in the INE registry a geographic identifier (id_unico) from the IGM settlement point dataset, enabling downstream linkage to coordinates and Wikidata entities.
The two datasets do not share a common key. Matching proceeds by community name within department, with a multi-stage spatial disambiguation pipeline to resolve cases where the same name appears in more than one location.
2 Data Sources
Four datasets feed the pipeline.
INE community registry (CLASIF_UB_GEOG_COMUNIDAD.xlsx). The official INE classification of urban geographical locations (clasificación de ubicaciones geográficas), containing 19,418 communities. Each row carries an 11-digit Codigo that encodes department, province, municipality, canton, and a community sequence number (see Section 3). Administrative hierarchy columns (DEPARTAMENTO, PROVINCIA, MUNICIPIO) and their numeric segment codes (DEP, PRO, MUN) are also present.
IGM settlement points (localizacion_poblaciones_2016.json). A GeoJSON point dataset of 23,891 Bolivian settlements compiled by the IGM in 2016. The key fields used here are id_unico (a unique point identifier), nombre_dep (department name), nombre_c_1 (settlement name, uppercase), and tipo_area (settlement type: cp = population centre, cpd = dispersed population centre, ci = urban centre, dis = dispersed).
GADM Bolivia level-3 boundaries (gadm41_BOL_3.gpkg). Municipality (level-3) polygon boundaries from the Global Administrative Areas database (GADM v4.1). Used to spatially assign each IGM point to a municipality and thereby resolve name ambiguities.
USCA community polygons (etnicidad_tenencia/usca_final.shp). Community- level multipolygon boundaries from the Unidades Socio-Culturales Agrarias (USCA) dataset, covering 14,426 features. Each polygon carries a 10-digit INE code (cod10dig), which maps to Codigo after prepending a leading zero. Used as a final spatial disambiguation layer for cases that GADM could not resolve.
3 INE Code Structure
Each INE community code (Codigo) is an 11-character string with a fixed leading zero followed by five numeric fields:
0 | DD | PP | MM | C | SSS
dept prov mun canton seq
| Position | Width | Field | Description |
|---|---|---|---|
| 1 | 1 | Leading zero | Fixed padding |
| 2–3 | 2 | DD |
Department code |
| 4–5 | 2 | PP |
Province code |
| 6–7 | 2 | MM |
Municipality code |
| 8 | 1 | C |
Canton code |
| 9–11 | 3 | SSS |
Community sequence number |
The USCA shapefile uses a 10-digit form (the leading zero is absent). To join USCA codes to INE Codigo values, prepend a zero: paste0("0", cod10dig).
4 Matching Pipeline
The pipeline proceeds in eight stages, each producing a more complete and resolved crosswalk. The central challenge is that the IGM dataset contains no municipality field — only department and name — so names that recur within a department cannot be disambiguated by name alone.
4.1 Name normalization
Show code
normalize_match_key <- function(x) {
x |>
str_trim() |>
str_squish() |>
str_to_upper() |>
str_replace_all("[^\x20-\x7E]", "") |>
str_replace_all("[^A-Z0-9 ]", " ") |>
str_squish()
}Both datasets are normalized before any join. The normalize_match_key() function strips non-ASCII characters (encoding artifacts introduced by accented letters in different encodings), collapses punctuation and symbols to spaces, squishes runs of whitespace, and uppercases everything. This handles the most common source of spurious non-matches: encoding differences between the INE Excel file and the IGM GeoJSON.
A small number of INE names are known to differ from their IGM counterparts for reasons beyond encoding — abbreviations, historical name changes, or alternative place names. These are corrected with explicit overrides before the normalized key is computed:
| INE name | IGM name | Reason |
|---|---|---|
OKINAWA 1 |
OKINAWA UNO |
Abbreviated ordinal |
TAIPIPLAYA |
TANIPLAYA |
Alternative spelling |
MUYUPAMPA |
VILLA VACA GUZMAN |
Historical name change |
VACAS K UCHU |
VACAS KUCHU |
Spacing artifact |
VALLE DE CONCEPCION |
CONCEPCION |
Short-form name in IGM |
4.2 Initial join: department + name
Show code
ine_geog_2013 <- readxl::read_excel("../data/CLASIF_UB_GEOG_COMUNIDAD.xlsx") |>
rename(
department = DEPARTAMENTO,
province = PROVINCIA,
municipality = MUNICIPIO
) |>
mutate(
cod.prov = paste0(DEP, PRO),
cod.mun = paste0(DEP, PRO, MUN),
cod.com = Codigo,
cod.dep = DEP
)
geo <- st_read("../data/localizacion_poblaciones_2016.json", quiet = TRUE)
geo_norm <- geo |>
st_drop_geometry() |>
mutate(
nombre_c_1 = str_trim(nombre_c_1),
match_key = normalize_match_key(nombre_c_1)
)
ine_norm <- ine_geog_2013 |>
mutate(
com_name = str_trim(`CIUDAD/COMUNIDAD`),
match_key = case_when(
com_name == "OKINAWA 1" ~ normalize_match_key("OKINAWA UNO"),
com_name == "3o GRUPO VALLE HERMOSO" ~ normalize_match_key("3O GRUPO VALLE HERMOSO"),
com_name == "TAIPIPLAYA" ~ normalize_match_key("TANIPLAYA"),
com_name == "SINDICATO AGRARIO IMILLA IMILLA" ~ normalize_match_key("SINDICATO AGRARIO IMILLA"),
com_name == "VALLE SACTA (Disperso)" ~ normalize_match_key("VALLE SACTA (DISPERSO)"),
com_name == "VALLE DE CONCEPCION" ~ normalize_match_key("CONCEPCION"),
com_name == "VACAS K UCHU" ~ normalize_match_key("VACAS KUCHU"),
com_name == "MUYUPAMPA" ~ normalize_match_key("VILLA VACA GUZMAN"),
TRUE ~ normalize_match_key(com_name)
)
)
crosswalk_raw <- ine_norm |>
select(Codigo, department, municipality, com_name, match_key) |>
left_join(
geo_norm |> select(nombre_dep, nombre_c_1, id_unico, match_key),
by = c("department" = "nombre_dep", "match_key"),
relationship = "many-to-many"
)
match_counts <- crosswalk_raw |>
filter(!is.na(id_unico)) |>
group_by(Codigo) |>
summarise(n_geo = n_distinct(id_unico), .groups = "drop")The initial join matches on the pair (department, normalized_name). Because the IGM dataset carries no municipality information, any community name that appears in two or more municipalities within the same department produces multiple candidate matches. These are flagged as ambiguous and carried forward to the disambiguation stages.
After this stage, records fall into three preliminary categories:
- Unique — exactly one IGM point matches the name within the department.
- Ambiguous — two or more IGM points match.
- Unmatched — no IGM point with that name exists in the department.
4.3 Stage 1: Direct unique matches
Show code
crosswalk_geo_ine_s1 <- crosswalk_raw |>
left_join(match_counts, by = "Codigo") |>
mutate(
n_geo = coalesce(n_geo, 0L),
match_status = case_when(
is.na(id_unico) ~ "unmatched",
n_geo == 1 ~ "unique",
n_geo > 1 ~ "ambiguous"
)
) |>
select(Codigo, department, municipality, com_name, id_unico, match_status, n_geo)Records with exactly one matching IGM point are immediately assigned match_status = "unique". This accounts for the large majority of the registry.
4.4 Stage 2: GADM spatial municipality join
Show code
gadm_path <- "../data/gadm41_BOL_3.gpkg"
mun_boundaries <- st_read(gadm_path, layer = "ADM_ADM_3", quiet = TRUE)
gadm_lookup <- mun_boundaries |>
st_drop_geometry() |>
select(gadm_dep = NAME_1, gadm_prov = NAME_2, gadm_mun = NAME_3) |>
mutate(
dep_key = normalize_match_key(gadm_dep),
mun_key = normalize_match_key(gadm_mun)
)
ine_mun_lookup <- ine_geog_2013 |>
select(department, province, municipality, cod.mun) |>
distinct() |>
mutate(
dep_key = normalize_match_key(department),
mun_key = normalize_match_key(municipality)
)To resolve ambiguous matches, each IGM point is spatially assigned to a GADM municipality polygon using st_within. Points that fall on or outside polygon boundaries (e.g. border settlements) are snapped to the nearest polygon using st_nearest_feature as a fallback.
Municipality names differ substantially between the INE and GADM datasets — different spellings, abbreviations, historical renames, and cases where GADM uses a province name where INE uses a municipality name. A manual crosswalk of 44 municipality name pairs resolves these discrepancies, covering all 339 INE municipalities.
Once each IGM point is tagged with a municipality code, the ambiguous crosswalk entries are filtered to keep only the candidate whose assigned municipality matches the INE record’s municipality. If this filter reduces the candidate set to exactly one point, the record is resolved as unique_via_spatial.
4.5 Stage 3: Name-proximity fallback
Show code
ine_name_to_mun <- ine_norm |>
select(department, match_key, municipality, cod.mun) |>
distinct() |>
group_by(department, match_key) |>
filter(n_distinct(municipality) == 1) |>
ungroup() |>
select(department, match_key, municipality_ine = municipality, cod.mun_ine = cod.mun)
mun_sf <- mun_boundaries |>
mutate(
dep_key = normalize_match_key(NAME_1),
mun_key = normalize_match_key(NAME_3)
) |>
inner_join(
bind_rows(
ine_mun_lookup |>
inner_join(gadm_lookup |> select(dep_key, mun_key, gadm_mun, gadm_prov),
by = c("dep_key", "mun_key")) |>
select(cod.mun, department, municipality, gadm_mun),
tribble(
~department, ~municipality, ~gadm_mun,
"La Paz", "La Paz", "Nuestra Señora de La Paz",
"La Paz", "Callapa", "Santiago de Callapa",
"Beni", "Rurrenabaque", "Puerto Menor de Rurrenabaque",
"Chuquisaca", "Muyupampa", "Villa Vaca Guzmán"
) |>
left_join(ine_mun_lookup |> select(department, municipality, cod.mun),
by = c("department", "municipality"))
) |>
mutate(
dep_key = normalize_match_key(department),
mun_key = normalize_match_key(gadm_mun)
),
by = c("dep_key", "mun_key")
) |>
select(cod.mun, department, municipality)A small number of IGM points are placed outside their true GADM polygon by the spatial join — typically because they lie very close to a border or because the GADM boundaries do not perfectly align with the IGM point locations. For INE records whose community name is unambiguous within the department (i.e., the name maps to only one municipality in the INE list), the correct municipality is known even without spatial containment. In these cases, the candidate IGM point closest to that municipality’s centroid is selected, resolving the record as unique_via_name.
4.6 Stage 4: USCA polygon containment
Show code
usca_valid <- st_make_valid(
st_read("../data/etnicidad_tenencia/usca_final.shp", quiet = TRUE)
)
igm_spatial_ine <- st_join(
geo |> select(id_unico),
usca_valid |> select(cod10dig),
join = st_within
) |>
st_drop_geometry() |>
distinct(id_unico, .keep_all = TRUE) |>
mutate(
spatial_ine = if_else(!is.na(cod10dig), paste0("0", cod10dig), NA_character_)
) |>
select(id_unico, spatial_ine)For records still ambiguous after Stages 2 and 3, the USCA community polygon layer provides a finer spatial filter. Each IGM point is matched to the USCA polygon it falls within (using st_within), and the resulting USCA code is compared directly to the INE Codigo. If exactly one of the candidate points falls inside the USCA polygon corresponding to the INE code, that point is assigned as unique_via_usca.
The USCA dataset uses 10-digit codes; a leading zero is prepended before comparison.
4.7 Stage 5: Canton-split detection
Show code
id_set_sig <- crosswalk |>
filter(match_status == "ambiguous_canton_split" | match_status == "ambiguous") |>
filter(!is.na(id_unico)) |>
group_by(Codigo) |>
summarise(id_key = paste(sort(unique(id_unico)), collapse = "|"), .groups = "drop")
canton_split_map <- id_set_sig |>
group_by(id_key) |>
filter(n() > 1) |>
mutate(canton_split_group = min(Codigo)) |>
ungroup() |>
select(Codigo, canton_split_group)Bolivia’s canton boundaries were reorganised after the 2001 census. As a result, some communities appear in the INE registry under two or more Codigo values that differ only in the canton digit (position 8) — with the department, province, municipality, and community sequence number unchanged. These sibling codes all match the same pool of IGM candidate points.
To detect these groups, each remaining ambiguous Codigo is fingerprinted by the sorted set of its candidate id_unico values. Any two codes that share an identical fingerprint are placed in the same canton-split group, identified by the lowest Codigo in the group (canton_split_group). These records are labelled ambiguous_canton_split.
The many-to-one mapping (multiple INE codes → same IGM points) is expected and correct for these groups. To build a deduplicated community key, use:
community_key <- coalesce(canton_split_group, Codigo)4.8 Stage 6: Dispersed settlement reclassification
Show code
pure_dis_codes <- crosswalk |>
filter(match_status %in% c("ambiguous", "ambiguous_no_spatial")) |>
filter(!is.na(id_unico)) |>
left_join(
geo |> st_drop_geometry() |> select(id_unico, tipo_area),
by = "id_unico"
) |>
group_by(Codigo) |>
filter(all(tipo_area == "dis")) |>
pull(Codigo) |>
unique()The IGM dataset uses tipo_area = "dis" to flag dispersed settlements — rural communities spread across a wide area rather than concentrated at a single point. For such communities, the IGM frequently records multiple points, each representing a cluster or hamlet within the broader settlement.
When all candidate IGM points for an ambiguous INE record carry tipo_area == "dis", no single point is more authoritative than the others. These records are reclassified as ambiguous_dispersed, signalling to downstream users that the full candidate pool should be treated as a coordinate envelope rather than forcing a single selection.
4.9 Stage 7: Cross-municipality adjacency splitting
Show code
cross_mun_summary <- crosswalk |>
filter(match_status == "ambiguous_no_spatial") |>
group_by(department, com_name) |>
summarise(n_mun = n_distinct(municipality), .groups = "drop") |>
count(n_mun)Some records are labelled ambiguous_no_spatial because the same community name appears in two municipalities within the same department and the GADM spatial join could not assign the points unambiguously. However, if the two municipalities are geographically non-adjacent, the candidate IGM points can be partitioned by which municipality polygon they fall within — and that partition resolves the ambiguity.
For each ambiguous_no_spatial group spanning exactly two municipalities, the pipeline tests whether those municipalities are adjacent (st_touches). For non-adjacent pairs, the candidate points are spatially split between the two municipality polygons. Each INE record then receives only the points assigned to its municipality:
- If the split leaves a single point:
unique_via_spatial - If multiple points remain (all dispersed):
ambiguous_dispersed
4.10 Stage 8: Mixed-type resolution
Show code
type_resolution_note <- crosswalk |>
filter(match_status == "unique_via_spatial") |>
nrow()For any ambiguous records still remaining after Stage 7, the tipo_area field provides a final resolution heuristic. Named population centres (cp, cpd) and urban centres (ci) are more specific geographic entities than dispersed points (dis). If a candidate pool contains exactly one point of the highest-priority type, that point is selected:
| Priority | tipo_area |
Description |
|---|---|---|
| 1 | cp, cpd |
Population centre / dispersed population centre |
| 2 | ci |
Urban/city centre |
| 3 | dis |
Dispersed settlement |
A code is resolved as unique_via_spatial when the top-ranked type appears exactly once. If multiple candidates share the top rank, the record remains ambiguous.
5 Results
Show code
status_counts <- crosswalk |>
distinct(Codigo, match_status) |>
count(match_status, name = "n_codes") |>
arrange(desc(n_codes))
status_counts |>
mutate(
pct = scales::percent(n_codes / sum(n_codes), accuracy = 0.1)
) |>
rename(
`Match status` = match_status,
`INE codes` = n_codes,
`Share` = pct
) |>
kable(align = c("l", "r", "r"))| Match status | INE codes | Share |
|---|---|---|
| unique | 13998 | 72.1% |
| unique_via_spatial | 4138 | 21.3% |
| ambiguous_canton_split | 710 | 3.7% |
| ambiguous_dispersed | 401 | 2.1% |
| unique_via_usca | 142 | 0.7% |
| unique_via_name | 23 | 0.1% |
| unmatched | 5 | 0.0% |
| ambiguous_no_spatial | 1 | 0.0% |
Show code
status_counts |>
mutate(
match_status = fct_reorder(match_status, n_codes),
resolved = str_starts(match_status, "unique")
) |>
ggplot(aes(x = n_codes, y = match_status, fill = resolved)) +
geom_col() +
scale_fill_manual(
values = c("TRUE" = "#2a9d8f", "FALSE" = "#e9c46a"),
labels = c("TRUE" = "Resolved", "FALSE" = "Ambiguous / unmatched"),
name = NULL
) +
scale_x_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.05))) +
labs(
x = "Number of INE codes",
y = NULL,
title = "INE–IGM crosswalk match status",
subtitle = "19,418 communities across Bolivia"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")Show code
n_resolved <- crosswalk |>
distinct(Codigo, match_status) |>
filter(str_starts(match_status, "unique")) |>
nrow()
n_total <- n_distinct(crosswalk$Codigo)Of the 19,418 INE community codes, 18,301 (94.2%) are resolved to a unique IGM point by at least one matching stage.
6 Match Status Reference
Show code
status_ref <- tribble(
~Status, ~Description,
"unique", "Direct 1-to-1 name match within department.",
"unique_via_spatial", "Ambiguous name resolved by GADM municipality boundary, cross-municipality adjacency splitting, or population-centre type priority.",
"unique_via_name", "Ambiguous name resolved by selecting the IGM point closest to the correct municipality centroid.",
"unique_via_usca", "Ambiguous name resolved by USCA community polygon containment.",
"ambiguous_canton_split", "Same community listed under multiple canton codes due to administrative reorganisation. All sibling codes share the same IGM candidate pool; many-to-one mapping is expected.",
"ambiguous_dispersed", "All IGM candidates are tipo_area == 'dis' (dispersed). The IGM recorded multiple points for the same settlement. Treat all candidates as a valid coordinate pool.",
"ambiguous", "Genuinely repeated name within the same municipality. The correct point cannot be determined automatically.",
"ambiguous_no_spatial", "Name recurs across municipalities; no spatial disambiguation succeeded.",
"unmatched", "No IGM point with a matching name found."
)
status_ref |>
left_join(
crosswalk |>
distinct(Codigo, match_status) |>
count(match_status, name = "N"),
by = c("Status" = "match_status")
) |>
mutate(N = coalesce(N, 0L)) |>
rename(`Match status` = Status, `N (INE codes)` = N) |>
kable()| Match status | Description | N (INE codes) |
|---|---|---|
| unique | Direct 1-to-1 name match within department. | 13998 |
| unique_via_spatial | Ambiguous name resolved by GADM municipality boundary, cross-municipality adjacency splitting, or population-centre type priority. | 4138 |
| unique_via_name | Ambiguous name resolved by selecting the IGM point closest to the correct municipality centroid. | 23 |
| unique_via_usca | Ambiguous name resolved by USCA community polygon containment. | 142 |
| ambiguous_canton_split | Same community listed under multiple canton codes due to administrative reorganisation. All sibling codes share the same IGM candidate pool; many-to-one mapping is expected. | 710 |
| ambiguous_dispersed | All IGM candidates are tipo_area == ‘dis’ (dispersed). The IGM recorded multiple points for the same settlement. Treat all candidates as a valid coordinate pool. | 401 |
| ambiguous | Genuinely repeated name within the same municipality. The correct point cannot be determined automatically. | 0 |
| ambiguous_no_spatial | Name recurs across municipalities; no spatial disambiguation succeeded. | 1 |
| unmatched | No IGM point with a matching name found. | 5 |
Canton-split groups can be identified and deduplicated as follows:
# Build a deduplicated community key (one per physical place)
community_key <- coalesce(canton_split_group, Codigo)7 Usage
7.1 Loading the crosswalk
Show code
crosswalk <- readRDS("../data/crosswalk_ine_igm.rds")
# or
crosswalk <- read_csv("../data/crosswalk_ine_igm.csv")7.2 Joining to the INE community list
Show code
library(tidyverse)
ine_geog_2013 |>
left_join(
crosswalk |> select(Codigo, id_unico, match_status, canton_split_group),
by = "Codigo"
)7.3 Joining to IGM coordinates
Show code
library(sf)
igm_sf <- st_read("../data/localizacion_poblaciones_2016.json", quiet = TRUE)
# Resolved records only
crosswalk |>
filter(str_starts(match_status, "unique")) |>
left_join(igm_sf |> st_drop_geometry(), by = "id_unico")7.4 Looking up a single INE code
The ine_match_status() function (defined in ine_community_mapping.R) returns a human-readable summary for any INE code, accepting both the 11-digit Codigo form and the 10-digit USCA cod10dig form:
source("ine_community_mapping.R")
ine_match_status("01010101001")