How I added a new property to Wikidata

A how-to guide to putting a unique identifier onto the global knowledge graph

Author

Carwil Bjork-James

Published

May 22, 2026

This post walks through how I added a new property to Wikidata: the codes assigned to Bolivian geographic features by the Instituto Nacional de Estadística (INE, National Statistical Institute). You can read more about just what unique identifiers like INE codes are good for, and how Wikidata has become a repository for them below.

1. Property proposal and approval

For a unique ID to be assigned to Wikidata items, it must be included as a statement on that item’s page and the property officially defined. Any one can add an place to Wikidata, where it get assigned a unique QID, but properties must be approved by the editing community through a proposal process. Another user had started the INE code proposal in August 2025, but little had happened. I was able to help move the process along by supplying sample Wikidata statements. One user marked the proposal as ready a day later, and a Wikidata property creator with the username Tinker Bell activated it the next morning.

2. Finding entities that need INE codes

INE codes are assigned to Bolivia’s 9 departments, 112 provinces, and 340 municipalities, sub-municipal cantons, and rural communities. These are listed on Spanish Wikipedia. For now this page will walk through how I got the codes for the first three kinds of entities onto Wikidata.

INE codes are structured strings of digits, laid out like this.

Departmento: DD (2 digits)
Province: DDPP (4 digits
Municipality: DDPPMM (6 digits)

I had a complete list of INE municipalities and their codes: the left four columns of a Census table.

ine_geog_2013 <- readxl::read_excel("data/CLASIF_UB_GEOG_COMUNIDAD.xlsx")
ine_geog_2013 <- ine_geog_2013 %>%
  rename(
    department = DEPARTAMENTO,
    province = PROVINCIA,
    municipality = MUNICIPIO) %>%
  mutate(cod.prov = paste0(DEP, PRO),
         cod.mun = paste0(DEP, PRO, MUN),
         cod.com = Codigo) %>%
  rename(cod.dep = DEP)

ine_geog_provinces <- ine_geog_2013 %>%
  select(cod.prov, province, department) %>%
  distinct()

ine_geog_municipalities <- ine_geog_2013 %>%
  select(cod.mun, municipality, province, department) %>%
  distinct()

All I would need to do was to check if each of the entities involved were already on Wikidata and write QuickStatements to assign the property.

I did the data wrangling in R. In the process, I had to create new functions for working with Wikidata: retrieving Wikidata items into data tables, crafting QuickStatements to add to Wikidata, etc. Happily, because I did this, you won’t have to.

To find all Wikidata items, I used the function get_wikidata_instances(), available here.

# retrieve instances of "department of Bolivia" (Q250050)
#   with their QIDs and labels in English and Spanish
departments_wd <- get_wikidata_instances("Q250050", languages = c("en", "es"))

Matching these to their INE codes required one last use of string-based matching, and to do that I also simplified the labels to a common format along the way, making sure they each appeared as “<name> Department” in English and “Departamento de <nombre>” in Spanish.

For provinces and municipalities, names aren’t always enough (there are six pairs of municipalities that share the same names: El Puente, Entre Ríos, San Javier, San Pedro, San Ramón, and Santa Rosa).

3. Facilitating string matches while cleaning up Wikidata: Departments

As I worked, I kept making Wikidata a bit more uniform, for example by adding “Province” at the end of each English label for the province items.

# visual inspection shows that not all have "Province" in label_en, but
# it's just 6 that don't.
provinces_without_suffix <- provinces %>%
  filter(!str_detect(label_en, "Province") | str_detect(label_en, "Cercado"))
# next step is to relabel these, but create_quick_statements has
# broken logic for labels.
provinces_without_suffix <- provinces_without_suffix %>%
  add_quick_statement_column(qid, property="L", value=str_c(.data$label_en, " Province (", department, ")"), lang="en")
writeLines(provinces_without_suffix$quick_statement)

This involves functions that create QuickStatements that modify Wikidata. An online QuickStatements tool lets Wikidata users upload these lists of text commands, preview the changes they will make and carry them out all at once.

My modification to province names were performed as QuickStatements here. After several moves to clean up the provinces we had a new list.

provinces <- get_wikidata_instances("Q1062593", c("P131", "P17"), c("located_in", "country"))
# get department names from the table we've already imported
provinces <- provinces %>%
  left_join(select(departments, qid, department), by=join_by("located_in"=="qid")) %>%
  relocate(department, .after="located_in")
# Note that if we recreate provinces_without_suffix, it now has four items,
# just the Cercado provinces, which now have their department in parentheses in
# the label.
provinces <- provinces |>
  mutate(province = str_extract(label_en, "^(.+)\\s+Province", group = 1))

provinces_ine <- provinces %>% left_join(ine_geog_provinces, by=c("province", "department"))

After cleaning up the province labels, I pulled a fresh copy of all Wikidata province items and joined them back to the INE province table. At that point the easy matches were done, but dozens of provinces still did not line up automatically. Some differences were minor spelling variants, some reflected accent marks, and others came from cases where Wikidata and the INE source preferred different versions of the same name. In a few places the province item on Wikidata was labeled with only part of the official name.

I started by re-importing the provinces and extracting a cleaner province name from the English label:

provinces <- get_wikidata_instances("Q1062593", c("P131", "P17"),
                                    c("located_in", "country"))

provinces <- provinces |>
  left_join(select(departments, qid, department), 
            by = join_by("located_in" == "qid")) |>
  relocate(department, .after = "located_in") |> 
  mutate(province = str_extract(label_en, "\^(.+)\\s+Province", group = 1))

provinces_ine <- provinces |>
  left_join(ine_geog_provinces, by = c("province", "department"))

This made it possible to compare the province names coming from Wikidata with the names in the INE table. The remaining non-matches then became a manageable list for inspection. To handle those exceptions, I built a small manual match table. For some provinces this also meant adding English aliases on Wikidata, so that the item would carry a more recognizable or complete province name in addition to its main label.

Once those exceptions were resolved, I could join the province items to the INE list and generate a full set of QuickStatements assigning the new INE property to each province item, with citations to the data source.

4. String matches and Wikidata cleanup: Municipalities

Municipalities took much more work. In principle, municipalities should have been easier: there are only a few hundred, and the INE source gives a complete list. In practice, municipality items on Wikidata were much less standardized. Some were linked to provinces, others directly to departments, and some overlapped with separate city items of the same name. There were also many label differences between Wikidata and the INE table: missing accents, alternate spellings, abbreviated names, and cases where one source used a longer ceremonial or historical form while the other used the shorter everyday name.

I began by retrieving all Wikidata municipality items together with their administrative-location properties:

# Q1062710 = municipality of Bolivia
municipalities_wd <- get_wikidata_instances( "Q1062710", c("P131", "P17"), 
                                             c("located_in", "country") )

# This function expands the located_in property so that each item gets 
# detailed location information in a tabular form.
municipalities_wd_exp <- expand_located_in_pd(municipalities_wd)

The custom expansion function let me see which province and department each municipality as “located in” on Wikidata.

qid	label_en	loc_prov_qid	loc_prov_en	loc_dep_qid	loc_dep_en
Q783480	San Lucas	Q1420337	Nor Cinti Province	Q235110	Chuquisaca Department
Q916322	San Ignacio de Velasco Municipality	Q1215518	José Miguel de Velasco Province	Q235106	Santa Cruz Department
Q920327	Yamparáez	Q1420314	Yamparáez Province	Q235110	Chuquisaca Department
Q955703	Villa Alcalá	Q1420320	Tomina Province	Q235110	Chuquisaca Department
Q993949	General Saavedra	Q1215579	Obispo Santistevan Province	Q235106	Santa Cruz Department

To figure out where each municipality belonged, I used the province table I had already cleaned to recover department names whenever possible. A few items still lacked enough geographic context, so I assigned their departments manually.

prov_to_dept <- provinces |>
  select(prov_qid = qid, dep_from_prov = department) |>
  mutate(dep_from_prov = str_replace(dep_from_prov, "Potosí", "Potosi"))

wd_clean <- municipalities_wd_exp |>
  left_join(prov_to_dept, by = c("loc_prov_qid" = "prov_qid")) |>
  mutate(dept_final = coalesce(dep_direct, dep_from_prov))

Matching then happened in several rounds. First I tried a direct join between the Spanish Wikidata label and the municipality name in the INE table, while also matching on department so that duplicate names in different parts of the country would not be confused.

For the items that remained unmatched, I created cleaned versions of the names by stripping prefixes like “Municipio de” and suffixes like “Municipality,” then tried the join again. That resolved many routine formatting differences.

The remaining municipalities needed explicit hand matching. Here I wrote out a table pairing specific Wikidata QIDs with INE municipal codes. This covered a wide range of cases: accent and spelling variation, word-order differences, abbreviations, municipalities whose Wikidata item used an older or alternative official name, and a few items whose labels were so incomplete that I had to identify them from related information such as their linked Wikipedia articles.

match_manual <- tribble(
  ~qid,         ~cod.mun,
  "Q685208",    "080401",  # Santa Ana del Yacuma -> Santa Ana de Yacuma
  "Q721682",    "071501",  # Ascencion de Guarayos -> Ascension de Guarayos
  "Q647771",    "040503"   # Cruz De Machacamarca -> Cruz de Machacamarca
  # and so on, for a total of 24 manual matches
)

One further complication was duplication. Wikidata sometimes had both a municipality item and a city item that described nearly the same place and could easily be confused in matching. In those cases I excluded the city item and kept the municipality item, especially when the latter was clearly marked as a municipality.

With those matches in place, I joined the INE municipal codes back onto the municipality table and generated the QuickStatements to add the new property to each item. This is the resulting batch.

municipalities_matched <- municipalities_wd_exp |>
  left_join(all_matches, by = "qid") |>
  add_quick_statement_column(
    qid, "P14142", cod.mun,
    reference_qid = "Q138354774"
  )

I also used the resulting code assignments to identify municipalities whose administrative parent on Wikidata was still a department rather than a province. Since the first four digits of a municipality’s INE code identify its province, I could derive the correct province and prepare two more batches of QuickStatements: first removing the department as the “located in” value, then adding the proper province item instead.

By the end of this process, adding a new property had turned into a broader cleanup of Bolivia’s territorial data on Wikidata. The property itself was only one part of the task. To make the identifiers genuinely useful, I also had to standardize labels, add aliases, resolve ambiguous names, eliminate mistaken overlaps between cities and municipalities, and improve administrative relationships. A well-formed set of Wikidata entries then allows for for mass addition of new data from nationwide sources.

Altogether this process took 637 automated QuickStatements to reorganize and assign codes to Bolivia’s first, second, and third-level administrative subdivisions. While I had to think through the coding, I didn’t have to type a single one of them in manually (though I think I did just to see how the process worked).

That’s the whole process. Read on for why any of this matters.

Unique identifiers: the key to data interoperability

Unique identifiers solve problems. In the course of building Ultimate Consequneces, an interactive database of lives lost in Bolivian political conflict, I’ve had to coordinate the data I’m generating from a number of other sources: Census tables, data on Indigenous identification, shapefiles that show the size, location, and outlines of municipalities, geographic datasets mapping every rural community in the country.

Essentially each of these forms of data are big data tables whose rows are places. And I need to make them work with my own data table of deaths, each row of which contains a set of place information, laid out across six variables like this:

place	address	community	municipality	province	department
Place	Address/ Intersection	Community/Neighborhood	Municipality	Province	Department
Lugar	Dirección/Intersección	Comunidad/Barrio	Municipalidad	Provincia	Departamento

Now if I want to create a municipal map like this one, I have to count the number of deaths in each municipality and also figure out which item on the map shares that name. But nowhere in the world is there a consistent unambiguous spelling for all places. And in Bolivia in particular names have shifted dramatically, as the country’s municipal map was radically decentralized, adding three hundred new municipalities between 1994 and 2010. Some municipalities have multiple names, but each data source either only lists one, or occasionally both or all three.

For a while I was writing custom functions for each pair of data sources, a process that gets tedious (and/or unreliable) fast. Fortunately, two sources introduced me to a simpler solution: unique identifiers in the form of INE codes, produced by the National Statistical Institute that runs the Bolivian Census. A table like this one can handle all the interchanges, with one column for each data source. And building the table gets easier ecah time, since I can cross-reference each new set of names against all that came before.

The two sources where I first encountered INE codes were this table of Bolivian municipalities on Spanish Wikipedia and this census population table.

id_muni	muni_gb2014	muni_anexo	muni_ine	muni_census	department	n_unique
010101	Sucre	Sucre	Sucre	Sucre	Chuquisaca	1
010102	Yotala	Yotala	Yotala	Yotala	Chuquisaca	1
010103	Poroma	Poroma	Poroma	Poroma	Chuquisaca	1
010201	Azurduy	Azurduy	Azurduy	Azurduy	Chuquisaca	1

Now I can just use the lookup table to add an INE code to any dataset, with a command add_id_for_municipality() that I just throw into my workflow:

locations$municipality_centroid <- lookup$municipality %>%
  add_id_for_municipality() %>%
  left_join(select(gadm_muni_centroids, cod.mun, lon, lat),
            by = c("id_muni" = "cod.mun"))

Wikidata: A global index of unique identifiers

Put most simply, Wikidata as an open knowledge graph that anyone can edit and approve. Where Wikipedia primarily hosts article, Wikidata is filled with structured data, something familiar to most people only through the infoboxes at the upper right of Wikipedia pages and the similar tables of properties served alongside Google searches (which have become much less common there recently, supplanted by paragraph-long AI summaries). A popular single-domain knowledge graph is the Internet Movie Database

Arguably what Wikidata is best use for is an index. Started as a way of aligning the various Wikipedia article with one another across all of the online encyclopedia editions, Wikidata has become a massive respository of entities and their properties, stored in a structured data format whose basic elements are items and their properties.

Each item is a single entity, typically a noun and prototypically one element in a large conceptual set, whether an administrative or a species or human being or a work of art. And such categories are merely the root values of long conceptual chains of categories. For example, New York City (assigned the sequential code Q60) is an instance of a city (Q515), which is a subclass of a human settlement (Q486972), through several intermediate categories (city or town, urban settlement). Embedded in this chain of connections is a complex conceptual answer about what New York City is. And this is only one of three chains extending upwards from New York City’s Q60 page, which also lead through “municipality in the United States” (Q3327870) to “administrative territorial entity of a single country” (Q12076836) ultimately to “type of region” (Q137022846).

These are kinds of descriptions that librarians, archivists, encyclopedia designers, and computer programmers might all love. Given a well-organized system, you can do a lot with such indices. And if nothing else, Wikidata has become a global repository for this kind of index. Once the meeting place just for Wikipedia entries, it now archives hundreds (perhaps thousands of such indices).

One might know a lot about New York City (my former home of seven years) and only encounter a few such indices for the city in the course of a year: its six area codes and a couple dozen ZIP codes, its three major airport AITA codes (none of which directly link to Q60 since they are all describe smaller entities within it). You might come across its top-level Internet domain .nyc, which is on Q60. But there you’ll also find scores of other index ID’s: codes used to identify the city in library catalogs, online encyclopedias, even as a sub-Reddit.

Where this really matters is in research applications, like genomics or data science, where knowing which precise entity is being named and measured across different sources is absolutely crucial. In any one source, we use unique identifier codes to tell things apart. Wikidata makes it possible to link those codes together and thereby to combine different forms of data without going line-by-line through the datasets to find matches. (Possible, but not easy, it must be said.)