library(tidyverse)
library(tidycensus)
library(tigris)
library(sf)
library(tmap)
library(tmaptools)
library(showtext)
library(furrr)
library(data.table)
library(dtplyr)
options(tigris_use_cache = TRUE)
options(scipen = 999)
::loadfonts(device = "win", quiet = T) extrafont
Center of Population by Ethnicity
Introduction
Census data is a big part of what I do and what I’m interested in. Particularly, the Mean Center of Population is something that’s always intrigued me. Consolidating every person in the U.S. to one point is perhaps the peak of descriptive statistics: squishing all of the countless personalities, names, and relationships that make up the nation into a singular point located near the unassuming town of Hartville, Missouri. In fact, if you asked Americans what they think of as the “center of America,” I’d wager many guesses wouldn’t be too far from there.
Obviously, though, that isn’t everyone’s center of the country. There are myriad ways to slice a population; only some of which are captured in the Census: age, gender, income, marital status, race and ethnicity – just to name a few. However, the latter of these categories is often the one which defines a person’s experience in the U.S., for better or for worse – and these experiences often start with the distribution of a group across space.
Whether discussing the enslavement of African American communities, largely in the Southeast; the immigration of Hispanic families, many of whom end up in Texas, California, Arizona, and New Mexico; or the “Manifest Destiny” of the White man moving westward – all of these happened somewhere, and impact the distribution of descendants today.
This project seeks to investigate and elaborate on artifacts of these histories by investigating the Mean Center of Population, split out by summarized race and ethnicity categories collected in the decennial Census. Though this example will only focus on one survey question, this methodology can easily be applied to any of the other ways the Census Bureau stratifies the American population, across any smaller geography.
Methods
Luckily, the Census Bureau is quite transparent with the methods used to calculate the Mean Center: there’s a whole PDF explaining the method from start to finish, including changes over time. The information is dense, so here’s the important stuff.
From 1960 onward, the formula by which the Mean Center is calculated is as follows:
\[ \bar{\phi} = \frac{\Sigma(w_i\phi_i)}{\Sigma w_i} \]
where \(\bar{\phi}\) is equal to the mean latitude of population
\[ \bar{\lambda}=\frac{\Sigma w_i\lambda_i cos(\phi_i(\frac{\pi}{180}))}{\Sigma w_i cos(\phi_i(\frac{\pi}{180}))} \]
and \(\bar{\lambda}\) is the mean longitude.
\(\phi_i\), \(\lambda_i\), and \(w_i\) are the individual latitude, longitude, and weight (in this case, population) of each of the small units used in computation, respectively.
Now that the mathematical basis for our computations has been established, analysis can begin. Here are the packages we’ll need, and some handy options:
Parallelization
Operating on Census blocks nationally is no small feat. There are over 11,000,000 blocks; multiply that by 8 population categories and we’re operating on nearly 90 million rows! Thus, parallelization is a must-use in this situation.
<- unique(fips_codes$state)[1:51] # create list to apply over
fips_list <- availableCores()-1 # keep one core free while running to avoid crashes
num_cores plan(multisession, workers = num_cores)
Population groupings
The variable codes of interest can change as different summary files are released, but the population categories should be identical across decades. As this analysis was completed before the 2020 Census tabulation was completed, the preliminary redistricting information was used for population counts, as shown below:
# use PL 94-171 data for recent vintages
<- c(white = "P2_005N",
pop_vars black = "P2_006N",
aian = "P2_007N",
asian = "P2_008N",
hipi = "P2_009N",
other = "P2_010N",
two_p = "P2_011N",
hisp = "P2_002N")
The groupings I’m using here are tabulated below:
Group | Variable | Census Sub-groups |
---|---|---|
Hispanic | P2_002N | “Hispanic or Latino” |
NH White | P2_005N | “Not Hispanic or Latino; White alone” |
NH African-American | P2_006N | “Not Hispanic or Latino; Black or African-American alone” |
NH American and Alaskan Native | P2_007N | “Not Hispanic or Latino; American Indian and Alaska Native alone” |
NH Asian | P2_008N | “Not Hispanic or Latino; Asian alone” |
NH Hawaiian and Pacific Islander | P2_009N | “Not Hispanic or Latino; Native Hawaiian and Other Pacific Islander alone” |
NH Other + NH Two or more | P2_010N + P2_011N | “Not Hispanic or Latino; Some Other Race alone” + “Not Hispanic or Latino; Two or more races” |
Data collection
Now, things start to get real. This looks tricky, but isn’t too hard to understand once you get past the density of it:
<- reduce(future_map(fips_list, function(x){
b get_decennial(geography = "block",
variables = pop_vars,
year = yr,
state = x,
output = "tidy",
geometry = FALSE)}, .progress = TRUE), rbind)
<- reduce(future_map(fips_list, function(x){blocks(state = x, year = yr) %>%
c st_drop_geometry() %>%
mutate(across(.cols = starts_with("INTPTLON"), .fns = as.numeric, .names = "lon")) %>%
mutate(across(.cols = starts_with("INTPTLAT"), .fns = as.numeric, .names = "lat")) %>%
select(c(5,last_col(offset = 1), last_col()))}, .progress = TRUE), rbind)
:::ClusterRegistry("stop")
future
fwrite(as.data.table(b), paste0("../data/pop_data_2020.csv"))
fwrite(as.data.table(c), paste0("../data/latlon_2020.csv"))
- Iterate
get_decennial
over each of our states infips_list
, usingfuture_map()
to take advantage of multiple CPU cores- Collect population data for each of our
pop_vars
defined earlier, and keep the data in tidy format for grouping and summarization
- Collect population data for each of our
reduce()
the resulting set of dataframes into one long dataframe usingrbind
as the.f
argument- Iterate
blocks()
over each of our states infips_list
, again usingfuture_map()
- Clean up the internal point coordinates with
as.numeric()
, and select only the important columns - Again,
reduce()
the set of dataframes into one, with an eighth as many rows asb
.
Finally, we can stop our cluster and write the two objects to disk, using data.table
’s fwrite()
. These can be read for subsequent analyses, or chunked to avoid memory issues in the computation steps.
Computation
Now, we need to join our tables in order to weight the internal point of each block with its respective population. Note the usage of {dtplyr}
here: it allows us to keep writing in tidyverse syntax while gaining the performance of {data.table}
.
Since block-level population centers aren’t available through {tigris}
, we will use the “internal point” latitude and longitude. The internal point of a geography is often the centroid: if a geography is shaped such that the centroid would fall outside its boundary, the internal point is placed as close to the internal centroid of the geography as possible, preferably on land. We’ll be using the formulas from earlier, after translating them from mathematician to R.
<- lazy_dt(fread(paste0("../data/pop_data_", yr, ".csv")))
b <- lazy_dt(fread(paste0("../data/latlon_", yr, ".csv")))
c <- b %>%
o pivot_wider(names_from = variable, values_from = value) %>%
mutate(other = other + two_p) %>%
select(-c(two_p)) %>%
pivot_longer(cols = 3:9, names_to = "variable")
<- left_join(o, c, by = c("GEOID" = paste0("GEOID", str_sub(yr, 3,4)))) %>%
j mutate(elat = value * lat,
elon = value * lon * cos(lat * (pi/180)),
denom = value*cos(lat * (pi/180)))
<- j %>%
g group_by(variable) %>%
summarize(tlat = sum(elat, na.rm = TRUE)/sum(value, na.rm = TRUE),
tlon = sum(elon, na.rm = TRUE)/sum(denom, na.rm = TRUE),
pop = sum(value))
<- as.data.table(j) %>%
cc summarize(tlat = sum(elat, na.rm = TRUE)/sum(value, na.rm = TRUE),
tlon = sum(elon, na.rm = TRUE)/sum(denom, na.rm = TRUE),
pop = sum(value))
<- as.data.table(g) g
I chose to read b
and c
back in from disk, so that I could skip querying the Census API in the future. If you still have b
and c
in your environment, and have enough memory to continue to use them, that’s also an option. Otherwise, you’ll have to get creative with chunking your files on disc to allow for smaller object sizes in memory. Note that because of how the Mean Center formula is written, it can nest again and again – one way to do this with less memory could be to compute the center of each state at a time, then aggregate those into regions, then aggregate those to the national level. Luckily for me, 32GB of RAM was just about enough to hold all of the objects in-memory.
This code chunk, thanks to the magic of dtplyr
, is a bit easier to read than the previous, while allowing the performance gains of data.table
and the readability of dplyr
. The steps are, generally, as follows:
- Pivot
b
wider so that each population group has its own column, and sumother
andtwo_p
, then droptwo_p
, and pivot longer again to createo
- Join
o
andc
by their GEOIDs, then calculate their individual components of weight to createj
- Group
j
by each category variable, then summarize totlat
andtlon
(total mean lat and long, respectively) - create
cc
based onj
, but without grouping by variables. This should yield the total Mean Center of Population - Coerce
g
to adata.table
, which computes the steps on thelazy_dt
object and creates a memory object
These two objects, g
and cc
, contain the coordinates of the Mean Center of Population for each ethnicity category we’ve defined. Now, there’s only one thing left to do before we can map them – we need to make them into sf
objects.
Creating spatial data
<- g %>%
s st_as_sf(coords = c("tlon","tlat")) %>% st_set_crs(4326) %>% st_transform(6350)
<- cc %>%
cc st_as_sf(coords = c("tlon", "tlat")) %>% st_set_crs(4326) %>% st_transform(6350)
<- vector(mode = 'list', length = nrow(s))
lso
for (row in 1:nrow(s)){
<- data.frame(st_coordinates(cc)[1], st_coordinates(cc)[2]) %>%
cs rename(p_x = 1,
p_y = 2) %>%
mutate(seq = 1)
<- cbind(g[,c(1,4)],cs)
cs
<- s[row,] %>%
ss rowwise() %>%
mutate(p_x = unlist(geometry)[1],
p_y = unlist(geometry)[2]) %>%
mutate(seq = 2) %>%
st_drop_geometry() %>%
ungroup()
<- rbind(ss,cs[row,]) %>%
ap arrange(variable, seq)
<- as.matrix(ap[order(ap$seq),
m c("p_x","p_y")])
<- st_linestring(m) %>%
ls st_sfc() %>%
st_sf() %>%
mutate(variable = ap$variable[1],
pop = ap$pop[1]) %>%
bind_cols()
<- lso %>% bind_rows(ls)
lso
}
<- lso %>% st_set_crs(6350) lsp
I really wanted to have lines connecting each of my points (in g
) to the total Mean Center of Population (cc
). To do so, I had to create a LINESTRING
for each point, with the start point being at cc
and the endpoint being at each point in g
. After converting g
to an sf
object (s
), I created a list the same length as s
. By creating an empty object with the same length as the object I’m iterating over, I remove all headroom R needs to allocate more space to the incoming objects. Josiah Parry has an awesome YouTube video discussing this topic, for those who think looping is slow in R (like I used to!).
Plotting
Finally, it’s time to plot our information. Now that everything is an sf
object, this is pretty trivial – I chose tmap
for my plotting, but you could just as easily use ggplot2
for static maps, or, if you prefer interactive, mapgl
or mapview
.
I chose to symbolize my points by color, representing a population group; and by size, representing the population of that group. However, the values for the minimum population and the maximum population are really far apart! The points barely show up on the low end. So, first, I’ll scale the population values in s
to reduce the magnitude of the difference between the most- and least-populous groups.
<- s %>% mutate(pop_pct = pop/sum(pop),
s log_pct = abs(log(pop_pct)),
normalized_log_pct = 0.1 + (log_pct - max(log_pct)) / (min(log_pct) - max(log_pct)) * (0.7 - 0.1))
Let’s put it on a map!
<- palette(c(rgb(114,153,67, maxColorValue = 255),
ilh rgb(148,79,161, maxColorValue = 255),
rgb(76,196,144, maxColorValue = 255),
rgb(185,74,115, maxColorValue = 255),
rgb(193,158,60, maxColorValue = 255),
rgb(104,123,210, maxColorValue = 255),
rgb(185,85,61, maxColorValue = 255)))
<- unique(fips_codes$state_code)[1:51]
plot_fips <- plot_fips[!plot_fips %in% c("02", "15", "72", "78")]
plot_fips
<- states(cb = TRUE, year = yr) %>% filter(if (yr == 2010) STATE %in% plot_fips else
us %in% plot_fips) %>%
STATEFP st_transform(6350) #weird artifact in tigris means that column names don't match
<- tm_shape(us, bbox = bb(us, ext = 1.1))+
t tm_polygons(border.col = "#aaaaaa", lwd = 0.75, col = "#5b5b5b")+
tm_shape(lsp)+
tm_lines(col = "variable", palette = ilh, legend.col.show = FALSE, lwd = 1.5, legend.lwd.show = FALSE)+
tm_shape(s)+
tm_symbols(col = "variable", title.col = "Race/Ethnicity", size = "normalized_log_pct", border.col = "#bdbdbd", palette = ilh, border.lwd = 1, legend.col.show = FALSE, legend.size.show = FALSE)+
tm_shape(cc)+
tm_dots(col = "#1286c4", shape = 24, title = "Total center of population", size = 0.6, legend.show = FALSE, border.lwd = 1, border.col = "#bdbdbd")+
tm_add_legend(type = "symbol",
labels = c("American/Alaskan Native", "Asian", "African-American", "Hawaiian/Pacific Islander", "Hispanic", "Other/Two or more", "White"),
col = ilh,
border.col = "#bdbdbd",
title = "Ethnicity",
size = 0.4)+
tm_add_legend(type = "symbol",
shape = 24,
col = "#1286c4",
size = 0.6,
border.col = "#bdbdbd",
labels = "Total center of population")+
tm_layout(main.title = "Center of population by race/ethnicity",
main.title.fontfamily = "Manrope",
main.title.fontface = 2,
main.title.size = 2,
bg.color = "#3b3b3b",
legend.outside = TRUE,
legend.text.color = "#bdbdbd",
legend.text.fontfamily = "Manrope",
legend.title.color = "#bdbdbd",
legend.title.fontface = 2,
legend.title.size = 1.5,
legend.title.fontfamily = "Manrope",
legend.text.size = 0.75,
legend.position = c(0,0.25),
outer.bg.color = "#3b3b3b",
frame = FALSE,
main.title.color = "#bdbdbd")+
tm_credits(paste0("Decennial census data, ", yr, "\nGraphic by Harrison DeFord (@oonuttlet)"),
position = c(0,0.08),
col = "#bdbdbd",
fontfamily = "Open Sans",
size = 0.62)
t
if (!file.exists(paste0("../bin/center_of_pop_natl_",yr,"_intpt.png"))){
tmap_save(t, paste0("../bin/center_of_pop_natl_",yr,"_intpt.png"), dpi = 1200)
}
Discussion
The completed map is visible below. Some trends are immediately apparent: the African-American point is dragged to the southeast, and the Hawaiian/Pacific Islander point is in the Pacific. I invite you to look over the map and think about how the historic distribution of these populations could shape their modern-day Mean Center of Population.
Finally, I’d like to investigate the change in Mean Centers from 2010-2020. I’ve computed this analysis only for 2010 and 2020, though in theory, it would be trivial to calculate for earlier years (at least since 1960). The gif shown below demonstrates the shift in Mean Centers between 2010 and 2020. Much of the shift is probably systemic shifts – 2020 was the first year in which the decennial census was administered mostly online. Again, I invite you to think about what other variables might be leading to the shift between decades. Of note: COVID-19 didn’t impact the U.S. until around March 12, 2020 – the Census Bureau aims to count every person on or around April 1 of the given vintage. Thus, COVID-19 may have affected the Census response rates, but is unlikely to have had an impact on net gain or loss of a given area that recently after its first impacts in the U.S.