This weeks class we are going over spatial regression models. These models take into account that spatial units, such as neighborhoods, may influence each other. Most traditional regression techniques assume independence between the units, and in this class we will mainly go over one model, the endogenous spatial lag model, that explicitly estimates how different spatial units influence one another.

As a reminder, the readings for this week are:

- Bernasco & Elffers (2010) Statistical analysis of spatial crime data. In
*Handbook of Quantitative Criminology*, pgs. 699-724. - Land & Deane (1992) On the large-sample estimation of regression models with spatial- or network-effects terms: A two-stage least squares approach.
*Sociological Methodology*22: 221-248. - Lesage and Pace (2010)
*The biggest myth in spatial econometrics*. SSRN http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1725503

So why do we care about spatial autocorrelation anyway? When estimating a regression model, say the effect of the numbers of bars in a neighborhood on the number of crimes:

\[\text{Crime} = \beta_0 + \beta_1(\text{Bars}) + \epsilon\]

To make inferences about the size of the effect of bars on crime, \(\beta_1\), we need to make some assumptions. One of those assumptions are that the errors, \(\epsilon\), are independent of one another. This is a case often violated when you have contiguous data over one city. Other examples you may be familiar with though in social science applications are repeated observations of the same individuals, or a series of the same measurements over time (i.e. a time series).

In particular, here are the implications of what happens to our inferences about \(\beta_1\) when there is residual spatial auto-correlation in the model:

- the standard error around \(\beta_1\) is too small. So if you have a p-value of just below 0.05, but there is residual autocorrelation, that effect is probably not statistically significant at the 0.05 level
- \(\beta_1\) is biased - that means as the sample size grows it does not necessarily get closer to the right value. This is a case though we can have a good guess at the direction of the bias as well. If the residual auto-correlation is positive (which it almost always is)
^{1}, the coefficient will be biased*away from zero*(J. P. Lesage and Pace 2009). What that means is that the absolute size of the coefficient will be too large. (In the rare case of negative spatial autocorrelation the opposite happens, it will be biased towards zero). - and finally, the model does not represent reality

For the last point, in general regression models are abstractions of reality. While no model is perfect, we want them to be as close to the actual data generating process as we can be. A model that is closer to that process is surely better than one that is further away. Taking into account spatial autocorrelation tends to make the model closer to how we actual think the process operates. Spatial autocorrelation essentially represents another aspect of the data that we can exploit to better explain the outcome of interest.

But how do we take spatial autocorrelation into account? To determine nearby in space, we will use the same spatial weights matrix, \(W\), that we discussed in the week 6 lecture. We will also be using the same spatial lag operation, \(Wx\), as we used last week. So if either of those concepts are confusing to you, you should go and re-read some of the last weeks lecture notes. So with that in mind, the workhorse of spatial regression models is the *endogenous spatial lag model*, which can be written as:

\[y = \rho Wy + X\beta\]

For this model, the outcome, \(y\) is a function of its own spatial lag, \(Wy\), as well as another vector of coefficients, \(X \beta\) (that is, they can be more than one independent variable, but you can pretend it is only one independent variable for simplicity if you want). In this equation, \(\rho\) is the endogenous effect of \(y\) on itself. (Endogenous just means that \(y\) is both on the left hand side of the equation and on the right hand side. Variables on the left hand side are endogenous.) You could interpret \(\rho\) as crime begets more crime – if an area has higher counts of crime other areas nearby will also have higher counts of crime.

This model cannot be easily estimated though, see Land and Deane (1992). You cannot simply include \(Wy\) on the right hand side of the regression equation and estimate the equation with ordinary least squares. (Despite that Land and Dean paper being well cited and over 20 years old, this mistake is still quite regular.^{2}) The way to estimate the model via maximum likelihood is to rewrite the equation so \(y\) is only one the left hand side, where \(I\) is the identity matrix:

\[y - \rho Wy = X\beta\] \[(I - \rho W)y = X\beta\] \[y = (I - \rho W)^{-1} X\beta\]

Now the term \((I - \rho W)^{-1}\) may seem pretty mysterious, so how do we interpret it? It is actually not that hard, and implies indirect spillovers of the independent variables. First take note that \((I - \rho W)^{-1}\) can be written as an infinite regress:

\[(I - \rho W)^{-1} = I + \rho W + \rho^2 W^2 + \rho^3 W^3 \dots\]

So for the first iteration, this term is simply \(I\) (the identity matrix), multiplied by \(X\beta\). This is just like a normal regression equation. The second term, multiples that normal regression equation by the spatial lag and by rho. Then higher powers take higher terms for the spatial weights and the rho. Higher powers of the spatial weights matrix is basically looking at the neighbors of neighbors. So if \(W\) is a binary contiguity matrix, then \(W^2\) would be positive for the neighbors of your first order neighbors (i.e. the second order neighbors).

It may be easier to illustrate with an example, say we estimate our bar equation on crime, and also include an endogenous lag. Lets say that the equation ends up as (where \(y\) are the number of crimes, and \(x\) is the number of bars:

\[y = 0.5Wy + 2x\]

So lets say for a location we add one bar. For the first iteration of our infinite series, we then have:

\[2 = I + 2 \cdot 1\]

So crime at that location we added one bar is predicted to go up by 2, and this is called the *local* effect of bars on crime. But because places with crime are expected to neighbor places with more crime, this diffuses crime to neighboring areas. So then the spillover effect for the sets of neighbors are:

- first neighbors, \(0.5 \cdot W (2 \cdot 1) = 1 \cdot W\)
- second neighbors, \(0.5^2 \cdot W (2 \cdot 1) = 0.5 \cdot W^2\)
- third neighbors, \(0.5^3 \cdot W (2 \cdot 1) = 0.25 \cdot W^3\)
- etc.

The series geometrically decays with higher order spatial lags. The diagram below shows this process, with the local effect the square in the middle, the first spatial effect as the area around the initial location, and the second order spatial effect as the area beyond that. So adding one bar diffuses into neighboring areas, but that diffusion effect is smaller the further out in space you go, and the rate of decay is dictated by \(\rho\).

This diagram is not quite right, because the second order neighbors reflects some crime back into the original location, but you should get the idea behind this. The equation for the total spatial impacts (over infinitely many spatial spillovers) is the same as for time series models with one auto-regressive term:

\[\frac{1}{(1-\rho)}\cdot \beta\]

So for our example equation, with \(\beta = 2\) and \(\rho=0.5\), the cumulative total impacts of bars are \(\frac{1}{(1-0.5)}\cdot 2 = 4\), Since the local effect is 2, we then know the total spatial spillover effect is 2 as well. So that is, although the beta estimate for the effect of bars on crime is 2, if you add one bar it will effectively increase crimes by 4 over the entire study area, not just in the place where the bar is added. The idea behind the J. Lesage and Pace (2010) paper is that even if you use different definitions of \(W\), oftentimes the total effects (local and spatial) end up being very similar. You cannot simply compare the regression coefficients, as they do not speak as to the total diffusion effects though.

The endogenous spatial lag model can be extended to include spatial lags of independentvariables as well. And this model is referred to as the Spatial Durbin Model:

\[y = \rho W y + X\beta + W X \lambda\]

Where \(WX\) is the vector of spatially lagged coefficients and \(\lambda\) are their spatially lagged effects. You can also generally include spatial lags of *independent* variables in any type of regression equation you want. (If the local values are exogenous the spatially lagged values should be exogenous as well.)

For some quick additional notes about endogenous spatial lag models:

- To ensure the spatial term is not explosive, \(W\) for endogenous spatial lags are almost always row normalized (binary contiguity can be problematic).
- \((I - \rho W)^{-1}\) is very hard to estimate for large \(W\), hence why two stage least squares is nice sometimes.
- You can estimate the effects of multiple endogenous lags, see Deane et al. (2008) for one example in which Glenn has two spatial weights matrix, one based on spatial distance and one based on cities being in the same state.

As a final point, this model is basically limited to linear outcomes. There are extensions for binary (0,1) data, but you cannot estimate an endogenous spatial lag in a Poisson regression model. This last point is a stickler - since estimating crime outcomes at small places one typically is estimating low count data, endogenous spatial lag models are not easily possible. But in the next section I will go over alternative models that take into account spatial autocorrelation for these low count crime models.

There are three additional models that take into account spatial autocorrelation. I will only touch on them briefly, as I prefer the endogenous spatial model due to its spatial diffusion interpretation in most circumstances, but other circumstances may dictate what model you choose.

The first is the spatial error model. In this model, instead of an endogenous spatial lag, the errors are autocorrelated (Loftin and Ward 1983).

\[y = X\beta + u\] \[u = \rho Wu + \epsilon\]

So in this notation you can see that \(u\) has a spatial lag effect. So the errors are correlated in space. This does not result in spatial diffusion effects though as does the endogenous spatial lag model.

In comparison to the endogenous spatial lag model, these are equivalent to the difference between auto-regressive and moving average processes in time series models. It is very hard to tell the difference between these two processes in practice, but if you must choose between them Luc Anselin has a model selection procedure outlined in his GeoDa workbook on page 199.

I have only seen this model estimated for linear outcomes as well, but it is more similar to spatial filtering approaches.

The simplest spatial filtering approach is to include the spatial coordinates as explanatory variables on the right hand side of the regression equation.

\[\text{Outcome}_i = X\beta + f(x_i,y_i)\]

Here the \(f(x_i,y_i)\) is just generic notation that means the outcome is some function of the x and y spatial coordinates of the units. This has the benefit that you can estimate this using any type of regression model, including Poisson generalized linear models. It simply involves including the non-linear terms on the right hand side of the regression equation.

This is what I did for my dissertation predicting crime at micro places in Washington, D.C. using negative binomial regression equations. Basically you just include the X and Y coordinates on the right hand side of the regression equation, and you can make the effects non-linear by including splines or polynomial terms. Below is an example from my dissertation. The map on the left shows the kernel density estimate of crime, and the map on the right shows how well tensor splines of the XY coordinates of street units replicates those overall hot spots.

Again this does not result in spatial diffusion effects, but effectively controls for the background rate of crime. I liken it very similar to Shaw and McKay’s zonal model - just estimated based on data instead of arbitrarily drawing circles over the city. There are other ways to do spatial filtering though, see Patuelli et al. (2011) for an example.

A final type of model I want to mention are conditional auto-regressive models. (This is frequently abbreviated CAR, whereas the spatial endogenous model is abbreviated SAR for spatial auto-regressive and the spatial error model is abbreviated SER). Instead of estimating the effects of spatial lags, the CAR model is more like estimating the effects of the difference between the local value and the spatial lag. One way to write the model is:

\[\mathbb{E}[y_i|y_j] = g[\mu_i + \rho \Sigma w_{ij}(y_j - \mu_j)]\]

The last term is the spatial differences term, and the notation \(\mathbb{E}[y_i|y_j]\) means the expected value of \(y_i\) conditional on the neighboring values of \(y_j\). These models are almost always fit using Bayesian Markov Chain Monte Carlo techniques (and for technical reasons are easier to estimate than SAR models using MCMC). See Wheeler and Waller (2009) for one example (no relation). CAR models can be fit to Poisson outcomes, and subsequently are more popular among health researchers. (SAR and SER models tend to be more popular among economists, and I prefer them as well because they are much easier to estimate and interpret.)

As a final note, no matter what spatial model you choose to fit, you need to assess whether there is spatial autocorrelation in the model residuals. If you happen to fit a non-spatial model to spatial data, and there is not any residual spatial autocorrelation, you do not need to worry about fitting any more complicated spatial data model. Some people make the mistake that if the outcome is spatially autocorrelated, you *need* to fit a spatial model. That is not the case - we only care about the residuals in regression, not the marginal outcome.

But even if you do fit a spatial model, you still need to then check to see if spatial autocorrelation is still evident in your model. You do this by calculating the residuals from your regression model, and then looking at the global Moran’s I value (see from week 6). I will show an example of this in this weeks tutorial. When looking at regression residuals, Moran’s I does not have a well behaved null distribution, so I suggest to use a permutation approach (Anselin 1995). Basically what the permutation approach does is it takes the original data and randomly changes its spatial location. Since the process is random, the end result Moran’s I is what would happen with non-spatially autocorrelated data. You do this permutation approach say 99 times, and if your observed value of Moran’s I for the actual data or residuals is higher than any of the simulations, it is a p-value of less than 0.01, so you would reject the null of no spatial autocorrelation. (You can do the simulation more times if you want a lower or more specific p-value.)

In most applications I am aware of in criminology, just fitting a regression model will typically take care of the majority of spatial auto-correlation. If Moran’s I gets to lower than 0.05 I am typically pretty happy. With bigger samples I have had Moran’s I values of less than 0.01 be statistically significant, but with that low of residual autocorrelation it is surely not a big problem with bias in the coefficients.

This weeks homework you will be using the R software to fit the endogenous spatial lag model to a dataset of crime rates in rural Appalachian counties. If you are interested, I also have a an example of fitting the non-linear spatial filtering terms with a Poisson regression model. Each of these models also shows calculating Moran’s I to evaluate the regression residuals. I have you estimate the models and subsequently interpret the rate models in terms of actual crimes increased.

Anselin, Luc. 1995. “Local Indicators of Spatial Association- LISA.” *Geographical Analysis* 27 (2): 93–115.

Deane, Glenn, Steven F. Messner, Thomas D. Stucky, Kelly McGeever, and Charis Kubrin. 2008. “Not ’Islands, Entire of Themselves’: Exploring the Spatial Context of City-Level Robbery Rates.” *Journal of Quantitative Criminology* 24 (4): 363–80.

Land, Kenneth C., and Glenn Deane. 1992. “On the Large-Sample Estimation of Regression Models with Spatial- or Network-Effects Terms: A Two-Stage Least Squares Approach.” *Sociological Methodology* 22: 221–48.

Lesage, James P., and R. Kelley Pace. 2009. *Introduction to Spatial Econometrics*. Boca Raton, FL: CRC Press.

Lesage, James, and R. Kelley Pace. 2010. “The Biggest Myth in Spatial Econometrics.” *Available at SSRN 1725503*.

Loftin, Colin, and Sally K. Ward. 1983. “A Spatial Autocorrelation Model of the Effects of Population Density on Fertility.” *American Sociological Review* 48 (1): 121–28.

Patuelli, Roberto, Daniel A. Griffith, Michael Tiefelsdorf, and Peter Nijkamp. 2011. “Spatial Filtering and Eigenvector Stability: Space-Time Models for German Unemployment Data.” *International Regional Science Review* 34 (2): 253–80.

Tolnay, Stewart E., Glenn Deane, and E M. Beck. 1996. “Vicarious Violence: Spatial Effects on Southern Lynchings, 1890-1919.” *American Journal of Sociology* 102 (3): 788–815.

Wheeler, David C., and Lance A. Waller. 2009. “Comparing Spatially Varying Coefficient Models: A Case Study Examining Violent Crime Rates and Their Relationships to Alcohol Outlets and Illegal Drug Arrests.” *Journal of Geographical Systems* 11 (1): 1–22.

The only example of negative spatial autocorrelation in a social science application I am familiar with is Tolnay, Deane, and Beck (1996). It is possible that negative autocorrelation could happen with competing events or displacement, e.g. if crime goes up in one area it goes down in another. This does not bear out though in most applications.↩

When one does this, simply includes \(Wy\) on the right hand side and estimates the equation, if there is positive spatial autocorrelation it results in the other coefficients being

*too small*and biased towards zero (J. P. Lesage and Pace 2009). In effect, it overestimates the spatial lag effect (that will be elaborated on later) and underestimates the direct effect.↩