Geospatial Data Formats And Bad Geospatial Education

Lets Talk About GeoSpatial Data 101

A few weeks back I was invited to talk at one of a series of round tables the Israeli planning administration was setting up about the "Future of Planning Systems of Israel". I was originally scheduled to speak at the first one which was about technology, but had a scheduling issue and postponed my lecture to the last one, which was part summary and part whatever lecturers they still wanted to hear from, which happened yesterday.

After half hearing at best (I enjoy solving problems with tools, I don't particularly care about urban planning), about two hours of other lectures, and hearing a lot of questions, a few thing drove me mad. My lecture was about using triggers in PostGIS (though they would work in other databases) for spatial tests and calculations. I built this based on a small part of Cliff Patterson's lecture in last year's PostGIS day about "Leveraging PostGIS in an Enterprise Environment" and used a similar example using some open data layers and a new parks layer where I set up the trigger to find the nearest shelter, check if there's a gas station within 100 meters and also combined pgRouting to summarize an attribute from buildings within a driving distance.

During the lecture, and the questions that followed, I had to explain (multiple times) that it doesn't matter what software you used to create or edit the data. This is what drove me mad. This is something that anyone that took any introductory GIS course (which was mandatory for any geography student when I was one) should know. Since this is a rant, I'm gonna write down the basis for understanding what is geospatial information, and you can either read it or not, but this need to get out.

We should all know that there are two types of spatial information, they are most commonly named:

1. Vector ,which is tabular data, with a geometry.

2. Raster ,which is a spatially referenced image, meaning we know where the corners of that image should be.

Rasters

I'll start with raster data since it can be harder to understand. when we have a raster image, we basically have a picture, and it's metadata. The picture itself can be from a plane, a satellite, or generated from data which isn't technically an image at all. 

We can understand a picture from a plane or satellite easily, but what does a generated image mean? every image is made out of pixels, which are little squares that together form the complete image, and every pixel has values. When we look at a picture, any picture, even one you took with your phone, it has a resolution, which tells us how many pixels that image is made out of, the better resolution, the more pixels we have and vice versa.
Every pixel has a position in the image, and in GIS (maybe elsewhere too, I don't know nor care) we check the positions from the top left corner. Every pixel also has a value, it can have one value or multiple values, which are divided into bands. 


So what are these values? This is part of my point, all of these values are numbers, yep, just numbers, nothing scary or fancy. Depending on the image type, in quite a lot of case they have a value between 0 and 255 (which is because most are coded in 8 bit precision, read more about it), which tells us how strong a signal is, for gray scaled images 0 can mean black and 255 will mean white, with all the numbers in the middle being shades of gray.
When we look at color images, the most well known type is to use three bands of Red, Green and Blue (RGB), each having a number between 0 and 255. Combining the band values for RGB we can specify most colours (
256*256*256=16,777,216 possible colours). 

https://en.wikipedia.org/wiki/Raster_images
Raster graphics - Wikipedia

But again, these are numbers, we use to tell the computer how to draw our image, most GIS software that can read them can display them in different ways, depending on how you tell the software to display the image.

Our software can draw raster images knowing where the corners of the raster are, and its resolution and pixel values. These are the building blocks of displaying raster data, since geospatial data is inherently relational, we want a reference system to tell us where the position of our corners are in relation to the earth (or any other planet or imaginary plane we want to draw the raster on), that's what Coordinate Reference Systems or CRSs are for.
They tell us where our coordinates are in relation to a specific model of our world.

Now there are a lot of file formats for storing these required building blocks for raster data, some are faster to create and some are built for fast serving of the data, but, and this is important, they all serve the same data!!! Understanding this, can save you time and money, and make sure you don't spout nonsense at meetings with people who listened during the lecture about rasters.

* An important note: This, is an oversimplification of the subject, and not necessarily a good one at that, if you want to learn more about how imagery works, including the physical basis of remote sensing, go take a class, watch videos of people who claim to have a good understanding of the subject, read actual papers and books, not just skim through blog posts. You can start by reading about raster data at QGIS's A Gentle Introduction to GIS.


Vectors

The important thing to understand here, is that there are a lot of spatial data formats for vector data, and just like with raster formats, they all do the same thing.
vector data is a geometry which is simply a set of coordinates, it can be a simple X and Y for points, or a few sets of X and Y for lines and polygons, you can make them three dimensional by, and this will shock you, adding another dimension, like height (most commonly referred to as a Z coordinate). And just like raster pixels, every coordinate is a number, nothing more. And while this isn't true when looking for locations on a globe, coordinates are just like numbers on a Cartesian plane, the type of plane you used in high school to solve linear equations.



All vector data formats are just different ways of "encoding" this set of coordinates. You might have a faster or more efficient way of encoding and parsing those coordinates, but in the end, what we have is a set of 2-4 numbers that tell us where stuff is.

The set of coordinates can be a simple point, a line, a polygon, a multipoint\line\polygon, a 3D geometry, a 4D geometry or a point cloud, but it all comes down to having a specific location, which is a set of between 2 and 4 numbers in 99.9999% of cases.

And yes, that applies to things like Uber H3, or What3Words or whatever the next new fad will be, it's always a fancy new way to encode and parse a coordinate.

And yes, that also applies to CAD data, though the data you got can be not referenced within a CRS. And for your precious pointcloud, it's a fast and efficient way of displaying and handling a lot of singular points. But, each point in it still has a coordinate for X, Y and Z, all the rest of its data is just an attribute table.

You thought the vector part was going to be long? Why? I wrote that it's the easier part to understand,  It's just sets of coordinates, nothing more.

Just one last important link, when someone next sends you a DWG and you are having the worst time possible converting it, tell them to send you a proper vector data format, AutoCAD can connect to PostGIS and there is no reason to work with proprietary formats when converting data to and from PostGIS is as easy as ogr2ogr.

Comments