Page 1 of 15 (298 posts)

  • talks about »

Last update:
Mon Mar 18 18:55:10 2019

A Django site.

QGIS Planet

Stand-alone PyQGIS scripts with OSGeo4W

PyQGIS scripts are great to automate spatial processing workflows. It’s easy to run these scripts inside QGIS but it can be even more convenient to run PyQGIS scripts without even having to launch QGIS. To create a so-called “stand-alone” PyQGIS script, there are a few things that need to be taken care of. The following steps show how to set up PyCharm for stand-alone PyQGIS development on Windows10 with OSGeo4W.

An essential first step is to ensure that all environment variables are set correctly. The most reliable approach is to go to C:\OSGeo4W64\bin (or wherever OSGeo4W is installed on your machine), make a copy of qgis-dev-g7.bat (or any other QGIS version that you have installed) and rename it to pycharm.bat:

Instead of launching QGIS, we want that pycharm.bat launches PyCharm. Therefore, we edit the final line in the .bat file to start pycharm64.exe:

In PyCharm itself, the main task to finish our setup is configuring the project interpreter:

First, we add a new “system interpreter” for Python 3.7 using the corresponding OSGeo4W Python installation.

To finish the interpreter config, we need to add two additional paths pointing to QGIS\python and QGIS\python\plugins:

That’s it! Now we can start developing our stand-alone PyQGIS script.

The following example shows the necessary steps, particularly:

  1. Initializing QGIS
  2. Initializing Processing
  3. Running a Processing algorithm
import sys

from qgis.core import QgsApplication, QgsProcessingFeedback
from qgis.analysis import QgsNativeAlgorithms

QgsApplication.setPrefixPath(r'C:\OSGeo4W64\apps\qgis-dev', True)
qgs = QgsApplication([], False)
qgs.initQgis()

# Add the path to processing so we can import it next
sys.path.append(r'C:\OSGeo4W64\apps\qgis-dev\python\plugins')
# Imports usually should be at the top of a script but this unconventional 
# order is necessary here because QGIS has to be initialized first
import processing
from processing.core.Processing import Processing

Processing.initialize()
QgsApplication.processingRegistry().addProvider(QgsNativeAlgorithms())
feedback = QgsProcessingFeedback()

rivers = r'D:\Documents\Geodata\NaturalEarthData\Natural_Earth_quick_start\10m_physical\ne_10m_rivers_lake_centerlines.shp'
output = r'D:\Documents\Geodata\temp\danube3.shp'
expression = "name LIKE '%Danube%'"

danube = processing.run(
    'native:extractbyexpression',
    {'INPUT': rivers, 'EXPRESSION': expression, 'OUTPUT': output},
    feedback=feedback
    )['OUTPUT']

print(danube)

Easy Processing scripts comeback in QGIS 3.6

When QGIS 3.0 was release, I published a Processing script template for QGIS3. While the script template is nicely pythonic, it’s also pretty long and daunting for non-programmers. This fact didn’t go unnoticed and Nathan Woodrow in particular started to work on a QGIS enhancement proposal to improve the situation and make writing Processing scripts easier, while – at the same time – keeping in line with common Python styles.

While the previous template had 57 lines of code, the new template only has 26 lines – 50% less code, same functionality! (Actually, this template provides more functionality since it also tracks progress and ensures that the algorithm can be cancelled.)

from qgis.processing import alg
from qgis.core import QgsFeature, QgsFeatureSink

@alg(name="split_lines_new_style", label=alg.tr("Alg name"), group="examplescripts", group_label=alg.tr("Example Scripts"))
@alg.input(type=alg.SOURCE, name="INPUT", label="Input layer")
@alg.input(type=alg.SINK, name="OUTPUT", label="Output layer")
def testalg(instance, parameters, context, feedback, inputs):
    """
    Description goes here. (Don't delete this! Removing this comment will cause errors.)
    """
    source = instance.parameterAsSource(parameters, "INPUT", context)

    (sink, dest_id) = instance.parameterAsSink(
        parameters, "OUTPUT", context,
        source.fields(), source.wkbType(), source.sourceCrs())

    total = 100.0 / source.featureCount() if source.featureCount() else 0
    features = source.getFeatures()
    for current, feature in enumerate(features):
        if feedback.isCanceled():
            break
        out_feature = QgsFeature(feature)
        sink.addFeature(out_feature, QgsFeatureSink.FastInsert)
        feedback.setProgress(int(current * total))

    return {"OUTPUT": dest_id}

The key improvement are the new decorators that turn an ordinary function (such as testalg in the template) into a Processing algorithm. Decorators start with @ and are written above a function definition. The @alg decorator declares that the following function is a Processing algorithm, defines its name and assigns it to an algorithm group. The @alg.input decorator creates an input parameter for the algorithm. Similarly, there is a @alg.output decorator for output parameters.

For a longer example script, check out the original QGIS enhancement proposal thread!

For now, this new way of writing Processing scripts is only supported by QGIS 3.6 but there are plans to back-port this improvement to 3.4 once it is more mature. So give it a try and report back!

Movement data in GIS #20: Trajectools v1 released!

In previous posts, I already wrote about Trajectools and some of the functionality it provides to QGIS Processing including:

There are also tools to compute heading and speed which I only talked about on Twitter.

Trajectools is now available from the QGIS plugin repository.

The plugin includes sample data from MarineCadastre downloads and the Geolife project.

Under the hood, Trajectools depends on GeoPandas!

If you are on Windows, here’s how to install GeoPandas for OSGeo4W:

  1. OSGeo4W installer: install python3-pip
  2. Environment variables: add GDAL_VERSION = 2.3.2 (or whichever version your OSGeo4W installation currently includes)
  3. OSGeo4W shell: call C:\OSGeo4W64\bin\py3_env.bat
  4. OSGeo4W shell: pip3 install geopandas (this will error at fiona)
  5. From https://www.lfd.uci.edu/~gohlke/pythonlibs/#fiona: download Fiona-1.7.13-cp37-cp37m-win_amd64.whl
  6. OSGeo4W shell: pip3 install path-to-download\Fiona-1.7.13-cp37-cp37m-win_amd64.whl
  7. OSGeo4W shell: pip3 install geopandas
  8. (optionally) From https://www.lfd.uci.edu/~gohlke/pythonlibs/#rtree: download Rtree-0.8.3-cp37-cp37m-win_amd64.whl and pip3 install it

If you want to use this functionality outside of QGIS, head over to my movingpandas project!

Dealing with delayed measurements in (Geo)Pandas

Yesterday, I learned about a cool use case in data-driven agriculture that requires dealing with delayed measurements. As Bert mentions, for example, potatoes end up in the machines and are counted a few seconds after they’re actually taken out of the ground:

Therefore, in order to accurately map yield, we need to take this temporal offset into account.

We need to make sure that time and location stay untouched, but need to shift the potato count value. To support this use case, I’ve implemented apply_offset_seconds() for trajectories in movingpandas:

    def apply_offset_seconds(self, column, offset):
        self.df[column] = self.df[column].shift(offset, freq='1s')

The following test illustrates its use: you can see how the value column is shifted by 120 second. Geometry and time remain unchanged but the value column is shifted accordingly. In this test, we look at the row with index 2 which we access using iloc[2]:

    def test_offset_seconds(self):
        df = pd.DataFrame([
            {'geometry': Point(0, 0), 't': datetime(2018, 1, 1, 12, 0, 0), 'value': 1},
            {'geometry': Point(-6, 10), 't': datetime(2018, 1, 1, 12, 1, 0), 'value': 2},
            {'geometry': Point(6, 6), 't': datetime(2018, 1, 1, 12, 2, 0), 'value': 3},
            {'geometry': Point(6, 12), 't': datetime(2018, 1, 1, 12, 3, 0), 'value':4},
            {'geometry': Point(6, 18), 't': datetime(2018, 1, 1, 12, 4, 0), 'value':5}
        ]).set_index('t')
        geo_df = GeoDataFrame(df, crs={'init': '31256'})
        traj = Trajectory(1, geo_df)
        traj.apply_offset_seconds('value', -120)
        self.assertEqual(traj.df.iloc[2].value, 5)
        self.assertEqual(traj.df.iloc[2].geometry, Point(6, 6))

Movement data in GIS #19: splitting trajectories by date

Many current movement data sources provide more or less continuous streams of object locations. For example, the AIS system provides continuous locations of vessels (mostly ships). This continuous stream of locations – let’s call it track – starts when we first record the vessel and ends with the last record. This start and end does not necessarily coincide with the start or end of a vessel voyage from one port to another. The stream start and end do not have any particular meaning. Instead, if we want to see what’s going on, we need to split the track into meaningful segments. One such segmentation – albeit a simple one – is to split tracks by day. This segmentation assumes that day/night changes affect the movement of our observed object. For many types of objects – those who mostly stay still during the night – this will work reasonably well.

For example, the following screenshot shows raw data of one particular vessel in the Boston region. By default, QGIS provides a Points to Path to convert points to lines. This tool takes one “group by” and one “order by” field. Therefore, if we want one trajectory per ship per day, we’d first have to create a new field that combines ship ID and day so that we can use this combination as a “group by” field. Additionally, the resulting lines loose all temporal information.

To simplify this workflow, Trajectools now provides a new algorithm that creates day trajectories and outputs LinestringM features. Using the Day trajectories from point layer tool, we can immediately see that our vessel of interest has been active for three consecutive days: entering our observation area on Nov 5th, moving to Boston where it stayed over night, then moving south to Weymouth on the next day, and leaving on the 7th.

Since the resulting trajectories are LinestringM features with time information stored in the M value, we can also visualize the speed of movement (as discussed in part #2 of this series):

From CSV to GeoDataFrame in two lines

Pandas is great for data munging and with the help of GeoPandas, these capabilities expand into the spatial realm.

With just two lines, it’s quick and easy to transform a plain headerless CSV file into a GeoDataFrame. (If your CSV is nice and already contains a header, you can skip the header=None and names=FILE_HEADER parameters.)

usecols=USE_COLS is also optional and allows us to specify that we only want to use a subset of the columns available in the CSV.

After the obligatory imports and setting of variables, all we need to do is read the CSV into a regular DataFrame and then construct a GeoDataFrame.

import pandas as pd
from geopandas import GeoDataFrame
from shapely.geometry import Point

FILE_NAME = "/temp/your.csv"
FILE_HEADER = ['a', 'b', 'c', 'd', 'e', 'x', 'y']
USE_COLS = ['a', 'x', 'y']

df = pd.read_csv(
    FILE_NAME, delimiter=";", header=None,
    names=FILE_HEADER, usecols=USE_COLS)
gdf = GeoDataFrame(
    df.drop(['x', 'y'], axis=1),
    crs={'init': 'epsg:4326'},
    geometry=[Point(xy) for xy in zip(df.x, df.y)])

It’s also possible to create the point objects using a lambda function as shown by weiji14 on GIS.SE.

PyQGIS101 part 10 published!

PyQGIS 101: Introduction to QGIS Python programming for non-programmers has now reached the part 10 milestone!

Beyond the obligatory Hello world! example, the contents so far include:

If you’ve been thinking about learning Python programming, but never got around to actually start doing it, give PyQGIS101 a try.

I’d like to thank everyone who has already provided feedback to the exercises. Every comment is important to help me understand the pain points of learning Python for QGIS.

I recently read an article – unfortunately I forgot to bookmark it and cannot locate it anymore – that described the problems with learning to program very well: in the beginning, it’s rather slow going, you don’t know the right terminology and therefore don’t know what to google for when you run into issues. But there comes this point, when you finally get it, when the terminology becomes clearer, when you start thinking “that might work” and it actually does! I hope that PyQGIS101 will be a help along the way.

Movement data in GIS #18: creating evaluation data for trajectory predictions

We’ve seen a lot of explorative movement data analysis in the Movement data in GIS series so far. Beyond exploration, predictive analysis is another major topic in movement data analysis. One of the most obvious movement prediction use cases is trajectory prediction, i.e. trying to predict where a moving object will be in the future. The two main categories of trajectory prediction methods I see are those that try to predict the actual path that a moving object will take versus those that only try to predict the next destination.

Today, I want to focus on prediction methods that predict the path that a moving object is going to take. There are many different approaches from simple linear prediction to very sophisticated application-dependent methods. Regardless of the prediction method though, there is the question of how to evaluate the prediction results when these methods are applied to real-life data.

As long as we work with nice, densely, and regularly updated movement data, extracting evaluation samples is rather straightforward. To predict future movement, we need some information about past movement. Based on that past movement, we can then try to predict future positions. For example, given a trajectory that is twenty minutes long, we can extract a sample that provides five minutes of past movement, as well as the actually observed position five minutes into the future:

But what if the trajectory is irregularly updated? Do we interpolate the positions at the desired five minute timestamps? Do we try to shift the sample until – by chance – we find a section along the trajectory where the updates match our desired pattern? What if location timestamps include seconds or milliseconds and we therefore cannot find exact matches? Should we introduce a tolerance parameter that would allow us to match locations with approximately the same timestamp?

Depending on the duration of observation gaps in our trajectory, it might not be a good idea to simply interpolate locations since these interpolated locations could systematically bias our evaluation. Therefore, the safest approach may be to shift the sample pattern along the trajectory until a close match (within the specified tolerance) is found. This approach is now implemented in MovingPandas’ TrajectorySampler.

def test_sample_irregular_updates(self):
    df = pd.DataFrame([
        {'geometry':Point(0,0), 't':datetime(2018,1,1,12,0,1)},
        {'geometry':Point(0,3), 't':datetime(2018,1,1,12,3,2)},
        {'geometry':Point(0,6), 't':datetime(2018,1,1,12,6,1)},
        {'geometry':Point(0,9), 't':datetime(2018,1,1,12,9,2)},
        {'geometry':Point(0,10), 't':datetime(2018,1,1,12,10,2)},
        {'geometry':Point(0,14), 't':datetime(2018,1,1,12,14,3)},
        {'geometry':Point(0,19), 't':datetime(2018,1,1,12,19,4)},
        {'geometry':Point(0,20), 't':datetime(2018,1,1,12,20,0)}
        ]).set_index('t')
    geo_df = GeoDataFrame(df, crs={'init': '4326'})
    traj = Trajectory(1,geo_df)
    sampler = TrajectorySampler(traj, timedelta(seconds=5))
    past_timedelta = timedelta(minutes=5)
    future_timedelta = timedelta(minutes=5)
    sample = sampler.get_sample(past_timedelta, future_timedelta)
    result = sample.future_pos.wkt
    expected_result = "POINT (0 19)"
    self.assertEqual(result, expected_result)
    result = sample.past_traj.to_linestring().wkt
    expected_result = "LINESTRING (0 9, 0 10, 0 14)"
    self.assertEqual(result, expected_result)

The repository also includes a demo that illustrates how to split trajectories using a grid and finally extract samples:

 

Movement data in GIS #17: Spatial analysis of GeoPandas trajectories

In Movement data in GIS #16, I presented a new way to deal with trajectory data using GeoPandas and how to load the trajectory GeoDataframes as a QGIS layer. Following up on this initial experiment, I’ve now implemented a first version of an algorithm that performs a spatial analysis on my GeoPandas trajectories.

The first spatial analysis algorithm I’ve implemented is Clip trajectories by extent. Implementing this algorithm revealed a couple of pitfalls:

  • To achieve correct results, we need to compute spatial intersections between linear trajectory segments and the extent. Therefore, we need to convert our point GeoDataframe to a line GeoDataframe.
  • Based on the spatial intersection, we need to take care of computing the corresponding timestamps of the events when trajectories enter or leave the extent.
  • A trajectory can intersect the extent multiple times. Therefore, we cannot simply use the global minimum and maximum timestamp of intersecting segments.
  • GeoPandas provides spatial intersection functionality but if the trajectory contains consecutive rows without location change, these will result in zero length lines and those cause an empty intersection result.

So far, the clip result only contains the trajectory id plus a suffix indicating the sequence of the intersection segments for a specific trajectory (because one trajectory can intersect the extent multiple times). The following screenshot shows one highlighted trajectory that intersects the extent three times and the resulting clipped trajectories:

This algorithm together with the basic trajectory from points algorithm is now available in a Processing algorithm provider plugin called Processing Trajectory.

Note: This plugin depends on GeoPandas.

Note for Windows users: GeoPandas is not a standard package that is available in OSGeo4W, so you’ll have to install it manually. (For the necessary steps, see this answer on gis.stackexchange.com)

The implemented tests show how to use the Trajectory class independently of QGIS. So far, I’m only testing the spatial properties though:

def test_two_intersections_with_same_polygon(self):
    polygon = Polygon([(5,-5),(7,-5),(7,12),(5,12),(5,-5)])
    data = [{'id':1, 'geometry':Point(0,0), 't':datetime(2018,1,1,12,0,0)},
        {'id':1, 'geometry':Point(6,0), 't':datetime(2018,1,1,12,10,0)},
        {'id':1, 'geometry':Point(10,0), 't':datetime(2018,1,1,12,15,0)},
        {'id':1, 'geometry':Point(10,10), 't':datetime(2018,1,1,12,30,0)},
        {'id':1, 'geometry':Point(0,10), 't':datetime(2018,1,1,13,0,0)}]
    df = pd.DataFrame(data).set_index('t')
    geo_df = GeoDataFrame(df, crs={'init': '31256'})
    traj = Trajectory(1, geo_df)
    intersections = traj.intersection(polygon)
    result = []
    for x in intersections:
        result.append(x.to_linestring())
    expected_result = [LineString([(5,0),(6,0),(7,0)]), LineString([(7,10),(5,10)])]
    self.assertEqual(result, expected_result) 

One issue with implementing the algorithms as QGIS Processing tools in this way is that the tools are independent of one another. That means that each tool has to repeat the expensive step of creating the trajectory objects in memory. I’m not sure this can be solved.

TimeManager 3.0.2 released!

Bugfix release 3.0.2 fixes an issue where “accumulate features” was broken for timestamps with milliseconds.

If you like TimeManager, know your way around setting up Travis for testing QGIS plugins, and want to help improve TimeManager stability, please get in touch!

Movement data in GIS #16: towards pure Python trajectories using GeoPandas

Many of my previous posts in this series [1][2][3] have relied on PostGIS for trajectory data handling. While I love PostGIS, it feels like overkill to require a database to analyze smaller movement datasets. Wouldn’t it be great to have a pure Python solution?

If we look into moving object data literature, beyond the “trajectories are points with timestamps” perspective, which is common in GIS, we also encounter the “trajectories are time series with coordinates” perspective. I don’t know about you, but if I hear “time series” and Python, I think Pandas! In the Python Data Science Handbook, Jake VanderPlas writes:

Pandas was developed in the context of financial modeling, so as you might expect, it contains a fairly extensive set of tools for working with dates, times, and time-indexed data.

Of course, time series are one thing, but spatial data handling is another. Lucky for us, this is where GeoPandas comes in. GeoPandas has been around for a while and version 0.4 has been released in June 2018. So far, I haven’t found examples that use GeoPandas to manage movement data, so I’ve set out to give it a shot. My trajectory class uses a GeoDataFrame df for data storage. For visualization purposes, it can be converted to a LineString:

import pandas as pd 
from geopandas import GeoDataFrame
from shapely.geometry import Point, LineString

class Trajectory():
    def __init__(self, id, df, id_col):
        self.id = id
        self.df = df    
        self.id_col = id_col
        
    def __str__(self):
        return "Trajectory {1} ({2} to {3}) | Size: {0}".format(
            self.df.geometry.count(), self.id, self.get_start_time(), 
            self.get_end_time())
        
    def get_start_time(self):
        return self.df.index.min()
        
    def get_end_time(self):
        return self.df.index.max()
        
    def to_linestring(self):
        return self.make_line(self.df)
        
    def make_line(self, df):
        if df.size > 1:
            return df.groupby(self.id_col)['geometry'].apply(
                lambda x: LineString(x.tolist())).values[0]
        else:
            raise RuntimeError('Dataframe needs at least two points to make line!')

    def get_position_at(self, t):
        try:
            return self.df.loc[t]['geometry'][0]
        except:
            return self.df.iloc[self.df.index.drop_duplicates().get_loc(
                t, method='nearest')]['geometry']

Of course, this class can be used in stand-alone Python scripts, but it can also be used in QGIS. The following script takes data from a QGIS point layer, creates a GeoDataFrame, and finally generates trajectories. These trajectories can then be added to the map as a line layer.

All we need to do to ensure that our data is ordered by time is to set the GeoDataFrame’s index to the time field. From then on, Pandas takes care of the time series aspects and we can access the index as shown in the Trajectory.get_position_at() function above.

# Get data from a point layer
l = iface.activeLayer()
time_field_name = 't'
trajectory_id_field = 'trajectory_id' 
names = [field.name() for field in l.fields()]
data = []
for feature in l.getFeatures():
    my_dict = {}
    for i, a in enumerate(feature.attributes()):
        my_dict[names[i]] = a
    x = feature.geometry().asPoint().x()
    y = feature.geometry().asPoint().y()
    my_dict['geometry']=Point((x,y))
    data.append(my_dict)

# Create a GeoDataFrame
df = pd.DataFrame(data).set_index(time_field_name)
crs = {'init': l.crs().geographicCrsAuthId()} 
geo_df = GeoDataFrame(df, crs=crs)
print(geo_df)

# Test if spatial functions work
print(geo_df.dissolve([True]*len(geo_df)).centroid)

# Create a QGIS layer for trajectory lines
vl = QgsVectorLayer("LineString", "trajectories", "memory")
vl.setCrs(l.crs()) # doesn't stop popup :(
pr = vl.dataProvider()
pr.addAttributes([QgsField("id", QVariant.String)])
vl.updateFields() 

df_by_id = dict(tuple(geo_df.groupby(trajectory_id_field)))
trajectories = {}
for key, value in df_by_id.items():
    traj = Trajectory(key, value, trajectory_id_field)
    trajectories[key] = traj
    line = QgsGeometry.fromWkt(traj.to_linestring().wkt)
    f = QgsFeature()
    f.setGeometry(line)
    f.setAttributes([key])
    pr.addFeature(f) 
print(trajectories[1])

vl.updateExtents() 
QgsProject.instance().addMapLayer(vl)

The following screenshot shows this script applied to a sample of the Geolife datasets containing 100 trajectories with a total of 236,776 points. On my notebook, the runtime is approx. 20 seconds.

So far, GeoPandas has proven to be a convenient way to handle time series with coordinates. Trying to implement some trajectory analysis tools will show if it is indeed a promising data structure for trajectories.

My favorite new recipe in QGIS Map Design 2nd ed

If you follow me on Twitter, you have probably already heard that the ebook of “QGIS Map Design 2nd Edition” has now been published and we are expecting the print version to be up for sale later this month. Gretchen Peterson and I – together with our editor Gary Sherman (yes, that Gary Sherman!) – have been working hard to provide you with tons of new and improved map design workflows and many many completely new maps. By Gretchen’s count, this edition contains 23 new maps, so it’s very hard to pick a favorite!

Like the 1st edition, we provide increasingly advanced recipes in three chapters, each focusing on either layer styling, labeling, or creating print layouts. If I had to pick a favorite, I’d have to go with “Mastering Rotated Maps”, one of the advanced recipes in the print layouts chapter. It looks deceptively simple but it combines a variety of great QGIS features and clever ideas to design a map that provides information on multiple levels of detail. Besides the name inspiring rotated map items, this design combines

  • map overviews
  • map themes
  • graduated lines and polygons
  • a rotated north arrow
  • fancy leader lines

all in one:

“QGIS Map Design 2nd Edition” provides how-to instructions, as well as data and project files for each recipe. So you can jump right into it and work with the provided materials or apply the techniques to your own data.

The ebook is available at LocatePress.

Geocoding with Geopy

Need to geocode some addresses? Here’s a five-lines-of-code solution based on “An A-Z of useful Python tricks” by Peter Gleeson:

from geopy import GoogleV3
place = "Krems an der Donau"
location = GoogleV3().geocode(place)
print(location.address)
print("POINT({},{})".format(location.latitude,location.longitude))

For more info, check out geopy:

geopy is a Python 2 and 3 client for several popular geocoding web services.
geopy includes geocoder classes for the OpenStreetMap Nominatim, ESRI ArcGIS, Google Geocoding API (V3), Baidu Maps, Bing Maps API, Yandex, IGN France, GeoNames, Pelias, geocode.earth, OpenMapQuest, PickPoint, What3Words, OpenCage, SmartyStreets, GeocodeFarm, and Here geocoder services.

Plotting GPS Trajectories with error ellipses using Time Manager

This is a guest post by Time Manager collaborator and Python expert, Ariadni-Karolina Alexiou.

Today we’re going to look at how to visualize the error bounds of a GPS trace in time. The goal is to do an in-depth visual exploration using QGIS and Time Manager in order to learn more about the data we have.

The Data

We have a file that contains GPS locations of an object in time, which has been created by a GPS tracker. The tracker also keeps track of the error covariance matrix for each point in time, that is, what confidence it has in the measurements it gives. Here is what the file looks like:

data.png

Error Covariance Matrix

What are those sd* fields? According to the manual: The estimated standard deviations of the solution assuming a priori error model and error parameters by the positioning options. What it basically means is that the real GPS location will be located no further than three standard deviations across north and east from the measured location, most of (99.7%) the time. A way to represent this visually is to create an ellipse that maps this area of where the real location can be.ellipse_ab

An ellipse can be uniquely defined from the lengths of the segments a and b and its rotation angle. For more details on how to get those ellipse parameters from the covariance matrix, please see the footnote.

Ground truth data

We also happen to have a file with the actual locations (also in longitudes and latitudes) of the object for the same time frame as the GPS (also in seconds), provided through another tracking method which is more accurate in this case.

actual_data

This is because, the object was me running on a rooftop in Zürich wearing several tracking devices (not just GPS), and I knew exactly which floor tiles I was hitting.

The goal is to explore, visually, the relationship between the GPS data and the actual locations in time. I hope to get an idea of the accuracy, and what can influence it.

First look

Loading the GPS data into QGIS and Time Manager, we can indeed see the GPS locations vis-a-vis the actual locations in time.

actual_vs_gps

Let’s see if the actual locations that were measured independently fall inside the ellipse coverage area. To do this, we need to use the covariance data to render ellipses.

Creating the ellipses

I considered using the ellipses marker from QGIS.

ellipse_marker.png

It is possible to switch from Millimeter to Map Unit and edit a data defined override for symbol width, height and rotation. Symbol width would be the a parameter of the ellipse, symbol height the b parameter and rotation simply the angle. The thing is, we haven’t computed any of these values yet, we just have the error covariance values in our dataset.

Because of the re-projections and matrix calculations inherent into extracting the a, b and angle of the error ellipse at each point in time, I decided to do this calculation offline using Python and relevant libraries, and then simply add a WKT text field with a polygon representation of the ellipse to the file I had. That way, the augmented data could be re-used outside QGIS, for example, to visualize using Leaflet or similar. I could have done a hybrid solution, where I calculated a, b and the angle offline, and then used the dynamic rendering capabilities of QGIS, as well.

I also decided to dump the csv into an sqlite database with an index on the time column, to make time range queries (which Time Manager does) run faster.

Putting it all together

The code for transforming the initial GPS data csv file into an sqlite database can be found in my github along with a small sample of the file containing the GPS data.

I created three ellipses per timestamp, to represent the three standard deviations. Opening QGIS (I used version: 2.12, Las Palmas) and going to Layer>Add Layer>Add SpatialLite Layer, we see the following dialog:

add_spatialite2.png

After adding the layer (say, for the second standard deviation ellipse), we can add it to Time Manager like so:

add_to_tm

We do the process three times to add the three types of ellipses, taking care to style each ellipse differently. I used transparent fill for the second and third standard deviation ellipses.

I also added the data of my  actual positions.

Here is an exported video of the trace (at a place in time where I go forward, backwards and forward again and then stay still).

gps

Conclusions

Looking at the relationship between the actual data and the GPS data, we can see the following:

  • Although the actual position differs from the measured one, the actual position always lies within one or two standard deviations of the measured position (so, inside the purple and golden ellipses).
  • The direction of movement has greater uncertainty (the ellipse is elongated across the line I am running on).
  • When I am standing still, the GPS position is still moving, and unfortunately does not converge to my actual stationary position, but drifts. More research is needed regarding what happens with the GPS data when the tracker is actually still.
  • The GPS position doesn’t jump erratically, which can be good, however, it seems to have trouble ‘catching up’ with the actual position. This means if we’re looking to measure velocity in particular, the GPS tracker might underestimate that.

These findings are empirical, since they are extracted from a single visualization, but we have already learned some new things. We have some new ideas for what questions to ask on a large scale in the data, what additional experiments to run in the future and what limitations we may need to be aware of.

Thanks for reading!

Footnote: Error Covariance Matrix calculations

The error covariance matrix is (according to the definitions of the sd* columns in the manual):

sde * sde sign(sdne) * sdne * sdne
sign(sdne) * sdne * sdne sdn * sdn

It is not a diagonal matrix, which means that the errors across the ‘north’ dimension and the ‘east’ dimension, are not exactly independent.

An important detail is that, while the position is given in longitudes and latitudes, the sdn, sde and sdne fields are in meters. To address this in the code, we convert the longitude and latitudes using UTM projection, so that they are also in meters (northings and eastings).

For more details on the mathematics used to plot the ellipses check out this article by Robert Eisele and the implementation of the ellipse calculations on my github.

Movement data in GIS #15: writing a PL/pgSQL stop detection function for PostGIS trajectories

Do you sometimes start writing an SQL query and around at line 50 you get the feeling that it might be getting out of hand? If so, it might be useful to start breaking it down into smaller chunks and wrap those up into custom functions. Never done that? Don’t despair! There’s an excellent PL/pgSQL tutorial on postgresqltutorial.com to get you started.

To get an idea of the basic structure of a PL/pgSQL function and to proof that PostGIS datatypes work just fine in this context, here’s a basic function that takes a trajectory geometry and outputs its duration, i.e. the difference between its last and first timestamp:

CREATE OR REPLACE FUNCTION AG_Duration(traj geometry) 
RETURNS numeric LANGUAGE 'plpgsql'
AS $BODY$ 
BEGIN
RETURN ST_M(ST_EndPoint(traj))-ST_M(ST_StartPoint(traj));
END; $BODY$;

My end goal for this exercise was to implement a function that takes a trajectory and outputs the stops along this trajectory. Commonly, a stop is defined as a long stay within an area with a small radius. This leads us to the following definition:

CREATE OR REPLACE FUNCTION AG_DetectStops(
   traj geometry, 
   max_size numeric, 
   min_duration numeric)
RETURNS TABLE(sequence integer, geom geometry) 
-- implementation follows here!

Note how this function uses RETURNS TABLE to enable it to return all the stops that it finds. To add a line to the output table, we need to assign values to the sequence and geom variables and then use RETURN NEXT.

Another reason to use PL/pgSQL is that it enables us to write loops. And loops I wanted for my stop detection function! Specifically, I wanted to go through all the points in the trajectory:

FOR pt IN SELECT (ST_DumpPoints(traj)).geom LOOP
-- here comes the magic!
END LOOP;

Eventually the function should go through the trajectory and identify all segments that stay within an area with max_size diameter for at least min_duration time. To test for the area size, we can use:

IF ST_MaxDistance(segment,pt) <= max_size THEN is_stop := true; 

Putting everything together, my current implementation looks like this:

CREATE OR REPLACE FUNCTION AG_DetectStops(
   traj geometry,
   max_size numeric,
   min_duration numeric)
RETURNS TABLE(sequence integer, geom geometry) 
LANGUAGE 'plpgsql'
AS $BODY$
DECLARE 
   pt geometry;
   segment geometry;
   is_stop boolean;
   previously_stopped boolean;
   stop_sequence integer;
   p1 geometry;
BEGIN
segment := NULL;
sequence := 0;
is_stop := false;
previously_stopped := false;
p1 := NULL;
FOR pt IN SELECT (ST_DumpPoints(traj)).geom LOOP
   IF segment IS NULL AND p1 IS NULL THEN 
      p1 := pt; 
   ELSIF segment IS NULL THEN 
      segment := ST_MakeLine(p1,pt); 
      p1 := NULL;
      IF ST_Length(segment) <= max_size THEN is_stop := true; END IF; ELSE segment := ST_AddPoint(segment,pt); -- if we're in a stop, we want to grow the segment, otherwise we remove points to the specified min_duration IF NOT is_stop THEN WHILE ST_NPoints(segment) > 2 AND AG_Duration(ST_RemovePoint(segment,0)) >= min_duration LOOP
            segment := ST_RemovePoint(segment,0); 
         END LOOP;
      END IF;
      -- a stop is identified if the segment stays within a circle of diameter = max_size
      IF ST_Length(segment) <= max_size THEN is_stop := true; ELSIF ST_Distance(ST_StartPoint(segment),pt) > max_size THEN is_stop := false;
      ELSIF ST_MaxDistance(segment,pt) <= max_size THEN is_stop := true; ELSE is_stop := false; END IF; -- if we found the end of a stop, we need to check if it lasted long enough IF NOT is_stop AND previously_stopped THEN IF ST_M(ST_PointN(segment,ST_NPoints(segment)-1))-ST_M(ST_StartPoint(segment)) >= min_duration THEN
            geom := ST_RemovePoint(segment,ST_NPoints(segment)-1); 
            RETURN NEXT;
            sequence := sequence + 1;
            segment := NULL;
            p1 := pt;
         END IF;
      END IF;
   END IF;
   previously_stopped := is_stop;
END LOOP;
IF previously_stopped AND AG_Duration(segment) >= min_duration THEN 
   geom := segment; 
   RETURN NEXT; 
END IF;
END; $BODY$;

While this function is not really short, it’s so much more readable than my previous attempts of doing this in pure SQL. Some of the lines for determining is_stop are not strictly necessary but they do speed up processing.

Performance still isn’t quite where I’d like it to be. I suspect that all the adding and removing points from linestring geometries is not ideal. In general, it’s quicker to find shorter stops in smaller areas than longer stop in bigger areas.

Let’s test! 

Looking for a testing framework for PL/pgSQL, I found plpgunit on Github. While I did not end up using it, I did use its examples for inspiration to write a couple of tests, e.g.

CREATE OR REPLACE FUNCTION test.stop_at_beginning() RETURNS void LANGUAGE 'plpgsql'
AS $BODY$
DECLARE t0 integer; n0 integer;
BEGIN
WITH temp AS ( SELECT AG_DetectStops(
   ST_GeometryFromText('LinestringM(0 0 0, 0 0 1, 0.1 0.1 2, 2 2 3)'),
   1,1) stop 
)
SELECT ST_M(ST_StartPoint((stop).geom)), 
       ST_NPoints((stop).geom) FROM temp INTO t0, n0;	
IF t0 = 0 AND n0 = 3
   THEN RAISE INFO 'PASSED - Stop at the beginning of the trajectory';
   ELSE RAISE INFO 'FAILED - Stop at the beginning of the trajectory';
END IF;
END; $BODY$;

Basically, each test is yet another PL/pgSQL function that doesn’t return anything (i.e. returns void) but outputs messages about the status of the test. Here I made heavy use of the PERFORM statement which executes the provided function but discards the results:


Update: The source code for this function is now available on https://github.com/anitagraser/postgis-spatiotemporal

Movement data in GIS #14: updates from GI_Forum 2018

Last week, I traveled to Salzburg to attend the 30th AGIT conference and co-located English-speaking GI_Forum. Like in previous year, there were a lot of mobility and transportation research related presentations. Here are my personal highlights:

This year’s keynotes touched on a wide range of issues, from Sandeep Singhal (Google Cloud Storage) who – when I asked about the big table queries he showed – stated that they are not using a spatial index but are rather brute-forcing their way through massive data sets, to Laxmi Ramasubramanian @nycplanner (Hunter College City University of New York) who cautioned against tech arrogance and tendency to ignore expertise from other fields such as urban planning:

One issue that Laxmi particularly highlighted was the fact that many local communities are fighting excessive traffic caused by apps like Waze that suggest shortcuts through residential neighborhoods. Just because we can do something with (mobility) data, doesn’t necessarily mean that we should!

Not limited to mobility but very focused on open source, Jochen Albrecht (Hunter College City University of New York) invited the audience to join his quest for a spatial decision support system based on FOSS only at bit.ly/FiltersAndWeights and https://github.com/geojochen/fosssdss

The session Spatial Perspectives on Healthy Mobility featured multiple interesting contributions, particularly by Michelle P. Fillekes who presented a framework of mobility indicators to assess daily mobility of study participants. It considers both spatial and temporal aspects of movement, as well as the movement context:

Figure from Michelle Pasquale Fillekes, Eleftheria Giannouli, Wiebren Zijlstra, Robert Weibel. Towards a Framework for Assessing Daily Mobility using GPS Data. DOI: 10.1553/giscience2018_01_s177 (under cc-by-nd)

It was also good to see that topics we’ve been working on in the past (popularity routing in this case) continue to be relevant and have been picked up in the German-speaking part of the conference:

Of course, I also presented some new work of my own, specifically my research into PostGIS trajectory datatypes which I’ve partially covered in a previous post on this blog and which is now published in Graser, A. (2018) Evaluating Spatio-temporal Data Models for Trajectories in PostGIS Databases. GI_Forum ‒ Journal of Geographic Information Science, 1-2018, 16-33. DOI: 10.1553/giscience2018_01_s16.

My introduction to GeoMesa talk failed to turn up any fellow Austrian GeoMesa users. So I’ll keep on looking and spreading the word. The most common question – and certainly no easy one at that – is how to determine the point where it becomes worth it to advance from regular databases to big data systems. It’s not just about the size of the data but also about how it is intended to be used. And of course, if you are one of those db admin whizzes who manages a distributed PostGIS setup in their sleep, you might be able to push the boundaries pretty far. On the other hand, if you already have some experience with the Hadoop ecosystem, getting started with tools like GeoMesa shouldn’t be too huge a step either. But that’s a topic for another day!

Since AGIT&GI_Forum are quite a big event with over 1,000 participants, it was not limited to movement data topics. You can find the first installment of English papers in GI_Forum 2018, Volume 1. As I understand it, there will be a second volume with more papers later this year.


This post is part of a series. Read more about movement data in GIS.

PyQGIS for non-programmers

If you’re are following me on Twitter, you’ve certainly already read that I’m working on PyQGIS 101 a tutorial to help GIS users to get started with Python programming for QGIS.

I’ve often been asked to recommend Python tutorials for beginners and I’ve been surprised how difficult it can be to find an engaging tutorial for Python 3 that does not assume that the reader already knows all kinds of programming concepts.

It’s been a while since I started programming, but I do teach QGIS and Python programming for QGIS to university students and therefore have some ideas of which concepts are challenging. Nonetheless, it’s well possible that I overlook something that is not self explanatory. If you’re using PyQGIS 101 and find that some points could use further explanations, please leave a comment on the corresponding page.

PyQGIS 101 is a work in progress. I’d appreciate any feedback, particularly from beginners!

Scalable spatial vector data processing

Working with movement data analysis, I’ve banged my head against performance issues every once in a while. For example, PostgreSQL – and therefore PostGIS – run queries in a single thread of execution. This is now changing, with more and more functionality being parallelized. PostgreSQL version 9.6 (released on 2016-09-29) included important steps towards parallelization, including parallel execution of sequential scans, joins and aggregates. Still, there is no parallel processing in PostGIS so far (but it is under development as described by Paul Ramsey in his posts “Parallel PostGIS II” and “PostGIS Scaling” from late 2017).

At the FOSS4G2016 in Bonn, I had the pleasure to chat with Shoaib Burq who ran the “An intro to Apache PySpark for Big Data GeoAnalysis” workshop. Back home, I downloaded the workshop material and gave it a try but since I wanted a scalable system for storing, analyzing, and visualizing spatial data, it didn’t really seem to fit the bill.

Around one year ago, my search grew more serious since we needed a solution that would support our research group’s new projects where we expected to work with billions of location records (timestamped points and associated attributes). I was happy to find that the fine folks at LocationTech have some very promising open source projects focusing on big spatial data, most notably GeoMesa and GeoWave. Both tools take care of storing and querying big spatio-temporal datasets and integrate into GeoServer for publication and visualization. (A good – if already slightly outdated – comparison of the two has been published by Azavea.)

My understanding at the time was that GeoMesa had a stronger vector data focus while GeoWave was more focused on raster data. This lead me to try out GeoMesa. I published my first steps in “Getting started with GeoMesa using Geodocker” but things only really started to take off once I joined the developer chats and was pointed towards CCRI’s cloud-local “a collection of bash scripts to set up a single-node cloud on your desktop, laptop, or NUC”. This enabled me to skip most of the setup pains and go straight to testing GeoMesa’s functionality.

The learning curve is rather significant: numerous big data stack components (including HDFS, Accumulo, and GeoMesa), a most likely new language (Scala), as well as the Spark computing system require some getting used to. One thing that softened the blow is the fact that writing queries in SparkSQL + GeoMesa is pretty close to writing PostGIS queries. It’s also rather impressive to browse hundreds of millions of points by connecting QGIS TimeManager to a GeoServer WMS-T with GeoMesa backend.

Spatial big data stack with GeoMesa

One of the first big datasets I’ve tested are taxi floating car data (FCD). At one million records per day, the three years in the following example amount to a total of around one billion timestamped points. A query for travel times between arbitrary start and destination locations took a couple of seconds:

Travel time statistics with GeoMesa (left) compared to Google Maps predictions (right)

Besides travel time predictions, I’m also looking into the potential for predicting future movement. After all, it seems not unreasonable to assume that an object would move in a similar fashion as other similar objects did in the past.

Early results of a proof of concept for GeoMesa based movement prediction

Big spatial data – both vector and raster – are an exciting challenge bringing new tools and approaches to our ever expanding spatial toolset. Development of components in open source big data stacks is rapid – not unlike the development speed of QGIS. This can make it challenging to keep up but it also holds promises for continuous improvements and quick turn-around times.

If you are using GeoMesa to work with spatio-temporal data, I’d love to hear about your experiences.

Movement data in GIS #13: Timestamp labels for trajectories

In Movement data in GIS #2: visualization I mentioned that it should be possible to label trajectory segments without having to break the original trajectory feature. While it’s not a straightforward process, it is indeed possible to create timestamp labels at desired intervals:

The main point here is that we cannot use regular labels because there would be only one label for the whole trajectory feature. Instead, we are using a marker line with a font marker:

By default, font markers only display one character from a given font but by using expressions we can make it display longer text, including datetime strings:

If you want to have a label at every node of the trajectory, the expression looks like this:

format_date( 
   to_datetime('1970-01-01T00:00:00Z')+to_interval(
      m(start_point(geometry_n(
         segments_to_lines( $geometry ),
         @geometry_part_num)
      ))||' seconds'
   ),
   'HH:mm:ss'
)

You probably remember those parts of the expression that extract the m value from previous posts. Note that – compared to 2016 – it is now necessary to add the segments_to_lines() function.

The m value (which stores time as seconds since Unix epoch) is then converted to datetime and finally formatted to only show time. Of course you can edit the datetime format string to also include the date.

If we only want a label every 30 seconds, we can add a case statement around that:

CASE WHEN 
m(start_point(geometry_n(
   segments_to_lines( $geometry ),
   @geometry_part_num)
)) % 30 = 0
THEN
format_date( 
   to_datetime('1970-01-01T00:00:00Z')+to_interval(
      m(start_point(geometry_n(
         segments_to_lines( $geometry ),
         @geometry_part_num)
      ))||' seconds'
   ),
   'HH:mm:ss'
)
END

This works well if the trajectory sampling interval is fairly regular. This is not always the case and that means that the above case statement wouldn’t find many nodes with a timestamp that ends in :30 or :00. In such a case, we could resort to labeling nodes based on their order in the linestring:

CASE WHEN 
 @geometry_part_num  % 30 = 0
THEN
...

Thanks a lot to @JuergenEFischer for providing a solution for converting seconds since Unix epoch to datetime without a custom function!

Note that expressions using @geometry_part_num currently suffer from the following issue: Combination of segments_to_lines($geometry) and @geometry_part_num gives wrong segment numbers


This post is part of a series. Read more about movement data in GIS.

Movement data in GIS #12: why you should be using PostGIS trajectories

In short: both writing trajectory queries as well as executing them is considerably faster using PostGIS trajectories (as LinestringM) rather than the commonly used point-based approach.

Here are a couple of examples to give you an impression of the differences.

Spoiler alert! Trajectory queries are up to 500 times faster than comparable point-based queries.

A quick look at indexing

In both cases, we have indexed the tracker id, geometry, and time columns to speed up query processing.

The trajectory table has 3 indexes

  • gist (time_range)
  • gist (track gist_geometry_ops_nd)
  • btree (tracker)

The point-based table has 4 indexes

  • gist (pt)
  • btree (trajectory_id)
  • btree (tracker)
  • btree (t)

Length

First, let’s see how to determine trajectory length for all observed moving objects (identified by a tracker id).

Using the point-based approach, we first need to ensure that the points are in the correct temporal order, create the lines, and finally sum up their length:

WITH ordered AS (
 SELECT trajectory_id, tracker, t, pt
 FROM geolife.trajectory_pt
 ORDER BY t
), tmp AS (
 SELECT trajectory_id, tracker, st_makeline(pt) traj
 FROM ordered 
 GROUP BY trajectory_id, tracker
)
SELECT tracker, round(sum(ST_Length(traj::geography)))
FROM tmp
GROUP BY tracker 
ORDER BY tracker

With trajectories, we can go right to computing lengths:

SELECT tracker, round(sum(ST_Length(track::geography)))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

On my test system, the trajectory query run time is 22.7 sec instead of 43.0 sec for the point-based approach:

Duration

Compared to trajectory length, duration is less complicated in the point-based approach:

WITH tmp AS (
 SELECT trajectory_id, tracker, min(t) start_time, max(t) end_time
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
)
SELECT tracker, sum(end_time - start_time)
FROM tmp
GROUP BY tracker
ORDER BY tracker

Still, the trajectory query is less complex and much faster at 31 ms instead of 6.0 sec:

SELECT tracker, sum(upper(time_range) - lower(time_range))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

Temporal filter

Extracting trajectories that occurred during a certain time frame is another common use case:

WITH tmp AS (
 SELECT trajectory_id, tracker, min(t) start_time, max(t) end_time
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
)
SELECT trajectory_id, tracker, start_time, end_time
FROM tmp
WHERE end_time > '2008-11-26 11:00'
AND start_time < '2008-11-26 15:00'
ORDER BY tracker

This point-based query takes 6.0 sec while the shorter trajectory query finishes in 12 ms:

SELECT id, tracker, time_range
FROM geolife.trajectory_ext
WHERE time_range && '[2008-11-26 11:00+1,2008-11-26 15:00+01]'::tstzrange

or equally fast (12 ms) by making use of the n-dimensional index:

WHERE track &&&	ST_Collect(
 ST_MakePointM(-180, -90, extract(epoch from '2008-11-26 11:00'::timestamptz)),
 ST_MakePointM(180, 90, extract(epoch from '2008-11-26 15:00'::timestamptz))
)

Spatial filter

Finally, of course, let’s have a look at spatial filters, for example, trajectories that start in a certain area:

WITH my AS ( 
 SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(116.31894,39.97472),4326),0.0005) areaA
), tmp AS (
 SELECT trajectory_id, tracker, min(t) t 
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
)
SELECT distinct traj.tracker, traj.trajectory_id 
FROM tmp
JOIN geolife.trajectory_pt traj
ON tmp.trajectory_id = traj.trajectory_id AND traj.t = tmp.t
JOIN my
ON ST_Within(traj.pt, my.areaA)

This point-based query takes 6.0 sec while the shorter trajectory query finishes in 488 ms:

WITH my AS ( 
 SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(116.31894, 39.97472),4326),0.0005) areaA
)
SELECT id, tracker, ST_AsText(track)
FROM geolife.trajectory_ext
JOIN my
ON areaA && track
AND ST_Within(ST_StartPoint(track), areaA)

For more generic “does this trajectory intersect another geometry”, the points can also be aggregated to a linestring on the fly but that takes 21.9 sec:

I’ll be presenting more work on PostGIS trajectories at GI_Forum in Salzburg in July. In the talk, I’ll also have a look at the custom PG-Trajectory datatype. Here’s the full open-access paper:

Graser, A. (2018) Evaluating Spatio-temporal Data Models for Trajectories in PostGIS Databases. GI_Forum ‒ Journal of Geographic Information Science, 1-2018, 16-33. DOI: 10.1553/giscience2018_01_s16.

You can find my fork of the PG-Trajectory project – including all necessary fixes – on Bitbucket.


This post is part of a series. Read more about movement data in GIS.

  • Page 1 of 15 ( 298 posts )
  • >>

Back to Top

Sponsors