What value is residing in geo-located data and how easy is it unlock this value? In this post, we explore one of the first uses of geo-located data that was used to enact a policy change that arguable saved many lives. We also discuss using one of the latest ML techniques to extract value from geo-located data and describe how we helped Highways England make better use of their datasets; sometimes just bringing the data together and putting it on a map can be a game changer.
In our experience, we are seeing more and more geo-located data among the datasets available to clients. This is partly due to the proliferation of smart-phone devices and their use as a means of collecting data, which means GPS location data is a lot easier to capture. Clients are also realising the value that geo-located data can hold and so are also collecting these attributes as part of the continuous drive to improve data quality.
Probably, one of the most famous uses of geo-located data is also one of the oldest. On the 28th August 1854 the baby daughter of Thomas and Sarah Lewis, who lived at 40 Broad Street (now Broadwick Street), was taken ill with what transpired to be cholera. As far as we know, she was the index patient of a cholera outbreak that went on to kill 616 people. While they waited for the doctor, Sarah washed the dirty cloth nappies in water and deposited the water in a cesspit that lay in front of the house. Later investigation would reveal deficiencies in the construction of a well next to this cesspit, which was used to supply drinking water to the now infamous Broad Street pump.
John Snow was a surgeon and pioneering anaesthetist and lived in Soho around the time of the outbreak. He took it upon himself to investigate the outbreak by going door-to-door and speaking with local residents in an effort to ascertain the cause of the outbreak. As part of this effort, Snow complied a dot distribution map of the outbreak, an example of which is replicated below:
The map was carefully compiled to reduce the complexity and only show what’s needed – the street outlines, the location of water pumps and the incidents of cholera cases. Where multiple cases at a given location occurred, Snow drew multiple lines so that clusters of cases are visually identifiable.
During this time, the cause of bacterial infections where unknown and there were competing theories. The most popular was that infections where caused by miasma or bad smells, of which there were plenty in mid 19th century London. Snow’s analysis clearly shows a cluster of cases surrounding the Broad Street pump providing weight that this was a waterborne disease and potentially linked to the city’s drinking water supplies. It was enough to convince The London Board of Health to remove the handle of the pump on Broad Street thus curtailing the further spread of the outbreak.
It’s important to note that dot maps are not without criticism and the results need to be interpreted carefully, as the following obligatory XKCD illustration shows:
Despite these limitations, rendering geo-located data on a map can be very useful, particularly when you add the ability to overlay additional data sets. There is a plethora of opensource GIS and mapping data that has been made available via the UK government at data.gov.uk. It is useful, for example, to overlay geo-located property data with the flood-risk data as demonstrated below and taken from a system that we helped build for Highways England:
This system brought geo-located data from a multitude of systems together for the first time and allowed for the intuitive retrieval and navigation of this data via a map-based interface. Being able to dynamically change the layers and data displayed achieves a similar result to John Snow’s map, where only the pertinent detail is displayed removing unnecessary clutter.
Showing points and polygons on a map, while simple, has proved incredible useful but can more value be obtained from geo-located data using some of the modern machine-learning techniques that are currently in vogue? One of the issues with geo-located data is that the data are often split over two feature parameters: longitude and latitude. This can sometimes confuse techniques that might mistake latitude, for example, as having importance but not longitude.
When we built Safescape, a tool for making predictions on road accident risks based on publicly available Stats 19 data, we developed a proprietary geo-aware decision tree implementation that combines longitude and latitude into a single feature. This results in a more powerful model that is able to make more accurate predictions and classifications based on geo-located data.
Within hours of the removal of the handle of the pump on Broad Street, Thomas Lewis also fell ill and succumbed to cholera 11 days later. The cesspit at 40 Broad street was once again contaminated. Had the pump handle not been removed then it is likely that the outbreak would’ve been extended and the death toll higher. John Snow’s map played an integral role in convincing the authorities to remove the pump at a time when there were many competing theories as to the true cause of the disease.
There’s an abundance of value contained within geo-located datasets. Are you realising the true potential of yours?