Introduction
Slightly related to a previous article, which was my first dive into building and publishing custom docker images.
I bought a data-set of UK places years ago and never really did anything with it.
In this post I will detail how I extract the data from the source CSV, transform the data very slightly to construct a target CSV, and finally load the data into a database. After all of that I will give a brief example of how this data could be used to calculate ‘as the crow flies’ distances between two points.
Dockerfile
Only the bottom part is relevant. It doesn’t need a stage prior, but when learning that’s what I did. I now learned that there’s better tools than using PHP and a composer package to slightly transform the source data.
FROM php:7.4-cli as builder
COPY --from=composer:1.10 /usr/bin/composer /usr/bin/composer
COPY . /app
RUN apt-get update && \
apt-get install -y git zip unzip libzip-dev && \
docker-php-ext-configure zip && \
docker-php-ext-install zip && \
rm -rf /var/lib/apt/lists/partial && rm -rf /var/lib/apt/lists/* && \
chmod +x /app/gazetteer
WORKDIR /app
ENV COMPOSER_AUTH='{"github-oauth": {"github.com": "redacted"}}'
RUN composer require league/csv
RUN ./gazetteer
# ---
FROM mysql:8.0
COPY --from=builder /app/output.csv /var/lib/mysql-files/gazetteer.csv
COPY structure.sql /docker-entrypoint-initdb.d/00-structure.sql
COPY import.sql /docker-entrypoint-initdb.d/01-import.sql
# --local-infile=1
# https://github.com/mysql/mysql-docker/blob/mysql-server/8.0/Dockerfile
# unquoted path is important
# [Note] [Entrypoint]: Starting temporary server
#mysqld: Error on realpath() on ''/var/lib/mysql-files'' (Error 2 - No such file or directory)
CMD ["mysqld", "--local-infile=1", "--secure-file-priv=/var/lib/mysql-files"]
The ./gazetteer
script is performing the transform part of the build. This is a simple PHP script
which forms part of the multi-stage build and isn’t really required. A much simpler bash script does the job.
Building
Building the image will require a few steps.
Extract
I won’t be able to distribute the csv, but the script will require the input in the format which is given from the download.
head -n1 Gazetteer_201407.csv
Place Name,Grid Reference,Latitude,Longitude,County,Admin County,District,Unitary Authority,Police Area,Country
Since I’m not a masochist, having spaces in column names is a no for me. We’ll transform these in the next step.
Transform
For this, I wrote a fairly simple PHP script to transform the data slightly.
#!/usr/bin/env php
<?php
require __DIR__ . '/vendor/autoload.php';
use League\Csv\Reader;
use League\Csv\Writer;
// input
$csv = Reader::createFromPath('/app/Gazetteer_201407.csv', 'r');
$csv->setHeaderOffset(0);
#$header = $csv->getHeader(); //returns the CSV header record
$records = $csv->getRecords();
$output = Writer::createFromPath('/app/output.csv', 'w');
$header = [
'place_name',
'grid_reference',
'latitude',
'longitude',
'county',
'administrative_county',
'district',
'unitary_authority',
'police_area',
'country',
];
$output->insertOne($header);
foreach ($records as $k => $record) {
$output->insertOne([
'id' => 100000 + $k, // annoyingly the local infile thing doesn't add auto increment thing itself
'place_name' => $record['Place Name'],
'grid_reference' => $record['Grid Reference'],
'latitude' => $record['Latitude'],
'longitude' => $record['Longitude'],
'county' => $record['County'],
'administrative_county' => $record['Admin County'],
'district' => $record['District'],
'unitary_authority' => $record['Unitary Authority'],
'police_area' => $record['Police Area'],
'country' => $record['Country'],
]);
}
But, I think we can do better.
echo 'id,place_name,grid_reference,latitude,longitude,county,administrative_county,district,unitary_authority,police_area,country' > output.csv
Need to increment the number of lines by 100,000 to give place names a consistent length ID.
csvtool drop 1 Gazetteer_201407.csv | awk '{print NR+100000 "," $s}' > output.csv
Now we don’t really need the ‘multi-stage build’ part of it. Oh well.
Load
There are a few steps to load the data, but all in a couple of .sql
scripts.
Structure
First, will start off with the structure of the database.
Everything is relatively straightforward.
structure.sql
:
|
|
latitude
andlongitude
are stored asdecimal(11,8)
- this is gives us 8 places after decimal point, which is the highest resolution this data-set provides.- Later on we will add an additional column for better distance calculations between two poitns.
Import
import.sql
:
load data infile '/var/lib/mysql-files/gazetteer.csv'
into table uk_places
fields terminated by ','
enclosed by '"'
lines terminated by '\n'
ignore 1 rows;
The path is important, we start the container with the flag --secure-file-priv
with this path /var/lib/mysql-files
, so we can load the data from this file.
Cleansing
Now, the empty columns are interpreted as an empty string. The way for the load data infile
command to interpret these as null is \N
.
update uk_places set administrative_county = null where administrative_county = '';
update uk_places set district = null where district = '';
update uk_places set unitary_authority = null where unitary_authority = '';
Adding POINT
column
For better performance and easier calculations, we will add a new column to the table.
alter table uk_places add latitude_longitude point null;
Setting POINT
data
Now setting the values:
update uk_places
set latitude_longitude = st_srid(point(longitude, latitude), 4326)
where true;
Using set_srid
will give the index we add later better performance. Note order of coordinates.
We will update the column to not null
, can’t add an index without that!
alter table uk_places modify latitude_longitude point not null;
POINT
index
create index uk_places_latitude_longitude_index on uk_places (latitude_longitude);
Query Usage
Using st_distance_sphere
between the two points will give you the distance in meters.
select st_distance_sphere(
(select latitude_longitude from uk_places where id = 117854), # glasgow
(select latitude_longitude from uk_places where id = 127889) # london
) as 'distance_in_m'
from dual;
Result:
554522.8318811639
Building Image
I won’t go in to too much detail here. I went into more detail in my previous article here including publishing to registry, etc.
I’ll just leave a Makefile
which might have some issues.
.PHONY: rmi build
.ONESHELL:
# https://www.gnu.org/software/make/manual/make.html#Errors
# keeps going on error, i.e. if no image was removed, it's not the end of the world for the build stage :)
.IGNORE:
rmi:
docker image rm --force $$(docker image ls -a -q --filter reference=gazetteer) 2>/dev/null
build:
docker build --force-rm --tag=gazetteer:latest --tag=gazetteer:2014-07 .
down:
docker container stop $$(docker container ls -a -q --filter name=db_gazetteer) 2>/dev/null
server: down
docker run --rm -p 3333:3306 --name db_gazetteer -e MYSQL_ROOT_PASSWORD=password gazetteer:latest
image:
docker image ls -a --filter reference=gazetteer
container:
docker container ls -a -q --filter name=db_gazetteer
shell:
docker run -it gazetteer:latest bash
In the future, I might add a basic application around this.