Scripts
A list of scripts which load data into elasticsearch for use in the Search API.
A list of scripts
Retrieve CMD Datasets
This script retrieves a list of datasets stored in mongodb instance and will check that the url to dataset resource on the ons website exists before storing the data in a csv file.
You can run either of the following commands:
if you do not set the flags or environment variables for mongodb bind address and filename, the script will use a default value set to localhost:27017 and cmd-datasets.csv respectively.
Load Datasets
This script reads a csv file defined by flag/environment variable or default value and stores the dataset data into elasticsearch. The csv must contain particular headers (but not in any necessary order).
One can use the Retrieve cmd datasets script to generate a new csv file or use the pre-generated one stored as cmd-datasets.csv.
- Use Makefile
- Set
dataset_index, filename and/or elasticsearch_url environment variable with:
export dataset_index=<elasticsearch index>
export filename=<file name and loaction>
export elasticsearch_url=<elasticsearch bind address>
- Optionally set
dimensions_filename environment variable with, should end with .json:
export taxonomy_filename=<filename and location>
- Optionally set
taxonomy_filename environment variable with, should end with .json:
export taxonomy_filename=<filename and location>
- Use go run command with or without flags
-dataset-index, -filename and/or -elasticsearch_url being set
go run upload-datasets/main.go -dataset-index=<elasticsearch index> -filename=<file name and loaction> -dimensions-filename=<dimensions file and location> -taxonomy-filename=<taxonomy file name and location> -elasticsearch_url=<elasticsearch bin address>
Taxonomy and Dimensions will be stored in a json file that will be read into memory in the dataset search API on start up, these file names and locations should match the environment configurations for TAXONOMY_FILENAME and DIMENSIONS_FILENAME respectively. For ease of use just run the make commands without editing flags or setting environment variables for these variables.
Retrieve Dataset Taxonomy
This script scrapes the ons website to pull out taxonomy hierarchy by iterating through pages.
You can run either of the following commands:
if you do not set the flag or environment variable for filename, then the script will use a default value set to ../taxonomy/taxonomy.json.
Load Postcode
This script loads postcode data for all postcodes across the UK as of Febraury 2020 from a csv file downloaded from the geo portal here.
Once file is downloaded (from above link), unzip file. The data layout to postcode data should look like this:
- NSPL_FEB_2020_UK
- Data
- NSPL_FEB_2020_UK.csv
Upload postcode data to elasticsearch index with:
make postcode
This will take approximately 4 minutes and 20 seconds and documents will be stored in test_postcode index.
Load data from GEOJSON files
This script loads geographical boundaries for 2011 census data. This includes lower and middle layer output areas, as well as other output areas, towns and cities across England and Wales only.
Files can be downloaded from the geoportal -> boundaries -> census boundaries -> select geography layer. This will tend to open up a search of all relevant boundaries, select the data you would like to view/import. The new screen will have a drop down list to the right of webpage titled APIs, click the drop down and copy the GEOJSON url. Paste the url into the browser and it will automatically download the data, be patient this may take some time; below is a list of urls used for the geojson scripts (these might break if geoportal decides to move the geojson file location):
Once the above files have downloaded, move the files to root of this repository and store under geojson folder.
Upload all the data to elasticsearch index with:
make geojson
This will take a long time as it it populates 150,000+ records with full polygon boundaries into elasticsearch area_profiles index and create a hierarchy.json file containing a list of hierarchies that an api user can filter an area profile data type.
There are actually five separate scripts which handle generating data for COUNTRIES, LSOA, MSOA, OA and TCITY files. These can be run separately using make countries, make lsoa, make msoa, make oa, make tcity respectively. Be aware that if you are running this for the first time you will need to create the area_profiles index, you can do this by running make refreshgeojson. One can rebuild the list of hierarchies using make hierarchies
The refresh script deletes the index and recreates it with 0 data.
Build Hierarchies JSON
As described at the bottom of load data from geojson files section, one can rebuild the hierarchy json file by running make hierarchies, this is a list of hierarchies based on the geojson scripts that exist and if the scripts get extended to incorporate new levels of geographical hierarchies then the hardcoded list in hierarchies script will also need updating.