GeoIndex of the JISC UK Web Domain Dataset (1996-2010).
The ~2.5 billion 200 OK responses in the JISC UK Web Domain Dataset (1996-2010) have been scanned for geographic references - specifically postcodes. This set of postcode citations found at particular URLs, crawled at particular times, forms an historical GeoIndex of the UK web.
The GeoIndex is composed of some 700,641,549 lines of TSV data, each asserting that a given web page, crawled at a given data, contained one or more references to a given postcode. Uncompressed, this is a total of 61 GB of text, and so care should be taken before downloading or attempting to use this data set.
The GeoIndex is available for download in a compressed format (total download size is approximately 8GB). For more details about how the data was created, its format, and how to use it, please refer to the full description of this dataset.
If you make use of this dataset, or have any questions, ideas or requests for alternative datasets, please get in touch.