Dataset and initial visualisation results from a format analysis of the JISC UK Web Domain Dataset (1996-2010).
The dataset is a format profile, summarising the data formats (MIME types) contained within all of the HTTP 200 OK responses in the JISC UK Web Domain Dataset (1996-2010). It was generated by running modified versions of DROID and Apache Tika on every resource, and so allows the two tools to be compared as well as providing format information over time.
The dataset is hosted in our GitHub repository and available for download, along with a full description of the dataset.
Below are example visualisations generated using the formats dataset, showing usages trend of popular image formats and HTML over time.
Popular Image Formats
Here we show overall usage trends for a range of popular image formats, illustrating how this dataset can be used to explore how format usage can vary over time.
This shows that JPEG usage has remained stable over the years, but that TIFF, GIF and XBM images have become rarer. The decline in usage of the XBM format is particularly striking. The graph also shows significant gains over the same period for the PNG format, and for the ICO format commonly used to create favicons.
HTML Version Usage Over Time
The following graph shows how the number of resources that used a given version of the HTML format, as a fraction of all HTML resources, by year.
Each new version comes to dominate the picture, then slowly fade away. Over time, more versions are present in each crawl, with HTML 2.0-4.01 and XHTML 1.0-1.1 all present in the 2010 crawl data.
If you make use of this dataset, or have any questions, ideas or requests for alternative datasets, please get in touch.