back

Notes

The process of indicator production

All the data on this website comes from official government sources.  For an overview of these sources, see the page on data sources.

The sources are of three main types, namely:

  • Large government survey datasets.
  • Published tables, available from the web.
  • Requests placed by us to the relevant government department.

The processes of indicator production is different for these different sources and are therefore discussed separately below.

Large government survey datasets

A list of the major government surveys used by this website is provided in the survey release calendar page.

A survey dataset typically comprises a record for each individual surveyed and is typically around 30,000 records each year.  It is usually in SPSS, STATA or TAB format (these being three of the most commonly used formats for statistical analysis) and is typically around 50Mb in size for each year covered.  Some are published on the web whilst others are only available by request to the relevant government department.

There are two main ways of analysing data from these datasets:

  • Directly, using a package such as SPSS.
  • Using Excel, extracting the relevant fields from the dataset as required.

In either case, the results are placed in an Excel spreadsheet together with any technical notes on definitions, methods used, etc.  These notes are always sufficient to allow replication of the results from the source data at a later date.

The spreadsheet is then used to manipulate the data as required and to produce the relevant graphs.

So, the underlying 'database' contains three elements:

  • The datasets themselves in the format they were downloaded in, typically a series of files for each year.
  • The datasets in Excel.
  • An Excel spreadsheet for each indicator which contains the results required to produce the graphs.

Collectively, this database is enormous, currently 70Gb in size.

To help maintain a proper audit trail, each spreadsheet is divided into three linked sections, the first containing the extracted results, the second containing any intermediate calculations, and the third containing the data used for the graphs.

These graphs then need to be converted to a format suitable for the web.  Two formats are produced, namely:

  • pngs, which can easily be embedded into a web page but are not resizeable;
  • pdfs, which are resizeable but for which there are no open standards for their inclusion on a web page.

Producing these various formats is a two-stage process:

  • Conversion from Excel to pdf, using Acrobat Writer software.
  • Conversion from pdf to gif, using Photoshop software.
  • Conversion from gif to png, using Irfanview software.

top

Published tables, available from the web

A list of the website data sources used by this website is provided in the website data release calendar page.

A wide variety of government departments produce a voluminous array of periodic publications.  Nowadays these are nearly always available on the web, mostly as pdf files.  Increasingly, they are accompanied by Excel files which contain copies of the tables in the pdf.

If the tables are available in Excel, they are simply placed in a spreadsheet and the graphs produced as for the datasets above.

If the tables are only available within pdfs then these are converted to Excel using OCR software.  This data is then combined with the equivalent data for earlier years and the graphs are then produced as for the datasets above.

So, the underlying 'database' contains two elements:

  • The published reports or Excel files, of which there are hundreds.
  • An Excel spreadsheet for each indicator which contains the results required to produce the graphs.

Note that:

  • It is important that there is a thorough audit trail from data source to graph.  As well as appropriate documentation, this is achieved by ensuring that the Excel spreadsheet always starts from a precise replica of the published data.
  • When the government publishes new data, it often revises the data for earlier years.  To make sure that this revised data is always captured, the updating process for this website includes the overwriting of all the data for all the previous years.

top

Requests placed by us to the relevant government department

Requests are necessary when the government does not routinely publish the data (e.g. in the area of health data) or when we require different definitions than those used in the published data.

Requests are made by email, including a copy of previous years' data plus any technical notes so that the government department knows precisely what we want.  The government responds by sending an Excel spreadsheet with the latest data.  This is then combined with the equivalent spreadsheets for previous years' data and the graphs produced as for the datasets above.

So, the underlying 'database' contains two elements:

  • The email correspondence and associated attachments.
  • An Excel spreadsheet for each indicator which contains the results required to produce the graphs.

As with the other types of data source, a thorough audit trail is fundamental but, in this case, is achieved through notes and links with the email correspondence.

top

Managing the process

In total, there are around 900 graphs on this website, nearly all of which have to be updated at least annually.  This process is managed using a spreadsheet which contains a record for each graph.  Inter alia, the data for each graph includes:

  • A number which uniquely identifies the graph.
  • A description of the graph, including where it sits on the website.
  • Where the underlying data came from.
  • When it was last updated and when it is next scheduled to be updated.

Each week, a report is produced listing those graphs which require updating.  The relevant data is obtained and the graphs updated, together with the website text with which they are associated.

At least monthly, the website is updated to the latest version.

top

© Guy Palmer | info@poverty.org.uk