CREST Data Suite
Enabling a UK charity to understand their user base and grow their flagship STEM awards programme with a flexible and cost effective reporting tool.
Data modelling • Database design • Python development
With the introduction of new digital platforms for the CREST awards, the BSA also needed a new analysis and reporting capability. The reporting suite needed to consolidate award data from four different systems, including historical sales from legacy systems.
A flat-file database, rebuilt weekly using Python scripts, will be cheap and easy to maintain - and plenty big enough for their needs over the next few years. Tableau provides a flexible front-end for ad-hoc analysis as well as routine reporting.
Emily O’Byrne, Inga Umblija
While the client had some ideas about the kind of reports they wanted, we took a top-down and bottom-up approach. Inga ran workshops with the team, encouraging them to think about funder needs (eg reaching underserved communities), operational needs (eg how many awards we sold last month), management information (eg long-term trends), and marketing (eg targeting dormant users). Meanwhile, I researched open data sets that might help us to enrich the data we had in the various source systems. I also created data models for each source, to understand what was available.
My primary concern for the technical architecture was ease of maintenance. Relational databases need regular care to avoid corruption - and the expertise to do that wasn’t available within the BSA. It would also limit flexibility for the client: changing the data structure would require data conversion work, increasing the cost of any enhancements.
These considerations led us to choose a flat file architecture, rebuilt from scratch each week. For the scripts, we chose Python: it is widely used and has excellent support for data manipulation via the pandas framework. For analysis and reporting, the client chose Tableau, which works well with flat files as well as heavier-weight data sources.
As Inga developed a list of desired attributes, I mapped across the source data. We ended up with a spreadsheet listing each reporting database attribute, and the corresponding field in each of the four data sources. Where data needed to come from other sources, we added that too, and also agreed on any transformation rules.
To make reporting as easy as possible for the client, I chose to build a fully-denormalised single table database. This was effectively a traditional star-schema, but the lookup tables existed only for the benefit of the transformation scripts; we didn't use them in the Tableau configuration.
I took an iterative approach to development, passing new versions to Inga for testing every couple of days, and incorporating feedback along with new features. We used Github for version control and distribution - this also supported handover to the client at the end of the project.
Development threw up some interesting challenges. We couldn’t find a single public source for deprivation measures on UK schools: England, Scotland, Wales and Northern Ireland all have different ways to assess if a school might need additional support, from counting free school meal places to the number of pupils in community care. I needed to check different tables depending on the country (based on the school postcode). To enable comparison across the UK, I calculated a standardised deprivation index to sit alongside the specific measures for each country.
To look up additional data for a school (such as deprivation measures), we first needed to identify the school. Postcode was our best option, but many UK schools share a postcode, so I developed a semi-automated matching routine: if I could create a unique match using the postcode I would; otherwise, I wrote the school out to a holding file for manual matching.
Testing and implementation
The high number of helper functions for data matching and enrichment made testing very slow, so I decided to implement automated testing for these modules. This approach, using the unittest package, allowed us to make changes quickly, and resulted in far fewer bugs in the code.
I also used the Anaconda Project framework to create a repeatable installation process. Anaconda is a standard way for the scientific community to share data analysis packages. As I was working with a highly educated user group who shared a keen interest in science, I was confident the client team would pick this up - and they did.
I’ve continued to work with the BSA on maintaining and expanding the data suite - for instance, automating the production of certificate order files, and creating regional data extracts.