Scaling the Home Office Knowledge Graph by Refactoring the Application
reduction in processing
in the pipeline
for future development
Want to find out more? Get in touch with our team today to learn more about how we could help your business.
Challenge
Butterfly Data, working as a sub-contractor to the prime contractor 6point6 (now part of Accenture), supported the Home Office’s Data Assurance Service (DAS), which is responsible for the development and maintenance of infrastructure and tools that enable large-scale profiling for data quality issues.
A crucial part of these operations is a knowledge graph that represents a comprehensive summary of data quality and data lineage statistics for every field in the database. It feeds into a self-service user interface that can be used to answer complex questions about the quality of specific database entries and their relationships with each other.
The knowledge graph is updated monthly. Over time, the underlying raw data grew considerably in size and new statistics were added to the graph’s entities.
As a result, the processing time ballooned massively, making the pipeline increasingly unstable, costly to run and delayed the ingestion of new insights into the user interface.
To handle the continued increase of the database and the future extension with new statistics, the entire code base had to be refactored with a focus on performance.
Solution
Discovery work identified the main bottleneck of the knowledge graph generator application: it was designed to use an S3 bucket for the storage of the entity JSONs during processing. This meant that each JSON file had to be read and written millions of times, causing an overhead that became increasingly costly as the size of the entities grew with each update to the database.
The Python-based application was re-engineered to hold all JSONs in memory during processing. An efficient data in-memory data structure was devised to handle the entities with low latency. Furthermore, a caching mechanism for metadata files ensures that each file is only loaded and processed once, dynamically handling data cleaning rules. Dozens of class methods were updated or added to make the code base compatible with the new approach.
Logging features were added to capture the runtimes of the sub processes and detailed logs capture every part of the pipeline to identify further performance improvements and provide insights for troubleshooting.
All changes were tested extensively, ensuring that results match the previous version and unit and integration tests were updated or extended. A suite of diagnostic functions that partially automates the quality assurance was created.
Impact
Refactoring the knowledge graph generator application at a macro level resulted in a reduction of processing time and cost by 85%.
The pipeline, which runs on Kubernetes clusters on AWS, is significantly more robust and has not suffered any aborted runs, a problem that occurred frequently before the refactoring.
During the refactoring work, a range of secondary areas for improvement and ideas for new features and organisational improvements was identified. These were turned into a detailed development roadmap that helps the DAS Team stay ahead of the ever-increasing requirements for the knowledge graph.
Client satisfaction guaranteed
Check out our related case studies.
Ready to transform your data?
Book your free discovery call and find out how our bespoke data services and solutions could help you uncover untapped potential and maximise ROI.
