Here is a list of some data projects I’ve lead throughout my career (for other aspects see my LinkedIn, personal data projects, downloadable CV) It briefly shows the benefits of the project, the tools used and the technical approach. It will give you an idea of the technical challenges, but not too much detail to ensure commercial in confidence. You may want to read this in conjunction with my summary on data science.
Senior Data Engineer, JetStar (ThoughtWorks), 2018
Extended the data pipeline (ETL) for ‘Next Best Destination’ machine learning prediction engine
- Increased the number of customers in the model from 2 million to 6.5 million through experimentation in AWS with: S3, Redshift, Pandas and Jenkins.
- Solved the difficult analytical problem of linking high volume Adobe Analytics web traffic data to core customer through various customer IDs from disparate systems.
- Optimised algorithm to configure the level of parallel and series data processing in Redshift clusters to increase back-fill speed.
- Migrated data platform from Jenkins scheduling to Airflow for both daily processing and back-filling.
Built Redshift database to enable data exploration for web analytics
- Built database and ETL from scratch with: ~ 1000 columns, 4 billion records to enable various functional teams to explore data for modelling that drive personalisation initiatives
Drove lean data engineering strategy
- Through deep diving into the complexities of the data storage and processing of the data platform whilst understanding business needs, outlined and championed detailed strategy to reduce costs by keeping complexity to a minimum.
- Teaching various commercial and technical teams the use of various cloud patterns, the requirements of a data platform and how we can achieve commercial goals.
Senior Data Engineer, AGL Energy, 2017 – 2018
Enabled unstructured customer IoT data to be usable for analytics and reporting in SQL
- Built ETL (Extract, Transform, Load) for 1000s of homes’ unstructured solar and load monitoring data (IoT) from Azure Blob and Table Storage, from an unknown and undocumented data model.
Improved analysis capability for advanced analytics team
- Built extensions on pandas in python to perform automated data manipulations including: filtering/mining/searching dataframes, counting value combinations, csv handling, split summary statistics.
- Built object relational mapper (ORM) for extracting and mining data using python, whereby the ORM creates SQL at run time.
- Extended ORM to work in conjunction with: Hadoop, SQL Server, Postgres, SAP HANA and other classes derived from these, for domain specific applications using inheritance.
Created data strategy and analytics for new energy plan partnered with Amazon
- Defined data strategy and risks to create the contract between AGL and Amazon to launch AGL Skills (voice control) on Amazon Alexa.
- Built propensity model to predict which customers would purchase the EDM Smart Home energy plan using data from Customer Analytic Record (feeding Pega) and Adobe Analytics.
- Built automated daily reports for AGL Skills and Smart Home product suite, drawing in and manipulating disparate sources from: Azure Table Storage, SAP HANA and network folder csvs, running on an Azure virtual machine.
Increased AGL’s technical understanding of our virtual power plant through analysis
- Conducted a study on spurious electricity grid disconnection issues on 300 battery customers to estimate through experimental modelling the value lost to each customer.
- Analysed battery data to model and understand the value created during bulk battery control operations (orchestration) on 100s of homes in the AGL battery network.
- Analysed raw battery data from disparate files to calculate real-life energy efficiency for new Tesla, LG and SolarEdge energy storage devices (battery & inverter)
Data Engineer, Marketplacer, 2017
Transformed 100s of tables of data across 15 e-Commerce sites to create insight.
- Built ETL software running on AWS to manipulate 100s of tables of data across 15 e-Commerce sites to deliver insight reporting in Tableau.
Data analyst and product lead, GrantTree, 2015-2016
Estimating cost per R&D claim to drive new product development
Analysed 5 years of historical R&D Tax claim data to support a business case for servicing small clients . The business case needed to determine the cost of doing any claim, since small clients have low profit margins. A cost for each claim was successfully estimated and the subsequent business case drove the company’s decision to invest in this project. The project was coded in VBA (in Excel), with graphing and statistics in R. Here are the main challenges:
- Analysing CRM data to determine of the plethora of which fields, which could be relied upon and hence useful.
- Cleaning data, and estimating missing data.
- Building rules to determine the most statistically likely value for missing values, based on the correct ones. Missing data was in the form of erroneous GUI time-stamps for various claim tasks.
- Finding and dealing with dozens of edge cases with claim data (missing claim years, void records, non chronological time-stamps, duplicate records etc)
- Building a model for time spent per employee per month merged with payroll data.
- Fine-tuning a weighted model for claim tasks.
- Automatically calculating the cost of all claim tasks per month.
- Finally, establishing an estimate of cost per claim, on a monthly basis for all claims.
- Visualising the claim cost over time, as a weighted average time series graph to show trends.
- Presenting these results and business case to the whole company to get buy in.
- Establishing a task-force and leading the project.
Building web portal with pandas analytics capability
Built and tested two versions an intuitive web application for clients to upload claim data, known as the portal. This was tested on roughly 30 clients, 7 paid. The information on client’s usage of the application was tracked through google analytics. This data would be merged with others sources and analysed to improve user experience. The minimum viable product was created using Typeforms, with analysis and data manipulation in pandas.
- Building two versions of the portal for client testing, carefully considering what is the minimum viable product. Analysis was required at the tax level, knowing how and what clients typically have difficulties dealing with, and building the software accordingly.
- Connecting company’s bespoke CRM, and google sheets tracking data to google analytics through pandas.
- Merging above data sources, and manipulated into a format for analysis at the client and claim level. Variables looked at included: time spent on page, session duration, time since last interaction etc.
Data mining to find sales leads
Data mining sales leads to find small clients for project ‘the portal’. A list of roughly 100 leads was found from 10s of thousands of unclassified leads. They were to be used for the first email campaign. This data mining was coded in pandas.
- Merging two data-sets with different schemas from CRM: one historical back-up migration, and the current migration.
- Creating rules using regular expressions to determine records most likely to be small clients. Refining the rules through several iterations as new edge cases were found.
- Finding and correct missing lead information automatically by searching data migrations. This was difficult since the information for a lead was often spread across the two migrations.
Dynamic R&D claim value visualisations
Analysing the value of claims of all time, to make better decisions on resource allocation. This was coded in pandas and matplotlib.
- Cleaned all historical claim data to show claim value on a per ‘filing’ basis (filing = one piece of continual client work).
- Transforming database of claims (one record) to a database of filings (from one or two claim records) based on time-stamps. Determining algorithmically when to merge records, and which information from each record will be used.
- Creating algorithm to show claim quality data graphically, in multiple plots based on variables: width of the claim value, number of years to go back, minimum claim value, classification of claim quality.
Data modelling and financial analysis of R&D claims
These were various mini-projects used for faster client work and to train other staff. They were coded in pandas and matplotlib.
- Modelling R&D Tax scheme based on the three variables that affect claim size: amount of expenditure, claim period, tax position.
- Creating algorithm to generate plots of this R&D Tax model, separated into relevant sections of the graph curve by tax position.
- Creating a workflow for analysing client financial data, to find & correct anomalies, aggregate, prepare and calculate an R&D Tax claim.
Software and Controls Engineer, ASC & Navantia, 2012-2013
Building naval ship control system database & search capability
Made accessible 100s of megabytes of control system data, and modelled this in hierarchical form. The data was for the naval ships being built which are: 147m long and weigh 7000 tonnes. They have: 3 mission-critical networks, 19 servers, 1000s of pieces of remote controlled equipment, 10s of 1000s of digital & analogue signals. Databases were stored in Access and merged with SQL. Search capability was created with SQL & custom coded VBA algorithms for the more complicated hierarchical data. Here were the main grouped challenges:
- Cleaning and merging 33 differently formatted and structured Access cable & equipment databases from our supplier. No documentation was provided on the databases. Many of the dozens of fields were unknown and knowing which they corresponded to in other databases required experimentation.
- Creating a model of the control system in hierarchical form. There was a separate database which had incomplete hierarchical information, but not linked with cable or equipment data. This also required building a model to find all daisy-chained equipment, correct anomalies and making chains searchable.
- Making search capability of all merged databases, and hierarchical information. Involved creating a model of advanced search criteria with a finite state machine to allow AND/OR expressions and wildcard searching. Also, enabling the filtering of results to search criteria through by clicking on cell of tables.
- Creating a GUI in Access, with bespoke VBA code enabling mouse and key events to create unique views on screen.
Improving reliability and efficiency of PLC code creation software
Much of the Programmable Logic Controller (PLC) software code for the naval ships could be designed by adding data to a database, then running various scripts to generate the code. This process although innovative, still required manual effort transforming databases and files, and checking data as it was prone to human error. I automated a large part of this data manipulation and checking process to increase the reliability of PLC code and delivery speed of work. This was coded in C#, drawing data from network file systems and an Access database. Here were the main challenges:
- Searching by regular expressions, dozens of types of PLC code files to find the relevant sections to be checked against the database.
- Creating from scratch, a highly customised ‘set’ data-structure to enable fast checking of incongruences between PLC code against the database.
- Creating a workflow system to: rename, move, and search files, and: display & correct data incongruences. The workflow system of roughly 10 stages had to be flexible enough to allow the user to have control for adding and removing the stages depending on design requirements.
- Creating a GUI to run the workflow and display the results of each stage. This GUI had a system of stage dependencies coded within, such that pre-requisite stages were forced on the user when needed.