Data Integration Hub
The Problem
Your marketing team asks “how many active subscribers do we have?” and three systems give three different answers. The billing platform says one thing, the email marketing tool says another, the data warehouse disagrees with both. Now try building a retargeting audience from that mess.
That was the reality for this European e-commerce group: subscription data siloed across multiple platforms, no unified view of customer lifecycles, and upstream exports riddled with shifting columns, freetext-encoded business rules, and silent data corruption. I identified this fragmentation and designed the full architecture from scratch.
The Solution
A centralized data integration platform with 5 automated ETL pipelines extracting, normalizing, and enriching data from multiple SaaS platforms via a cloud data lake, delivering structured order documents to downstream CRMs and daily-refreshed retargeting audiences to advertising platforms.
The system evolved through 4 major phases (from a single retargeting query to a 134-file platform with 30+ database tables) as production data revealed new inconsistencies demanding immediate solutions:
- Column-shift bug: an upstream field value caused entire rows to shift by one column. Built automated detection and remapping to silently correct this on import.
- Freetext field parsing: purchase funnel steps (9 types), charge categories (6 types), and subscription classifications (7 types) were all encoded as substrings inside freetext fields. Wrote parsers and classifiers to extract structured data.
- Refund logic rework: “goodwill gesture” refunds followed an entirely different data pattern, discovered only in production. Required a complete rewrite with multi-invoice partial credit note logic and database transactions.
- Test data pollution: multi-layered detection (account flags, email/name blocklists, known test individuals) to prevent test records from corrupting production analytics.
The core pipelines run daily:
- Billing Raw Import (05:00): imports 16 entity types across 3 brands with incremental watermark-based batch history tracking
- Billing JSON Enrichment: transforms raw data into enriched, classified records with business logic and xxh3 checksum-based change detection
- Order Assembly: joins data across accounts, contacts, subscriptions, invoices, payments, and rate plans into structured JSON documents, handling subscription chain resolution and partial refunds
- Email Marketing Import (03:00): imports 4 email engagement event types for subscriber analytics
- Retargeting (06:00): outputs anonymized subscriber lists via hashed emails for advertising platform audiences
Every data mutation is tracked for a complete audit trail.
Tech Stack
- Framework: Laravel (PHP)
- Database: MySQL
- Data Lake: AWS Athena (query engine over S3-backed data warehouse)
- Upstream Systems: Subscription billing platform, email marketing platform
- Infrastructure: Docker, AWS EKS (Kubernetes), AWS ECR
- Audit: Spatie Activity Log + JSON diff (swaggest/json-diff)
- CI/CD: Bitbucket Pipelines
- Monitoring: Sentry, Laravel Telescope
Outcomes
- 100K+ subscription records processed across multiple brands (12-month sample)
- 15,700+ order documents assembled and delivered to downstream CRM
- 20 entity types imported daily (16 from billing, 4 from email marketing)
- 3,900+ data mutations tracked with full audit trails
- 8 daily scheduled jobs across 3 time slots
- Eliminated manual data reconciliation and provided reliable, daily-refreshed retargeting audiences