Giuseppe Santoro

Senior Data Engineer

Career Profile

Senior Data Engineer with 8 years of experience in Big Data including training and consultancy. Highlights include developing and supporting data pipelines at scale in various industries. Key interest in Distributed Systems and Linux Operating Systems.

Experience

Big Data Engineer
Sony Interactive Entertainment (Industry: Gaming) October 2018 - August 2021
Game Analytics Team at World Wide Studios
  • Wrote a standalone application to compress, deduplicate and re-partition historical and daily game analytics data. This made querying months worth of data in 10s of seconds possible. Previously, it would have taken hours to query a single month and it would have run out of memory for querying longer periods (Golang, AWS S3, AWS Athena, AWS Glue).
  • Wrote a Lambda to modify game analytics events in real-time. This lambda replaced IDs in the events according to custom rules per game and per event type stored in AWS S3. Part of this project included writing an in-memory cache for the S3 configs to reduce the processing time (Python, AWS Kinesis, AWS Lambda).
  • Wrote a Lambda to offload data from the real-time pipeline to the data warehouse (Golang, AWS Kinesis, AWS Kinesis Firehose, AWS S3)
  • Contributed to the successful launch and ongoing support of 10+ Playstation first-party games from World Wide Studios by providing capacity estimations, cost optimizations, modifying configs, fixing bugs and being on call (AWS EMR, AWS Lambda, AWS Kinesis, Golang, Python, Terraform, Jenkins)
  • Contributed to a major refactoring of the data pipeline (Golang, Python, AWS Kinesis, EMR, AWS Lambda, Terraform)
  • Reduced the time taken by Engineers to manually create Grafana users, by writing an automation script that synchronized Okta users with Grafana users (Python)
  • Wrote a script that would allow Engineers from a Studio to search, download and transcode videos according to different criteria (Python).
  • Maintained and improved 10+ internal microservices, 20+ EMR jobs, 5+ AWS Lambdas to support the data pipeline (Python, Golang, Terraform, Ansible)
Senior Software Engineer
Skimlinks (Industry: Advertising) March 2017 - August 2018
Platform team
  • Reduced the running time of the daily pipeline by ~4x (from 8 to 2 hours), by optimizing some Hive queries and replacing the legacy code with a Google Dataflow job written in Java for the ingestion of 120 million documents into ElasticSearch.
  • Achieved 30% cost savings in the infrastructure expenses of my team by replacing the legacy Hadoop cluster with a new setup with the same processing power plus Kerberos Security
  • Wrote a Google Dataflow job in Java as part of the migration of the data pipeline from Hive on-premise to Google BigQuery
  • Automated the scheduling of the data pipeline by creating, configuring and supporting an Airflow cluster running on Kubernetes on Google cloud and by writing various Airflow jobs in Python
  • Improved various parts of the data pipeline (SQL for Hive queries, Java for Hive UDFs and Hive SerDes)
  • Developed scripts to collect and display data pipeline statistics into InfluxDB and Grafana (Python)
Big Data Engineer
BenevolentAI (Industry: Pharmaceutical research) July 2015 - February 2017
Platform team
  • Reduced the ingestion time of 40 million scientific publications into ElasticSearch by ~8x (from 24 to 3 hours) by rewriting one of the steps of the data pipeline (Spark, Hadoop cluster on AWS).
  • Deployed, configured and maintained a variety of Hadoop, ElasticSearch and Cassandra clusters either bare-metal on-premise or on cloud (Python, Ansible, Amazon AWS)
  • Contributed to the migration of the batch data pipeline from a bare-metal Hadoop cluster to an Amazon EMR cluster by rewriting a Map/Reduce job that was written in Java into a Spark job written in Scala
  • Automated the scheduling of the data pipeline by writing multiple Airflow jobs (Python)
  • Improved the data pipeline alerting capabilities (Python and Bash scripting)
  • Contributed and maintained various apps running on Mesos and Marathon as Docker containers
Senior Big Data Engineer
Big Data Partnership (Industry: Software) acquired by Teradata in 2016 November 2013 - July 2015
  • For 2 consecutive years, I delivered the official 5 days training course on Hadoop as part of the Hadoop Summit in partnership with Hortonworks
  • Delivered 5+ training courses (duration 3-5 days) on Hadoop (either as Admin or Developer) at client offices across Europe
  • Installed, configured and optimized 3 Hadoop clusters on bare-metal machines for different clients
  • Delivered 5+ consultancy sessions on Hadoop and Big Data for various clients
  • Obtained 3 Hortonworks certifications on Hadoop as Administrator (with grade 83.33/100), Developer (with grade 92/100) and Java Developer (with grade 93/100)
  • Lead Engineer in a team of 3 Engineers for a 4-month greenfield project for a client
  • Contributed to the development of the internal batch data pipeline (HBase, Pig, Hadoop)
  • Attended various training courses including Spark by Databricks, Administrator Training with Cassandra by Datastax, Developer Training with Cassandra by Datastax
Software Engineer
VisualDNA (Industry: Advertising) January 2013 - October 2013
Integrations Team
  • Developed 5+ integrations with external data providers via FTP protocol (Java)
  • Developed an internal RESTful API to serve data from a NoSQL database (Java, AWS SimpleDB)
  • Wrote various parts of the internal data pipeline (SQL, Hive, Pig)
  • Wrote a client for Kafka (Java)

Education

MSc in Computer Engineering
University of Catania, Italy September 2006 - July 2012
110/110 cum laude (with honours)
BSc in Computer Engineering
University of Palermo, Italy September 2002 - July 2006
110/110 cum laude (with honours)

Skills

Backend: Python, Golang, Java, SQL
Amazon AWS: Lambda, Kinesis, Glue, Athena, EMR, S3
Google Cloud Platform: Dataflow, BigQuery
Distributed Systems: Spark, Hadoop, Hive, Pig, ElasticSearch
DevOps: Docker, Terraform, Ansible, Fabric, Bash scripting
Languages
Italian (Native)
English (Fluent)