Installing Tasmota on 6€ smart socket with energy monitoring

Do you like sharing your home telemetry data with others? I do not, therefore the first thing I do after buying a new smart socket is to re-flash it with open-source firmware, and Tasmota is my favorite one. Some smart sockets require a bit of magic to flash Tasmota into these...
Read More

Integration tests of Spark applications

Writing Apache Spark applications is no different than writing any other code - you should test it with both unit and integration tests. Unfortunately, even with the huge value they provide, integration tests of big data application are often skipped...
Read More

Anomaly detection in podcasting

Detecting the anomalies in incoming data is one the main tasks of Data Engineer. Recently I implemented the detection and notification system for one clients, Acast. It turned out that simple logarithm curve fitting solves most of the challenges.
Read More

Running Apache Spark on AWS

If you use AWS and deploy big data jobs there with Apache Spark, you probably use AWS EMR. But, it is not the only maganed service you can use! For one of my clients I deployed a few jobs on AWS Glue and AWS Fargate, and they work much more efficiently than EMR-based ones. There is no "one tool fits all" for Big Data on AWS.
Read More

GPU-based workloads as a part of Airflow DAGs

Apache Airflow allows you to orchestrate not only the classic ETL jobs, copying data from one place to another, but it is also extremely helpful on running part of your pipeline on the specific hardware. For one of my projects I used it to run compute-heavy job on GPU nodes in AWS.
Read More

Reading JSON, CSV and XML files efficiently in Apache Spark

With Apache Spark you can easily read semi-structured files like JSON, CSV using standard library and XML files with spark-xml package. Sadly, the process of loading files may be long, as Spark needs to infer schema of underlying records by reading them. That's why I'm going to explain possible improvements and show an idea of handling semi-structured files in a very efficient and elegant way.
Read More