Do you like sharing your home telemetry data with others? I do not, therefore the first thing I do after buying a new smart socket is to re-flash it with open-source firmware, and Tasmota is my favorite one. Some smart sockets require a bit of magic to flash Tasmota into these...
Read More
Setting up a Istio-powered cluster is easy, but once created, you need to take care about restricting access to your services. One the most effortless options is to use external OAuth2 provider and if you use recent Istio version, it's only a matter of simple configuration.
Read More
Writing Apache Spark applications is no different than writing any other code - you should test it with both unit and integration tests. Unfortunately, even with the huge value they provide, integration tests of big data application are often skipped...
Read More
Detecting the anomalies in incoming data is one the main tasks of Data Engineer. Recently I implemented the detection and notification system for one clients, Acast. It turned out that simple logarithm curve fitting solves most of the challenges.
Read More
If you use AWS and deploy big data jobs there with Apache Spark, you probably use AWS EMR. But, it is not the only maganed service you can use! For one of my clients I deployed a few jobs on AWS Glue and AWS Fargate, and they work much more efficiently than EMR-based ones. There is no "one tool fits all" for Big Data on AWS.
Read More
Apache Airflow allows you to orchestrate not only the classic ETL jobs, copying data from one place to another, but it is also extremely helpful on running part of your pipeline on the specific hardware. For one of my projects I used it to run compute-heavy job on GPU nodes in AWS.
Read More
With Apache Spark you can easily read semi-structured files like JSON, CSV using standard library and XML files with spark-xml package. Sadly, the process of loading files may be long, as Spark needs to infer schema of underlying records by reading them. That's why I'm going to explain possible improvements and show an idea of handling semi-structured files in a very efficient and elegant way.
Read More