Testing & QA — Tech News

EN

Designing a Community Skill for AWS Transform Custom: AWS Glue 5.0 Upgrade Readiness

TL;DR I designed a proposed AWS Transform Custom community skill that prepares Glue 2.0, 3.0, and 4.0 repositories for Glue 5.0. It separates safe mec…

aws opensource dataengineering ai

EN

Entry-Level Data Engineering Is Gone. Here's the Proof.

I've been on both sides of the data engineering hiring table for years. Interviewed candidates, been the candidate, watched the market shift underneat…

dataengineering career beginners python

EN

Databricks Workflows vs Airflow vs Dagster: Picking an Orchestrator

Every data team eventually asks the same question: what runs our pipelines, on what schedule, with what retry logic, and who gets paged when it fails.…

databricks airflow dataengineering orchestration

EN

AI Data Lake and Lakehouse

The word "AI" appears in every data platform pitch now. Most of the time, it means nothing — a chatbot bolted onto a dashboard, an LLM that generates …

data dataengineering programmers lakehouae

EN

# AI-Generated SQL Has a Silent Failure Problem. Here's a Way to Catch It.

If you've spent any time close to a data warehouse, you already know the scariest bug isn't the one that throws an error. It's the one that returns a …

sql ai dataengineering machinelearning

EN

One Histogram, Four Engines: Fixing the Statistics Silo Problem

TL;DR — DuckDB, DataFusion, Polars, and Postgres each compute and store table statistics their own way, so a histogram built in your ELT pipeline is i…

rust sql dataengineering iceberg

EN

Why Athena/Iceberg Tends to Make Code the Spec

Every time this comes up, someone credits the same setup: Athena on Iceberg is where "the code is the spec" — where you open Git and read the whole sy…

dataengineering iceberg athena ai

EN

Why AI Projects Fail Even with Great Models

Artificial Intelligence is advancing at an incredible pace. Every week, we see announcements about new Large Language Models (LLMs), improved reasonin…

ai machinelearning dataengineering softwareengineering

EN

What Is a Semantic Layer? A Practical Guide for Data Engineers

Your data warehouse has a table called orders . It has columns like amount , status , created_at , and customer_id . Now three people ask "What was Q1…

semanticlayer dataengineering database data

EN

The AI Cheat Tool Your Interview Cannot See

I've been on hiring panels where we spent 45 minutes convinced a candidate was sharp. Articulate answers. Clean code. Solid reasoning. Then in the deb…

dataengineering interview career beginners

EN

I moved 10 million rows in 9.9 seconds with pip install apitap

Last week I published apitap - an open-source engine that moves whole tables between databases. One function, no config files, no pipeline DAGs: impor…

dataengineering python rust programming

EN

File Encryption for the Lakehouse: The Terminology, the Machinery, and the Hard Problem of Interoperable Encrypted Tables

For years, the open lakehouse had an honest gap that practitioners whispered about and slide decks skipped: encryption. Not the checkbox kind, every c…

database dataengineering opensource security

EN

GitOps for Geospatial Data: Building a Self-Healing, Zero-Cost Data Pipeline with GitHub Actions

Most data engineering and geospatial projects follow a predictable infrastructure blueprint: an ingestion cron job, an enterprise database (like PostG…

dataengineering devops github serverless

EN

Scraping Indian government open data in 2026: what actually works

I maintain Village Finder , an open-source mapping project tracking over 78,000 Indian villages. It works by pulling daily raw updates straight from o…

dataengineering opensource python webscraping

EN

Choose a Columnar Format From the Read Path Backward

The columnar-format ecosystem is moving quickly. Feature tables encourage teams to ask which format has the newest encoding. A system-design decision …

dataengineering architecture database performance

EN

wide CSV 여러 개를 EAV로 모아 gold mart 만들기

wide CSV 여러 개를 EAV로 모아 gold mart 만들기 현실의 데이터 소스는 한 가지 모양으로 오지 않는다. 같은 의미의 값도 어떤 파일에서는 생산수량 , 다른 파일에서는 units , 또 다른 파일에서는 made 로 올 수 있다. 온도도 어떤 곳은 섭씨, …

dataengineering python etl learning

EN

schema drift를 fail이 아니라 warn으로 둔 이유

schema drift를 fail이 아니라 warn으로 둔 이유 데이터 파이프라인에서 source schema가 바뀌는 순간은 애매하다. 무조건 무시하면 운영자는 입력 구조가 바뀐 사실을 모른다. 반대로 모든 schema 변화를 실패로 처리하면, 정상적인 컬럼 추가까지…

dataengineering python etl learning

EN

source_hash로 같은 입력 재처리를 안전하게 skip하기

source_hash로 같은 입력 재처리를 안전하게 skip하기 작은 데이터 파이프라인도 한 번만 실행된다고 가정하면 금방 거짓말이 된다. 실제로는 같은 파일을 다시 실행할 수 있다. 실패한 run을 재시도할 수도 있고, 과거 날짜를 backfill할 수도 있고, 운영…

dataengineering python etl learning

EN

Federation and the Lakehouse: Two Roads to Unified Data Access, and How to Know Which One to Take

Every data strategy document written this decade contains some version of the same sentence: we need a single place to access all our data. The senten…

architecture data database dataengineering

EN

Your Data Pipeline Is Probably More Fragile Than You Think

Most engineering teams don't think much about their data pipelines until something breaks. That's partly because a healthy pipeline is almost invisibl…

dataengineering data

EN

Debezium vs Managed CDC: How to Actually Decide Between Build and Buy

Most "Debezium vs managed tool" articles get the question wrong. They frame it as a product bake-off, feature grid included, and declare a winner. But…

dataengineering database cdc

EN

Stop Digging Through PDFs: Build a FHIR-Standard EHR Knowledge Base with RAG

We’ve all been there: staring at a stack of printed lab results or a folder full of cryptic report_final_v2_NEW.pdf files, trying to remember if our c…

openai rag python dataengineering

EN

PowerBi is smart but also lazy

Introduction For the past few days I have been doing data cleaning using PowerBi 34 columns to be precise.Data cleaning and transformation to me alway…

data dataengineering microsoft performance

EN

What does electricity have to do with knowledge graphs?

Christopher Nolan? Crime? Not really... But this is what my first tries for a knowledge graph did. It wasn't hallucinating. There was a real directed_…

ai data database dataengineering

EN

Day 57: Internals of ClickHouse® Data Parts and Merges – A Complete Guide

Introduction One of the reasons ClickHouse® can handle massive analytical workloads with exceptional speed is its storage architecture. Unlike traditi…

clickhouse devops analytics dataengineering

EN

Day 50 - How to Migrate Data from MySQL to ClickHouse®: A Step-by-Step Guide

Introduction As applications grow, traditional relational databases such as MySQL may struggle with analytical workloads involving millions of records…

clickhouse devops database dataengineering

EN

Day 48 - Sharding Strategies in ClickHouse®

As data volumes grow from gigabytes to terabytes and eventually petabytes, a single database server often becomes a bottleneck. Storage limitations, C…

clickhouse devops database dataengineering

EN

How to Build a Modern On-Premise Data Lakehouse (Without Vendor Lock-in)

Building a modern analytical platform usually rhymes with one thing: migrating to the cloud. But what happens when operational realities, budget const…

dataengineering architecture opensource devops

EN

Day 41: Monitoring ClickHouse® Performance Metrics

Monitoring ClickHouse® Performance Metrics Introduction Monitoring is a fundamental part of operating a healthy ClickHouse® deployment. As databases g…

clickhouse devops database dataengineering

EN

How to Stream & Flatten 1GB+ JSON to CSV in the Browser Without Memory Leaks

As developers, data engineers, or analysts, we’ve all been there: you download a massive database export, a logging stack dump, or a transaction archi…

dataengineering javascript performance tutorial