Arrowjet is now a Cross-Database Sync Tool in Python (PG, MySQL, Redshift)
I've been building Arrowjet, an open-source Python library for fast bulk data movement. It started as a Redshift speed tool, but it now supports Postg…
Latest Web news from Tech News
I've been building Arrowjet, an open-source Python library for fast bulk data movement. It started as a Redshift speed tool, but it now supports Postg…
The uncomfortable truth about AI infrastructure that nobody is talking about — and why your stack might be optimizing for the wrong metric In February…
We had a slightly reckless idea: what if we let AI do most of our data engineering work? Not "help with a query here and there," but actually build re…
TL;DR: I built an app that audits data pipelines using AI agent sessions. It analyzes tasks, dynamic tables, streams, pipes, and more -- then offers i…
You monitor schema changes in a data warehouse by periodically querying metadata catalogs (like INFORMATION_SCHEMA ), subscribing to event-driven noti…
Python is the king of data science, but it charges a heavy price for convenience. When you use pd.read_csv() on a 10GB+ file, Python attempts to load …
Modern data engineering revolves around automation, reliability, and scalability. Writing an ETL script in Python is only the beginning. To transform …
Every Python developer loading data into PostgreSQL hits the same wall. executemany() with 1M rows? 16 minutes. df.to_sql() ? Same thing — it generate…
TL;DR Operations/Systems engineer recently moved to the software side via AI collaboration. Built a domain-specific entity resolution tool in a handfu…
EXPLAIN ANALYZE is the standard tool for understanding how PostgreSQL runs a query. It shows the chosen plan, estimated and actual row counts, and exe…
Overview Databricks Genie is designed to let business users ask questions in plain language and receive answers grounded in governed enterprise data i…
The conventional wisdom for data platform modernization goes like this: pick a target system, build ETL pipelines for every source, migrate everything…
Modern data platforms are no longer simple pipelines—they are distributed ecosystems. Data moves across clouds, microservices, event streams, APIs, wa…
Introduction Databricks has become a core platform for data engineering, analytics, and machine learning. It brings flexibility and scalability, but i…
Have you ever wondered how wearable devices manage the massive deluge of data coming from your body every second? We are talking about high-frequency …
Google Cloud Next '26 dropped a lot of flashy headlines — eighth-gen TPUs, Gemini Enterprise Agent Platform, AI-powered security ops. But the announce…
Let’s be honest: our medical history is usually a chaotic mess of scattered PDFs, blurry smartphone photos of prescriptions, and "I think I had a feve…
As a data engineer there is a myriad of tools to choose from in the quest to avail clear data for analysis. Clean data leads valuable insights and bus…
Two weeks past the Iceberg Summit, the San Francisco in-person alignments are now translating into formal proposals and code on the dev lists. Iceberg…
DataFrames & SQL in Databricks: Reading, Writing, and Transforming Data This is where things get real. So far we've set up our environment, unders…
So, you know Python and think, "Hey, why don't I get into Data Engineering?" You have your learning checklist ready to go, but there’s a giant, termin…
The Data Titans: Diving Deep into the World of Columnar Databases (ClickHouse & Snowflake) Hey there, fellow data enthusiasts! Ever feel like you'…
Part 7 - Spark Transform Local vs Cloud ⚡ This part continues from the API client layer and explains the transformation job in spark_jobs/air_quality_…
Part 6 - API Client Design and Reliability 🔁 This part continues from the ingestion DAG and explains the reusable client functions in dags/air_quality…
Part 5 - Ingestion DAG and Raw Storage 📥 This part continues from the runtime config and looks at the first real Airflow DAG in the chain: dags/api_in…
Part 4 - Airflow Runtime and Shared Config ⚙️ This part continues from the bootstrap logic and explains the configuration layer that keeps the rest of…
Part 3 - Station Sampling and Cache Building 🗂️ This part continues from the data source overview and focuses on the bootstrap script that prepares th…
Part 2 - Data Sources and Domain Model 📡 This part continues directly from the architecture overview. Now that the overall flow is clear, the next que…
Part 1 - Introduction and End-to-End Architecture 🌍 This project was built as part of the Data Engineering Zoomcamp final project. The goal is simple …
TL;DR ClickHouse has full native JSON support, and has since v25.3. The JSON type stores each path as a separate columnar subcolumn with native type p…