Starting with (py)Spark

There has been a large shift in the industry in recent years to moving to python for data engineering. Spark one of the leading distributed data processing engines has great support for python via the Pyspark library, allowing users to write high performance distributed data processing on spark from the comfort of python. In this article I want to touch on some high-level topics relating to using pyspark which I have seen skimmed over at companies where I have worked at leading to varied and generally poor data pipeline quality....

2024-07-28

Road to Declarative Systems

Over the past few years I have worked at different companies building data lakehouses/meshes, and one challenge that has been constant is the need to understand the state of things (cloud, platform, pipeline, computers, etc) and how to document and mutate the state of these in a straightforward manner. Why declarative systems? Lets say you were tasked with ensuring that all cloud storage is configured in a manner which matches the new standards put in place by the security team....

Python Packaging with Poetry

It’s not suprise that new developers are confused when it comes to developing python libraries, there has been major changes over the years when it comes to how to work with python packages, and currently you may ask yourself what to use out of setup.py, setup.cfg and pyproject.toml. Before we look at these, it is important to first make sure we properly understand python modules and packages. Python Modules In python a module is simply a file with a ....