Okay, so I’ve been messing around with this thing called “Iceberg,” and let me tell you, it’s been a journey. I started out thinking, “Oh, this will be a quick little project,” but like an actual iceberg, most of the complexity was hidden beneath the surface.
First, I needed to get my environment set up. I’m using a simple setup for basic demonstration.
installed Spark.
installed necessary Iceberg libraries for Spark.
Spinned up a Jupyter Notebook, ’cause that’s how I like to roll when I’m experimenting.
Next up, I got Spark up running.
I needed some data to play with. I decided to create my own and add some simple transcation.
*("CREATE TABLE my_table (id INT, name STRING) USING iceberg")
*("INSERT INTO my_table VALUES (1, 'Alice')")
*("INSERT INTO my_table VALUES (2, 'Bob')")
Then start to test update and delete operation.
*("UPDATE my_table SET name = 'Charlie' WHERE id = 1")
*("DELETE FROM my_table WHERE id = 2")
I quickly realized that I’m just seeing the tip of the iceberg,I start to dig around how it works under the hook.
Iceberg uses a clever system of manifest files and data files. It’s like a super-organized filing cabinet for your data. When you make changes, it doesn’t rewrite the whole table. Instead, it creates new files and updates the metadata to point to the new stuff. This makes things way faster, especially when you’re dealing with tons of data.
I spent a good chunk of time just reading through the Iceberg documentation and some helpful blog posts. It’s not exactly light reading, but it’s worth it if you want to understand what’s going on.
I’m still exploring all the cool things Iceberg can do, like time travel (going back to older versions of your table) and schema evolution (changing the structure of your table without breaking everything). It’s a powerful tool, and I’m excited to keep learning more about it!