<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://dataswamp.tech/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dataswamp.tech/" rel="alternate" type="text/html" /><updated>2026-03-15T13:41:17+00:00</updated><id>https://dataswamp.tech/feed.xml</id><title type="html">DataSwamp.Tech</title><subtitle>This website covers Data Swamps - environments which were once clean Data Lakes, but as the time passes, requirements change and the solutions evolve. In the end we&apos;re left with a fragile ecosystem, muddy, filled with bugs and critters, that&apos;s called a Data-Swamp.</subtitle><entry><title type="html">Hater&apos;s guide to Iceberg - Part 1 - Deleting a table</title><link href="https://dataswamp.tech/iceberg/2026/02/10/deleting-tables.html" rel="alternate" type="text/html" title="Hater&apos;s guide to Iceberg - Part 1 - Deleting a table" /><published>2026-02-10T12:00:00+00:00</published><updated>2026-02-10T12:00:00+00:00</updated><id>https://dataswamp.tech/iceberg/2026/02/10/deleting-tables</id><content type="html" xml:base="https://dataswamp.tech/iceberg/2026/02/10/deleting-tables.html"><![CDATA[<p>In most database and data-lake systems, deleting a table is easy: <code class="language-plaintext highlighter-rouge">DROP TABLE tbl_name</code> and it's done… but under Iceberg it's more complicated, since it might leave behind stray files and directories.</p>

<h2 id="basic-sql-approach">Basic SQL approach</h2>

<div style="padding:20px; margin:20px 0; border:1px solid #eee; border-left-color:#428bca; border-left-width: 5px; border-radius: 3px"><b>📌 Note:</b> 
<p>Every sample of code has been tested with JDK 17, <a href="https://ammonite.io/">Ammonite 3.0</a> and Scala 2.13. The libraries used are Spark 4.0.2 and Iceberg 1.10.1. The launch command is below.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">JAVA_OPTS</span><span class="o">=</span><span class="s1">'--add-exports java.base/sun.nio.ch=ALL-UNNAMED'</span> amm example.sc
</code></pre></div></div>
</div>

<p>First, the <code class="language-plaintext highlighter-rouge">DROP TABLE</code> command doesn't delete the data, it just deletes the
table's name from the table catalog, leaving the data-files untouched – this behavior
is similar to Spark's <a href="https://spark.apache.org/docs/4.1.1/sql-ref-syntax-ddl-drop-table.html">external table</a> use-case. It's been 
<a href="https://iceberg.apache.org/docs/1.10.0/spark-ddl/#drop-table">documented</a>, and mostly true (more on that later).</p>

<button class="accordion"><p><strong>Code example (click to expand)</strong></p>
</button>
<div class="panel">
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.spark::spark-core:4.0.2`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.spark::spark-sql:4.0.2`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.iceberg::iceberg-spark-runtime-4.0:1.10.1`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.xerial:sqlite-jdbc:3.51.1.0`</span>

<span class="c1">// Cleanup command:</span>
<span class="c1">//    rm -rf database.db spark-warehouse/ warehouse/</span>

<span class="k">import</span> <span class="nn">org.apache.spark.sql.SparkSession</span>

<span class="nd">@main</span>
<span class="k">def</span> <span class="nf">main</span><span class="o">()</span> <span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
  <span class="k">val</span> <span class="nv">spark</span> <span class="k">=</span> <span class="nc">SparkSession</span>
              <span class="o">.</span><span class="py">builder</span><span class="o">()</span>
              <span class="o">.</span><span class="py">appName</span><span class="o">(</span><span class="s">"Iceberg table deletion"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">master</span><span class="o">(</span><span class="s">"local[*]"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.extensions"</span><span class="o">,</span> <span class="s">"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog"</span><span class="o">,</span> <span class="s">"org.apache.iceberg.spark.SparkSessionCatalog"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.type"</span><span class="o">,</span> <span class="s">"jdbc"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.uri"</span><span class="o">,</span> <span class="s">"jdbc:sqlite:database.db"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.warehouse"</span><span class="o">,</span> <span class="s">"warehouse"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">getOrCreate</span><span class="o">()</span>

  <span class="nv">spark</span><span class="o">.</span><span class="py">sql</span><span class="o">(</span><span class="s">"""
    CREATE TABLE db.table
    USING iceberg
    PARTITIONED BY (bucket(16, id))
    AS (
      VALUES (22, 'aa'), (33, 'bb'), (44, 'cc')
      AS sample_val (id, data)
    )
  """</span><span class="o">)</span>

  <span class="nv">spark</span><span class="o">.</span><span class="py">sql</span><span class="o">(</span><span class="s">"DROP TABLE db.table"</span><span class="o">)</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Result:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
|-- database.db
|-- spark-warehouse
|-- test.sc
`-- warehouse
    `-- db
        `-- table
            |-- data
            |   |-- id_bucket=1
            |   |   `-- 00000-3-e97ebdb5-f2cb-4873-9641-5a0073ca6bcf-0-00002.parquet
            |   `-- id_bucket=13
            |       `-- 00000-3-e97ebdb5-f2cb-4873-9641-5a0073ca6bcf-0-00001.parquet
            `-- metadata
                |-- 00000-565b7429-b403-4f8b-b630-24492efaa7e2.metadata.json
                |-- da88a3e1-04d4-4aee-844e-dc6887bde34c-m0.avro
                `-- snap-1690246185083637803-1-da88a3e1-04d4-4aee-844e-dc6887bde34c.avro
</code></pre></div></div>
</div>

<h2 id="better-sql-approach">Better SQL approach</h2>

<p>In order to delete the tracked files belonging to a table, we have to run the
<code class="language-plaintext highlighter-rouge">DROP TABLE tbl_name PURGE</code>. This solves most of the problems…</p>

<button class="accordion"><p><strong>Code example (click to expand)</strong></p>
</button>
<div class="panel">
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.spark::spark-core:4.0.2`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.spark::spark-sql:4.0.2`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.iceberg::iceberg-spark-runtime-4.0:1.10.1`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.xerial:sqlite-jdbc:3.51.1.0`</span>

<span class="c1">// Cleanup command:</span>
<span class="c1">//    rm -rf database.db spark-warehouse/ warehouse/</span>

<span class="k">import</span> <span class="nn">org.apache.spark.sql.SparkSession</span>

<span class="nd">@main</span>
<span class="k">def</span> <span class="nf">main</span><span class="o">()</span> <span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
  <span class="k">val</span> <span class="nv">spark</span> <span class="k">=</span> <span class="nc">SparkSession</span>
              <span class="o">.</span><span class="py">builder</span><span class="o">()</span>
              <span class="o">.</span><span class="py">appName</span><span class="o">(</span><span class="s">"Iceberg table deletion"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">master</span><span class="o">(</span><span class="s">"local[*]"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.extensions"</span><span class="o">,</span> <span class="s">"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog"</span><span class="o">,</span> <span class="s">"org.apache.iceberg.spark.SparkSessionCatalog"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.type"</span><span class="o">,</span> <span class="s">"jdbc"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.uri"</span><span class="o">,</span> <span class="s">"jdbc:sqlite:database.db"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.warehouse"</span><span class="o">,</span> <span class="s">"warehouse"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">getOrCreate</span><span class="o">()</span>

  <span class="nv">spark</span><span class="o">.</span><span class="py">sql</span><span class="o">(</span><span class="s">"""
    CREATE TABLE db.table
    USING iceberg
    PARTITIONED BY (bucket(16, id))
    AS (
      VALUES (22, 'aa'), (33, 'bb'), (44, 'cc')
      AS sample_val (id, data)
    )
  """</span><span class="o">)</span>

  <span class="nv">spark</span><span class="o">.</span><span class="py">sql</span><span class="o">(</span><span class="s">"DROP TABLE db.table PURGE"</span><span class="o">)</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Result:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
|-- database.db
|-- spark-warehouse
|-- test.sc
`-- warehouse
    `-- db
        `-- table
            |-- data
            |   |-- id_bucket=1
            |   `-- id_bucket=13
            `-- metadata
</code></pre></div></div>
</div>

<p>We see that, yes, the files have been deleted… but the directories remain. This is
a <a href="https://en.wikipedia.org/wiki/Leaky_abstraction">leaky abstraction</a> where
the complexity of the storage layer is not completely handled. If the storage
layer for the data-files uses object-storage semantics ("objects are binary blobs
in a key-value storage, directories aren't real") it works fine, but if the storage
layer is a filesystem (and empty directories can exist, such as in local filesystems or HDFS),
we're left with a lot of empty directories. This is a known &amp; ignored
<a href="https://github.com/apache/iceberg/issues/9956">issue</a>.</p>

<p>There's also the problem of the orphan files: sometimes we have files that have been uploaded
to the storage path, but haven't been committed as belonging to the table (due to interrupted
transactions, crashed jobs, etc.). Since the Iceberg layer will only delete the files which
it knows about, the orphan files will remain stored even after the table is deleted.</p>

<h2 id="adding-a-manual-step">Adding a manual step</h2>

<p>At this point, the steps to properly delete an Iceberg table (and all its associated elements) are:</p>
<ol>
  <li>use the SQL interface to delete the table;</li>
  <li>use the storage layer's tools to delete the rest of the table.</li>
</ol>

<p>Surely it cannot get worse, right?</p>

<p>Well, it gets worse: if you swap the order of the steps above, the table enters an unrecoverable state.</p>

<p>More precisely, you cannot remove the table from the catalog anymore,
if you start by deleting its files, without doing the <code class="language-plaintext highlighter-rouge">DROP TABLE</code> command beforehand 
(and the table-name remains allocated forever). It's another known issue 
(<a href="https://github.com/apache/iceberg/issues/6785">issue 1</a>, <a href="https://github.com/apache/iceberg/issues/7258">issue 2</a>, <a href="https://github.com/apache/iceberg/issues/12016">issue 3</a>, <a href="https://github.com/apache/iceberg/issues/12062">issue 4</a>) with proposed fixes (<a href="https://github.com/apache/iceberg/pull/6786">proposed fix 1</a>, <a href="https://github.com/apache/iceberg/pull/7228">proposed fix 2</a>) but no merged solution.</p>

<h2 id="and-then-theres-hadoop">And then there's Hadoop</h2>

<p>When using a Hadoop catalog, a simple <code class="language-plaintext highlighter-rouge">DROP TABLE</code> is sufficient to delete everything: files and directories.</p>

<p>It's strange that just by changing the catalog type (which is just a detail related to the
metadata management), the behavior of <code class="language-plaintext highlighter-rouge">DROP TABLE</code> changes completely, but it happens anyway.</p>

<button class="accordion"><p><strong>Code example (click to expand)</strong></p>
</button>
<div class="panel">
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.spark::spark-core:4.0.2`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.spark::spark-sql:4.0.2`</span>
<span class="k">import</span> <span class="nn">$ivy.</span><span class="n">`org.apache.iceberg::iceberg-spark-runtime-4.0:1.10.1`</span>

<span class="c1">// Cleanup command:</span>
<span class="c1">//    rm -rf database.db spark-warehouse/ warehouse/</span>

<span class="k">import</span> <span class="nn">org.apache.spark.sql.SparkSession</span>

<span class="nd">@main</span>
<span class="k">def</span> <span class="nf">main</span><span class="o">()</span> <span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
  <span class="k">val</span> <span class="nv">spark</span> <span class="k">=</span> <span class="nc">SparkSession</span>
              <span class="o">.</span><span class="py">builder</span><span class="o">()</span>
              <span class="o">.</span><span class="py">appName</span><span class="o">(</span><span class="s">"Iceberg table deletion"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">master</span><span class="o">(</span><span class="s">"local[*]"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.extensions"</span><span class="o">,</span> <span class="s">"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog"</span><span class="o">,</span> <span class="s">"org.apache.iceberg.spark.SparkSessionCatalog"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.type"</span><span class="o">,</span> <span class="s">"hadoop"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">config</span><span class="o">(</span><span class="s">"spark.sql.catalog.spark_catalog.warehouse"</span><span class="o">,</span> <span class="s">"warehouse"</span><span class="o">)</span>
              <span class="o">.</span><span class="py">getOrCreate</span><span class="o">()</span>

  <span class="nv">spark</span><span class="o">.</span><span class="py">sql</span><span class="o">(</span><span class="s">"""
    CREATE TABLE db.table
    USING iceberg
    PARTITIONED BY (bucket(16, id))
    AS (
      VALUES (22, 'aa'), (33, 'bb'), (44, 'cc')
      AS sample_val (id, data)
    )
    """</span><span class="o">)</span>

  <span class="nv">spark</span><span class="o">.</span><span class="py">sql</span><span class="o">(</span><span class="s">"DROP TABLE db.table"</span><span class="o">)</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Result:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
|-- spark-warehouse
|-- test.sc
`-- warehouse
    `-- db
</code></pre></div></div>
</div>

<h2 id="conclusion">Conclusion</h2>

<p>Just because a table "has disappeared" from the SQL interface, don't assume that
the files and directories themselves have been deleted. Do a manual check, since
there might be leftovers.</p>]]></content><author><name></name></author><category term="iceberg" /><summary type="html"><![CDATA[In most database and data-lake systems, deleting a table is easy: DROP TABLE tbl_name and it's done… but under Iceberg it's more complicated, since it might leave behind stray files and directories.]]></summary></entry></feed>