Why Do We All Talk About Data Like It’s Water?

I was in a meeting recently and had the horror of witnessing someone referring to a data ingestion process as “the data being slurped in”. Once I’d recovered from this moment, it did get me thinking, why are we all so obsessed with talking about data like its water?

The water metaphors are everywhere: streaming, data lakes, data lakehouses, data pipelines, in bad cases data swamps and even sometimes data ponds and data puddles. My personal favorite being Amazon Data Firehose. At some point it became cool to freeze the water and we ended up with Apache Iceberg, Amazon S3 Glacier, and a whole database company called Snowflake.

It’s not exactly clear how this started, it seems likely that it may have been accelerated by the famous quote “data is the new oil”(1) in 2006 as that began thinking about data as a liquid. It then probably took off in 2010 by the coining of the term “data lake”(2) by James Dixon, the founder of Pentaho.

There are clearly some positives to these water analogies. I think it can make some processes really understandable, such as how data moves between systems, in a similar way to how water flows. It illustrates that similar to water quality, your data lakes can be good or poor quality, and even tarnished with being labeled as a data swamp. It takes a lot of work to get drinkable water to homes, similarly it takes a lot of work to make data usable.

However, there are definitely some limitations to these analogies. One of the most powerful and valuable things about data is that it can be copied and used infinitely without being used up. If millions of people took water from a real lake, we would very quickly run out of water. Structurally data is nothing like water, when you store data in the same place it doesn’t mix together but stays as a discrete object. You would assume however, that the use of water analogies isn’t ever actually confusing anyone, but it is a funny quirk of our industry.

Ultimately, I think it makes sense why these water analogies have been so prevalent, but sometimes the metaphors probably need watering down (sorry, not sorry).

1https://www.forbes.com/sites/nishatalagala/2022/03/02/data-as-the-new-oil-is-not-enough-four-principles-for-avoiding-data-fires/#:~:text=Generally%20credited%20to%20mathematician%20Clive,entity%20that%20drives%20profitable%20activity.

2 https://www.dataversity.net/brief-history-data-lakes/

Next
Next

Semantic Analysis Using LLMs - Classification with BERT