Data is the New Lego

When I was a kid, I used to love playing with Lego. My brother and I built almost all kinds of stuff with Lego — animals, cars, houses, and even spaceships. As time went on, our creations became more ambitious and realistic. There were also times when we could each have insisted that our Lego was our own, till we realized that pooling resources would eventually help us went further. We were growing up too, and as our playing became more sophisticated, we learned how to make better models.

As an aspiring data scientist, I realized that working with data is surprisingly a lot like my childhood Lego memories. In this blog post, I want to share some of the memories I've had that show how playing with Lego and working with data are closer than you think.

Exploration is the most fun part of the process

When I was a kid, I liked to put all my Lego bricks together in a giant tub because a lot of fun in building something was searching through a sea of bricks and trying out new patterns that I didn't think about before. Anyone who deals with data knows that as much as 80% of the process is cleaning up the data and doing exploratory analysis. Personally, that's what I love about working with data — that's where I let my creativity and imagination run wild. Jumping straight into the dataset and exploring various visualizations and correlations, in search of patterns, brings me back to a childhood spent digging through a pile of Lego.

To build something useful you need lots of resources

If you don't have enough Lego bricks, chances are the things you're building aren't realistic. The model is crude, the colors don't match, and there are gaps. The same goes for machine learning models. If you don't have enough data, your models are poor, and you will encounter lots of errors.

However, sometimes, I might not have the right pieces to build a model exactly the way I wanted it, so I had to search for alternatives or reconsider how to build my Lego model. Hence, I learned a new way of using what I had. Similarly, as long as you are creative about where you look, there are always insights to be gained from even the most limited data.

A good quality model needs a diversity of resources

To build a good quality Lego model, you also need a diversity of bricks. Models built with only the basic 2x4 bricks are rough and inaccurate. This is where it was so useful to get Lego from friends and family. As our family and friends gave us more Lego bricks, we got more diverse bricks that helped us create more accurate models.

This may also be a harsh childhood truth, that the children with the most Lego, the best pieces, and the most time to play create the best models. The same harsh truth applies to any machine learning projects. Projects with the biggest data volumes, the most diverse data, and the best teams to use the data would create the most accurate models.

Both require iterative thinking

The beauty of Lego is that you're not limited to what's on the box. Rebuilding something and refining it each time requires iterative thinking. When it comes to working with data, there are also plenty of opportunities to iterate.

When I get a “decent enough” solution, whether it's a dashboard or a Python script, I still find time to break it, repair it, and keep improving. It may seem to get the job done at first, but I'm likely to be able to redesign it into something more effective and scalable.

You get better as you build more

Young children make rough Lego models, the colors don't match and the shapes are wrong. On the other hand, older children build models with careful color and shape planning.

The same also happens with data and algorithms. As you get to know your data and algorithms, you get to understand their limitations and strive to build something better. And as the amount of data is growing, you may need to fix and adjust your models to get better and better. In other words, the same learning curve applies to Lego building and machine learning modeling.

Design is important

The name Lego is derived from the Danish phrase ’leg godt’, which means ‘play well’. Before I start building something with Lego, I will first decide if it's something I want to display, or something I want to play with. For display-only models, I could get away with a simpler architecture, but if it was something I wanted to play with, I knew I had to make it extra robust. After all, it would be very disappointing if the wings of my spaceship fell off while I was swooshing it around the room.

When it comes to making a dashboard, Python script, or even a report, I often start by asking myself if this is something people will actually use (i.e. play with), or if it's something they want to see once and never again. From there, I plan and build accordingly.

Lego has taught me a lot about data and building models. Just like Lego:

“To build something useful you need lots of resources, diversity, and the knowledge to build the right models in the right way.”

Richard Cornelius Suwandi

PhD Student at CUHK-Shenzhen

Related

comments powered by Disqus