Machine Learning on Big Data Workshop

A hands-on workshop covering machine learning fundamentals applied to large-scale datasets, highlighting practical strategies like reducing data size, chunking, lazy loading, and using efficient file formats such as Parquet.

I compare tools like Pandas, DuckDB, Polars, and Spark, and provide guidance on choosing between local machines, cloud VMs, or distributed systems depending on data size, frequency, and team needs.

Finally, I cover real-world pipeline considerations such as batch vs streaming workflows, orchestration tools, and managed platforms, showing how ML projects typically scale from local prototypes to production data pipelines.

Workshop

Open in Google Colab or Download Notebook (.ipynb)
Download Slides PDF

https://www.linkedin.com/feed/update/urn:li:activity:7432726202823610369/

Machine Learning on Big Data Workshop

Workshop

Need support for your AI project?

Related Posts

Performant, Reliable Agents via Context Engineering

Architecting Reliable AI Agents for Production

Are Knowledge Graphs in RAG better than regular vector RAG?