Machine Learning on Big Data Workshop
By Yury Zhuk on February 25, 2026 · 1 min read
Materials for the Machine Learning on Big Data workshop for Lumos Student DS Consulting
A hands-on workshop covering machine learning fundamentals applied to large-scale datasets, highlighting practical strategies like reducing data size, chunking, lazy loading, and using efficient file formats such as Parquet.
I compare tools like Pandas, DuckDB, Polars, and Spark, and provide guidance on choosing between local machines, cloud VMs, or distributed systems depending on data size, frequency, and team needs.
Finally, I cover real-world pipeline considerations such as batch vs streaming workflows, orchestration tools, and managed platforms, showing how ML projects typically scale from local prototypes to production data pipelines. 
Workshop
Open in Google Colab or Download Notebook (.ipynb)
Download Slides PDF
https://www.linkedin.com/feed/update/urn:li:activity:7432726202823610369/
Need support for your AI project?
Let's work together!
Related Posts
Context Engineering for Sovereign AI
Practical context engineering patterns for building reliable, EU-sovereign enterprise agents — from strict guardrails and structured retrieval to production tactics like retries, escalation, and lightweight evals.
Architecting Reliable AI Agents for Production
What happens when your AI agent hallucinates a legal citation or a refund policy? Concrete ways to architect for reliability when deploying autonomous agents.
Are Knowledge Graphs in RAG better than regular vector RAG?
A simplified answer to when knowledge graphs add value to RAG systems versus when they just add unnecessary complexity.