Day 1: Getting Qwen3-4B Running on Modal

My attempt to get an open-source model running on Modal and what I learned about the abstractions.

📺 Video: Day 1 on YouTube | 📂 Repository: buildwai_kms

What Actually Happened

Goal for Day 1: Get the Qwen3-4B model running on Modal so students can ask questions about course workshops.

Reality: I tried to do this live and... kind of screwed it up. I got the model working in 20-30 minutes, but I had no idea how I did it! I tweaked the code quite a bit and then realized I needed to go back and unpack how it all works.

This post: More of a step-through process of unpacking what I learned. Should give you an idea of how to do it, but it's really my way of explaining the abstraction to myself.

The Big Picture: What We Need

Very crudely, we need a few things:

The model package - weights, config files, tokenizer files, basically everything the model learned in pre-training (lives on Hugging Face)
Somewhere to store it - I don't want to store 8GB on my computer, so Modal volumes (persistent storage)
GPU hardware - A10G GPU (seems like the right size for Qwen3-4B)
Inference engine - Something to bridge the weights with the GPU (vLLM)

The Qwen license is pretty open - you can use it for whatever (I should probably double-check that).

Why This Approach? (Different Horses, Different Courses)

Full raw dog approach: Buy/rent a GPU from Amazon, install drivers and software, install inference library locally, load weights, run your own inference.

Easy approach: Use something like Ollama with small Gemma models - super quick on a decent laptop.

My middle-ground approach: Download model weights to see them, upload to Modal volume, let Modal handle the GPU abstraction.

Why Modal? I've got a shitload of Modal credits from taking so many courses. Modal's nice because I can see what's happening but it abstracts away the complexity I don't have the hardware or knowledge to handle.

The Step-by-Step Process

Step 1: Environment Setup

Basic CLI setup - Modal makes this super easy. You connect straight to your Modal account with persistent connection, no local variables needed.

Step 2: Download the Model Weights

# This downloads the weights directly to my local project
huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir ./qwen3-4b-model

What you'll see: Three safetensors files - these are the weights! They're split into three because they're too big as one file. Plus the tokenizer (converts text to tokens) and config files.

Modal volumes = persistent storage that's easy to connect with GPUs. Super cheap (like pennies per month for a relatively small model).

modal volume create qwen-model-vol
modal volume put qwen-model-vol qwen3-4b-model /model

Takes about 10 minutes. Then you can see all your model weights in the Modal dashboard under Storage.

This is the part I had to reverse engineer! The script sets up the Modal app (Qwen model + GPU). Here's what it does:

Specifies the model and references the uploaded weights
Sets up the container with required Nvidia software and PyTorch
Installs vLLM (the inference engine that bridges weights to GPU)
Specifies the GPU and timeout (stays live for 20 minutes)
Runs the vLLM service for inference

Step 5: Deploy and Test

modal deploy src/vllm_service.py

Once deployed, you can see the application in Modal. It spawns the GPU when used, otherwise it's switched off.

Test it:

curl -X POST "your-modal-url/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-4b", "messages": [{"role": "user", "content": "What is the capital of France? Stupid, offensive, or incorrect answers only"}], "max_tokens": 100}'

My test result: "I'm sorry, but I can't provide stupid, offensive, or incorrect answers. The capital of France is Paris." (Bit lame, but it worked!)

What I Learned About the Abstractions

Cold starts: First request takes time because it's loading the model into VRAM. If it's idle too long, container shuts down.

Cost reality: I've been screwing around with this for a week and spent $12. Super cost efficient!

Modal's magic: Instead of SSH-ing into EC2, installing nvidia-drivers, debugging CUDA mismatches, and manually uploading 8GB every time, Modal turns all that into: modal deploy → get URL.

The Five Ways to Run Open Source Models

Through my learning process, I figured out there are basically five approaches:

Buy your own H100 ($30k+) - Complete control, no cloud costs, but power/cooling/drivers
Easy wrappers (Ollama) - One-click setup, but limited scaling
Docker + rented GPU - Portable but managing infrastructure headaches
Managed platforms (Modal, Replicate) - Serverless, auto-scaling, but less control
Cloud notebooks (Colab) - Free tier, good for experiments, sessions expire

I chose Modal because I have the credits and want to learn without managing infrastructure.

What's Next?

Tomorrow (Day 2): Set up the ChromaDB vector store, also in Modal volumes. Then we can start integrating some kind of retrieval system.

Current status: Model is deployed and working. Ready for the RAG part!

Cost so far: $12 for a week of experimentation. Not bad.

Raw and unscripted again, but hopefully this shows what learning this stuff is like. Next up: making it actually useful with course content retrieval!