Day 1: Getting Qwen3-4B Running on Modal
My attempt to get an open-source model running on Modal and what I learned about the abstractions.
📺 Video:
| 📂 Repository: buildwai_kmsWhat Actually Happened
Goal for Day 1: Get the Qwen3-4B model running on Modal so students can ask questions about course workshops.
Reality: I tried to do this live and... kind of screwed it up. I got the model working in 20-30 minutes, but I had no idea how I did it! I tweaked the code quite a bit and then realized I needed to go back and unpack how it all works.
This post: More of a step-through process of unpacking what I learned. Should give you an idea of how to do it, but it's really my way of explaining the abstraction to myself.
The Big Picture: What We Need
Very crudely, we need a few things:
- The model package - weights, config files, tokenizer files, basically everything the model learned in pre-training (lives on Hugging Face)
- Somewhere to store it - I don't want to store 8GB on my computer, so Modal volumes (persistent storage)
- GPU hardware - A10G GPU (seems like the right size for Qwen3-4B)
- Inference engine - Something to bridge the weights with the GPU (vLLM)
The Qwen license is pretty open - you can use it for whatever (I should probably double-check that).
Why This Approach? (Different Horses, Different Courses)
Full raw dog approach: Buy/rent a GPU from Amazon, install drivers and software, install inference library locally, load weights, run your own inference.
Easy approach: Use something like Ollama with small Gemma models - super quick on a decent laptop.
My middle-ground approach: Download model weights to see them, upload to Modal volume, let Modal handle the GPU abstraction.
Why Modal? I've got a shitload of Modal credits from taking so many courses. Modal's nice because I can see what's happening but it abstracts away the complexity I don't have the hardware or knowledge to handle.
The Step-by-Step Process
Step 1: Environment Setup
Basic CLI setup - Modal makes this super easy. You connect straight to your Modal account with persistent connection, no local variables needed.
Step 2: Download the Model Weights
# This downloads the weights directly to my local project
huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir ./qwen3-4b-model
What you'll see: Three safetensors files - these are the weights! They're split into three because they're too big as one file. Plus the tokenizer (converts text to tokens) and config files.
Step 3: Upload to Modal Storage
Modal volumes = persistent storage that's easy to connect with GPUs. Super cheap (like pennies per month for a relatively small model).
modal volume create qwen-model-vol
modal volume put qwen-model-vol qwen3-4b-model /model
Takes about 10 minutes. Then you can see all your model weights in the Modal dashboard under Storage.
Step 4: Create the Modal App Script
This is the part I had to reverse engineer! The script sets up the Modal app (Qwen model + GPU). Here's what it does:
- Specifies the model and references the uploaded weights
- Sets up the container with required Nvidia software and PyTorch
- Installs vLLM (the inference engine that bridges weights to GPU)
- Specifies the GPU and timeout (stays live for 20 minutes)
- Runs the vLLM service for inference
Step 5: Deploy and Test
modal deploy src/vllm_service.py
Once deployed, you can see the application in Modal. It spawns the GPU when used, otherwise it's switched off.
Test it:
curl -X POST "your-modal-url/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-4b", "messages": [{"role": "user", "content": "What is the capital of France? Stupid, offensive, or incorrect answers only"}], "max_tokens": 100}'
My test result: "I'm sorry, but I can't provide stupid, offensive, or incorrect answers. The capital of France is Paris." (Bit lame, but it worked!)
What I Learned About the Abstractions
Cold starts: First request takes time because it's loading the model into VRAM. If it's idle too long, container shuts down.
Cost reality: I've been screwing around with this for a week and spent $12. Super cost efficient!
Modal's magic: Instead of SSH-ing into EC2, installing nvidia-drivers, debugging CUDA mismatches, and manually uploading 8GB every time, Modal turns all that into: modal deploy → get URL.
The Five Ways to Run Open Source Models
Through my learning process, I figured out there are basically five approaches:
- Buy your own H100 ($30k+) - Complete control, no cloud costs, but power/cooling/drivers
- Easy wrappers (Ollama) - One-click setup, but limited scaling
- Docker + rented GPU - Portable but managing infrastructure headaches
- Managed platforms (Modal, Replicate) - Serverless, auto-scaling, but less control
- Cloud notebooks (Colab) - Free tier, good for experiments, sessions expire
I chose Modal because I have the credits and want to learn without managing infrastructure.
What's Next?
Tomorrow (Day 2): Set up the ChromaDB vector store, also in Modal volumes. Then we can start integrating some kind of retrieval system.
Current status: Model is deployed and working. Ready for the RAG part!
Cost so far: $12 for a week of experimentation. Not bad.
Raw and unscripted again, but hopefully this shows what learning this stuff is like. Next up: making it actually useful with course content retrieval!