One Person, Five Stacks: What It Actually Takes to Ship an AI Product Alone
What Kanha does, in 30 seconds
You point it at your website. It crawls your pages, generates QA training pairs, fine-tunes a small language model on your content, converts it to a WebGPU-compatible format, and serves it from Hugging Face Hub. Your customers open a chat widget, the model loads into their browser, and inference happens on their device. No API calls per query. Flat monthly pricing.
The interesting part isn't any one of those steps. It's that one person has to own all of them.
The backend: Rust, and why it's not overkill for a solo project
The backend is Axum + Tokio + SQLx. About 4,000 lines of Rust handling auth, CRUD, crawling, training orchestration, billing, and four background worker queues.
The common take is that Rust is slow to develop in, especially alone. My experience is the opposite, once it compiles, it works. I don't have a QA team. I don't have staging. The compiler is my first line of defense against shipping broken code at 2 AM.
Concrete things Rust gives me:
A web crawler that doesn't fall over. The crawler uses chromiumoxide for headless Chromium when pages need JavaScript rendering. It auto-detects JS-heavy pages (content < 500 chars after static fetch, or content-to-HTML ratio below 2%) and falls back to the browser. Concurrent tabs are semaphore-limited. The whole thing runs as a background Tokio task consuming from a Redis queue.
If I'd built this in Python, I'd be debugging asyncio + Playwright memory leaks. In Rust, the borrow checker prevents the class of bugs where you accidentally hold a browser handle across an await point. It just doesn't compile.
Background workers that recover from crashes. I have four Redis reliable queues (crawl, QA generation, training, email) using BRPOPLPUSH. Each has a DB sweep that runs every 60 seconds, checking for orphaned jobs that fell out of the queue. On startup, anything stuck in the processing list gets moved back to the main queue.
This is not sophisticated distributed systems. It's the simplest thing that works, Redis lists with a recovery loop. But Rust's type system means the job structs are always valid when they deserialize, and the async runtime means I can run all four consumers in a single binary.
A billing system that handles two payment providers. Stripe rejected my India account. Rather than wait, I implemented Razorpay and built a dispatch layer: a BILLING_PROVIDER env var routes checkout, portal, and cancellation to the right provider. Both webhook endpoints are always registered. The internal subscription and quota system is shared.
In a dynamic language, the two-provider abstraction would be a source of subtle runtime bugs, a field that exists on Stripe responses but not Razorpay, a webhook payload shape difference that only shows up when a customer cancels. In Rust, the compiler forces you to handle both variants explicitly. I have zero billing bugs in production.
The frontend: Next.js, and the parts that are boring on purpose
The marketing site, docs, dashboard, and auth are all Next.js App Router. Tailwind v4, Radix UI, Supabase auth. Nothing surprising here, and that's the point.
When you're solo, you need to pick your battles. The frontend is not where I innovate. It's where I need things to look good and work reliably with minimal maintenance. Shadcn components, Zustand for the little state I have, TanStack Query for server state. Standard tools, used in standard ways.
The one interesting frontend decision: the chat playground uses my own npm package (kanha-ai) for inference. Dogfooding the SDK on my own dashboard means I catch integration bugs before customers do.
The training pipeline: the hardest part nobody sees
This is where the complexity lives. The pipeline goes:
Crawled pages → QA pair generation (LLM) → Dataset assembly (JSONL)
→ QLoRA fine-tune (GPU) → MLC conversion → Upload to Hugging Face Hub
→ SDK fetches model → WebGPU inference in browser
Each arrow is a system boundary where things break differently.
QA generation
After crawling a page, I enqueue it for QA pair generation. An LLM (currently via OpenRouter with Groq fallback) reads the page content and generates 3–8 question-answer pairs. Quality filters reject pairs where the answer is under 50 characters or the question is under 15.
This is the part most people would do at training time. I do it eagerly, right after crawl, so users can review and edit pairs before assembling a dataset. The tradeoff: storage cost for pairs that might never be used. Worth it for the UX.
Training orchestration
Training jobs go through a Redis reliable queue. The consumer pulls a job, wakes up the GPU provider, submits the training request, and waits for a webhook callback.
I currently support two providers:
Lightning AI: a persistent studio with a T4 GPU. The backend embeds worker files in the Rust binary via include_str! and uploads them on startup. The studio auto-sleeps after 10 minutes of idle. Good for development, but the cold-start (waking a sleeping studio) adds minutes.
fal: serverless GPUs. No provisioning needed. I deploy a Python app (fal deploy) and submit jobs to their queue API. The catch: read-only virtualenv at runtime, so MLC (the WebGPU compiler) has to be installed via direct wheel URLs in the requirements list. pip install at runtime doesn't work. Neither does --extra-index-url (pulls stub packages from PyPI instead of the real ones). I spent two days learning this.
The training itself is QLoRA: 4-bit NF4 quantization, LoRA rank 16, 3 epochs, ChatML template. Then MLC conversion to q4f16_1 format, quantized weight shards plus a config that web-llm understands. Everything gets uploaded to a Hugging Face repo.
The MLC conversion rabbit hole
MLC-LLM compiles model weights into a format that runs on WebGPU via WebAssembly. The conversion step (mlc_llm convert_weight) needs to run on CPU (no nvcc dependency), and the config needs a rope_theta injection because Qwen3 omits it from its config.json but the MLC runtime expects it.
I found this by getting NaN outputs in the browser after a seemingly successful training run. The model loaded, generated tokens, but every token was garbage. Took a full day to trace it to a missing RoPE parameter. Now there's a _patch_config_for_mlc() function that injects it automatically.
These are the kinds of bugs that don't show up in any tutorial or documentation. They only show up when you're the person running the entire pipeline end-to-end.
The SDK: shipping a real npm package
The SDK (kanha-ai on npm) has three integration modes: React component, vanilla JS mount() function, and a <kanha-bot> Web Component. All built from one TypeScript codebase with tsup.
The key design constraint: zero backend dependency at runtime. The SDK resolves model artifacts from Hugging Face Hub URLs, loads them via web-llm, and runs inference entirely in the browser. No API keys, no usage meters, no server round-trips.
The tricky part was URL resolution. web-llm's cleanModelUrl() appends /resolve/main/ to URLs that don't already contain it. Hugging Face artifact URLs already have /resolve/main/, so they pass through untouched. The ndarray-cache.json file uses relative dataPath entries that resolve against the HF base URL. Getting this chain right; training output → HF upload → SDK resolution → web-llm loading → WebGPU execution, required reading web-llm's source code. The docs don't cover custom model hosting.
Publishing is a GitHub Actions workflow with manual dispatch. Pick patch/minor/major, it publishes to npm, creates a git tag, and jsdelivr auto-mirrors it for the CDN integration mode.
What "solo" actually means
It doesn't mean I built everything from scratch. It means I'm the only person who gets paged.
I lean heavily on managed services: Supabase for Postgres and auth, Upstash Redis for caching and queues, fal for GPU compute, Hugging Face for model hosting, resend for emails. The build-vs-buy decisions are aggressive, if someone else can run it, they should.
What I do own:
The Rust backend (application logic, crawling, orchestration, billing)
The Next.js frontend (marketing, dashboard, docs)
The training pipeline (QLoRA + MLC conversion)
The npm SDK (three integration modes, WebGPU inference)
The deployment and monitoring
That's five codebases across three languages (Rust, TypeScript, Python). The Python is only in the training worker, about 400 lines total. Everything else is Rust or TypeScript.
What I'd do differently
Start with fal, not Lightning AI. I built the Lightning AI integration first because persistent studios felt simpler. They're not. Managing studio lifecycle (wake, sleep, health checks, file sync) is more operational burden than submitting to a serverless queue. fal's model is "deploy once, submit jobs", exactly what a solo developer needs.
Skip the second billing provider. I implemented both Stripe and Razorpay with a clean abstraction layer. Elegant? Sure. But I could have shipped with just Razorpay and added Stripe later when I actually had US customers. The abstraction layer took two days to build. Those two days could have been two more landing page experiments.
Write the SDK earlier. I built the dashboard playground first with raw web-llm calls, then extracted the SDK later. This meant I had to refactor the playground to use the SDK, and I found integration bugs I would have caught sooner if the SDK had been the foundation from day one.
The actual hard part
The hard part isn't any individual technical challenge. It's context-switching across all of them in a single day.
Monday morning I'm debugging a Rust lifetime error in the crawler's browser pool. Monday afternoon I'm tweaking Tailwind spacing on a pricing page. Monday evening I'm reading MLC-LLM source code to figure out why a converted model produces garbage output in Safari but works in Chrome.
There's no team to absorb the cognitive load. Every decision, from database schema to button color to LoRA hyperparameters, routes through one brain. The skill isn't depth in any one area. It's maintaining enough depth across all of them to ship something that works.
If you're considering building something like this alone: it's doable. But the bottleneck isn't technical skill. It's energy management. The code is the easy part. The hard part is not burning out before you find customers.
Kanha is live at kanha.ai. The free tier gives you 50 pages and a production bot, no credit card, no trial timer.