Question 1

What's an AI API and why would I self-host one?

Accepted Answer

An AI API is an HTTP endpoint (typically OpenAI-compatible — same /v1/chat/completions schema) that runs a language or vision model and returns inference. You self-host instead of using OpenAI / Groq / Anthropic when you need (a) per-token cost predictability at scale, (b) data residency in India for compliance, (c) custom or fine-tuned models, (d) latency under 100 ms to Indian users, or (e) freedom from rate limits. AIC Cloud GPUs (RTX 3090 / 4090 / A100) are the compute layer — you bring vLLM, Ollama, TGI, or LM Studio Server on top.

Question 2

Which models can I run on AIC Cloud GPU as an API?

Accepted Answer

Any model that fits the GPU's VRAM. RTX 3090 (24 GB): Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B/14B (quantized), DeepSeek Coder 6.7B, Stable Diffusion XL, Whisper Large v3. RTX 4090 (24 GB): same plus higher throughput. A100 (80 GB): Llama 3.1 70B (FP8 / AWQ quantized), Mixtral 8x7B, Qwen 2.5 72B, full SDXL pipelines with ControlNet stacks. For inference frameworks we recommend vLLM (best throughput) or Ollama (easiest setup).

Question 3

How does AIC AI API hosting compare to OpenAI / Groq / Anthropic on cost?

Accepted Answer

For typical LLM workloads: OpenAI GPT-4o-mini costs $0.15 input / $0.60 output per 1M tokens. Groq Llama 70B costs $0.59/$0.79 per 1M. Self-hosting Llama 3.1 8B on AIC RTX 3090 at ₹27.74/hr with vLLM throughput of ~150 tokens/sec sustained = ~₹50 per 1M tokens output (input near-free relatively). Break-even vs OpenAI is around 500K-1M tokens/day; above that, self-host wins decisively. Plus you get latency advantage for Indian users (sub-100 ms vs OpenAI's ~600 ms from India).

Question 4

Is the API OpenAI-compatible?

Accepted Answer

Yes — vLLM and Ollama both expose OpenAI-compatible endpoints (/v1/chat/completions, /v1/embeddings, /v1/models). Your existing OpenAI SDK code works by changing two lines: the baseURL points to your AIC instance (https://your-gpu-ip:8000/v1) and the model name changes to whatever you loaded (e.g., 'llama-3.1-8b-instruct'). Drop-in replacement in most cases.

Question 5

What about latency for Indian users?

Accepted Answer

AIC GPU instances are deployed close to Indian users (sub-100 ms one-way latency from major Indian cities). For comparison: OpenAI requests from India typically take 400-700 ms in network alone before the model even processes anything. For real-time use cases — voice agents, autocomplete, code completion, customer chat — self-hosted Indian inference is dramatically more responsive.

Question 6

Can I fine-tune a model on AIC and serve it as an API on the same instance?

Accepted Answer

Yes. The typical flow: rent an A100 80 GB for the fine-tuning run (₹163/hr × ~4-12 hrs depending on dataset size and LoRA/QLoRA settings), save the adapter, then either keep that A100 running as your inference server or move the adapter to a smaller RTX 3090/4090 for serving. Same wallet, same INR billing, same SSH access — no separate accounts for training vs inference.

Question 7

Do you offer a managed AI API, or do I have to install vLLM / Ollama myself?

Accepted Answer

Currently self-managed — you SSH in, install your inference framework (vLLM, Ollama, TGI, LM Studio Server), and expose the API on a port. Setup takes 10-30 minutes for most setups; we have pre-built Ollama and vLLM templates you can spin up directly. Managed AI API offering (one-click model deploy with auto-scale) is on the roadmap; contact support for early access.

Question 8

What about Stable Diffusion / image generation APIs?

Accepted Answer

Same idea — rent a GPU, run ComfyUI (with its REST API) or Automatic1111 (--api flag) or SD.Next, and you have a Stable Diffusion API endpoint. RTX 3090 handles SDXL at ~6-8 seconds per 1024×1024 image at 30 steps; RTX 4090 is ~30% faster. Pre-built ComfyUI template available.

Question 9

What about Whisper / speech-to-text APIs?

Accepted Answer

Whisper Large v3 runs well on RTX 3090 (24 GB) — transcribes about 30 seconds of audio in ~5 seconds wall-clock. For an API endpoint, the easiest path is faster-whisper-server or WhisperX behind FastAPI. AIC has a pre-built Whisper template for one-click deploy.

Question 10

How do I pay — INR or USD?

Accepted Answer

INR. Top up your AIC wallet via UPI (PhonePe, GPay, Paytm), net banking, or any Indian debit/credit card via Razorpay. GPU minutes deduct from your wallet in real time as your inference API serves requests. No FX fees, no foreign card needed, no USD subscription. For overseas customers, PayPal (USD) and crypto (USDT) also supported.

Self-host your AI API in India from ₹27.74/hr

When to self-host vs use OpenAI / Groq

📊 Cost crossover at ~500k tokens/day

🇮🇳 Data residency for compliance

⚡ Real-time latency for voice / chat

🎨 Custom or fine-tuned models

Spin up an AI API in 5 minutes

1 — Pick a GPU

2 — Pick a template

3 — Load your model

4 — Point your code

Ready to run AI APIs at Indian-data-center latency?