{"id":28,"date":"2026-03-10T09:00:00","date_gmt":"2026-03-10T03:30:00","guid":{"rendered":"https:\/\/rdp.in\/blog\/?p=28"},"modified":"2026-04-21T18:41:03","modified_gmt":"2026-04-21T13:11:03","slug":"building-your-ai-factory-in-india-a-cios-playbook-for-2026","status":"publish","type":"post","link":"https:\/\/rdp.in\/blog\/building-your-ai-factory-in-india-a-cios-playbook-for-2026\/","title":{"rendered":"Building Your AI Factory in India: A CIO&#8217;s Playbook for 2026"},"content":{"rendered":"\n<p><em><strong>Part 2 of 3 \u00b7 RDP AI Infrastructure Series<\/strong><\/em><\/p>\n\n\n\n<p>In Part 1 of this series, we made the case for <em>why<\/em> Indian enterprises are repatriating AI workloads from cloud to on-prem. This post is about <em>how<\/em> to actually do it \u2014 without overbuilding, underbuilding, or designing yourself into a corner.<\/p>\n\n\n\n<p>An AI factory is not a single product. It\u2019s a stack: compute, networking, storage, power, cooling, software, and the people who run it. Get the architecture right and the deployment compounds for five years. Get it wrong and you end up with a very expensive rack that trains one model per quarter.<\/p>\n\n\n\n<p>Here\u2019s how we think about it at RDP \u2014 the questions we walk every customer through, and the decisions that tend to matter most.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-28-1024x683.png\" alt=\"\" class=\"wp-image-355\" srcset=\"https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-28-1024x683.png 1024w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-28-300x200.png 300w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-28-768x512.png 768w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-28.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Start with the Workload, Not the Hardware<\/h2>\n\n\n\n<p>The single most common mistake we see: organizations start by asking \u201chow many H100s do we need?\u201d<\/p>\n\n\n\n<p>Wrong question. Start with the workload.<\/p>\n\n\n\n<p>There are three broad workload shapes, and each wants a different architecture:<\/p>\n\n\n\n<p><strong>Inference-dominant.<\/strong> You\u2019re running trained models in production \u2014 a chatbot, a document classifier, a fraud detector, a RAG pipeline over corporate data. The GPU is usually one of the smaller SKUs (L40S, H100 PCIe, or even A100 for cost-optimized inference). You need throughput, low latency, and high uptime. Networking is less critical \u2014 inference workloads rarely need RDMA-class fabrics. Storage is moderate.<\/p>\n\n\n\n<p><strong>Fine-tuning-dominant.<\/strong> You\u2019re taking open-weight or commercial base models and adapting them to your data. Think LoRA and QLoRA on Llama, Mistral, or domain-specific models. You need a mid-size GPU cluster (4\u201316 H100-class), fast local NVMe, and good-but-not-extreme networking. This is the sweet spot for most Indian mid-enterprise deployments.<\/p>\n\n\n\n<p><strong>Pre-training or large-scale training.<\/strong> You\u2019re building foundation models from scratch, or doing full fine-tunes at meaningful scale. This needs real infrastructure \u2014 32 to 128+ GPUs, 400Gb\/s InfiniBand or equivalent, liquid cooling, shared parallel file systems. This is AI Factory territory in the true sense, and it\u2019s where the top of the Indian market \u2014 banks, telcos, research institutions, and sovereign AI initiatives \u2014 is investing.<\/p>\n\n\n\n<p><strong>The honest truth:<\/strong> 85% of Indian enterprise deployments are inference-plus-fine-tuning. Very few need pre-training infrastructure. Don\u2019t buy for the glamour workload; buy for the workload you actually run.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-30-1024x683.png\" alt=\"\" class=\"wp-image-357\" srcset=\"https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-30-1024x683.png 1024w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-30-300x200.png 300w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-30-768x512.png 768w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-30.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">The Five Decisions That Matter<\/h2>\n\n\n\n<p>Once you know your workload shape, there are five architectural decisions that drive everything else. Get these right and the rest is execution.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. GPU SKU and count<\/h3>\n\n\n\n<p>The hardware conversation has narrowed. For Indian enterprises in 2026, the practical choices are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>L40S \/ L4<\/strong> \u2014 inference and light fine-tuning. Cost-effective. Air-cooled. Good fit for departmental deployments.<\/li>\n\n\n\n<li><strong>H100 (SXM or PCIe)<\/strong> \u2014 the workhorse. Fine-tuning, production inference at scale, mid-size training. SXM variants need liquid or hybrid cooling above 4 GPUs per node.<\/li>\n\n\n\n<li><strong>H200<\/strong> \u2014 higher memory bandwidth and capacity than H100. Increasingly the default for fine-tuning and larger inference models. Same thermal envelope considerations as H100.<\/li>\n\n\n\n<li><strong>B200 \/ Blackwell-class<\/strong> \u2014 top-of-line for large-scale training. Needs liquid cooling. Lead times and procurement complexity are non-trivial; only go here if the workload demands it.<\/li>\n<\/ul>\n\n\n\n<p>The count question is answered by your workload mix. A good starting heuristic: size for your 75th-percentile workload, not your peak. The peak is what cloud burst is for.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2. Networking fabric<\/h3>\n\n\n\n<p>This is where \u201cAI factory\u201d diverges from \u201cGPU server with a lot of GPUs.\u201d If your training jobs span multiple nodes, your fabric matters more than your GPUs. Options:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Ethernet with RoCE (100\/200\/400 GbE)<\/strong> \u2014 good for most mid-size deployments. Lower cost, easier operations, compatible with existing data center skills. This is where most Indian enterprises land.<\/li>\n\n\n\n<li><strong>InfiniBand (HDR \/ NDR)<\/strong> \u2014 required for tightly-coupled large-scale training. Higher cost, specialized skills, but the performance difference at 32+ GPUs is material.<\/li>\n<\/ul>\n\n\n\n<p>Don\u2019t over-spec here. A 16-GPU cluster running fine-tuning jobs is rarely bottlenecked on fabric.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. Cooling<\/h3>\n\n\n\n<p>Indian ambient temperatures make this a first-class decision, not an afterthought. Three regimes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Air cooling<\/strong> \u2014 viable up to ~30 kW per rack. Works for edge and smaller departmental deployments. Requires good hot-aisle\/cold-aisle discipline and adequate CRAC capacity.<\/li>\n\n\n\n<li><strong>Rear-door heat exchangers (RDHx)<\/strong> \u2014 bridges to ~45\u201355 kW per rack. Good middle path. Relatively low facility disruption. RDP deploys these often for customers who want rack-scale without going full liquid.<\/li>\n\n\n\n<li><strong>Direct liquid cooling (DLC)<\/strong> \u2014 required for densest H100 SXM, H200, and B200 deployments. 70\u2013120+ kW per rack. Needs facility water loops and CDU infrastructure. Plan for 6\u201312 weeks of facility work if you\u2019re retrofitting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4. Storage architecture<\/h3>\n\n\n\n<p>Two tiers, both matter:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Hot \/ scratch<\/strong> \u2014 local NVMe on compute nodes, or an all-flash parallel file system (Weka, VAST, DDN, Lustre, BeeGFS). This is where training datasets live during a run.<\/li>\n\n\n\n<li><strong>Warm \/ corpus<\/strong> \u2014 object storage (MinIO, Ceph, or commercial) for the broader training corpus, checkpoints, model artifacts, and logs. Cheap, dense, network-attached.<\/li>\n<\/ul>\n\n\n\n<p>The ratio that matters: for fine-tuning workloads, plan ~2\u20135 TB of hot NVMe per GPU. For pre-training, more.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5. Software stack<\/h3>\n\n\n\n<p>The hardware conversation is the easy part. The software stack is where operational burden lives.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Orchestration<\/strong> \u2014 Kubernetes with GPU operator and a scheduler that understands GPU topology (Volcano, Kueue). Slurm remains popular for HPC-style workloads.<\/li>\n\n\n\n<li><strong>Observability<\/strong> \u2014 Prometheus + Grafana + DCGM exporter for GPU telemetry. Log aggregation (Loki, ELK). Alerting for thermal, utilization, and failure events.<\/li>\n\n\n\n<li><strong>Model serving<\/strong> \u2014 Triton, vLLM, TGI, or an MLOps platform that wraps one of them. Pick based on your team\u2019s preference; the platforms are maturing fast.<\/li>\n\n\n\n<li><strong>Security<\/strong> \u2014 perimeter, identity, data-at-rest encryption, and increasingly, model-level access controls. Treat the AI factory like any production system that handles sensitive data, because it does.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Phasing: Don\u2019t Build It All at Once<\/h2>\n\n\n\n<p>The second most common mistake: buying the full three-year footprint upfront.<\/p>\n\n\n\n<p>Here\u2019s the phased approach we recommend:<\/p>\n\n\n\n<p><strong>Phase 1 (Months 0\u20133) \u2014 Pilot pod.<\/strong> Stand up a small cluster (4\u20138 GPUs, air-cooled, single rack). Objectives: prove the workload, validate the software stack, build internal skills, capture baseline metrics. Capex in the \u20b960 lakh \u2013 \u20b91.2 crore range.<\/p>\n\n\n\n<p><strong>Phase 2 (Months 3\u20139) \u2014 Production departmental.<\/strong> Scale to 16\u201332 GPUs. Introduce RDHx or early liquid cooling if needed. Promote pilot workloads to production. Start migrating the heaviest cloud inference workloads on-prem. Capex scales to \u20b92\u20134 crore cumulative.<\/p>\n\n\n\n<p><strong>Phase 3 (Months 9\u201324) \u2014 Rack-scale AI factory.<\/strong> Multi-rack, liquid-cooled, parallel file system, proper fabric. This is the inflection from \u201cwe have GPUs\u201d to \u201cwe operate an AI factory.\u201d Only go here if utilization and roadmap justify it. Capex \u20b96\u201315 crore+ depending on scale.<\/p>\n\n\n\n<p>Each phase makes the next one easier to justify. And crucially, each phase produces real business value \u2014 not just a benchmark number.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Pitfalls to Avoid<\/h2>\n\n\n\n<p>From our deployments across Indian enterprises, the recurring failure modes:<\/p>\n\n\n\n<p><strong>Buying GPUs before fixing power and cooling.<\/strong> Hardware arrives; facility can\u2019t support it; rack sits at 40% density for six months. Always qualify power and cooling <em>first<\/em>.<\/p>\n\n\n\n<p><strong>Ignoring network fabric until it bottlenecks.<\/strong> A well-specced GPU cluster on a mediocre fabric is a mediocre AI factory. If your workload needs multi-node training, design fabric into Phase 1, not Phase 3.<\/p>\n\n\n\n<p><strong>Underestimating storage.<\/strong> Training workloads generate checkpoints. Many of them. A 70B parameter fine-tune can produce terabytes of checkpoint data per run. Plan accordingly.<\/p>\n\n\n\n<p><strong>Treating it as an IT project instead of a platform.<\/strong> The AI factory will be used by data scientists, ML engineers, and application developers \u2014 not just the infra team. Self-service access, quotas, chargeback, and good developer ergonomics matter from day one.<\/p>\n\n\n\n<p><strong>Skimping on observability.<\/strong> When a training run silently degrades at 3 AM on day six, you\u2019ll want metrics. Build the telemetry stack before you need it.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The India-Specific Considerations<\/h2>\n\n\n\n<p>Three things that shift the calculus for Indian enterprises specifically:<\/p>\n\n\n\n<p><strong>Power. Quality and cost.<\/strong> Tier-1 metros have decent grid quality; tier-2 often doesn\u2019t. Plan for redundant power, UPS capacity sized for GPU workloads (not office IT), and if relevant, diesel backup with adequate runway. Your power design needs to survive the grid, not just augment it.<\/p>\n\n\n\n<p><strong>Facility lead times.<\/strong> Building out a new rack row with proper cooling in India typically runs 8\u201316 weeks. Longer if you\u2019re retrofitting a space that wasn\u2019t designed for 40+ kW per rack. Factor this into your roadmap \u2014 it\u2019s often the critical path.<\/p>\n\n\n\n<p><strong>Local support and spares.<\/strong> Imported hardware means imported RMAs and imported spares. The difference between 4-hour on-site and 4-week-by-sea matters enormously around year three. India-based manufacturers and integrators compress this to hours, not weeks \u2014 which is a design consideration, not a sales pitch.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Where RDP Fits<\/h2>\n\n\n\n<p>This playbook is deliberately vendor-neutral \u2014 the architecture choices apply regardless of who builds your hardware. That said, if you\u2019re evaluating partners: RDP designs and manufactures AI infrastructure in India, from edge compute to rack-scale AI factories. We ship the AI-POD for departmental deployments and rack-scale AI Factory configurations for larger installs, with India-based engineering, warranty, and support.<\/p>\n\n\n\n<p>We tend to get engaged at two points: when an organization is pricing their first serious GPU deployment and wants a second opinion on sizing, or when they\u2019re mid-way through a phased rollout and need a partner who can scale with them through Phase 3. We\u2019re happy to do either. <strong>Reliability is Our Product.<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-32-1024x683.png\" alt=\"\" class=\"wp-image-360\" srcset=\"https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-32-1024x683.png 1024w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-32-300x200.png 300w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-32-768x512.png 768w, https:\/\/rdp.in\/blog\/wp-content\/uploads\/2026\/03\/image-32.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">What\u2019s Next<\/h2>\n\n\n\n<p>This was Part 2 of the RDP AI Infrastructure Series \u2014 the tactical <em>how<\/em>.<\/p>\n\n\n\n<p>In <a href=\"https:\/\/rdp.in\/blog\/sovereign-ai-starts-with-sovereign-compute-the-case-for-indias-on-prem-ai-stack\/\"><strong>Part 3<\/strong><\/a>, we\u2019ll cover the <em>why it matters beyond cost<\/em>: how Indian enterprises and government bodies are thinking about <strong>sovereign AI<\/strong> as a strategic imperative \u2014 data sovereignty, model sovereignty, hardware sovereignty, and the policy landscape shaping it.<\/p>\n\n\n\n<p>Missed Part 1? Read <a href=\"https:\/\/rdp.in\/blog\/the-real-cost-of-cloud-ai-why-indian-enterprises-are-moving-gpu-workloads-on-prem-in-2026\/\"><strong>The Real Cost of Cloud AI: Why Indian Enterprises Are Moving GPU Workloads On-Prem in 2026<\/strong><\/a>.<\/p>\n\n\n\n<p>If you\u2019re starting to scope an AI factory and want a second opinion on sizing, architecture, or phasing \u2014 <a href=\"https:\/\/rdp.in\/contact\/\">speak with our AI Infrastructure team<\/a>. No sales pitch, just an honest design review.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Read the full series<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/rdp.in\/blog\/the-real-cost-of-cloud-ai-why-indian-enterprises-are-moving-gpu-workloads-on-prem-in-2026\/\"><strong>Part 1: The Real Cost of Cloud AI<\/strong><\/a> \u2014 why Indian enterprises are moving GPU workloads on-prem.<\/li>\n\n\n\n<li><strong>Part 2: Building Your AI Factory in India<\/strong> <em>(you are here)<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/rdp.in\/blog\/sovereign-ai-starts-with-sovereign-compute-the-case-for-indias-on-prem-ai-stack\/\"><strong>Part 3: Sovereign AI Starts with Sovereign Compute<\/strong><\/a> \u2014 the policy and sovereignty angle.<\/li>\n<\/ul>\n\n\n\n<p><strong>Table: AI Factory Architecture Tiers \u2014 GPU Count, Power, Cooling, Space, and Indicative Budget<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Tier<\/th><th>Use Case<\/th><th>GPU Count (indicative)<\/th><th>Power Draw<\/th><th>Cooling Requirement<\/th><th>Floor Space<\/th><th>Indicative \u20b9 Budget<\/th><\/tr><\/thead><tbody><tr><td>Tier 1 \u2014 Edge Inference<\/td><td>Branch \/ factory floor inference, IoT<\/td><td>1\u20134 GPUs (e.g., RTX 4000 class)<\/td><td>300\u2013800 W<\/td><td>Standard AC; no special infra<\/td><td>1\u20132 rack units<\/td><td>\u20b98\u201325 lakh per node<\/td><\/tr><tr><td>Tier 2 \u2014 Departmental AI<\/td><td>Team-level LLM serving, RAG, fine-tuning PoC<\/td><td>8\u201316 GPUs (e.g., H100 SXM or A100)<\/td><td>10\u201325 kW<\/td><td>Precision AC or in-row cooling<\/td><td>1\u20132 racks (~10 sq m)<\/td><td>\u20b93\u20138 cr<\/td><\/tr><tr><td>Tier 3 \u2014 Enterprise AI Cluster<\/td><td>Production LLM serving, multi-model, MLOps pipeline<\/td><td>64\u2013256 GPUs<\/td><td>100\u2013400 kW<\/td><td>Chilled water or rear-door heat exchanger<\/td><td>50\u2013200 sq m<\/td><td>\u20b925\u2013100 cr<\/td><\/tr><tr><td>Tier 4 \u2014 Hyperscale AI Training<\/td><td>Foundation model training, national AI missions<\/td><td>1,000+ GPUs (H100\/B200 class)<\/td><td>1\u201310 MW<\/td><td>Direct liquid cooling (DLC) mandatory<\/td><td>500+ sq m dedicated DC hall<\/td><td>\u20b9300 cr+<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>RDP Technologies Limited designs, manufactures, and supports AI infrastructure \u2014 from edge compute to rack-scale AI factories \u2014 for Indian enterprises, government bodies, and research institutions. Make in India. Built for an AI-Ready India. Reliability is Our Product.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A CIO&#8217;s practical playbook for designing an AI Factory in India in 2026 \u2014 sizing, architecture, phasing, and the five decisions that matter most.<\/p>\n","protected":false},"author":1,"featured_media":361,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[17],"tags":[24,22,27,28,26,30,29,21],"class_list":["post-28","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-infrastructure","tag-ai-factory","tag-ai-infrastructure-india","tag-cio-playbook","tag-data-center-design","tag-gpu-clusters","tag-liquid-cooling","tag-nvidia-h100","tag-on-prem-gpu"],"acf":[],"_links":{"self":[{"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/posts\/28","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/comments?post=28"}],"version-history":[{"count":7,"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/posts\/28\/revisions"}],"predecessor-version":[{"id":362,"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/posts\/28\/revisions\/362"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/media\/361"}],"wp:attachment":[{"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/media?parent=28"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/categories?post=28"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rdp.in\/blog\/wp-json\/wp\/v2\/tags?post=28"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}