Production 환경 구성도 설계법: 실무에서 통하는 서비스 아키텍처 완전 가이드

안녕하세요, 성장하는 개발자 여러분!

“개발은 완료했는데… 이제 어떻게 서비스를 운영하지?” 🤔

많은 주니어 개발자들이 코드 작성에는 익숙하지만, 실제 서비스를 Production 환경에 배포하고 운영하는 것은 또 다른 세계입니다. 오늘은 제가 실무에서 겪은 다양한 프로젝트를 통해 배운 Production 환경 설계의 모든 것을 정리해 드리겠습니다.

단순히 “서버 하나에 올려서 돌리면 되지 않나?”라고 생각하셨다면, 이 글을 끝까지 읽어보세요. 실제 서비스 운영에는 생각보다 많은 것들이 필요합니다! 🚀

1. Production 환경이란? 현실과 이상의 차이

개발 환경 vs Production 환경

# 개발 환경 (Local)
- 내 컴퓨터에서만 동작
- 데이터는 테스트용
- 에러가 나면 재시작
- 사용자는 나 혼자

# Production 환경 (Real Service)
- 24/7 무중단 서비스
- 실제 사용자 데이터
- 에러는 곧 손실
- 수천~수만 명 동시 사용

실제 Production에서 고려해야 할 것들

가용성(Availability): 서비스가 항상 살아있어야 함
확장성(Scalability): 사용자가 늘어나도 견딜 수 있어야 함
보안(Security): 데이터와 시스템을 지켜야 함
성능(Performance): 빠르게 응답해야 함
모니터링(Monitoring): 문제를 빨리 발견해야 함
백업/복구(Backup/Recovery): 데이터를 잃어버리면 안 됨

2. Production 프로젝트 생명주기: 기획부터 배포까지

단계별 프로세스 개요

graph LR
    A[기획] --> B[Kick-off]
    B --> C[Pre Sprint]
    C --> D[Sprint]
    D --> E[QA]
    E --> F[Beta]
    F --> G[Deploy]

각 단계별 상세 분석

1️⃣ 기획 단계 (Planning)

주요 담당자: 기획자, 제품 오너
핵심 산출물: 요구사항 명세서, 사용자 스토리

# 기획 단계에서 기술적으로 고려할 것들

## 1. 예상 사용자 규모
- DAU (Daily Active Users) 예측
- 동시 접속자 수 예상
- 트래픽 패턴 분석 (시간대별, 요일별)

## 2. 데이터 규모 예측
- 저장할 데이터량 추정
- 데이터 증가율 예측
- 중요 데이터와 일반 데이터 분류

## 3. 성능 요구사항
- 응답 시간 목표 (일반적으로 2초 이내)
- 처리량 목표 (초당 요청 수)
- 가용성 목표 (99.9% uptime 등)

실무 팁: 기획 단계에서 “몇 명이 쓸 것 같나요?”를 꼭 물어보세요. 100명용 시스템과 10만명용 시스템은 완전히 다릅니다!

2️⃣ Kick-off 단계

주요 담당자: PM, 팀 리더들
핵심 도구: JIRA, Figma, Adobe XD

# Kick-off에서 결정해야 할 기술 스택

## Frontend 기술 스택
- Framework: React, Vue.js, Angular
- UI Library: Material-UI, Ant Design, Bootstrap
- 상태 관리: Redux, Vuex, Context API
- 번들러: Webpack, Vite, Parcel

## Backend 기술 스택
- Language: Node.js, Python, Java, Go
- Framework: Express, Django, Spring Boot, Gin
- Database: PostgreSQL, MySQL, MongoDB
- API 설계: RESTful, GraphQL

## 인프라 기술 스택
- Cloud Provider: AWS, GCP, Azure
- Container: Docker, Kubernetes
- CI/CD: GitHub Actions, GitLab CI, Jenkins
- Monitoring: Prometheus, Grafana, ELK Stack

3️⃣ Pre Sprint 단계 (인프라 설계의 핵심)

주요 담당자: 인프라 엔지니어, DBA, 아키텍트

이 단계가 Production 설계의 핵심입니다!

# 인프라 용량 계산 예시

## 1. 서버 사양 계산
# 예상 동시 접속자 1,000명
# 평균 응답 시간 500ms
# 필요한 서버 처리 능력 = 1,000 / (1 / 0.5) = 500 TPS

## 2. 데이터베이스 용량 계산
# 사용자 1만명 × 평균 데이터 1MB = 10GB
# 1년 성장률 200% 고려 = 30GB
# 여유분 100% = 60GB SSD

## 3. 네트워크 대역폭 계산
# 평균 페이지 크기 2MB
# 동시 접속자 1,000명
# 필요 대역폭 = 2MB × 1,000 / 8 = 250Mbps

스케일 업 vs 스케일 아웃 전략

# Scale Up (수직 확장)
장점:
- 구현이 간단
- 데이터 일관성 보장
- 관리 포인트 적음

단점:
- 서비스 중단 필요
- 비용이 비쌈
- 한계가 있음

# Scale Out (수평 확장)
장점:
- 무제한 확장 가능
- 장애 격리 가능
- 비용 효율적

단점:
- 복잡한 구조
- 데이터 동기화 이슈
- 운영 복잡도 증가

4️⃣ Sprint 단계 (개발 및 구현)

주요 담당자: 개발팀
핵심 개념: MVP (Minimum Viable Product)

# MVP 개발 전략

## 1. 핵심 기능 우선 순위
1순위: 사용자 인증, 기본 CRUD
2순위: 핵심 비즈니스 로직
3순위: 부가 기능, 최적화

## 2. 개발 환경 설정
- 로컬 개발 환경
- 개발 서버 (Development)
- 테스트 서버 (Staging)
- 운영 서버 (Production)

3. Production 아키텍처 설계: 실전 패턴들

기본 3-Tier 아키텍처

# 전통적인 3계층 구조
┌─────────────────┐
│   Presentation  │  ← Frontend (React, Vue.js)
│      Layer      │
├─────────────────┤
│   Application   │  ← Backend API (Node.js, Spring)
│      Layer      │
├─────────────────┤
│      Data       │  ← Database (MySQL, PostgreSQL)
│      Layer      │
└─────────────────┘

소규모 스타트업용 아키텍처

# 사용자 1,000명 미만, 간단한 서비스

Internet
    ↓
[Load Balancer]  ← AWS ALB, Nginx
    ↓
[Web Server]     ← EC2 t3.medium (Frontend + Backend)
    ↓
[Database]       ← RDS MySQL (Multi-AZ)
    ↓
[File Storage]   ← S3 (이미지, 파일)

# 예상 비용: 월 $100-300
# 처리 가능: 동시 접속자 100-500명

중규모 서비스용 아키텍처

# 사용자 1만-10만명, 일정한 트래픽

Internet
    ↓
[CDN]            ← CloudFront (정적 파일 캐싱)
    ↓
[Load Balancer]  ← ALB (Health Check, SSL 터미네이션)
    ↓
[Frontend]       ← S3 + CloudFront (SPA 배포)
    ↓
[API Gateway]    ← 라우팅, 인증, Rate Limiting
    ↓
[Backend Servers] ← Auto Scaling Group (EC2 2-5대)
    ├─ User Service
    ├─ Order Service
    └─ Payment Service
    ↓
[Cache Layer]    ← Redis Cluster
    ↓
[Database]       ← RDS (Master-Slave 구조)
    ├─ Write: Master
    └─ Read: Read Replica
    ↓
[Message Queue]  ← SQS, RabbitMQ
    ↓
[Monitoring]     ← CloudWatch, Prometheus

# 예상 비용: 월 $1,000-5,000
# 처리 가능: 동시 접속자 1,000-10,000명

대규모 서비스용 마이크로서비스 아키텍처

# 사용자 10만명 이상, 복잡한 비즈니스 로직

Internet
    ↓
[Global CDN]     ← Multi-Region CDN
    ↓
[WAF]           ← Web Application Firewall
    ↓
[API Gateway]   ← Kong, AWS API Gateway
    ↓
[Service Mesh]  ← Istio, Consul Connect
    ↓
┌─────────────────────────────────────┐
│           Microservices              │
├─ User Service      ├─ Product Service │
├─ Order Service     ├─ Payment Service │
├─ Notification      ├─ Analytics       │
└─ Auth Service      └─ Search Service  │
└─────────────────────────────────────┘
    ↓
[Container Platform] ← Kubernetes, Docker Swarm
    ↓
[Data Layer]
├─ RDBMS (PostgreSQL Cluster)
├─ NoSQL (MongoDB, Cassandra)
├─ Cache (Redis Cluster)
├─ Search (Elasticsearch)
└─ Data Warehouse (BigQuery, Redshift)
    ↓
[Message Systems]
├─ Event Bus (Kafka)
├─ Queue (RabbitMQ)
└─ Pub/Sub (Redis, Google Pub/Sub)

# 예상 비용: 월 $10,000+
# 처리 가능: 동시 접속자 10만명+

4. 핵심 구성 요소별 설계 가이드

로드 밸런서 설계

# Nginx 로드 밸런서 설정 예시
upstream backend {
    # Round Robin (기본)
    server 10.0.1.10:3000 weight=3;
    server 10.0.1.11:3000 weight=2;
    server 10.0.1.12:3000 weight=1;

    # Health Check
    health_check interval=30s;

    # Sticky Session (필요한 경우)
    ip_hash;
}

server {
    listen 80;
    server_name api.example.com;

    # SSL Redirect
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    # SSL 설정
    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    # 보안 헤더
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;

    # Rate Limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
    limit_req zone=api burst=20 nodelay;

    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeout 설정
        proxy_connect_timeout 30s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
    }
}

데이터베이스 설계 전략

Master-Slave 구조

# MySQL Master-Slave 설정

## Master 서버 설정 (Write)
# /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
server-id = 1 
log-bin = mysql-bin 
binlog-format = ROW gtid-mode = ON 
enforce-gtid-consistency = ON 

# 읽기 전용 계정 생성 
CREATE USER 'replicator'@'%' IDENTIFIED BY 'password'; 
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%'; 

## Slave 서버 설정 (Read) 
# /etc/mysql/mysql.conf.d/mysqld.cnf

[mysqld]
server-id = 2 
read-only = 1 
relay-log = relay-log 
gtid-mode = ON 
enforce-gtid-consistency = ON 

# Master에 연결
CHANGE MASTER TO 
MASTER_HOST='master-server-ip', MASTER_USER='replicator', 
MASTER_PASSWORD='password', 
MASTER_AUTO_POSITION = 1; 

START SLAVE;

연결 풀 최적화

// Node.js MySQL 연결 풀 설정
const mysql = require("mysql2/promise");

// Write용 연결 풀 (Master)
const writePool = mysql.createPool({
  host: "master.db.example.com",
  user: "app_user",
  password: "secure_password",
  database: "production_db",
  waitForConnections: true,
  connectionLimit: 20, // 동시 연결 수
  queueLimit: 0,
  acquireTimeout: 60000, // 연결 획득 타임아웃
  timeout: 60000, // 쿼리 타임아웃
  reconnect: true,
  charset: "utf8mb4",
});

// Read용 연결 풀 (Slave)
const readPool = mysql.createPool({
  host: "slave.db.example.com",
  user: "app_readonly",
  password: "readonly_password",
  database: "production_db",
  waitForConnections: true,
  connectionLimit: 50, // Read는 더 많이
  queueLimit: 0,
  acquireTimeout: 60000,
  timeout: 60000,
  reconnect: true,
  charset: "utf8mb4",
});

// 사용 예시
class DatabaseService {
  static async write(query, params) {
    const connection = await writePool.getConnection();
    try {
      const [rows] = await connection.execute(query, params);
      return rows;
    } finally {
      connection.release();
    }
  }

  static async read(query, params) {
    const connection = await readPool.getConnection();
    try {
      const [rows] = await connection.execute(query, params);
      return rows;
    } finally {
      connection.release();
    }
  }
}

캐싱 전략

// Redis 캐싱 계층 구현
const Redis = require("ioredis");

const redis = new Redis.Cluster(
  [
    { host: "redis-1.cache.example.com", port: 6379 },
    { host: "redis-2.cache.example.com", port: 6379 },
    { host: "redis-3.cache.example.com", port: 6379 },
  ],
  {
    redisOptions: {
      password: "redis_password",
      maxRetriesPerRequest: 3,
      retryDelayOnFailover: 100,
      maxRetriesPerRequest: null,
      enableReadyCheck: false,
    },
  }
);

class CacheService {
  // 1. 단순 캐싱
  static async get(key) {
    try {
      const value = await redis.get(key);
      return value ? JSON.parse(value) : null;
    } catch (error) {
      console.error("Cache get error:", error);
      return null;
    }
  }

  static async set(key, value, ttl = 3600) {
    try {
      await redis.setex(key, ttl, JSON.stringify(value));
    } catch (error) {
      console.error("Cache set error:", error);
    }
  }

  // 2. Cache-Aside 패턴
  static async getOrSet(key, fetcher, ttl = 3600) {
    let value = await this.get(key);

    if (value === null) {
      value = await fetcher();
      if (value !== null) {
        await this.set(key, value, ttl);
      }
    }

    return value;
  }

  // 3. 분산 락 (동시성 제어)
  static async withLock(lockKey, timeout, callback) {
    const lockValue = Date.now() + timeout + 1;
    const acquired = await redis.set(lockKey, lockValue, "PX", timeout, "NX");

    if (acquired) {
      try {
        return await callback();
      } finally {
        // Lua 스크립트로 안전한 락 해제
        await redis.eval(
          `
                    if redis.call('GET', KEYS[1]) == ARGV[1] then
                        return redis.call('DEL', KEYS[1])
                    else
                        return 0
                    end
                `,
          1,
          lockKey,
          lockValue
        );
      }
    } else {
      throw new Error("Could not acquire lock");
    }
  }
}

// 사용 예시
app.get("/api/users/:id", async (req, res) => {
  const userId = req.params.id;
  const cacheKey = `user:${userId}`;

  const user = await CacheService.getOrSet(
    cacheKey,
    () => DatabaseService.read("SELECT * FROM users WHERE id = ?", [userId]),
    1800 // 30분 캐시
  );

  res.json(user);
});

5. 보안 설계: 다층 방어 전략

네트워크 보안

# AWS Security Group 설정 예시

## Web Tier Security Group
# HTTP/HTTPS만 인터넷에서 접근 허용
aws ec2 authorize-security-group-ingress 
    --group-id sg-web-tier 
    --protocol tcp 
    --port 80 
    --cidr 0.0.0.0/0

aws ec2 authorize-security-group-ingress 
    --group-id sg-web-tier 
    --protocol tcp 
    --port 443 
    --cidr 0.0.0.0/0

## App Tier Security Group
# Web Tier에서만 접근 허용
aws ec2 authorize-security-group-ingress 
    --group-id sg-app-tier 
    --protocol tcp 
    --port 3000 
    --source-group sg-web-tier

## DB Tier Security Group
# App Tier에서만 접근 허용
aws ec2 authorize-security-group-ingress 
    --group-id sg-db-tier 
    --protocol tcp 
    --port 3306 
    --source-group sg-app-tier

애플리케이션 보안

// Express.js 보안 설정
const express = require("express");
const helmet = require("helmet");
const rateLimit = require("express-rate-limit");
const cors = require("cors");

const app = express();

// 1. 기본 보안 헤더
app.use(
  helmet({
    contentSecurityPolicy: {
      directives: {
        defaultSrc: ["'self'"],
        styleSrc: ["'self'", "'unsafe-inline'"],
        scriptSrc: ["'self'"],
        imgSrc: ["'self'", "data:", "https:"],
      },
    },
    hsts: {
      maxAge: 31536000,
      includeSubDomains: true,
      preload: true,
    },
  })
);

// 2. CORS 설정
app.use(
  cors({
    origin: process.env.NODE_ENV === "production" ? ["https://myapp.com", "https://www.myapp.com"] : ["http://localhost:3000"],
    credentials: true,
    optionsSuccessStatus: 200,
  })
);

// 3. Rate Limiting
const limiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15분
  max: 100, // 최대 100 요청
  message: {
    error: "Too many requests, please try again later.",
  },
  standardHeaders: true,
  legacyHeaders: false,
});

app.use("/api/", limiter);

// 4. 특별한 엔드포인트에 더 강한 제한
const strictLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 5, // 로그인은 15분에 5번만
  skipSuccessfulRequests: true,
});

app.use("/api/auth/login", strictLimiter);

// 5. JWT 토큰 검증
const jwt = require("jsonwebtoken");

const authenticateToken = (req, res, next) => {
  const authHeader = req.headers["authorization"];
  const token = authHeader && authHeader.split(" ")[1];

  if (!token) {
    return res.sendStatus(401);
  }

  jwt.verify(token, process.env.JWT_SECRET, (err, user) => {
    if (err) return res.sendStatus(403);
    req.user = user;
    next();
  });
};

// 6. SQL 인젝션 방지 (Prepared Statements)
app.post("/api/users", authenticateToken, async (req, res) => {
  const { name, email } = req.body;

  // 입력값 검증
  if (!name || !email || !email.includes("@")) {
    return res.status(400).json({ error: "Invalid input" });
  }

  try {
    // Prepared Statement 사용
    const result = await DatabaseService.write("INSERT INTO users (name, email) VALUES (?, ?)", [name, email]);
    res.json({ id: result.insertId });
  } catch (error) {
    console.error("Database error:", error);
    res.status(500).json({ error: "Internal server error" });
  }
});

6. 모니터링과 로깅 시스템

Prometheus + Grafana 모니터링

# docker-compose.yml - 모니터링 스택
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--web.console.libraries=/etc/prometheus/console_libraries"
      - "--web.console.templates=/etc/prometheus/consoles"
      - "--storage.tsdb.retention.time=200h"
      - "--web.enable-lifecycle"

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/provisioning:/etc/grafana/provisioning

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)"

volumes:
  prometheus_data:
  grafana_data:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  # Prometheus 자체 메트릭
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter (시스템 메트릭)
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  # 애플리케이션 메트릭
  - job_name: "app"
    static_configs:
      - targets: ["app:3000"]
    metrics_path: "/metrics"
    scrape_interval: 5s

  # 데이터베이스 메트릭
  - job_name: "mysql"
    static_configs:
      - targets: ["mysql-exporter:9104"]

애플리케이션 메트릭 수집

// Express.js 애플리케이션에 Prometheus 메트릭 추가
const prometheus = require("prom-client");
const express = require("express");

const app = express();

// 기본 메트릭 수집
const collectDefaultMetrics = prometheus.collectDefaultMetrics;
collectDefaultMetrics({ timeout: 5000 });

// 커스텀 메트릭 정의
const httpRequestDuration = new prometheus.Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status"],
});

const httpRequestTotal = new prometheus.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "route", "status"],
});

const activeConnections = new prometheus.Gauge({
  name: "active_connections",
  help: "Number of active connections",
});

const databaseConnectionPool = new prometheus.Gauge({
  name: "database_connection_pool_size",
  help: "Current database connection pool size",
  labelNames: ["pool_name", "state"],
});

// 메트릭 수집 미들웨어
const metricsMiddleware = (req, res, next) => {
  const start = Date.now();

  activeConnections.inc();

  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;

    httpRequestDuration.labels(req.method, route, res.statusCode).observe(duration);

    httpRequestTotal.labels(req.method, route, res.statusCode).inc();

    activeConnections.dec();
  });

  next();
};

app.use(metricsMiddleware);

// 메트릭 엔드포인트
app.get("/metrics", (req, res) => {
  res.set("Content-Type", prometheus.register.contentType);
  res.end(prometheus.register.metrics());
});

// 데이터베이스 메트릭 업데이트 (주기적으로)
setInterval(() => {
  // 실제 연결 풀 상태 확인
  const poolStats = getConnectionPoolStats();

  databaseConnectionPool.labels("main", "active").set(poolStats.active);

  databaseConnectionPool.labels("main", "idle").set(poolStats.idle);

  databaseConnectionPool.labels("main", "waiting").set(poolStats.waiting);
}, 30000); // 30초마다 업데이트

로그 집중화 (ELK Stack)

# docker-compose.elk.yml
version: "3.8"

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
    container_name: elasticsearch
    environment:
      - node.name=elasticsearch
      - cluster.name=es-docker-cluster
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  logstash:
    image: docker.elastic.co/logstash/logstash:7.15.0
    container_name: logstash
    volumes:
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
      - ./logstash/pipeline:/usr/share/logstash/pipeline:ro
    ports:
      - "5044:5044"
      - "5000:5000/tcp"
      - "5000:5000/udp"
      - "9600:9600"
    environment:
      LS_JAVA_OPTS: "-Xmx1g -Xms1g"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:7.15.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      ELASTICSEARCH_URL: http://elasticsearch:9200
      ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]'
    depends_on:
      - elasticsearch

volumes:
  elasticsearch_data:

# logstash/pipeline/logstash.conf
input {
  beats {
    port => 5044
  }

  # JSON 로그 직접 수신
  tcp {
    port => 5000
    codec => json_lines
  }
}

filter {
  # 타임스탬프 파싱
  date {
    match => [ "timestamp", "ISO8601" ]
  }

  # 로그 레벨별 필터링
  if [level] == "ERROR" {
    mutate {
      add_tag => [ "error" ]
    }
  }

  # User-Agent 파싱
  if [user_agent] {
    useragent {
      source => "user_agent"
      target => "ua"
    }
  }

  # GeoIP 정보 추가
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geoip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }

  # 에러 로그는 별도 인덱스
  if "error" in [tags] {
    elasticsearch {
      hosts => ["elasticsearch:9200"]
      index => "error-logs-%{+YYYY.MM.dd}"
    }
  }

  stdout { codec => rubydebug }
}

7. CI/CD 파이프라인 구축

GitHub Actions 예시

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: "18"
  AWS_REGION: "ap-northeast-2"

jobs:
  test:
    runs-on: ubuntu-latest

    services:
      mysql:
        image: mysql:8.0
        env:
          MYSQL_ROOT_PASSWORD: test_password
          MYSQL_DATABASE: test_db
        ports:
          - 3306:3306
        options: --health-cmd="mysqladmin ping" --health-interval=10s --health-timeout=5s --health-retries=3

      redis:
        image: redis:7
        ports:
          - 6379:6379
        options: --health-cmd="redis-cli ping" --health-interval=10s --health-timeout=5s --health-retries=3

    steps:
      - uses: actions/checkout@v3

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"

      - name: Install dependencies
        run: npm ci

      - name: Run linting
        run: npm run lint

      - name: Run type checking
        run: npm run type-check

      - name: Run unit tests
        run: npm run test:unit
        env:
          NODE_ENV: test

      - name: Run integration tests
        run: npm run test:integration
        env:
          NODE_ENV: test
          DB_HOST: localhost
          DB_PORT: 3306
          DB_NAME: test_db
          DB_USER: root
          DB_PASSWORD: test_password
          REDIS_URL: redis://localhost:6379

      - name: Run security audit
        run: npm audit --audit-level high

      - name: Upload coverage reports
        uses: codecov/codecov-action@v3

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'

    steps:
      - uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Build and push Docker image
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          ECR_REPOSITORY: myapp-backend
          IMAGE_TAG: ${{ github.sha }}
        run: |
          # Build Docker image
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:latest .

          # Push to ECR
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest

          # Save image URI for deployment
          echo "IMAGE_URI=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_ENV

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    environment: staging

    steps:
      - uses: actions/checkout@v3

      - name: Deploy to staging
        run: |
          # Update ECS service
          aws ecs update-service 
            --cluster staging-cluster 
            --service myapp-backend 
            --force-new-deployment

      - name: Wait for deployment
        run: |
          aws ecs wait services-stable 
            --cluster staging-cluster 
            --services myapp-backend

      - name: Run smoke tests
        run: |
          # 기본적인 헬스체크
          curl -f https://staging-api.example.com/health || exit 1

          # 주요 API 테스트
          npm run test:smoke -- --baseUrl=https://staging-api.example.com

  deploy-production:
    needs: [build, deploy-staging]
    runs-on: ubuntu-latest
    environment: production
    if: github.ref == 'refs/heads/main'

    steps:
      - uses: actions/checkout@v3

      - name: Blue/Green deployment
        run: |
          # 현재 active 색상 확인
          CURRENT_COLOR=$(aws elbv2 describe-target-groups 
            --target-group-arns ${{ secrets.BLUE_TG_ARN }} 
            --query 'TargetGroups[0].TargetGroupName' 
            --output text | grep -o 'blue|green')

          if [ "$CURRENT_COLOR" = "blue" ]; then
            NEW_COLOR="green"
            NEW_TG_ARN="${{ secrets.GREEN_TG_ARN }}"
            OLD_TG_ARN="${{ secrets.BLUE_TG_ARN }}"
          else
            NEW_COLOR="blue"
            NEW_TG_ARN="${{ secrets.BLUE_TG_ARN }}"
            OLD_TG_ARN="${{ secrets.GREEN_TG_ARN }}"
          fi

          echo "Deploying to $NEW_COLOR environment"

          # 새로운 환경에 배포
          aws ecs update-service 
            --cluster production-cluster 
            --service myapp-backend-$NEW_COLOR 
            --force-new-deployment

          # 배포 완료 대기
          aws ecs wait services-stable 
            --cluster production-cluster 
            --services myapp-backend-$NEW_COLOR

          # 헬스체크
          for i in {1..30}; do
            if curl -f https://api.example.com/health; then
              echo "Health check passed"
              break
            fi
            echo "Health check failed, retrying..."
            sleep 10
          done

          # 트래픽 전환
          aws elbv2 modify-listener 
            --listener-arn ${{ secrets.PROD_LISTENER_ARN }} 
            --default-actions Type=forward,TargetGroupArn=$NEW_TG_ARN

          echo "Traffic switched to $NEW_COLOR"

      - name: Post-deployment verification
        run: |
          # 5분간 모니터링
          sleep 300

          # 에러율 확인 (Prometheus 쿼리)
          ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])/rate(http_requests_total[5m])" | jq -r '.data.result[0].value[1]')

          if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
            echo "High error rate detected: $ERROR_RATE"
            # 롤백 실행
            exit 1
          fi

          echo "Deployment successful!"

  notify:
    needs: [deploy-production]
    runs-on: ubuntu-latest
    if: always()

    steps:
      - name: Notify Slack
        uses: 8398a7/action-slack@v3
        with:
          status: ${{ job.status }}
          channel: "#deployments"
          webhook_url: ${{ secrets.SLACK_WEBHOOK }}

8. 재해 복구 및 백업 전략

데이터베이스 백업 자동화

#!/bin/bash
# db_backup.sh - 데이터베이스 백업 스크립트

set -euo pipefail

# 설정 변수
DB_HOST="prod-db.example.com"
DB_USER="backup_user"
DB_PASSWORD="${DB_PASSWORD:-$(cat /etc/mysql/backup_password)}"
DB_NAME="production_db"
BACKUP_DIR="/backups/mysql"
S3_BUCKET="my-company-db-backups"
RETENTION_DAYS=30

# 백업 파일명 (타임스탬프 포함)
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${DB_NAME}_${TIMESTAMP}.sql.gz"
BACKUP_PATH="${BACKUP_DIR}/${BACKUP_FILE}"

# 로깅 함수
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a /var/log/db_backup.log
}

# 백업 디렉토리 생성
mkdir -p "$BACKUP_DIR"

log "Starting database backup for $DB_NAME"

# MySQL 덤프 생성 (압축)
mysqldump 
    --host="$DB_HOST" 
    --user="$DB_USER" 
    --password="$DB_PASSWORD" 
    --single-transaction 
    --routines 
    --triggers 
    --quick 
    --lock-tables=false 
    "$DB_NAME" | gzip > "$BACKUP_PATH"

# 백업 파일 크기 확인
BACKUP_SIZE=$(ls -lh "$BACKUP_PATH" | awk '{print $5}')
log "Backup completed. File size: $BACKUP_SIZE"

# S3에 업로드
aws s3 cp "$BACKUP_PATH" "s3://$S3_BUCKET/mysql/$(date +%Y/%m/%d)/"

if [ $? -eq 0 ]; then
    log "Backup uploaded to S3 successfully"
else
    log "ERROR: Failed to upload backup to S3"
    exit 1
fi

# 로컬 파일 정리 (7일 이상 된 백업 삭제)
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete

# S3에서 오래된 백업 정리
aws s3 ls "s3://$S3_BUCKET/mysql/" --recursive | 
    while read -r line; do
        createDate=`echo $line | awk {'print $1" "$2'}`
        createDate=`date -d"$createDate" +%s`
        olderThan=`date -d"-$RETENTION_DAYS days" +%s`
        if [[ $createDate -lt $olderThan ]]; then
            fileName=`echo $line | awk {'print $4'}`
            aws s3 rm "s3://$S3_BUCKET/$fileName"
            log "Deleted old backup: $fileName"
        fi
    done

log "Database backup process completed"

# Slack 알림 (성공)
curl -X POST -H 'Content-type: application/json' 
    --data "{"text":"✅ Database backup completed successfully\nFile: $BACKUP_FILE\nSize: $BACKUP_SIZE"}" 
    "$SLACK_WEBHOOK_URL"

장애 복구 시나리오

#!/bin/bash
# disaster_recovery.sh - 재해 복구 스크립트

# 1. 데이터베이스 복구
restore_database() {
    local backup_date=$1
    local backup_file="production_db_${backup_date}.sql.gz"

    echo "Restoring database from $backup_file"

    # S3에서 백업 파일 다운로드
    aws s3 cp "s3://my-company-db-backups/mysql/$backup_file" /tmp/

    # 데이터베이스 복구
    gunzip -c "/tmp/$backup_file" | mysql 
        --host="$RECOVERY_DB_HOST" 
        --user="$DB_ADMIN_USER" 
        --password="$DB_ADMIN_PASSWORD" 
        "$DB_NAME"

    echo "Database restore completed"
}

# 2. 애플리케이션 배포
deploy_from_backup() {
    local git_commit=$1

    echo "Deploying application from commit $git_commit"

    # 이전 버전으로 롤백
    kubectl set image deployment/myapp-backend 
        backend="$ECR_REGISTRY/myapp-backend:$git_commit"

    # 배포 완료 대기
    kubectl rollout status deployment/myapp-backend --timeout=600s

    echo "Application deployment completed"
}

# 3. 트래픽 전환
switch_traffic() {
    local target_cluster=$1

    echo "Switching traffic to $target_cluster"

    # DNS 레코드 업데이트 (Route53)
    aws route53 change-resource-record-sets 
        --hosted-zone-id "$HOSTED_ZONE_ID" 
        --change-batch file://dns-change-$target_cluster.json

    echo "Traffic switch completed"
}

# 4. 헬스체크
health_check() {
    local endpoint=$1
    local max_attempts=30
    local attempt=1

    while [ $attempt -le $max_attempts ]; do
        if curl -f "$endpoint/health" > /dev/null 2>&1; then
            echo "Health check passed"
            return 0
        fi

        echo "Health check failed (attempt $attempt/$max_attempts)"
        sleep 10
        ((attempt++))
    done

    echo "Health check failed after $max_attempts attempts"
    return 1
}

# 메인 복구 프로세스
main() {
    local recovery_type=$1

    case $recovery_type in
        "database")
            restore_database "$2"
            ;;
        "application")
            deploy_from_backup "$2"
            ;;
        "full")
            restore_database "$2"
            deploy_from_backup "$3"
            switch_traffic "backup"
            health_check "https://api.example.com"
            ;;
        *)
            echo "Usage: $0 {database|application|full} [backup_date] [git_commit]"
            exit 1
            ;;
    esac
}

main "$@"

9. 성능 최적화 전략

데이터베이스 최적화

-- 인덱스 최적화 예시
-- 1. 자주 조회되는 컬럼에 인덱스 생성
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_id_created_at ON orders(user_id, created_at);

-- 2. 복합 인덱스 (순서가 중요!)
-- WHERE user_id = ? AND status = ? ORDER BY created_at DESC
CREATE INDEX idx_orders_composite ON orders(user_id, status, created_at);

-- 3. 부분 인덱스 (조건부 인덱스)
CREATE INDEX idx_active_users ON users(id) WHERE status = 'active';

-- 4. 쿼리 최적화
-- BEFORE: N+1 문제
SELECT * FROM orders;
-- 각 주문마다: SELECT * FROM users WHERE id = ?

-- AFTER: JOIN 사용
SELECT o.*, u.name, u.email
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.created_at >= '2024-01-01';

-- 5. 페이지네이션 최적화
-- BEFORE: OFFSET 사용 (느림)
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 10000;

-- AFTER: 커서 기반 페이지네이션
SELECT * FROM products WHERE id > 10000 ORDER BY id LIMIT 20;

애플리케이션 레벨 최적화

// 1. 데이터베이스 연결 풀 최적화
const poolConfig = {
  connectionLimit: 20,
  acquireTimeout: 60000,
  timeout: 60000,
  reconnect: true,
  // 연결 재사용
  idleTimeout: 300000,
  // 데드락 감지
  deadlockTimeout: 60000,
};

// 2. 캐싱 전략
class ProductService {
  static async getProduct(id) {
    const cacheKey = `product:${id}`;

    // 1차: 메모리 캐시 (가장 빠름)
    let product = memoryCache.get(cacheKey);
    if (product) return product;

    // 2차: Redis 캐시
    product = await redisCache.get(cacheKey);
    if (product) {
      memoryCache.set(cacheKey, product, 300); // 5분
      return product;
    }

    // 3차: 데이터베이스
    product = await db.query("SELECT * FROM products WHERE id = ?", [id]);

    if (product) {
      // 캐시에 저장 (TTL: 1시간)
      await redisCache.set(cacheKey, product, 3600);
      memoryCache.set(cacheKey, product, 300);
    }

    return product;
  }

  // 캐시 무효화
  static async updateProduct(id, data) {
    const result = await db.query("UPDATE products SET ? WHERE id = ?", [data, id]);

    // 캐시 무효화
    const cacheKey = `product:${id}`;
    memoryCache.del(cacheKey);
    await redisCache.del(cacheKey);

    return result;
  }
}

// 3. 비동기 처리 최적화
const processOrder = async (orderData) => {
  try {
    // 1. 주문 생성 (동기)
    const order = await Order.create(orderData);

    // 2. 부가 작업들은 비동기로 처리
    await Promise.all([
      // 재고 차감
      updateInventory(orderData.items),
      // 이메일 발송 (큐에 추가)
      emailQueue.add("order-confirmation", {
        orderId: order.id,
        userEmail: orderData.email,
      }),
      // 분석 데이터 전송
      analyticsQueue.add("order-event", {
        event: "order_created",
        orderId: order.id,
        amount: orderData.total,
      }),
    ]);

    return order;
  } catch (error) {
    // 보상 트랜잭션
    await rollbackOrder(orderData);
    throw error;
  }
};

// 4. 이미지 최적화 서비스
const sharp = require("sharp");

const optimizeImage = async (inputBuffer, options = {}) => {
  const { width = 800, height = 600, quality = 80, format = "webp" } = options;

  try {
    const optimized = await sharp(inputBuffer)
      .resize(width, height, {
        fit: "inside",
        withoutEnlargement: true,
      })
      .toFormat(format, { quality })
      .toBuffer();

    return optimized;
  } catch (error) {
    console.error("Image optimization failed:", error);
    throw error;
  }
};

// 5. 요청 배칭 (DataLoader 패턴)
const DataLoader = require("dataloader");

const userLoader = new DataLoader(async (userIds) => {
  // 여러 사용자를 한 번에 조회
  const users = await db.query("SELECT * FROM users WHERE id IN (?)", [userIds]);

  // ID 순서대로 정렬
  return userIds.map((id) => users.find((user) => user.id === id) || null);
});

// 사용 예시
const orders = await Order.findAll();
const enrichedOrders = await Promise.all(
  orders.map(async (order) => ({
    ...order,
    user: await userLoader.load(order.user_id), // 배칭됨
  }))
);

마치며: Production 설계의 핵심 원칙

체크리스트 요약

📋 설계 단계

[ ] 용량 계획: 예상 사용자 수와 트래픽 패턴 분석
[ ] 기술 스택 선택: 팀 역량과 요구사항에 맞는 기술 선택
[ ] 아키텍처 설계: 확장 가능하고 유지보수 가능한 구조
[ ] 보안 설계: 다층 방어 전략 수립

🚀 구현 단계

[ ] CI/CD 파이프라인: 자동화된 빌드, 테스트, 배포
[ ] 모니터링 시스템: 메트릭, 로그, 알림 체계
[ ] 백업 전략: 정기 백업과 복구 절차
[ ] 성능 최적화: 데이터베이스, 캐싱, 비동기 처리

🔧 운영 단계

[ ] 장애 대응: 빠른 감지와 복구 시스템
[ ] 확장 계획: 트래픽 증가에 대한 대응 방안
[ ] 보안 유지: 정기적인 보안 점검과 업데이트
[ ] 비용 최적화: 리소스 사용량 모니터링과 최적화

단계별 성장 로드맵

주니어 개발자 (1년차):

단일 서버 배포 경험하기
기본적인 모니터링 설정하기
수동 배포 프로세스 이해하기

중급 개발자 (2-3년차):

로드 밸런서와 다중 서버 구성
CI/CD 파이프라인 구축
데이터베이스 최적화 경험

시니어 개발자 (4년차+):

마이크로서비스 아키텍처 설계
클라우드 네이티브 환경 구축
대규모 트래픽 처리 경험

마지막 조언

완벽한 아키텍처는 존재하지 않습니다. 현재 요구사항에 맞는 “충분히 좋은” 설계를 하고, 필요에 따라 점진적으로 발전시켜 나가는 것이 중요합니다.

시작은 작게, 하지만 확장 가능하게. 그리고 항상 사용자 경험을 최우선으로 생각하세요! 🎯

다음에는 “CI/CD와 자동화 배포 실전 가이드”로 더 깊이 있는 DevOps 여정을 함께 해보겠습니다. 기대해 주세요!