Prometheus ve Grafana ile Modern Monitoring
Modern mikroservis mimarilerinde etkili monitoring sistemi, uygulamaların sağlık durumunu takip etmek ve potansiyel sorunları proaktif olarak tespit etmek için kritik öneme sahiptir. Bu kapsamlı rehberde Prometheus ve Grafana kullanarak profesyonel monitoring ve alerting altyapısı kurmayı öğreneceksiniz.
İçindekiler
- Monitoring Mimarisi ve Prensipler
- Prometheus Kurulum ve Konfigürasyon
- Metrics Toplama ve Service Discovery
- Grafana Dashboard'ları
- AlertManager Konfigürasyonu
- Best Practices ve Optimizasyon
- Troubleshooting ve Debugging
Monitoring Mimarisi ve Prensipler {#monitoring-mimarisi}
The Four Golden Signals
Google SRE metodolojisine göre her servisin izlenmesi gereken dört temel metrik:
1. Latency (Gecikme)
- Request işleme süresi
- P50, P95, P99 percentile'ları
- Successful ve failed request'lerin ayrı analizi
2. Traffic (Trafik)
- Request per second (RPS)
- Concurrent connection sayısı
- Bandwidth kullanımı
3. Errors (Hatalar)
- Hata oranları ve türleri
- HTTP status code dağılımı
- Business logic hataları
4. Saturation (Doygunluk)
- CPU, Memory, Disk, Network kullanımı
- Queue depth ve connection pool durumu
- Thread pool ve goroutine sayıları
Monitoring Stack Mimarisi
# monitoring-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
labels:
name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
Prometheus Kurulum ve Konfigürasyon {#prometheus-kurulum}
Prometheus Server Deployment
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
ports:
- containerPort: 9090
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--storage.tsdb.retention.time=15d'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-pvc
Prometheus Konfigürasyonu
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production-k8s'
region: 'us-west-2'
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
Metrics Toplama ve Service Discovery {#metrics-toplama}
Uygulama Metrics'leri
Go uygulaması için örnek metrics implementation:
// metrics.go
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status_code"},
)
httpDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_duration_seconds",
Help: "Duration of HTTP requests",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
queueSize = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "queue_size",
Help: "Size of various queues",
},
[]string{"queue_name"},
)
)
func init() {
prometheus.MustRegister(httpRequests)
prometheus.MustRegister(httpDuration)
prometheus.MustRegister(activeConnections)
prometheus.MustRegister(queueSize)
}
func prometheusMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap ResponseWriter to capture status code
ww := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(ww, r)
duration := time.Since(start).Seconds()
httpDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpRequests.WithLabelValues(r.Method, r.URL.Path,
fmt.Sprintf("%d", ww.statusCode)).Inc()
})
}
func main() {
http.Handle("/metrics", promhttp.Handler())
// Your application routes here
http.HandleFunc("/api/health", healthHandler)
http.HandleFunc("/api/users", usersHandler)
log.Fatal(http.ListenAndServe(":8080", prometheusMiddleware(http.DefaultServeMux)))
}
ServiceMonitor Konfigürasyonu
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: webapp-metrics
namespace: monitoring
labels:
app: webapp
spec:
selector:
matchLabels:
app: webapp
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- production
- staging
Grafana Dashboard'ları {#grafana-dashboards}
Grafana Deployment
# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.1.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-password
- name: GF_INSTALL_PLUGINS
value: "grafana-piechart-panel,grafana-worldmap-panel"
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-config
mountPath: /etc/grafana/provisioning
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-config
configMap:
name: grafana-config
Dashboard Provisioning
# grafana-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitoring
data:
datasources.yml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
dashboards.yml: |
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
kubernetes-overview.json: |
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Overview",
"tags": ["kubernetes"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
}
]
}
}
AlertManager Konfigürasyonu {#alertmanager-config}
AlertManager Deployment
# alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.26.0
ports:
- containerPort: 9093
args:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--web.external-url=http://alertmanager.monitoring.svc.cluster.local:9093'
volumeMounts:
- name: alertmanager-config
mountPath: /etc/alertmanager
- name: alertmanager-storage
mountPath: /alertmanager
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: alertmanager-config
configMap:
name: alertmanager-config
- name: alertmanager-storage
emptyDir: {}
Alert Rules
# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
kubernetes.rules: |
groups:
- name: kubernetes.rules
rules:
- alert: KubernetesNodeReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes memory pressure (instance {{ $labels.instance }})
description: "{{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes container oom killer (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} time(s) in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: application.rules
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High error rate detected
description: "Error rate is {{ $value }} errors per second"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: High latency detected
description: "95th percentile latency is {{ $value }}s"
AlertManager Konfigürasyonu
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@tektik.com'
smtp_auth_username: 'alerts@tektik.com'
smtp_auth_password: 'app_password'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: web.hook
routes:
- match:
severity: critical
receiver: critical-alerts
continue: true
- match:
severity: warning
receiver: warning-alerts
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook-service:5000/alerts'
send_resolved: true
- name: 'critical-alerts'
email_configs:
- to: 'devops-team@tektik.com'
subject: 'CRITICAL: {{ .GroupLabels.alertname }} - {{ .GroupLabels.cluster }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts-critical'
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-alerts'
email_configs:
- to: 'devops-team@tektik.com'
subject: 'WARNING: {{ .GroupLabels.alertname }} - {{ .GroupLabels.cluster }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Best Practices ve Optimizasyon {#best-practices}
Metrics Tasarım Prensipleri
# good-metrics-example.yaml
# ✅ İyi metric tasarımı
http_requests_total{method="GET", endpoint="/api/users", status_code="200"}
http_request_duration_seconds{method="GET", endpoint="/api/users"}
# ❌ Kötü metric tasarımı - yüksek cardinality
http_requests_total{method="GET", endpoint="/api/users/123", user_id="123"}
Storage Optimizasyonu
# prometheus-storage-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-storage-config
data:
prometheus.yml: |
global:
scrape_interval: 30s # Daha uzun interval
evaluation_interval: 30s
# Retention policies
scrape_configs:
- job_name: 'high-frequency'
scrape_interval: 15s
static_configs:
- targets: ['critical-service:8080']
- job_name: 'low-frequency'
scrape_interval: 60s
static_configs:
- targets: ['batch-jobs:8080']
# Metric relabeling for optimization
metric_relabel_configs:
- source_labels: [__name__]
regex: '(go_memstats_|go_gc_|process_).*'
action: drop
Troubleshooting ve Debugging {#troubleshooting}
Prometheus Query Debugging
# Memory kullanımı trend analizi
rate(container_memory_usage_bytes[5m])
# En çok CPU kullanan Pod'lar
topk(10, rate(container_cpu_usage_seconds_total[5m]))
# Network I/O analizi
rate(container_network_receive_bytes_total[5m]) +
rate(container_network_transmit_bytes_total[5m])
# Disk kullanımı prediction
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
# Alert debug
ALERTS{alertstate="firing"}
Performance Tuning
# prometheus-performance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-performance
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
# Query engine optimization
query:
timeout: 2m
max_concurrency: 20
max_samples: 50000000
# TSDB optimization
tsdb:
min_block_duration: 2h
max_block_duration: 25h
retention: 15d
# WAL optimization
wal:
compression: true
Sonuç
Prometheus ve Grafana ile kurduğunuz monitoring sistemi, modern mikroservis mimarilerinin sağlık durumunu etkili bir şekilde takip etmenizi sağlar. Başarılı bir monitoring stratejisi için:
- Golden Signals: Latency, traffic, errors ve saturation metriklerini önceliklendirin
- Service Discovery: Kubernetes native service discovery kullanın
- Alert Fatigue: Yalnızca actionable alert'ler oluşturun
- Dashboard Design: İş hedeflerine odaklı dashboard'lar tasarlayın
- Retention Strategy: Maliyet ve performance dengesini koruyun
Bu rehberdeki konfigürasyonlar ile production-ready monitoring altyapınızı oluşturabilir ve işletebilirsiniz.
TekTık Yazılım olarak, monitoring ve observability altyapılarınızın kurulumu ve optimizasyonu konularında profesyonel destek sağlamaktayız. İletişim sayfamızdan detaylı bilgi alabilirsiniz.