Deploying Apicurio Registry for high availability

High availability architecture overview

A highly available Apicurio Registry deployment consists of the following components:

  • Application (backend) pods - Multiple stateless replicas that process REST API requests. These pods can scale horizontally and are distributed across cluster nodes for fault tolerance.

  • UI pods - Multiple stateless replicas serving the web console (static files). Typically fewer replicas are needed compared to the backend.

  • Storage layer - The stateful component requiring HA configuration. Strategy depends on whether you use SQL database or KafkaSQL storage.

The application and UI components are stateless and can be scaled horizontally. High availability is achieved by running multiple replicas distributed across failure domains (availability zones) and implementing a highly available storage layer.

Configuring application component high availability

The Apicurio Registry backend and UI components are stateless and can scale horizontally to provide high availability and increased throughput.

Configuring multiple replicas

Configure the number of replicas for each component in the ApicurioRegistry3 custom resource:

apiVersion: registry.apicur.io/v1
kind: ApicurioRegistry3
metadata:
  name: example-registry-ha
spec:
  app:
    replicas: 3
    storage:
      type: postgresql
      sql:
        dataSource:
          url: jdbc:postgresql://postgresql-ha.my-project.svc:5432/registry
          username: registry_user
          password:
            name: postgresql-credentials
            key: password
    ingress:
      host: registry.example.com
  ui:
    replicas: 2
    ingress:
      host: registry-ui.example.com
Running multiple replicas requires a production-ready storage backend (PostgreSQL, MySQL, or KafkaSQL with persistent Kafka). Do not use in-memory storage with multiple replicas.

Distributing pods across nodes

To ensure high availability, distribute Apicurio Registry pods across different nodes and availability zones:

spec:
  app:
    replicas: 3
    podTemplateSpec:
      spec:
        affinity:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app: example-registry
                      app.kubernetes.io/component: app
                      app.kubernetes.io/part-of: apicurio-registry
                  topologyKey: kubernetes.io/hostname
              - weight: 50
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app: example-registry
                      app.kubernetes.io/component: app
                      app.kubernetes.io/part-of: apicurio-registry
                  topologyKey: topology.kubernetes.io/zone

This configuration spreads pods across different nodes (hostname) and availability zones when possible.

Configuring resource requests and limits

Set appropriate resource requests and limits to ensure pod scheduling and prevent resource contention:

spec:
  app:
    podTemplateSpec:
      spec:
        containers:
          - name: apicurio-registry-app
            resources:
              requests:
                memory: "512Mi"
                cpu: "500m"
              limits:
                memory: "1Gi"
                cpu: "1000m"

Configuring PodDisruptionBudget

Ensure minimum availability during voluntary disruptions such as node drains or cluster upgrades. The operator creates a PodDisruptionBudget with maxUnavailable: 1 by default, but you can customize it by creating your own:

spec:
  app:
    replicas: 3
    podDisruptionBudget:
      enabled: false
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
    name: example-registry-app-poddisruptionbudget
spec:
    minAvailable: 1
    selector:
      matchLabels:
        app: example-registry
        app.kubernetes.io/component: app
        app.kubernetes.io/part-of: apicurio-registry

Configuring SQL database storage for high availability

When using PostgreSQL or MySQL storage, high availability depends on your database configuration. The database must be configured with replication and automatic failover.

MySQL storage is not yet supported by the operator and requires manual configuration using environment variables.

Database high availability options

Consider these database HA strategies:

  • PostgreSQL with streaming replication - Primary-replica configuration with automatic failover using tools like Patroni, CloudNativePG, or Crunchy PostgreSQL Operator

  • PostgreSQL with synchronous replication - Ensures zero data loss but may impact performance

  • MySQL with Group Replication - Multi-primary or single-primary mode with automatic failover

  • Managed database services - Cloud provider managed databases (RDS, Cloud SQL, Azure Database) with built-in HA

Configuring connection pool for failover

Configure the Agroal connection pool to handle database failover gracefully. Add these environment variables to the app component:

spec:
  app:
    env:
      # Connection pool sizing
      - name: APICURIO_DATASOURCE_JDBC_INITIAL-SIZE
        value: "10"
      - name: APICURIO_DATASOURCE_JDBC_MIN-SIZE
        value: "10"
      - name: APICURIO_DATASOURCE_JDBC_MAX-SIZE
        value: "50"

      # Connection acquisition timeout (5 seconds)
      - name: QUARKUS_DATASOURCE_JDBC_ACQUISITION-TIMEOUT
        value: "5S"

      # Background validation to detect stale connections (every 2 minutes)
      - name: QUARKUS_DATASOURCE_JDBC_BACKGROUND-VALIDATION-INTERVAL
        value: "2M"

      # Foreground validation before use (every 1 minute)
      - name: QUARKUS_DATASOURCE_JDBC_FOREGROUND-VALIDATION-INTERVAL
        value: "1M"

      # Maximum connection lifetime (30 minutes)
      - name: QUARKUS_DATASOURCE_JDBC_MAX-LIFETIME
        value: "30M"

These settings ensure that:

  • Connections are validated regularly to detect database failovers

  • Stale connections are removed and recreated

  • Connection acquisition times out rather than hanging indefinitely

Example: PostgreSQL with CloudNativePG

When using the CloudNativePG operator for PostgreSQL HA:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: registry-db-cluster
spec:
  instances: 3
  storage:
    size: 20Gi
    storageClass: standard
  postgresql:
    parameters:
      max_connections: "200"
  backup:
    barmanObjectStore:
      # Configure backup storage

Then reference the cluster in your ApicurioRegistry3 CR:

spec:
  app:
    storage:
      type: postgresql
      sql:
        dataSource:
          url: jdbc:postgresql://registry-db-cluster-rw:5432/app
          username: app
          password:
            name: registry-db-cluster-app
            key: password

Configuring KafkaSQL storage for high availability

When using KafkaSQL storage, high availability depends on the Kafka cluster configuration. Each Apicurio Registry replica independently consumes all messages from the Kafka journal topic.

Kafka cluster high availability

Configure your Kafka cluster for high availability:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: registry-kafka
spec:
  kafka:
    version: 3.5.0
    replicas: 3
    config:
      # Replication settings for HA
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
    storage:
      type: persistent-claim
      size: 100Gi
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi

Key configuration for HA:

  • Kafka replicas: 3 - Provides fault tolerance for broker failures

  • Replication factor: 3 - Each partition has 3 copies

  • min.insync.replicas: 2 - Requires at least 2 replicas to acknowledge writes

KafkaSQL topic configuration

Apicurio Registry uses three Kafka topics for various purposes:

  • Journal topic - Stores all changes to the registry data. Named kafkasql-journal by default and is the most important topic for data durability.

  • Snapshots topic - Stores periodic snapshots of the registry state for faster startup, if the snapshotting feature is used. It’s named kafkasql-snapshots by default and should be configured with similar replication settings as the journal topic.

  • Events topic - Stores events to support Kafka-based Registry eventing feature. Named registry-events by default, its configuration depends on your event processing needs. Currently, Apicurio Registry sends messages to only 1 partition of this topic.

Configure the Kafka topics used by Apicurio Registry with appropriate replication:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: kafkasql-journal
  labels:
    strimzi.io/cluster: registry-kafka
spec:
  partitions: 3
  replicas: 3
  config:
    cleanup.policy: delete
    min.insync.replicas: 2
    retention.ms: -1  # Infinite retention
    retention.bytes: -1  # Infinite retention
The journal topic and the snapshots topic must use cleanup.policy: delete with infinite retention (retention.ms: -1 and retention.bytes: -1) to prevent accidental data loss. As of the latest version, Apicurio Registry will check these settings on startup and refuse to start if they are not configured correctly.

Apicurio Registry automatically creates the journal, snapshots, and events topics on startup by default, if they do not exist. While this is not recommended in a high-availability production scenario, if you want the topics to be created automatically, the following environment variables provide the equivalent configuration:

APICURIO_KAFKASQL_TOPIC_PARTITIONS=3
APICURIO_KAFKASQL_TOPIC_REPLICATION_FACTOR=3
APICURIO_KAFKASQL_TOPIC_MIN_INSYNC_REPLICAS=2
APICURIO_KAFKASQL_SNAPSHOTS_TOPIC_PARTITIONS=3
APICURIO_KAFKASQL_SNAPSHOTS_TOPIC_REPLICATION_FACTOR=3
APICURIO_KAFKASQL_SNAPSHOTS_TOPIC_MIN_INSYNC_REPLICAS=2

Consumer behavior with multiple replicas

When running multiple Apicurio Registry replicas with KafkaSQL storage:

  • Each replica uses a unique consumer group ID (automatically generated using UUID)

  • Each replica independently consumes all messages from the journal topic

  • There is no consumer group rebalancing between replicas

  • All replicas build the same in-memory state from the Kafka topic

This design ensures that:

  • New replicas can be added without affecting existing replicas

  • Pod restarts only affect the restarting pod, not others

  • Each replica maintains a consistent view of the data

Monitoring high availability deployments

Apicurio Registry exposes Prometheus metrics for monitoring application health and performance.

Enabling metrics

Metrics are enabled by default. Access the metrics endpoint at /q/metrics.

Key metrics to monitor

Monitor these metrics for HA deployments:

  • REST API metrics:

    • http_server_requests_seconds - Request latency

    • http_server_active_requests - Concurrent requests

    • http_server_requests_total - Total request count

  • Storage metrics:

    • apicurio_storage_operation_seconds - Storage operation latency

    • apicurio_storage_concurrent_operations - Concurrent storage operations

    • apicurio_storage_operation_total - Total storage operations

  • Health check metrics:

    • Readiness probe: /q/health/ready

    • Liveness probe: /q/health/live

  • JVM metrics:

    • jvm_memory_used_bytes - Memory usage

    • jvm_gc_pause_seconds - Garbage collection pauses

Configuring ServiceMonitor for Prometheus Operator

If using the Prometheus Operator, create a ServiceMonitor to scrape metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: apicurio-registry-metrics
  labels:
    app: apicurio-registry
spec:
  selector:
    matchLabels:
      app: apicurio-registry
  endpoints:
    - port: http
      path: /q/metrics
      interval: 30s

Alerting recommendations

Configure alerts for:

  • Pod restarts or crash loops

  • High error rates (5xx responses)

  • Storage operation timeouts

  • Database connection pool exhaustion

  • Kafka consumer lag (for KafkaSQL storage)

Performing rolling updates

When updating Apicurio Registry to a new version, use rolling updates to minimize downtime.

Rolling update strategy

The operator performs rolling updates automatically when you update the ApicurioRegistry3 CR. The default strategy ensures:

  • Pods are updated one at a time

  • Each new pod must pass readiness checks before the next pod is updated

  • PodDisruptionBudget prevents too many pods being unavailable

Updating Apicurio Registry version

Apicurio Registry and Apicurio Registry Operator are versioned together. To update Apicurio Registry to a new version, update Apicurio Registry Operator, which will automatically update the application.

For production environments, we strongly recommend using Operator Lifecycle Manager (OLM) with manual install plan confirmation enabled. This prevents automatic updates from occurring immediately when a new version becomes available, giving you time to review release notes and prepare for the update.

If you need to update the application without updating Apicurio Registry Operator (not recommended except as a workaround for critical issues), you can override the application image in your pod template:

spec:
  app:
    podTemplateSpec:
      spec:
        containers:
          - name: apicurio-registry-app
            image: quay.io/apicurio/apicurio-registry:3.1.0

Safe update practices

Follow these practices for safe updates:

  • Test in non-production first - Validate the new version in a test environment

  • Monitor during rollout - Watch metrics and logs during the update

  • Maintain minimum replicas - Keep at least 2 replicas to ensure availability during updates

  • Review release notes - Check for breaking changes or migration steps

For patch releases (e.g., 3.1.0 to 3.1.1), rolling updates typically complete without issues. For major or minor version updates, always review the migration guide.

Backup and restore procedures

Regular backups are essential for disaster recovery and data protection.

SQL database backup

For SQL storage (PostgreSQL or MySQL), backup strategies depend on your database setup:

PostgreSQL backup options

  • pg_dump logical backups:

    pg_dump -h postgresql-host -U registry_user -d registry > registry-backup.sql
  • Continuous archiving with WAL - For point-in-time recovery

  • Operator-managed backups - If using CloudNativePG or Crunchy PostgreSQL Operator:

    spec:
      backup:
        barmanObjectStore:
          destinationPath: s3://my-backups/registry-db
          s3Credentials:
            accessKeyId:
              name: backup-credentials
              key: ACCESS_KEY_ID
            secretAccessKey:
              name: backup-credentials
              key: ACCESS_SECRET_KEY
          wal:
            compression: gzip
        retentionPolicy: "30d"

MySQL backup options

  • mysqldump logical backups:

    mysqldump -h mysql-host -u registry_user -p registry > registry-backup.sql
  • Binary backups - Using tools like Percona XtraBackup or MySQL Enterprise Backup

Restoring from SQL backup

To restore from a logical backup:

# PostgreSQL
psql -h postgresql-host -U registry_user -d registry < registry-backup.sql

# MySQL
mysql -h mysql-host -u registry_user -p registry < registry-backup.sql

KafkaSQL storage backup

For KafkaSQL storage, back up the Kafka journal topic and snapshots topic.

Backing up Kafka topics

Use Kafka’s MirrorMaker 2 or Strimzi Mirror Maker for topic replication (recommended):

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: registry-backup-mirror
spec:
  version: 3.5.0
  replicas: 1
  connectCluster: "backup-cluster"
  clusters:
    - alias: "source-cluster"
      bootstrapServers: registry-kafka-bootstrap:9092
    - alias: "backup-cluster"
      bootstrapServers: backup-kafka-bootstrap:9092
  mirrors:
    - sourceCluster: "source-cluster"
      targetCluster: "backup-cluster"
      topicsPattern: "kafkasql-.*"
      groupsPattern: ".*"

Alternatively, export and import topics using Kafka’s console consumer or CLI tools like kcat/kafkacat. Since the topics might contain binary data, which might be affected by text file encoding, make sure they are exported correctly when using CLI tools. See our Exporting Apicurio Registry Kafka topic data guide for more details.

Testing backup and restore

Regularly test your backup and restore procedures:

  1. Create a test Apicurio Registry deployment in a separate namespace

  2. Restore from backup to the test deployment

  3. Verify that all artifacts and metadata are present

  4. Test API functionality to ensure data integrity

Additional resources