CodeMie Platform - Data Processing & Storage Architecture

Overview

The CodeMie platform is an AI-powered coding assistant that processes user conversations and integrates with external services while maintaining data sovereignty and security through configurable regional deployment.

Data Processing Architecture

Core Principles

1. Regional Data Isolation

Persistent storage regions are independently configurable – PostgreSQL (Keycloak), PostgreSQL (CodeMie), Object Storage, and KMS/Vault can be deployed in the same region or distributed across different regions based on customer requirements
In-cluster components (Elasticsearch, NATS, and other in-cluster services) run within the Kubernetes cluster region
AI processing regions are configurable per LLM provider – customers can select specific regions for each model based on compliance and latency requirements
PostgreSQL (Keycloak) is deployed in the same region as the Kubernetes cluster in default configuration, but can be configured to use an external managed PostgreSQL instance in a different region if required

2. Data Encryption

Data at rest: Encrypted using KMS/Vault (AES-256)
Data in transit: TLS 1.2+ for all external communications

3. External Data Isolation

External service data (Jira, GitHub, etc.) is fetched via integration/datasource; for retrieval-based features it is normalized, embedded, and indexed locally in Elasticsearch
Some tools can query external datasources directly and return results to chat without indexing
LLM providers do not train on customer data under their enterprise agreements. Refer to the provider documentation for details:
- Azure OpenAI: Data Privacy
- AWS Bedrock: Service Terms and Third-Party Models
- Google Vertex AI: Data Governance
- Anthropic Claude: Data Training
- OpenAI: Enterprise Data Privacy
- etc.
Integration/datasource credentials are never sent to LLMs, and LLM models cannot access them

LLM Provider

LLM Provider refers to external AI model services such as AWS Bedrock, Azure OpenAI, Google Vertex AI, OpenAI, Anthropic Claude, etc.

Data Storage Layers

Persistent Storage

PostgreSQL (Cloud SQL / RDS / Azure Database) – Data Stored

Data Types:
- Authorization Data: Role permissions, user-role mappings
- Conversation Data: Chat history, prompts, AI responses
- Configuration Data: Datasource configs, model preferences, system prompts, assistants, workflows, integration/datasource credentials (KMS/Vault-encrypted strings)
Storage & Security:
- Encryption at rest: Data encrypted with KMS/Vault-managed keys
- Encryption in transit: SSL/TLS connections required

User Authentication Storage

Keycloak uses a dedicated PostgreSQL cluster in Kubernetes that stores SSO federation mappings, IDP client secrets, and local user credentials. When SSO is enabled, user authentication (passwords, MFA) is managed by the external IDP (Entra ID, Okta, Google) rather than stored in Keycloak.

Object Storage (GCS / S3 / Azure Blob) – Data Stored

Data Types:
- User Files: Uploaded attachments, documents, images
Storage & Security:
- Retention: Configurable based on customer requirements
- Access: Managed identities/service accounts only, no public access
- Encryption at rest: Server-side KMS/Vault encryption

In-Cluster Storage

Elasticsearch

Vector Embeddings: Semantic search vectors (generated by embeddings models)
Platform Components Logs: Logs from core platform services (see Core Components)

Data Processing Flows

1. User Authentication Flow

Data Stored:

User profile (email, name, roles) in Keycloak PostgreSQL (GKE/AKS/EKS)
External IDP configuration in Keycloak PostgreSQL (GKE/AKS/EKS)
OAuth2 Proxy secrets (client-secret, cookie-secret) in K8s Secrets (encrypted at-rest via etcd)
Session data: Encrypted browser cookie (no server-side storage, TLS 1.2+ in-transit)

OAuth2 Proxy Security

OAuth2 Proxy operates in stateless mode using encrypted browser cookies with HTTPS-only transmission and CSRF protection.

2. Chat Conversation Flow

Data Stored:

Conversation data: PostgreSQL (Cloud SQL/RDS/Azure Database)
Context snippets: Retrieved from Elasticsearch (GKE/AKS/EKS)
Analytics: PostgreSQL (Cloud SQL/RDS/Azure Database)

3. Datasource Indexing Flow

Data Stored:

Integration/Datasource Credentials: PostgreSQL (encrypted string via KMS/Vault, plaintext never persisted)
Indexed Data (Embeddings): Elasticsearch (vector database; GKE/AKS/EKS)
Metadata: PostgreSQL (Cloud SQL/RDS/Azure Database) — datasource config, indexing status, timestamps

Credential Security

Integration/datasource credentials are stored as KMS/Vault-encrypted strings in PostgreSQL. Decryption happens on-demand via KMS/Vault APIs, and plaintext credentials exist only in memory for the duration of the API call, never persisted to disk.

Supported Datasources

For a complete list of supported datasource types and configuration details, see Data Source Overview.

Regional Data Distribution

Component	Data Type	Storage	Region	Encryption
PostgreSQL (Keycloak)	User auth, SSO mappings	K8s cluster (optionally Cloud SQL/RDS/Azure DB)	GKE/AKS/EKS Region (or managed DB region)	StorageClass encryption + TLS
PostgreSQL (CodeMie)	Chat, Config, Roles	Cloud SQL/RDS/Azure DB	Cloud SQL/RDS/Azure Database Region	KMS/Vault (at rest) + TLS (in transit)
Elasticsearch	Knowledge Base, Indices	K8s Persistent Volume	GKE/AKS/EKS Region	StorageClass encryption + TLS
Object Storage	Files, Attachments	GCS/S3/Azure Blob	GCS/S3/Azure Blob Region	KMS/Vault (server-side)
KMS/Vault	Encryption Keys, Secrets	KMS/Vault	KMS/Vault Region (configurable)	HSM-backed
LLM Services	Prompt Processing	External API	LLM Provider Region	TLS
External IDPs	Authentication	External SaaS	Customer tenant	TLS + SAML/OIDC
External Services	Source Data	External SaaS/Self-hosted	Managed by customer	TLS + OAuth 2.0

Data Processing Principles

AI Model Interaction

Input: User prompt + locally indexed context (from Elasticsearch)
Processing: External LLM Provider service – data leaves platform boundary
Output: AI-generated response stored in PostgreSQL (Cloud SQL/RDS/Azure DB)
External Data Usage: External service data is indexed locally for retrieval; some tools can access external sources directly when requested
Embeddings: Generated by AI models but stored in Elasticsearch (GKE/AKS/EKS)

Datasource Data

Fetch: CodeMie API pulls data via OAuth 2.0 authenticated APIs
Transform: Data cleaned, chunked, and prepared for indexing
Index: Elasticsearch stores searchable indices with vector embeddings
Cache: Optional raw data cache in object storage

Authentication & Authorization

SSO Federation: Keycloak acts as SAML/OIDC broker, user data stored locally after federation
API Keys: Integration/datasource credentials stored in PostgreSQL (Cloud SQL/RDS/Azure DB) as KMS/Vault-encrypted strings, never exposed to users
JWT Tokens: Issued by Keycloak, validated by CodeMie API for each request
Role-Based Access: Permissions stored in PostgreSQL, enforced at API layer

Data Retention & Compliance

Data Category	Default Retention	Configurable	Compliance Notes
Chat History (PostgreSQL CodeMie)	Indefinite	No	Right to erasure supported
User Profiles (PostgreSQL Keycloak)	Indefinite	No	Deleted on user account deletion
Elasticsearch Indices	Indefinite	Yes (if ILM enabled)	Can be rebuilt from sources
Object Storage Files	Indefinite	Yes	Can be configured according to requirements
PostgreSQL Backups (Keycloak)	7 days	Yes	K8s PGO point-in-time recovery
PostgreSQL Backups (CodeMie)	7 days	Yes	Cloud SQL/RDS/Azure DB backups
KMS/Vault Keys	Indefinite	Yes	Rotation disabled by default

Security & Privacy

Data Minimization

Only necessary data sent to AI models (prompt + relevant context, not entire databases)
External service credentials scoped to minimum required permissions
User data isolated by tenant/organization

Audit & Monitoring

Cloud-specific audit logging configurations:

Azure: Log Analytics workspace with Container Insights and Key Vault audit events
GCP: PostgreSQL DDL statement logging, default GKE logging
AWS: Application logs via Fluent Bit to Elasticsearch (CloudWatch and CloudTrail could be configured)

Example: End-to-End Flow

Scenario: Authenticated user asks "What's the status of JIRA-123?"

Data Movement:

Jira data: External → CodeMie API → Tools (real-time)
Prompt: User → API → PostgreSQL (Cloud SQL/RDS/Azure DB region) → LLM service
Response: LLM service → API → PostgreSQL (Cloud SQL/RDS/Azure DB region) → User
Credentials: KMS/Vault (configurable region) → CodeMie API (ephemeral memory)

Summary

The CodeMie platform follows a local-first data architecture where:

Persistent storage regions are independently configurable – PostgreSQL (Keycloak), PostgreSQL (CodeMie), Object Storage, and KMS/Vault can be deployed in the same region or distributed across different regions based on customer requirements
External service data is indexed locally for retrieval; some tools can access external sources directly when requested
AI processing regions are configurable per LLM provider – customers can select specific regions for each model based on compliance and latency requirements (optional)
Encryption keys can be geo-separated from data for compliance
External customer services (IDP, datasources) authenticate via OAuth 2.0/SAML/OIDC; cloud infrastructure (compute,databases, storage, LLM services, etc.) uses IAM/managed identities; all communications encrypted with TLS 1.2+

This architecture supports data sovereignty and multi-region AI processing while maintaining security and performance.

Data Residency Configuration

Configure regional settings during deployment to match your organization's data residency and compliance requirements. PostgreSQL (CodeMie), Object Storage, and KMS/Vault regions can be set independently from each other and from the Kubernetes cluster region. PostgreSQL (Keycloak) is deployed in the same region as the Kubernetes cluster by default and can be configured otherwise according to customer requirements.

Overview​

Core Principles​

Data Storage Layers​

Persistent Storage​

In-Cluster Storage​

Data Processing Flows​

1. User Authentication Flow​

2. Chat Conversation Flow​

3. Datasource Indexing Flow​

Regional Data Distribution​

Data Processing Principles​

AI Model Interaction​

Datasource Data​

Authentication & Authorization​

Data Retention & Compliance​

Security & Privacy​

Example: End-to-End Flow​

Summary​

Overview

Core Principles

Data Storage Layers

Persistent Storage

In-Cluster Storage

Data Processing Flows

1. User Authentication Flow

2. Chat Conversation Flow

3. Datasource Indexing Flow

Regional Data Distribution

Data Processing Principles

AI Model Interaction

Datasource Data

Authentication & Authorization

Data Retention & Compliance

Security & Privacy

Example: End-to-End Flow

Summary