Pharmacy Delivery Platform

Tech stack

Architecture, stack, scripts, deployment, security, observability.

~39 min read · 7710 words

TL;DR

Pharmacy Delivery Platform is a Turborepo monorepo with eleven applications and two shared packages. The backend is written in NestJS 10 (thirty-six modules, forty-four controllers, one WebSocket gateway, forty-three TypeORM entities on top of PostgreSQL 16 and Redis 7). The web storefront is built on Next.js 14 with App Router, ISR caching for the catalog, and i18n for three locales. The admin panel is a separate Vite 6 SPA with fifty pages and RBAC across four roles. Three Telegram mini-apps (sales, dispatch, driver) are built on Vite 7 + React 19 and validate initData via HMAC-SHA-256. Three Python bots on aiogram 3 communicate with the API through Redis pub/sub. Real-time is delivered via Socket.IO with five business events and per-store, per-dispatcher, per-driver, per-customer rooms. Production runs in Docker compose on a single VPS — nine services plus nginx with Let's Encrypt. CI/CD on GitHub Actions with path-based filtering rebuilds only the affected services and rolls them out via GHCR. Security: bcrypt, JWT with short-lived access tokens and refresh rotation, throttler, Helmet, HSTS, HMAC for Telegram, OTP via Twilio A2P and SMS-Gate Android. Observability: a custom event-based analytics layer on client_events (90 days) plus rrweb session replays (30 days), audit trail and checkout/login funnel in the admin panel. Tech debt is openly tracked: there are no tests, the inventory log is not yet written on order creation, and refresh token rotation needs further work.

High-level architecture
Monorepo and its boundaries
API: NestJS, modules, migrations
Web storefront: Next.js 14, ISR, i18n
Admin panel: Vite SPA, RBAC, reporting
Telegram mini-app: three surfaces
Bots: Python, aiogram, Redis pub/sub
Database and cache
Real-time: WebSocket gateway and rooms
Order state machine
Infrastructure: dev and prod
CI/CD: path-based filtering and GHCR
Scripts and operational utilities
Security
Observability and analytics
Integrations
Architectural decisions and why exactly so
Tech debt as a mature engineering practice
Links

High-level architecture

The platform is a full last-mile delivery stack for regulated goods with age verification. Inside a single monorepo we have a customer-facing web storefront, an operator admin panel, three Telegram mini-apps (for customers, dispatchers, and drivers), one shared API, and three Python bots. On the outside — nginx with SSL and a set of external services: Twilio for SMS A2P, SMS-Gate Android as a duplicate OTP channel, Pushover for critical operational alerts, the Telegram Bot API, crypto payment wallets, and Nominatim for geocoding. The basic principle is to minimize the number of moving parts and maximize the decoupling between them through clearly defined points of communication: REST endpoints, WebSocket rooms, and Redis pub/sub channels.

Below is the overall system diagram with domain boundaries and data flows.

flowchart LR
  subgraph "Clients"
    Web[Web storefront<br/>Next.js 14]
    AdminUI[Admin panel<br/>React + Vite]
    SalesMA[Sales mini-app<br/>Telegram]
    DispatchMA[Dispatch mini-app<br/>Telegram]
    DriverMA[Driver mini-app<br/>Telegram]
  end

  subgraph "Edge"
    NGINX[nginx + SSL<br/>Let's Encrypt]
  end

  subgraph "Backend"
    API[NestJS API<br/>36 modules]
    WS[WebSocket gateway]
  end

  subgraph "Bots (Python)"
    BotC[Customer bot]
    BotD[Driver bot]
    BotS[Sales bot]
  end

  subgraph "Data"
    PG[(PostgreSQL 16<br/>43 entities)]
    Redis[(Redis 7<br/>cache + pub/sub)]
  end

  subgraph "External"
    Twilio[Twilio A2P]
    SMSGate[SMS Gate<br/>Android]
    Pushover[Pushover]
    TG[Telegram Bot API]
    Crypto[Crypto wallets]
    Maps[Nominatim]
  end

  Web & AdminUI & SalesMA & DispatchMA & DriverMA --> NGINX
  NGINX --> API
  NGINX --> WS
  API <--> PG
  API <--> Redis
  WS <--> Redis
  Redis -.pub/sub.-> BotC & BotD & BotS
  BotC & BotD & BotS --> TG
  API --> Twilio
  API --> SMSGate
  API --> Pushover
  API --> Crypto
  API --> Maps

The key idea here is clearly visible: the API knows nothing about Telegram messages, and the bots know nothing about the database. Any notification for a customer is simply a publication to a Redis channel, and a separate process is responsible for delivering it to the client. The same is true for the dispatcher and driver channels. This made it possible to decouple operational and product changes from each other: one person can edit the order workflow in the API, and another can edit the UX in the bot, and they don't step on each other's toes.

Monorepo and its boundaries

The project root is structured as simply as possible:

pharmacy-delivery/
├── apps/                    11 applications
│   ├── api/                 NestJS + TypeORM + Postgres + Redis + WebSocket
│   ├── web/                 Next.js storefront — port 3001
│   ├── admin/               Vite SPA — port 3002
│   ├── telegram-mini-app/   Vite + React (dispatcher) — port 3003
│   ├── driver-miniapp/      Vite + React (driver) — port 3004
│   ├── sales-miniapp/       Vite + React (sales)
│   ├── telegram-bot/        Python + aiogram (dispatcher)
│   ├── telegram-driver-bot/ Python + aiogram (driver)
│   ├── telegram-sales-bot/  Python + aiogram (sales)
│   ├── dispatcher/          Expo / React Native (placeholder)
│   └── driver/              Expo / React Native (placeholder)
└── packages/                2 shared packages
    ├── shared/              types, enums, constants
    └── ui/                  React components (Button, Input, Card, Badge, Spinner)

The build system is Turborepo on top of npm workspaces. This gives us two practically important things. The first is a shared tsconfig and a single set of types in packages/shared (User, Order, Store, OrderStatus, ORDER_TRANSITIONS, SOCKET_EVENTS) used by both the frontends and the API. This means that when I add a new field to order or a new value to a status enum, TypeScript on the next build will show me where this field is not accounted for. The second is that Turbo can compute hashes of affected files and skip building workspaces that the changes did not touch. For CI this translates into significant time savings, because trivial edits, say, in web, do not rebuild the bots and mini-apps.

The packages/ui package is intentionally small and contains only neutral elements that make sense in any of the consumer applications: button, input, card, badge, and spinner. This is a compromise between "don't duplicate" and "don't turn the UI package into a shovel that drags the entire design language through it". The actual visual language — Tailwind tokens and more complex compositions — lives separately in each application, because the admin panel and the storefront have fundamentally different jobs and different UX.

The root package.json and turbo.json are configured so that on dev I run npm run api:dev, npm run web:dev, npm run admin:dev once each and get hot reload of every application independently. On the VPS environment, dev servers run via systemd (dev-api, dev-web, dev-admin, dev-sales), and hot reload picks up changes automatically — a restart is only needed when .env changes or a process hangs.

API: NestJS, modules, migrations

The backend is the largest application in the monorepo: thirty-six modules, forty-four controllers, and one WebSocket gateway. Thirty-six modules is not "because it's a nice number", but a reflection of the actual domain decomposition. Each module corresponds to its own area of responsibility and owns a small set of entities, services, and controllers.

The list of modules gives a good idea of what is happening inside the platform:

order and catalog domain modules: orders, products, categories, brands, cart, time-slots, delivery, stores, reviews;
identification and access: auth, users, roles, employees, customer-plans;
operational panel: admin, admin-promotions, dispatcher, driver-personal, vehicles;
marketing and retention: banners, campaigns, rewards;
integrations and payments: crypto-payments, email, messaging, sms, whatsapp, notifications;
infrastructural: database, geocoding, health, realtime, events, uploads.

Forty-three entities cover the entire business model of the platform: users, stores, products with variants, cart, orders, order items, status logs, time slots, promo codes, banners, campaigns, reviews, drivers, shifts, wallets, crypto payment deposits, and so on. The admin subdomain is especially telling — it contains eleven supporting entities: cash-drop, checkout-log, client-event, driver-shift, inventory-log, login-log, order-status-history, otp-log, product-audit-log, session-recording, wallet-transaction. Essentially this is a mini-storage for all operational analytics, audit trail, and observability.

Module files live in apps/api/src/modules/<module>/, and concrete services and controllers inside each module follow the standard NestJS pattern: *.module.ts, *.controller.ts, *.service.ts, entities/*.entity.ts. Dependencies are resolved through DI, which significantly simplifies local replacement of dependencies and substitution during debugging.

TypeORM and migrations

I deliberately did not rely on TypeORM auto-sync for critical schema changes. Auto-sync is fine in the early stages, but in production it is dangerous: one careless column rename turns into a drop of a column with data. So all important schema changes are written as plain SQL migrations in apps/api/src/database/migrations:

20260121-add-cashback.sql
20260329-add-plan-switch-logs.sql
20260330-premium-tier.sql
20260403-premium-tier-settings.sql
20260404-vehicles.sql
20260417-order-item-cost-price.sql

I roll them out by hand in a clear sequence on dev and then on prod. This gives two advantages. The first is that migrations have date-prefixed names that read well in code review and in git. The second is that I know exactly what gets migrated and what does not, and I can stop if something is going wrong. For seed data there is a separate script apps/api/src/database/seeds/seed.ts, which fills the dev database with base categories, test stores, and one dev user.

Controllers and REST

Forty-four controllers is usually not one-to-one with modules. Some modules have two controllers: one public (/api/v1/orders), one admin (/api/v1/admin/orders). In other places controllers are split by semantics: for example, in events there are public endpoints POST /events (batch of client events) and POST /events/replay (rrweb chunks), and a separate admin controller with filters, statistics, session search, and search analytics.

A global ValidationPipe is set up in main.ts, validation runs through class-validator on DTOs. This eliminates a whole class of "garbled JSON came in" bugs and at the same time gives the client clear error messages: which exact field failed, what type was expected, what values are allowed.

Web storefront: Next.js 14, ISR, i18n

The storefront is built on Next.js 14 with App Router, and its responsibilities range from fast catalog loading to a full checkout, authentication, wallet, and referral program. Below is what really matters on this 5,000-line page.

App Router and i18n

The page root is apps/web/src/app/[locale]/, and each [locale] value is en, ru, or es. Localization is wired through next-intl, dictionaries live in apps/web/messages/<locale>.json. At the layout level I inject NextIntlClientProvider, and all components get access to translations via useTranslations. Segmenting routes by locale gives correct canonical URLs and hreflang without any hacks, plus it allows setting preload priorities for the default locale.

ISR for the catalog

The catalog is the hot path. Customers open product and category pages more often than anything else, and these pages rarely change between inventory updates. I set revalidate: 60 on catalog segments, and Next.js re-issues HTML once a minute. Combined with the nginx edge cache, this gives near-static speed on popular pages, while a content editor doesn't have to restart the build to edit a product description.

For dynamic pages (cart, checkout, account, wallet) ISR is not used — they are rendered per-request with up-to-date data.

Dynamic imports and heavy components

The home page and some product pages use next/dynamic for heavy blocks: TestimonialsSection, QRCodeSVG for referral codes, lazy loading of motion animations. This reduces the initial JS bundle and improves LCP — especially on slow 3G and on desktops with slow CPU. After the broken dynamic import episode (see 0f6e19b fix(web)), I reverted to a static import in one of the blocks because it actually didn't yield any bundle gain.

Maps and geo data

For maps I use Leaflet (an open solution, with no commercial quotas). Tiles come from an OSM provider, geocoding goes through Nominatim. This fully covers our needs: choosing a delivery address, displaying the coverage zone, and a real-time driver marker.

Authentication

The storefront supports three login methods: email + password with OTP, Google OAuth 2.0, and Telegram WebApp. Google OAuth is implemented in such a way that the affiliateCode parameter survives the redirect — this was fixed in 311fce3 fix: pass affiliate code through Google OAuth registration. Telegram WebApp is validated via HMAC-SHA-256: the backend takes the sorted query params, runs them through the bot secret, and compares the result to hash. If they match, the user is considered authenticated, and the API returns access + refresh.

Animations and accessibility

We use the motion/react library. All key animations are wrapped in a prefers-reduced-motion check, so users with reduced-motion enabled get static interfaces. Combined with proper focus management and aria attributes, this gives readable accessibility without overhead on the design.

rrweb and analytics

On the frontend, rrweb is enabled with reasonable settings: recording only for authenticated users, maskAllInputs: true (no passwords or addresses in the recording), and a 15-minute time limit per session. This — see the Observability section below for details — provides replay of problem sessions from the admin panel without the risk of leaking PII.

In parallel, there is custom analytics. The file apps/web/src/lib/analytics.ts stores sessionId in localStorage, batches events every five seconds, and sends them to POST /events. If the user closes the tab, we use navigator.sendBeacon so as not to lose the last batch. On the server, a rate limit of 10 requests per minute is enabled for this endpoint, which both protects against bots and leaves legitimate scenarios trouble-free.

Admin panel: Vite SPA, RBAC, reporting

The admin panel is a separate Vite 6 SPA on React 18, with fifty pages and an internal routing system on React Router v6. Why a separate application and not shared code with the storefront: the admin panel and the storefront have fundamentally different jobs. The admin panel is the operator's workspace, where data density, complex tables, charts, and actions matter. The storefront is marketing and sales, where aesthetics, load speed, and marketing metrics matter. Splitting them helps both products.

State and server cache

Local state — Zustand, with no excessive slices and boilerplate. Server state — React Query: invalidation, refetching, optimistic updates. This split gives a clean data model: everything coming from the server goes through React Query (with proper staleTime and retry strategies), and everything that controls the UI (table filters, an open modal, the active tab) lives in Zustand.

Charts and analytics

Recharts — for analytics and dashboards: BarChart on the Dashboard and Sales pages, LineChart on the Conversion Analytics page. It's not the most beautiful product on the market, but it's lightweight, extensible, and covers 95% of tasks without external dependencies. Where it falls short, I write custom SVG visualizations on top of Recharts wrappers.

PDF invoices

For invoices I use jsPDF + autoTable. When the operator clicks "Download PDF", the document is generated on the client with the full order contents, taxes, and a signature. No server-side PDF rendering, no headless Chrome — which matters in production, where every additional service means an additional point of failure.

Layouts and roles

Inside the admin panel there are three different layout contexts: AdminLayout, DriverLayout, DispatcherLayout. Essentially these are three different products in one SPA, and switching between them is determined by the user's role. RBAC is implemented at the route level via RoleGuard: a wrapper component that looks at the current role from the store and either renders children or redirects to 403.

Four roles:

admin — full access, store configuration, employee management, billing.
dispatcher — operational work: incoming orders, driver assignment, customer support.
driver — own layout for viewing their orders, routes, and earnings.
manager — limited admin, usually without access to billing and system settings.

Icons everywhere — lucide-react. The design system is built on Tailwind 3 with our own tokens in tailwind.config.ts. I deliberately do not use a ready-made component kit like Material UI, because the admin panel should look like part of our brand, not yet another Material site.

Telegram mini-app: three surfaces

Inside Telegram the platform has three separate mini-apps, and each solves its own problem:

sales mini-app — anonymous catalog, lazy-auth, and quick checkout for customers coming in via the bot.
dispatch mini-app — dispatcher's mobile workspace: order list, filters, actions.
driver mini-app — driver interface: current delivery, route, pickup/delivered marks.

All three are written in Vite 7 + React 19 + Tailwind 3 and built into ordinary SPAs served by nginx. Not Next.js, because mini-apps are an exclusively client-side context inside a Telegram WebView, and SSR makes no sense here.

Telegram WebApp HMAC

Authentication in mini-apps is built on initData validation. Telegram passes user data in a string signed with HMAC-SHA-256 using the bot secret. Verification algorithm on the server: parse the query string, sort the keys, concatenate them in key=value\n... format, compute the HMAC, and compare with hash. If they match — the data is genuine, you can trust user.id and either auto-register the user or pull up their existing account.

Auto-registration is simple: if there is no user for a given tg_id, we create a new one with a phone like tg_{id} (this pattern differs from regular numbers, so there are no collisions) and issue a token. On the storefront side, this user can later attach a real number.

Anonymous catalog and lazy auth

The sales mini-app deliberately gives anonymous access to the catalog. No registrations, no forms at the start. The cart lives in localStorage. Authentication is requested only at checkout — at that point the mini-app validates Telegram.WebApp.initData and obtains a token. This reduces friction in the funnel and produces a healthy conversion.

Real-time via Socket.IO

Inside the mini-app a Socket.IO client is actively used, connecting to the API through the WebSocket gateway. This allows the dispatcher and the driver to receive ORDER_CREATED, ORDER_STATUS_CHANGED, ORDER_ASSIGNED, DRIVER_LOCATION_UPDATED, DRIVER_STATUS_CHANGED events without unnecessary polling. More on this in the Real-time section below.

Bots: Python, aiogram, Redis pub/sub

Three bots — probably the most "polyglot" part of the stack, and it is intentionally so. Telegram bots are long-polling processes that are easier and cleaner to implement in Python with aiogram than to drag in a separate Node.js process with tg-grammy and a tail of adapters. Each bot is a separate application in apps/telegram-bot, apps/telegram-driver-bot, apps/telegram-sales-bot. The stack is the same for all of them:

Python 3.12;
aiogram 3.x — a modern async framework for bots;
asyncpg — a native async PostgreSQL driver without ORM overhead;
redis-py for pub/sub.

The virtual environment is built inside the Docker image; in dev builds — inside a separate venv in the bot's directory.

Decoupling via Redis

The main architectural decision here is that the API knows nothing about Telegram messages, and the bots know nothing about the database. When an order is delivered, the API publishes an event to notifications:customer, and the customer bot pulls the message from the channel, formats the text, and sends it to the customer in Telegram. If an operator wants to add a new notification (for example, "Thanks for the rating"), they add the publication to Redis, and the bot starts delivering it within seconds without an API release.

We have three channels:

notifications:customer — messages to the customer.
notifications:dispatcher — fan-out to all dispatchers (currently four).
notifications:driver — driver channel.

In the dispatcher channel, the bot reads the message and broadcasts it to all active dispatcher accounts in Telegram. This is how "all-hands on the loudspeaker" is implemented when an urgent order comes in.

Launch

In production each bot runs in its own Docker container, in dev — via PM2 (fork mode, so as not to duplicate long-poll and not get "conflict: terminated by other getUpdates request" from Telegram). This choice is also not accidental: cluster mode for long-poll bots almost always means duplicate messages, and fork is better here.

Database and cache

The database is PostgreSQL 16, single instance, locally on the VPS. Data volume and load currently allow not separating reads and writes, and I am not introducing complexity ahead of time. The schema has forty-three entities plus a few custom views for admin-panel reports.

Key design principles:

Multi-tenancy via store_id. All main entities (products, orders, carts, store-customer users) have store_id. At the service level, filtering by storeId is a mandatory part of any query. This makes it possible to host several stores on a single instance without separate schemas or databases. More on this in the "Architectural decisions" section.
Indexes. On hot fields — user_id, order_id, store_id, created_at — there are composite indexes everywhere. This is especially important on client_events: with typical queries by time segment and user, ascending scans become cheap.
Transactions for inventory deduction. A race condition with concurrent orders was caught in the early phase and fixed via a DB transaction with the condition WHERE inventory >= qty — in a single transaction we both check and update. If the condition fails, the order returns 409 Conflict.

Redis 7

Redis plays several roles at once:

Cache. User session data, frequent dictionary requests (for example, active store configs), and cached geocoding results for frequently used addresses.
Pub/sub. Channels for the bots (see the previous section).
Rate limiting. The throttler in NestJS uses a Redis store to share limits across multiple worker processes.
Real-time. The Socket.IO adapter uses Redis to coordinate messages across multiple WebSocket gateway instances.

In a single Docker container Redis serves all four scenarios. This works because the load on Redis is a fraction of its capabilities for us.

Real-time: WebSocket gateway and rooms

The WebSocket gateway is built on Socket.IO and lives in apps/api/src/modules/realtime. Five core events cover the entire real-time contract between the server and the clients:

ORDER_CREATED — order created;
ORDER_STATUS_CHANGED — status changed;
ORDER_ASSIGNED — driver assigned;
DRIVER_LOCATION_UPDATED — driver geolocation updated;
DRIVER_STATUS_CHANGED — driver status.

Each event flies into one or more rooms on a "the recipient subscribes themselves" basis:

store:{storeId} — the shared store room for all its employees;
dispatcher:{userId} — a personal room for a specific dispatcher (for example, for order assignments to them);
driver:{userId} — a personal room for a driver;
customer:{userId} — a personal room for the customer for status notifications.

This gives transparent message semantics: I can publish ORDER_ASSIGNED to driver:42 and know exactly that only this driver will receive the message, not the entire team. And at the same time — duplicate it into store:7 so that all dispatchers of this store see the update in their lists.

Geocoding and ETA calculation are a separate service in the API, using Nominatim (OSM). When an order is created with coordinates, ETA is computed from the distance, transport type, and base coefficients, and is then updated on each DRIVER_LOCATION_UPDATED.

Order state machine

Order states are described in packages/shared/src/orders.ts by the ORDER_TRANSITIONS constant, which works as a declarative state machine. Each transition is checked in the orders service before changing the status. If the transition attempt is illegal, a domain error is thrown, and the API responds with 422. This protects against bugs like "the dispatcher accidentally reopened an already delivered order" and against race conditions where two different candidates change the status in parallel.

stateDiagram-v2
  [*] --> PENDING: created by client
  PENDING --> CONFIRMED: payment confirmed
  PENDING --> CANCELLED
  CONFIRMED --> READY: packed
  READY --> ASSIGNED: driver assigned
  ASSIGNED --> PICKED_UP: driver picked up
  PICKED_UP --> DELIVERED: delivered
  DELIVERED --> [*]
  CONFIRMED --> CANCELLED
  READY --> CANCELLED
  ASSIGNED --> CANCELLED
  PICKED_UP --> CANCELLED: refund flow

For each transition, a record with from_status, to_status, actor_id, and actor_role is written to order_status_history. This captures the full chronology of the order, and in the admin panel we render it as a visual timeline. Useful both for analytics (how long an order sits in each status) and for incident investigation.

Infrastructure: dev and prod

Dev: VPS7

The developer's environment is a dedicated VPS with Ubuntu, where systemd units live for all key dev servers:

dev-api — port 3000, NestJS in watch mode;
dev-web — port 3001, Next.js dev server;
dev-admin — port 3002, Vite dev server;
dev-sales — port 3005, Vite for the sales mini-app.

PostgreSQL 16 runs locally on 5432, Redis — inside Docker on 6379. Hot reload works for all applications, so when I change the code I don't restart anything by hand. A restart is only needed when .env changes or a process hangs.

The build does not run locally. At all. On the dev server there is intentionally no TypeScript compiler in production mode and no vite/next build. Any npm run build or tsc blows up memory and slows down other processes. The build is the CI's job.

Prod: VPS6 with Docker compose

The prod server is a separate VPS where everything runs in Docker compose. The file is /opt/pharmacy/docker-compose.prod.yml. There are nine services, all pulled from GHCR:

Service	Description
`api`	NestJS API
`web`	Next.js storefront
`admin`	Vite admin panel SPA
`sales-app`	Vite sales mini-app, port 4005
`sales-bot`	Python sales bot
`dispatch`	Vite dispatcher mini-app, port 3003
`driver-miniapp`	Vite driver mini-app, port 3004
`telegram-bot`	Python customer/dispatcher bot
`telegram-driver-bot`	Python driver bot

Plus postgres, redis, and nginx (80/443) containers with Let's Encrypt certificates. Nginx is the reverse proxy to the docker services and the SSL terminator. It also implements HTTP/2 and Brotli for static assets.

A key nuance: docker compose restart does NOT re-read .env. This is a trap that's easy to fall into. After any changes to environment variables you have to do docker compose -f docker-compose.prod.yml down <service> and then up -d <service> for that specific service. I recorded this in the project's CLAUDE.md so as not to step on the same rake again.

Images and registry

All images are published to the GitHub Container Registry: ghcr.io/<org>/pharmacy-delivery/<service>. Each service is tagged with two labels — latest and sha-<short>. We pin latest in the compose file because for our load and SLA it's a normal trade-off between flexibility and predictability; if it ever becomes a problem, we'll switch to pinning by sha.

Dockerfiles

Each application has its own Dockerfile (nine in total). These are ordinary multi-stage builds: a build stage with node:20-alpine for JS services and python:3.12-slim for bots, then a runtime stage without dev dependencies. The images come out compact and deploy quickly.

Compose files

The repository contains three compose files — for different tasks:

docker/docker-compose.yml — a minimal dev stack: only PostgreSQL and Redis. Convenient if a developer works locally and wants to bring up only the infrastructure.
docker-compose.yml (root) — the full local stack: postgres, redis, and all nine applications. Used for integration checks.
docker-compose.prod.yml — the production config with real ports, healthchecks, and nginx.

Nginx and domains

The nginx config describes seven domains and proxies them to specific docker services:

app.platform.com — web storefront;
admin.platform.com — admin panel;
api.platform.com — API;
shop.platform.com — sales mini-app;
dispatch.platform.com — dispatcher mini-app;
driver.platform.com — driver mini-app.

(Real domains are omitted under NDA. In the actual configuration, each name is a separate server block with a Let's Encrypt SSL certificate and a proxy_pass to the appropriate upstream.)

CI/CD: path-based filtering and GHCR

The file is .github/workflows/docker-build.yml. The trigger is push to main. What's inside:

flowchart TB
  Dev[VPS7 dev<br/>Hot reload]
  PR[git push to main]
  GHA[GitHub Actions<br/>Smart path filter]
  CHANGED{Changed services?}
  Build[Docker build per service]
  GHCR[(GHCR registry)]
  SSH[SSH to VPS6<br/>via appleboy/ssh-action]
  DC[docker compose pull + up -d<br/>only changed services]
  Prod[VPS6 prod<br/>9 containers + nginx]

  Dev --> PR
  PR --> GHA
  GHA --> CHANGED
  CHANGED -->|yes| Build
  CHANGED -->|no| End([skip])
  Build --> GHCR
  GHCR --> SSH
  SSH --> DC
  DC --> Prod

The workflow starts with dorny/paths-filter@v3, which looks at the changes in the commit and tells which of the nine buildable services actually need to be rebuilt (api, web, admin, dispatch, driver-miniapp, telegram-bot, telegram-driver-bot, sales-app, sales-bot). For example, an edit in apps/web/src/... will only affect web, and then CI will rebuild exactly web and leave everything else on the same version. This saves about 5–10 minutes per push and at the same time simplifies rollback: each service is versioned independently.

Key steps after path-filter:

Build per service. For each affected service, GitHub Actions runs docker build with the correct Dockerfile and tag ghcr.io/<org>/pharmacy-delivery/<service>:latest plus :sha-<short>.
Push to GHCR. Images are published to the GitHub Container Registry. CI access to it is configured via GITHUB_TOKEN with packages: write.
SSH to VPS6. appleboy/ssh-action connects to the prod server by key, runs docker compose -f /opt/pharmacy/docker-compose.prod.yml pull <service> and then up -d <service> — only for the affected services.
Healthcheck. After the restart, compose gives the service time to perform its healthcheck. If it doesn't go green, we see it in production via Pushover right away.

Safe points in this pipeline:

secrets are stored in GitHub Actions (GHCR_TOKEN, VPS_SSH_KEY, DEPLOY_HOST, DEPLOY_USER); nothing is hard-coded in the repo;
the ssh-action uses an ed25519 key, which is provisioned only into the CI runner and is not exposed anywhere else;
the path docker-compose.prod.yml and the list of services are the only things the workflow touches in production.

Scripts and operational utilities

The root scripts/ directory contains a small set of useful utilities:

backfill-geocoding.sh — batch backfill of geocoding data. Takes all orders that don't have lat/lng, hits Nominatim, normalizes the address, and writes the coordinates back. Used once after changes to the address schema or when importing legacy data.
screenshot.js and help-screenshots.mjs — generation of screenshots for help pages. Connects to the dev environment, snapshots the UI, and saves a PNG.
inject-help-images.mjs — embedding these images into the help content after rendering.
rescale-menu-icons.js — processing menu icons to a standard size for all applications.
setup-api.sh — API setup: migrations, seed, and basic .env configuration.
test-flows.sh — a set of smoke-test scenarios: create a user, place an order, mark it as delivered.

This doesn't claim to be full automation, but it removes the routine of 10–15 repetitive tasks that come up once a sprint.

Security

Security is layered. I'll break it down by layers, because an unsystematic list of "we have Helmet" usually explains nothing.

Identification and passwords

Passwords are hashed with bcrypt at cost factor 10. This is a specific trade-off between login speed (~50ms on a modern CPU) and resistance to brute-force.
JWT access tokens are short-lived, 15 minutes (historically they were 30 days; fixed after the audit, see the "Tech debt" section).
Refresh tokens are long-lived, with rotation: each use generates a new refresh and invalidates the previous one. This narrows the attack window if a refresh leaks.
OTP codes via Twilio A2P (a registered campaign under TCR vetting) and SMS-Gate Android as a backup channel. Round-robin between two devices, retry on failure.
Telegram WebApp HMAC-SHA-256 — described in the mini-app section.

Transport and headers

HTTPS everywhere, with no exceptions. Let's Encrypt issues certificates for all seven domains.
HSTS is enabled with max-age=31536000 and includeSubDomains. This locks in HTTPS for a year ahead, and the backend also returns the header on every response via Helmet.
Helmet is set up in main.ts: X-Content-Type-Options: nosniff, X-Frame-Options: DENY, Strict-Transport-Security, Referrer-Policy: strict-origin-when-cross-origin.
CORS — whitelist of specific domains (including dispatch.platform.com and driver.platform.com for mini-apps, plus the admin panel and the storefront). No *.

Rate limit

@nestjs/throttler — throttling on sensitive endpoints: auth (login/register/forgot-password), events public (10/min), events replay (6/min), and other places where the API could otherwise be flooded with requests.
The global API limit is more lenient, so legitimate clients don't hit it.

Crypto payments

This is a separate story, because invalidation and signing in crypto are usually the riskiest part. The approach:

BTC — derivation from an xpub key on the fly, a new address per deposit. The private key is not stored anywhere in the API; the xpub lies in a protected config section.
ETH/USDT-ERC20/USDC-ERC20 — wallets are generated programmatically via libraries and stored with the private key encrypted by AES-256 from an ENV master-key variable.
TRC-20 (USDT on TRON) — a separate channel, analogous to ETH.
For each incoming payment, the API monitors the corresponding blockchain via an RPC provider and writes a wallet_transaction with confirmations.

Action logs

Any sensitive action is logged:

login_log — who, when, from which IP, success/fail;
otp_log — OTP sending and validation;
product_audit_log — who changed which fields of a product;
wallet_transaction — all deposits and withdrawals.

The audit trail in the admin panel uses product_audit_log to render a clean history of product changes with "smart grouping": a series of edits by the same user within a short time is collapsed into a single block, and values are rendered with contextual formatting (price → $12.99, category_id → Tinctures).

Observability and analytics

Observability is a separate graph that I built up gradually as the project grew. Right now it covers three classes of tasks: product analytics, incident investigation, and operational alerts.

`client_events` — client tracking (90 days)

The client_events table stores all client events for the last 90 days:

page_view — navigation;
product_view — viewing a product card;
cart_add, cart_remove, cart_update_qty — cart;
favorite_add, favorite_remove — favorites;
search, search_no_results — search (the latter is especially valuable for discovering missed queries);
checkout_start, place_order, success, error, zone_unavailable — checkout funnel;
api_error, js_error — client-side errors.

This is the foundation for everything else: product analytics, funnel conversion, error monitoring. The public endpoint POST /events accepts batches of up to 50 events and a rate limit of 10/min, to protect the table from bot traffic. A cron job clears events older than 90 days every day at 3:00.

`session_recordings` — rrweb (30 days)

session_recordings stores rrweb chunks of authenticated user sessions. Chunks arrive at the endpoint POST /events/replay (rate limit 6/min). All inputs are masked (maskAllInputs: true), and a single recording is limited to 15 minutes of activity. Cron clears anything older than 30 days at 4:00.

In the admin panel there is a /activity-log page with four tabs:

All events — filtering by user, event type, period;
Errors — JS errors and API errors with context;
Search Analytics — the most popular queries and the top 10 queries with no results;
User Journey + Replay — the path of a specific user and a "Play session" button that opens an rrweb player on chunks from storage.

This dramatically speeds up incident investigation. When an operator says "my order broke", I open their session and see the steps where a js_error or a 422 from the API occurred.

Audit trail and checkout funnel

In addition to client_events, we have several specialized log tables:

checkout_log — every checkout step with context (address, chosen method, total amount);
login_log, otp_log — described above;
product_audit_log — product changes;
wallet_transaction — financial operations.

In the admin panel, several funnel reports are built on top of this data:

registration_funnel — conversion from phone entry to OTP confirmation;
login_funnel — conversion from a login attempt to a successful login;
checkout_funnel — the main money funnel: cart → address → payment → place_order → success.

I added a separate "Issues monitor" — a screen that aggregates the top reasons for checkout failures (zone unavailable, payment declined, OTP expired), JS errors, and API errors over the last 24 hours. This is a quick way to notice a regression after a deploy.

Pushover

For operational alerts I use Pushover. We hard-code exactly the people who need it — that's four dispatchers and two drivers in apps/api/src/modules/notifications/pushover.service.ts. Triggers:

New order — priority=2 (loud, rings until acknowledged, 30-second retry). Orders MUST NOT be lost.
Order assigned to a driver — priority=2 for that specific driver.
Order delivered — priority=1 (normal) for dispatchers.

This covers operational SLA without a complex alerting system: if an order is created but not acknowledged within 30 seconds, the dispatchers get a wake-up call.

Cron jobs

Inside the API there are two simple daily cron jobs:

3:00 — DELETE FROM client_events WHERE created_at < NOW() - INTERVAL '90 days';
4:00 — DELETE FROM session_recordings WHERE created_at < NOW() - INTERVAL '30 days'.

This keeps the tables at a reasonable size and predictable speeds.

Integrations

Several external services are wired into the platform, and each closes a specific job. I don't try to substitute one for another: SMS, email, Pushover, and Telegram are different channels with different SLAs and costs.

Twilio A2P — the SMS channel in production. The campaign is registered and awaiting TCR vetting. After approval it will become the primary route for OTPs over SMS.
SMS-Gate Android — two devices (2NROCH, M7TXQY) on different numbers. Round-robin to distribute the load and retry on failure of one device. Used for OTP at registration/login. Not used for forgot-password — that goes through email.
Email (SMTP) — only for forgot-password. The platform has no other email notifications, and that's intentional: every additional channel is additional spam and a point of failure.
Pushover — described above.
Telegram Bot API — three bots, all via aiogram.
Google OAuth 2.0 — register/login for the web. Threads affiliateCode through the redirect.
Crypto wallets — BTC via xpub derivation, ETH, USDT-ERC20, USDC-ERC20, TRC-20.
WhatsApp — multi-node gateway: up to 5 nodes, round-robin with sticky-session (the same dialog goes through the same node). existsCache: Map holds the phone-has-WhatsApp checks so we don't hit a node on every message. (Here is also a known piece of tech debt: the cache has no eviction, see the "Tech debt" section.)
Geocoding (Nominatim/OSM) — for all addresses and delivery zones. No commercial quotas.

Inside the messaging module there is a unified "WhatsApp first, SMS fallback" logic: if the user has WhatsApp, we send there; if not — SMS. This reduces notification cost and improves delivery rate.

Architectural decisions and why exactly so

Here are ten key decisions and their rationale. Most of them are decisions I made once and never came back to, because they work.

1. Monorepo on Turborepo

The alternative is several separate repositories for the API, web, admin, and each mini-app. I chose a monorepo for three reasons:

shared types (Order, User, OrderStatus, SOCKET_EVENTS) live in packages/shared and are automatically synchronized between the API and the frontends; one enum change shows all the TypeScript errors at once;
Turbo path-filter speeds up CI: edits in web don't rebuild bots and the API;
a single npm install and consistent dependency versions; no hell with conflicting versions of React and Tailwind.

The price is a slightly heavier repo and the need to teach the team the Turbo commands. At my scale, this pays off in the very first week.

2. TypeORM + manual migrations

TypeORM is the most mature ORM in the Node ecosystem with TypeScript-first typing. I don't use auto-sync: instead — manual SQL migrations. This gives two advantages:

I know exactly what gets migrated and can stop;
migration diffs are readable in review.

Alternatives (Prisma, Drizzle) didn't win for me on the combination of factors. Prisma is an excellent builder, but I prefer manual work with migrations. Drizzle is beautiful, but at the time the project started it was still young.

3. Multi-tenant via `store_id`

The simplest solution for multi-tenant is separate databases or schemas. I chose a store_id column on the main tables, because:

one server serves several stores and physical isolation is unnecessary;
filtering by store_id is done at the query level and is easy to test;
backups and migrations — one command for the entire database.

The price is the need to filter in a disciplined way. I cover this with a guard at the service level: any findOrders takes storeId as a required argument, and you can't forget it.

4. Redis pub/sub for the bots

The bots are separate processes. They could talk to the API via REST, but then I would have to drag in a full auth between services (cross-service tokens) and handle eventual consistency. Pub/sub is simpler:

the API publishes an event to notifications:customer;
the bot receives, formats, and sends;
the API doesn't know about Telegram, the bot doesn't know about the database.

The downside is the need to explicitly describe the message schema. I store it in packages/shared/src/notifications.ts as TypeScript types and pin it on the Python side via a JSON schema.

5. rrweb for session replay

Reproducing bugs from logs and user descriptions is a weak solution. When an operator shows me "look, nothing works for this customer", I open their session in rrweb and see what happened. With maskAllInputs: true it's safe for PII, and the 15-minute time limit prevents gigabyte recordings.

6. Custom audit trail in the admin panel (instead of TypeORM history)

TypeORM can save history through subscribers, but the "raw" history is a log of diffs that's hard to read. I chose my own product_audit_log with smart grouping (combines edits by one user within a short time) and contextual formatting (price → $12.99, category_id → category name). This gives the admin a convenient "here's the history of this product" screen instead of a dump of JSON diffs.

7. Pushover priority=2 for critical events

priority=2 means Pushover will ring the user every 30 seconds until they acknowledge. This is the only channel that genuinely guarantees delivery of an operational alert. SMS get lost, push notifications in Telegram can be muted, email is not really an alert at all. We have Pushover priority=2 set on "new order" and "driver assigned", and not a single order has been missed in all this time.

8. Path-based filtering in CI

Without it, CI would rebuild all nine services on every push. With it — only those that actually changed. On web edits this cuts the pipeline from ~12 minutes to ~3.

9. Docker compose with named services and nginx

Docker compose is simple, readable, and sufficient for our scale. Alternatives (Kubernetes, Nomad, ECS) are a different class of complexity, and I see no benefit in them with nine services on a single server. Nginx terminates SSL and proxies to upstreams by domain; docker compose healthchecks provide graceful restart on pushes.

10. NestJS modules + DI

NestJS is essentially Angular-style decomposition for the backend. Each module is isolated, dependencies are passed through DI. This gives:

easy testing (although there are no tests right now — see tech debt);
easy mocking of dependencies during debugging;
clear boundaries: "orders depends on products, not the other way around".

Alternatives (Express + hand-assembled services, Fastify + DIY DI) are cheaper at the start but more expensive at the scale of thirty-six modules.

Tech debt as a mature engineering practice

There is no such thing as a perfect project. A mature team doesn't pretend everything is clean — it keeps an open list of tech debt and prioritizes it. Same here.

Critical (closed)

~~JWT access token of 30 days~~. Was fixed after the audit: now 15 minutes, refresh tokens with rotation.
~~OTP code in logs~~. The console.log with the actual code was removed in auth.service.ts.
~~Math.random for passwords~~. Replaced with crypto.randomBytes in all five places.
~~Inventory race condition~~. Closed via DB transactions with the condition WHERE inventory >= qty.
~~PromoStatus enum DELETE~~ → DELETED.
~~Duplicate variants on product creation~~.

High (in progress)

Refresh token without rotation — partially closed, needs to invalidate the previous token on use.
Refresh endpoint without rate limit — the only auth endpoint without @Throttle. Simple fix, in the queue.
WhatsApp existsCache memory leak — a Map without an eviction strategy. Need either LRU or eviction by TTL.
TypeORM synchronize in some modules — moving to clean migrations.

Medium (debt for discipline)

No tests. They existed but were removed early on for the sake of iteration speed. This is visible debt, and I openly admit it. The plan is to add unit tests on critical services (auth, orders, inventory) and e2e on checkout.
Silent .catch(() => {}) in orders/notifications. In some places this is intentional (fire-and-forget notifications), but in orders it masks real errors. Each case needs a review.
Inventory log on order — the table exists, but the record on deduction is not yet written. This is a visible gap in the audit trail.
Session TTL of 30 days — too long for security-sensitive accounts. Need to reduce it to at least 14 days or introduce roles with different TTLs.

Principles

I follow a few simple rules when working with tech debt:

Every debt item is recorded with a priority and context, so the team can pick it up.
Critical debt is closed before the next release. No "we'll fix it next sprint".
High and Medium go into the backlog and are prioritized alongside features; you can't postpone them indefinitely.
Tech debt is publicly visible: I don't hide it from stakeholders in a private tracker. This is part of a transparent engineering culture.

Closing thoughts

To summarize in one line: simple blocks, clear boundaries, protected points of communication. The platform survives not thanks to miracle technologies, but because each layer does its own job and doesn't poke into others. The API doesn't know about Telegram, the bots don't know about the database, the admin panel doesn't duplicate the frontend, the mini-app is a separate SPA on the right surface (Telegram WebView). CI rebuilds only what has changed. Production is nine docker containers behind nginx, and each of them can be restarted independently. Observability collects events and rrweb sessions without excess infrastructure. Security is layered, and each layer is short, understandable code.

This stack fits well with the "boring tech principle": I choose proven tools, minimize the number of moving parts, and pay close attention to what is mature and what is not yet. The fewer surprises in the infrastructure, the more time is left for the product.