This directory contains Terraform configuration for deploying the ingestion pipeline to Google Cloud.
-
Install Terraform: https://www.terraform.io/downloads
brew install terraform # macOS -
Authenticate with Google Cloud:
gcloud auth application-default login
-
Set your project:
gcloud config set project YOUR_PROJECT_ID
Create terraform.tfvars from the example:
cd terraform
cp terraform.tfvars.example terraform.tfvarsEdit terraform.tfvars:
project_id = "your-actual-project-id"
region = "europe-west1"terraform initterraform planterraform applyType yes when prompted. This will create:
- Service account with appropriate permissions
- Cloud Run Jobs:
- Individual crawler jobs (for manual triggering)
- Individual ingest and notify jobs (for manual triggering)
- Pipeline jobs:
pipeline-emergentandpipeline-all
- Cloud Scheduler jobs:
pipeline-emergent-schedule- Every 30 minutes (for emergency crawlers)pipeline-all-schedule- 3 times daily (for all crawlers)
- Enable required Google Cloud APIs
After Terraform creates the infrastructure, get the Artifact Registry URL:
cd .. # back to ingest directory
# Get the image URL from terraform
terraform output container_image_url
# Configure Docker for Artifact Registry (replace REGION with your region)
gcloud auth configure-docker [REGION]-docker.pkg.dev
# Build and push image using the URL from terraform output
gcloud builds submit --tag [CONTAINER_IMAGE_URL]
# Or build locally and push
docker build -t [CONTAINER_IMAGE_URL] .
docker push [CONTAINER_IMAGE_URL]Example:
# If terraform output shows: europe-west1-docker.pkg.dev/my-project/oborishte-ingest/oborishte-ingest:latest
gcloud auth configure-docker europe-west1-docker.pkg.dev
gcloud builds submit --tag europe-west1-docker.pkg.dev/my-project/oborishte-ingest/oborishte-ingest:latestAutomatic Cleanup: The repository keeps the latest tag indefinitely and removes untagged images after 1 day.
# Test a job
gcloud run jobs execute pipeline-emergent --region=europe-west1 --wait
# Or test individual crawler
gcloud run jobs execute crawl-rayon-oborishte --region=europe-west1 --wait
# View all jobs
gcloud run jobs list --region=europe-west1
# View schedules
gcloud scheduler jobs list --location=europe-west1When you make changes to Terraform files:
terraform plan # preview changes
terraform apply # apply changesWhen you update the application code:
-
Build and push new image:
# Build and push with Cloud Build (recommended) gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/oborishte-ingest:v1.0.1 # Or build locally docker build -t gcr.io/YOUR_PROJECT_ID/oborishte-ingest:v1.0.1 . docker push gcr.io/YOUR_PROJECT_ID/oborishte-ingest:v1.0.1
-
Update Terraform variable (if using a specific tag):
# In terraform.tfvars image_tag = "v1.0.1"
-
Apply changes:
terraform apply
Alternatively, update jobs directly without Terraform:
gcloud run jobs update crawl-rayon-oborishte \
--image=gcr.io/YOUR_PROJECT_ID/oborishte-ingest:v1.0.1 \
--region=europe-west1All variables are defined in variables.tf. Override them in terraform.tfvars:
| Variable | Description | Default |
|---|---|---|
project_id |
GCP Project ID | required |
firebase_project_id |
Firebase Project ID | required |
alert_email |
Email for pipeline failure alerts | required |
app_url |
Public URL of the web app | required |
ci_service_account_email |
CI/CD service account email | required |
region |
GCP region | europe-west1 |
locality |
Locality identifier (e.g. bg.sofia) |
bg.sofia |
image_name |
Docker image name | oborishte-ingest |
image_tag |
Docker image tag | latest |
artifact_registry_repo_id |
Artifact Registry repository ID | oborishte-ingest |
schedule_timezone |
Timezone for schedules | Europe/Sofia |
localities |
Locality IDs to deploy crawlers for | ["bg.sofia"] |
crawlers |
Manual override for the crawler map | {} (auto-assembled) |
gcs_generic_bucket |
GCS bucket for file storage (optional) | "" |
sentry_dsn_secret_id |
Secret Manager ID for Sentry DSN | "" |
Edit schedules in variables.tf:
variable "schedules" {
type = object({
pipeline_emergent = string # Emergent crawlers + ingest + notify
pipeline_all = string # All crawlers + ingest + notify
})
default = {
pipeline_emergent = "*/30 7-22 * * *" # Every 30 minutes, 7:00 AM–10:30 PM
pipeline_all = "0 10,14,16 * * *" # 3x daily at 10:00, 14:00, 16:00
}
}Cron format: minute hour day month weekday
Examples:
0 6 * * *- Daily at 6:00 AM0 6 * * 1- Weekly on Mondays at 6:00 AM0 */6 * * *- Every 6 hours*/30 * * * *- Every 30 minutes30 8 * * 1-5- Weekdays at 8:30 AM
Note: Individual crawler, ingest, and notify jobs remain available for manual triggering but are not scheduled. Only the two pipeline jobs run on schedule.
The pipeline uses Google Cloud Workflows for orchestration:
- Workflow definitions:
workflows/all.yaml.tftplandworkflows/emergent.yaml.tftpl(Terraform templates) - Triggered by: Cloud Scheduler (same schedules as before)
- Execution: Crawlers run in parallel via
parallel: for:loop, followed by sequential ingest and notify
Workflow changes: Workflow YAML is generated from Terraform templates using templatefile(). Crawler lists are derived automatically from local.crawlers (assembled from per-locality files in ingest/terraform/):
terraform plan # Workflows will be updated
terraform applyAdding new crawlers: Add an entry to the relevant locality file (e.g. crawlers.bg.sofia.tf):
- Add an entry to
crawlers.bg.<locality>.tf— the workflow templates pick it up automatically vialocal.crawlers. - For emergent crawlers (30-min intervals): set
emergent = truein the crawler entry and updateEMERGENT_CRAWLERSinpipeline.ts. This is a manual allowlist keyed by crawler source/directory names (not Terraform job keys). pipeline.tsautomatically discovers the full list of available crawlers from the filesystem; only membership in the emergent group is controlled manually viaEMERGENT_CRAWLERS.
Manual workflow testing:
# Execute workflow directly
gcloud workflows execute pipeline-emergent --location=europe-west1
# View execution details
gcloud workflows executions describe [EXECUTION_ID] \
--workflow=pipeline-emergent \
--location=europe-west1
# List recent workflow executions
gcloud workflows executions list \
--workflow=pipeline-emergent \
--location=europe-west1 \
--limit=10Legacy pipeline jobs: The monolithic pipeline-emergent and pipeline-all Cloud Run jobs remain available for local testing and emergency manual execution, but are no longer scheduled by Cloud Scheduler.
# Manual execution of legacy pipeline job
gcloud run jobs execute pipeline-emergent --region=europe-west1 --waitLocal development:
cd ingest
pnpm pipeline:emergent
pnpm pipeline:allChange memory/CPU in main.tf:
resources {
limits = {
cpu = "2" # Change CPU
memory = "2Gi" # Change memory
}
}Remote state is configured by default via a GCS backend in main.tf. The bucket and prefix cannot use Terraform variables — override them at init time:
terraform init \
-backend-config="bucket=your-terraform-state-bucket" \
-backend-config="prefix=your-prefix"Create the bucket first if it doesn't exist:
gsutil mb -l europe-west1 gs://your-terraform-state-bucket
gsutil versioning set on gs://your-terraform-state-bucketThen re-initialize:
terraform init -migrate-state# Initialize/upgrade providers
terraform init -upgrade
# Format code
terraform fmt
# Validate configuration
terraform validate
# Show current state
terraform show
# List resources
terraform state list
# Get specific output
terraform output service_account_email
# Destroy all resources (careful!)
terraform destroyEnsure you have necessary permissions:
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="user:your-email@example.com" \
--role="roles/editor"If you get "API not enabled" errors, Terraform should enable them automatically. If not:
gcloud services enable run.googleapis.com
gcloud services enable cloudbuild.googleapis.com
gcloud services enable cloudscheduler.googleapis.comIf working with a team and state is locked:
terraform force-unlock LOCK_IDCheck scheduler status:
gcloud scheduler jobs describe crawl-rayon-oborishte-schedule \
--location=europe-west1Manually trigger:
gcloud scheduler jobs run crawl-rayon-oborishte-schedule \
--location=europe-west1The infrastructure defined here should cost ~$0.20/month:
- Cloud Run Jobs: FREE (within free tier)
- Cloud Scheduler: $0.10/month × 2 pipeline schedules = $0.20/month
To reduce costs further:
- Reduce pipeline frequency (e.g., emergent every hour instead of 30 minutes)
- Reduce full pipeline to 2x daily instead of 3x
- Use smaller memory allocations
-
Never commit
terraform.tfvarsor*.tfstatefiles -
Use remote state with locking for team collaboration
-
Store secrets in Google Secret Manager, not in Terraform:
env { name = "API_KEY" value_source { secret_key_ref { secret = "api-key-secret" version = "latest" } } }
-
Use service account with minimal permissions
-
Enable VPC Service Controls for production
Example GitHub Actions workflow:
name: Deploy Infrastructure
on:
push:
branches: [main]
paths: ["ingest/terraform/**"]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
working-directory: ingest/terraform
- name: Terraform Apply
run: terraform apply -auto-approve
working-directory: ingest/terraform
env:
GOOGLE_CREDENTIALS: ${{ secrets.GCP_CREDENTIALS }}