MOLT Replicator exposes Prometheus metrics at each stage of the replication pipeline. When using Replicator to perform forward replication or failback, you should monitor the health of each relevant pipeline stage to quickly detect issues.
This page describes and provides usage guidelines for Replicator metrics, according to the replication source:
- PostgreSQL
- MySQL
- Oracle
- CockroachDB (during failback)
Replication pipeline
MOLT Replicator replicates data as a pipeline of change events that travel from the source database to the target database where changes are applied. The Replicator pipeline consists of four stages:
- Source read: Connects Replicator to the source database and captures changes via logical replication (PostgreSQL, MySQL), LogMiner (Oracle), or changefeed messages (CockroachDB).
- Staging: Buffers mutations for ordered processing and crash recovery.
- Staging: Buffers mutations for ordered processing and crash recovery.
- Core sequencer: Processes staged mutations, maintains ordering guarantees, and coordinates transaction application.
- Core sequencer: Processes staged mutations, maintains ordering guarantees, and coordinates transaction application.
- Target apply: Applies mutations to the target database.
Set up metrics
Enable Replicator metrics by specifying the --metricsAddr flag with a port (or host:port) when you start Replicator. This exposes Replicator metrics at http://{host}:{port}/_/varz. For example, the following command exposes metrics on port 30005:
replicator start \
--targetConn $TARGET \
--stagingConn $STAGING \
--metricsAddr :30005
...
To collect Replicator metrics, set up Prometheus to scrape the Replicator metrics endpoint. To visualize Replicator metrics, use Grafana to create dashboards.
Metrics endpoints
The following endpoints are available when you enable Replicator metrics:
| Endpoint | Description |
|---|---|
/_/varz |
Prometheus metrics endpoint. |
/_/diag |
Structured diagnostic information (JSON). |
/_/healthz |
Health check endpoint. |
/debug/pprof/ |
Go pprof handlers for profiling. |
For example, to view the current snapshot of Replicator metrics on port 30005, open http://localhost:30005/_/varz in a browser. To track metrics over time and create visualizations, use Prometheus and Grafana as described in Set up metrics.
To check Replicator health:
curl http://localhost:30005/_/healthz
OK
Visualize metrics
Use the Replicator Grafana dashboard bundled with your binary (replicator_grafana_dashboard.json) to visualize metrics. The bundled dashboard matches your binary version. Alternatively, you can download the latest dashboard.
Use the Replicator Grafana dashboards bundled with your binary to visualize metrics. The general Replicator dashboard (replicator_grafana_dashboard.json) displays overall replication metrics, and the Oracle-specific dashboard (replicator_oracle_grafana_dashboard.json) displays Oracle source metrics. The bundled dashboards match your binary version. Alternatively, you can download the latest dashboards for Replicator and Oracle source metrics.
Overall replication metrics
High-level performance metrics
Monitor the following metrics to track the overall health of the replication pipeline:
core_source_lag_seconds- Description: Age of the most recently received checkpoint. This represents the time from source commit to
COMMITevent processing. - Interpretation: If consistently increasing, Replicator is falling behind in reading source changes, and cannot keep pace with database changes.
- Description: Age of the most recently received checkpoint. This represents the time from source commit to
core_source_lag_seconds- Description: Age of the most recently received checkpoint. This represents the time elapsed since the latest received resolved timestamp.
- Interpretation: If consistently increasing, Replicator is falling behind in reading source changes, and cannot keep pace with database changes.
target_apply_mutation_age_seconds- Description: End-to-end replication lag per mutation from source commit to target apply. Measures the difference between current wall time and the mutation's MVCC timestamp.
- Interpretation: Higher values mean that older mutations are being applied, and indicate end-to-end pipeline delays. Compare across tables to find bottlenecks.
target_apply_queue_utilization_percent- Description: Percentage of target apply queue capacity utilization.
- Interpretation: Values above 90 percent indicate severe backpressure throughout the pipeline, and potential data processing delays. Increase
--targetApplyQueueSizeor investigate target database performance.
target_apply_queue_utilization_percent- Description: Percentage of target apply queue capacity utilization.
- Interpretation: Values above 90 percent indicate severe backpressure throughout the pipeline, and potential data processing delays. Investigate target database performance.
Replication lag
Monitor the following metric to track end-to-end replication lag:
target_apply_transaction_lag_seconds- Description: Age of the transaction applied to the target table, measuring time from source commit to target apply.
- Interpretation: Consistently high values indicate bottlenecks in the pipeline. Compare with
core_source_lag_secondsto determine if the delay is in source read or target apply.
Progress tracking
Monitor the following metrics to track checkpoint progress:
target_applied_timestamp_seconds- Description: Wall time (Unix timestamp) of the most recently applied resolved timestamp.
- Interpretation: Use to verify continuous progress. Stale values indicate apply stalls.
target_pending_timestamp_seconds- Description: Wall time (Unix timestamp) of the most recently received resolved timestamp.
- Interpretation: A gap between this metric and
target_applied_timestamp_secondsindicates apply backlog, meaning that the pipeline cannot keep up with incoming changes.
Replication pipeline metrics
Source read
Source read metrics track the health of connections to source databases and the volume of incoming changes.
CockroachDB source
checkpoint_committed_age_seconds- Description: Age of the committed checkpoint.
- Interpretation: Increasing values indicate checkpoint commits are falling behind, which affects crash recovery capability.
checkpoint_proposed_age_seconds- Description: Age of the proposed checkpoint.
- Interpretation: A gap with
checkpoint_committed_age_secondsindicates checkpoint commit lag.
checkpoint_commit_duration_seconds- Description: Amount of time taken to save the committed checkpoint to the staging database.
- Interpretation: High values indicate staging database bottlenecks due to write contention or performance issues.
checkpoint_proposed_going_backwards_errors_total- Description: Number of times an error condition occurred where the changefeed was restarted.
- Interpretation: Indicates source changefeed restart or time regression. Requires immediate investigation of source changefeed stability.
Oracle source
To visualize the following metrics, import the Oracle Grafana dashboard bundled with your binary (replicator_oracle_grafana_dashboard.json). The bundled dashboard matches your binary version. Alternatively, you can download the latest dashboard.
oraclelogminer_scn_interval_size- Description: Size of the interval from the start SCN to the current Oracle SCN.
- Interpretation: Values larger than the
--scnWindowSizeflag value indicate replication lag, or that replication is idle.
oraclelogminer_time_per_window_seconds- Description: Amount of time taken to fully process an SCN interval.
- Interpretation: Large values indicate Oracle slowdown, blocked replication loop, or slow processing.
oraclelogminer_query_redo_logs_duration_seconds- Description: Amount of time taken to query redo logs from LogMiner.
- Interpretation: High values indicate Oracle is under load or the SCN interval is too large.
oraclelogminer_num_inflight_transactions_in_memory- Description: Current number of in-flight transactions in memory.
- Interpretation: High counts indicate long-running transactions on source. Monitor for memory usage.
oraclelogminer_num_async_checkpoints_in_queue- Description: Checkpoints queued for processing against staging database.
- Interpretation: Values close to the
--checkpointQueueBufferSizeflag value indicate checkpoint processing cannot keep up with incoming checkpoints.
oraclelogminer_upsert_checkpoints_duration- Description: Amount of time taken to upsert checkpoint batch into staging database.
- Interpretation: High values indicate the staging database is under heavy load or batch size is too large.
oraclelogminer_delete_checkpoints_duration- Description: Amount of time taken to delete old checkpoints from the staging database.
- Interpretation: High values indicate staging database load or long-running transactions preventing checkpoint deletion.
mutation_total- Description: Total number of mutations processed, labeled by source and mutation type (insert/update/delete).
- Interpretation: Use to monitor replication throughput and identify traffic patterns.
MySQL source
mylogical_dial_success_total- Description: Number of times Replicator successfully started logical replication.
- Interpretation: Multiple successes may indicate reconnects. Monitor for connection stability.
mylogical_dial_failure_total- Description: Number of times Replicator failed to start logical replication.
- Interpretation: Nonzero values indicate connection issues. Check network connectivity and source database health.
mutation_total- Description: Total number of mutations processed, labeled by source and mutation type (insert/update/delete).
- Interpretation: Use to monitor replication throughput and identify traffic patterns.
PostgreSQL source
pglogical_dial_success_total- Description: Number of times Replicator successfully started logical replication (executed
START_REPLICATIONcommand). - Interpretation: Multiple successes may indicate reconnects. Monitor for connection stability.
- Description: Number of times Replicator successfully started logical replication (executed
pglogical_dial_failure_total- Description: Number of times Replicator failed to start logical replication (failure to execute
START_REPLICATIONcommand). - Interpretation: Nonzero values indicate connection issues. Check network connectivity and source database health.
- Description: Number of times Replicator failed to start logical replication (failure to execute
mutation_total- Description: Total number of mutations processed, labeled by source and mutation type (insert/update/delete).
- Interpretation: Use to monitor replication throughput and identify traffic patterns.
Staging
Staging metrics track the health of the staging layer where mutations are buffered for ordered processing.
For checkpoint terminology, refer to the MOLT Replicator documentation.
stage_commit_lag_seconds- Description: Time between writing a mutation to source and writing it to staging.
- Interpretation: High values indicate delays in getting data into the staging layer.
stage_mutations_total- Description: Number of mutations staged for each table.
- Interpretation: Use to monitor staging throughput per table.
stage_duration_seconds- Description: Amount of time taken to successfully stage mutations.
- Interpretation: High values indicate write performance issues on the staging database.
Core sequencer
Core sequencer metrics track mutation processing, ordering, and transaction coordination.
core_sweep_duration_seconds- Description: Duration of each schema sweep operation, which looks for and applies staged mutations.
- Interpretation: Long durations indicate that large backlogs, slow staging reads, or slow target writes are affecting throughput.
core_sweep_mutations_applied_total- Description: Total count of mutations read from staging and successfully applied to the target database during a sweep.
- Interpretation: Use to monitor processing throughput. A flat line indicates no mutations are being applied.
core_sweep_success_timestamp_seconds- Description: Wall time (Unix timestamp) at which a sweep attempt last succeeded.
- Interpretation: If this value stops updating and becomes stale, it indicates that the sweep has stopped.
core_parallelism_utilization_percent- Description: Percentage of the configured parallelism that is actively being used for concurrent transaction processing.
- Interpretation: High utilization indicates bottlenecks in mutation processing.
Target apply
Target apply metrics track mutation application to the target database.
target_apply_queue_size- Description: Number of transactions waiting in the target apply queue.
- Interpretation: High values indicate target apply cannot keep up with incoming transactions.
apply_duration_seconds- Description: Amount of time taken to successfully apply mutations to a table.
- Interpretation: High values indicate target database performance issues or contention.
apply_upserts_total- Description: Number of rows upserted to the target.
- Interpretation: Use to monitor write throughput. Should grow steadily during active replication.
apply_deletes_total- Description: Number of rows deleted from the target.
- Interpretation: Use to monitor delete throughput. Compare with delete operations on the source database.
apply_errors_total- Description: Number of times an error was encountered while applying mutations.
- Interpretation: Growing error count indicates target database issues or constraint violations.
apply_conflicts_total- Description: Number of rows that experienced a compare-and-set (CAS) conflict.
- Interpretation: High counts indicate concurrent modifications or stale data conflicts. May require conflict resolution tuning.
apply_resolves_total- Description: Number of rows that experienced a compare-and-set (CAS) conflict and were successfully resolved.
- Interpretation: Compare with
apply_conflicts_totalto verify conflict resolution is working. Should be close to or equal to conflicts.
Userscript metrics
Userscripts allow you to define how rows are transformed, filtered, and routed before Replicator writes them to the target database. Replicator exposes Prometheus metrics that provide insight into userscript activity, performance, and stability.
script_invocations_total(counter)- Description: Number of times userscript handler functions (such as
onRowUpsert,onRowDelete, andonWrite) are invoked. - Interpretation: Use to confirm that userscripts are actively being called, and detect misconfigurations where scripts filter out all data or never run.
- Description: Number of times userscript handler functions (such as
script_rows_filtered_total(counter)- Description: Number of rows filtered out by the userscript (for example, handlers that returned
nullor produced no output). - Interpretation: Use to identify scripts that unintentionally drop incoming data, and confirm that logic for filtering out data rows is working as intended.
- Description: Number of rows filtered out by the userscript (for example, handlers that returned
script_rows_processed_total(counter)- Description: Number of rows successfully processed and passed through the userscript.
- Interpretation: Use to measure how many rows are being transformed or routed successfully. Compare with
script_rows_filtered_totalto understand filtering ratios and validate script logic.
script_exec_time_seconds(histogram)- Description: Measures the execution time of each userscript function call.
- Interpretation: Use to detect slow or inefficient userscripts that could introduce replication lag, and identify performance bottlenecks caused by complex transformations or external lookups.
script_entry_wait_seconds(histogram)- Description: Measures the latency between a row entering the Replicator userscript queue and the start of its execution inside the JavaScript runtime.
- Interpretation: Use to detect whether userscripts are queuing up before execution (higher values indicate longer wait times), and monitor how busy the userscript runtime pool is under load.
script_errors_total(counter)- Description: Number of errors that occurred during userscript execution (for example, JavaScript exceptions or runtime errors).
- Interpretation: Use to surface failing scripts or invalid assumptions about incoming data, and monitor script stability over time and catch regressions early.
Read more about userscript metrics.
Metrics snapshots
When enabled, the metrics snapshotter periodically writes out a point-in-time snapshot of Replicator's Prometheus metrics to a file in the Replicator data directory. Metrics snapshots can help with debugging when direct access to the Prometheus server is not available, and you can bundle snapshots and send them to CockroachDB support to help resolve an issue. A metrics snapshot includes all of the metrics on this page.
Metrics snapshotting is disabled by default, and can be enabled with the --metricsSnapshotPeriod Replicator flag. Replicator metrics must be enabled (with the --metricsAddr flag) in order for metrics snapshotting to work.
If snapshotting is enabled, the snapshot period must be at least 15 seconds. The recommended range for the snapshot period is 15-60 seconds. The retention policy for metrics snapshot files can be determined by time and by the total size of the snapshot data subdirectory. At least one retention policy must be configured. Snapshots can also be compressed to a gzip file.
Changing the snapshotter's configuration requires restarting the Replicator binary with different flags.
Enable metrics snapshotting
Step 1. Run Replicator with the snapshot flags
The following is an example of a replicator command where snapshotting is configured:
replicator pglogical \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--slotName molt_slot \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
replicator mylogical \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--defaultGTIDSet '4c658ae6-e8ad-11ef-8449-0242ac140006:1-29' \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
replicator oraclelogminer \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--scn 26685786 \
--backfillFromSCN 26685444 \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
replicator start \
--targetConn postgres://postgres:postgres@localhost:5432/molt?sslmode=disable \
--stagingConn postgres://root@localhost:26257/_replicator?sslmode=disable \
--bindAddr 0.0.0.0:30004 \
--stagingSchema _replicator \
--stagingCreateSchema \
--disableAuthentication \
--tlsSelfSigned \
--stageMode crdb \
--bestEffortWindow 1s \
--flushSize 1000 \
--metricsAddr :30005 \
--metricsSnapshotPeriod 15s \
--metricsSnapshotCompression gzip \
--metricsSnapshotRetentionTime 168h \
-v
If successful, Replicator will start, and the console output will indicate that the snapshotter has started as well:
INFO [Feb 2 10:20:32] Replicator starting
...
INFO [Feb 2 10:20:32] metrics snapshotter started, writing to replicator-data/metrics-snapshots every 15s, retaining 168h0m0s
Upon interruption of Replicator, the snapshotter will be stopped:
INFO [Feb 2 10:26:45] Interrupted
INFO [Feb 2 10:26:45] metrics snapshotter stopped
INFO [Feb 2 10:26:45] Server shutdown complete
Step 2. Find the snapshot files in the data directory
You can find the snapshot files in the Replicator data directory:
cd replicator-data/metrics-snapshots && ls . | tail -n 5
snapshot-20260202T152405.737Z.txt.gz
snapshot-20260202T152420.736Z.txt.gz
snapshot-20260202T152435.736Z.txt.gz
snapshot-20260202T152450.735Z.txt.gz
snapshot-20260202T152505.735Z.txt.gz
The uncompressed files list the metrics collected at that snapshot:
gzcat snapshot-20260202T152505.735Z.txt.gz | head -n 3
# HELP cdc_resolved_timestamp_buffer_size Current size of the resolved timestamp buffer channel which is yet to be processed by Pebble Stager
# TYPE cdc_resolved_timestamp_buffer_size gauge
cdc_resolved_timestamp_buffer_size 0.0 1.770045905735e+09
Bundle and send metrics snapshots
The following requires a Linux system that supports bash.
Step 1. Download the export script
Download the metrics snapshot export script. Ensure it's accessible and can be run by the current user.
Step 2. Run a snapshot export
Run an export, indicating the metrics-snapshots directory within your Replicator data directory. You can also provide start and end timestamps to define a subset of metrics to bundle. Times are specified as UTC and should be of the format YYYYMMDDTHHMMSS.
Running the script without timestamps bundles all of the data in the snapshot directory. For example:
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots
Running the script with one timestamp bundles all of the data in the snapshot directory beginning at that timestamp. For example:
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000
Running the script with two timestamps bundles all of the data in the snapshot directory within the two timestamps. For example:
./export-metrics-snapshots.sh ./replicator-data/metrics-snapshots 20260115T120000 20260115T140000
The resulting output is a .tar.gz file placed in the directory from which you ran the script (or to a path specified as an optional argument).
Step 3. Upload output file to a support ticket
Include this bundled metrics snapshot file on a support ticket to give support metrics information that's relevant to your issue.