TIP 93#: GoldenGate lag

I have a client with Oracle DBs using bi-directional GoldenGate replication in production and UAT environment. I noticed that DBs are out of sync. Checking GoldenGate lag using info all did not show any major lag, Also nothing was reported in GoldenGate log or replication discard file. All processes
were up and running on source and target boxes per the following snaps. (On node A :EXT01SA is primary extract, PPSA01SB is pump which pushes changes to node B, and RESB01SA is replicat which receives changes from node B - The same set of processes exist on node B)

------- Node A --------
GGSCI (########) 1> info all

Program Status Group Lag Time Since Chkpt

MANAGER RUNNING 
EXTRACT RUNNING EXT01SA 00:00:00 00:00:10 
EXTRACT RUNNING PPSA01SB 00:00:00 00:00:01 
REPLICAT RUNNING RESB01SA 00:00:00 00:00:02 

------ Node B ---------

Program Status Group Lag Time Since Chkpt

MANAGER RUNNING 
EXTRACT RUNNING EXT01SB 00:00:00 00:00:03 
EXTRACT RUNNING PPSB01SA 00:00:00 00:00:02 
REPLICAT RUNNING RESA01SB 00:00:00 00:00:01 


Checking further on the GG processes, I found that pump is behind primary extract although it is not reported in info all.In more details, the following shows that Primary extract is on sequence 349 while pump is on sequence 340 !!! When there is no lag, GG pump should read almost the same sequence that primary extract writes.

GGSCI > send extract EXT01SA, status
Sending STATUS request to EXTRACT EXT01SA ...
EXTRACT EXT01SA (PID 21599)
Current status: Recovery complete: At EOF
Current read position:
Sequence #: 5297
RBA: 17509888
Timestamp: ############
Current write position:
 Sequence #: 349 
RBA: 921878
Timestamp: ##########
Extract Trail: ############

GGSCI > send extract PPSA01SB status
Sending STATUS request to EXTRACT PPSA01SB ...

EXTRACT PPSA01SB (PID 22968)
Current status: Recovery complete: Processing data

Current read position:
Sequence #: 340 
RBA: 472796720
Timestamp: #########
Extract Trail: ############

Current write position:
Sequence #: 737
RBA: 2248444
Timestamp: ###########
Extract Trail: ##########



So now, we know why DBs are out of sync but what is the cause? Checking with log dump, the sequence that GG pump stopped on is the end of file for trail 340. (Interesting !!).
Further checking in GG log showed that the primary extract trail was rolled over manually with alter extract EXT01SA, etrollover; and in GG when a trail file is rolled over manually, Pump and replicat stops with detecting the new trail files and these processes require to be rolled over also. Interestingly enough info all does not show any lag but the moment pump or replicat is rolled over, the commands starts showing the right lag.

Lesson : "info all" command in GoldenGate does not cover all circumstance for reporting lag. The best way is to find out read and write sequence of each GG process and determine the lag.