Generative retrieval (GR) ranks documents by autoregressively generating identifiers. Trie-constrained beam search, which is used by many GR methods, is susceptible to early pruning of relevant prefixes. Planning ahead in generative retrieval (PAG) mitigates such prefix pruning by using simultaneous decoding to compute a document-level look-ahead prior that guides subsequent sequential decoding. We reproduce PAG at inference time and stress-test its decoding behavior. Using the authors’ released checkpoint and identifier/trie artifacts in the reported decoding setup, we reproduce the main effectiveness results on MS~MARCO Dev and TREC-DL 2019/2020 and corroborate the reported beam-size and latency trade-offs in our hardware setup. Beyond reproduction, we introduce plan drift diagnostics that quantify how intent-preserving query variations, including misspellings, reordering, synonym substitutions, paraphrases, naturality shifts, and translation-based variants, change the planner’s top-n candidate set and highest-weight tokens. We find that the planning signal is brittle, intent-preserving typos cause ``plan collapse,’’ where the look-ahead bonus effectively vanishes, reverting the model to a weaker unguided search. We further evaluate cross-lingual robustness by querying a fixed English index with non-English \textsc{mMARCO} inputs, and assess inference-time mitigations and query-side adaptation that require no re-indexing. We reproduce PAG’s reported effectiveness and confirm the benefit of planning-guided decoding, while showing that the planner’s sparse token-level scoring mechanism is sensitive to query surface-form variation, a robustness aspect not systematically evaluated in the original work.