PG中的WAL

2021-10-06

Internal Layout of XLOG Record(version 9.5 or later)

WAL的技术博客来源于interdb Write Ahead Logging，PostgreSQL重启恢复—-XLOG 1.0，PostgreSQL重启恢复—-XLOG 2.0

XLOG Record的data portion可以分为header part和data part。

Fig. 9.9. Common XLOG record format.

20190102_0920_2

下面是XLOG Record的例子，以INSERT操作为例：

Fig. 9.10. Examples of XLOG records (version 9.5 or later).

一般情况下，只有一个XlogRecordBlockHeader和block data，在checkpoint的XLOG Record中，甚至连一个都没有。

数据结构

XLogRecord

1	include/access/xlogrecord.h/XLogRecord

/*
 * The overall layout of an XLOG record is:
 *		Fixed-size header (XLogRecord struct)
 *		XLogRecordBlockHeader struct
 *		XLogRecordBlockHeader struct
 *		...
 *		XLogRecordDataHeader[Short|Long] struct
 *		block data
 *		block data
 *		...
 *		main data
 *
 * There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of
 * rmgr-specific data not associated with a block.  XLogRecord structs
 * always start on MAXALIGN boundaries in the WAL files, but the rest of
 * the fields are not aligned.
 *
 * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
 * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
 * used to distinguish between block references, and the main data structs.
 */
typedef struct XLogRecord
{
	uint32		xl_tot_len;		/* total len of entire record */
	TransactionId xl_xid;		/* xact id */
	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
	uint8		xl_info;		/* flag bits, see below */
	RmgrId		xl_rmid;		/* resource manager for this record */
	/* 2 bytes of padding here, initialize to zero */
	pg_crc32c	xl_crc;			/* CRC for this record */

	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

xl_rmid表示资源管理器号，表示当前正在做的是什么操作，当后面恢复的时候就可以调用相应的函数来redo，比如对于insert操作，它的日志的xl_rmid就是RM_HEAP，xl_info就是XLOG_HEAP_INSERT, 后面redo时就用RM_HEAP::heap_xlog_insert()函数；

RelFileNode

1	/include/storage/relfilenode.h/RelFileNode

/*
 * RelFileNode must provide all that we need to know to physically access
 * a relation, with the exception of the backend ID, which can be provided
 * separately. Note, however, that a "physical" relation is comprised of
 * multiple files on the filesystem, as each fork is stored as a separate
 * file, and each fork can be divided into multiple segments. See md.c.
 *
 * spcNode identifies the tablespace of the relation.  It corresponds to
 * pg_tablespace.oid.
 *
 * dbNode identifies the database of the relation.  It is zero for
 * "shared" relations (those common to all databases of a cluster).
 * Nonzero dbNode values correspond to pg_database.oid.
 *
 * relNode identifies the specific relation.  relNode corresponds to
 * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
 * to assign new physical files to relations in some situations).
 * Notice that relNode is only unique within a database in a particular
 * tablespace.
 *
 * Note: various places use RelFileNode in hashtable keys.  Therefore,
 * there *must not* be any unused padding bytes in this struct.  That
 * should be safe as long as all the fields are of type Oid.
 */
typedef struct RelFileNode
{
	Oid			spcNode;		/* tablespace */
	Oid			dbNode;			/* database */
	Oid			relNode;		/* relation */
} RelFileNode;

RelFilNode唯一地标识了一个relation。

xl_heap_head

存储了heap_tuple的部分头部信息，比如t_infomask2，t_infomask，t_hoff；

/*
 * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
 * or updated tuple in WAL; we can save a few bytes by reconstructing the
 * fields that are available elsewhere in the WAL record, or perhaps just
 * plain needn't be reconstructed.  These are the fields we must store.
 * NOTE: t_hoff could be recomputed, but we may as well store it because
 * it will come for free due to alignment considerations.
 */
typedef struct xl_heap_header
{
	uint16		t_infomask2;
	uint16		t_infomask;
	uint8		t_hoff;
} xl_heap_header;

不用将整个HeapTupleHeaderData都写入XLOG，HeapTupleHeaderData中的很多信息都可以重构或者不需要重构。所以只用存放一些必要的信息，而xl_heap_header就用于记录这些必要信息。

main data(xl_heap_insert)

当执行INSERT语句时，产生的XLOG的main data部分就是xl_heap_insert类型。存储了该元组在物理块中的偏移（是ItemIdData，也即ItemPointer，在页面中的偏移，例如，这个元组在页面中对应第2个ItemPointer，那么在xl_heap_insert中就存储2，而不是元组实体在页面中的偏移）；以这种方式记录的日志称为物理逻辑日志；如果记录的是元组实体的偏移，就称为物理日志；

struct xl_heap_insert{
    OffsetNumber offnum;
    uint8 flags;
}

Page-oriented Log

在明白了XLOG的结构之后，我们就可以来解释什么叫做Page-oriented Log了。从XLOG的信息中，我们不难发现，XLOG描述了一条元组应该被写入到哪个页面的什么位置。从heap_insert的流程中，我们也不难发现，当一条元组写入数据页面后，我们就立即为这次写入操作生成一个XLOG，并写入log buffer。也就是说XLOG描述了页面中的数据变化，这就是Page-oriented Log。与之相对应的是逻辑日志（logic log），逻辑日志通常只是记录一条SQL语句，在redo时，会重新执行这条SQL语句。所以对于Page-oriented Log而言，在redo时元组总是写入到先前写入的那个页面，但对于逻辑日志，redo时的写入就很随意了。

对于Page-oriented Log又分为物理日志和物理逻辑日志两种。前面提到过，对于物理日志会记录元组插入页面中的物理位置（ItemIdData中lp_off的值），而对于物理逻辑日志，只记录元组插入页面中的逻辑位置（ItemIdData自身的偏移）。

对于物理日志而言，由于记录了元组的实际偏移，所以在redo时只用定位到实际位置，然后直接覆盖原有元组（不管元组有没有落盘），这种操作本身是具有幂等性的，不论执行多少次redo结果都一样。但这个方式有一个问题，就是一旦块做了整理（比如：vacuum操作）那么元组的物理位置会发生变化。为了保持精确的物理信息，整理也会产生大量物理日志，这非常影响性能。

所以PostgreSQL采用的是物理逻辑日志，所谓物理是指记录了元组实际插入的数据页，所谓逻辑具体写入到数据页中的什么位置是一个逻辑的值。这样在vacuum的时候只需要保持ItemIdData的位置不变，就没有任何影响。但是物理逻辑日志本身不具有幂等性，如果不加任何处理直接多次redo的话，就会写入多条数据。所以对于物理逻辑日志需要一种手段来判断该XLOG是否需要在对应页面中进行redo操作，这也就是所谓的LSN。

partial write

操作系统会保证一个磁盘块落盘的原子性，但是PG中一个页是两个磁盘页，因此就不能保证在落盘时不出现问题。如果一个页面在落盘的过程中，数据库发生了崩溃，那么这个页面就可能出现一部分落盘，而一部分没有落盘的情况，也就是部分写（partial write）。对于一个部分写的页面，我们是没办法用XLOG来恢复的。这个是为什么呢？

首先，第一点，PG是支持物理逻辑日志的，也就是日志不具备幂等性，重复恢复一个相同的日志，加入恢复相同的插入数据日志，则会在数据库中插入两个相同的元组。

第二，，每一个页面都记录了一个LSN，称为Page LSN（类似于ARIES）。Page LSN表示，所有LSN小于等于Page LSN的XLOG对应的操作都已经落盘。那么在重启恢复时，所有LSN小于Page LSN的XLOG都不会在该页面中做redo操作。但是这一切都必须有一个前提，那就是页面必须完成落盘。如果发生partial write，那么就会出现块头正确落盘，而块数据没有正确落盘，从而无法保证Page LSN之前的所有操作都正确落盘，此时即使知道了这个page是坏的，也不能通过日志恢复（因为之前的第一点）。

为了解决这个问题，PostgreSQL提供备份区块的方式。这种方式的思路是，对于checkpoint 之后，页面的第一次修改，会在 XLOG中记录页面的全部数据。我觉得这是一种undo操作，不过是在极端情况下（数据块没有正确落盘）才会执行的undo。

Writing of XLOG Record

我们以INSERT操作为例，解释PG中Writing of XLOG Record的过程。输入下述SQL语句。

1	INSERT INTO tbl VALUES ('A');

顺序流程

这个语句会触发exec_simple_query()函数，接下来的流程是：

exec_simple_query() @postgres.c

(1) ExtendCLOG() @clog.c                  /* Write the state of this transaction
                                           * "IN_PROGRESS" to the CLOG.
                                           */
(2) heap_insert()@heapam.c                /* Insert a tuple, creates a XLOG record,
                                           * and invoke the function XLogInsert.
                                           */
(3) XLogInsert() @xlog.c (9.5 or later, xloginsert.c)
                                          /* Write the XLOG record of the inserted tuple
                                           *  to the WAL buffer, and update page's pd_lsn.
                                           */
(4) finish_xact_command() @postgres.c     /* Invoke commit action.*/   
      XLogInsert() @xlog.c  (9.5 or later, xloginsert.c)
                                          /* Write a XLOG record of this commit action 
                                           * to the WAL buffer.
                                           */
(5) XLogWrite() @xlog.c                 /* Write and flush all XLOG records on 
                                           * the WAL buffer to WAL segment.
                                           */
(6) TransactionIdCommitTree() @transam.c  /* Change the state of this transaction 
                                           * from "IN_PROGRESS" to "COMMITTED" on the CLOG.
                                           */

关于CLOG，可以看这张图：

fig-2-02

XLogWrite() @xlog.c 函数是将WAL buffer中的数据flush到WAL segments中，它可能在一下几种情况下被调用：

One running transaction has committed or has aborted.
The WAL buffer has been filled up with many tuples have been written. (The WAL buffer size is set to the parameter wal_buffers .)
A WAL writer process writes periodically.

If one of above occurs, all WAL records on the WAL buffer are written into a WAL segment file regardless of whether their transactions have been committed or not.这句话说明flush到wal segments中的日志可能是没有commit日志的，但是PG又是只支持redo日志，这就意味着可能在recover阶段检测：从redo point点开始哪些事务的日志是需要redo的。

Write sequence of XLOG Records

Fig. 9.11. Write-sequence of XLOG records.

Fig. 9.12. Write-sequence of XLOG records (continued from Fig. 9.11).

调用栈

1. /backend/access/heap/heapam.c/heap_insert(Relation relation, HeapTuple tup, CommandId cid,int options, BulkInsertState bistate)  
   1. RelationPutHeapTuple(relation, buffer, heaptup,(options & HEAP_INSERT_SPECULATIVE) != 0);
   2. 在全局变量中注册XLOG的各个部分数据
   2. recptr = XLogInsert(RM_HEAP_ID, info);
      /backend/access/transam/xloginsert.c/XLogInsert(RmgrId rmid, uint8 info)  
      向WAL buffer中写入已经注册好的XLOG各个部分数据
      1. rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,&fpw_lsn, &num_fpi);
         /backend/access/transam/xloginsert.c/XLogRecordAssemble(RmgrId rmid, uint8 info,XLogRecPtr RedoRecPtr, bool doPageWrites,XLogRecPtr *fpw_lsn, int *num_fpi) 
      2. EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi);
         /backend/access/transam/xlog.c/XLogInsertRecord(XLogRecData *rdata,XLogRecPtr fpw_lsn,uint8 flags,int num_fpi)
         将装配好的XLogRecData链表（存放XLOG4个部分的数据）持久化到WAL segments中

heap_insert() @heapam.c

1
2
3

/backend/access/heap/heapam.c/heap_insert(Relation relation, 
    HeapTuple tup, CommandId cid,
    int options, BulkInsertState bistate)

   /* XLOG stuff */
if (RelationNeedsWAL(relation))
{
	xl_heap_insert xlrec;
	xl_heap_header xlhdr;
	XLogRecPtr	recptr;
	Page		page = BufferGetPage(buffer);
	uint8		info = XLOG_HEAP_INSERT;
	int			bufflags = 0;

	/*
	 * If this is a catalog, we need to transmit combocids to properly
	 * decode, so log that as well.
	 */
	if (RelationIsAccessibleInLogicalDecoding(relation))
		log_heap_new_cid(relation, heaptup);

	/*
	 * If this is the single and first tuple on page, we can reinit the
	 * page instead of restoring the whole thing.  Set flag, and hide
	 * buffer references from XLogInsert.
	 */
	if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
	{
		info |= XLOG_HEAP_INIT_PAGE;
		bufflags |= REGBUF_WILL_INIT;
	}

	xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
	xlrec.flags = 0;
	if (all_visible_cleared)
		xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
	if (options & HEAP_INSERT_SPECULATIVE)
		xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
	Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));

	/*
	 * For logical decoding, we need the tuple even if we're doing a full
	 * page write, so make sure it's included even if we take a full-page
	 * image. (XXX We could alternatively store a pointer into the FPW).
	 */
	if (RelationIsLogicallyLogged(relation) &&
		!(options & HEAP_INSERT_NO_LOGICAL))
	{
		xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
		bufflags |= REGBUF_KEEP_DATA;
	}

	XLogBeginInsert();
	XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);

	xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
	xlhdr.t_infomask = heaptup->t_data->t_infomask;
	xlhdr.t_hoff = heaptup->t_data->t_hoff;

	/*
	 * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
	 * write the whole page to the xlog, we don't need to store
	 * xl_heap_header in the xlog.
	 */
	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
	XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
	XLogRegisterBufData(0,
						(char *) heaptup->t_data + SizeofHeapTupleHeader,
						heaptup->t_len - SizeofHeapTupleHeader);

	/* filtering by origin on a row level is much more efficient */
	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);

	recptr = XLogInsert(RM_HEAP_ID, info);

	PageSetLSN(page, recptr);
}

1567945832709627

关于注册数据，可以看这张图直接地了解一下

XLogInsert() @xloginsert.c

1	/backend/access/transam/xloginsert.c/XLogInsert(RmgrId rmid, uint8 info)

在heap_insert()中主要已经注册了下述数据：

//xlrec为xl_heap_insert结构体
XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);		
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
//xlhdr为xl_heap_header结构体
XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
//(char *) heaptup->t_data + SizeofHeapTupleHeader为实际元组
XLogRegisterBufData(0,
                    (char *) heaptup->t_data + SizeofHeapTupleHeader,
					heaptup->t_len - SizeofHeapTupleHeader);

通过注册流程，我们现构建了XLOG如下部分的数据（绿色为已构建的，红色为尚未构建的）：

XLogRecord+XLogRecordBlockHeader+RelFileNode+BlockNumber + mainrdata_len +

xl_heap_header+ 实际元组数据+ xl_heap_insert

xl_heap_insert在mainrdata链表中。
RelFileNode+BlockNumber+xl_heap_header+ 实际元组数据在regbuf链表中。

XLogRecordAssemble() @xloginsert.c

1
2
3

/backend/access/transam/xloginsert.c/XLogRecordAssemble(RmgrId rmid, uint8 info,
				   XLogRecPtr RedoRecPtr, bool doPageWrites,
				   XLogRecPtr *fpw_lsn, int *num_fpi)

XLogRecordAssemble() 负责获取前面红色部分的数据：XLogRecord、XLogRecordBlockHeader、mainrdata_len。然后将XLOG的4个部分：XLOG头部 + xl_heap_header + 元组具体数据 + xl_heap_insert组装成XLogRecData链表。

// src/include/access/xlog_internal.h

/*
 * The functions in xloginsert.c construct a chain of XLogRecData structs
 * to represent the final WAL record.
 */
typedef struct XLogRecData
{
	struct XLogRecData *next;	/* next struct in chain, or NULL */
	char	   *data;			/* start of rmgr data to include */
	uint32		len;			/* length of rmgr data to include */
} XLogRecData;

/*
 * Assemble a WAL record from the registered data and buffers into an
 * XLogRecData chain, ready for insertion with XLogInsertRecord().
 *
 * The record header fields are filled in, except for the xl_prev field. The
 * calculated CRC does not include the record header yet.
 *
 * If there are any registered buffers, and a full-page image was not taken
 * of all of them, *fpw_lsn is set to the lowest LSN among such pages. This
 * signals that the assembled record is only good for insertion on the
 * assumption that the RedoRecPtr and doPageWrites values were up-to-date.
 */
static XLogRecData *
XLogRecordAssemble(RmgrId rmid, uint8 info,
				   XLogRecPtr RedoRecPtr, bool doPageWrites,
				   XLogRecPtr *fpw_lsn, int *num_fpi)
{
	XLogRecData *rdt;
	uint32		total_len = 0;
	int			block_id;
	pg_crc32c	rdata_crc;
	registered_buffer *prev_regbuf = NULL;
	XLogRecData *rdt_datas_last;
	XLogRecord *rechdr;
	char	   *scratch = hdr_scratch;

	/*
	 * All the modifications we do to the rdata chains below must handle that.
	 */

	/* The record begins with the fixed-size header */
	rechdr = (XLogRecord *) scratch;
	scratch += SizeOfXLogRecord;

	hdr_rdt.next = NULL;
	rdt_datas_last = &hdr_rdt;
	hdr_rdt.data = hdr_scratch;

	/*
	 * Enforce consistency checks for this record if user is looking for it.
	 * Do this before at the beginning of this routine to give the possibility
	 * for callers of XLogInsert() to pass XLR_CHECK_CONSISTENCY directly for
	 * a record.
	 */
	if (wal_consistency_checking[rmid])
		info |= XLR_CHECK_CONSISTENCY;

	/*
	 * Make an rdata chain containing all the data portions of all block
	 * references. This includes the data for full-page images. Also append
	 * the headers for the block references in the scratch buffer.
	 */
	*fpw_lsn = InvalidXLogRecPtr;
	for (block_id = 0; block_id < max_registered_block_id; block_id++)
	{
		registered_buffer *regbuf = &registered_buffers[block_id];
		bool		needs_backup;
		bool		needs_data;
		XLogRecordBlockHeader bkpb;
		XLogRecordBlockImageHeader bimg;
		XLogRecordBlockCompressHeader cbimg = {0};
		bool		samerel;
		bool		is_compressed = false;
		bool		include_image;

		if (!regbuf->in_use)
			continue;

		/* Determine if this block needs to be backed up */
		if (regbuf->flags & REGBUF_FORCE_IMAGE)
			needs_backup = true;
		else if (regbuf->flags & REGBUF_NO_IMAGE)
			needs_backup = false;
		else if (!doPageWrites)
			needs_backup = false;
		else
		{
			/*
			 * We assume page LSN is first data on *every* page that can be
			 * passed to XLogInsert, whether it has the standard page layout
			 * or not.
			 */
			XLogRecPtr	page_lsn = PageGetLSN(regbuf->page);

			needs_backup = (page_lsn <= RedoRecPtr);
			if (!needs_backup)
			{
				if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn)
					*fpw_lsn = page_lsn;
			}
		}

		/* Determine if the buffer data needs to included */
		if (regbuf->rdata_len == 0)
			needs_data = false;
		else if ((regbuf->flags & REGBUF_KEEP_DATA) != 0)
			needs_data = true;
		else
			needs_data = !needs_backup;

		bkpb.id = block_id;
		bkpb.fork_flags = regbuf->forkno;
		bkpb.data_length = 0;

		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;

		/*
		 * If needs_backup is true or WAL checking is enabled for current
		 * resource manager, log a full-page write for the current block.
		 */
		include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;

		if (include_image)
		{
			Page		page = regbuf->page;
			uint16		compressed_len = 0;

			/*
			 * The page needs to be backed up, so calculate its hole length
			 * and offset.
			 */
			if (regbuf->flags & REGBUF_STANDARD)
			{
				/* Assume we can omit data between pd_lower and pd_upper */
				uint16		lower = ((PageHeader) page)->pd_lower;
				uint16		upper = ((PageHeader) page)->pd_upper;

				if (lower >= SizeOfPageHeaderData &&
					upper > lower &&
					upper <= BLCKSZ)
				{
					bimg.hole_offset = lower;
					cbimg.hole_length = upper - lower;
				}
				else
				{
					/* No "hole" to remove */
					bimg.hole_offset = 0;
					cbimg.hole_length = 0;
				}
			}
			else
			{
				/* Not a standard page header, don't try to eliminate "hole" */
				bimg.hole_offset = 0;
				cbimg.hole_length = 0;
			}

			/*
			 * Try to compress a block image if wal_compression is enabled
			 */
			if (wal_compression)
			{
				is_compressed =
					XLogCompressBackupBlock(page, bimg.hole_offset,
											cbimg.hole_length,
											regbuf->compressed_page,
											&compressed_len);
			}

			/*
			 * Fill in the remaining fields in the XLogRecordBlockHeader
			 * struct
			 */
			bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;

			/* Report a full page image constructed for the WAL record */
			*num_fpi += 1;

			/*
			 * Construct XLogRecData entries for the page content.
			 */
			rdt_datas_last->next = &regbuf->bkp_rdatas[0];
			rdt_datas_last = rdt_datas_last->next;

			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;

			/*
			 * If WAL consistency checking is enabled for the resource manager
			 * of this WAL record, a full-page image is included in the record
			 * for the block modified. During redo, the full-page is replayed
			 * only if BKPIMAGE_APPLY is set.
			 */
			if (needs_backup)
				bimg.bimg_info |= BKPIMAGE_APPLY;

			if (is_compressed)
			{
				bimg.length = compressed_len;
				bimg.bimg_info |= BKPIMAGE_IS_COMPRESSED;

				rdt_datas_last->data = regbuf->compressed_page;
				rdt_datas_last->len = compressed_len;
			}
			else
			{
				bimg.length = BLCKSZ - cbimg.hole_length;

				if (cbimg.hole_length == 0)
				{
					rdt_datas_last->data = page;
					rdt_datas_last->len = BLCKSZ;
				}
				else
				{
					/* must skip the hole */
					rdt_datas_last->data = page;
					rdt_datas_last->len = bimg.hole_offset;

					rdt_datas_last->next = &regbuf->bkp_rdatas[1];
					rdt_datas_last = rdt_datas_last->next;

					rdt_datas_last->data =
						page + (bimg.hole_offset + cbimg.hole_length);
					rdt_datas_last->len =
						BLCKSZ - (bimg.hole_offset + cbimg.hole_length);
				}
			}

			total_len += bimg.length;
		}

		if (needs_data)
		{
			/*
			 * Link the caller-supplied rdata chain for this buffer to the
			 * overall list.
			 */
			bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
			bkpb.data_length = regbuf->rdata_len;
			total_len += regbuf->rdata_len;

			rdt_datas_last->next = regbuf->rdata_head;
			rdt_datas_last = regbuf->rdata_tail;
		}

		if (prev_regbuf && RelFileNodeEquals(regbuf->rnode, prev_regbuf->rnode))
		{
			samerel = true;
			bkpb.fork_flags |= BKPBLOCK_SAME_REL;
		}
		else
			samerel = false;
		prev_regbuf = regbuf;

		/* Ok, copy the header to the scratch buffer */
		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
		scratch += SizeOfXLogRecordBlockHeader;
		if (include_image)
		{
			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
			scratch += SizeOfXLogRecordBlockImageHeader;
			if (cbimg.hole_length != 0 && is_compressed)
			{
				memcpy(scratch, &cbimg,
					   SizeOfXLogRecordBlockCompressHeader);
				scratch += SizeOfXLogRecordBlockCompressHeader;
			}
		}
		if (!samerel)
		{
			memcpy(scratch, &regbuf->rnode, sizeof(RelFileNode));
			scratch += sizeof(RelFileNode);
		}
		memcpy(scratch, &regbuf->block, sizeof(BlockNumber));
		scratch += sizeof(BlockNumber);
	}

	/* followed by the record's origin, if any */
	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) &&
		replorigin_session_origin != InvalidRepOriginId)
	{
		*(scratch++) = (char) XLR_BLOCK_ID_ORIGIN;
		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
		scratch += sizeof(replorigin_session_origin);
	}

	/* followed by main data, if any */
	if (mainrdata_len > 0)
	{
		if (mainrdata_len > 255)
		{
			*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
			memcpy(scratch, &mainrdata_len, sizeof(uint32));
			scratch += sizeof(uint32);
		}
		else
		{
			*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
			*(scratch++) = (uint8) mainrdata_len;
		}
		rdt_datas_last->next = mainrdata_head;
		rdt_datas_last = mainrdata_last;
		total_len += mainrdata_len;
	}
	rdt_datas_last->next = NULL;

	hdr_rdt.len = (scratch - hdr_scratch);
	total_len += hdr_rdt.len;

	/*
	 * Calculate CRC of the data
	 *
	 * Note that the record header isn't added into the CRC initially since we
	 * don't know the prev-link yet.  Thus, the CRC will represent the CRC of
	 * the whole record in the order: rdata, then backup blocks, then record
	 * header.
	 */
	INIT_CRC32C(rdata_crc);
	COMP_CRC32C(rdata_crc, hdr_scratch + SizeOfXLogRecord, hdr_rdt.len - SizeOfXLogRecord);
	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);

	/*
	 * Fill in the fields in the record header. Prev-link is filled in later,
	 * once we know where in the WAL the record will be inserted. The CRC does
	 * not include the record header yet.
	 */
	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
	rechdr->xl_tot_len = total_len;
	rechdr->xl_info = info;
	rechdr->xl_rmid = rmid;
	rechdr->xl_prev = InvalidXLogRecPtr;
	rechdr->xl_crc = rdata_crc;

	return &hdr_rdt;
}

hdr_scratch指向一块区域，它即将保存XLOG头部，包括XLogRecord, XLogRecordBlockHeader, RelFileNode, BlockNumber；hdr_rdt是链表头，hdr_rdt->data和hdr_scratch相等。

1567945832709627

这个函数涉及到很重要的备份区块的问题：在经过判断后，如果一个page需要备份，那么就执行相关操作，但是这个page是没有必要全部放到日志中的，只需要放一部分。可以看这张图理解。

在registered_buffer中，提供了两个临时的XLogRecData，其实就是分别用于存放备份区块的两个部分数据的，第一部分是：page head + item data，第二部分是tuples。

ReserveXLogInsertLocation() @xlog.c

根据日志的大小来预留足够的XLog空间。

/*
 * Reserves the right amount of space for a record of given size from the WAL.
 * *StartPos is set to the beginning of the reserved section, *EndPos to
 * its end+1. *PrevPtr is set to the beginning of the previous record; it is
 * used to set the xl_prev of this record.
 *
 * This is the performance critical part of XLogInsert that must be serialized
 * across backends. The rest can happen mostly in parallel. Try to keep this
 * section as short as possible, insertpos_lck can be heavily contended on a
 * busy system.
 *
 * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
 * where we actually copy the record to the reserved space.
 */
static void
ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
						  XLogRecPtr *PrevPtr)
{
	XLogCtlInsert *Insert = &XLogCtl->Insert;
	uint64		startbytepos;
	uint64		endbytepos;
	uint64		prevbytepos;

	size = MAXALIGN(size);

	/* All (non xlog-switch) records should contain data. */
	Assert(size > SizeOfXLogRecord);

	/*
	 * The duration the spinlock needs to be held is minimized by minimizing
	 * the calculations that have to be done while holding the lock. The
	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
	 * page headers. The mapping between "usable" byte positions and physical
	 * positions (XLogRecPtrs) can be done outside the locked region, and
	 * because the usable byte position doesn't include any headers, reserving
	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
	 */
	SpinLockAcquire(&Insert->insertpos_lck);

	startbytepos = Insert->CurrBytePos;
	endbytepos = startbytepos + size;
	prevbytepos = Insert->PrevBytePos;
	Insert->CurrBytePos = endbytepos;
	Insert->PrevBytePos = startbytepos;

	SpinLockRelease(&Insert->insertpos_lck);

	*StartPos = XLogBytePosToRecPtr(startbytepos);
	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);

	/*
	 * Check that the conversions between "usable byte positions" and
	 * XLogRecPtrs work consistently in both directions.
	 */
	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
}

在PostgreSQL中XLOG是顺序写入的，PostgreSQL使用了一个全局的XLogCtlInsert结构体对象来记录日志的写入位置。其中CurrBytePos成员表示日志的当前写入位置，用CurrBytePos+size就可以得到日志的结束位置。然后将这个当前写入位置作为下一条日志的PrevBytePos存放到XLogCtlInsert结构体中。由于XLogCtlInsert是一个全局对象，所以在获取和修改其中成员时，需要加锁，这里直接使用自旋锁。

startbytepos、endbytepos、prevbytepos，这三个位置，实际上是三个逻辑位置。

CurrBytePos每次都递增一个XLOG日志的大小size，这种方式给程序员提供了一个很好的抽象，仿佛xlog buffer中只会存放一条一条的XLOG，但实际上并不是这样。

Fig. 9.7. Internal layout of a WAL segment file.

一个WAL段文件的默认大小为16MB，并且其内部被划分成大小为8KB的多个页面。第一个页面包含了由XLogLongPageHeaderData定义的首部数据，其他页面包含了由XLogPageHeaderData定义的首部数据。

真实的物理位置是这样计算的：

/*
 * Converts a "usable byte position" to XLogRecPtr. A usable byte position
 * is the position starting from the beginning of WAL, excluding all WAL
 * page headers.
 */
static XLogRecPtr
XLogBytePosToRecPtr(uint64 bytepos)
{
	uint64		fullsegs;
	uint64		fullpages;
	uint64		bytesleft;
	uint32		seg_offset;
	XLogRecPtr	result;

    /*
     * fullsegs：XLOG有多个段，所以需要先确定bytepos应该对应第几个段
     * bytesleft：确定了bytepos属于第几个段后，就需要确定段内偏移，bytesleft只是一个逻辑偏移
     *
     * UsableBytesInSegment是一个段里面的有效载荷，
     * 更通俗的说就是一个段里面除去管理信息剩下的空间大小
     */
	fullsegs = bytepos / UsableBytesInSegment;		
	bytesleft = bytepos % UsableBytesInSegment;

    /*
     * 现在需要将bytesleft转换成一个物理偏移seg_offset
     * 转换的方式是确定bytesleft是段内的第几个块，以及块内偏移，然后加上块管理信息
     * 段中第一个页面的组织结构和其他页面不太一样（第一个页面放的是XLogLongPageHeaderData，
     * 其他是XLogPageHeaderData），所以第一个页面需要单独处理
     */
	if (bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD)
	{
		/* 
		 * fits on first page of segment 
		 * 段内的第一个块，直接加上块头SizeOfXLogLongPHD（XLogLongPageHeaderData的大小）
		 */
		seg_offset = bytesleft + SizeOfXLogLongPHD;
	}
	else
	{
		/* account for the first page on segment with long header */
		seg_offset = XLOG_BLCKSZ; //由于肯定不在第一个页面上，所以起始偏移就是第一个页面的大小
		bytesleft -= XLOG_BLCKSZ - SizeOfXLogLongPHD;	//去掉第一个页面中可用空间有效载荷大小
        												//XLOG_BLCKSZ - SizeOfXLogLongPHD
		
        /*
         *	计算剩余的bytesleft属于哪个页面，以及页内偏移
         *  UsableBytesInPage是页面内的有效载荷
         */
		fullpages = bytesleft / UsableBytesInPage;
		bytesleft = bytesleft % UsableBytesInPage;

        //加上管理信息，得到最终的段内偏移
		seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD;
	}

    /*
     *  #define XLogSegNoOffsetToRecPtr(segno, offset, dest) \
	 *			(dest) = (segno) * XLOG_SEG_SIZE + (offset)
	 *
	 * 	根据前面计算出的段号和段内偏移计算出最终的物理偏移
	 */	
    
	XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, result);

	return result;
}

CopyXLogRecordToWAL() @xlog.c

首先调用GetXlogBuffer()获取写入Xlog buffer的位置currpos，然后再进行写入：

while (rdata != NULL)
	{
		char	   *rdata_data = rdata->data;
		int			rdata_len = rdata->len;

		while (rdata_len > freespace)
		{
			/*
			 * Write what fits on this page, and continue on the next page.
			 */
			Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || freespace == 0);
			memcpy(currpos, rdata_data, freespace);
			rdata_data += freespace;
			rdata_len -= freespace;
			written += freespace;
			CurrPos += freespace;

			/*
			 * Get pointer to beginning of next page, and set the xlp_rem_len
			 * in the page header. Set XLP_FIRST_IS_CONTRECORD.
			 *
			 * It's safe to set the contrecord flag and xlp_rem_len without a
			 * lock on the page. All the other flags were already set when the
			 * page was initialized, in AdvanceXLInsertBuffer, and we're the
			 * only backend that needs to set the contrecord flag.
			 */
			currpos = GetXLogBuffer(CurrPos);
			pagehdr = (XLogPageHeader) currpos;
			pagehdr->xlp_rem_len = write_len - written;
			pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;

			/* skip over the page header */
			if (XLogSegmentOffset(CurrPos, wal_segment_size) == 0)
			{
				CurrPos += SizeOfXLogLongPHD;
				currpos += SizeOfXLogLongPHD;
			}
			else
			{
				CurrPos += SizeOfXLogShortPHD;
				currpos += SizeOfXLogShortPHD;
			}
			freespace = INSERT_FREESPACE(CurrPos);
		}

		Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
		memcpy(currpos, rdata_data, rdata_len);
		currpos += rdata_len;
		CurrPos += rdata_len;
		freespace -= rdata_len;
		written += rdata_len;

		rdata = rdata->next;
	}

内存循环用于处理XLOG长度大于当前page空闲空间的情况，此时需要先将XLOG的一部分存放到当前page的剩余空间中，然后调用GetXLogBuffer为XLOG的剩余部分寻找一个新的page进行写入，而这个新page实际就是当前page的下一个page。如果当前page是log buffer中的最后一个page，那么GetXLogBuffer就会循环到的log buffer的第一个page。

XLOG buffer

XLOG buffer的组织结构

Fig. 9.7. Internal layout of a WAL segment file.

void
XLOGShmemInit(void)
{
	bool		foundCFile,
				foundXLog;
	char	   *allocptr;
	int			i;

#ifdef WAL_DEBUG

	/*
	 * Create a memory context for WAL debugging that's exempt from the normal
	 * "no pallocs in critical section" rule. Yes, that can lead to a PANIC if
	 * an allocation fails, but wal_debug is not for production use anyway.
	 */
	if (walDebugCxt == NULL)
	{
		walDebugCxt = AllocSetContextCreate(TopMemoryContext,
											"WAL Debug",
											ALLOCSET_DEFAULT_SIZES);
		MemoryContextAllowInCriticalSection(walDebugCxt, true);
	}
#endif

    /*
     * step1:分配空间
     */
	ControlFile = (ControlFileData *)
		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
	XLogCtl = (XLogCtlData *)
		ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);

	if (foundCFile || foundXLog)
	{
		/* both should be present or neither */
		Assert(foundCFile && foundXLog);

		/* Initialize local copy of WALInsertLocks and register the tranche */
		WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
		LWLockRegisterTranche(LWTRANCHE_WAL_INSERT,
							  &XLogCtl->Insert.WALInsertLockTranche);
		return;
	}
    
     /*
     * step2:初始化共享缓存
     */
    //第一部分：XLogCtl
	memset(XLogCtl, 0, sizeof(XLogCtlData));

	/*
	 * Since XLogCtlData contains XLogRecPtr fields, its sizeof should be a
	 * multiple of the alignment for same, so no extra alignment padding is
	 * needed here.
	 */
    
    //第二部分：XLOGbuffers个XLogRecPtr（XLogRecPtr为int类型）
	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;


	/* WAL insertion locks. Ensure they're aligned to the full padded size */
    //第三部分：NUM_XLOGINSERT_LOCKS个WALInsertLockPadded
    //注意，后面在XLogInsertRecord中调用的WALInsertLockAcquire函数，所使用的就是		WALInsertLocks
	allocptr += sizeof(WALInsertLockPadded) -
		((uintptr_t) allocptr) %sizeof(WALInsertLockPadded);
	WALInsertLocks = XLogCtl->Insert.WALInsertLocks =
		(WALInsertLockPadded *) allocptr;
	allocptr += sizeof(WALInsertLockPadded) * NUM_XLOGINSERT_LOCKS;

	XLogCtl->Insert.WALInsertLockTranche.name = "wal_insert";
	XLogCtl->Insert.WALInsertLockTranche.array_base = WALInsertLocks;
	XLogCtl->Insert.WALInsertLockTranche.array_stride = sizeof(WALInsertLockPadded);

	LWLockRegisterTranche(LWTRANCHE_WAL_INSERT, &XLogCtl->Insert.WALInsertLockTranche);
	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
	{
		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
	}

	/*
	 * Align the start of the page buffers to a full xlog block size boundary.
	 * This simplifies some calculations in XLOG insertion. It is also
	 * required for O_DIRECT.
	 * 第四部分：ALIGN
	 */
	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
	XLogCtl->pages = allocptr;
    //第五部分：log buffer（XLOG_BLCKSZ个page）
	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);

	/*
	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
	 * in additional info.)
	 */
	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
	XLogCtl->SharedRecoveryInProgress = true;
	XLogCtl->SharedHotStandbyActive = false;
	XLogCtl->WalWriterSleeping = false;

	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
	SpinLockInit(&XLogCtl->info_lck);
	SpinLockInit(&XLogCtl->ulsn_lck);
	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);

	/*
	 * If we are not in bootstrap mode, pg_control should already exist. Read
	 * and validate it immediately (see comments in ReadControlFile() for the
	 * reasons why).
	 */
	if (!IsBootstrapProcessingMode())
		ReadControlFile();
}

从上述代码中可用看出共享缓存被分了5个部分：

第一部分：XLogCtl
第二部分：LSN数组，数组元素个数和log buffer的页面数一致（XLOGbuffers）
第三部分：WALInsertLockPadded数组，数组元素个数为NUM_XLOGINSERT_LOCKS
第四部分：对齐位
第五部分：log buffer，数组元素个数为XLOGbuffers

XLogCtl是一个XLogCtlData结构体，这个结构体非常重要，用于控制XLOG的写入；其中的pages用于指向log buffer的起始地址；XLogCacheBlck用于存放最大的log buffer页面下标，也就是页面数量-1。

GetXLogBuffer() @xlog.c

在XLogRecordAssemble组装好一条XLOG之后。会经历以下步骤：

调用ReserveXLogInsertLocation获取XLOG的物理写入位置，这个位置也是XLOG的LSN。LSN是一个单调递增的整数。
调用GetXLogBuffer，将上一步得到的LSN作为入参，获取这个LSN应该写入log buffer的哪个页面，以及写入的位置指针，currpos。
将XLOG写入currpos指向的log buffer。

log buffer是由连续内存空间组成的循环队列，XLOG从前向后写log buffer，写满后循环到队头，再重头开始写。

/*
 * Get a pointer to the right location in the WAL buffer containing the
 * given XLogRecPtr.
 *
 * If the page is not initialized yet, it is initialized. That might require
 * evicting an old dirty buffer from the buffer cache, which means I/O.
 *
 * The caller must ensure that the page containing the requested location
 * isn't evicted yet, and won't be evicted. The way to ensure that is to
 * hold onto a WAL insertion lock with the insertingAt position set to
 * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
 * to evict an old page from the buffer. (This means that once you call
 * GetXLogBuffer() with a given 'ptr', you must not access anything before
 * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
 * later, because older buffers might be recycled already)
 */
static char *
GetXLogBuffer(XLogRecPtr ptr)
{
	int			idx;
	XLogRecPtr	endptr;
	static uint64 cachedPage = 0; // cachedPage就是逻辑上，当前正在可以插入日志的页
	static char *cachedPos = NULL; // cachedPos是物理log buffer上，当前正在插入日志的页的地址
                                   // cachedPage和cachedPos的区别类似于队列和环形队列的关系
	XLogRecPtr	expectedEndPtr;

	/*
	 * Fast path for the common case that we need to access again the same
	 * page as last time.
	 */
	if (ptr / XLOG_BLCKSZ == cachedPage)
	{
		Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
		Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
		return cachedPos + ptr % XLOG_BLCKSZ;
	}

	/*
	 * The XLog buffer cache is organized so that a page is always loaded to a
	 * particular buffer.  That way we can easily calculate the buffer a given
	 * page must be loaded into, from the XLogRecPtr alone.
	 */
	idx = XLogRecPtrToBufIdx(ptr);

	/*
	 * See what page is loaded in the buffer at the moment. It could be the
	 * page we're looking for, or something older. It can't be anything newer
	 * - that would imply the page we're looking for has already been written
	 * out to disk and evicted, and the caller is responsible for making sure
	 * that doesn't happen.
	 *
	 * However, we don't hold a lock while we read the value. If someone has
	 * just initialized the page, it's possible that we get a "torn read" of
	 * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
	 * that case we will see a bogus value. That's ok, we'll grab the mapping
	 * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
	 * the page we're looking for. But it means that when we do this unlocked
	 * read, we might see a value that appears to be ahead of the page we're
	 * looking for. Don't PANIC on that, until we've verified the value while
	 * holding the lock.
	 */
	expectedEndPtr = ptr;
	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;

	endptr = XLogCtl->xlblocks[idx]; 
	if (expectedEndPtr != endptr)  // 这就说明XLOG buffer满了，这个时候需要evict一些XLOG buffer page
	{
		XLogRecPtr	initializedUpto;

		/*
		 * Before calling AdvanceXLInsertBuffer(), which can block, let others
		 * know how far we're finished with inserting the record.
		 *
		 * NB: If 'ptr' points to just after the page header, advertise a
		 * position at the beginning of the page rather than 'ptr' itself. If
		 * there are no other insertions running, someone might try to flush
		 * up to our advertised location. If we advertised a position after
		 * the page header, someone might try to flush the page header, even
		 * though page might actually not be initialized yet. As the first
		 * inserter on the page, we are effectively responsible for making
		 * sure that it's initialized, before we let insertingAt to move past
		 * the page header.
		 */
		if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
			XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
			initializedUpto = ptr - SizeOfXLogShortPHD;
		else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
				 XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
			initializedUpto = ptr - SizeOfXLogLongPHD;
		else
			initializedUpto = ptr;

		WALInsertLockUpdateInsertingAt(initializedUpto);

		AdvanceXLInsertBuffer(ptr, false);
		endptr = XLogCtl->xlblocks[idx];

		if (expectedEndPtr != endptr)
			elog(PANIC, "could not find WAL buffer for %X/%X",
				 (uint32) (ptr >> 32), (uint32) ptr);
	}
	else
	{
		/*
		 * Make sure the initialization of the page is visible to us, and
		 * won't arrive later to overwrite the WAL data we write on the page.
		 */
		pg_memory_barrier();
	}

	/*
	 * Found the buffer holding this page. Return a pointer to the right
	 * offset within the page.
	 */
	cachedPage = ptr / XLOG_BLCKSZ;
	cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;

	Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
	Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));

	return cachedPos + ptr % XLOG_BLCKSZ;
}

1 2	#define XLogRecPtrToBufIdx(recptr) \ (((recptr) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))

由于是循环队列，那么当循环到队头后，队头page中的数据就会被新的XLOG覆盖。既然要覆盖，那么在覆盖之前需要先确保对应page中的数据已经落盘。所以GetXLogBuffer()还有一个非常重要的功能就是在页面覆盖之前判断这个页面是否是脏页，如果是脏页就需要将脏页落盘。它是通过计算expetedEndPtr和实际的endptr然后来进行比较来实现的（如果通过XLogRecPtrToBufIdx()计算出了一个idx，然后若idx对应的xlog buffer page是环形队列的头，或者说是脏页，那么这个脏页的endptr肯定和expectedEndPtr不一样）。

XLOG落盘

XLOG什么时候需要落盘？

事务commit之前

依据WAL的定义，XLOG落盘之后事务才可以commit。所以在事务commit之前，必须将事务相关的XLOG落盘。
log buffer被覆盖之前。这个就是前面说到的情况。
后台进程定期落盘。由于commit之前日志必须落盘，也就是说日志没有落盘，事务就不能commit。所以日志的落盘会导致commit的延迟，为了降低这种延迟，数据库通常都有专门的后台线程或者进程来定期对日志进行落盘。

为了测试第2种情况下的落盘，需要将后台定期日志落盘的进程给挂起。

定期落盘的主要调用栈是：

1
2
3

1. WalWriteMain()
   1. XLogBackgroundFlush()
      1. XLogWrite()

数据结构

/*----------
 * Shared-memory data structures for XLOG control
 *
 * LogwrtRqst indicates a byte position that we need to write and/or fsync
 * the log up to (all records before that point must be written or fsynced).
 * LogwrtResult indicates the byte positions we have already written/fsynced.
 * These structs are identical but are declared separately to indicate their
 * slightly different functions.
 *
 *
 * To read XLogCtl->LogwrtResult, you must hold either info_lck or
 * WALWriteLock.  To update it, you need to hold both locks.  The point of
 * this arrangement is that the value can be examined by code that already
 * holds WALWriteLock without needing to grab info_lck as well.  In addition
 * to the shared variable, each backend has a private copy of LogwrtResult,
 * which is updated when convenient.
 *
 * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
 * (protected by info_lck), but we don't need to cache any copies of it.
 *
 * info_lck is only held long enough to read/update the protected variables,
 * so it's a plain spinlock.  The other locks are held longer (potentially
 * over I/O operations), so we use LWLocks for them.  These locks are:
 *
 * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
 * It is only held while initializing and changing the mapping.  If the
 * contents of the buffer being replaced haven't been written yet, the mapping
 * lock is released while the write is done, and reacquired afterwards.
 *
 * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
 * XLogFlush).
 *
 * ControlFileLock: must be held to read/update control file or create
 * new log file.
 *
 * CheckpointLock: must be held to do a checkpoint or restartpoint (ensures
 * only one checkpointer at a time; currently, with all checkpoints done by
 * the checkpointer, this is just pro forma).
 *
 *----------
 *
 */

typedef struct XLogwrtRqst
{
	XLogRecPtr	Write;			/* last byte + 1 to write out */
	XLogRecPtr	Flush;			/* last byte + 1 to flush */
} XLogwrtRqst;

typedef struct XLogwrtResult
{
	XLogRecPtr	Write;			/* last byte + 1 written out */
	XLogRecPtr	Flush;			/* last byte + 1 flushed */
} XLogwrtResult;

Write

Write表示在此位置之前的XLOG已经写入操作系统缓存（不确定是否落盘）。
Flush

Write表示在此位置之前的XLOG已经落盘。

由于，我们只关注同步提交，所以Write和Flush一定是相等的。

XLogwrtRqst与XLogwrtResult是存放于共享内存中被所有进程共享的，所以在读写时需要加锁。具体来说：读时需要info_lck锁或者WALWriteLock，写时两把锁都需要。
XLogwrtRqst表示我们需要落盘的XLOG lsn，XLogwrtResult表示已经落盘的XLOG lsn。

XLogFlush() @xlog.c

1 2	void XLogFlush(XLogRecPtr record)

需要将record之前的所有XLOG进行落盘。

/*
 * Ensure that all XLOG data through the given position is flushed to disk.
 *
 * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
 * already held, and we try to avoid acquiring it if possible.
 */
void
XLogFlush(XLogRecPtr record)
{
	XLogRecPtr	WriteRqstPtr;
	XLogwrtRqst WriteRqst;

	/*
	 * During REDO, we are reading not writing WAL.  Therefore, instead of
	 * trying to flush the WAL, we should update minRecoveryPoint instead. We
	 * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
	 * to act this way too, and because when it tries to write the
	 * end-of-recovery checkpoint, it should indeed flush.
	 */
	if (!XLogInsertAllowed())
	{
		UpdateMinRecoveryPoint(record, false);
		return;
	}

	/* Quick exit if already known flushed */
	if (record <= LogwrtResult.Flush)
		return;

#ifdef WAL_DEBUG
	if (XLOG_DEBUG)
		elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
			 (uint32) (record >> 32), (uint32) record,
			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
#endif

	START_CRIT_SECTION();

	/*
	 * Since fsync is usually a horribly expensive operation, we try to
	 * piggyback as much data as we can on each fsync: if we see any more data
	 * entered into the xlog buffer, we'll write and fsync that too, so that
	 * the final value of LogwrtResult.Flush is as large as possible. This
	 * gives us some chance of avoiding another fsync immediately after.
	 */

	/* initialize to given target; may increase below */
	WriteRqstPtr = record;

	/*
	 * Now wait until we get the write lock, or someone else does the flush
	 * for us.
	 */
	for (;;)
	{
		XLogRecPtr	insertpos;

		/* read LogwrtResult and update local state */
		SpinLockAcquire(&XLogCtl->info_lck);
		if (WriteRqstPtr < XLogCtl->LogwrtRqst.Write)
			WriteRqstPtr = XLogCtl->LogwrtRqst.Write;
		LogwrtResult = XLogCtl->LogwrtResult;
		SpinLockRelease(&XLogCtl->info_lck);

		/* done already? */
		if (record <= LogwrtResult.Flush)
			break;

		/*
		 * Before actually performing the write, wait for all in-flight
		 * insertions to the pages we're about to write to finish.
		 */
		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);

		/*
		 * Try to get the write lock. If we can't get it immediately, wait
		 * until it's released, and recheck if we still need to do the flush
		 * or if the backend that held the lock did it for us already. This
		 * helps to maintain a good rate of group committing when the system
		 * is bottlenecked by the speed of fsyncing.
		 */
		if (!LWLockAcquireOrWait(WALWriteLock, LW_EXCLUSIVE))
		{
			/*
			 * The lock is now free, but we didn't acquire it yet. Before we
			 * do, loop back to check if someone else flushed the record for
			 * us already.
			 */
			continue;
		}

		/* Got the lock; recheck whether request is satisfied */
		LogwrtResult = XLogCtl->LogwrtResult;
		if (record <= LogwrtResult.Flush)
		{
			LWLockRelease(WALWriteLock);
			break;
		}

		/*
		 * Sleep before flush! By adding a delay here, we may give further
		 * backends the opportunity to join the backlog of group commit
		 * followers; this can significantly improve transaction throughput,
		 * at the risk of increasing transaction latency.
		 *
		 * We do not sleep if enableFsync is not turned on, nor if there are
		 * fewer than CommitSiblings other backends with active transactions.
		 */
		if (CommitDelay > 0 && enableFsync &&
			MinimumActiveBackends(CommitSiblings))
		{
			pg_usleep(CommitDelay);

			/*
			 * Re-check how far we can now flush the WAL. It's generally not
			 * safe to call WaitXLogInsertionsToFinish while holding
			 * WALWriteLock, because an in-progress insertion might need to
			 * also grab WALWriteLock to make progress. But we know that all
			 * the insertions up to insertpos have already finished, because
			 * that's what the earlier WaitXLogInsertionsToFinish() returned.
			 * We're only calling it again to allow insertpos to be moved
			 * further forward, not to actually wait for anyone.
			 */
			insertpos = WaitXLogInsertionsToFinish(insertpos);
		}

		/* try to write/flush later additions to XLOG as well */
		WriteRqst.Write = insertpos;
		WriteRqst.Flush = insertpos;

		XLogWrite(WriteRqst, false);

		LWLockRelease(WALWriteLock);
		/* done */
		break;
	}

	END_CRIT_SECTION();

	/* wake up walsenders now that we've released heavily contended locks */
	WalSndWakeupProcessRequests();

	/*
	 * If we still haven't flushed to the request point then we have a
	 * problem; most likely, the requested flush point is past end of XLOG.
	 * This has been seen to occur when a disk page has a corrupted LSN.
	 *
	 * Formerly we treated this as a PANIC condition, but that hurts the
	 * system's robustness rather than helping it: we do not want to take down
	 * the whole system due to corruption on one data page.  In particular, if
	 * the bad page is encountered again during recovery then we would be
	 * unable to restart the database at all!  (This scenario actually
	 * happened in the field several times with 7.1 releases.)	As of 8.4, bad
	 * LSNs encountered during recovery are UpdateMinRecoveryPoint's problem;
	 * the only time we can reach here during recovery is while flushing the
	 * end-of-recovery checkpoint record, and we don't expect that to have a
	 * bad LSN.
	 *
	 * Note that for calls from xact.c, the ERROR will be promoted to PANIC
	 * since xact.c calls this routine inside a critical section.  However,
	 * calls from bufmgr.c are not within critical sections and so we will not
	 * force a restart for a bad LSN on a data page.
	 */
	if (LogwrtResult.Flush < record)
		elog(ERROR,
			 "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
			 (uint32) (record >> 32), (uint32) record,
			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
}

首先，将record的值与本地缓存的XLogwrtResult.Flush相比较，以判断record之前的XLOG是否已经落盘，如果是则直接返回。（这里是一个优化，缓存的XLogwrtResult肯定比全局的XLogCtl->LogwrtResult要旧，如果record小于XLogwrtResult.Flush，那么肯定也不用访问XLogCtl了，因为访问它，要加锁，加锁是有开销的。）

1
2
3

/* Quick exit if already known flushed */
if (record <= LogwrtResult.Flush)
	return;

接下来，对info_lck加锁，然后获取全局XLogwrtResult，以更新本地XLogwrtResult。前面讲过对全局XLogwrtResult、XLogwrtRqst的读操作需要对info_lck加锁。更新本地XLogwrtResult后再次判断record之前的XLOG是否已经落盘。

/* read LogwrtResult and update local state */
SpinLockAcquire(&XLogCtl->info_lck);
if (WriteRqstPtr < XLogCtl->LogwrtRqst.Write)
    WriteRqstPtr = XLogCtl->LogwrtRqst.Write;
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);

接下来我们需要“wait for all in-flight insertions to the pages we’re about to write to finish”。

/*
 * Before actually performing the write, wait for all in-flight
 * insertions to the pages we're about to write to finish.
 */
insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);

然后，我们需要获取WALWriteLock锁，在对XLOG进行写盘之前，必须获取WALWriteLock锁，获取锁之后，需要再次读取全局XLogwrtResult，然后判断record之前的XLOG是否已经落盘。

接下来，将临时WriteRqst的Write和Flush修改为insertpos，表示我们希望将insertpos（insrtpos可能是record，也可能不是，这个要看在执行XLogFlush()时有没有别的进程先一步flush了）之前的XLOG落盘，然后调用XLogWrite进行真正的写盘操作。

/* try to write/flush later additions to XLOG as well */
WriteRqst.Write = insertpos;
WriteRqst.Flush = insertpos;

XLogWrite(WriteRqst, false);

LWLockRelease(WALWriteLock);

XLogWrite() @xlog.c

XLogWrite主要接收一个参数，是XLogwrtRqst，表示写入的日志的最大的LSN。这个函数还要依赖其他很重要的上下文:

XLogCtl。这个变量包含了LogwrtResult, pages, xlbocks等属性需要用到
openLogFile，当前正在打开的段文件的VFD(虚拟文件描述符， PG中的文件概念)。
openLogSegNo: 当前正在打开的段文件的文件号。

/*
 * Write and/or fsync the log at least as far as WriteRqst indicates.
 *
 * If flexible == true, we don't have to write as far as WriteRqst, but
 * may stop at any convenient boundary (such as a cache or logfile boundary).
 * This option allows us to avoid uselessly issuing multiple writes when a
 * single one would do.
 *
 * Must be called with WALWriteLock held. WaitXLogInsertionsToFinish(WriteRqst)
 * must be called before grabbing the lock, to make sure the data is ready to
 * write.
 */
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
	bool		ispartialpage;
	bool		last_iteration;
	bool		finishing_seg;
	bool		use_existent;
	int			curridx;
	int			npages;
	int			startidx;
	uint32		startoffset;

	/* We should always be inside a critical section here */
	Assert(CritSectionCount > 0);

	/*
	 * Update local LogwrtResult (caller probably did this already, but...)
	 */
	LogwrtResult = XLogCtl->LogwrtResult;

	/*
	 * Since successive pages in the xlog cache are consecutively allocated,
	 * we can usually gather multiple pages together and issue just one
	 * write() call.  npages is the number of pages we have determined can be
	 * written together; startidx is the cache block index of the first one,
	 * and startoffset is the file offset at which it should go. The latter
	 * two variables are only valid when npages > 0, but we must initialize
	 * all of them to keep the compiler quiet.
	 */
	npages = 0;
	startidx = 0;
	startoffset = 0;

	/*
	 * Within the loop, curridx is the cache block index of the page to
	 * consider writing.  Begin at the buffer containing the next unwritten
	 * page, or last partially written page.
	 */
	curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);

	while (LogwrtResult.Write < WriteRqst.Write)
	{
		/*
		 * Make sure we're not ahead of the insert process.  This could happen
		 * if we're passed a bogus WriteRqst.Write that is past the end of the
		 * last page that's been initialized by AdvanceXLInsertBuffer.
		 */
		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];

		if (LogwrtResult.Write >= EndPtr)
			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
				 (uint32) (LogwrtResult.Write >> 32),
				 (uint32) LogwrtResult.Write,
				 (uint32) (EndPtr >> 32), (uint32) EndPtr);

		/* Advance LogwrtResult.Write to end of current buffer page */
		LogwrtResult.Write = EndPtr;
		ispartialpage = WriteRqst.Write < LogwrtResult.Write;

		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
							 wal_segment_size))
		{
			/*
			 * Switch to new logfile segment.  We cannot have any pending
			 * pages here (since we dump what we have at segment end).
			 */
			Assert(npages == 0);
			if (openLogFile >= 0)
				XLogFileClose();
			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
							wal_segment_size);

			/* create/use new log file */
			use_existent = true;
			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
			ReserveExternalFD();
		}

		/* Make sure we have the current logfile open */
		if (openLogFile < 0)
		{
			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
							wal_segment_size);
			openLogFile = XLogFileOpen(openLogSegNo);
			ReserveExternalFD();
		}

		/* Add current page to the set of pending pages-to-dump */
		if (npages == 0)
		{
			/* first of group */
			startidx = curridx;
			startoffset = XLogSegmentOffset(LogwrtResult.Write - XLOG_BLCKSZ,
											wal_segment_size);
		}
		npages++;

		/*
		 * Dump the set if this will be the last loop iteration, or if we are
		 * at the last page of the cache area (since the next page won't be
		 * contiguous in memory), or if we are at the end of the logfile
		 * segment.
		 */
		last_iteration = WriteRqst.Write <= LogwrtResult.Write;

		finishing_seg = !ispartialpage &&
			(startoffset + npages * XLOG_BLCKSZ) >= wal_segment_size;

		if (last_iteration ||
			curridx == XLogCtl->XLogCacheBlck ||
			finishing_seg)
		{
			char	   *from;
			Size		nbytes;
			Size		nleft;
			int			written;

			/* OK to write the page(s) */
			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
			nbytes = npages * (Size) XLOG_BLCKSZ;
			nleft = nbytes;
			do
			{
				errno = 0;
				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
				written = pg_pwrite(openLogFile, from, nleft, startoffset);
				pgstat_report_wait_end();
				if (written <= 0)
				{
					char		xlogfname[MAXFNAMELEN];
					int			save_errno;

					if (errno == EINTR)
						continue;

					save_errno = errno;
					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
								 wal_segment_size);
					errno = save_errno;
					ereport(PANIC,
							(errcode_for_file_access(),
							 errmsg("could not write to log file %s "
									"at offset %u, length %zu: %m",
									xlogfname, startoffset, nleft)));
				}
				nleft -= written;
				from += written;
				startoffset += written;
			} while (nleft > 0);

			npages = 0;

			/*
			 * If we just wrote the whole last page of a logfile segment,
			 * fsync the segment immediately.  This avoids having to go back
			 * and re-open prior segments when an fsync request comes along
			 * later. Doing it here ensures that one and only one backend will
			 * perform this fsync.
			 *
			 * This is also the right place to notify the Archiver that the
			 * segment is ready to copy to archival storage, and to update the
			 * timer for archive_timeout, and to signal for a checkpoint if
			 * too many logfile segments have been used since the last
			 * checkpoint.
			 */
			if (finishing_seg)
			{
				issue_xlog_fsync(openLogFile, openLogSegNo);

				/* signal that we need to wakeup walsenders later */
				WalSndWakeupRequest();

				LogwrtResult.Flush = LogwrtResult.Write;	/* end of page */

				if (XLogArchivingActive())
					XLogArchiveNotifySeg(openLogSegNo);

				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;

				/*
				 * Request a checkpoint if we've consumed too much xlog since
				 * the last one.  For speed, we first check using the local
				 * copy of RedoRecPtr, which might be out of date; if it looks
				 * like a checkpoint is needed, forcibly update RedoRecPtr and
				 * recheck.
				 */
				if (IsUnderPostmaster && XLogCheckpointNeeded(openLogSegNo))
				{
					(void) GetRedoRecPtr();
					if (XLogCheckpointNeeded(openLogSegNo))
						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
				}
			}
		}

		if (ispartialpage)
		{
			/* Only asked to write a partial page */
			LogwrtResult.Write = WriteRqst.Write;
			break;
		}
		curridx = NextBufIdx(curridx);

		/* If flexible, break out of loop as soon as we wrote something */
		if (flexible && npages == 0)
			break;
	}

	Assert(npages == 0);

	/*
	 * If asked to flush, do so
	 */
	if (LogwrtResult.Flush < WriteRqst.Flush &&
		LogwrtResult.Flush < LogwrtResult.Write)

	{
		/*
		 * Could get here without iterating above loop, in which case we might
		 * have no open file or the wrong one.  However, we do not need to
		 * fsync more than one file.
		 */
		if (sync_method != SYNC_METHOD_OPEN &&
			sync_method != SYNC_METHOD_OPEN_DSYNC)
		{
			if (openLogFile >= 0 &&
				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
								 wal_segment_size))
				XLogFileClose();
			if (openLogFile < 0)
			{
				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
								wal_segment_size);
				openLogFile = XLogFileOpen(openLogSegNo);
				ReserveExternalFD();
			}

			issue_xlog_fsync(openLogFile, openLogSegNo);
		}

		/* signal that we need to wakeup walsenders later */
		WalSndWakeupRequest();

		LogwrtResult.Flush = LogwrtResult.Write;
	}

	/*
	 * Update shared-memory status
	 *
	 * We make sure that the shared 'request' values do not fall behind the
	 * 'result' values.  This is not absolutely essential, but it saves some
	 * code in a couple of places.
	 */
	{
		SpinLockAcquire(&XLogCtl->info_lck);
		XLogCtl->LogwrtResult = LogwrtResult;
		if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
			XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
		if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
			XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
		SpinLockRelease(&XLogCtl->info_lck);
	}
}

第一个场景（最复杂的）：

上图展示了LSN、log buffer、物理文件的对应关系。其中绿色部分表示已经落盘的XLOG、蓝色部分表示尚未落盘的XLOG。

LSN: 如果我们把所有日志（包括段页管理信息）看成一个长段的话，那么当前的日志号，LSN，就是当前日志的末尾(右边是开区间，不包含)。
log buffer：XLOG首先会写入log buffer，前面讲过GetXLogBuffer会定位一个LSN对应的XLOG应该写入哪个buffer page。log buffer是一个循环队列，1号page在log buffer的队头，队尾写满之后，会循环到队头继续写入（当然，这时要考虑eviction）。
disk：disk表示物理文件，log buffer中的XLOG最终都会落盘到物理文件。那么我们如何知道buffer page和物理块的对应关系呢？实际上也是通过LSN。一个segment对应一个单独的物理文件，一个page对应物理文件中的两个个物理块（PG中的页是8KB，磁盘页是4KB）。段号 = LSN / XLogSegSize, 段内块偏移 = LSN % XLogSegSize

npages = 0;
startidx = 0;
startoffset = 0;

/*
 * Within the loop, curridx is the cache block index of the page to
 * consider writing.  Begin at the buffer containing the next unwritten
 * page, or last partially written page.
 */
curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);

npages用于记录需要落盘的页面数量；startidx表示第一个需要落盘的xlog buffer page的下标；startoffset表示段文件需要写入的起始位置。log buffer以数组形式存放在XLogCtlData的pages成员中，数组元素为一个buffer page。而startidx、curridx均表示pages数组的下标。

while (LogwrtResult.Write < WriteRqst.Write)
{
    /*
     * Make sure we're not ahead of the insert process.  This could happen
     * if we're passed a bogus WriteRqst.Write that is past the end of the
     * last page that's been initialized by AdvanceXLInsertBuffer.
     */
    XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];

xlblocks与buffer page一一对应。xlblocks数组来表示某个buffer page当前可写入的XLOG的LSN的上限。由于buffer page的大小固定为XLOG_BLCKSZ，所以通过xlblocks-XLOG_BLCKSZ就可以得到该page可写入XLOG的LSN的下限。

/* Add current page to the set of pending pages-to-dump */
if (npages == 0)
{
    /* first of group */
    startidx = curridx;
    startoffset = XLogSegmentOffset(LogwrtResult.Write - XLOG_BLCKSZ,
                                    wal_segment_size);
}
npages++;

d8932ff81a574eaaa78a9079968be34a

/*
 * Dump the set if this will be the last loop iteration, or if we are
 * at the last page of the cache area (since the next page won't be
 * contiguous in memory), or if we are at the end of the logfile
 * segment.
 */
last_iteration = WriteRqst.Write <= LogwrtResult.Write;

finishing_seg = !ispartialpage &&
    (startoffset + npages * XLOG_BLCKSZ) >= wal_segment_size;

if (last_iteration ||
    curridx == XLogCtl->XLogCacheBlck ||
    finishing_seg)
{

checkpoint

WalWriterMain中主要是调用XLogBackgroundFlush把相应XLOG写入事务日志文件。

参考PostgreSQL启动过程中的那些事十九：walwriter进程一，PostgreSQL启动过程中的那些事十九：walwriter进程二