NVM WAL BUFFER
【1】使用WAL段文件映射到内存作为WAL BUFFER
Non-volatile Memory Logging(PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
Persistent Memory(SNIA) https://www.snia.org/PM
读文章1
读了一篇文章:《The impact of NVM as the main memory on the database management system》,网址是:https://www.programmersought.com/article/75573496597/ 其实,这篇文章是论文的《Implications of Non-Volatile Memory as Primary Storage for Database Management Systems》的简版。
“In addition, there are built-in substances related to leakage and voltage that limit the further expansion of DRAM. Therefore, as the main memory medium, DRAM cannot possibly keep up with the growth of current and future data sets.” 这句话表明DRAM不能做的很大。
“The simplest design method is to replace NVM with the disk, and use its low latency to obtain performance improvement. However, adapting the DBMS to the characteristics of NVM goes far beyond its low latency.” 因为NVM可以字节寻址,但是传统的磁盘是块寻址。
这篇文章也在研究如何在PG中部署NVM:
- “how to include NVM in the current system’s memory structure;”
- “modifying the PostgreSQL storage engine to maximize the dividend of NVM”
- “bypass the slow disk interface while ensuring the robustness of the DBMS”(这句话不是很懂)
“the delay of STT-RAM is 1-20ns. Nevertheless, his delay is already very close to DRAM.”;“PC_RAM and R-RAM have higher write latency than DRAM. But write latency is not very important, because it can be alleviated by buffer.”(这句话的意思是,把DRAM作为NVM的buffer?还有可能是CPU内部的cache作为NVM的buffer)
“When using NVM as the main memory, not only the application software but also the system software needs to be modified, so that the advantages of NVM can be fully utilized. “
- 为什么要改system software呢?“The traditional file system accesses the storage medium through the block layer. If only the disk is replaced with NVM without any modification, then the NVM storage also needs to pass the block layer to read and write data. ”,这样的话,NVM的字节寻址能力就没有用到。
“PMFS is a POSIX file system developed by Intel and open source”,PMFS作为传统文件系统的替代品,有两个特征:
- “NVM and memory are addressed uniformly”. 这就意味着 “there is no need to copy data from NVM to DRAM for application access”
- “ traditional databases access blocks in two ways: file IO; memory mapped IO. PMFS implements file IO in a manner similar to traditional FS. However, the implementation of memory mapped IO is different.”,因为:“In the traditional file system, memory mapped IO first copies pages to DRAM. PMFS does not need this step, it directly maps pages directly to the address space of the process” (PMFS是一个基于操作系统的软件,还是对操作系统的改变呢?)

memory layered design
这篇文章提供了三种“DBMS memory layered design based on NVM”:

文章中不建议把DRAM替换为NVM,原因是:”such changes require a redesign of the current operating system and application software. In addition, as a substitute for DRAM, NVM technology is not mature in terms of durability. Therefore, we advocate that the platform still contains DRAM memory, and the disk is completely or partially replaced with NVM”
DRAM+NVM的组合是:DRAM快速处理暂存的数据;“it allows applications to access the data of the database system through the PMFS file system, and uses the NVM byte addressing feature to avoid the API overhead of the current traditional file system.”(我不懂,字节寻址避免了什么开销呢)
当使用NVM替换磁盘时,我们如何更改DBMS的重要部分:
- Avoid block-level access。使用NVM作为main memory medium时,字节寻址比块寻址要更加高效,但是“this reduces the data granularity to the byte level, without data warm-up. A better method needs to balance the advantages of these two aspects.” 所以,什么是data warm-up。
- Remove the internal buffer cache of the DBMS:“If the address space of NVM can be seen by other processes, there is no need to do block copying for a long time. Direct access to records in NVM will be more efficient. However, this requires an operating system that supports NVM, such as PMFS, which can directly expose the NVM address space to the process.” (我觉得,还是要buffer cache的吧)
- Remove redo logs:“if the internal buffer cache is not deployed, when all writes are directly written to NVM, redo log is not needed, but undo log is still needed.”(对于最简单的恢复系统,也就是先写日志到持久化存储,再写对应的更新到持久化存储,那么确实不需要redo log了,但是对于PG来说,这个不适用,因为PG是直接把更新写到buffer中的。)
PG read and write architecture

Fig3 图a是传统的PG数据库的IO方式,“when the buffer cache is missed, the storage engine of the native PG will cause two copy operations. When the data set is very large, this will be a big overhead”
PMEM
参考:【3】关于mmap,ndctl,clwb等等很多相关知识都可以看到。
mmap
”For persistent memory, a PMem-aware file system allows a memory mapped file to access the PMem directly, a feature known as DAX. “ 因为对于普通的mmap来说,都是直接操作文件映射的内存的,但是对于PM来说,mmap就是通过虚拟地址直接操作PM。
“Unlike memory mapped files on storage, where the OS performs paging to DRAM as necessary, the application is able to access persistent memory data structures in-place, right where they are located in PMem.”
但是由于CPU内部本身有缓存,所以不能保证数据一定会持久化到PM,所以就有了clwb, clflush, ntstore之类的指令。

pmem-Aware File System指的是可以支持DAX方式的文件系统,比如ext4。在pmem-Aware File System下,可以使用Standard File API的mmap()和msync()命令;同时,如果mmap()在“MAP_SYNC”模式下返回成功,则用户程序可以直接使用clwb等持久化命令。
Device DAX
如果通过”pmem-Aware File System”来访问PM,这是访问PM的fsdax模式,如果不使用“pmem-Aware File System”来访问PM,这是访问PM的dev dax模式。两者的区别是什么呢?
- 【2】fsdax模式由于提供的是标准的POSIX文件系统接口,所以他的兼容性更好,应用可以在完全不修改的情况下,把PMEM当作一个非常快速的磁盘来使用,同时因为是要模拟磁盘行为,IO的粒度也是512个字节,对于PMEM这样在cache line大小的IO中,能提供的超低延迟特性的硬件来说,IO的粒度仍旧太大了,无法真正发挥出PMEM的全部实力;只有devdax模式,才是真正的可以完全发挥PMEM所有高性能、可持久化、字节寻址能力的最强模式!但是唯一遗憾的就是要想使用devdax模式,必须重写现有的应用程序,和PMEM设备对接,同时,在非易失性内存上编程,也是和传统编程有较大的区别,需要考虑很多以前不需要处理的问题,所以对于很多现有的应用来说,虽然devdax模式“看上去很美”,但是却没办法采用。
DAX FS
参考【2】,【4】
引用
【1】postgresql email列表对NVM WAL BUFFER的讨论 https://blog.csdn.net/yanzongshuai/article/details/111940166
https://mp.ofweek.com/Internet/a956714781157
postgresql官网关于Yoshimi Ichiyanagi发送的NVM WAL BUFFER邮件:https://www.postgresql.org/message-id/C20D38E97BCB33DAD59E3A1%40lab.ntt.co.jp
这个作者zys87有很多相关的“PostgreSQL源码分析”的文章,比如:
非易失性WAL buffer https://yanzongshuaidba.blog.csdn.net/article/details/104116169
对应的专栏:PostgreSQL源码分析 https://yanzongshuaidba.blog.csdn.net/category_9278392.html
【2】dax fs:https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf
MemVerge(PMEM): https://card.weibo.com/article/m/show/id/2309404633085408051536
【3】pmem官网(关于mmap,ndctl,clwb等等很多相关知识都可以看到):https://pmem.io/
【4】linux doc on DAX:https://www.kernel.org/doc/Documentation/filesystems/dax.txt
网址
应用PMDK修改WAL操作使之适配持久化内存 https://blog.51cto.com/yanzongshuai/2428535
yzs大神似乎很会研究PostgreSQL https://blog.51cto.com/yanzongshuai
dax fs:https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf
MemVerge(PMEM): https://card.weibo.com/article/m/show/id/2309404633085408051536