SimpleDB P1：数据库IO的接口

顾建平 · 发表于 2022-9-20 14:37:54

对应于书中的 3.5

<hr/>
小结：

cache 的优化思想！
共享指针 和 vector 优化内存管理
通过 trunc 打开一个文件后，应该重新关闭再打开一次。
write 和 append 都会 extend 文件，在 SimpleDB 中，文件大小一定是4kb 的倍数大小，扩展的大小也一定是 4kb 的倍数大小。
流需要考虑并发的情况
写操作之后，需要记得立即落盘

<hr/>
在 Database Learning L2 - Block & File Level - 知乎 (zhihu.com) 中介绍了 block-level 和 file-level。最后提到过，为了兼顾简洁和性能，许多数据库采用 file-level 来访问磁盘，而对每个文件都看为是一个磁盘，将一个文件分割成若干个逻辑块（logical block），通过实现自己的缓冲池来对 block 进行缓存、访问，就能够提升 IO 执行的效率。
下面，将所有从文件中被复制到内存中的块称为 page，将文件的逻辑块称为 block。SImpleDB 通常使用多个文件来存储数据库，每个表、索引都会有一个文件，每个 DB 还会有一个日志文件。数据库中的各个 page，可以通过文件名(filename) + 在文件中的逻辑块号(logical block number) 对应到一个/几个 block 中。
SimpleDB 通过一个 FileManager 统一来打开文件，并让 BufferManager 可以通过 Page-Level 对文件进行访问。FileManager 的主要功能是：

将内存中的 page 写到磁盘中的 block
将磁盘中的 block 读到内存中的 page

通过这两种方式，就能对文件进行扩展、修改，并将数据库的数据存储到磁盘上，保证服务器崩溃时，数据库能够容灾恢复，不会丢失数据库数据。
本文将实现三个类，分别是：

BlockId：对应于一个磁盘块（感觉可以认为是一个 pageid，这里保存了原书中的叫法）
Page：对应于内存中的一个page
FileManager：控制文件的访问和更新

在下文中，由于测试的代码都是 page大小 = 4kb，所以默认 block_size = 4kb

1. BlockId

private:
// the block belong to which file
std::string file_name_;
// logical block number
int block_num_;
};
一个磁盘块可以由 文件名（file_name_） +  逻辑块号（block_num_） 唯一标识。每个 BlockId 对应一个磁盘块。DB engine 具体是怎么知道有多少个文件、一个文件有多少块，这并不是现在要关心的内容。对于 BlockId 来说，只需要能够维护好 文件名 和 逻辑块 就行了。
/* api */
// 返回文件名
std::string FileName() const;
// 返回逻辑块号
int BlockNum() const;
2. Page

2.1 data member

/* SimpleDB */
public:
std::shared_ptr<std::vector<char>> content();
private:
// Page&#39;s content, stored in heap
std::shared_ptr<std::vector<char>> buffer_page_;
page 中会存储 block 的数据，存储在 buffer_page_。
使用  vector：减少内存释放的可能，防止内存泄露。但是 vector<char> 作为一个私有局部变量，假如 buffer_page_ 的类型仅仅是 vector<char> ，是很难修改他的，只能通过曝露给用户的 public 函数来修改，很不方便。
因此buffer_page_ 类型要定义成 vector<char> * 的形式，这样就能够通过 content 返回 vector<char> 的指针，对 buffer_page_ 存储的内容进行修改。为了方便内存管理，这里将 * 替换成了共享指针。

<hr/>
/* the page of bustub */
private:
char data_[PAGE_SIZE]{};
// or std::shared_ptr<char[]> data_(new char[PAGE_SIZE]);
public:
inline char *GetData() {
      return data_;
}
/* the bufferpool of bustub */
Page *pages_; // array of in-memory pages
如上所示，在一些DB中，content 可以直接作为数组存储。但是声明Page对象时，就会存储在栈上，因此很多缓冲池存储的都是 page*数组。不同的实现方法，实际的思想都是一样的。
2.2 functions member

在SimpleDB中， Page 可以存储三种值：

int
Blob
string

其中比较需要注意的是 如何存取Blob。也就是 Page 类中的 GetBytes 和 SetBytes 函数。

一个blob由 header + byte-array 组成，header 表示 byte-array 的大小。byte-array 存储对应的数据。
// 1. set blob&#39;s header by calling SetInt func
// 2. prepare to set byte-array in blob and check overflow
// 3. set byte-array in blob
/* SetBytes.cc */
void Page::SetBytes(int offset, const std::vector<char> &byte_array) {
char *page_offset;
int blob_size = byte_array.size();

// set the blob&#39;s header
SetInt(offset, blob_size);
// check overflow
if(blob_size + sizeof(int) + offset > buffer_page_->size()) {
      // overflow
      throw std::runtime_error(&#34;Page overflow when SetBytes&#34;);
}
// no overflow
page_offset = &(*buffer_page_)[offset + sizeof(int)];

std::memcpy(page_offset, &byte_array[0], blob_size);
}
将 blob 写到 Page 中：1. 首先需要修改 header；2. 检查是否溢出；3. 将要写的数据复制到 blob 的 byte_array 当中；由于 offset + sizeof(int) 才是 byte_array 的开头，因此这里是从[offset + sizeof(int)] 开始复制。
// a blob consists of a header(4 bytes, store byte-array&#39;s size) and a byte-array
// 1. read the blob&#39;s header,get the blob_size
// 2. check overflow
// 3. read the blob&#39;s byte_array, stores in vector<char>
/* GetBytes.cc */
std::vector<char> Page::GetBytes(int offset) const {
char *page_offset_begin;
char *page_offset_end;
std::vector<char> byte_array;
int blob_size = GetInt(offset);

// the maximum of access address = sizeof(int) + blob_size + offset
if(blob_size + sizeof(int) + offset > buffer_page_->size()) {
      // overflow
      // printf(&#34;blob_size = %d, offset = %d, buffer_page_->size = %d\n&#34;,
         // blob_size, offset, buffer_page_->size());
      throw std::runtime_error(&#34;Page overflow when GetBytes&#34;);
}
// no overflow
page_offset_begin = &(*buffer_page_)[offset + sizeof(int)];
page_offset_end = page_offset_begin + blob_size;

byte_array.insert(byte_array.end(), page_offset_begin,
      page_offset_end);
return byte_array;
}
同理，GetBytes 就需要先从 header 读出 byte-array 的长度，然后才可以读取 byte-array 的数据。
3. FileManager

前面的 blockid 和 page 是如何完成复制的呢？答案就是由 FileManager 控制的，FileManager 可以读取磁盘块 block 到 page 中，也可以将 page 写到 block 中。
// 读取 block 到 page 中
void Read(const BlockId &block, Page &page);
// 将 page 写到 block 中
void Write(const BlockId &block, Page &page);
// 增加文件的长度,扩展文件
BlockId Append(const std::string &file_name);
以上是比较重要的三个函数。在了解他们之前，首先需要了解成员变量和两个基本的函数。
3.1 data memember

// directory path
std::string directory_name_;
// all blocks are the same
int block_size_;
// is this db newly created?
bool is_new_;
// map file_name to pointer which point to their fstream and cache it(optimization)
std::map<std::string, std::shared_ptr<std::fstream>> open_files_;
// mutex latch
std::mutex latch_;
由于同一个文件是通过流进行读写的，流有一个当前读写的文件位置，通过流进行 read、write 都会修改这个文件位置。同一个时刻，流应该只能被一个进程访问。当多个进程同时对一个流进行读，每个进程读到的内容和单进程读到的内容都是不一致的，当多个进程对一个流进行读写，也会导致读写不一致。因此，对于一个文件的读、写是互斥的，同一时刻有且仅有一个进程能对流进程操作，那么就需要 latch_。
通过流来打开一个文件，需要进行 open 系统调用，并且进行一次 IO 操作。由于时间局部性，如果一个文件经常要被访问，这样子的性能损失是很大的，这里做的优化就是每打开一个文件流，都将对应的流指针存储下来，下一次再进行流操作直接使用对应的流指针，就不需要每次都 IO + 系统调用了。
在 SimpleDB ，这一点实现的很简陋，并没有限制文件打开数量，只受限于 Linux 打开文件个数的限制。
3.2 GetFile

如何打开一个文件？左转 GetFile
// 1. first, check whether the file has been opened
// 2. if the file exists but has not opened, we need to open it
// 3. if the file is not created, we also need to create it
// 4. remeber that put it in open_files_
std::shared_ptr<std::fstream> FileManager::GetFile(const std::string &file_name) {
auto file_io = std::make_shared<std::fstream>();
std::string file_path = directory_name_ + &#34;/&#34; + file_name;

// search the hash-table
if(open_files_.find(file_name) != open_files_.end()) {
      // the file has been created
      file_io = open_files_[file_name];
      if(file_io->is_open()) { /* file is being opened */
         return file_io;
      }
}
// file is not opened
file_io->open(file_path, std::ios::binary | std::ios::in | std::ios::out);
if(!file_io->is_open()) { /* the file is not created */
      file_io->clear();
      // create new file, use trunc to make sure to overwrite the original file content
      file_io->open(file_path, std::ios::binary | std::ios::in
         | std::ios::out | std::ios::trunc);
      // reopen the file with original mode
      file_io->close();
      file_io->open(file_path, std::ios::binary | std::ios::in | std::ios::out);

      if(!file_io->is_open()) {
         throw std::runtime_error(&#34;can&#39;t open file &#34; + file_name);
      }
}
// put it in open_files_
open_files_[file_name] = file_io;
return file_io;
}
假设 directory_name 已经是文件目录的路径了，并且在 FileManager 的构造函数中将会创建文件目录。那么这里创建文件都是在文件目录下创建的。该函数需要考虑三个情况：

这个文件已经被创建，并且流也打开
这个文件没有被创建
这个文件已经被创建，但是流没有被打开

有可能 DB 因为某些原因被中断，那么原本创建的文件并没有被打开。因此需要重新打开，并且更新到 open_files_ 里面
// file is not opened
file_io->open(file_path, std::ios::binary | std::ios::in | std::ios::out);
// 使用二进制方式打开，可以读写
if(!file_io->is_open()) { /* the file is not created */
      file_io->clear();
      // create new file, use trunc to make sure to overwrite the original file content
      file_io->open(file_path, std::ios::binary
         | std::ios::out | std::ios::trunc);
      // reopen the file with original mode
      file_io->close();
      file_io->open(file_path, std::ios::binary | std::ios::in | std::ios::out);

      if(!file_io->is_open()) {
         throw std::runtime_error(&#34;can&#39;t open file &#34; + file_name);
      }
}
假如打不开，说明文件并没有被创建，那么 file_io 需要清除所有的状态，否则根据 C++ IO流的实现，如果有状态被设置了，这个流就无法使用。在后面重新调用了两次 open：

第一次open：使用 std::ios::trunc 会创建文件。但是只要不加 in ，有 out 模式也可以创建文件，out 模式默认带有 trunc 模式。
第二次open：由于第一次open 没有 std::ios::in 的模式，第二次打开时需要加上。

3.3 Read

// should consider three things
// 1. read past end of file
// 2. io error
// 3. read less than a block
void FileManager::Read(const BlockId &block, Page &page) {
std::lock_guard<std::mutex> lock(latch_); /* mutex lock! */
std::string file_name = block.FileName();
int block_number = block.BlockNum();
int offset = block_number * block_size_;
auto file_io = GetFile(file_name); // 这里调用 GetFile,防止文件没有被创建没法打开
uint32_t read_count;

// call the GetFileSize func everytime?
// TODO: we need to cache it
if(offset > GetFileSize(file_name)) {
      std::cerr << &#34;read past end of file&#34; << std::endl;
      memset(&((*page.content())[0]),0,block_size_);
      return;
}
// seek to the location you want to read, seekp and seekg work the same
file_io->seekp(offset, std::ios::beg);
file_io->read(&((*page.content())[0]), block_size_);
if(file_io->bad()) {
      throw std::runtime_error(&#34;I/O error when read a block&#34;);
}

read_count = file_io->gcount();
if(read_count < block_size_) {
      std::cerr << &#34;read less than a block&#34; << std::endl;
      file_io->clear();
      // set those to zero
      memset(&((*page.content())[0]) + read_count, 0, block_size_ - read_count);
      return;
}
}
read 需要考虑三个问题：

read 函数的 offset 超过了文件的大小：read 函数是不能读取超过文件大小的内容，应该提前判断这个错误。
io 出错
读到的字节 < 一个page的字节：OS分配文件是按磁盘块为单位分配的，一个磁盘块不会分配给两个文件，给一个文件分配的大小总是 4KB 的倍数。在 SimpleDB 中，文件实际大小也是 4KB的倍数，如果读到的内容 < block_size，可能是哪里出错了（按理说short counts的三个情况都不会发生）

3.4 Write

// Write can extend the file,
// so it can write past the end of file
void FileManager::Write(const BlockId &block, Page &page) {
std::lock_guard<std::mutex> lock(latch_); /* mutex lock! */
std::string file_name = block.FileName();
int block_number = block.BlockNum();
int offset = block_number * block_size_;
auto file_io = GetFile(file_name);

file_io->seekp(offset, std::ios::beg); // 从文件头偏移offset
file_io->write(&((*page.content())[0]), block_size_); // 写一个块的大小
if(file_io->bad()) {
throw std::runtime_error(&#34;I/O error when write a block&#34;);
}
// write immediately
file_io->flush();
}
流向文件中未分配的逻辑块执行写操作的时候，OS 会自动给文件分配新的逻辑块。假如write操作不在未分配的位置，由于 offset 总是 4kb 的倍数，并且 block_size_ = 4kb，那么一次 write 操作之后，文件的 EOF 就一定会在新分配块的末尾，一定还是扩展 4kb的倍数。所以文件的大小一定是4kb的倍数。

Page 的 offset 是参数，这里的 offset 是计算出来的，一定是 4kb 的倍数。
EOF 是你写多少就扩展多少。这里每次写都是 4kb，所以 EOF 一次扩展4kb。

由于标准IO库中存储一个缓冲区，当我们write完之后，对应的数据可能还是在缓冲区里面的，对应的修改没有更新到磁盘上，可能会等到缓冲区满才会落盘。为了保证数据实时修改，当 write 完之后要记得实时 flush 落盘。

为什么保证数据实时修改？假如没有 flush，在后面的 WAL(recovery) 中，假如你和 DB engine 说我 write 之后，数据落盘了，DB engine 会记住并且更新这个信息。如果这时候系统崩溃了！内存中的数据全部会消失，包括标准IO缓冲区的数据，这时候 DB engine 以为数据落盘了，实际上早丢了，FileManager 成功的诈骗了一次 DB engine。

3.5 Append

// By writing to next logical block to file that not belonging to us
// The OS will automatically complete the extension
BlockId FileManager::Append(const std::string &file_name) {
std::lock_guard<std::mutex> lock(latch_); /* mutex lock! */
auto file_io = GetFile(file_name);
std::vector<char> empty_array(block_size_, 0);
int next_block_num = Length(file_name); // 获得下一个逻辑块号
BlockId new_block_id(file_name, next_block_num);
int offset = next_block_num * block_size_; // 新分配块的开头

file_io->seekp(offset, std::ios::beg); // 到下一个未分配的逻辑块开头
file_io->write(&(empty_array[0]), block_size_); // 写一个块的大小
if(file_io->bad()) {
throw std::runtime_error(&#34;I/O error when append a block&#34;);
}
// write immediately
file_io->flush();

return new_block_id;
}
apend 用来将文件扩展一个block — 4KB，并且返回这个block 的 BlockId。和 write 的扩展一样，都是向未分配的逻辑块执行写操作完成扩展。不同的是 write 可以一次性扩展多个逻辑块，append 一次只能扩展一个逻辑块，并且新扩展的逻辑块内容全是0。
最后，由于需要落盘，文件才会实现扩展，所以需要 flush() 即使更新。
4. Questions

潜在的优化点：

Page 和后面的buffer Page似乎可以合并？
open_files_ 写成一个 pool 的形式，防止打开流过多，应该进行置换
多个流可以同时读写
每次都要 GetFileSize，可以把 FileSize 缓存下来（只在 Append 、Write 时修改）

一些问题
1. 使用 shared_ptr 而不是 raw pointer
共享指针、内存管理
2. 使用 vector 而不是 c array
内存管理，vector 长度不变的时候速度 > 数组
3. Page不采用连续存储，而是使用一个指向content的指针
4. cache优化思想
在 filemanager 中，很多优化点都是基于 Cache 实现的，比如 open_files_，或者可以进行优化的 file_size。
5. 为什么文件大小一定是/一定要 block_size 倍数大小？

一定是：每次extend都是 extend block_size 倍数大小
一定要：可以通过 next_block_num 获得下一个逻辑块号

6. 为什么要立即落盘？
需要结合日志进行分析
7. 为什么要互斥文件操作？
防止修改文件位置
8. 读写操作检查：

page：检查是否溢出
io要考虑的问题：读越界，io error，short counts