使用pg_filedump恢复PostgreSQL中无备份的被误删除delete的行数据

如果自己搞不定可以找诗檀软件专业PostgreSQL数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线 : 13764045638   QQ号:47079569    邮箱:service@parnassusdata.com

 

在postgreSQL中少量数据的删除时可以通过pg_filedump工具恢复被误删除的数据行的。

 

其原理是虽然数据被删除了,但在数据文件中其仅仅是被标记为删除,而并未被清空;只要在auto vaccum之前都可以通过pg_filedump工具恢复这部分数据。

 

下面是一个例子:

 

 



create database testa;
\c testa

create table novels (name varchar(200), id int);

insert into novels values('三国演义',1);
insert into novels values('水浒传',2);
insert into novels values('西游记',3);
insert into novels values('红楼梦',4);


select oid from pg_database where datname='testa';
  oid  
-------
 17420

 
 
select oid,relfilenode from pg_class where relname='novels';

  oid  | relfilenode 
-------+-------------
 17421 |       17421
(1 row)


testa=# delete from novels;
DELETE 4

testa=# select * from novels;
 name | id 
------+----
(0 rows)


postgres@vultr:~$ ls -l /var/lib/postgresql/10/main/base/17420/17421 
-rw------- 1 postgres postgres 8192 Nov 25 05:11 /var/lib/postgresql/10/main/base/17420/17421


root@vultr:~# chmod 700 pg_filedump_rc

root@vultr:~# ./pg_filedump_rc -D charn,int /var/lib/postgresql/10/main/base/17420/17421

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 11.0
*
* File: /var/lib/postgresql/10/main/base/17420/17421
* Options used: -D charn,int 
*
* Dump created on: Mon Nov 25 05:17:40 2019
*******************************************************************

Block    0 ********************************************************
----- Block Offset: 0x00000000 Offsets: Lower 40 (0x0028) Block: Size 8192 Version 4 Upper 8024 (0x1f58) LSN: logid 0 recoff 0x07db2630 Special 8192 (0x2000) Items: 4 Free Space: 7984 Checksum: 0xd24c Prune XID: 0x0000037b Flags: 0x0000 () Length (including item array): 40 ------ Item 1 -- Length: 44 Offset: 8144 (0x1fd0) Flags: NORMAL COPY: 三国演义 1 Item 2 -- Length: 40 Offset: 8104 (0x1fa8) Flags: NORMAL COPY: 水浒传 2 Item 3 -- Length: 40 Offset: 8064 (0x1f80) Flags: NORMAL COPY: 西游记 3 Item 4 -- Length: 40 Offset: 8024 (0x1f58) Flags: NORMAL COPY: 红楼梦 4 *** End of File Encountered. Last Block Read: 0 *** root@vultr:~# ./pg_filedump_rc -D charn,int /var/lib/postgresql/10/main/base/17420/17421|grep COPY COPY: 三国演义 1 COPY: 水浒传 2 COPY: 西游记 3 COPY: 红楼梦 4

PostgreSQL FATAL: could not read block 0 in file Input/output error错误

FATAL: could not read block 0 in file  Input/output error


ERROR:  could not read block 84 in file "base/16386/670007": Input/output error
ERROR:  could not read block 4707 of relation 1663/16384/16564: Success



 

如果PG的登陆出现上述错误 则一般说明PG数据文件出现了磁盘故障,导致无法登陆数据库,可以尝试用以下步骤解决

 

步骤1 设置 zero_damaged_pages 参数 并尝试重建系统索引,zero_damaged_pages加入到postgre配置文件中,并reload或重启pg服务进程:

 

 

 

reindexdb -p 5433 --system dbname

 

步骤2 设置 ignore_system_indexes=true , 以尝试忽略system index的影响:

 

 

pg_ctl -D /data -o '-c ignore_system_indexes=true' start
 

并连接数据库

psql $dbname

 

步骤3 重建全库 索引


reindex database "dbname";


步骤4 vacuum 全表库


vacuum full analyze verbose;
vacuum full verbose;

步骤5 禁用约束


update pg_class set relchecks=0 where relname='tablename';


步骤六 重建索引


reindex database "dbname";

最后建议用pg_dump 导出全库 后 重建数据库。

如果做了上述步骤还不行, 可以考虑用pg_filedump 抽取所有文件内的数据后再重建数据库。

PostgreSQL checksum

文章来源: https://yq.aliyun.com/articles/675942

在计算机系统中,checksum 通常用于校验数据在传输或存取过程中是否发生错误。PostgreSQL 从 9.3 开始支持 checksum,以发现数据因磁盘、 I/O 损坏等原因造成的数据异常。本文介绍 PostgreSQL 中 checksum 的使用及其实现原理。

 

概述

PostgreSQL 从 9.3 开始支持数据页的 checksum,可以在执行 initdb 时指定 -k 或 --data-checksums 参数开启 checksum,但开启 checksum 可能会对系统性能有一定影响,官网描述如下:

Use checksums on data pages to help detect corruption by the I/O system that would otherwise be silent. Enabling checksums may incur a noticeable performance penalty. This option can only be set during initialization, and cannot be changed later. If set, checksums are calculated for all objects, in all databases.

启用 checksum 后,系统会对每个数据页计算 checksum,从存储读取数据时如果检测 checksum 失败,则会发生错误并终止当前正在执行的事务,该功能使得 PostgreSQL 自身拥有了检测 I/O 或硬件错误的能力。

Checksum 引入一个 GUC 参数 ignore_checksum_failure,该参数若设置为 true,checksum 校验失败后不会产生错误,而是给客户端发送一个警告。当然,checksum 失败意味着磁盘上的数据已经损坏,忽略此类错误可能导致数据损坏扩散甚至导致系统奔溃,此时宜尽早修复,因此,若开启 checksum,该参数建议设置为 false

 

实现原理

 

设置 checksum

 

数据页的 checksum 在从 Buffer pool 刷到存储时才设置,当页面再此读取至 Buffer pool 时进行检测。

PostgreSQL 中 Buffer 刷盘的逻辑集中在 FlushBuffer 中,其中设置 checksum 的逻辑如下:

 

/*
 * Update page checksum if desired.  Since we have only shared lock on the
 * buffer, other processes might be updating hint bits in it, so we must
 * copy the page to private storage if we do checksumming.
 */
bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);

 

 

比较有意思的是,其他进程可能会在只加 content-lock 共享锁的情况下并发修改 page 的 Hint Bits,从而导致 checksum 值发生变化,为确保 page 的内容及其 checksum 保持一致,PostgreSQL 采用了 先复制页,然后计算 checksum 的方式,如下:

 

 

/*
 * We allocate the copy space once and use it over on each subsequent
 * call.  The point of palloc'ing here, rather than having a static char
 * array, is first to ensure adequate alignment for the checksumming code
 * and second to avoid wasting space in processes that never call this.
 */
if (pageCopy == NULL)
    pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);

memcpy(pageCopy, (char *) page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);

 

 

即先将数据页的内容拷贝一份,拷贝的数据自然不会被其他进程修改,然后基于该拷贝页计算并设置 checksum 值。

 

 

checksum 算法

数据页的 checksum 算法基于 FNV-1a hash 改造而来,其结果为 32 位无符号整型。由于 PageHeaderData 中 pd_checksum 是 16 位无符号整型,因此将其截取 16 位作为数据页的 checksum 值,如下:

 

/*
* Save pd_checksum and temporarily set it to zero, so that the checksum
* calculation isn't affected by the old checksum stored on the page.
* Restore it after, because actually updating the checksum is NOT part of
* the API of this function.
*/
save_checksum = cpage->phdr.pd_checksum;
cpage->phdr.pd_checksum = 0;
checksum = pg_checksum_block(cpage);
cpage->phdr.pd_checksum = save_checksum;

/* Mix in the block number to detect transposed pages */
checksum ^= blkno;

/*
* Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
* one. That avoids checksums of zero, which seems like a good idea.
*/
return (checksum % 65535) + 1;

 

pg_checksum_block 函数计算数据页的 32 位 checksum 值,具体算法可以参考源码,在此不详述。

 

 

检测 checksum

 

 

PostgreSQL 会在页面从存储读入内存时检测其是否可用,调用函数为 PageIsVerified,该函数不仅会检测正常初始化过的页(non-zero page),还会检测 全零页(all-zero page)

为什么会出现 全零页 呢?
在特定场景下表中可能出现 全零页,比如有进程扩展了一个表,即在该表中添加了一个新页,但在 WAL 日志写入存储之前,进程崩溃了。此时新加的页可能已经在表文件中,下次重启时就会读取到。

对于 non-zero page,检测其 checksum 是否一致以及 page header 信息是否正确,若 checksum 失败,但 header 信息正确,此时会根据 ignore_checksum_failure 值判断验证是否通过;对于 all-zero page,如果为全零,则验证通过。

若验证失败,两种处理方式:

  • 若读取数据的模式为 RBM_ZERO_ON_ERROR 且 GUC 参数 zero_damaged_pages 为 true,则将该页全部置 0
  • 报错,invalid page

 

checksum 与 Hint bits

 

 

数据页写至存储时,如果写失败,可能会导致破碎的页(torn page),PostgreSQL 通过 full_page_writes 特性解决此类写失败导致数据不可用的问题。

Hint Bits 是数据页中用于标识事务状态的标记位,一般情况下,作为提示位,不是很重要。但如果使用了 checksum,Hint Bits 的变化会导致 checksum 值发生改变。设想如果一个页面发生部分写,恰好把某些 Hint Bits 写错,此页面可能并不影响正常使用,但 checksum 会抛出异常,此时应如何恢复呢?

在 checksum 的实现中,checkpoint 后,如果页面因更新 Hint Bits 第一次被标记为 dirty,需要记录一个 Full Page Image 至 WAL 日志中,以应对以上提到的因 Hint Bits 更新丢失导致 checksum 失败的问题,具体实现可参考 MarkBufferDirtyHint。对于已经是 dirty 的页,更新 Hint Bits 则不需要记录 WAL 日志,因为在 checkpoint 后,第一次将该页标记为 dirty 时已经写入了对应的 Full Page Image

可见,在启用 checksum 的情况下,checkpoint 后页面的第一次修改如果是更新 Hint Bits, 会写 Full Page Image 至 WAL 日志,这会导致 WAL 日志占用更多的存储空间。

关于 PostgreSQL checksum 和 Full Page Image 的关系,可以参考 stackoverflow 上这个问题

 

 

查看 checksum

PostgreSQL 10 在 pageinspect 插件中添加了函数 page_checksum() 用来查看 page 的 checksum,当然使用 page_header() 也可以查看 page 的 checksum,如下:

 

postgres=# SELECT page_checksum(get_raw_page('pg_class', 0), 0);
 page_checksum
---------------
         17448
(1 row)

postgres=# SELECT * FROM page_header(get_raw_page('pg_class', 0));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 0/78A1E918 |    17448 |     0 |   200 |   368 |    8192 |     8192 |       4 |         0
(1 row)

总结

Checksum 使 PostgreSQL 具备检测因硬件故障或传输导致数据不一致的能力,一旦发生异常,通常会报错并终止当前事务,用户可以尽早察觉数据异常并予以恢复。当然,开启 checksum 也会引入一些开销,体现在两个方面:

  • 计算数据页的 checksum 会引入一些 CPU 开销,具体开销取决于 checksum 算法的效率
  • checkpoint 后,若因更新 Hint Bits 将页面第一次置为 dirty 会写一条记录 Full Page Image 的 WAL 日志,以用于恢复因更新 Hint Bits 产生的破碎页。

对于数据可用性要求较高的场景,通常建议将 full_page_writes 和 checksum 都打开,前者用于避免写失败导致的数据缺失,后者用于尽早发现因硬件或传输导致数据不一致的场景,一旦发现,可以利用 full_page_writes 和 checksum 记录在 WAL 日志中的 Full Page Image 进行数据恢复。

References

postgreSQL 坏块与checksum使用验证


 su - postgres
initdb -k -D $PGDATA
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

fixing permissions on existing directory /var/lib/postgresql/10/main ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /var/lib/postgresql/10/main -l logfile start

	
	
pg_ctl -D /var/lib/postgresql/10/main -l logfile start
waiting for server to start.... done
server started
	
	
psql
psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1))
Type "help" for help.

postgres=#



wget https://zcdn.askmac.cn/dellstore2-normal-1.0.tar.gz

tar -zxvf dellstore2-normal-1.0.tar.gz

createdb mac
psql mac -f dellstore2-normal-1.0/dellstore2-normal-1.0.sql  




postgres@vultr:~$ psql mac
psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1))
Type "help" for help.

mac=# \d
                    List of relations
 Schema |           Name           |   Type   |  Owner   
--------+--------------------------+----------+----------
 public | categories               | table    | postgres
 public | categories_category_seq  | sequence | postgres
 public | cust_hist                | table    | postgres
 public | customers                | table    | postgres
 public | customers_customerid_seq | sequence | postgres
 public | inventory                | table    | postgres
 public | orderlines               | table    | postgres
 public | orders                   | table    | postgres
 public | orders_orderid_seq       | sequence | postgres
 public | products                 | table    | postgres
 public | products_prod_id_seq     | sequence | postgres
 public | reorder                  | table    | postgres
(12 rows)




psql mac -c "SELECT relname, relpages, reltuples, relfilenode FROM pg_class
  WHERE relkind = 'r' AND relname NOT LIKE 'pg%'
  ORDER BY relpages DESC
  LIMIT 1;"
  
  
  
  relname  | relpages | reltuples | relfilenode 
-----------+----------+-----------+-------------
 customers |      488 |     20000 |       16478
(1 row)



psql mac -c "SELECT datname, oid FROM pg_database;"

postgres@vultr:~$ psql mac -c "SELECT datname, oid FROM pg_database;"
  datname  |  oid  
-----------+-------
 postgres  | 13055
 mac       | 16457
 template1 |     1
 template0 | 13054
(4 rows)




wget https://zcdn.askmac.cn/corrupt.pl


pg_ctl stop -D $PGDATA

waiting for server to shut down.... done


perl corrupt.pl $PGDATA/base/16457/16478 5


pg_ctl start -D $PGDATA


psql mac -c "SELECT * FROM customers;"

postgres@vultr:~$ psql mac -c "SELECT * FROM customers;"
2019-11-20 07:11:42.487 UTC [24934] WARNING:  page verification failed, calculated checksum 56287 but expected 28524
WARNING:  page verification failed, calculated checksum 56287 but expected 28524
2019-11-20 07:11:42.487 UTC [24934] ERROR:  invalid page in block 1 of relation base/16457/16478
2019-11-20 07:11:42.487 UTC [24934] STATEMENT:  SELECT * FROM customers;
ERROR:  invalid page in block 1 of relation base/16457/16478



postgres@vultr:~$ psql mac 
psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1))
Type "help" for help.

  
SHOW zero_damaged_pages ;
SHOW ignore_checksum_failure;
SET zero_damaged_pages to on;
SET
set ignore_checksum_failure to on;
SET

postgres=#  SELECT * FROM customers;






or 

设置参数 

SET ignore_checksum_failure to on ;
ALTER SYSTEM SET ignore_checksum_failure = on;
SELECT pg_reload_conf();




postgreSQL but could not open file pg_control: No such file or directory


but could not open file pg_control: No such file or directory


postgres@vultr:~/10/main/global$ pg_ctl start -D $PGDATA
waiting for server to start....postgres: could not find the database system
Expected to find it in the directory "/var/lib/postgresql/10/main",
but could not open file "/var/lib/postgresql/10/main/global/pg_control": No such file or directory
 stopped waiting
pg_ctl: could not start server
Examine the log output.


postgreSQL 使用pg_resetxlog修复PostgreSQL控制文件的方法

文章来源:https://billtian.github.io/digoal.blog/2016/08/14/02.html

 

背景

PostgreSQL 控制文件在$PGDATA/global目录下名为pg_control.

控制文件中记录了以下三部分信息 :

 

 

1. initdb时生成的静态信息 :

pg_control version number:            922      
Catalog version number:               201204301      
Database system identifier:           5831753892046499175      
Maximum data alignment:               8      
Database block size:                  8192      
Blocks per segment of large relation: 131072      
WAL block size:                       16384      
Bytes per WAL segment:                16777216      
Maximum length of identifiers:        64      
Maximum columns in an index:          32      
Maximum size of a TOAST chunk:        1996      
Date/time type storage:               64-bit integers      
Float4 argument passing:              by value      
Float8 argument passing:              by value      
2. postgresql.conf中的配置信息 :       
Current wal_level setting:            hot_standby      
Current max_connections setting:      1000      
Current max_prepared_xacts setting:   10      
Current max_locks_per_xact setting:   64      
3. write-ahead logging以及checkpoint的动态信息 :       
Latest checkpoint location:           96E8/5B000020      
Prior checkpoint location:            96E8/5A0C8CC0      
Latest checkpoint's REDO location:    96E8/5B000020      
Latest checkpoint's TimeLineID:       1      
Latest checkpoint's full_page_writes: on      
Latest checkpoint's NextXID:          0/1183664222      
Latest checkpoint's NextOID:          309701      
Latest checkpoint's NextMultiXactId:  1      
Latest checkpoint's NextMultiOffset:  0      
Latest checkpoint's oldestXID:        1006759584      
Latest checkpoint's oldestXID's DB:   1      
Latest checkpoint's oldestActiveXID:  0      
Time of latest checkpoint:            Fri 11 Jan 2013 07:44:19 AM CST      
Minimum recovery ending location:     0/0      
Backup start location:                0/0      
Backup end location:                  0/0      
End-of-backup record required:        no      

以上信息可以使用pg_controldata从pg_control获取 :

src/bin/pg_controldata/pg_controldata.c      
 * pg_controldata      
 *      
 * reads the data from $PGDATA/global/pg_control      

如果控制文件$PGDATA/global/pg_control损坏或丢失, 数据库将运行异常, 无法启动.

如何修复? 关键在于恢复write-ahead logging以及checkpoint的动态信息.

这些信息可以从pg_xlog, pg_clog, pg_multixact这些目录的文件中解析出来。

pg_xlog的文件名解析可参看, 不同的段大小, 命名大不相同, pg_resetxlog的帮助文件适用16MB的段大小, 如果是其他大小, 需要重新计算名字 :

http://blog.163.com/digoal@126/blog/static/1638770402012914112949546/

接下来介绍一下使用pg_resetxlog重建pg_control的方法.

pg_resetxlog功能如下 :

src/bin/pg_resetxlog/pg_resetxlog.c      
 * pg_resetxlog.c      
 *        A utility to "zero out" the xlog when it's corrupt beyond recovery.      
 *        Can also rebuild pg_control if needed.      
 *      
 * The theory of operation is fairly simple:      
 *        1. Read the existing pg_control (which will include the last      
 *               checkpoint record).  If it is an old format then update to      
 *               current format.      
 *        2. If pg_control is corrupt, attempt to intuit reasonable values,      
 *               by scanning the old xlog if necessary.      
 *        3. Modify pg_control to reflect a "shutdown" state with a checkpoint      
 *               record at the start of xlog.      
 *        4. Flush the existing xlog files and write a new segment with      
 *               just a checkpoint record in it.  The new segment is positioned      
 *               just past the end of the old xlog, so that existing LSNs in      
 *               data pages will appear to be "in the past".      
 * This is all pretty straightforward except for the intuition part of      
 * step 2 ...      

pg_resetxlog的用法 :

ocz@db-172-16-3-150-> pg_resetxlog --help      
pg_resetxlog resets the PostgreSQL transaction log.      
      
Usage:      
  pg_resetxlog [OPTION]... DATADIR      
      
Options:      
  -e XIDEPOCH      set next transaction ID epoch      
  -f               force update to be done      
  -l TLI,FILE,SEG  force minimum WAL starting location for new transaction log      
  -m XID           set next multitransaction ID      
  -n               no update, just show extracted control values (for testing)      
  -o OID           set next OID      
  -O OFFSET        set next multitransaction offset      
  -V, --version    output version information, then exit      
  -x XID           set next transaction ID      
  -?, --help       show this help, then exit      

 

 

 

参数具体含义 :

 

-l timelineid,fileid,seg      
  The WAL starting address (-l) should be larger than any WAL segment file name currently existing in the      
directory pg_xlog under the data directory. These names are also in hexadecimal and have three parts. The      
first part is the “timeline ID” and should usually be kept the same. Do not choose a value larger than 255      
(0xFF) for the third part; instead increment the second part and reset the third part to 0. For example, if      
00000001000000320000004A is the largest entry in pg_xlog, -l 0x1,0x32,0x4B will work; but if the largest      
entry is 000000010000003A000000FF, choose -l 0x1,0x3B,0x0 or more.      
    Note      
    pg_resetxlog itself looks at the files in pg_xlog and chooses a default -l setting beyond the last      
    existing file name. Therefore, manual adjustment of -l should only be needed if you are aware of WAL      
    segment files that are not currently present in pg_xlog, such as entries in an offline archive; or if      
    the contents of pg_xlog have been lost entirely.      
      
-e XIDEPOCH      
  The transaction ID epoch is not actually stored anywhere in the database except in the field that is set by      
pg_resetxlog, so any value will work so far as the database itself is concerned. You might need to adjust      
this value to ensure that replication systems such as Slony-I work correctly - if so, an appropriate value      
should be obtainable from the state of the downstream replicated database.      
      
-x XID      
  A safe value for the next transaction ID (-x) can be determined by looking for the numerically largest file      
name in the directory pg_clog under the data directory, adding one, and then multiplying by 1048576. Note      
that the file names are in hexadecimal. It is usually easiest to specify the option value in hexadecimal      
too. For example, if 0011 is the largest entry in pg_clog, -x 0x1200000 will work (five trailing zeroes      
provide the proper multiplier).      
      
-m XID      
  A safe value for the next multitransaction ID (-m) can be determined by looking for the numerically largest      
file name in the directory pg_multixact/offsets under the data directory, adding one, and then multiplying      
by 65536. As above, the file names are in hexadecimal, so the easiest way to do this is to specify the      
option value in hexadecimal and add four zeroes.      
      
-O OFFSET      
   A safe value for the next multitransaction offset (-O) can be determined by looking for the numerically      
largest file name in the directory pg_multixact/members under the data directory, adding one, and then      
multiplying by 65536. As above, the file names are in hexadecimal, so the easiest way to do this is to      
specify the option value in hexadecimal and add four zeroes.      
      
-o OID      
  There is no comparably easy way to determine a next OID that's beyond the largest one in the database, but      
fortunately it is not critical to get the next-OID setting right.      
      
-n      
  no update, just show extracted control values (for testing)      
-f      
  force      

 

 

 

测试步骤如下(基于PostgreSQL 9.2.1) :

1. 新建测试数据, 用到with oids的表, 因为OID无法确定, 看看是否会有异常.

2. 关闭数据库

3. 记下pg_controldata信息, 方便修复后进行比对

4. 删除$PGDATA/global/pg_control

5. 开启数据库观察报错输出

6. touch $PGDATA/global/pg_control

7. 使用pg_resetxlog修复pg_control

8. 记下pg_controldata信息, 与前面的pg_controldata输出进行比对

9. 启动数据库

10. 查看测试数据是否正常, 新插入数据

11. 关闭数据库, 并记下pg_controldata的信息, 看看有何变化.

测试过程 :

1. 测试数据

 

 

digoal=> create table oid_test(id int primary key) with oids;      
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "oid_test_pkey" for table "oid_test"      
CREATE TABLE      
digoal=> insert into oid_test select generate_series(1,100000);      
INSERT 0 100000      
digoal=> select min(oid),max(oid) from oid_test ;      
  min  |  max         
-------+--------      
 16397 | 116396      
(1 row)      

2. 关闭数据

ocz@db-172-16-3-150-> pg_ctl stop -m fast      
waiting for server to shut down.... done      
server stopped      

3. 记下pg_controldata信息, 方便修复后进行比对

ocz@db-172-16-3-150-> pg_controldata       
pg_control version number:            922      
Catalog version number:               201204301      
Database system identifier:           5832000131111550393      
Database cluster state:               shut down      
pg_control last modified:             Fri 11 Jan 2013 09:48:18 AM CST      
Latest checkpoint location:           96E8/5F000020      
Prior checkpoint location:            96E8/5EE5C698      
Latest checkpoint's REDO location:    96E8/5F000020      
Latest checkpoint's TimeLineID:       1      
Latest checkpoint's full_page_writes: on      
Latest checkpoint's NextXID:          0/1183842312      
Latest checkpoint's NextOID:          116414      
Latest checkpoint's NextMultiXactId:  65536      
Latest checkpoint's NextMultiOffset:  65536      
Latest checkpoint's oldestXID:        1006759584      
Latest checkpoint's oldestXID's DB:   1      
Latest checkpoint's oldestActiveXID:  0      
Time of latest checkpoint:            Fri 11 Jan 2013 09:48:18 AM CST      
Minimum recovery ending location:     0/0      
Backup start location:                0/0      
Backup end location:                  0/0      
End-of-backup record required:        no      
Current wal_level setting:            hot_standby      
Current max_connections setting:      1000      
Current max_prepared_xacts setting:   10      
Current max_locks_per_xact setting:   64      
Maximum data alignment:               8      
Database block size:                  8192      
Blocks per segment of large relation: 131072      
WAL block size:                       16384      
Bytes per WAL segment:                16777216      
Maximum length of identifiers:        64      
Maximum columns in an index:          32      
Maximum size of a TOAST chunk:        1996      
Date/time type storage:               64-bit integers      
Float4 argument passing:              by value      
Float8 argument passing:              by value      

4. 删除$PGDATA/global/pg_control

ocz@db-172-16-3-150-> rm $PGDATA/global/pg_control       
rm: remove regular file `/data05/ocz/pg_root/global/pg_control'? y      

5. 开启数据库观察报错输出

ocz@db-172-16-3-150-> pg_ctl start      
server starting      
ocz@db-172-16-3-150-> postgres: could not find the database system      
Expected to find it in the directory "/data05/ocz/pg_root",      
but could not open file "/data05/ocz/pg_root/global/pg_control": No such file or directory      

接下来进行修复 :

6. touch $PGDATA/global/pg_control

ocz@db-172-16-3-150-> touch $PGDATA/global/pg_control      
ocz@db-172-16-3-150-> chmod 600 $PGDATA/global/pg_control      

7. 使用pg_resetxlog修复pg_control

1、首先确定-l timelineid,fileid,seg的信息 :

ocz@db-172-16-3-150-> cd $PGDATA/pg_xlog      
ocz@db-172-16-3-150-> ll      
total 65M      
-rw------- 1 ocz ocz 16M Jan 11 09:39 00000001000096E80000005C      
-rw------- 1 ocz ocz 16M Jan 11 09:39 00000001000096E80000005D      
-rw------- 1 ocz ocz 16M Jan 11 09:48 00000001000096E80000005E      
-rw------- 1 ocz ocz 16M Jan 11 09:48 00000001000096E80000005F      
drwx------ 2 ocz ocz 44K Jan 11 09:48 archive_status      

-l timelineid,fileid,seg 的数据来自pg_xlog文件名的三个部分, 分别占用8个16进制位.

段大小为16MB, 所以末端最大为0xFF.

得出-l 0x1,0x96E8,0x60

10的版本之后,-l会自动计算,使用已存在的xlog+1即可(这样才是比较安全的,否则可能会导致WAL出现分叉(如00000100是已有的,你设置-l 00000090,那么从090到100中间实际上是出现了wal分叉的)。)。

2、接下来确定-x XID的信息

来自pg_clog

ocz@db-172-16-3-150-> cd $PGDATA/pg_clog      
ocz@db-172-16-3-150-> ll -t|head -n 5      
total 43M      
-rw------- 1 ocz ocz 8.0K Jan 11 09:48 0469      
-rw------- 1 ocz ocz 216K Jan 10 21:00 0468      
-rw------- 1 ocz ocz 256K Jan 10 12:56 0467      
-rw------- 1 ocz ocz 256K Jan 10 09:35 0466      

取最大值加1然后乘以1048576.

转换成16进制的话相当于取最大值加1然后末尾添加5个0

得到-x 0x046A00000

3、接下来确定-m XID的信息

来自pg_multixact/offsets

ocz@db-172-16-3-150-> cd $PGDATA/pg_multixact/offsets      
ocz@db-172-16-3-150-> ll      
total 0      

取最大值加1然后乘以65536.

转换成16进制的话相当于取最大值加1然后末尾添加4个0

没有文件的话使用0加1, 然后末尾添加4个0

得到-m 0x10000

4、接下来确定-O OFFSET的信息

来自pg_multixact/members

ocz@db-172-16-3-150-> cd $PGDATA/pg_multixact/members      
ocz@db-172-16-3-150-> ll      
total 0      

取最大值加1然后乘以65536.

转换成16进制的话相当于取最大值加1然后末尾添加4个0

没有文件的话使用0加1, 然后末尾添加4个0

得到-O 0x10000

5、最后, 不确定的值有2个 :

-e XIDEPOCH    ,如果么有使用 slony或者londiste这种基于触发器的数据同步软件,则-e意义不大,它实际上是在将32位的xid转换为64位的xid时使用的一个转换系数 。   
-o OID    , 系统会自动跳过已经分配的OID,自动容错,例如OID被别的程序使用掉了,PG会自动生成下一个OID,并且继续判断可用性。知道可用为止。    

可以先不管这两个值.

6、执行pg_resetxlog 如下 :

ocz@db-172-16-3-150-> pg_resetxlog -l 0x1,0x96E8,0x60 -x 0x046A00000 -m 0x10000 -O 0x10000 -f $PGDATA      
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it      
Transaction log reset      

8. 记下pg_controldata信息, 与前面的pg_controldata输出进行比对

ocz@db-172-16-3-150-> pg_controldata       
pg_control version number:            922      
Catalog version number:               201204301      
Database system identifier:           5832008033851373032      
Database cluster state:               shut down      
pg_control last modified:             Fri 11 Jan 2013 10:09:44 AM CST      
Latest checkpoint location:           96E8/60000020      
Prior checkpoint location:            0/0      
Latest checkpoint's REDO location:    96E8/60000020      
Latest checkpoint's TimeLineID:       1      
Latest checkpoint's full_page_writes: off      
Latest checkpoint's NextXID:          0/1184890880      
Latest checkpoint's NextOID:          10000      
Latest checkpoint's NextMultiXactId:  65536      
Latest checkpoint's NextMultiOffset:  65536      
Latest checkpoint's oldestXID:        3479858176      
Latest checkpoint's oldestXID's DB:   0      
Latest checkpoint's oldestActiveXID:  0      
Time of latest checkpoint:            Fri 11 Jan 2013 10:09:44 AM CST      
Minimum recovery ending location:     0/0      
Backup start location:                0/0      
Backup end location:                  0/0      
End-of-backup record required:        no      
Current wal_level setting:            minimal      
Current max_connections setting:      100      
Current max_prepared_xacts setting:   0      
Current max_locks_per_xact setting:   64      
Maximum data alignment:               8      
Database block size:                  8192      
Blocks per segment of large relation: 131072      
WAL block size:                       16384      
Bytes per WAL segment:                16777216      
Maximum length of identifiers:        64      
Maximum columns in an index:          32      
Maximum size of a TOAST chunk:        1996      
Date/time type storage:               64-bit integers      
Float4 argument passing:              by value      
Float8 argument passing:              by value      

注意修复后从控制文件读取到的不确定的-e XIDEPOCH和-o OID信息如下 :

也就是initdb后的初始值.

Latest checkpoint's NextXID:          0/1184890880  : XIDEPOCH=0      
Latest checkpoint's NextOID:          10000      

与修复pg_control前发生了变化的值如下 :

修复前

Database system identifier:           5832000131111550393      
pg_control last modified:             Fri 11 Jan 2013 09:48:18 AM CST      
Prior checkpoint location:            96E8/5EE5C698      
Latest checkpoint's full_page_writes: on      
Latest checkpoint's NextXID:          0/1183842312      
Latest checkpoint's NextOID:          116414      
Latest checkpoint's oldestXID:        1006759584      
Latest checkpoint's oldestXID's DB:   1      
Time of latest checkpoint:            Fri 11 Jan 2013 09:48:18 AM CST      
Current wal_level setting:            hot_standby      
Current max_connections setting:      1000      
Current max_prepared_xacts setting:   10      

修复后

Database system identifier:           5832008033851373032      
pg_control last modified:             Fri 11 Jan 2013 10:09:44 AM CST      
Prior checkpoint location:            0/0      
Latest checkpoint's full_page_writes: off      
Latest checkpoint's NextXID:          0/1184890880      
Latest checkpoint's NextOID:          10000      
Latest checkpoint's oldestXID:        3479858176      
Latest checkpoint's oldestXID's DB:   0      
Time of latest checkpoint:            Fri 11 Jan 2013 10:09:44 AM CST      
Current wal_level setting:            minimal      
Current max_connections setting:      100      
Current max_prepared_xacts setting:   0      

控制文件结构参考头文件:

src/include/catalog/pg_control.h

生成控制文件内的systemid的方法参考如下代码

src/backend/access/transam/xlog.c

/*  
 * This func must be called ONCE on system install.  It creates pg_control  
 * and the initial XLOG segment.  
 */  
void  
BootStrapXLOG(void)  
{  
..............  
        uint64          sysidentifier;  
...............  
        /*  
         * Select a hopefully-unique system identifier code for this installation.  
         * We use the result of gettimeofday(), including the fractional seconds  
         * field, as being about as unique as we can easily get.  (Think not to  
         * use random(), since it hasn't been seeded and there's no portable way  
         * to seed it other than the system clock value...)  The upper half of the  
         * uint64 value is just the tv_sec part, while the lower half contains the  
         * tv_usec part (which must fit in 20 bits), plus 12 bits from our current  
         * PID for a little extra uniqueness.  A person knowing this encoding can  
         * determine the initialization time of the installation, which could  
         * perhaps be useful sometimes.  
         */  
        gettimeofday(&tv, NULL);  
        sysidentifier = ((uint64) tv.tv_sec) << 32;  
        sysidentifier |= ((uint64) tv.tv_usec) << 12;  
        sysidentifier |= getpid() & 0xFFF;  
  
..................  
        /* Initialize pg_control status fields */  
        ControlFile->system_identifier = sysidentifier;  
  
.................  
        /* some additional ControlFile fields are set in WriteControlFile() */  
  
        WriteControlFile();  

如果控制文件丢失,那么使用pg_resetwal重新初始化得到的控制文件,system id可能发生变化:

src/bin/pg_resetwal/pg_resetwal.c

.........  
        /*  
         * Attempt to read the existing pg_control file  
         */  
        if (!ReadControlFile())  
                GuessControlValues();  
.......  
/*  
 * Guess at pg_control values when we can't read the old ones.  
 */  
static void  
GuessControlValues(void)  
{  
  
        /*  
         * Create a new unique installation identifier, since we can no longer use  
         * any old XLOG records.  See notes in xlog.c about the algorithm.  
         */  
        gettimeofday(&tv, NULL);  
        sysidentifier = ((uint64) tv.tv_sec) << 32;  
        sysidentifier |= ((uint64) tv.tv_usec) << 12;  
        sysidentifier |= getpid() & 0xFFF;  
  
        ControlFile.system_identifier = sysidentifier;  
.............  

9. 启动数据库

ocz@db-172-16-3-150-> pg_ctl start      
server starting      
ocz@db-172-16-3-150-> LOG:  00000: loaded library "pg_stat_statements"      
LOCATION:  load_libraries, miscinit.c:1249      

10. 查看测试数据是否正常, 然后新插入数据

ocz@db-172-16-3-150-> psql digoal digoal      
psql (9.2.1)      
Type "help" for help.      
digoal=> select min(oid),max(oid),count(*) from oid_test ;      
  min  |  max   | count        
-------+--------+--------      
 16397 | 116396 | 100000      
(1 row)      

数据可以正常访问.

新插入数据 :

digoal=> insert into oid_test select generate_series(100001,200000);      
INSERT 0 100000      
digoal=> select min(oid),max(oid),count(*) from oid_test ;      
  min  |  max   | count        
-------+--------+--------      
 16384 | 116396 | 200000      
(1 row)      
digoal=> select oid,* from oid_test where oid=16397;      
  oid  |   id         
-------+--------      
 16397 |      1      
 16397 | 100014      
(2 rows)      

注意oid出现了重复, 印证了PostgreSQL中的说明, OID不确保唯一性.

11. 关闭数据库, 并记下pg_controldata的信息, 看看有何变化.

ocz@db-172-16-3-150-> pg_ctl stop -m fast      
waiting for server to shut down.... done      
server stopped      
ocz@db-172-16-3-150-> pg_controldata       
pg_control version number:            922      
Catalog version number:               201204301      
Database system identifier:           5832008033851373032      
Database cluster state:               shut down      
pg_control last modified:             Fri 11 Jan 2013 10:16:18 AM CST      
Latest checkpoint location:           96E8/61000020      
Prior checkpoint location:            96E8/60DFF470      
Latest checkpoint's REDO location:    96E8/61000020      
Latest checkpoint's TimeLineID:       1      
Latest checkpoint's full_page_writes: on      
Latest checkpoint's NextXID:          0/1184890883      
Latest checkpoint's NextOID:          116385      
Latest checkpoint's NextMultiXactId:  65536      
Latest checkpoint's NextMultiOffset:  65536      
Latest checkpoint's oldestXID:        1006759584      
Latest checkpoint's oldestXID's DB:   1      
Latest checkpoint's oldestActiveXID:  0      
Time of latest checkpoint:            Fri 11 Jan 2013 10:16:18 AM CST      
Minimum recovery ending location:     0/0      
Backup start location:                0/0      
Backup end location:                  0/0      
End-of-backup record required:        no      
Current wal_level setting:            hot_standby      
Current max_connections setting:      1000      
Current max_prepared_xacts setting:   10      
Current max_locks_per_xact setting:   64      
Maximum data alignment:               8      
Database block size:                  8192      
Blocks per segment of large relation: 131072      
WAL block size:                       16384      
Bytes per WAL segment:                16777216      
Maximum length of identifiers:        64      
Maximum columns in an index:          32      
Maximum size of a TOAST chunk:        1996      
Date/time type storage:               64-bit integers      
Float4 argument passing:              by value      
Float8 argument passing:              by value      

关闭数据库后与刚修复好时的控制文件信息变化如下 :

开库前 :

pg_control last modified:             Fri 11 Jan 2013 10:09:44 AM CST      
Latest checkpoint location:           96E8/60000020      
Prior checkpoint location:            0/0      
Latest checkpoint's REDO location:    96E8/60000020      
Latest checkpoint's full_page_writes: off      
Latest checkpoint's NextXID:          0/1184890880      
Latest checkpoint's NextOID:          10000      
Latest checkpoint's oldestXID:        3479858176      
Latest checkpoint's oldestXID's DB:   0      
Time of latest checkpoint:            Fri 11 Jan 2013 10:09:44 AM CST      
Current wal_level setting:            minimal      
Current max_connections setting:      100      
Current max_prepared_xacts setting:   0      

关库后 :

pg_control last modified:             Fri 11 Jan 2013 10:16:18 AM CST      
Latest checkpoint location:           96E8/61000020      
Prior checkpoint location:            96E8/60DFF470      
Latest checkpoint's REDO location:    96E8/61000020      
Latest checkpoint's full_page_writes: on      
Latest checkpoint's NextXID:          0/1184890883      
Latest checkpoint's NextOID:          116385      
Latest checkpoint's oldestXID:        1006759584      
Latest checkpoint's oldestXID's DB:   1      
Time of latest checkpoint:            Fri 11 Jan 2013 10:16:18 AM CST      
Current wal_level setting:            hot_standby      
Current max_connections setting:      1000      
Current max_prepared_xacts setting:   10      

小结

1. 使用pg_resetxlog后, 先检查数据一致性, 必要时将数据逻辑导出(pg_dump), 使用initdb新建数据库, 再导入(pg_restore).

2. 如果控制文件丢失, 并且没有备份的话, pg_resetxlog你不知道该填啥, 但是可以从pg_xlog目录中获得大概的redo location, 或者pg_resetxlog 会猜测一些值, 直接-f生成控制文件, 启动数据库后, 可能由于XID回归到以前的XID而致使数据”消失”, 你可以使用txid_current()函数不断的消耗XID来得到一致的值.

《Use pg_resetxlog simulate tuple disappear within PostgreSQL》

《PostgreSQL 闪回 – flash back query emulate by trigger》

使用pg_xlogdump从xlog中抽取信息, 包括txid.

http://blog.163.com/digoal@126/blog/static/16387704020134993845555/

pg_resetxlog的版本必须要与数据库集群的版本一致。

注意

控制文件和PostgreSQL的大版本相关,建议用户使用pg_resetxlog前,看一看对应版本的使用说明。

PostgreSQL psql: FATAL: database “” does not exist

postgresql error PANIC: could not locate a valid checkpoint record

postgreSQL 如果没有正常关闭数据库且丢失了xlog 或者说wal日志文件的话 会导致启动时出现上述错误,对于pg而言可以通过pg_resetxlog或pg_resetwal(pg 10以后)来绕过该问题,即跳过这些日志。不像在oracle中,如果是没有clean close database且丢失了acitve或current的重做日志的话,会导致oracle数据库也无法打开,且强制开库的话步骤十分复杂,要比pg困难很多。

 

PG中可以执行如下命令来重置日志,但是注意做这些操作之前应该对$PGDATA目录做一个tarbar备份:

 

 


2019-11-19 06:37:09.459 UTC [2705] FATAL:  terminating autovacuum process due to administrator command
2019-11-19 06:38:35.142 UTC [2791] LOG:  listening on IPv6 address "::1", port 5432
2019-11-19 06:38:35.142 UTC [2791] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2019-11-19 06:38:35.143 UTC [2791] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2019-11-19 06:38:35.162 UTC [2792] LOG:  database system was interrupted; last known up at 2019-11-19 06:36:47 UTC
2019-11-19 06:38:35.162 UTC [2792] LOG:  creating missing WAL directory "pg_wal/archive_status"
2019-11-19 06:38:35.209 UTC [2792] LOG:  invalid primary checkpoint record
2019-11-19 06:38:35.210 UTC [2792] LOG:  invalid secondary checkpoint record
2019-11-19 06:38:35.210 UTC [2792] PANIC:  could not locate a valid checkpoint record
2019-11-19 06:38:37.204 UTC [2791] LOG:  startup process (PID 2792) was terminated by signal 6: Aborted
2019-11-19 06:38:37.206 UTC [2791] LOG:  aborting startup due to startup process failure
2019-11-19 06:38:37.222 UTC [2791] LOG:  database system is shut down
pg_ctl: could not start server
Examine the log output.




tar -jcvf pgdata.tar.bz2 $PGDATA



# Postgres < 10.0
pg_resetxlog -f $PGDATA

# Postgres >= 10.0
pg_resetwal -f $PGDATA



如何找到PostgreSQL中表对应的数据文件

postgreSQL中一般 一个表或索引会对应一个或多个数据文件,可以通过下面的方法获得表对应的数据文件位置:

 

 

postgres=# create table testtab (t1 int);                                                                                                                                                                                                 
CREATE TABLE

postgres=# 

postgres=# SELECT pg_relation_filepath('testtab');
 pg_relation_filepath 
----------------------
 base/13055/16384
(1 row)


PostgreSQL数据目录结构

文章来源:https://www.jianshu.com/p/cd8c5b988e52

PostgreSQL数据目录结构

 

根目录介绍

 

data
├── base                  # use to store database file(SELECT oid, datname FROM pg_database;)
├── global                # under global, all the filenode is hard-code(select oid,relname,relfilenode from pg_class where relfilenode=0 order by oid;)
├── pg_clog               # dir of transaction commit log
│   └── 0000
├── pg_commit_ts
├── pg_dynshmem
├── pg_hba.conf           # client authentication config file
├── pg_ident.conf         # user ident map file
├── pg_logical
│   ├── mappings
│   └── snapshots
├── pg_multixact
│   ├── members
│   │   └── 0000
│   └── offsets
│       └── 0000
├── pg_notify
│   └── 0000
├── pg_replslot
├── pg_serial
├── pg_snapshots         # dir of snapshot file
├── pg_stat
├── pg_stat_tmp          # dir of tmp stat file
│   ├── db_0.stat
│   ├── db_12407.stat
│   ├── db_16384.stat
│   └── global.stat
├── pg_subtrans
│   └── 0000
├── pg_tblspc
├── pg_twophase
├── PG_VERSION           # version file
├── pg_xlog              # dir of xlog file
│   ├── 000000010000000000000001
│   └── archive_status   # status info of xlog archive
├── postgresql.auto.conf
├── postgresql.conf      # config file of postmaster progress
├── postmaster.opts
└── postmaster.pid       # pid file of postmaster progress

global目录介绍

 

global名如其意,存放的文件用于存储全局的系统表信息和全局控制信息。

global下有四种文件:

  1. pg_control
    用于存储全局控制信息
  2. pg_filenode.map
    用于将当前目录下系统表的OID与具体文件名进行硬编码映射(每个用户创建的数据库目录下也有同名文件)。
  3. pg_internal.init
    用于缓存系统表,加快系统表读取速度(每个用户创建的数据库目录下也有同名文件)。
  4. 全局系统表文件
    数字命名的文件,用于存储系统表的内容。它们在pg_class里的relfilenode都为0,是靠pg_filenode.map将OID与文件硬编码映射。(注:不是所有的系统表的relfilenode都为0)

 

data
├── global                # under global, all the filenode is hard-code(select oid,relname,relfilenode from pg_class where relfilenode=0 order by oid;)
│   ├── 1136              # pg_pltemplate
│   ├── 1137              # pg_pltemplate_name_index
│   ├── 1213              # pg_tablespace
│   ├── 1214              # pg_shdepend
│   ├── 1232              # pg_shdepend_depender_index
│   ├── 1233              # pg_shdepend_reference_index
│   ├── 1260              # pg_authid
│   ├── 1261              # pg_auth_members
│   ├── 1262              # pg_database
│   ├── 2396              # pg_shdescription
│   ├── 2397              # pg_shdescription_o_c_index
│   ├── 2671              # pg_database_datname_index
│   ├── 2672              # pg_database_oid_index
│   ├── 2676              # pg_authid_rolname_index
│   ├── 2677              # pg_authid_oid_index
│   ├── 2694              # pg_auth_members_role_member_index
│   ├── 2695              # pg_auth_members_member_role_index
│   ├── 2697              # pg_tablespace_oid_index
│   ├── 2698              # pg_tablespace_spcname_index
│   ├── 2846              # pg_toast_2396
│   ├── 2847              # pg_toast_2396_index
│   ├── 2964              # pg_db_role_setting
│   ├── 2965              # pg_db_role_setting_databaseid_rol_index
│   ├── 2966              # pg_toast_2964
│   ├── 2967              # pg_toast_2964_index
│   ├── 3592              # pg_shseclabel
│   ├── 3593              # pg_shseclabel_object_index
│   ├── 4060              # pg_toast_3592x
│   ├── 4061              # pg_toast_3592_index
│   ├── 6000              # pg_replication_origin
│   ├── 6001              # pg_replication_origin_roiident_index
│   ├── 6002              # pg_replication_origin_roname_index
│   ├── pg_control        # global control file, use pgcheck -pc to see it.
│   ├── pg_filenode.map   # system table (oid -> filenode) mapping file, use pgcheck -pm to see it.
│   └── pg_internal.init  # system table cache file, use pgcheck -pr to see it.

base目录介绍

 

base目录用于存放数据库的所有实体文件。例如,我们创建的第一个库testdb的OID为16384,那么在data/base下就会产生一个名为16384的目录,用于存储testdb的数据文件。

 

testdb=# select oid,datname from pg_database;
  oid  |  datname
-------+-----------
 12407 | postgres
 16384 | testdb
     1 | template1
 12406 | template0
(4 rows)

base目录结构

data
├── base                  # use to store database file(SELECT oid, datname FROM pg_database;)
│   ├── 1                 # template database
│   ├── 12406             # template0 database
│   ├── 12407             # postgres database
│   └── 16384             # testdb, first user database
│   │   ├── 3600
│   │   ├── 3600_fsm
│   │   ├── 3600_vm
│   │   ├── 16385
│   │   ├── pg_filenode.map
│   │   ├── pg_internal.init
│   │   └── PG_VERSION
  1. pg_filenode.map 是pg_class里relfilenode为0的系统表,OID与文件的硬编码映射。
  2. pg_internal.init 是系统表的cache文件,用于加快读取。默认不存在,查询系统表后自动产生。
  3. PG_VERSION 是当前数据库数据格式对应的版本号
  4. 其它文件是需要到pg_class里根据OID查到对应的relfilenode来与文件名匹配的。
    例如:tab1的relfilenode是16385,那么16385这个文件就是tab1的数据文件
    testdb=# select oid,relfilenode,relname from pg_class where relname='tab1';
      oid  | relfilenode | relname
    -------+-------------+---------
     16385 |       16385 | tab1
    (1 row)
  1. 空闲空间映射表
    名字以_fsm结尾的文件是数据文件对应的FSM(free space map)文件,用map方式来标识哪些block是空闲的。用一个Byte而不是bit来标识一个block。对于一个有N个字节的block,它在_fsm文件中第blknum个字节中记录的值是(31+N)/32。通过这种方式标识一个block空闲字节数。FSM中不是简单的数组,而是一个三层的树形结构。FSM文件是在需要用到它时才自动产生的。
  2. 可见性映射表文件
    名字以_vm结尾的文件是数据文件对应的VM(visibility map)。PostgreSQL中在做多版本并发控制时是通过在元组头上标识“已无效”来实现删除或更新的,最后通过VACUUM功能来清理无效数据回收空闲空间。在做VACUUM时就使用VM开快速查找包含无效元组的block。VM仅是个简单的bitmap,一个bit对应一个block。

注:系统表分为全局系统表和库级系统表。
全局系统表位于global下,例如:pg_database,pg_tablespace,pg_auth_members这种存储系统级对象的表。
库级系统表位于数据库目录下,例如:pg_type,pg_proc,pg_attribute这种存储库级对象的表。
值得注意的是pg_class位于库级目录的里,但也包含全局系统表信息,因此研发或运维人员在改动全局系统表信息时需要注意。

表空间目录介绍

 

testdb=# select oid,* from pg_tablespace;
  oid  |  spcname   | spcowner | spcacl | spcoptions
-------+------------+----------+--------+------------
  1663 | pg_default |       10 |        |
  1664 | pg_global  |       10 |        |
 49162 | dbspace    |       10 |        |
(3 rows)

每一个Oid都在data/pg_tblspc下对应一个名为Oid的软链接文件,指向真正的space目录。

 tree ../data/pg_tblspc/
../data/pg_tblspc/
└── 49162 -> /home/postgres/postgresql-9.6.6/postgres/data/dbspace

在space目录是如何组织的呢?

testdb=# create table tab3(a int) tablespace dbspace;
CREATE TABLE

testdb=# select oid,relname,relfilenode from pg_class where relname='tab3';
  oid  | relname | relfilenode
-------+---------+-------------
 57351 | tab3    |       57351
(1 row)

 tree ../data/pg_tblspc/49162
../data/pg_tblspc/49162
└── PG_9.6_201608131
    └── 16384
        └── 57351

沪ICP备14014813号-2

沪公网安备 31010802001379号