Contents: Pseudo_Code_Of_ibbackup Case_1_Backup Case_2_ApplyLog MySQL_64Bit_DataTypes Compiler_Flags Zlib_Compression_Interface InnoDB_Utilities Important_Global_Variables InnoDB_DoubleWrite: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ http://dimitrik.free.fr/blog/archives/2009/08/mysql-performance-innodb-doublewrite-buffer-impact.html How_To_Check_fsync innodb_support_xa Innodb_Group_Commit InnoDB_Code Debugging_ibbackup mysqlbackup_Implementation ============================================================================== Pseudo_Code_Of_ibbackup ============================================================================== Background: Initializing System Tablespace: system_tablespace = back_tablespace_new(spaceid=0, 0, FALSE, FALSE) { } Data Structures: typedef struct { char *filename; ulint size_in_mb; unlint space_id; back_filestatus_t status; ulint zip_size; /* compressed page size in bytes */ back_filestatus_t status; /* either exists or DROPPED state */ } back_datafile_t; typedef struct { ulint max_count; ulint count; back_datafile_t *items; } back_datafilearray_t; typedef struct back_tablespace { ulint space_id; ulint file_format; /* 0 - Antelope */ ulint zip_size; /* compressed page sz in bytes or 0 = uncompressed */ back_datafilearray_t data_files; back_datafilearray_t backup_data_files; ibool is_auto_extending; ulint max_auto_extend_size; ibool is_compressed; struct back_tablespace *next; } back_tablespace_t; ============================================================================== Global Variables: back_first_tablespace ==> First system table space; back_last_tablespace ==> (updated for per-file table space as well?) /* All pertable datafiles collected during backup */ back_datafilearray_t back_collected_datafiles; ============================================================================== Case_1_Backup ibbackup ./etc/my.cnf backup.cnf main: dict_ind_init(); // init some dict to use some innodb low level routines. Initialize global variable fil_system which is tablespace mem cache. Initialize 2 buffer pool frames; Init data file array; Init system table space info: back_tablespace_t *system_tablespace; [ temporarily sets autoextending & compressed to FALSE but may change later.] Read 6 parameters from my.cnf and then backup.cnf : 7667 parse_my_cnf_file( 7668 system_tablespace, 7669 &system_tablespace->data_files, /*ibdata1:32M;ibdata2:32M:autoextend */ 7670 FALSE, 7671 my_cnf, 7672 &back_datadir, 7673 &back_innodb_data_home_dir, 7674 &back_log_dir, 7675 &back_n_log_files, 7676 &back_log_file_size ); get_pertable_tablespaces(create_backup_database_dirs, is_src_data_incremental): collect_pertable_filenames(filenames, create_db_dir, is_src_differential): if (back_datadir) // Get it from original data dir. collect_pertable_tablespaces(filenames, ...) else // Get it from backup dir. Used in apply-log etc. back_collect_filenames(back_back_datadir, uncompress_option? filter_ibz : filter_ibd, excluded_names[in], filenames[out]); Note: Above function is idempotent for "backup"; i.e. When called multiple times, only the new per-table files are identified and table spaces created. For operations such as "apply-log", the above function is *not* idempotent. A new tablespace is always created for each file found. The tablespace contains space->data_files and space->backup_data_files; For backup operation both are updated; For apply-log operation only backup_data_files is updated. Q: Since compressed tablespaces are not again compressed, how does apply-log differentiate when collecting file names of all existing table spaces ? collect_pertable_tablespaces(filenames): /* From datadir */ Any file with .ibd suffix is per-table tablespace. Note: ibdata1 and ibdata2 don't have .ibd suffix. Get table space id from the .ibd file: Read first page; Space id available in both TableSpace Header & Page header in the page; if the values don't match, error out. Get compressed page size(0-uncompressed) from the first page of a tablespace: Get flags from (first page) tablespace header; Lookup compressed page size using dict_table_flags_to_zip_size(flags); Create new pertable tablespace; /* For this it again reads the first page of tablespace! why?! */ back_tablespace_t *space = new back_tablespace_new(...); Get File format id: trx_sys_read_pertable_file_format_id(pathname-pertable, &format_id); File format is 4 bytes at page+54th offset. mach_read_from_4(ptr): mach0data.ic:182: return long from byte[] from lsb to msb; File format = 0 ==> Antelope; Populate space->data_files and space->backup_data_files; Note: Basically we do initialize tablespaces for *all* per-table data files even before starting! system_tablespace_read_format_id(system-table-space); // Antelope; make_backup() : check_backup_directories() : Verify that backup dir is not subdir of source datadir! back_are_files_the_same(): On Windows, check accordingly. On Unix, check stat() syscall result and see if inodes are same. back_look_for_checkpoint_pos(): Looks for a checkpoint in the first log file, set the start of the log copying accordingly. back_read_cp_info(): Open ib_logfile0; Read 3*512 + 512 bytes. Guarantees to have 1st & 2nd checkpoint info. Reads the checkpoint info needed in hot backup. Findout the log number (i.e. N in ib_logfileN) having last cpinfo back_doublewrite_init(); // Init info about double write buf. recv_sys_create(); // Initialize local recovery system to parse, // log data to see if scanned log is corrupt. recv_sys->parse_start_lsn = backup_start_checkpoint; etc. // Initialize start lsn as last check point. back_copy_log(); For each tablespace: backup_tablespace(incremental?, start_lsn_incremental, space); back_suspend(); // At the end. back_check_log_scanned_far_enough(); // Copy remaining log if needed. Get new checkpoint info; back_copy_log(); // copy logs from where we left last time. // we must have copied now atleast upto latest checkpoint. // otherwise, the backup will be unusable. Because we don't have // the logs between our last parsed LSN and the latest checkpoint. i.e. verify (back_up_log_end_lsn >= last_cp_lsn) Else Error; Assert(back_back_log_file_offset == back_up_log_end_lsn - back_up_log_start_lsn); At the end of ibbackup_logfile, append a 512 bytes block with following details: back_up_log_start_lsn, back_up_log_end_lsn, back_up_start_checkpoint, LOG_END_MAGIC ==================== End Backup Logic ====================================== From log/log0recv.c: 747 @return TRUE if success */ 748 UNIV_INTERN 749 ibool 750 recv_read_cp_info_for_backup( 751 /*=========================*/ 752 const byte* hdr, /*!< in: buffer containing the log group 753 header */ 754 ib_uint64_t* lsn, /*!< out: checkpoint lsn */ 755 ulint* offset, /*!< out: checkpoint offset in the log group */ 756 ulint* fsp_limit,/*!< out: fsp limit of space 0, 757 1000000000 if the database is running 758 with < version 3.23.50 of InnoDB */ 759 ib_uint64_t* cp_no, /*!< out: checkpoint number */ 760 ib_uint64_t* first_header_lsn) 761 /*!< out: lsn of of the start of the 762 first log file */ The log contains the first header that looks like: +check_sum_for_header+... Note: checksum header is of 288 bytes length. Verify that the checksum matches. To do that: ut_fold_binary(buf, checkpoint_checksum_1 offset); Initial 8 bytes of Log record is checkpoint number; In example session: checkpoint number is 26 and checkpoint LSN is 81683811. If the log file size of ib_logfile0 file and my.cnf parameter, does not match, give error and abort! In the example session with 2 rows with UNIQUE_THAVA_DATA : log at offset (0x9d50) : 40288 data at offset (0xc8120) : 819,488 double write data(0x128120) : 1,212,704 2366 /* The back_doublewrite_init function initializes "back_doublewrite" struct 2367 with the location of the doublewrite buffer in the system tablespace. 2368 2369 If this function fails, it does not return. 2370 */ 2371 static void back_doublewrite_init(...) The 5th page in tablespace is system tx header. It contains info about double write buffer. The last 200 bytes in this page contains double-write buffer header. There are 2 blocks: each block contains 64-consecutive pages dedicated for double write buffer. The tablespace page-5 contains pointers to all these info. recv_sys_create(): means initialize recovery system Init red-black tree with 5 MB memory. i.e. 5*1024*1024 Init start_lsn = backup_start_checkpoint (= 81683811 how?!) Init ibbackup log name = "ibbackup_logfile"; (the original = ib_logfile0) back_copy_log(): copy from checkpoint lsn. open ib_logfile0 for reading; Read BACK_LOG_COPY_SEG_SIZE chunks of each 1 MB recv_scan_log_seg_for_backup(): // log/log0recv.c log_block_get_data_len(log_block) which is 355 bytes! scanned_checkpoint_no is 19! how?! while copying log record, we also parse it to detect corrupted log record. Why?! redundant? copy the chunk to ibbackup_logfile posix_fadvise() call on output log file; (why for each 1MB?!) Standard_Breakpoints: back_look_for_checkpoint_pos(); Todo: Remember to remove too many fflush(stdout) from code. ============================================================================== Case_2_ApplyLog main { Initialize 2 buffer pool frames for apply log; The second frame will be needed in btr_page_reorganize_low() recover_backup() { open ibbackup_logfile; it is only 2*1024 bytes! size = Read file size : 2K; Substract 512 bytes: size -= 512; Read Last 512 bytes from ibbackup_logfile into log_end_mark; It contains: start_lsn, end_lsn, start_checkpoint (each 8 bytes) ibbackup: *start-lsn* : 15390720, *end-lsn* : 15392144, ibbackup: *start-checkpoint* : 15391113. (end_mark + 24) Should have BACK_UP_LOG_END_NEW_FORMAT_MAGIC_N ; It is: 542632761 : 0x2057eb39 end_mark+28 : 4 bytes indicate partial backup; recv_sys_create(); Recovery Scan Size is 4 Pages; Set buf_start_lsn; fil_space_create('ibdata1') : Internal TableSpace object creation; Load single table spaces info; fil_open_log_and_system_tablespace_files() : Do real open of ibdata1,etc; While Not Finished: { Read LogSegment in ibbackup_logfile : From offset 0 to (2K-512) Read min(remaining_size, 4Pages) which is 2K-512 = 1536 recv_scan_log_recs(.. buf, buflen, start_lsn, ..., &scanned_up_to); { cur_lsn = start_lsn; For Each LogBlock in (buf .. buf+len) : // To Process One LogBlock. { no = log_block_get_hdr_no(buf); ==> Yields 30061 Note: It is the first 4 bytes in the log header! Map LSN to Log Block Hdr Number! cur_lsn_to_hdr_no = (lsn/512) + 1; It is >0 and <= 1G Q: Does this mean, My LSN can be inaccurate with in 512 bytes???? The cur_lsn's header number and current buffer header num should match; Else Break loop; If checksum of the LogBlock Does not Match, Break Loop; If LogBlock has flushbit (the MSB) set, it is First LogBlock, then set the contiguous_lsn to cur_lsn; If (LogBlock's CheckPoint-Number < Recovery System's CheckPointNum) Break Loop to Skip this LogBlock; data_len = log_block_get_data_len(log_block); i.e. 512 scanned_lsn += data_len; i.e. 15390720 + 512 = 15391232 recv_sys_add_to_parsing_buf(log_block, scanned_lsn) { /* It is for "adding" to an existing buffer. */ Can Add Only if recv_sys->parse_start_lsn > 0; Q: When was recv_sys initialized ? It was initialized by recover_backup() in back0back.c as: recv_sys->parse_start_lsn = back_up_start_checkpoint; } } } If recv_sys->heap size becomes higher than limit-memory: recv_apply_log_recs_for_backup(); } Report binlog number from TRX_SYS_PAGE_NO in ibdata1; } } Note: No single log record can be greater than 500KB. The log records operate on rows ? ============================================================================== Zlib_Compression_Interface: compress2(dest, destlen, source, srclen, compression-level) : z_stream stream; setup stream with input & output; deflateInit(&stream); deflate(&stream); deflateEnd(&stream); That's it! ============================================================================== MySQL_64Bit_DataTypes: Notes: mysys makes heavy use of size_t; There is no int32 or int64 types! Probably it never needs the datatype to be specific as 32 or 64?!!! sql/ has only sql_class.cc which refers to uint64. otherwise there is no int64 type! Only mysqld.cc contains _WIN32 and _WIN64 (both are defined in 64 bit Win!). Otherwise the check is mostly #ifdef __WIN__ ============================================================================== Compiler_Flags Windows Flags: WIN32 API: Most important API on windows for all core services: graphical user interface; access system resources such as memory and devices; display graphics and formatted text; incorporate audio, video, networking, or security. The core DLLs of Win32 are kernel32.dll, user32.dll, and gdi32.dll. Win32 was introduced with Windows NT; It provides almost all windows internal services. Windows 64-bit also uses same named DLLs for 64-bit; The pointers are 64-bit by default; cmake only defines WIN32 for within CMakeLists.txt; We define following variables (_D... ) for C programs internally from CMakeLists.txt : Note: All the flags set for win32 is also set for win64. General: _WIN32 : Automatically set by compilers. Signifies Win32 API available. WIN32 : Set by MSVC++ compiler. (And also by Windows SDK header file?) In innobackup-c : Flag 32-bit/64-bit/both __WIN__ : both _WINDOWS : both _WIN64 : 64-bit Note: innodb code finally uses __WIN__ in its code base. In ibbackup, there is include/win/inttypes.h Defines "x" , "lX" etc for fscanf macros on WIN64 systems. include/win/stdint.h : Defines int8_t, int64_t, etc definitions intended for MS VC++; It also defines: ssize_t as signed size_t; long or long long; Windows Considerations: the maximum length for a path is MAX_PATH, which is defined as 260 characters. File I/O functions in the Windows API convert "/" to "\" innodb defines: OS_FILE_PATH_SEPARATOR ============================================================================= InnoDB_Utilities: ut_a(X) : utility for assert. e.g. ut_a(p != NULL); ut_align( const void* ptr, ulint align_no); /* rounds up alignment */ ut_align_down(...) ; /* rounds down alignment */ hash_node_t ; UT_LIST_NODE_T(type); back_strarr_t *; [ array of strings ! ] os_file_create_simple_no_error_handling(...); os_file_get_last_error(TRUE); ut_print_timestamp(stderr); char* os_file_dirname(..) : returns parent dirname of the path! See Also Win32 APIs : PathCanonicalize() : GetFullPathName(): mem_free_func(); ==> some intelligent memory allocation internal functions used from innodb/include/mem0mem.ic !!! ============================================================================== Important_Global_Variables: fil_system : Tablespace Memory Cache. innodb/fil/fil0fil.c fil stands for "file" system ?! Page size = 16KB #define OS_FILE_LOG_BLOCK_SIZE 512 file has been completely written */ #define LOG_CHECKPOINT_1 OS_FILE_LOG_BLOCK_SIZE /* first checkpoint field in the log header; */ #define LOG_CHECKPOINT_2 (3 * OS_FILE_LOG_BLOCK_SIZE) /* second checkpoint field in the log header */ typedef byte * page_t; !!! There is no structure for page header! There is a sequence of OFFSETS which are defined for page structure! e.g. unsigned long long page_lsn = mach_read_ull(page+FIL_PAGE_LSN); Note: InnoDB usually writes MSB first (i.e. Bigendian order!!!) include/fil0fil.h Defines #define FIL_PAGE_LSN 16 LSN of the newest modification to the page! For data page, this makes sense. What about the Log Page ??? More about fil0fil.h : - Low-level File System. - What is page number ? Is it number within table space or within file ? typdef struct { uint4 pageno; /* Page number with in file or space ??? */ uint2 boffset; /* byte offset within page */ } thava_space_addr_t; typedef struct my_page{ ulint checksum; /* checksum of the page (since 4.0.14) */ ulint page_offset; /* page offset inside space */ fil_addr_t previous; /* offset or fil_addr_t */ fil_addr_t next; /* offset or fil_addr_t */ dulint page_lsn; /* lsn of the end of the newest modification log record to the page */ PAGE_TYPE page type; /* file page type */ dulint file_flush_lsn;/* the file has been flushed to disk at least up to this lsn */ int space_id; /* space id of the page */ char data[]; /* will grow */ ulint page_lsn; /* the last 4 bytes of page_lsn */ ulint checksum; /* page checksum, or checksum magic, or 0 */ } #define LOG_START_LSN 16 * 512 = 8K; Counting of LSN starts from this value. This is less than 1 innodb page size. why ? LogBlock Header contains: Header-Number(4 bytes); Data-Length; (2 bytes) Offset-to-first-mtr-log-record-group (2 bytes) Log-block-check-point-no (4 bytes) : This is "next checkpoint no" when log is written to. Total header size = 12 bytes; Log File Header Size = 4 * 512 = 2KB LOG File Header Contents: 0 - Log group ID (4 bytes) 4 - LSN of start of data in this log file. (8 bytes) 12 - Log file number (4 bytes) (only in archived log file). 16 - 32 byte string e.g "Created by ibbackup ... " Log Block Header contains: Log Start LSN = 8K log_group_calc_size_offset() : Converts real offset in log file to LSN; [ by substracting LOG_FILE_HDR_SIZE n times ]; log_group_calc_real_offset() : Converts from LSN to real offset in file [ by adding LOG_FILE_HDR_SIZE ] log_calc_where_lsn_is() : Converts real LSN into (fileno, offset); Main inputs: lsn, first_header_lsn ============================================================================== InnoDB_DoubleWrite http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ Innodb Double Write One of very interesting techniques Innodb uses is technique called doublewrite. It means Innodb will write data twice when it performs table space writes writes to log files are done only once. So why doublewrite is needed ? It is needed to archive data safety in case of partial page writes. Innodb does not log full pages to the log files, but uses what is called âphysiologicalâ logging which means log records contain page number for the operation as well as operation data (ie update the row) and log sequence information. Such logging structure is great as it require less data to be written to the log, however it requires pages to be internally consistent. It does not matter which page version it is â it could be âcurrentâ version in which case Innodb will skip page upate operation or âformerâ in which case Innodb will perform update. If page is inconsistent recovery canât proceed. So how does double write works ? You can think about it as about one more short term log file allocated inside Innodb tablespace â it contains space for 100 pages. (1.6MB) When Innodb flushes pages from Innodb buffer pool it does so by multiple pages. So several pages will be written to double write buffer (sequentially), fsync() called to ensure they make it to the disk, then pages written to their real location and fsync() called the second time. Now on recovery Innodb checks doublewrite buffer contents and pages in their original location. If page is inconsistent in double write buffer it is simply discarded, if it is inconsistent in the tablespace it is recovered from double write buffer. Q: Where does double write buffer live? InnoDB system tablespace? Or tablespace being written ? How much does double write buffer affect MySQL Performance ? Even though double write requires each page written twice its overhead is far less than double. Write to double write buffer is sequential so it is pretty cheap. It also allows Innodb to save on fsync()s â instead of calling fsync() for each page write Innodb submits multiple page writes and calls fsync() which allows Operating System to optimize in which order writes are executed and use multiple devices in parallel. This optimization could be used without doublewrite though, it was just implemented at the same time. So in general I would expect no more than 5-10% performance loss due to use of doublewrite. Can you disable doublewrite ? If you do not care about your data (ie slaves on RAID0) or if your file system guarantees you no partial page writes could exist you can disable doublewrite by setting innodb_doublewrite=0 It is however not worth the trouble in most cases. ============================================================================== How_To_Check_fsync: http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/ Check if you OS is doing real fsync. You should to know anyway if you care about your data safety. This can be done for example by using SysBench: sysbench test=fileio file-fsync-freq=1 file-num=1 file-total-size=16384 file-test-mode=rndwr. This will write and fsync the same page and you should see how many requests/sec it is doing. You also might want to check diskTest from this page http://www.faemalia.net/mysqlUtils/ which does some extra tests for fsync() correctness. ============================================================================== What is innodb_support_xa ? Parameter provides consistency between binlog and innodb tx log. ============================================================================== Innodb_Group_Commit : What is InnoDB Group Commit ? For each innodb tx, there is 1 (or 2 if xa is on) fsync is done. If there are N concurrent commits, if the fsyncs are combined. If the number of fsyncs remain as N*no_of_txns, then group commit broken. ============================================================================== InnoDB_Code log0log.c : log_init() { OS_FILE_LOG_BLOCK_SIZE = 512 bytes log_sys -> lsn = LOG_START_LSN = 16 * 512 = 8 K; LOG_BUFFER_SIZE >= 4 pages i.e. 64 KB; Typically this will be min 32MB or so. Allocate log_sys->buf = LOG_BUFFER_SIZE bytes; log_sys->max_buf_free = (LOG_BUFFER_SIZE/2) - 4 pages(i.e.64KB) - 4*512; log_sys->checkpoint_buf = allocate 512 bytes; log_block_init(log_sys->buf, log_sys->lsn(i.e. 8K)) : no = lsn / 512 = 16; log_block_set_hdr_no(buf, 16): write 16 (4bytes) into mem location (buf+ 0 (offset for block hdr no!)) log_block_set_data_len(buf, hdr_size=12 bytes) : write 12 (2 bytes) into mem location buf + 4 (offset for hdr data length) log_block_set_first_rec_group(buf, 0) : write 0 (2bytes) into buf+ 6; // for minitransaction log record ??? log_sys->buf_free = LOG_BLOCK_HDR_SIZE = 12 bytes; log_sys->lsn = 8 KB + 12 bytes; } log_calc_where_lsn_is(first_header_lsn, lsn to search, N_logfiles, filesize) { LSN is approximately := Amount of Real Log Content + 8K + 12 bytes Log file header size; (As if entire LOG lives in single log file) LSN points to next byte to write; ???? } ============================================================================== Debugging_ibbackup ============================================================================== mysqlbackup_Implementation incremental-backup apply will be similar to ibbackup. won't delete unknown files. --pipe won't be supported. ibbackup_Implementation: make_backup(): Look for checkpoint pos; init info about location of double write buffer in system tablespace; init recovery system in order to parse the log while we back it up; back_copy_log(): Loop: For each MB of log segment : recv_scan_log_seg_for_backup(): // Here, we read the segments. recv_scan_log_recs(); // Here we do parsing to make sure it is OK. Apply posix fadvise for each of 1MB; End of copying log; Start copying data: For each tablespace in all tablespaces linked from back_first_tablespace : backup_tablespace(); If it is last ts, check if there are new per-table datafiles; No more per-table datafiles; We are at end. If suspend-at-end enabled, back_suspend() : Until suspend file is deleted, do every second: Keep scanning the log; i.e. back_check_log_scanned_far_enough():