UniData Recoverable File System (RFS)
15 Dec 2005
Do you administer a UniData (U2) database with the Recoverable File System (RFS) enabled? Then you know that status and error messages logged by UniData's RFS sub-system are sometimes difficult to understand. This article will highlight commonly occuring messages and describe appropriate actions to take. In some cases you may be able to use this information to recover from a crash without additional assistance.
Checking log files ..... ----- SM (98320) is started at Dec 22 2004 10:26:41 ----- Unidata Environment : /usr/ud60/include SM: Restart_Flag = 0.
These are the normal messages logged during a
startup of a UniData database that has previously been shut down
gracefully. This is indicated by the
Restart Flag=0. If
the database was not shut down gracefully, the
Flag will be set to
1 and recovery steps will
ERROR in function U_s_setupsca(), file s_setsca.c: *** file log too small --- called by process 1198038 File log size should be at least 2501 blocks. Please change the file log size and run cntl_install, then startup system.
This set of messages is logged during database
startup if there is a mismatch between the file-level log settings in
logconfig and the NUSERS setting in udtconfig. The file log size must
be at least 1 greater than NUSERS. This typically happens when an
administrator adjusts NUSERS and neglects to reconfigure the logconfig
file and run
cntl_install to reinitialize the logs. Here
0120 (file-level log) line of a small logconfig
file configured for NUSERS=40:
/disk1/ud61/log/filelog_0 0120 4096 0 41
Note: This is a fatal startup error. The database will not start until corrective action is taken.
Checking log files ..... check_rdud:ReadHdr SM exited with code 3.
This message is reported when a log file
header cannot be positioned correctly (using the lseek() system call)
or the content of the log file header cannot be decoded. This could
happen if the logconfig file is changed after UniData has been shut
down, or if the log file header is corrupted. To correct the problem,
check the log configuration file and the specified log files. You may
need to run
cntl_install to reinitialize the log
Note: This is a fatal startup error. The database will not start until corrective action is taken.
----- SM (9778) is started at Dec 28 2004 15:14:36 ---- Unidata Environment : /usr/ud60/include SM: Restart_Flag = 1. Starting: Tue Dec 28 15:14:46 2004 RM: Replication is off recovery: RM started, pid:9779 Starting to restart .......... Undo logfile ..... The logfile has been scanned. Undo logfile has been undone. Undo logfile ..... The logfile has been scanned. Undo logfile has been undone. *****!!! Restart Finished !!!***** restart: Report of Log-file Status. # Type Checkpoint-Number Status 0 after-image 0 0 1 after-image 0 4 2 before-image 0 4 3 before-image 0 4 4 after-image 0 4 5 after-image 0 4 6 before-image 0 4 7 before-image 0 4 Link back all new blocks .... Link new blocks successful ! Start the redo processing .... Finish the redo processing. All undo/redo recovery processing has been finished!! Redo logfiles finished. Step3: Check the file level commands... File level commands checking finished. Restart is successful!!! RM has already been stopped. recovery: RM stopped. Checking log files ..... The system is normally running.
These are the normal messages logged during a
successful recovery of the database after a database or system crash.
This is from a UniData Version 6.0.9 with Replication turned off.
Restart Flag=1. This indicates that recovery is
required. First we undo the updates by applying the before-image logs.
The number of logs will vary depending on your logconfig settings. Then
the log-file status is displayed. Then we redo updates from the
after-image logs and check for file level commands in the logs. The
message "The system is normally running" indicates a successful
TM can only be started by SM. SM checked: tm (udtno=1, pid=75602, udtuid=0, udtgid=0, cur_request= -1 ): Initialization failed.
This set of messages occurs if a user is attempting to start a new UniData session while database recovery is still in progress. These messages will be interspersed with the other messages generated during database recovery. This can result in file corruption. Users should be blocked from starting a new UniData session until the database has completed all recovery steps. Starting at UniData Version 6.1.7, the database will block new user sessions. Prior to that release, the administrator should take steps external to UniData to prevent users from attempting to start new sessions.
addfend_tbl:not in the endfile table undo_recv:Scan_Undo_Log Errno=10001 errno=2,at Tue Dec 21 15:11:01 2004 . U_crashrest:U_undo_recover: 0 Errno=10001 errno=2,at Tue Dec 21 15:11:01 2004 . restart:U_CrashRest Errno=10001 errno=2,at Tue Dec 21 15:11:01 2004 . RM has already been stopped. recovery: RM stopped. SM exited with code 4.
This set of messages was recorded during a failed recovery. This indicates a confusion about the status of the log files. The problem was triggered by crashing the system immediately after a successful recovery (and before the first check point). It is the result of a product defect (e7412). This problem is corrected in UniData Version 6.1.7 and later.
Note: The only way to get the
database restarted after this error was to run
cntl_install to reinitialize the logs.
Undo logfile has been undone. write_back_newblk: undo_addblk Errno=22 errno=22,at Fri Sep 10 02:46:17 2004 . U_crashrest:U_write_back_newblk(). Errno=22 errno=22,at Fri Sep 10 02:46:17 2004 . restart:U_CrashRest Errno=22 errno=22,at Fri Sep 10 02:46:17 2004 . ******* restart: Cannot Restart!!! Link back all new blocks .... write_back_newblk: undo_addblk Errno=22 errno=22,at Fri Sep 10 02:46:17 2004 . U_crashrest:U_write_back_newblk(). Errno=22 errno=22,at Fri Sep 10 02:46:17 2004 . restart:U_CrashRest Errno=22 errno=22,at Fri Sep 10 02:46:17 2004 . SM exited with code 4. U_go_to_sleep(): udt_msgrcv() error: Identifier removed SM checked: bimglog (pid = 58598): Exit code 1, status(0x100) ----- System Crashed at Sep 10 2004 02:46:19 ----- All possible CM & TMs & AIMGLOGs & BIMGLOGs killed
This set of messages was recorded during a failed recovery. The database crash preceding this recovery was forced by the system administrator after user processes appeared to hang and "File udt_system(s) full for file path_to_file, pid - nnnn" messages were recorded in the sm.log. The "file udt_system(s) full" message was actually the result of a UniData static hashed file expanding to a size that was not an even multiple of the file's block size, and becoming inaccessible. The messages in the sm.log during restart were the result of attempts to update this file. Two fixed issues relate to this problem. With the fix for issue e7185, UniData now prohibits an RFS file from expanding beyond an even multiple of its block size and becoming inaccessible. This fix resolves the problem with other processes becoming unresponsive as well as the database recovery issue, by not allowing the original situation to occur. The fix for issue e7187 corrects the "full" message to actually display the specific file name and the PID of the tm process struggling (instead of the PID of the cm or syncd process). These fixes are all available in UniData Versions 6.0.16, 6.1.1, and later.
Index: not a free node address: 651165696 Index: Can not split node
This message indicates that a UniData alternate index cannot be updated during RFS recovery. The specific reason is that the first node of the free chain of the index file is not marked as a "Free Node." (Each index node has a special flag to indicate the type of the node.) The index file can no longer be updated because there is no free index node.
Rocket U2 identified a problem (e6650) in RFS crash recovery handling where an index node is freed and not marked as a "Free Node" during the undo phase of recovery. This problem has been corrected at UniData Versions 6.0.14, 6.1.0, and later.
CM: Unknown msg type rcvd.
For RFS systems with Archiving enabled, this message will appear in the sm.log each time a checkpoint occurs. It is the result of a harmless coding error within the UniData engine. No corrective action is necessary. Starting with UniData Version 6.1.6, this message no longer occurs.
CM, Wed Dec 1 21:23:57 2004:WARNING: The log file [/dev/rbimglv01] is too small to contain the log records. Please enlarge the log files.
This warning is fairly clear. It is generated by the checkpoint manager (CM). This indicates that the specified RFS log is too small to contain all of the log records for a checkpoint. UniData will handle this condition by writing records to the LOG_OVERFLO directory, but you should reconfigure the logs so that this extra I/O overhead is not necessary.
Note: If the logs overflow and UniData is unable to write to the LOG_OVERFLO directory for any reason (for example, if the disk is full), the database will crash.
Error when attaching shm (4980785, 0, 0), errno=24 ==TM(pid:34685034) calling reportIpcState(); logfile: /usr/ud61/bin/TM34685034c0.ipclog; funcname:U_getshmadd(A), errno:24. Current time=Thu Dec 2 12:50:57 2004
get_addr.c:U_mi_addr:U_getshmadd:cannot attach Error when attaching shm (4980777, 0, 0), errno=24 ==TM(pid:8536212) calling reportIpcState(); logfile: /usr/ud61/bin/TM8536212c0.ipclog; funcname:U_getshmadd(A), errno:24. Current time=Thu Dec 2 12:57:22 2004
Both of the above messages (3a and 3b)
tm process failing to attach needed shared
memory segments. The
udt pair will exit
(the error is fatal to them), but the database will not crash. The
message refers to an ipclog file; this file contains details about the
shared memory use by the
udt pair, as well
as overall system use of shared memory at the moment the
tm processes exited. The information in the ipclog can be
used to tune your system kernel or UniData shared memory parameters in
the udtconfig file to eliminate the error.
U_sbb_rlsblk(): mismatched current first put entry and saved first put entry.
This message is produced by
running on a recoverable dynamic file. It is just a warning produced by
a UniData coding error. This is recorded as engineering case e6286.
This message will not occur in versions of UniData that are 6.0.11 or
SM stop_tm: TM 15 has already exited.
The message "SM stop_tm: TM 15 has already
exited" is produced by the
sm daemon while it is
STOP_TM request from the
smm issues the
STOP_TM request on an
RFS-enabled system, if it finds a user process is dead but that process
still occupies an entry in the UniData user process table. If
sm finds that the corresponding
for this dead
udt process has already exited from the
sm prints out the message.
Putting this another way, if both the
udt process and its corresponding
have exited or been killed and the user process table in shared memory
has not been cleaned up, the system produces this message. There are
several scenarios that can produce such a situation:
- During the UniData process login procedure, the
udtprocess gets killed or abnormally exits (for example, core dumps) before its corresponding
tmprocess has been initiated.
- During the UniData process logout procedure, after requesting
tmto exit, the
udtprocess gets killed or exits abnormally.
Note: The above logout
procedure can either be a normal user-initiated logout, a logout as a
result of severe system errors, or logout as a result of being asked to
udt process receiving a kill signal, including the
command won't trigger this message. However, if the
deleteuser process encounters further issues, then the
message can occur. The
deleteuser command will first send
a soft kill (kill -15) to the
udt process. The
udt process will start the logout procedure -- which
includes requesting its
tm process to exit. These
activities may take some time on a very busy system. Meanwhile, the
deleteuser command (or the impatient system
administrator) will issue a hard kill (kill -9) to the same
udt process. The hard kill will immediately kill the
process, but the side effect is that the user process table in shared
memory won't be properly cleaned up (as it is during a normal
udt logout procedure).
archive,Sat Dec 4 15:41:51 2004: The archive file /udlogs/arch/arch11 is full. The Logical Sequence Number (LSN) of this archive is -- 1 ar_backupd, Sat Dec 4 15:41:51 2004: starting to offload archive /udlogs/arch/arch11 (LSN = 1) ar_backupd, Sat Dec 4 15:42:03 2004: file (LSN = 1) off-loaded.
This is a normal set of messages recorded when an RFS system is configured with Archiving. These messages indicate an archive has filled and been successfully off-loaded.
ar_backupd, Sat Dec 4 21:12:15 2004: starting to offload archive /udlogs/arch/arch6 (LSN = 58) The archive file /udlogs/arch/arch6 can not be off loaded. You may want to make sure that this file has been saved. Also, label it Logical Sequence Number (LSN) -- 58 Please create /usr/ud61/include/DONE file (as root) when done... ... ar_backupd, Sat Dec 4 21:34:19 2004: off-loading of archive /udlogs/arch/arch6 (lsn 58) must be completed before system can progress. ARCH: waiting for the file (/udlogs/arch/arch6) to be off-loaded ... ARCH: waiting for the file (/udlogs/arch/arch6) to be off-loaded ... ...
This set of messages indicates that all of the archives are full, and the specified (oldest) archive must be off-loaded before the database can continue. The database will not be responsive to users trying to update records. The last message will repeat until the administrator corrects the situation and signals the database to proceed by creating the DONE file in the specified UniData include directory.
U_tm_login msgsnd error: Invalid argument SM checked: tm (udtno = 6, pid = 63604, udtuid = 4191, udtgid = 300, cur_request = 1 ): Initialization failed. sendback: msgsnd: Invalid argument
This series of messages is reported when
sm daemon is trying to start a
udt process initialization. The first message is
reported by the
tm process when it tries to send a login
success message to its buddy
udt process. Part of the
message structure is the process ID of the
udt process ID recorded in the shared memory
process table is less than 1 (most likely it is -1). This causes the
msgsnd() call failure. Because of the msgsnd() failure, the
tm process exits with error code U_TM_INIT_FAIL_EXIT. The
error code is caught by
sm and reported in the log. When
sm tries to deliver an error message to notify the
udt process, the msgsnd() system fails again because of
the same reason of the first message (the message type is still set to
This problem could happen when a
udt process tries to start its
(through the help of
sm), and the
Local Control Table entry for the
udt process is
inappropriately cleaned or freed while
tm is starting.
This suggests that there may be a timing
issue in the UniData RFS login, logout, and cleanup architecture, but
this problem has not been reproduced. The impact on the user is minimal
(launching a new
udt process fails, but a retry is
generally successful). Starting at UniData Version 6.1.7, additional
details about the specific message command that fails is included in
the sm.log (e7543). This may provide information that would lead to
reproducing this problem.
File udt_system(s) full for file /ud/KEY-DATA/FIN.RPT, pid - 56222
This message is misleading. The "file udt_system(s) full" message was actually the result of a UniData static hashed file expanding to a size that was not an even multiple of the file's block size and becoming inaccessible. Starting with UniData Version 6.0.16 and 6.1.1, this problem can no longer occur. The fix for e7185 UniData now prohibits an RFS file from expanding beyond an even multiple of its block size and becoming inaccessible. If this situation were still to occur in some unanticipated manner, the message in the sm.log will now be more meaningful (e7187).
Note: Although this problem does not specifically trigger a database crash, the database does become unusable. Both the check point manager daemon (cm) and one of the sync daemons (syncd) start spinning and become unresponsive.
==Fri Sep 10 06:21:54 2004 TM, file-corruption detected, call U_fdbg_verify_put(963081)== TM found the file in its Unix-open-table. TM calls U_CheckFileInfo() to report the details. == TM:U_CheckFileInfo(afid:1086, sub_nth:00000000, fd:15, putno:963081 == UnixFileName:/ud/KEY-DATA-PROD/PROD.POINTER TM-fd's (st_dev:3473450, st_ino:19). AFT-fd's (st_dev:3473450, st_ino:19). U_____Report Error____ Funcname: write_record...Syscall: long record check...U_Errno: 4... Filename: /ud/KEY-DATA-PROD/PROD.POINTER ...Key: ...Number: 2852 U_____Report Error____ Funcname: U_add_record...Syscall: long record check...U_Errno: 1... Filename: /ud/KEY-DATA-PROD/PROD.POINTER ...Key: 001!FO1067137!064...Number: 5842944 U_____Report Error____ Funcname: modify_record...Syscall: long record check...U_Errno: 1... Filename: /ud/KEY-DATA-PROD/PROD.POINTER ...Key: 001!FO1067137!064...Number: -1 U_____Report Error____ Funcname: U_append_strtuple...Syscall: long record check...U_Errno: 1... Filename: /ud/KEY-DATA-PROD/PROD.POINTER ... Key: 001!FO1067137!064 ... Number: 2852U_sbr_addrec error in U_tm_wrtrec errno=20: Not a directory
The messages here indicate that a
tm process encountered file corruption in the PROD.POINTER
file. The nature of the corruption was not severe enough to be a fatal
error (which would have brought down the database). The file in
question needs to be repaired using
index: write temporary file error errno=28.
This error is triggered by BUILD.INDEX when the temp directory used by BUILD.INDEX has filled. BUILD.INDEX uses whatever directory is specified by the environment variable $TMP.
Index error when adding 002001665 U_____Report Error____ at Tue Dec 31 14:39:11 2002 Funcname: U_sbr_upd_altkey...Syscall: U_tm_upd_altkey...calltimes: 0... Filename:/mccdata/mcc07/MCC.DATA2/GB.GROUP/GB.GROUP ... Key: ... Number: 2320 U_Errno: 0 errno: 20
This error indicates a
process was unable to update an alternate index on a recoverable file.
This was not a fatal error, but the index for the GB.GROUP file needs
attention. In order to repair it, list the indexes on the file using
LIST.INDEX. Delete the indexes on the file using DELETE.INDEX GB.GROUP
ALL (the ALL keyword results in the complete removal of the index
file). Then recreate the indexes and rebuild the index file. On an RFS
system (prior to UniData Version 7.1), the index file cannot be rebuilt
if any UniData process has the primary file open. At Version 7.1, you
can use the ONLINE keyword, which allows rebuilding the index on an
WARNING:  throw away an illegal message, len=12, cmd=1.
The basic communication between a
udt process and its buddy
tm process is
though operating system message queues (UNIX). There is a dynamic
security code for each message that the receiver side can anticipate.
This warning message indicates that the
tm process is
ignoring a message it just received because of an invalid security
code. This situation can happen when:
udtprocesses share the same UDTNO. This login timing issue has generally been resolved.
- There is a leftover message from the previous
tmpair assigned to the same UDTNO -- typically after a
udtprocess is killed or ends abnormally.
Invalid tm_afid,tm:1, U_tm_close, Sat Jul 7 14:45:33 2004, tm_afid:8, U_tm_aft_n:90.
udt process opens a data
tm buddy sends back a file descriptor called
TM AFID to the
udt process. This
AFID is verified by the
tm process whenever
udt asks the
tm to perform any operations on
the file. This message indicates that
1 is not currently a
TM AFID. This problem is most likely caused by the
udt process double-closing a
TM AFID, but has
not been isolated and reproduced. There does not seem to be any
consequence to the running application.
U_am_os_open(), fname=/data/fin/wbf/sysman/VOC errno:2 TM:U_am_os_open return error: afid:52, sub_nth:0, aborted SM checked: tm (udtno = 175, pid = 511660, udtuid = 51004, udtgid = 1108, cur_request = 4): Stopped because of IOT instruction - core dumped
Any problem that is fatal to a
process will bring down the database. In this case, a
process had a problem reading a record from the VOC file of a UniData
account. This resulted in the
tm process dumping core and
SM checked: tm (udtno = 1, pid = 565412, udtuid = 0, udtgid = 0, cur_request = -1): Stopped because of Kill
This message indicates that a
process was killed with a kill signal (likely -9). This will trigger a
database crash. This could have been done by an administrator or by a
----- System Crashed at Dec 06 2004 16:31:40 ----- All possible CM & TMs & AIMGLOGs & BIMGLOGs killed Dumping the system buffer to "/usr/ud60/bin/rfs.dump"......Done. Journal hasn't been started. RM has already been stopped. CLEANUPD stopped successfully. SBCS stopped successfully. SMM stopped successfully. All user processes killed. SM exited with code 5.
This set of messages represents a typical
database crash. In this particular case, the crash was initiated
tm process dumped core. The cause of the crash
will be reflected in the set of messages just preceding this set of
ar_backupd: msg Q removed! U_go_to_sleep(): udt_msgrcv() error: Identifier removed CM: cm_main(): receive message error Errno=0 errno=22,at Tue Dec 21 14:48:04 2004.
These three messages were recorded on a system experiencing problems with message queues. Specifically, the message queues necessary for UniData to work had been removed unexpectedly. The messages represented three different manifestations of the same problem. The database was unresponsive so the administrator initiated a failover to a hot-standby system.
CM, Mon Dec 30 11:20:55 2002: The 64th Big tree has writing operation unfinished. Please check the tm errno=2.
When the checkpoint manager (
switches a check point, it waits for the current WRITE operations to
finish by checking if the write count on a BIG (Block Index Group) tree
is zero. When the above error is displayed, the write count never
reaches zero for that particular BIG. This causes the RFS database to
Recent cases suggest that some combination of RFS alternate index file-level operations could cause such a deadlock. This problem has not yet been reproduced and isolated.
----- SM (force) Shutdown at Oct 06 2004 14:34:16 ----- SM stopped successfully.
This is the message written into the sm.log
when an administrator has issued the
command to bring the database down immediately (with "force").
Note: If you stop your RFS
database with force, you must next let it recover by running
startud. This will initiate the recovery process and go
through the undo and redo steps that result in clean, uncorrupted files
on disk. After recovery has completed, you can run
-f option. At that point, your UniData files
will be intact. You could then run
reconfigure/reinitialize your logs, or turn off RFS and restart UniData
without RFS enabled. If you were to do either of these operations after
stopud -f without allowing recovery, you would very likely
have corrupted UniData files.
The Last archive file (/log/udt/rfsarch/arch01) is LSN - 0 SM stopped successfully.
This set of messages is the result of a
$UDTBIN/stopud command with Archiving
Special thanks to researchers: Qiang Lin and Bing Lao.
And to technical reviewers: Tim Snyder, John Jenkins, and Jonathan Smith.
Administering the Recoverable File System on UNIX: Review the
UniData 7.2 reference manual for RFS on UNIX.
Administering the Recoverable File System on Windows Platforms:
Review the UniData 7.2 reference manual for RFS on Windows.
- "Zeroing in on
UniData problems" (developerWorks, September 2004) assists with the
problem determination process for someone administering or supporting
- Learn more about U2
products by visiting the Rocket U2 Web site.
Get products and technologies
- Rocket U2 Product Downloads: Download a trial version of UniData for Windows or Linux.
After nine years of developing, deploying and supporting application software for small manufacturing companies, in 1991 Wally Terhune joined the technical support team of a fledgling database company named Unidata. Since that time he has worked exclusively in the technical support group, wearing a variety of hats and in various senior positions including that of manager and director. Wally’s current role is that of U2 Support Architect. He works closely with customers and the U2 development team to get to the root of problems associated with running U2 (UniData and UniVerse) products. The UniData database continues to be his primary focus and passion.