Skip to content. | Skip to navigation

Personal tools
You are here: Home Accelerate Articles UniData Recoverable File System (RFS)

UniData Recoverable File System (RFS)

Database and system administrators who are responsible for a UniData system that has the Recoverable File System (RFS) feature enabled will find this article most helpful. All RFS status and error messages are recorded in the $UDTBIN/sm.log file. You will find examples to help you find and decipher the messages to determine the appropriate actions needed to keep your system running smoothly.

Messages and errors in sm.log

Level: Advanced

Wally Terhune (, U2 Support Architect, Rocket Software

15 Dec 2005



Do you administer a UniData (U2) database with the Recoverable File System (RFS) enabled? Then you know that status and error messages logged by UniData's RFS sub-system are sometimes difficult to understand. This article will highlight commonly occuring messages and describe appropriate actions to take. In some cases you may be able to use this information to recover from a crash without additional assistance.

Messages logged during normal database startup (Restart_flag=0)

Checking log files .....                                   
----- SM (98320) is started at Dec 22 2004 10:26:41 -----

Unidata Environment : /usr/ud60/include

SM: Restart_Flag = 0.

These are the normal messages logged during a startup of a UniData database that has previously been shut down gracefully. This is indicated by the Restart Flag=0. If the database was not shut down gracefully, the Restart Flag will be set to 1 and recovery steps will follow.

ERROR in function U_s_setupsca(), file s_setsca.c:
*** file log too small

--- called by process 1198038
File log size should be at least 2501 blocks.
Please change the file log size and run cntl_install,
then startup system.

This set of messages is logged during database startup if there is a mismatch between the file-level log settings in logconfig and the NUSERS setting in udtconfig. The file log size must be at least 1 greater than NUSERS. This typically happens when an administrator adjusts NUSERS and neglects to reconfigure the logconfig file and run cntl_install to reinitialize the logs. Here is the 0120 (file-level log) line of a small logconfig file configured for NUSERS=40:

/disk1/ud61/log/filelog_0 0120 4096 0 41

Note: This is a fatal startup error. The database will not start until corrective action is taken.

Checking log files .....
check_rdud:ReadHdr SM exited with code 3.

This message is reported when a log file header cannot be positioned correctly (using the lseek() system call) or the content of the log file header cannot be decoded. This could happen if the logconfig file is changed after UniData has been shut down, or if the log file header is corrupted. To correct the problem, check the log configuration file and the specified log files. You may need to run cntl_install to reinitialize the log files.

Note: This is a fatal startup error. The database will not start until corrective action is taken.


Messages logged during normal database recovery (Restart_flag=1)

----- SM (9778) is started at Dec 28 2004 15:14:36 ----

Unidata Environment : /usr/ud60/include

SM: Restart_Flag = 1.
Starting: Tue Dec 28 15:14:46 2004

RM: Replication is off
recovery: RM started, pid:9779
Starting to restart ..........
Undo logfile[2] .....
The logfile[2] has been scanned.
Undo logfile[2] has been undone.
Undo logfile[3] .....
The logfile[3] has been scanned.
Undo logfile[3] has been undone.

*****!!! Restart Finished !!!*****
restart: Report of Log-file Status.
   #     Type          Checkpoint-Number    Status
   0     after-image                   0         0
   1     after-image                   0         4
   2     before-image                  0         4
   3     before-image                  0         4
   4     after-image                   0         4
   5     after-image                   0         4
   6     before-image                  0         4
   7     before-image                  0         4
Link back all new blocks ....
Link new blocks successful !
Start the redo processing ....

Finish the redo processing.
All undo/redo recovery processing has been finished!!

Redo logfiles finished.

Step3: Check the file level commands...

File level commands checking finished.

Restart is successful!!!
RM has already been stopped.
recovery: RM stopped.
Checking log files .....
The system is normally running.

These are the normal messages logged during a successful recovery of the database after a database or system crash. This is from a UniData Version 6.0.9 with Replication turned off. Notice the Restart Flag=1. This indicates that recovery is required. First we undo the updates by applying the before-image logs. The number of logs will vary depending on your logconfig settings. Then the log-file status is displayed. Then we redo updates from the after-image logs and check for file level commands in the logs. The message "The system is normally running" indicates a successful recovery.

TM can only be started by SM.
SM checked: tm (udtno=1, pid=75602, udtuid=0, udtgid=0,
    cur_request= -1 ): Initialization failed.

This set of messages occurs if a user is attempting to start a new UniData session while database recovery is still in progress. These messages will be interspersed with the other messages generated during database recovery. This can result in file corruption. Users should be blocked from starting a new UniData session until the database has completed all recovery steps. Starting at UniData Version 6.1.7, the database will block new user sessions. Prior to that release, the administrator should take steps external to UniData to prevent users from attempting to start new sessions.

addfend_tbl:not in the endfile table
undo_recv:Scan_Undo_Log Errno=10001 errno=2,at Tue Dec 21
    15:11:01 2004
U_crashrest:U_undo_recover: 0 Errno=10001 errno=2,at Tue Dec 21 
    15:11:01 2004
restart:U_CrashRest Errno=10001 errno=2,at Tue Dec 21 15:11:01 2004
RM has already been stopped.
recovery: RM stopped.
SM exited with code 4.

This set of messages was recorded during a failed recovery. This indicates a confusion about the status of the log files. The problem was triggered by crashing the system immediately after a successful recovery (and before the first check point). It is the result of a product defect (e7412). This problem is corrected in UniData Version 6.1.7 and later.

Note: The only way to get the database restarted after this error was to run cntl_install to reinitialize the logs.

Undo logfile[8] has been undone.
write_back_newblk: undo_addblk Errno=22 errno=22,at Fri Sep 10
    02:46:17 2004
Errno=22 errno=22,at Fri Sep 10 02:46:17 2004
restart:U_CrashRest Errno=22 errno=22,at Fri Sep 10
    02:46:17 2004
******* restart: Cannot Restart!!!
Link back all new blocks ....
write_back_newblk: undo_addblk Errno=22 errno=22,at Fri Sep 10
    02:46:17 2004
Errno=22 errno=22,at Fri Sep 10 02:46:17 2004
restart:U_CrashRest Errno=22 errno=22,at Fri Sep 10 02:46:17 2004
SM exited with code 4.
U_go_to_sleep(): udt_msgrcv() error: Identifier removed
SM checked: bimglog (pid = 58598): Exit code 1, status(0x100)
----- System Crashed at Sep 10 2004 02:46:19 -----
All possible CM & TMs & AIMGLOGs & BIMGLOGs killed

This set of messages was recorded during a failed recovery. The database crash preceding this recovery was forced by the system administrator after user processes appeared to hang and "File udt_system(s) full for file path_to_file, pid - nnnn" messages were recorded in the sm.log. The "file udt_system(s) full" message was actually the result of a UniData static hashed file expanding to a size that was not an even multiple of the file's block size, and becoming inaccessible. The messages in the sm.log during restart were the result of attempts to update this file. Two fixed issues relate to this problem. With the fix for issue e7185, UniData now prohibits an RFS file from expanding beyond an even multiple of its block size and becoming inaccessible. This fix resolves the problem with other processes becoming unresponsive as well as the database recovery issue, by not allowing the original situation to occur. The fix for issue e7187 corrects the "full" message to actually display the specific file name and the PID of the tm process struggling (instead of the PID of the cm or syncd process). These fixes are all available in UniData Versions 6.0.16, 6.1.1, and later.

Index: not a free node address: 651165696
Index: Can not split node

This message indicates that a UniData alternate index cannot be updated during RFS recovery. The specific reason is that the first node of the free chain of the index file is not marked as a "Free Node." (Each index node has a special flag to indicate the type of the node.) The index file can no longer be updated because there is no free index node.

Rocket U2 identified a problem (e6650) in RFS crash recovery handling where an index node is freed and not marked as a "Free Node" during the undo phase of recovery. This problem has been corrected at UniData Versions 6.0.14, 6.1.0, and later.

Runtime warnings and normal messages

CM: Unknown msg type rcvd.

For RFS systems with Archiving enabled, this message will appear in the sm.log each time a checkpoint occurs. It is the result of a harmless coding error within the UniData engine. No corrective action is necessary. Starting with UniData Version 6.1.6, this message no longer occurs.

CM, Wed Dec 1 21:23:57 2004:WARNING: The log file [/dev/rbimglv01] is
    too small to contain the log records.
Please enlarge the log files.

This warning is fairly clear. It is generated by the checkpoint manager (CM). This indicates that the specified RFS log is too small to contain all of the log records for a checkpoint. UniData will handle this condition by writing records to the LOG_OVERFLO directory, but you should reconfigure the logs so that this extra I/O overhead is not necessary.

Note: If the logs overflow and UniData is unable to write to the LOG_OVERFLO directory for any reason (for example, if the disk is full), the database will crash.

Error when attaching shm (4980785, 0, 0), errno=24
==TM(pid:34685034) calling reportIpcState(); logfile:
funcname:U_getshmadd(A), errno:24. Current time=Thu Dec 2 12:50:57


get_addr.c:U_mi_addr:U_getshmadd:cannot attach
Error when attaching shm (4980777, 0, 0), errno=24
==TM(pid:8536212) calling reportIpcState(); logfile:
funcname:U_getshmadd(A), errno:24. Current time=Thu Dec 2 12:57:22

Both of the above messages (3a and 3b) represent a tm process failing to attach needed shared memory segments. The tm/udt pair will exit (the error is fatal to them), but the database will not crash. The message refers to an ipclog file; this file contains details about the shared memory use by the tm/udt pair, as well as overall system use of shared memory at the moment the tm processes exited. The information in the ipclog can be used to tune your system kernel or UniData shared memory parameters in the udtconfig file to eliminate the error.

U_sbb_rlsblk(): mismatched current first put entry and saved first
    put entry.

This message is produced by guide running on a recoverable dynamic file. It is just a warning produced by a UniData coding error. This is recorded as engineering case e6286. This message will not occur in versions of UniData that are 6.0.11 or later.

SM stop_tm: TM 15 has already exited.

The message "SM stop_tm: TM 15 has already exited" is produced by the sm daemon while it is processing a STOP_TM request from the smm daemon. smm issues the STOP_TM request on an RFS-enabled system, if it finds a user process is dead but that process still occupies an entry in the UniData user process table. If sm finds that the corresponding tm process for this dead udt process has already exited from the system, sm prints out the message.

Putting this another way, if both the udt process and its corresponding tm process have exited or been killed and the user process table in shared memory has not been cleaned up, the system produces this message. There are several scenarios that can produce such a situation:

  • During the UniData process login procedure, the udt process gets killed or abnormally exits (for example, core dumps) before its corresponding tm process has been initiated.
  • During the UniData process logout procedure, after requesting tm to exit, the udt process gets killed or exits abnormally.

Note: The above logout procedure can either be a normal user-initiated logout, a logout as a result of severe system errors, or logout as a result of being asked to (udt process receiving a kill signal, including the deleteuser command).

Normally, a deleteuser command won't trigger this message. However, if the deleteuser process encounters further issues, then the message can occur. The deleteuser command will first send a soft kill (kill -15) to the udt process. The udt process will start the logout procedure -- which includes requesting its tm process to exit. These activities may take some time on a very busy system. Meanwhile, the impatient deleteuser command (or the impatient system administrator) will issue a hard kill (kill -9) to the same udt process. The hard kill will immediately kill the process, but the side effect is that the user process table in shared memory won't be properly cleaned up (as it is during a normal udt logout procedure).

archive,Sat Dec  4 15:41:51 2004: The archive file
    /udlogs/arch/arch11 is full. 
The Logical Sequence Number (LSN) of this archive is -- 1

ar_backupd, Sat Dec  4 15:41:51 2004: starting to offload archive
        /udlogs/arch/arch11 (LSN = 1)

ar_backupd, Sat Dec  4 15:42:03 2004: file (LSN = 1) off-loaded.

This is a normal set of messages recorded when an RFS system is configured with Archiving. These messages indicate an archive has filled and been successfully off-loaded.

ar_backupd, Sat Dec  4 21:12:15 2004: starting to offload archive
        /udlogs/arch/arch6 (LSN = 58)

The archive file /udlogs/arch/arch6 can not be off loaded.
You may want to make sure that this file has been saved.

Also, label it Logical Sequence Number (LSN) -- 58
Please create /usr/ud61/include/DONE file (as root) when done...
ar_backupd, Sat Dec  4 21:34:19 2004: off-loading of archive
    /udlogs/arch/arch6 (lsn 58)
must be completed before system can progress.

ARCH: waiting for the file (/udlogs/arch/arch6) to be off-loaded ...
ARCH: waiting for the file (/udlogs/arch/arch6) to be off-loaded ...

This set of messages indicates that all of the archives are full, and the specified (oldest) archive must be off-loaded before the database can continue. The database will not be responsive to users trying to update records. The last message will repeat until the administrator corrects the situation and signals the database to proceed by creating the DONE file in the specified UniData include directory.

U_tm_login msgsnd error: Invalid argument
SM checked: tm (udtno = 6, pid = 63604, udtuid = 4191, udtgid = 300,
    cur_request = 1 ):
Initialization failed.
sendback: msgsnd: Invalid argument

This series of messages is reported when the sm daemon is trying to start a tm process during udt process initialization. The first message is reported by the tm process when it tries to send a login success message to its buddy udt process. Part of the message structure is the process ID of the udt process. However, the udt process ID recorded in the shared memory process table is less than 1 (most likely it is -1). This causes the msgsnd() call failure. Because of the msgsnd() failure, the tm process exits with error code U_TM_INIT_FAIL_EXIT. The error code is caught by sm and reported in the log. When sm tries to deliver an error message to notify the udt process, the msgsnd() system fails again because of the same reason of the first message (the message type is still set to -1).

This problem could happen when a udt process tries to start its tm process (through the help of smm and sm), and the Local Control Table entry for the udt process is inappropriately cleaned or freed while tm is starting.

This suggests that there may be a timing issue in the UniData RFS login, logout, and cleanup architecture, but this problem has not been reproduced. The impact on the user is minimal (launching a new udt process fails, but a retry is generally successful). Starting at UniData Version 6.1.7, additional details about the specific message command that fails is included in the sm.log (e7543). This may provide information that would lead to reproducing this problem.

File udt_system(s) full for file /ud/KEY-DATA/FIN.RPT, pid - 56222

This message is misleading. The "file udt_system(s) full" message was actually the result of a UniData static hashed file expanding to a size that was not an even multiple of the file's block size and becoming inaccessible. Starting with UniData Version 6.0.16 and 6.1.1, this problem can no longer occur. The fix for e7185 UniData now prohibits an RFS file from expanding beyond an even multiple of its block size and becoming inaccessible. If this situation were still to occur in some unanticipated manner, the message in the sm.log will now be more meaningful (e7187).

Note: Although this problem does not specifically trigger a database crash, the database does become unusable. Both the check point manager daemon (cm) and one of the sync daemons (syncd) start spinning and become unresponsive.

==Fri Sep 10 06:21:54 2004 TM[431], file-corruption detected,
    call U_fdbg_verify_put(963081)==
TM[431] found the file in its Unix-open-table.
TM[431] calls U_CheckFileInfo() to report the details.
== TM[431]:U_CheckFileInfo(afid:1086, sub_nth:00000000, fd:15,
   putno:963081 ==
TM-fd's (st_dev:3473450, st_ino:19).
AFT-fd's (st_dev:3473450, st_ino:19).

U_____Report Error____
Funcname: write_record...Syscall: long record check...U_Errno: 4...
    Filename: /ud/KEY-DATA-PROD/PROD.POINTER  ...Key: ...Number: 2852

U_____Report Error____
Funcname: U_add_record...Syscall: long record check...U_Errno: 1...
    Filename: /ud/KEY-DATA-PROD/PROD.POINTER  ...Key: 
    001!FO1067137!064...Number: 5842944

U_____Report Error____
Funcname: modify_record...Syscall: long record check...U_Errno: 1...
    Filename: /ud/KEY-DATA-PROD/PROD.POINTER  ...Key: 
    001!FO1067137!064...Number: -1

U_____Report Error____
Funcname: U_append_strtuple...Syscall: long record check...U_Errno:
    1... Filename: /ud/KEY-DATA-PROD/PROD.POINTER ...
    Key: 001!FO1067137!064 ... Number: 2852U_sbr_addrec error
    in U_tm_wrtrec
errno=20: Not a directory

The messages here indicate that a tm process encountered file corruption in the PROD.POINTER file. The nature of the corruption was not severe enough to be a fatal error (which would have brought down the database). The file in question needs to be repaired using guide and fixfile (or dumpgroup and fixgroup).

index: write temporary file error

This error is triggered by BUILD.INDEX when the temp directory used by BUILD.INDEX has filled. BUILD.INDEX uses whatever directory is specified by the environment variable $TMP.

Index error when adding 002001665

U_____Report Error____ at Tue Dec 31 14:39:11 2002
Funcname: U_sbr_upd_altkey...Syscall: U_tm_upd_altkey...calltimes:
    0... Filename:/mccdata/mcc07/MCC.DATA2/GB.GROUP/GB.GROUP   ...
    Key: ... Number: 2320 U_Errno: 0 errno: 20

This error indicates a tm process was unable to update an alternate index on a recoverable file. This was not a fatal error, but the index for the GB.GROUP file needs attention. In order to repair it, list the indexes on the file using LIST.INDEX. Delete the indexes on the file using DELETE.INDEX GB.GROUP ALL (the ALL keyword results in the complete removal of the index file). Then recreate the indexes and rebuild the index file. On an RFS system (prior to UniData Version 7.1), the index file cannot be rebuilt if any UniData process has the primary file open. At Version 7.1, you can use the ONLINE keyword, which allows rebuilding the index on an active file.

WARNING: [3385] throw away an illegal message, len=12, cmd=1.

The basic communication between a udt process and its buddy tm process is though operating system message queues (UNIX). There is a dynamic security code for each message that the receiver side can anticipate. This warning message indicates that the tm process is ignoring a message it just received because of an invalid security code. This situation can happen when:

  • Two udt processes share the same UDTNO. This login timing issue has generally been resolved.
  • There is a leftover message from the previous udt/tm pair assigned to the same UDTNO -- typically after a udt process is killed or ends abnormally.
Invalid tm_afid,tm:1, U_tm_close, Sat Jul 7 14:45:33 2004, tm_afid:8,

When a udt process opens a data file, its tm buddy sends back a file descriptor called TM AFID to the udt process. This TM AFID is verified by the tm process whenever udt asks the tm to perform any operations on the file. This message indicates that 1 is not currently a valid TM AFID. This problem is most likely caused by the udt process double-closing a TM AFID, but has not been isolated and reproduced. There does not seem to be any consequence to the running application.


Run-time errors that bring down the database

U_am_os_open(), fname=/data/fin/wbf/sysman/VOC errno:2
TM[175]:U_am_os_open return error: afid:52, sub_nth:0, aborted
SM checked: tm (udtno = 175, pid = 511660, udtuid = 51004,
    udtgid = 1108, cur_request = 4): 
Stopped because of IOT instruction - core dumped

Any problem that is fatal to a tm process will bring down the database. In this case, a tm process had a problem reading a record from the VOC file of a UniData account. This resulted in the tm process dumping core and exiting abnormally.

SM checked: tm (udtno = 1, pid = 565412, udtuid = 0, udtgid = 0,
    cur_request = -1): Stopped because of Kill

This message indicates that a tm process was killed with a kill signal (likely -9). This will trigger a database crash. This could have been done by an administrator or by a script.

----- System Crashed at Dec 06 2004 16:31:40 -----
All possible CM & TMs & AIMGLOGs & BIMGLOGs killed
Dumping the system buffer to "/usr/ud60/bin/rfs.dump"......Done.
Journal hasn't been started.
RM has already been stopped.
CLEANUPD stopped successfully.
SBCS stopped successfully.
SMM stopped successfully.
All user processes killed.
SM exited with code 5.

This set of messages represents a typical database crash. In this particular case, the crash was initiated because a tm process dumped core. The cause of the crash will be reflected in the set of messages just preceding this set of messages.

ar_backupd: msg Q removed!

U_go_to_sleep(): udt_msgrcv() error: Identifier removed

CM: cm_main(): receive message error Errno=0 errno=22,at Tue
    Dec 21 14:48:04 2004.

These three messages were recorded on a system experiencing problems with message queues. Specifically, the message queues necessary for UniData to work had been removed unexpectedly. The messages represented three different manifestations of the same problem. The database was unresponsive so the administrator initiated a failover to a hot-standby system.

CM, Mon Dec 30 11:20:55 2002: The 64th Big tree has writing operation
Please check the tm

When the checkpoint manager (cm) switches a check point, it waits for the current WRITE operations to finish by checking if the write count on a BIG (Block Index Group) tree is zero. When the above error is displayed, the write count never reaches zero for that particular BIG. This causes the RFS database to become unresponsive.

Recent cases suggest that some combination of RFS alternate index file-level operations could cause such a deadlock. This problem has not yet been reproduced and isolated.

Messages recorded during normal shutdown of the database

----- SM (force) Shutdown at Oct 06 2004 14:34:16 -----
SM stopped successfully.

This is the message written into the sm.log when an administrator has issued the $UDTBIN/stopud -f command to bring the database down immediately (with "force").

Note: If you stop your RFS database with force, you must next let it recover by running startud. This will initiate the recovery process and go through the undo and redo steps that result in clean, uncorrupted files on disk. After recovery has completed, you can run stopud without the -f option. At that point, your UniData files will be intact. You could then run cntl_install to reconfigure/reinitialize your logs, or turn off RFS and restart UniData without RFS enabled. If you were to do either of these operations after stopud -f without allowing recovery, you would very likely have corrupted UniData files.

The Last archive file (/log/udt/rfsarch/arch01) is LSN - 0

SM stopped successfully.

This set of messages is the result of a graceful $UDTBIN/stopud command with Archiving enabled.

Special thanks

Special thanks to researchers: Qiang Lin and Bing Lao.

And to technical reviewers: Tim Snyder, John Jenkins, and Jonathan Smith.




Get products and technologies


About the author


Wally Terhune photo

After nine years of developing, deploying and supporting application software for small manufacturing companies, in 1991 Wally Terhune joined the technical support team of a fledgling database company named Unidata. Since that time he has worked exclusively in the technical support group, wearing a variety of hats and in various senior positions including that of manager and director. Wally’s current role is that of U2 Support Architect. He works closely with customers and the U2 development team to get to the root of problems associated with running U2 (UniData and UniVerse) products. The UniData database continues to be his primary focus and passion.



Document Actions