Database Maintenance

From EOVSA Wiki
Jump to navigation Jump to search

Description of Databases

Initially we had a single database, eOVSA06, with many tables holding varioius system monitoring information at a 1-s cadence, tables with scan header information at a several-per-day cadence, and the calibration table (abin) with descriptors and data for various calibration types. However, that database, operating on the SQLserver Windows operating system, has grown too large and unwieldy (7.5 TB as of early 2022), and is impossible to use at other sites. The power outage associated with the Feb 2022 brush fire caused a major scare when the SQLserver machine could not be rebooted. We have largely recovered the database, but it is clearly dangerous to have only the one copy of this critical data.

In response, I have created a new database called eOVSA in the MySQL open standard, which contains only those columns from the 1-s stateframe needed for calibration (much smaller, at about 120 GB as of early 2022), and those columns needed from the scanheader, as well as the entire abin calibration database. This 125 GB database can be easily copied to other sites so that calibration of raw data can be done from anywhere. In concert with this, I have transitioned the EOVSA control and calibration software to Python3, creating two packages: eovsapy (routines that can be run from anywhere, and will form the main package for sharing with the world) and eovsactl (routines that are strictly for control of the EOVSA system and should only be run from the OVRO site). One issue, of course, is that this smaller database has to be kept up to date as more data are taken.

This wiki page is a convenient location for the specific commands needed to create and maintain these additional databases, as well as some documentation of how to back up the core part of the SQLserver (full) database.

SQLserver

The single eOVSA06 database has some disk errors that cause problems when trying to do a full backup, which is quite annoying. However, the way the complex table structure is filled out is to simply transmit binary data to three tables, abin, fbin, and hbin (calibration, stateframe, and scanheader tables, respectively). There are a number of "stored procedures" and "scalar-valued functions" (SPs and SVFs) that parse these binary records into the many other tables (and views) used in the database, but for saving the raw data it is enough to simply save these binary files and the corresponding code for the SPs and SVFs. The latter have been saved on the NAS4 RAID system under the folder /nas4/Gil/SP_and_SVF_code. The table structure has also been saved in the folder /nas4/Gil/Table_Structure. These binary files take about half the space of the full database. I have created a new database named test (not very original) in /nas4/SQL_LOG to which I have copies all of the recoverable abin, hbin, and fbin records through 2022-Feb-16 when the data stopped being recorded to the SQL database. This can serve as an emergency backup, but of course will need to be kept up to date as well.

Saved "Backup" Files

Contents of /nas4/SQL_LOG (backup "test" database containing tables abin, fbin, hbin)

-rwxr-xr-x 1 sched helios     117506048 Mar  3 00:09 eOVSA_xfer_2_GilInNJ
-rwxr-xr-x 1 sched helios  285778903040 Mar 17 00:20 test_log.ldf
-rwxr-xr-x 1 sched helios 3681028145152 Mar 17 00:20 test.mdf

Contents of /nas4/Gil/SP_and_SVF_code (stored procedures and single-valued functions needed to recreate database)

-rwxr-xr-x 1 sched helios   1179 Mar 13 06:13 SPs_AllOfEm.sql
-rwxr-xr-x 1 sched helios 527572 Mar 13 06:11 SPs_AllOfEm.txt
-rwxr-xr-x 1 sched helios   1503 Mar 14 04:46 SVFs_AllOfEm.sql
-rwxr-xr-x 1 sched helios   6173 Mar 14 04:45 SVFs_AllOfEm.txt

Contents of /nas4/Gil/Table_Structure (SQL script for recreating the tables)

-rwxr-xr-x 1 sched helios   8589 Mar 14 16:34 Tables_AllOfEm.sql
-rwxr-xr-x 1 sched helios 234161 Mar 14 16:32 Tables_AllOfEm.txt

Keeping "test" Up-to-Date

The "test" database was created by copying all hbin, fbin, and abin records from the original database through about 2022-Feb-16 when the SQL system stopped being updated. While copying these records we ran into a small number of read errors (3?) in copying individual fbin records. The way the copy was done is to create the tables from scripts generated from the SQLserver eOVSA06 database, and then run the following script from the SQLserver SSMS (SQL Server Management Studio):

USE [test]
DECLARE @cnt BIGINT = -2147483646
DECLARE @msg VARCHAR(50)
set identity_insert hbin off
set identity_insert fbin on
set textsize 2147483647
WHILE @cnt < -1922969932
BEGIN
  insert into fbin (Id, Bin) select top 100000 * from eOVSA06.dbo.fbin where id between @cnt and @cnt + 100000
  set @cnt = @cnt + 100000
  select @msg = 'CURRENT RECORD NUMBER' + STR(@CNT,15)
  RAISERROR(@msg, 0, 1) WITH NOWAIT
END

This script starts with the first record in the fbin table (record ID -2147483646) and transfers 100000 records at a time, looping until the last record ID (-1922969932). The statements regarding @msg are just to print out the current record number after each 100000 were transferred, in case of an error. This transfer required about 90 h to complete. In practice, the transfer would die at some point due to the above-mentioned dik error, in which case I would reduce the 100000 to a smaller number, e.g. 1000, and change the @cnt declaration to the last ID printed out, and thus pin-point the failing record number (within 1000 records or so) and then set @cnt just beyond the failure point and go back to 100000 records at a time.

Anyway, now that it is done we can use a version of the above to keep the "test" database up to date by running a similar transfer once per week or so. Annoyingly, it seems that the SQLserver SSMS forgets its connection to the \\NAS4\nas4 share. To reestablish it:

EXEC sp_configure 'show advanced options', 1;
GO
RECONFIGURE;
GO

EXEC sp_configure 'xp_cmdshell',1
GO
RECONFIGURE
GO

EXEC XP_CMDSHELL 'net use N: \\nas4\nas4 /user:whatever\<username> <password>'

where the username and password are the credentials for connecting to the share. At this moment, there is an issue connecting to the "test" database from SMSS on the SQLserver. It is complaining that the database is at a higher version level, which is nonsense. What seems to have happened is that I connected a later version of SMSS to the database from another computer, and although I did not write to it I suppose something was noted in the log. I'll have to look for a way around this before we can add more records. Ah, Microsoft...

Updating SQL with Saved Records

If/when the SQL server is not available for any reason, observations are still possible by writing the "raw" stateframe and scanheader records to the disk for later restoration to the main SQL database. Such records are currently written to /nas4/Tables/scanheader/ and /nas4/Tables/stateframe folders. The following commands would restore a list of scanheader logs contained in the files variable:

from stateframedef import scanheader_log2sql
from time import time
for file in files:
    t = time(); scanheader_log2sql('/nas4/Tables/scanheader/'+file); print 'Took',time()-t,'seconds'

Each day takes only 1-2 s to restore.

Similar commands are used to restore stateframe records:

from stateframedef import badlog2sql
from time import time
for file in files:
    t = time(); scanheader_log2sql('/nas4/Tables/stateframe/'+file); print 'Took',time()-t,'seconds'

But note in the case of stateframes that it takes about 40 min to restore a single day.

MySQL