Top-level troubleshooting

Here’s advice and procedures for unusual problem situations.

Restarting everything

After a power failure or power cycle, there are some things that must be restarted.

AOC NFS Server

In an ssh session to aoc, become root with sudo -s. Restart the nfs daemon with: service nfs restart.

AOC code

If the aoc code is not running, ssh into aoc. Make sure the power for the WFS is on, and then simply run start.

Occasionally, you will need to restart the AOC code when it is already running. To reset the AO WFS camera and the AOC code, do the following:

Make sure all mirrors are zeroed and powered off:

IDL> zeroall
IDL> tlc_move_power,['AO_TT','WOOFER','MEMS'],/off

Quit the aoc code. The easiest way to do this in IDL:

IDL> rtc_issue_command, 'quit'

Power cycle off then on the AOWFS power (TLC GUI or IDL with tlc_move_power, your choice)

Restart the AOC code: ssh into aoc and run start.

Known Issues

AO WFS pixels look strange

Sometimes the AOWFS loses synchronization. This manifests itself as an unusual pattern when the light source is off, and not seeing a true pupil when the light source is on. The simplest solution is to restart the AOC code.

Dark Stripes Appear on WFS

_images/blown_fuse_board1.png

Figure: Dispraw image with blown woofer fuse (board 1)

_images/blown_fuse_board2.png

Figure: Dispraw image with blown woofer fuse (board 2)

Congratulations! You’ve blown a woofer fuse. There is no software fix.

AO WFS loses light while loops are closed

There are a number of software and hardware safety mechanisms which will prevent the hardware from being damaged in this instance, but you should still attempt to ensure that the mirrors are set to zero voltages as soon as possible while you diagnose the reason for why the light went away. Note that as soon as the AO registers that there is no light (or insufficient light) the RPC will open all of the AO loops.

In general, there are two primary reasons for losing light under lab conditions:

  1. The tip/tilt destabilizes and pushes the beam off the WFS. This is most likely due to the AO PnC mirrors winding up and saturating, and will be accompanied by an error in the track assembly.
    • Open all AO loops (either openall from IDL or via the GUI - if using the GUI, make sure to stop CAL corrections to the AO PnCs or take the PnCs out of track).
    • Zero all mirrors (zeroall from IDL).
    • Stop and datum the AO PnCs either using the IDL command (tlc_command_pnc_pc) or via the track assembly GUI.
    • Diagnose why the AO PnCs saturated before reclosing (most likely bad commands from the CAL).
  2. The supercontinuum source shut off due to overheating or a sensor glitch. The supercontinuum laser has software temperature limits which will turn the laser power to zero if exceeded. These are occasionally set off by a sensor glitch that momentarily reports high temperatures.
    • Open all AO loops (either openall from IDL or via the GUI - if using the GUI, make sure to stop CAL corrections to the AO PnCs or take the PnCs out of track).
    • Zero all mirrors (zeroall from IDL).
    • If laser shutoff was due to sensor glitch, i.e., temperature reading in Source Assembly GUI immediately returns to normal (~30 deg C) levels, then init the ASU from the source assembly and turn the laser back on.
    • If laser actually overheated, power off supercontinuum source at the power bar and wait at least ten minutes before powering back on and checking temperature. If working remotely, contact people on site to check laser directly.

Once you are looking at actual photons from the sky, well, there are many more ways in which something can go wrong...

AOC is stuck on a command and refuses to accept any new input

There may be rare cases where the AOC takes a very long time to complete a command, or errors in such a way that the system registers as busy and never returns. In these instances, you can abort the current operation by sending an abort command. The simplest way to do this is via IDL:

IDL> aocrpc_passthru,'abort'

If this does not work, you can also attempt to send the command directly via the aoc socket. Start a new IDL session and run:

IDL> start_us_up,/grabsocket
IDL> rtc_issue_command,'abort'

ult_sysalign Hangs During ‘Datuming AO PnCs, Input Fold and flattening Tip/Tilt.’ Step

After being powered on, the AO tip/tilt mirrors will sometimes take a while to respond to commands. This can lead to an infinite loop in ult_task_point, which is called by ult_sysalign immediately after powering on the mirrors. The result will be a series of tip/tilt commands resulting in no movement in the mirrors, and a large, or diverging error. In these cases, it is simplest to kill the loop (Ctrl+c in the idl session). You can then either restart ult_sysalign (after issuing a retall command), or reissue ult_task_point (without doing a retall) and, once its done, continue by issuing the .continue command.

ult_sysalign Hangs During ‘Aligning System to CAL’ Step

The configuration for this step should be:

  • CAL Tip/Tilt biases set to [0,0]
  • AO Tip/Tilt and Woofer loops closed
  • CAL loop closed
  • AO PnCs in tracking with CAL Correct set to On

If any of these are not true, the alignment will not work. Kill the process (Ctrl+c in the idl session), fix the step that did not work (and if possible diagnose why it failed, and once the system is configured correctly, continue by issuing the .continue command.

CENTER_PIN Fails to Converge and Drives CAL PZTs to Bad State

Twice during instrument I&T, the CENTER_PIN algorithm in the CAL failed to converge and, in doing so has driven the PZT actuators to locations that caused the beam to steer completely off of the HOWFS. At the same time, the PZT starting values in the CAL Defaults file were sufficiently old so that they no longer served as sensible starting positions. In at least one of these incidents, the cause of the problem was a silent failure to open one or both of the internal CAL shutters.

The first step in diagnosing this problem is to ensure that you can get light on the HOWFS. This is done with the IDL routine cal_peek. Take an image of the LOWFS with the FPM in place:

IDL> cal_peek,/auto

Assuming ALIGN_FPM executed correctly, you should see a symmetric illumination pattern around the central hole, as in the figure below.

_images/lowfs_align.png

Figure: Auto-scaled CAL LOWFS image after alignment to FPM (output of IDL function cal_peek).

Next, check that you are able to move internal shutters and get light through both the reference and science legs of the interferometer:

IDL> tlc_move_cal,/ref,/open
IDL> tlc_move_cal,/sci,/close
IDL> cal_peek,/howfs,/auto

;;look at the image

IDL> tlc_move_cal,/ref,/close
IDL> tlc_move_cal,/sci,/sci
IDL> cal_peek,/howfs,/auto

In the reference leg, you should see a bright fully illuminated pupil, while in the science leg, you should see a much fainter pupil with bright spots corresponding to the bad MEMS actuators.

_images/howfs_ref.png

Figure: Auto-scaled CAL HOWFS image with light through the reference leg (output of IDL function cal_peek).

_images/howfs_sci.png

Figure: Auto-scaled CAL HOWFS image with light through the science leg (output of IDL function cal_peek).

If you are sure that the shutters are opening and closing as expected but are still not getting light on the LOWFS, the PZTs may be completely out of range. First, try reading in the DEFAULTS file (via the CAL Server GUI) to send the PZTs to their default position. If this still does not get you light on the HOWFS, the DEFAULTS themselves may be out of date. You must find the last PZT settings generated by a converged CENTER_PIN run. In the CAL log directory ($TLC_ROOT/log) find the log file containing the most recent successful run of CENTER_PIN (if the CAL server has not been restarted in a while, this will probably be the current log file). Look for convergent CENTER_PIN runs:

> grep -B 2 "gpCalCenterAlgorithm:  Center has been found." Cal_debugLog_????????_??????

This will return outputs related to both ALIGN_FPM and CENTER_PIN. You are looking for the most recent block of lines that looks like this:

2013-07-08 13:56:56.762230   gpCalCenter.c, 2251: <2> ===> New Tip/Tilt(used by Center Pinhole: (-32662.041016, 42364.410156), gain<0.400000>, err<-663.125610, -333.895355>
2013-07-08 13:56:56.782014   gpCalCenter.c, 1944: <2> gpCalCenterFound: Threshold met <3>#times, need to be met<3>times.
2013-07-08 13:56:56.782019   gpCalCenter.c, 2402: <1> gpCalCenterAlgorithm:  Center has been found. Stop Centering.

The two numbers in the first line (in this case -32662.041016, 42364.410156) are the last found CAL PZT settings. In the CAL config directory (/data/cal/config on the CAL) make a copy the DEFAULT file (i.e. cp DEFAULT DEFAULT.todaysdate). In the new copy, edit the lines PztTip and PztTilt and enter the values from the log. In the CAL GUI, enter your new filename in the textbox next to the ‘Read Params’ button and click ‘Read Params’. Verify that the PZTs have moved to the desired location via the CAL Show utility, and then try again to see whether you have light on the HOWFS.

Warning

Make sure you use the Read Params and NOT the Save Params button as the latter will override whatever file is listed in its textbox.

AOC DISP* Functions and OS X Mountain Lion

The dispraw and other AOC display functions do not render properly when forwarding X11 to a computer running OS X 10.8 (Mountain Lion). This issue stems from a bug in the (2013-era) stable release Xquartz (2.7.4) implementation of 8/16 bit rendering. The current release client supposedly solves this problem, but there is a workaround to get everything working without switching to the RC update channel.

You need to run a separate x server which uses the host x server for its framebuffer and send displays directly to that:

  1. From a terminal, run:
    • Xephyr :1 -screen 1000x1000 &
    • A blank screen will pop up.
  2. Change your DISPLAY env var to :1.0
    • Bash: DISPLAY=:1.0
    • C-shell: setenv DISPLAY :1.0
  3. SSH to AOC with X11 forwarding as usual:
    • ssh -X gpi@aoc.gpi.ucolick.org
  4. Start your favorite display program - it will appear in the Xephyr screen.

The 1000x1000 in the Xephyr call is the screen size – change at will.

tlc_observe Hangs/Errors and Detector Does not Show as Exposing

Once in a while, you’ll start an observations using tlc_observe and notice that it hangs after the configuration step while the Detector shows as ready and idle. In these cases it is possible that the command event handler on the TLC has crashed.

To check, run a status commmand on the tlc. If gpCmdEvent does not register as OK, then issue the command:

$TLC_ROOT/bin/linux64/gpCmdEvent -daemon -c $TLC_ROOT/config/CONFIG.CmdEvent

This can also show up as an error, producing a message like:

% TLC_OBSERVE: OBSERVE command failed.

Spawn Errors:
Reply awaited on ActiveMQMessageConsumer {value=ID:gpi-dhcp-17-33.ucolick.org-37017-1361311825143-0:2:1:1, started=true is null, probably the response timeoutd after 500 [ms]
Response Received: [ERROR {Message cannot be null}]

Instrument Sequencer Twt2Lens (MEMS to WFS) Alignment Fails

If the instrument sequencer MEMS to WFS alignment fails (as evidenced by an error message either in the IS GUI or the AO Server Control GUI or returned to IDL), the only way to recover is to init and then datum the AO WFS PnC in the Track Assembly. This will remove all offsets added to the pointing and centering, requiring you to realign the system.

AOC/CAL Command Errors With Message that System is Guiding

When loops are manually opened from IDL (at the RPC level) it takes a few seconds for the top level GMB variable (tlc.isRpc.guiding) to change states. Therefore, if a command that explicitly checks this variable is issued immediately after a loop opening, the system might mistakenly think that it is guiding even if all loops are open). Reissuing the command again after a small delay (typically less than 2 seconds) clears the error.

Change in WFS Camera Rate Produces no Change in Measured Intensity

There exists a (mostly) silent failure mode in which the AOC will continue to function normally, and will accept CAMERA RATE commands with no errors, but will not actually update the camera rate. This occurs when, for whatever, reason, communication with the WFS camera stop working. This will result in errors in the AOC log (see Log Files) when serial communications with the camera are attempted (when setting the CAMERA RATE, for example) of the form:

130722 10:28:29: ACK setCamRate
130722 10:28:29: INFO (gpAoSrtCamMngr_SendToCam) sending to WFS camera: @SEQ 0
130722 10:28:31: ERR (gpAoSrtCamMngr_SendToCam) no response from CamWFS 1
130722 10:28:31: INFO (gpAoSrtCamMngr_SendToCam) sending to WFS camera: @RCL 3
130722 10:28:33: ERR (gpAoSrtCamMngr_SendToCam) no response from CamWFS 1
130722 10:28:34: INFO (gpAoSrtCamMngr_SendToCam) sending to WFS camera: @REP 2643
130722 10:28:35: ERR (gpAoSrtCamMngr_SendToCam) no response from CamWFS 1
130722 10:28:35: INFO (gpAoSrtCamMngr_SendToCam) sending to WFS camera: @TMP?
130722 10:28:38: ERR (gpAoSrtCamMngr_SendToCam) no response from CamWFS 1
130722 10:28:38: ERR (gpAoSrtCamMngr_ProcCmd) invalid ccd temperature code: -1
130722 10:28:38: ERR (gpAoSrtCamMngr_ProcCmd) invalid wfsCamCase temp code: -1
130722 10:28:38: INFO (gpAoSrtCamMngr_SendToCam) sending to WFS camera: @SEQ 1
130722 10:28:39: ERR (gpAoSrtCamMngr_SendToCam) no response from CamWFS 1

In addition, the AOC status display in the TLC GUI will report the camera temperature at 0 Kelvin. The simplest way to deal with this problem is to restart the AOC code.

VM VNC Bottom Pane is Filled With Already Closed Windows

This is a known KDE bug. There is no fix (other than upgrading to a modern version of KDE, which apparently will not happen). Best workaround: recreate the task manager belonging to the panel:

  • If there is any free space on the panel right click there. Otherwise, click the Cashew (rightmost icon on the panel) and then right click anywhere on the panel.
  • Select ‘Remove This Task Manager’
  • Right click on the panel and select ‘Add Widgets...’
  • Select ‘Task Manager’ from the Widgets list and click ‘Add Widget’
  • Right click on the panel and select ‘Task Manager Settings’
  • Under Filters, select ‘Only show tasks from the current desktop’ and click Ok.

Gemini SSL VPN Error on OS X Mavericks

When attempting to launch the SSL VPN agent from https://umbral.gemini.edu, you get a blank applet with the message ‘Error. Click for details’. The details say something along the lines of ‘SecurityException - Missing required Permissions mnifest attribute in main jar’. To clear this error, you need to whitelist the Gemini SSL gateway:

  • Open System Preferences and click on the Java Icon
  • In the Java Control Panel, click on the Security Tab
  • In the Exception Site List, click ‘Edit Site List..’
  • Click the Add button and in the new line that appears type: https://umbral.gemini.edu
  • Click Okay in the Site List and Okay in the Java Control Panel
  • Reload the site in your browser