ASU site appears to be down again (early Dec 2020)

Activity

Show:
Tim Cartwright
February 8, 2021, 5:54 PM

Things look remarkably back-to-normal at ASU over the past 6–7 days. So… I guess we can call this one fixed, even though the ticket does not record the root cause or solution.

Tim Cartwright
February 2, 2021, 7:33 PM

My main comment w.r.t. priority is that ASU was contributing a little under 200 Khrs per month prior to all this. can sort out priorities.

Jeff Dost
February 2, 2021, 7:09 PM

can we lower the priority of this? is working on setting up at least a few CC* CEs so I think ASU is low prio relative to his other hosted CE work

Brian Bockelman
February 1, 2021, 10:19 PM

This ticket has languished but is marked as critical … either the priority got incorrectly marked or it got forgotten.

Did a quick triage on the situation and see:

  1. From “condor_q” output, there’s plenty of jobs running and using time.

  2. From payload monitoring, payloads are being appropriately reported and look similar to prior usage before the outage.

  3. The CE’s condor_history output looks reasonable, so inputs to the batch reporting look OK.

  4. The raw records in GRACC report very few seconds (near zero) of wall time per job but nonzero CPU time.

This makes me suspect that the current version of the Gratia probe on the CE is somehow misbehaving. The only change I can think of was there was a Python 3 issue recently that caused the incorrect number of raw hours to be reported.

- can you record the version of the Gratia probe on this CE? I don’t recall precisely what the problematic RPM version of the Gratia common library was so I’m adding Brian Lin and Carl as watchers.

Jeff Peterson
January 21, 2021, 9:34 PM

The claimed cores drop is related to the network issues River is having right now, this will be on many of the CEs it seems. We have been moving some over to Tiger.

Assignee

Jeff Peterson

Reporter

Tim Cartwright

Labels

None

Planned Start

None

Gantt Options

None

Planned End

None

PercentDone

None

DueTime

None

Actual Start

None

Actual End

None

Priority

Critical