Live Status Updates


There are no known current issues, please report any found to the VLSCI help desk!


There are currently no upcoming issues.




OUTAGE: Barcoo, Snowy, Merri

5th December 2016

The VLSCI Intel clusters, Barcoo, Snowy and Merri will need an outage for GPFS upgrades and some Slurm changes which require no jobs to be running.  The clusters will be unavailable all day until the work is complete.

Avoca is not affected by this work, jobs should continue as usual.



5 Sept

All systems have now been returned to service.


It has been a busy and long week.

I'd like to thank our systems administration team for all their hard work.



If you compile your own code that links against Slurm (Open-MPI is a good example) then you will need to recompile to pick up the new shared libraries.  If this means nothing to you then you are highly unlikely to be affected.



During the outage, there were;

1. Operating systems upgrades on Merri, Barcoo and Snowy

2. GPFS filesystem major upgrade

3. Backup and HSM software major upgrade

4. Slurm queuing system major upgrade on all systems (16.05.4)

5. Hot fixes for Avoca

6. Automation of scratch policy, including deletion of 31 million old files from /scratch

7. Cluster management software major upgrades

8. Many other configuration tweaks to Slurm and other components, including increasing priority to waiting jobs.

2 September

It has been a busy week with a long list of updates and upgrades achieved.  We are in the final stages of testing and expect to resume full operation by Monday afternoon.

Many thanks for your assistance and we look forward to the return to service.

29 AUGUST 9am
VLSCI is planning a full systems outage starting at 9am on Monday 29 Aug for up to a week.  We will work to minimise the disruption and we want to emphasise that this outage is necessary to apply urgent software upgrades and it will finalise major storage updates and repairs.
There will be no access to files or compute during the outage.  Jobs can still be submitted before the outage, and the scheduler will manage jobs around the outage.
During the outage we will automate our file removal policy for files in ‘scratch’.  This means that any file that has not been modified or accessed for 60 days or more will automatically be deleted by the system.  There will not be any notification of the files that will be deleted.  For more information on VLSCI storage and data management, please see:
If you have files in scratch space that need to be kept, please make sure you back them up before the outage.  You can use the ‘mystaledata scratch’ command line on any of the VLSCI machines to get a list of the largest files that have not been accessed or modified for 60 days or more (at the bottom of the output from the mystaledata command there is the location of the full file list).
If you compile your own code that links against Slurm (Open-MPI is a good example) then you will need to recompile to pick up the new shared libraries.  If this means nothing to you then you are highly unlikely to be affected.
Please let us know if you have any concerns so we can assist in minimising the disruption to your work


22 August 2016

AVOCA users

If you ran 'mystaledata scratch' on Avoca, please rerun the command to see any files that will be more than 60 days as of the 5th Sept (if left as is). Unfortunately, there was a bug on Avoca where this command used to report no old files. The bug has been fixed.



12th July 2016

We need to do an important software upgrade to our backup system Wednesday 13th July from 10am onwards.

There should be no user visible impact other than if you try and access a file in the /hsm filesystem which has been pushed out to tape then you won't be able to get that file until the work is complete.

You can see if you are using the HSM filesystem with the "mydisk" command - if your project is then your projects "shared" directory will be stored there.

During this work we won't be starting any new jobs on the clusters (as a precaution) but running jobs will continue to run and you can still login and queue new jobs.


  • 10am - work has begun, schedulers are paused so as to not start jobs.
  • 10:40am - upgrade of primary backup server completed OK, upgrade of secondary server starting.
  • 11:10am - upgrade of secondary server completed OK, transferring services back to primary
  • 12pm - we've identified a hardware issue on the primary server which means transferring services back to the backup server.
  • 12:20pm - all planned work complete, clusters are running jobs again.


1 Mar

It has been a busy week of maintenance activities and all systems have now been returned to service.

Projects no longer require quota. Job scheduling now uses a fair share system.  Fair share is designed to balance usage between users, projects, and access scheme (e.g. member institutes, non-member institutes, etc.).

For more information please see:

Thank you for your patience and we look forward to supporting your research.


23 Feb 2pm
VLSCI has seen significant growth in demand for its resources, in particular storage resources.  Unfortunately the file system is reaching the limits of its capacity.
In order to meet these demands, we need to take an outage.
This work will require some significant changes to the backend storage system and it is for this reason that we will require a full week. We appreciate that this is a significant interruption, but this is the most effective way to address this growth and we will endeavour to return to service as soon as possible.
This outage will allow the retirement of the previous quota system and the introduction of fair share, which means no more quarterly quotas or need to manage a projects quota. Please contact the help desk if you would like further information on fair share.
We also ask that you please help with data management. VLSCI data storage is intended only for current work, not long term archive. Data storage and backup capacity is becoming a serious problem and we need your assistance. By regularly removing your completed data from the system you make the system more responsive, minimise backups, and help us from running out of storage space. Could you please ensure all project members remove any data not needed for their immediate compute needs.
* Access: there will be no access to VLSCI storage and compute resources.  Please ensure you have all the data you need prior to the shutdown.
* Jobs:  A job scheduling reservation is in place, so no jobs will be allowed to run during the shutdown. All queued jobs will maintain their status in the queue.
* Websites: the VLSCI homepage, user management, and help websites should remain operational. There is an hour window on Wednesday 24 Feb from 10:30am where all power to the systems will be disconnected.  This may impact the websites.
More information on fair share can be found at the following location:


22 Jan 2016
VLSCI has seen significant growth in demand for its resources, in particular storage resources.  Unfortunately the file system is reaching the limits of its capacity.
To address the capacity limits work was undertaken to migrate data to the larger drives.  However, this has resulted in unpredictable file system performance.
Whilst we work with IBM to identify the underlying cause of the issues triggered by trying to migrate data to the larger drives we have stopped the migration process.  We will need to do occasional testing but these should only result in short periods of impact whilst we collect debugging information and test suggestions from IBM.


18 Jan 2016 13:30

We have just encountered a hardware failure on BARCOO, which has
brought that system down completely. (The nature of the failure is
network related.)

We'll endeavour to fix this as soon as possible.

Thanks again for your patience, and apologies for these unintended interruptions
to the system.


11 Jan 2016

As we are making these important changes to the file system, we are coming
across an unexpected issue.  Files that have not yet migrated across to the newer
disks are taking much longer than normal to access, which results in commands
like "ls" hanging or taking a much longer than expected time to complete.  The
issue also affects running jobs which are trying to access those files.  As a
consequence, it also affects logins.  There is no effective work around to this at
present.  IBM support have been engaged to help us resolve this issue.

Rather than have  a blanket stop to access the resources, we'd like to request some
patience while the last of the affected files are migrated.  We anticipate that this
should complete within the next few days though we can't be definitive at this stage.

Thanks again for your patience, and apologies for these unintended interruptions
to the system.

6 Jan 2016

Due to some filesystem work to enlarge the /vlsci partition some metadata operations may take longer than usual.

You will see problems with "ls" apparently hanging, it will eventually work but to fix the problem please do:

unalias ls

in all your login sessions to make it run quickly again.


11 Dec 2015 : 13:15 Update

VLSCI is pleased to be able to inform you that the clusters are available again
for use.  As mentioned in the  update below, there are some caveats so please
take this into consideration when retarting jobs. (As always, VLSCI has refunded
quota for all lost jobs.)

Please also note that the RESTORED_FILES file might not yet be in sync with files
that have been restored.  We appreciate that this might initially cause a bit of confusion
but should resolve itself for those files to be properly in sync later in the day.

Restoration of files will continue in the mean time.

VLSCI facilities team.

10 Dec 2015 : 19:00 Update

As you are aware our current outage is ongoing.  As was mentioned earlier
we have a work around for getting you back onto the systems.    We hope
to have this enabled tomorrow (an email notification will ensue when this happens)
with the following caveats:

(1) Not all affected files will have been restored
(2) Some previously running jobs might be affected due to (1)

You will find a file called AFFECTED_FILES in your project directory.  This
will let you know which files were damaged due to this outage.  A second file called

RESTORED_FILES will also be available.  This file will let you see which of the affected
files have already been restored, so that you can run jobs that used those file sets.

Restoration of affected files will continue until it is completed.  In the interests of having
your jobs running successfully, please consider running different data sets to those
which may have been affected.

On a positive note, this outage has enabled us to also so system upgrades and an
upgrade to the SLURM scheduler.  These upgrades were initially slated for mid-Jan 2016,
and could possibly have lasted 4 to 5 days.  We will no longer need that maintenance

Again, our apologies for this inconvenience, and thank you again for your patience while
we repair this issue.

VLSCI facilities team.

10 Dec 2015 : 12:00 Update

System upgrades have been completed.  There has also been an update of the SLURM scheduler which will enable us to deploy the new allocation scheme for 2016.

For the GPFS issue: We have unfortunately hit speed issues when trying to restore damaged files.  We are currently looking at a  work around to enable users to get back on to the system, and will inform everyone as soon as that system is in place.

Thank you again for your patience during this outage.

VLSCI facilities team.

8 Dec 2015 : 17:00 Update

We are currently in the process of restoring damaged files.  It appears that the process is a lot slower than we anticipated.  We are in contact with the vendor to determine if this is the normal rate for a restoration of a large number of files.  (About 4.5 million files were affected.)

We estimate that the return to normal services might still take a couple of days, and we will provide a further update before midday tomorrow with better estimates. (Please watch this space.)

VLSCI facilities team.

7 Dec 2015 : 14:00 Update

The repairs to the file system are on going and damaged files will be restored to the state they were in on Wednesday night (2 Dec).  The restoration will take some time still, but we hope to have the system up as soon as possible (within a few days.) 

(Note: the facility team is using this opportunity to do system updates that would otherwise have necessitated in a down time early mid-January.)

VLSCI facilities team.

4 Dec 2015 : 12:30 Update

Unfortunately we will have to go to a complete shutdown of all systems to repair the problems with the file system. We hope to have everything back online early next week.

Again, apologies for the inconvenience

VLSCI facilities team.

GPFS: At risk

4 Dec 2015 : 11:30

We have a possible issue with the file system.  The VLSCI team will keep you posted if the issue needs to be escalated.

Apologies in advance for any inconvenience.