Can I get my compute time back?

Who wouldn’t want to use all of their node hours on Levante in the most productive way? After all, this is how we present our project in the proposal. Everything will go according to plan. What could go wrong? In fact, a lot, so can I get the time back in that case? Here is an incomplete list of things we often hear.

  • I found a bug in my model and now I have to run all experiments again.

  • We couldn’t start production for some reason and now compute time expired. I’ll never be able to catch up!

  • The model performance is worse than we anticipated in the proposal, so we need more compute time to get through.

  • One of the nodes in my large job had a hardware failure. This isn’t my fault!

  • The system had a long downtime this quarter and now my time expires because I can’t get enough jobs through.

We understand your trouble and we would like to give you more time but we can’t, sorry. Unsatisfactory? Okay, here is the long answer.

Our HPC system can deliver a fixed amount of compute time each year. Here we consider the uptime our vendor assures us. Then we also take into account that the scheduler cannot always achieve 100% utilization due to our diverse mix of job sizes and run lengths. Many more things get factored in, including our experience on past systems. We then tell our scientific steering committee the total amount of node hours they can allocate to all proposals. In the end, each accepted project gets their share and nothing is left.

That is why we cannot give you any more compute time for this year. We have no secret stash of nodes we could fire up to create more node hours. It doesn’t matter whether it’s your fault or ours or simply hardware failing us - the time is gone.

So is it all bad and you have no chance to get any additional compute time? No, because there are other generous projects who can give you some of their share. You don’t even have to ask them (most would refuse anyway). But for whatever reason, many of them simply don’t use their time and when that happens, your job can sweep in to make use of those precious node hours.

This is what happens when a project’s time expires. In most cases the time wasn’t wasted, it was just some other project who got it.

Yes, that means you can spend more time than what was allocated to your project. Reviewers of your next proposal will see that you closed with a deficit but this is for once a good thing. It tells them that you are able to use your resources and more. It is much harder to convince them to allocate a big chunk of the machine to you when you didn’t use your time last year.

The bottom line is to always keep one or more jobs in the queue and at the end of the allocation period, you probably got your share and even more.