Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For me the worst thing about being on-call is not the actual work outside business hours (it’s usually not much), but the potential work: if something happens I need to jump into my laptop within X minutes (changes from company to company, but it’s usually within 10 minutes). This means: I cannot go for a run, I cannot go to the movies, I cannot go for a dinner with family, I cannot even go shopping (shopping mall is further than a 10 min. trip). Basically, all I can do is stay at home and be available. It sucks, and the money is not worth it.


10 min doesn't seem tenable over 24/7. Most likely you need to run errands and so on. In my team, our alerts aren't that critical, I just acknowledge on the phone and make sure I can get to the computer with 30 mn. I take it with me if needed.

No, for me, the real pain with oncall is that there are a lot of systems in my team. I understand well maybe 30% of it. I'm clueless about 30%. In between for the rest. I can try to fix issues myself (take long time, issues can add up), triage (but means bothering someone else). There are also things than nobody understand, and if they break, it can mean extra days of work for you with extra-stress because you don't know how hard it can be to fix.

Then, some people in the team ship code without adequate testing, because of pressure to ship. Which often adds work to the oncall. So there's all this extra tensions with colleagues which can be hard to deal with for an introvert.

Overall, it "kind of" works for us, but I agree with the conclusion, it sucks. It's really the worst part of my job. I went into software engineering because I like coding. Not because I liked monitor unreliable systems. And I think unreliability is encouraged by management to some extent. That keeps people at work.


This is exactly why I gave up a position as a full stack / devops engineer in favor of going back to low level drivers - there were too many unknowns, and far too many unknown unknowns often paired with expectations of prompt (and cheap) solutions to complicated issues.

Technically it was interesting and challenging, but in terms of stress just not worth it. You could pay me twice my current salary and I still would not go back to it. Now I try to place myself as far away from paying customers as technically possible.


> ...far too many unknown unknowns often paired with expectations of prompt (and cheap) solutions to complicated issues.

That describes pretty much all of my "full-stack" experience.

What sort of job/background do you have where you are writing low level drivers? I'd love to get into that side of things but I don't know where to start.


Telecom and automotive.

I guess hobby robotics, for example, could get you an entry to it if you choose to write the hardware interfacing parts yourself.


How'd you manage the transition (back?) to low-level? I would love to do chip work (or really anything systems-y) but all my experience is fullstack/webdev. Every time I apply I get bounced for insuffficient domain experience.


I started off my career in low level stuff and transitioned upwards to web. I’ve always been all over the place in terms of tech so it wasn’t particularly big steps either way. I’ve usually got something low level-ish going on at home. Emulator development, robotics, …


Oncall is becoming popular even for low level. My last few roles have all required it for reasons I've been unable to figure out beyond "all developers need on-call and you're a developer". In my case, a fix often requires hardware access and my commute is longer than the start-work SLA.


When I'm in charge of an on-call rotation I always try to make it very clear that this is not the expectation.

In my preferred model of on-call, you have a primary, then after 5min an escalation to secondary, then after 5min an escalation to something drastic (sometimes "everyone", sometimes a manager).

The expectation is that most of the time you should be able to respond within 5 minutes, but if you can't then that's what the secondary role is for - to catch you. This means it's perfectly acceptable to go for a run, go to a movie, etc.

You relax the responsibility on the individual and let a sensible amount of redundancy solve the problem instead. Everyone is less stressed, and sure you get the occasional 5min delay in response but I'm willing to bet that the overall MTTR is lower since people are well rested and happier to be on call to begin with.


We have a primary/backup setup and I would be pretty pissed if my primary just started going out for movies or a date night during their shift tbh. My job as a backup is to be there for unexpected events, ie they did not wake up or had an accident. Not be on call effectively 2 weeks in a row just because the primary doesn't take it seriously.


Yeah, going for a run or a dinner where you might be able to ack but not actually at keys for 10-20 minutes is one thing. Going to a movie or date where you might not even ack and won't be at keys for hours? Not cool at all.


I don’t see how this changes the problem where there is an expected guarantee of a rapid response except that now two people are expected to be available and would now need to directly coordinate in order to ensure one person’s going for a swim doesn’t interfere with the other’s WoW raid.


That's more or less what my team does. It works well. At least much better than saying you can't for for a swim at all.


I guess to me that seems worse because that’d effectively double the number of off-hours accountability per teammate. Not only do you need to be first on call for your primary hours, therefore severely restricting the quality of your “free time” but now you ALSO have to be secondary on call for that irresponsible coworker that goes afk without properly communicating for 2 hours, dipping twice into your actual free time.


Out of 168 hours in a week, there are maybe up to 8 where I want to do something that interferes with being oncall. There's no downside real downside to being oncall for the other 160 hours. But I would get a lot of disutility from losing my freedom during those 8.


This is pretty much how it should be done. If the business demands more, they should have a properly manned 24x7 NOC.

You also need *ownership*. There is nothing worse than having to support somebody else's work and not being allowed (either via time or other restrictions) to do things "right" so that you're not always paged for fixable problems. Everywhere I worked where the techs had ownership (which varied from OPS people being allowed to override the backlog to fix issues or developers being given enough free reign to fix technical debt) has usually meant that oncall is barely an issue. My current gig I often forget I'm even on call at all and the main issues that do crop up are usually external.


Almost all the reliability issues I encounter is usually due to constraints ordered by people who don't have to deal with on-call.

Things like, running in AWS but you have to use a custom K8S install so they aren't dependent on AWS.

Using self managed Kafka so that you aren't dependent on proprietary tech.

It all sucks because they are always less reliable and generate their own errors and noise for on-calls.

If they had to deal with phone calls every time there's a firewall issue that had absolutely nothing to do with the application, they would soon change their tune.


So it takes 10 min until you've gone to the drastic solution? With this time-frame it would be risky to go the bathroom, not go to a movie. Also even the backup sounds like a primary in this scenario.


Sure, but the assumption here is that primary and backup (edit: probably, ie. they're not coordinating this) aren't going to the bathroom at the same time. It's also based on the idea that alerts are extremely rare to begin with. If you're expecting at least one page every rotation, that's way, way too often. Step one is to get alerts under control, step two is a sane on-call rotation.


We want to ack within five minutes, and be at a laptop within 30. So long as I'm within mobile signal when the page goes off, it doesn't really matter what I'm doing — an ack is a button press on a push notification. And I can stay within 30 minutes of my laptop and an Internet connection by carrying said laptop and my phone (with "unlimited" data).

If the primary (paid) on-call doesn't catch the notification, the secondary (unpaid) will be paged. And so on, down a couple more steps, to a senior manager. There's no expectation that anyone other than the primary would actually be available to ack the alert.


Having the primary/secondary rotation is arguably worse. In that model, from the perspective of any one participant, now they're on-call for two weeks each time around instead of one.


> The expectation is that most of the time you should be able to respond within 5 minutes

That's an unreasonable expectation unless it's clearly said in writing and is billable hours.


This is why people used to be paid time-and-a-half or even double-time for being on call. Ask your union to demand that.

https://en.wikipedia.org/wiki/Time-and-a-half


For nurses on-call is a tiny amount - $3/hr and then you get 1.5x if you actually get called in


Pay is all about power. Nurses individually have little power (as they are replaceable), which is why unionising is good for them as the union gives them collective power.

Software engineers are an interesting case; some have a great deal of domain-specific knowledge, giving them leverage over their employers. Many less so, and so a union could help. AI might change this equation too.


> Nurses individually have little power (as they are replaceable)

Replaceable with what, exactly? The local ER is now having to close in the evening because they can't find sufficient nursing staff to keep it operating.


At least locally the experience is that hospital admin has gone delulu by thinking they can replace hiring unionized nursing staff with much more expensive travel nurses


When I was at a small IT consulting shop ~15 years ago, this is roughly how it worked. We'd get paid 24x7 for a week on-call at minimum wage + 1.5x normal wage for any hours we had to log in.


In most of the country, California unions have largely negotiated for .5x pay while on call making it quite popular.


Had this very same payment scheme as an SRE on call in Europe almost a decade ago.


Fun fact: In the US the concept of time-and-a-half (also minimum wage, and not hiring child labor) was created by the Fair Labor Standards Act. Most tech employees are classified as "exempt" -- the FLSA and its protections doesn't apply to them.


I have never understood why 'computer' workers are exempt


For the same reason that cashiers and burger flippers at fast food places have to sign non-compete clauses.

It makes everything cheaper.


In Ontario, Canada, IT workers are exempt from most labour laws too. Breaks, shift lengths, time off between shifts, etc. No overtime pay either.

What this means is that on call is often "included in your salary" and good luck.


Are you confusing overtime and on-call?

Nobody's going to pay you 1.5 your hourly rate to just sit and wait until something happens. Is that really a thing?

Now if you are called and spring into action, there may be time-and-a-half, if it's outside of your regular hours.


So for say a weekly on-call rotation you would be paid all 24 hours x 7 days at double rate? (- ~40)

Also most tech companies don't have unions...


I don't believe it's physically possible for anyone to be available at 10 minutes notice for 168 hours straight. And if it's possible, it would be deeply unhealthy. But if they did achieve this, then yes they should be paid some very large amount of money. But that doesn't just happen -- pay isn't fair, it's about power. So a union can help to negotiate this, or ideally, better working conditions.

> Also most tech companies don't have unions...

In Europe, there are plenty of unions that cover tech people. I'm a member of Prospect (https://prospect.org.uk/).


Also in the UK there’s https://utaw.tech/ and https://www.gameworkers.co.uk/

If there isn’t a union recognised in your workplace, you can build one yourself.


In Europe most of the unions apply to the whole industry sector, what you do inside the building doesn't really matter.


Usually (always?) regular working hours are compensated as usual. Then there’s rate for standby periods and another rate for each (half) hour when you get pinged and start doing actual work.


Last time I was on call it was a one hour payment per day, a 4 hour payment for answering the phone.

That lasted about 3 months, just not acceptable.


If I had a union it would demand a bunch of unqualified people join my team (and get paid the same as me) and it would forbid me from doing certain things because,say, moving the computer or plugging in a cable is IT's job, whereas I'm SE. No thanks


While you will find some extreme examples that could go that far, unions don't generally do that. Organisations that fight unions however do like to bring up that example, so... you've been had with anti union propaganda.


So my coworker who was a UAW member who told me stories about sleeping on the roof, and being reprimanded for moving a desk to retrieve a pen...was trying to dupe me?


> While you will find some extreme examples that could go that far,

I don't know how much more clear I could make it to you.


"while I should ignore evidence that contradicts what you claim..."


Why are you getting all of your thinking from one co-worker?


So I should ignore evidence directly from the source of one of the largest unions in the US, because it doesn't support your view? I should only accept evidence from your trusted sources? Ok.

Edit: or my UPS friend who told me how the union box loaders would falsely claim alcoholism or drug addiction before being fired so they could abuse the union "protection" that was given to them? Is he trying to dupe me too?


Think about your own employment experience. Was the work environment always static, or did your employer ever introduce change that wasn't popular? Were you still singing the corporate anthem afterwards?

As far as I can tell, unions only show up after decades of management malfeasance. They're kind of a natural reaction. The line "the only thing worse than a union is no union" is probably a hundred years old.


Aren't hackers supposed to be a curious bunch? Is that really the only way you can imagine unions working? Can you not see the imbalance of power between a single individual and the corporation that employs them? Unions are fundamentally about balancing that power dynamic.


i'm a big union advocate, but i worry that the traditional messaging that unions use doesn't work for tech employees.

things like more pay/better hours/safer working conditions are appealing to people working low-paid, dangerous jobs but don't really click with most tech employees because those aren't the things they hate about their work.

to win over tech employees unions should talk about more ambitious things like codetermination (i.e., getting workers on the board), 4-day work weeks, remote work policies, employee sabbaticals, etc


My wife is part of a union and there’s none of that bullshit. However when her employer wanted to reduce costs across the board the union negotiated a shorter working week for everyone instead of a pay rise next year. They voted on it and accepted it with an overwhelming majority.


The union offering "shrinkflation" as the way for the business to cut costs is an interesting framing. Your wife's union associates must hang out with grocery store executives.


The union spoke to its members and asked them what they wanted, negotiated it and voted on the final result.

If that’s your take away from it I don’t know what to tell you.


That isn't how unions work, rather the usual US fearmongering of having them.


It just dawned on me how this argument runs perfectly parallel to religion if you point out intolerance, misogyny, or violence. It's always "those other people" that do the bad thing, and everyone only reaffirms their own system of worship. You could almost do a 1-1 find/replace of keywords and have the same argument.


Everyone who works for a living understands why unions exist even if they don’t need one locally.


Any particular reason you can't handle incidents while out and about?

I know it varies by situation. When I've been on call I've been able to mostly go about my life. I just had to keep my laptop close, stay in cell signal, and accept I would sometimes have interruptions (typically brief). We fought to keep them infrequent enough that they didn't ruin our lives.


I do long(ish) distance running as a hobby - it's not feasible to take a laptop out on a two hour run.

If I want to go meet a friend for a drink or food, I have to lug around a backpack, keep an eye on it to make sure it's not stolen. If I wanted to have a beer or wine, I can't because I may need to work at any point.

Favourite band is performing? I suppose you could take a backpack and the laptop to the venue, but again there's a chance it's pinched, and they'll make you check it at the cloakroom for the performance.


> If I want to go meet a friend for a drink or food, I have to lug around a backpack, keep an eye on it to make sure it's not stolen. If I wanted to have a beer or wine, I can't because I may need to work at any point.

If this is a stated requirement from your employer, talk to a lawyer. This is a common litmus test for whether you need to be paid while on call, even if you aren't actively working. Depending on the jurisdiction you may be entitled to pay (or trigger a relaxation of your company's policies).


Does that apply to salaried/FLSA-exempt workers?


Depends on the jurisdiction, talk to a lawyer to find out.


I use my pocket computer if something comes up. It's not nearly as pleasant to use, granted, but way more pleasant than carrying a laptop everywhere. But I also wouldn't hesitate to have a beer if the desire arose. Perhaps I'm just not as committed to my work as you.


> If I wanted to have a beer or wine, I can't because I may need to work at any point.

One beer or glass of wine renders you incapable? I'd be totally comfortable having 1-2 drinks on call.


Not the GP, but I was in a similar situation. It was a requirement to be able to get to the office if the situation required the lab to diagnose the problem.


In California, as a non-exempt employee (basically not a manager if you're at a big company), you'd have to be paid for that on-call time with those requirements. The key term is "restricted" and the 10 minute expectation is quite a severe restriction.

https://www.dir.ca.gov/dlse/callbackandstandbytime.pdf https://www.shrm.org/topics-tools/tools/policies/california-...

This doesn't apply to a lot of people reading, but just a PSA for those in CA where it might.


If you're on-call, you're working - it doesn't matter if there's an active incident or not. Unless you're a contractor (in which case, you're unlikely to be on-call) the company you work for pays for your time, not delivery of specific work-items. On-call pay should reflect this.


99% of people on-call are salaried anyways


> all I can do is stay at home and be available

If home is fine then usually all you need is Internet and laptop.

> I cannot go for a run, I cannot go to the movies, I cannot go for a dinner with family, I cannot even go shopping (shopping mall is further than a 10 min. trip)

Sounds more like setting expectations and explaining the situation than a "cannot" (maybe except movies).

You can explain to your family for example that you're on-call and may need to leave urgently. I mean e.g. police do that. It's not that uncommon.

You can take runs within a 10 minute distance back home. The route is up to you. You can start by acknowledging on the phone as someone else commented, which would grant you maybe another few minutes.

There are lots of options. It's on you to workaround it. On-call isn't perfect, sure.


I once had a job with a lot of 2-4am wakeup outage calls. The timing was perfect such that you can't fall back asleep generally.

An aggravatingly large percent of them could be resolved by voice over the phone by walking the offshore support team through the same 2-3 runbook items.

"Did you look at the log... I see, OK are you looking at it now? Does it say X? Did you do Y? Good now? Great, goodnight."

"Did you try restarting it.. ok then try that now. Is it good now that you restarted it? Great, I'm going back to sleep"

Ironically we'd have less of these outage calls when the offshore person went on holiday because they'd send one of the competent NY support staff over for 2 weeks. Slept like a log every time.


I agree. The potential work is worse than the actual work. I used to think it wasn't so bad, but then I was so relieved when I left the on-call rotation that I must have been suppressing my feelings about it.

I wrote about my experience here: https://bobbiechen.com/blog/2022/7/20/being-on-call-sucks


>Most of the software I wrote requires mostly no interference or fix-ups, unless of course the requirements have changed

Bug-free software is great, but changing requirements are precisely the reason on-call needs to exist.

Even if you have a whole team of engineers who write bug-free software like this guy, you'll still have failures. Because the world is constantly slipping out from under your assumptions.

Customers never stop changing their usage patterns. They add load at different rates, come up with unexpected requests of all shapes and sizes, and invent new use cases that fly in the face of the original project requirements.

Even if you have created a software system with no bugs that perfectly meets both the functional and non-functional requirements of the project, changes in the state of the world vis a vis customer behavior will come along and change what counts as a bug. If your system has a blanket 60-second database query timeout, and everything's working fine, then there's no bug. But as soon as a new API usage pattern causes certain queries to run on average 10 times longer than before, now you have connection starvation and an urgent bug to fix.

I'm not saying that "timely maintenance and improvement" and "a culture of perpetual ownership" won't have positive effects on reliability. But it's unrealistic that any amount of responsible, careful software development will fully eliminate the occurrence of sudden and unexpected failures. Human on-call, as uncomfortable as it is, will remain a requirement as long as reliability is taken seriously.

FWIW my perspective is that of someone that runs an on-call/incident management platform (Rootly).


If you allow yourself to sleep then it's not entirely true. I totally agree that it's lurking and you think twice before starting an activity, but what's the worst that can happen? Run 10 minutes away from your house, take your laptop to the movies, and to your friends and family. I do it all the time and yes sometimes I need to isolate or find a place to work, I still enjoy the rest of the day


I've voiced my opinion on this before¹.

The problem is on-call is an essential and critical part of a managerial role, but toxic to those in a developer role.

Managers must be on-call to ensure the appropriate people and resources are brought to bear on unexpected problems that threaten the business.

Developers must NOT be on-call to ensure appropriate attention is spent designing, developing and maintaining the code that makes the business possible.

The rise of software-as-a-service led to companies promoting "devops" engineering which conflates these roles and unfortunately helps unscrupulous executives unfairly squeeze more work from employees.

The core idea of devops, that managers/operators and developers should understand and be capable of performing each other's role, isn't a bad one. Those who understand how the business works at all levels can do more to make it successful. It goes hand-in-hand with continuous delivery.

The best engineers alternate between these roles in a predictable schedule. When in the managerial role they need to observe, react, delegate and escalate problems as appropriate. When in the development role they need to deliver features that create recurring value for the business.

But businesses should not expect engineers to play both roles at the same time!

This form of "on-call" is a toxic moral hazard. It's a sign of instabilty. It's a signal of executive grift looking for a quick pop. "on-call" robs developers of attention they need to develop the features and increases risks that schedules will slip.

It doesn't need to be this way. If a business needs software development it should hire or train engineers with that experience. Likewise if it needs managers or operators to deliver software as a service.

As an operator or manager I look forward to working a shift, but as a developer I will never again accept on-call rotation.

¹ https://news.ycombinator.com/item?id=42230215


All companies I have worked for wanted you to answer within minutes, but you had half an hour to actually connect and try fixing the issue, so you could totally go out and if you were in a larger than a 30min drive/ride/walk you would just keep a laptop in a vehicle or backpack.

I used to do 3hours rides with my bicycle and go to dinner or social events with a gpd pocket 2 in a small bag.


A 10 minute response is "On the clock", not "on-call". A symptom of the "Do more with less" cancer in tech.


The airlines pay a certain rate when you go on call - since your'e still effectively on company time.


I mean, I'm salary, so I get paid for on-call as part of that, but I don't get any extra pay for picking up my team's on-call rotation.


I was on-call for over a decade, usually in roles where there was no compensation for working out of hours other than maybe TOIL. We're not talking FAANG gigs here - like £20-50k in the UK stuff. It's amazing how much having to carry an extra phone or making sure your laptop is in your car impacts your day-to-day life. Any social thing you're at could be interrupted at zero notice. Heck, I've taken calls in supermarkets and concert venues.

One place I worked had a 1 in 2 rotation. Every other week on call or weeks back to back if your colleague was on holiday. There was no front-line service screening calls which meant you could be woken several times in one night. All for £30 pcm towards broadband costs.

Most places are more sane than that example but suffer from the same core problem. Follow the sun support is incredibly expensive when compared to putting your existing staff to be on call. Here in the UK, so long as your equivalent hourly rate doesn't drop below national minimum wage and you're opted out of the working time directive (a lot of employers slip an opt-out form into your paperwork implying it's normal to sign it), then it's legal.

Unfortunately I'm yet to find anywhere that on-call operational teams have the clout to get code induced issues high up the priority list outside of cases where they've had to drag developers out of bed at 2am. In my experience that also plays out with getting anything infrastructure based into tech debt budgets. Why focus on fixing problems you don't directly suffer from when you can spend the time on a refactor, integrating a cool new library or spaffing out one more feature in the sprint?


This. And alerts are often just a fluke anyway. Sure many of us can step up and pull an all-nighter if it saves some company and you make some minor sacrifice to do something heroic. Being woken up several times at night for no reason but some metric that is a bit off is pretty soul crushing, and then come the knock-on effects in professional and personal life that you're always tired and demotivated during the day.


> This means: I cannot go for a run, I cannot go to the movies, I cannot go for a dinner with family, I cannot even go shopping (shopping mall is further than a 10 min. trip).

I live in rural Texas. The same things apply here, and more: I'm lucky to have good internet (which enables working remotely) but half my home doesn't get cell coverage so being responsive to a text message or phone call means not even going around my own home (for example, no cell signal in the kitchen means I can't cook while on-call); and with large tracts of land, I can't go out to do land maintenance (good luck hearing a phone ring or feeling it vibrate from a call when you're operating heavy machinery, assuming you even have cell signal there); all services are 15 minutes or more away: groceries, doctor, contractors, government, etc etc.

It's important to stress how much being on-call ruins my capability to use the time effectively for my own purposes (Texas Guidebook for Employers [0]; 29 CFR 785.16 [2] and 785.17 [3]). I tried telling this to a previous employer when they started wanting me to be on-call (3+ years after start of employment), and they indicated that those laws are only used for hourly employees but being salary + exempt means I do not qualify for additional pay and falls under "and other duties as assigned" in the employment contract. So the employer effectively started getting 60 hours of work for 40 hours of pay. Oof.

I also absolutely refuse to mix my personal devices with work; just at a minimum, I refuse to make my personal device available to legal discovery related to any legal issues with the employer. So if the employer wanted me to have cell phone availability, then I demanded that the employer provide that cell phone. That was a fun conversation that ended with some relaxed requirements (eg, I don't have to have cell phone availability if I'm responsive at my work desk already) which further reinforced the fact that I couldn't use the time for my own purposes.

Thankfully multiple years in this industry at (what was) fair compensation allows me to be picky for new employment contracts. And lesson learned: I'll be a lot more careful about contract language from now on, and specifically look for (or negotiate) carve-outs around being on-call and work/personal device separation. I recognize that having 10+ years of experience makes me able to handle that, but newcomers to the industry won't yet have that buffer and it sucks for them to not have that safety net for negotiation leverage.

A lot of this disagreement comes from businesses demanding rapid response while insisting on not taking on new hardware/payment obligations. To contrast: take the fireman who's waiting for an alarm (29 CFR 785.15 [1]): they are often often idle and can often go out for groceries but they're easily reachable. Ever seen a firetruck in front of a grocery store and the firemen are just inside shopping for groceries? Then see them come running out and turn on the lights & siren and drive off? I have. It's an interesting event, and it sucks for the grocery store that has to put those groceries (for ~15 people) back on the shelves and refrigerators. Nonetheless, those firemen are paid to do so and have special equipment (eg radios or cell phones) to be able to receive those messages, and the firement generally don't pay for that equipment themselves (the community does either through taxes or donations). I see analogies about on-call software engineers being called to put out (virtual) fires as very apt in this case.

[0]: https://efte.twc.texas.gov/c_waiting_or_on_call_time.html

[1]: https://www.ecfr.gov/current/title-29/subtitle-B/chapter-V/s...

[2]: https://www.ecfr.gov/current/title-29/subtitle-B/chapter-V/s...

[3]: https://www.ecfr.gov/current/title-29/subtitle-B/chapter-V/s...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: