We receive a notification if there are any failed jobs or operations. We have some Bamboo agents working, so if one of the jobs fails on one of these servers, PagerDuty Operations Cloud creates an incident and notifies us. We use PagerDuty Operations Cloud for monitoring purposes, and it works great for our current needs.
PagerDuty Operations Cloud
PagerDutyExternal reviews
External reviews are not included in the AWS star rating for the product.
Real-time monitoring has reduced downtime and ensures failed jobs are resolved quickly
What is our primary use case?
What is most valuable?
The best features PagerDuty Operations Cloud offers include quick access to failed jobs and the ability to add descriptions about the failed job. The quick access allows us to rapidly identify which job or operation has failed because it sends the job name or the operation name that has failed.
I have heard that integration between our systems and PagerDuty Operations Cloud was easy to implement. For efficiency, we can monitor our deployment process in real time. For incident response, PagerDuty Operations Cloud creates alarms that make calls to the specific person who can handle these issues.
We have fewer missed incidents because it keeps calling regarding the incidents until they are resolved. We have also reduced downtime because we notice errors and failed jobs, and we work to fix them.
What needs improvement?
The system is very smooth right now.
For how long have I used the solution?
We have been using the solution for about one year.
What do I think about the stability of the solution?
We have not experienced any stability issues.
What do I think about the scalability of the solution?
We have not experienced any scalability issues.
How are customer service and support?
We did not try to reach out to customer service because we did not face any issues.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I prefer not to use previous solutions.
How was the initial setup?
I joined the team after they had already purchased and configured PagerDuty Operations Cloud, so I did not have knowledge about the setup process.
What about the implementation team?
I do not have any experience with the implementation team.
What was our ROI?
Time saved.
What's my experience with pricing, setup cost, and licensing?
There was no relationship between setup cost and other factors.
Which other solutions did I evaluate?
We did not consider alternate solutions.
What other advice do I have?
PagerDuty Operations Cloud is a great tool that saves time and is worth starting to use. I would rate this product nine out of ten because nothing is fully perfect. My overall review rating for this product is nine out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Essential for Timely Alerts and Incident Response
Its on-call scheduling and escalation policies help reduce downtime and respond to incidents quickly.
The interface and configuration can also be a bit complex for new users.
This helps reduce downtime, improves incident response, and gives better visibility into operational issues.
On-call automation has reduced downtime and has enabled faster incident response at scale
What is our primary use case?
PagerDuty Operations Cloud is a platform that helps teams manage incidents, automate operations, and ensure system reliability by bringing alerts, on-call schedules, and real-time responses into one place. When we had to push things into production, we set up PagerDuty schedules on a weekly or biweekly basis. If an issue occurred at night, a roster would pop up, and the respective engineer would have to handle that use case.
A specific incident where PagerDuty Operations Cloud helped my team was during the peak season in America when lakhs of orders were placed in December, and a major S1 severity production issue suddenly happened. If no monitoring tool had been in place, the company would have faced doomed circumstances, incurring lakhs of dollars in losses. PagerDuty came to our rescue at the last moment when nothing was happening. At 3:00 a.m. my time, I received a message and subsequently a call while sleeping, and I learned that this issue had occurred. I logged in quickly, promptly fixed that issue, and within an hour or so, the issue was resolved with minimal damage. I even received appreciation for my quick response.
PagerDuty Operations Cloud helps in similar situations because whenever some issue happens and we are not aware of it, PagerDuty comes with a flag telling us that there is an issue that needs to be fixed before it becomes a major problem.
What is most valuable?
Some of the best features PagerDuty Operations Cloud offers are comprehensive incident management, automation, and AI operations, all integrated into one platform. Second, it provides noise reduction and smarter alert grouping through global intelligent alert grouping that uses machine learning to group and correlate alerts across services. It also provides automation to reduce toil and speed up resolutions and artificial intelligence, including generative AI assistance, to help teams respond faster and smarter. Additionally, it has built-in workflows with standardized, repeatable processes, improved visibility, collaboration, and a unified operations view, and support for bridging customer-facing teams and engineering and the SRE teams. The last thing it provides is scalability for enterprise environments.
The AI-powered alert grouping and automation have made a difference in my day-to-day work by reducing alert noise. It automatically groups multiple related alerts into a single incident, so instead of 20 separate alerts, I get one meaningful alert, which prevents on-call engineers from being spammed. It also helps in faster root cause understanding because AI looks at patterns across systems including logs, metrics, alarms, and graphs, finally providing a broad summary about that. This cuts down the response time, helps in prioritization, and reduces the burnout of on-call teams.
PagerDuty Operations Cloud has positively impacted my organization by helping in faster incident detection and resolution with less downtime. It has reduced noise and fewer false alerts, allowing better focus for teams, meaning that on-call engineers can focus only on real and important issues rather than all the duplicate and negligible issues. It has helped with automation and efficiency, better collaboration and communication among teams, improved post-incident learning and prevention, and has not only helped in operational cost savings and better return on investment, but also in scalability and readiness for growth.
What needs improvement?
Even though PagerDuty Operations Cloud is a strong platform, many things can be improved. Analytic and reporting depth can be improved with better depth. Noise suppression and alert grouping robustness can be improved because sometimes the grouping becomes vague and somewhat unclear. Usability can improve, and user interface and user experience can improve because it becomes quite complex for new users. Integration and ecosystem limitations can be improved, as well as cost because for small or mid-sized organizations, it would become quite expensive to pay for this solution. Complexity for smaller teams or simpler needs can also improve.
I think we can have richer analytics, and the reporting dashboards can improve. More robust noise suppression can help us. Native support for alert attachment can help us. A simpler user interface and user experience can be implemented, and pricing tiers and models should be more favorable. Accessible documents and easier onboarding can help a lot.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud since the first year of my job, and I have worked on four projects, using it in all of them.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is one of the most stable platforms.
What do I think about the scalability of the solution?
The scalability of PagerDuty Operations Cloud is quite great. I have seen it scale in a very easy and robust manner.
PagerDuty Operations Cloud has met my needs as my team and workload have grown. The workload would definitely grow because since we are going online, production issues might happen, but PagerDuty has helped reduce that workload.
How are customer service and support?
I never faced an issue that would make me have to reach out to PagerDuty customer support because I think it worked fantastically. However, if that happens in the future, I would be happy to share my experience.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
This is my first company, and I have been working here since the beginning of my career, so PagerDuty Operations Cloud is the only solution I have worked with.
How was the initial setup?
Before using PagerDuty Operations Cloud, my team often took longer to identify the root cause of incidents because alerts were scattered across different tools. After moving to PagerDuty Operations Cloud, AI-powered alert grouping and automated flows have helped us detect issues much faster. We now mobilize the right team within minutes, and our overall incident resolution time has dropped significantly, which has directly reduced our downtime and improved service reliability.
What about the implementation team?
It was not a team-level decision whether my organization evaluated other options before choosing PagerDuty Operations Cloud.
What was our ROI?
Cost savings happened since losses were prevented. Time savings also occurred, response time reduced, and many such things happened which I have already mentioned.
What's my experience with pricing, setup cost, and licensing?
Pricing, setup cost, and licensing were not my headaches, and the organization already provided me with everything set up. I just had to log in and start using it.
Which other solutions did I evaluate?
I did not purchase PagerDuty Operations Cloud through the AWS Marketplace because it is an organization-wide decision, so my company would have done that.
What other advice do I have?
I would definitely recommend trying this solution. If you are thinking to go with production in the near future, definitely give it a try. If someone is trying to go to production and wants to have reduced service level agreements and reduced time for root cause analysis and everything, definitely give it a try. It is a tool that you should work with, and I rate this product a 10 out of 10.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Runbook automation has reduced incident response time and now improves uptime and collaboration
What is our primary use case?
Our main use case for PagerDuty Operations Cloud is for alerting purposes whenever any kind of downtime or downstream incident happens with our application which causes any downtime, and PagerDuty Operations Cloud will alert us through calls and SMS so we can get notified and quickly remediate the issue.
A unique aspect of our main use case with PagerDuty Operations Cloud is using the Runbook flow. Whenever we experience a specific kind of incident, the Runbook will trigger automation to either remediate the issues or perform root cause analysis, thus enhancing our workflow automations.
What is most valuable?
PagerDuty Operations Cloud helps our team respond by increasing our response time. Whenever there is any incident, we will get notified and through PagerDuty Operations Cloud, we receive calls 24/7, allowing us to instantly get into a call or investigation and remediate the issue as early as possible. This way, PagerDuty Operations Cloud helps us reduce the MTTR and ensures our application is more reliable and resilient.
We have been using the Runbook automation feature for building automated flows that help us add extra monitoring for specific alerts or incidents and perform remediation tasks autonomously using this Runbook flow.
One feature I particularly appreciate about PagerDuty Operations Cloud is that it offers multiple notification options. I receive alerts via call as well as SMS, which is beneficial. If I miss the call, I may still receive the SMS and vice versa.
Through PagerDuty Operations Cloud, our MTTR has been reduced by at least 30% over the last year due to its instant notification features like SMS and calls, which help us jump on calls quickly to remediate issues. This reduction has impacted our application downtime, ensuring an uptime of approximately 99% throughout the year.
What needs improvement?
One suggestion for improving PagerDuty Operations Cloud is to provide more insights about incidents, such as root cause analysis or additional information, which could assist SRE teams in reducing remediation time and incident detection before jumping on a call.
From an integration point of view, everything is functioning well. However, we primarily use the desktop interface as our main tool, and adding more details on incidents directly from PagerDuty Operations Cloud's analysis would enhance the user experience.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for the last three years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is absolutely stable. We have never experienced any downtime or latency issues from PagerDuty Operations Cloud.
What do I think about the scalability of the solution?
We don't have much insight on scalability, as a separate enterprise PagerDuty Operations Cloud team is responsible for handling all scaling activities.
How are customer service and support?
We have internal enterprise support within the application, which is very interactive. They escalate issues to the external PagerDuty Operations Cloud team when necessary, and they are very supportive.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We have not previously used a different solution. PagerDuty Operations Cloud is the first alerting tool I have been using since the beginning.
How was the initial setup?
PagerDuty Operations Cloud onboarding is pretty straightforward in our organization, as new candidates simply need to be part of specific Windows AD groups to complete the onboarding process and gain access.
What about the implementation team?
There are automations in our organization that connect PagerDuty Operations Cloud to other ticketing tools such as Jira and ServiceNow. Whenever an incident occurs, automation that uses the Runbook flow triggers to extract data from the PagerDuty Operations Cloud alert to create incidents and Jira tickets for the development team.
What was our ROI?
In terms of return on investment, we have reduced our MTTR by 30% in the last year, indirectly improving our application's uptime to nearly 99%, which enhances client experience and boosts our business.
What's my experience with pricing, setup cost, and licensing?
I have no personal experience with pricing, setup costs, or licensing, as a separate enterprise PagerDuty Operations Cloud team manages those processes.
What other advice do I have?
The escalation policies within PagerDuty Operations Cloud are user-friendly and customizable, allowing us to set up multi-level escalations from SRE engineers to SRE leads and then to management.
PagerDuty Operations Cloud helps our team collaborate during incidents by automatically updating incident status based on progress. We have alerting integrated with Slack for this, where incidents show as red when active, yellow when acknowledged, and green when resolved.
Regarding performance metrics, there is a dedicated enterprise PagerDuty Operations Cloud team that handles monitoring, so as an SRE, I don't need to manage these performance aspects myself.
My advice to others looking into using PagerDuty Operations Cloud is that it is one of the best tools in the market for production support and SRE engineers. It is essential for our operations, functioning as our bread and butter.
We have covered almost everything regarding PagerDuty Operations Cloud. It has been a great tool for SRE and production support teams, and we look forward to more features, especially with trending technologies like AI. I would rate this product an 8 out of 10.
Automated on-call scheduling has reduced manual effort and now keeps holiday coverage reliable
What is our primary use case?
My main use case for PagerDuty Operations Cloud is to set up shifts for people on-call.
A specific example of how I use PagerDuty Operations Cloud for setting up shifts is for when we need to set up shifts for holidays. In our team, we'll assign people who will be on-call and create an Excel sheet and upload it to PagerDuty. It works normally, gives notifications, and everything else functions properly. It is very easy to set up and manage.
I usually discuss with my team who will be on-call during holidays, and we will set up how many people are needed. We create an Excel sheet, upload it to PagerDuty, and set up the line of who is the first person to reach, and if they miss it, then whom to escalate to. The web view and website are also very easy to use. I think this is the normal use case. Perhaps other teams are using it differently, but this works well for us. Before, it was very manual, and it was quite difficult.
What is most valuable?
The best features PagerDuty Operations Cloud offers are that it is simple to set up and supports Excel sheet uploads, which was very helpful. Setting up notifications and the integration with Datadog was excellent. We can automate many things.
PagerDuty Operations Cloud has positively impacted my organization because the support team is very happy. Before, setting up everything was very difficult. Now, we don't have to think about it. We can simply set it up in PagerDuty and it works. The escalation and everything simply works with the configuration we set up six months to one year ago, and it still functions. We make only minor changes. I think a lot of manual effort has been reduced, and the system is more reliable.
Since implementing PagerDuty Operations Cloud, before the L1 team had to stay online at night, and if someone fell asleep and missed an issue, it would easily escalate to a manager or someone higher up, creating a lot of fuss. That is almost gone now. The discussion part about deciding who will be on-call and setting that up was not as foolproof when we were creating it manually, and someone had to invest a lot of time, around one or two hours weekly. Now, it takes simply less than five minutes. Every week, we simply discuss and it's done. I think a lot of time has been saved, and a lot of mental effort has been saved.
What needs improvement?
I think the view on the website regarding how we see the chart and graph of who is on-call at what time could be improved. We could make that line more expressive to show who will get escalated if someone misses.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is stable; we didn't find any bugs or unintended behavior.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud is scalable; we can easily add teams, manage tags, and create teams. It is very easy to manage, and adding the line of priority and deciding whom to go first was very easy.
How are customer service and support?
The customer support is adequate; usually, they respond and help us fix issues during integration. It was helpful.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
Before using PagerDuty Operations Cloud, there was no solution in place. The L1 team was the one who checked the issues and called the developers, asking them if the error was related to them. This involved manually calling fifteen to twenty developers, which would take half an hour, and the issue would have persisted long enough, reducing the reliability of the site. Now it is automatic and very effective.
What was our ROI?
I have seen a return on investment; a lot of time has been saved. As I mentioned earlier, it would take a lot of manual effort before. Sometimes by mistake, two or more than one person would be assigned on-call, and it was not foolproof. The escalation was not possible at all before, which led to the L1 team being under too much stress. Now, it is not that severe; the L1 team had to coordinate with many people and call many people from their phones when they got an error. It was actually very bad. Now, PagerDuty escalates and will call them, and if it belongs to them, they will join. It is much more efficient and much less stressful.
Which other solutions did I evaluate?
We were not involved in evaluating other options; I think the higher team decided to go with PagerDuty, and we are happy with it.
What other advice do I have?
I don't want to add anything else about the features; we use this much and it's great. We don't want anything more for now. I don't think there is anything to improve; we are using PagerDuty Operations Cloud to set up on-call duty and it works. I chose a rating of nine because there may be some improvements in the future. My advice to others looking into using PagerDuty Operations Cloud is that the feature of on-call duty and setting up the on-call person are excellent. You can simply proceed with it, and even if teams are big, it will not be annoying or feel overwhelming. Just set it up and forget it; that's all. It is very effective. I have no additional thoughts about PagerDuty Operations Cloud before we wrap up; it is excellent. You can adopt it if you don't have any special needs; it is commonly accepted and effective. I gave this review a rating of nine out of ten.
Reliable Incident Alerts and Seamless Team Coordination
Empowers Incident Management with Reliable Alerts and Seamless Collaboration
Seamless Incident Management with Powerful Integrations
integration with tools like Slack, Zoom, monitoring systems (e.g., Prometheus) and usage of webhooks to build automated incident workflows
can manage multiple teams, services, and global operations
the mobile app and remote acknowledgement/resolution functionality are called out as strong points
- pricing is high relative to perceived value, especially for smaller orgs