We have multiple nodes integrated into our Azure infrastructure and our AKS clusters. These nodes are integrated with traces (as APM hosts).
We also have infrastructure Hosts integrated to see the metrics and the resources of each hosts mainly for Azure VMs and AKS nodes. Additionally, we also have hosts from our VMs in Azure which act as Activemq and we integrate them as messaging queues to show up in the Activemq dashboard.
We have recently added Activemq as containers in the AKS and we are also integrating those as messaging queues to show up in the Activemq dashboard integration
Logs are great. Having all services with different teams sending the logs to Datadog and having all logs in the same place is very helpful for us to understand what is going on in our app; filtering of the logs a huge help and adding special custom filters is easy, filters are fast. Documentation is better than average, with little room for improvement.
Dashboards are simple, and monitors are very easy to configure and get notified if something is wrong.
With the aggregated logs, we can now see logs from other systems and identify problems in other areas in which we had no visibility before.
Dashboards are the most valuable. We need the observability. We have given the dashboards to a dedicated team to monitor them off working hours and they are reporting whatever they see going red. This helps us since people without any knowledge can understand when there is a problem and when to react and when to inform others by simply looking if the monitor (showing the dashboards) turns up red.
Traces being connected to each other and seeing that each service is connected through one API call is very helpful for us to understand how the system works.
The monitors need improvement. We need easier root cause analysis when a monitor hits red. When we get the email, it's hard to identify why the trigger has gone red and which pod exactly is to blame in a scenario where the pod is restarting, for example.
Prices are a very difficult thing in Datadog. We have to be very mindful of any changes we make in Datadog, and we are a bit afraid of using new features since, if we change something, we might get charged a lot. For example, if we add a network feature to our nodes, we might get charged a lot simply by changing one flag, even though we are only going to use one small feature for those network nodes. However, due to the fact that we have more than 50 nodes, all of the nodes will be charged for the feature of "Network hosts".
This leads us to not fully utilize the capabilities of Datadog, and it's a shame. Maybe we can have a grace period to test features like a trial and then have datadog stop that for us to avoid paying more by mistake.
I've used the solution for five years.
The solution is stable enough. We found it to be down only a few times, and it's reasonable.
The solution offers very good scalability. When we added more logs and more hosts, we did not notice any degradation in the service.
Support is very good. They answer all of our questions, and with a few emails, we get what we need
We previously used Elastic. We had to set up everything and maintain it ourselves.
Datadog has very good support and it is not so complicated to set up.
We set up the solution in-house. We integrated everything on our own.
We found the product to be very valuable.
I'd advise others to start small and then integrate more stuff. Be mindful when using Datadog.
We evaluated Splunk and ELK.
Be careful of the costs. Set up only the important things.