Статья от Percona с советами по созданию дашбордов
В названии фигурирует Grafana, но советы на самом деле общее. Кому-то они могут показаться капитанскими, но я повидал слишком много непонятных, бесполезных и перегруженных дашбордов. Потому считаю необходимым запостить.
#observability
В названии фигурирует Grafana, но советы на самом деле общее. Кому-то они могут показаться капитанскими, но я повидал слишком много непонятных, бесполезных и перегруженных дашбордов. Потому считаю необходимым запостить.
#observability
Percona Database Performance Blog
Tips for Designing Grafana Dashboards
Peter Zaitsev share some of his considerations for designing Grafana Dashboards which will allow you to create better dashboards.
Сага в четырёх частях от СЕО Percona Петра Зайцева о мониторинге производительности в Linux и типичных ошибках:
- Часть первая: CPU
- Часть вторая: Disk
- Часть третья: Memory
- Часть четвёртая: Network
Очень хороший разбор, рекомендую к прочтению 👍
#observability
- Часть первая: CPU
- Часть вторая: Disk
- Часть третья: Memory
- Часть четвёртая: Network
Очень хороший разбор, рекомендую к прочтению 👍
#observability
ma.ttias.be
How to measure Linux Performance Avoiding Most Typical Mistakes: CPU
This post is the first in a four-part blog series by Peter Zaitsev, Percona Chief Executive Officer.
DataDog запустили своё мобильное приложение, с доступом к алертам и дашбордам, также интеграцией с сервисами типа PagerDuty и OpsGenie.
Всё ещё вопрос: прикрутят ли они туда rotation, чтобы соревноваться в том числе с on-call сервисами?
#observability
Всё ещё вопрос: прикрутят ли они туда rotation, чтобы соревноваться в том числе с on-call сервисами?
#observability
Introducing the Datadog mobile app
The Datadog mobile app gives you instant access to your dashboards and alerts in a portable package—perfect for being on-call.
Статья об ElasticSearch в стиле "послание себе, когда начинал с ним работать".
В статье есть информация про индексы, шарды, потребление ресурсов и траблшутинг.
#observability #elk #elasticsearch
В статье есть информация про индексы, шарды, потребление ресурсов и траблшутинг.
#observability #elk #elasticsearch
Medium
Starter-kit for Elasticsearch operations
The post tries to answer questions: how to size ES nodes; how to troubleshoot it; where to find deep-dive posts about Elasticseach
A list of awesome Prometheus alerts, divided by the nature. I.e. there are alerts for data bases, proxies and load balancers, storage, etc.
You can just copy-paste these into your monitoring code. Just keep in mind that the thresholds may be different for your particular case!
#observability #monitoring #prometheus
You can just copy-paste these into your monitoring code. Just keep in mind that the thresholds may be different for your particular case!
#observability #monitoring #prometheus
Amazon Managed Service for Grafana now supports Grafana Enterprise upgrade, Grafana version 7.5, Open Distro for Elasticsearch integration, and AWS Billing reports
You can upgrade to Grafana Enterprise with 30 days trial to enable enterprise data sources.
Beginning April 16th, 2021, customers using AMG will receive a 90-day free trial for five free users per account, with additional usage charges.
AMG is currently available in the US East (N. Virginia) and Europe (Ireland) region.
#aws #observability
You can upgrade to Grafana Enterprise with 30 days trial to enable enterprise data sources.
Beginning April 16th, 2021, customers using AMG will receive a 90-day free trial for five free users per account, with additional usage charges.
AMG is currently available in the US East (N. Virginia) and Europe (Ireland) region.
#aws #observability
Amazon
Amazon Managed Service for Grafana now supports Grafana Enterprise upgrade, Grafana version 7.5, Open Distro for Elasticsearch…
Gatus is a health dashboard written in Go.
It has minimalistic configuration and allows you to set multiple conditions to label an endpoint as "healthy".
Also, you can host it on your own inside your private network. So, if you have security requirements of not to allow external health checkers into the perimeter, this could be a good way to go.
#toolz #observability
It has minimalistic configuration and allows you to set multiple conditions to label an endpoint as "healthy".
Also, you can host it on your own inside your private network. So, if you have security requirements of not to allow external health checkers into the perimeter, this could be a good way to go.
#toolz #observability
GitHub
GitHub - TwiN/gatus: ⛑ Automated developer-oriented status page
⛑ Automated developer-oriented status page. Contribute to TwiN/gatus development by creating an account on GitHub.
A small neat write up on learnings about incident responses
Key takeaways:
- Declare incidents on smaller things. Division between SEV1 and SEV3 incidents helps you to track system health better. As well a bunch of smaller problems may lead to a critical failure. Also, such problems are usually easy to fix one by one.
- Decrease the time between the incident and postmortem analysis. Analysis will be much more accurate, when you have a fresh memory of what has happened.
- Alert on symptoms, not causes. Alert only if your users (external or internal) have issues, not when CPU utilization is high
#observability
Key takeaways:
- Declare incidents on smaller things. Division between SEV1 and SEV3 incidents helps you to track system health better. As well a bunch of smaller problems may lead to a critical failure. Also, such problems are usually easy to fix one by one.
- Decrease the time between the incident and postmortem analysis. Analysis will be much more accurate, when you have a fresh memory of what has happened.
- Alert on symptoms, not causes. Alert only if your users (external or internal) have issues, not when CPU utilization is high
#observability
FireHydrant
Pragmatic Incident Response: 3 Lessons Learned from Failures
Lessons learned from the front line that you actually immediately use in your incident management process.
Recently, I asked my subscribers what topics are interesting to them and a few people mentioned observability.
That’s funny, ‘coz yesterday I accidentally bumped into a great series of articles on setting SLAs for your products by Alex Ewerlöf!
- Calculating composite SLA - truly outstanding read!
- Some practical advice when setting SLA - notice, it says SLA, not SLO. So, there are some business related tips in this article as well. However, the core is technical, ofc.
- Calculating the SLA of a system behind a CDN - I haven’t read this one yet. But given the quality of previous two, I expect this one be great as well!
tl;dr for the first article in the list:
I would personally also add that when you try to set a “full” SLO(A) for your service, that is also a composite SLO(A). You should treat it as a serial. For example, if you have 99.8% error rate SLO and 99.1% latency SLO, an “overall” SLO would be 0.998 0.991 100% = 98.9%
That’s not only good to know, but you may also want to write your marketing materials differently. There is a difference between:
> We guarantee 99.8% SLO on 5th error rate and 99.1% SLO on requests not taking longer than X milliseconds.
And
> We guarantee the 98.9% availability of our systems.
I’m not a marketing person, though. I don’t know what’s better. What I do know is that:”Nines doesn’t matter, if your users are unhappy”.
#observability #slo #sla
That’s funny, ‘coz yesterday I accidentally bumped into a great series of articles on setting SLAs for your products by Alex Ewerlöf!
- Calculating composite SLA - truly outstanding read!
- Some practical advice when setting SLA - notice, it says SLA, not SLO. So, there are some business related tips in this article as well. However, the core is technical, ofc.
- Calculating the SLA of a system behind a CDN - I haven’t read this one yet. But given the quality of previous two, I expect this one be great as well!
tl;dr for the first article in the list:
for serial, multiply availability; For parallels, multiply unavailability
I would personally also add that when you try to set a “full” SLO(A) for your service, that is also a composite SLO(A). You should treat it as a serial. For example, if you have 99.8% error rate SLO and 99.1% latency SLO, an “overall” SLO would be 0.998 0.991 100% = 98.9%
That’s not only good to know, but you may also want to write your marketing materials differently. There is a difference between:
> We guarantee 99.8% SLO on 5th error rate and 99.1% SLO on requests not taking longer than X milliseconds.
And
> We guarantee the 98.9% availability of our systems.
I’m not a marketing person, though. I don’t know what’s better. What I do know is that:”Nines doesn’t matter, if your users are unhappy”.
#observability #slo #sla
Medium
Calculating composite SLA
How to serial and parallel dependencies affect the total SLA