CatOps
5.75K subscribers
94 photos
4 videos
19 files
2.21K links
DevOps and other issues by Yurii Rochniak (@grem1in) - SRE @ Preply && Maksym Vlasov (@MaxymVlasov) - Engineer @ Star. Opinions on our own.

We do not post ads including event announcements. Please, do not bother us with such requests!
Download Telegram
Статья от Percona с советами по созданию дашбордов

В названии фигурирует Grafana, но советы на самом деле общее. Кому-то они могут показаться капитанскими, но я повидал слишком много непонятных, бесполезных и перегруженных дашбордов. Потому считаю необходимым запостить.

#observability
Сага в четырёх частях от СЕО Percona Петра Зайцева о мониторинге производительности в Linux и типичных ошибках:

- Часть первая: CPU
- Часть вторая: Disk
- Часть третья: Memory
- Часть четвёртая: Network

Очень хороший разбор, рекомендую к прочтению 👍

#observability
DataDog запустили своё мобильное приложение, с доступом к алертам и дашбордам, также интеграцией с сервисами типа PagerDuty и OpsGenie.

Всё ещё вопрос: прикрутят ли они туда rotation, чтобы соревноваться в том числе с on-call сервисами?

#observability
Статья об ElasticSearch в стиле "послание себе, когда начинал с ним работать".

В статье есть информация про индексы, шарды, потребление ресурсов и траблшутинг.

#observability #elk #elasticsearch
​​A list of awesome Prometheus alerts, divided by the nature. I.e. there are alerts for data bases, proxies and load balancers, storage, etc.

You can just copy-paste these into your monitoring code. Just keep in mind that the thresholds may be different for your particular case!

#observability #monitoring #prometheus
Amazon Managed Service for Grafana now supports Grafana Enterprise upgrade, Grafana version 7.5, Open Distro for Elasticsearch integration, and AWS Billing reports

You can upgrade to Grafana Enterprise with 30 days trial to enable enterprise data sources.

Beginning April 16th, 2021, customers using AMG will receive a 90-day free trial for five free users per account, with additional usage charges.

AMG is currently available in the US East (N. Virginia) and Europe (Ireland) region.

#aws #observability
Gatus is a health dashboard written in Go.

It has minimalistic configuration and allows you to set multiple conditions to label an endpoint as "healthy".

Also, you can host it on your own inside your private network. So, if you have security requirements of not to allow external health checkers into the perimeter, this could be a good way to go.

#toolz #observability
A small neat write up on learnings about incident responses

Key takeaways:
- Declare incidents on smaller things. Division between SEV1 and SEV3 incidents helps you to track system health better. As well a bunch of smaller problems may lead to a critical failure. Also, such problems are usually easy to fix one by one.
- Decrease the time between the incident and postmortem analysis. Analysis will be much more accurate, when you have a fresh memory of what has happened.
- Alert on symptoms, not causes. Alert only if your users (external or internal) have issues, not when CPU utilization is high

#observability
Recently, I asked my subscribers what topics are interesting to them and a few people mentioned observability.

That’s funny, ‘coz yesterday I accidentally bumped into a great series of articles on setting SLAs for your products by Alex Ewerlöf!

- Calculating composite SLA - truly outstanding read!
- Some practical advice when setting SLA - notice, it says SLA, not SLO. So, there are some business related tips in this article as well. However, the core is technical, ofc.
- Calculating the SLA of a system behind a CDN - I haven’t read this one yet. But given the quality of previous two, I expect this one be great as well!

tl;dr for the first article in the list:

for serial, multiply availability; For parallels, multiply unavailability


I would personally also add that when you try to set a “full” SLO(A) for your service, that is also a composite SLO(A). You should treat it as a serial. For example, if you have 99.8% error rate SLO and 99.1% latency SLO, an “overall” SLO would be 0.998 0.991 100% = 98.9%

That’s not only good to know, but you may also want to write your marketing materials differently. There is a difference between:

> We guarantee 99.8% SLO on 5th error rate and 99.1% SLO on requests not taking longer than X milliseconds.

And

> We guarantee the 98.9% availability of our systems.

I’m not a marketing person, though. I don’t know what’s better. What I do know is that:”Nines doesn’t matter, if your users are unhappy”.

#observability #slo #sla
The Ultimate Guide to Microsoft Publisher