CatOps

4.3K views12:43

Filebeat Modules with Docker & Kubernetes

Статья от Percona с советами по созданию дашбордов

В названии фигурирует Grafana, но советы на самом деле общее. Кому-то они могут показаться капитанскими, но я повидал слишком много непонятных, бесполезных и перегруженных дашбордов. Потому считаю необходимым запостить.

#observability

Percona Database Performance Blog

Tips for Designing Grafana Dashboards

Peter Zaitsev share some of his considerations for designing Grafana Dashboards which will allow you to create better dashboards.

4.3K views09:23

Designing Practical Dashboards

CatOps

Сага в четырёх частях от СЕО Percona Петра Зайцева о мониторинге производительности в Linux и типичных ошибках:

- Часть первая: CPU
- Часть вторая: Disk
- Часть третья: Memory
- Часть четвёртая: Network

Очень хороший разбор, рекомендую к прочтению 👍

#observability

ma.ttias.be

How to measure Linux Performance Avoiding Most Typical Mistakes: CPU

This post is the first in a four-part blog series by Peter Zaitsev, Percona Chief Executive Officer.

5.0K views10:16

DataDog запустили своё мобильное приложение, с доступом к алертам и дашбордам, также интеграцией с сервисами типа PagerDuty и OpsGenie.

Всё ещё вопрос: прикрутят ли они туда rotation, чтобы соревноваться в том числе с on-call сервисами?

#observability

Introducing the Datadog mobile app

The Datadog mobile app gives you instant access to your dashboards and alerts in a portable package—perfect for being on-call.

3.2K views08:08

Introducing the Datadog mobile app

CatOps

Статья об ElasticSearch в стиле "послание себе, когда начинал с ним работать".

В статье есть информация про индексы, шарды, потребление ресурсов и траблшутинг.

#observability #elk #elasticsearch

Medium

Starter-kit for Elasticsearch operations

The post tries to answer questions: how to size ES nodes; how to troubleshoot it; where to find deep-dive posts about Elasticseach

5.2K views07:25

Starter-kit for Elasticsearch operations

CatOps

A list of awesome Prometheus alerts, divided by the nature. I.e. there are alerts for data bases, proxies and load balancers, storage, etc.

You can just copy-paste these into your monitoring code. Just keep in mind that the thresholds may be different for your particular case!

#observability #monitoring #prometheus

4.8K views17:52

Awesome Prometheus alerts

CatOps

Amazon Managed Service for Grafana now supports Grafana Enterprise upgrade, Grafana version 7.5, Open Distro for Elasticsearch integration, and AWS Billing reports

You can upgrade to Grafana Enterprise with 30 days trial to enable enterprise data sources.

Beginning April 16th, 2021, customers using AMG will receive a 90-day free trial for five free users per account, with additional usage charges.

AMG is currently available in the US East (N. Virginia) and Europe (Ireland) region.

#aws #observability

Amazon

Amazon Managed Service for Grafana now supports Grafana Enterprise upgrade, Grafana version 7.5, Open Distro for Elasticsearch…

2.9K views08:09

AWS Blog

CatOps

Gatus is a health dashboard written in Go.

It has minimalistic configuration and allows you to set multiple conditions to label an endpoint as "healthy".

Also, you can host it on your own inside your private network. So, if you have security requirements of not to allow external health checkers into the perimeter, this could be a good way to go.

#toolz #observability

GitHub

GitHub - TwiN/gatus: ⛑ Automated developer-oriented status page

⛑ Automated developer-oriented status page. Contribute to TwiN/gatus development by creating an account on GitHub.

3.7K views09:25

Gatus GitHub

CatOps

A small neat write up on learnings about incident responses

Key takeaways:
- Declare incidents on smaller things. Division between SEV1 and SEV3 incidents helps you to track system health better. As well a bunch of smaller problems may lead to a critical failure. Also, such problems are usually easy to fix one by one.
- Decrease the time between the incident and postmortem analysis. Analysis will be much more accurate, when you have a fresh memory of what has happened.
- Alert on symptoms, not causes. Alert only if your users (external or internal) have issues, not when CPU utilization is high

#observability

FireHydrant

Pragmatic Incident Response: 3 Lessons Learned from Failures

Lessons learned from the front line that you actually immediately use in your incident management process.

2.8K views10:00

Pragmatic Incident Response: 3 Lessons Learned from Failures

CatOps

Recently, I asked my subscribers what topics are interesting to them and a few people mentioned observability.

That’s funny, ‘coz yesterday I accidentally bumped into a great series of articles on setting SLAs for your products by Alex Ewerlöf!

- Calculating composite SLA - truly outstanding read!
- Some practical advice when setting SLA - notice, it says SLA, not SLO. So, there are some business related tips in this article as well. However, the core is technical, ofc.
- Calculating the SLA of a system behind a CDN - I haven’t read this one yet. But given the quality of previous two, I expect this one be great as well!

tl;dr for the first article in the list:



for serial, multiply availability; For parallels, multiply unavailability

I would personally also add that when you try to set a “full” SLO(A) for your service, that is also a composite SLO(A). You should treat it as a serial. For example, if you have 99.8% error rate SLO and 99.1% latency SLO, an “overall” SLO would be 0.998 0.991 100% = 98.9%

That’s not only good to know, but you may also want to write your marketing materials differently. There is a difference between:

> We guarantee 99.8% SLO on 5th error rate and 99.1% SLO on requests not taking longer than X milliseconds.

And

> We guarantee the 98.9% availability of our systems.

I’m not a marketing person, though. I don’t know what’s better. What I do know is that:”Nines doesn’t matter, if your users are unhappy”.

#observability #slo #sla

Medium

Calculating composite SLA

How to serial and parallel dependencies affect the total SLA

3.2K views10:32

Calculating composite SLA

Some practical advice when setting SLA

Calculating the SLA of a system behind a CDN

Support Ukraine 🇺🇦: Ukrainian resources and organisations to donate to

About

Blog

Apps

Platform