What is Observability, Monitoring & Alerting? How to avoid downtime of System?

Observability assists the team in identifying and resolving the issue’s fundamental cause while monitoring alerts the team of a possible issue.

If you work as a product engineer in the software business, two phrases that you hear frequently are Scalability and Reliability.

While Reliability strengthens your system’s foundation and makes it stable, Scalability allows your system to grow and serve a wider audience.

Have you ever thought of the bigger picture which comes to the mind when we hear about the biggest outages which has happened in last few years and have impacted large businesses?

Facebook experienced an outage for approximately six hours following a network outage together with its associated services WhatsApp and Instagram for 5.5 hours in 2021. This impacted 3.5 billion users of Facebook, WhatsApp and Instagram for 5.5 hours and cost the firm as much as $60 million in lost advertising revenue, as well as wiping off almost 5% of its share value 😨

What could be the potential solution to avoid such losses and outages?

The answer to the above questions are Observability, Monitoring & Alerting

If you don’t know that your website is down, you won’t be able to fix it. But using a powerful monitoring tool can instantly help in fixing. We shouldn’t wait for the users to tell you that your website is experiencing an outage.

We need to strengthen our system with the aid of Alerting & Monitoring, which means keeping the team and the company informed about the system’s current health and status 🙃 before there is some downtime or unavailability of services.

What is Observability, Monitoring & Alerting? How to avoid downtime of System?

What is Observability?

Observability in simple terms can be described as a kind of a system which can help in easily identifying the root cause of a performance by looking at the data it produced over a period of time.

Without Observability, it’s more like Going into a Battle 💣 Without a Gun 🔫

“Observability” is a superset of “Monitoring,” offering some advantages and insights that “monitoring” techniques fall short of.

Observability can produce a huge amount of data and it could become a mess so it’s important to understand how to use it.

What is Observability, Monitoring & Alerting? How to avoid downtime of System?

The pyramid above, which I refer to as the Monitoring Hierarchy, suggests possible uses for the data that have been gathered.

The bottom of the pyramid displays data from various systems, while the top of the pyramid denotes notifications that people will receive, read, and act upon as necessary.

The middle Layer which is for Monitoring helps in representation of large amounst of segregated diagnostic data in the form of dashboards which will help in quick identification and analysis of a problem.

This is why your monitoring system should provide answers to the following two key inquiries: “What’s broken, and Why it’s broken?”

What is Monitoring?

The best example which requires Monitoring would be a Database server running out of Disk Space or a proxy server having a high CPU Utilization.

Understanding the failure domain of the key system components in advance is necessary for creating “monitorable” systems. For Monitoring to be effective in its nature, its very important to understand and identify the failure modes of a system or to define a set of metrics which can indicate the health of a system accurately.

One of the main pain points of Monitoring is collecting a vast variety of metrics. Although we make an effort to gather all data, the vast majority of these metrics are never examined due to which sometimes there are chances of real metrics being overlooked and can cause real life problems in system 😓.

Common Monitoring Myths include:

  • In the case of a failure, monitoring data needs to be able to offer insight into both the failure’s effects and those of any implemented fixes right away.
  • The most important thing to know in this situation is that monitoring does not ensure that failure can be entirely prevented.
  • Monitoring provides a good approximation of the health of a system, but monitoring doesn’t prevent failure entirely.

What is Alerting?

If everything is alerted, then nothing is acted upon.

Instead, we should have 4–5 key business metrics for each application that should indicate its health. These metrics should have set thresholds that fire an email/slack/Pager if breached.

A “warning” is fired first, signalling that a threshold may have been lightly crossed and that a potentially problematic flaw may be present.
The second “critical” is the alert, which signifies that a strict threshold has been crossed and that more research is required to validate the existence of a bug and, in that event, notify the relevant stakeholders.

What is Observability, Monitoring & Alerting? How to avoid downtime of System?

The next step for us is to combine the monitoring of business indicators like conversion rate with technical analytics like what is our latency. The benefit of this combo is that it will allow us to better understand how our system impacts our company.

The phrase “the site is down” changes into “the site is down so we lost Y potential customers”. As engineers, it is simpler to remain in the former mindset, but it is in the later that we apply our knowledge of the industry to improve the overall system.

This was just a small attempt on my part to explain to you about Observability, Monitoring and Alerting.

👍 Please like this article if you found it helpful.


And please feel free to share your ideas for improvement with us.

🤞 Stay tuned for future posts.

Follow us on instagram

Feel free to contact us for any more conversation regarding Cloud Computing, DevOps, etc.

Share

Leave a Comment