You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that wake you up at night. The alerts are due to unhealthy systems that are automatically restarted within a minute. You want to set up a process that would prevent staff burnout while following Site Reliability Engineering practices. What should you do?
A.
Eliminate unactionable alerts.
B.
Create an incident report for each of the alerts.
C.
Distribute the alerts to engineers in different time zones.
D.
Redefine the related Service Level Objective so that the error budget is not exhausted
Eliminate unactionable alerts.
Your company is developing applications that are deployed on Google Kubernetes Engine
(GKE). Each team manages a different application. You need to create the development
and production environments for each team, while minimizing costs. Different teams should
not be able to access other teams’ environments. What should you do?
A.
Create one GCP Project per team. In each project, create a cluster for Development and
one for Production. Grant the teams IAM access to their respective clusters.
B.
Create one GCP Project per team. In each project, create a cluster with a Kubernetes
namespace for Development and one for Production. Grant the teams IAM access to their
respective clusters.
C.
Create a Development and a Production GKE cluster in separate projects. In each
cluster, create a Kubernetes namespace per team, and then configure Identity Aware
Proxy so that each team can only access its own namespace.
D.
Create a Development and a Production GKE cluster in separate projects. In each
cluster, create a Kubernetes namespace per team, and then configure Kubernetes Rolebased
access control (RBAC) so that each team can only access its own namespace
Create a Development and a Production GKE cluster in separate projects. In each
cluster, create a Kubernetes namespace per team, and then configure Kubernetes Rolebased
access control (RBAC) so that each team can only access its own namespace
You need to define Service Level Objectives (SLOs) for a high-traffic multi-region web application. Customers expect the application to always be available and have fast response times. Customers are currently happy with the application performance and availability. Based on current measurement, you observe that the 90th percentile of latency is 120ms and the 95th percentile of latency is 275ms over a 28-day window. What latency SLO would you recommend to the team to publish?
A.
90th percentile – 100ms
95th percentile – 250ms
B.
90th percentile – 120ms
95th percentile – 275ms
C.
90th percentile – 150ms
95th percentile – 300ms
D.
90th percentile – 250ms
95th percentile – 400ms
90th percentile – 120ms
95th percentile – 275ms
You support a large service with a well-defined Service Level Objective (SLO). The development team deploys new releases of the service multiple times a week. If a major incident causes the service to miss its SLO, you want the development team to shift its focus from working on features to improving service reliability. What should you do before a major incident occurs?
A.
Develop an appropriate error budget policy in cooperation with all service stakeholders.
B.
Negotiate with the product team to always prioritize service reliability over releasing new
features.
C.
Negotiate with the development team to reduce the release frequency to no more than
once a week.
D.
Add a plugin to your Jenkins pipeline that prevents new releases whenever your service
is out of SLO.
Negotiate with the product team to always prioritize service reliability over releasing new
features.
Your company experiences bugs, outages, and slowness in its production systems. Developers use the production environment for new feature development and bug fixes. Configuration and experiments are done in the production environment, causing outages for users. Testers use the production environment for load testing, which often slows the
production systems. You need to redesign the environment to reduce the number of bugs
and outages in production and to enable testers to load test new features. What should you
do?
A.
Create an automated testing script in production to detect failures as soon as they occur.
B.
Create a development environment with smaller server capacity and give access only to developers and testers.
C.
Secure the production environment to ensure that developers can't change it and set up one controlled update per year.
D.
Create a development environment for writing code and a test environment for
configurations, experiments, and load testing.
Create a development environment for writing code and a test environment for
configurations, experiments, and load testing.
Page 3 out of 15 Pages |
Previous |