谷歌云负载均衡系统瘫痪18个小时

为促进社区发展,运维派寻求战略合作、赞助、投资,请联系微信:helloywp

美国、欧洲和亚洲等地区的虚拟机都无法“连接到后端”,负载均衡系统瘫痪18个小时后,谷歌云恢复了之前的版本,才得以慢慢恢复正常。

这次故障最先是在8月30日太平洋夏令时00:52时间予以报告的,截至19:18分还没有得到解决。

谷歌在努力解决这个故障。06:00,该公司表示已“查明了导致此问题的基础设施部件,目前正在积极抢修之中。”

可是到了07:00,谷歌官网上的消息改为“我们之前采取的行动并没有解决问题。我们在采取另外的解决办法。”

到08:30,消息又改为“我们已查明了引发这个问题的事件,正在恢复配置更改以修复该问题。”半小时后,该更改已落实,谷歌开始采取“进一步的措施,以彻底解决该问题。”

该修复办法意味着没有一个新的实例会出现问题,但是问题发生时在运行的实例仍受到影响。随后,谷歌建议用户应执行下列操作:创建一个新的TargetPool(目标池)。将区域中受影响的虚拟机添加到这个新的目标池。等待虚拟机开始在现有的负载均衡系统配置环境中运行。删除新的目标池。切勿删除现有的负载均衡系统配置,包括旧的目标池。没必要创建一个新的ForwardingRule(转发规则)。

这正是云用户掏钱以避免操心的那种事情,这番说明对一些用户来说也不够清楚,因为半小时后,谷歌重新编写了“格式更好”的操作说明。

截至本文截稿时,谷歌表示,“除了us-central1区域不到10%受影响的网络负载均衡系统外,应该已为所有区域解决了这个问题。”

谷歌几乎承认了自己进行的更新是造成这个混乱局面的根源,这可能是又一起自摆乌龙的事件:这家公司在之前的2016年6月、2016年8月、2016年8月和2016年9月曾搞砸过自己的云。

谷歌的工作人员,下次别再上演自己导致的云故障事件了。

故障报告:

Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what’s posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.

Google Cloud Networking Incident #17002

Issue with Cloud Network Load Balancers connectivity

Incident began at 2017-08-29 18:35 and ended at 2017-08-30 20:18 (all times are US/Pacific).

DATE TIME DESCRIPTION
Aug 30, 2017 20:18 The issue with Network Load Balancers has been resolved for all affected projects as of 20:18 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.
Aug 30, 2017 19:18 The issue with Network Load Balancers should be resolved for all regions except for < 10% of affected Network Load Balancers in us-central1. The last few will be resolved in the upcoming hours. We will provide another status update by 21:00 US/Pacific with current details.
Aug 30, 2017 16:54 The issue with Network Load Balancers should be resolved for all regions except for < 10% of affected Network Load Balancers in us-central1. The last few will be resolved in the upcoming hours. We will provide another status update by 19:00 US/Pacific with current details.
Aug 30, 2017 15:58 The issue with Network Load Balancers should be resolved for all regions except us-central1, for which repairs are almost complete. We expect a full resolution in the next hour, and will provide another status update by 17:00 US/Pacific with current details.
Aug 30, 2017 14:35 The issue with Network Load Balancers should be resolved for all regions except us-central1, for which repairs are ongoing. We expect a full resolution in the next few hours, and will provide another status update by 16:00 US/Pacific with current details.
Aug 30, 2017 13:36 The issue with Network Load Balancers should be resolved for all regions are fixed except for us-central1, us-east1, and europe-west1. Those 3 are underway. We expect a full resolution in the next few hours. We will provide another status update by 16:00 US/Pacific with current details.
Aug 30, 2017 12:00 We have identified all possibly affected instances and are currently testing the fix for these instances. We will be deploying the fix once it has been verified. No additional action is required. Performing the workaround mentioned previously will not cause any adverse effects.

Next update at 14:00 US/Pacific

Aug 30, 2017 11:02 We wanted to send another update with better formatting. We will provide more another update on resolving effected instances by 12 PDT.

Affected customers can also mitigate their affected instances with the following procedure (which causes Network Load Balancer to be reprogrammed) using gcloud tool or via the Compute Engine API.

NB: No modification to the existing load balancer configurations is necessary, but a temporary TargetPool needs to be created.

Create a new TargetPool. Add the affected VMs in a region to the new TargetPool. Wait for the VMs to start working in their existing load balancer configuration. Delete the new TargetPool. DO NOT delete the existing load balancer config, including the old target pool. It is not necessary to create a new ForwardingRule.

Example:

1) gcloud compute target-pools create dummy-pool –project=<your_project> –region=<region>

2) gcloud compute target-pools add-instances dummy-pool –instances=<instance1,instance2,…> –project=<your_project> –region=<region> –instances-zone=<zone>

3) (Wait)

4) gcloud compute target-pools delete dummy-pool –project=<your_project> –region=<region>

Aug 30, 2017 10:30 Our first mitigation has completed at this point and no new instances should be effected. We are slowly going through an fixing affected customers. Affected customers can also mitigate their affected instances with the following procedure (which causes Network Load Balancer to be reprogrammed) using gcloud tool or via the Compute Engine API.

NB: No modification to the existing load balancer configurations is necessary, but a temporary TargetPool needs to be created.

Create a new TargetPool. Add the affected VMs in a region to the new TargetPool. Wait for the VMs to start working in their existing load balancer configuration. Delete the new TargetPool. DO NOT delete the existing load balancer config, including the old target pool. It is not necessary to create a new ForwardingRule.

Example: gcloud compute target-pools create dummy-pool –project= –region= gcloud compute target-pools add-instances dummy-pool –instances=<instance1,instance2,…> –project= –region= –instances-zone= (Wait) gcloud compute target-pools delete dummy-pool –project= –region=</instance1,instance2,…>

Aug 30, 2017 09:30 We are experiencing an issue with a subset of Network Load Balance. The configuration change to mitigate this issue has been rolled out and we are working on further measures to completely resolve the issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 10:30 US/Pacific with current details.
Aug 30, 2017 09:00 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. The configuration change to mitigate this issue has been rolled out and we are working on further measures to completly resolve the issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 09:30 US/Pacific with current details.
Aug 30, 2017 08:30 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. We have identified the event that triggers this issue and are rolling back a configuration change to mitigate this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 09:00 US/Pacific with current details.
Aug 30, 2017 08:00 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Mitigation work is still in progress. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 08:30 US/Pacific with current details.
Aug 30, 2017 07:30 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Mitigation work is still in progress. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 08:00 US/Pacific with current details.
Aug 30, 2017 07:00 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Our previous actions did not resolve the issue. We are pursuing alternative solutions. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 07:30 US/Pacific with current details.
Aug 30, 2017 06:30 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Mitigation work is currently underway by our Engineering Team. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 07:00 US/Pacific with current details.
Aug 30, 2017 06:00 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Our Engineering Team has determined the infrastructure component responsible for the issue and mitigation work is currently underway. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 06:30 US/Pacific with current details.
Aug 30, 2017 05:30 We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Our Engineering Team has reduced the scope of possible root causes and is still investigating. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 06:00 US/Pacific with current details.
Aug 30, 2017 05:00 We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. The investigation is still ongoing. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:30 US/Pacific with current details.
Aug 30, 2017 04:30 We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. The investigation is still ongoing. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:00 US/Pacific with current details.
Aug 30, 2017 04:00 We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. We have ruled out several possible failure scenarios. The investigation is still ongoing. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 04:30 US/Pacific with current details.
Aug 30, 2017 03:30 We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 04:00 US/Pacific with current details.
Aug 30, 2017 03:00 We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 03:30 US/Pacific with current details.
Aug 30, 2017 02:30 We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 03:00 US/Pacific with current details.
Aug 30, 2017 01:50 We are investigating an issue with network load balancer connectivity. We will provide more information by 02:30 US/Pacific.
Aug 30, 2017 01:20 We are investigating an issue with network connectivity. We will provide more information by 01:50 US/Pacific.
Aug 30, 2017 00:52 We are investigating an issue with network connectivity. We will provide more information by 01:20 US/Pacific.

原文来自:云头条

网友评论comments

发表评论

电子邮件地址不会被公开。 必填项已用*标注

暂无评论

Copyright © 2012-2019 YUNWEIPAI.COM - 运维派 - 粤ICP备14090526号-3
扫二维码
扫二维码
返回顶部