Google Cloud Status

UPDATE: Incident 20009 - Cloud Logging delays on log ingestion affecting 25% of projects.

The issue with Cloud Logging has been resolved for all affected projects as of Thursday, 2020-11-19 21:15 US/Pacific. We thank you for your patience while we worked on resolving the issue.

Last Update: About 7 days ago

UPDATE: Incident 20009 - Cloud Logging delays on log ingestion affecting 25% of projects.

Description: We believe the issue with Cloud Logging is partially resolved. Additional cleanup is ongoing. Some users may experience slow queries. We expect a full resolution within 3 hours. We will provide an update by Thursday, 2020-11-19 22:30 US/Pacific with current details. Diagnosis: Log writes and reads experiencing latency on logs ingested after 2020-11-19 08:41 US/Pacific. 25% of customers are affected and may see incomplete queries. Workaround: None at this time.

Last Update: About 7 days ago

UPDATE: Incident 20009 - Cloud Logging delays on log ingestion affecting 25% of projects.

Description: We believe the issue with Cloud Logging is partially resolved. Additional cleanup and prevention efforts are ongoing. Some users may experience slow queries. We do not have an ETA for full resolution at this point. We will provide an update by Thursday, 2020-11-19 19:30 US/Pacific with current details. Diagnosis: Log writes and reads experiencing latency on logs ingested after 2020-11-19 08:41 US/Pacific. 25% of customers are affected and may see incomplete queries. Workaround: None at this time.

Last Update: About 7 days ago

UPDATE: Incident 20009 - Cloud Logging delays on log ingestion affecting 25% of projects.

Description: We believe the issue with Cloud Logging is partially resolved. Additional cleanup and prevention efforts are ongoing, but the majority of users should see their log data available and queries succeeding. We do not have an ETA for full resolution at this point. We will provide an update by Thursday, 2020-11-19 18:30 US/Pacific with current details. Diagnosis: Log writes and reads experiencing latency on logs ingested after 2020-11-19 08:41 US/Pacific. 25% of customers are affected and may see incomplete queries. Workaround: None at this time.

Last Update: About 7 days ago

UPDATE: Incident 20012 - Elevated frequency of Host Maintenance events on GCE instances with an attached GPU(s) and SSD(s)

The issue with Google Compute Engine instances with an attached GPU(s) and SSD(s) is believed to be affecting a very small number of projects and our Engineering Team continues to work on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here. We thank you for your patience while we're working on resolving the issue.

Last Update: About 15 days ago

UPDATE: Incident 20012 - Elevated frequency of Host Maintenance events on GCE instances with an attached GPU(s) and SSD(s)

Description: Mitigation work is still underway by our engineering team. Further investigation of current impact and mitigation timeline is ongoing. We will provide more information by Wednesday, 2020-11-11 13:00 US/Pacific. Diagnosis: Affected customers will experience elevated frequency of Host Maintenance events on GCE instances with an attached GPU(s) and SSD(s). Workaround: Temporarily switch to use V100 GPU's which are unaffected by this issue. https://cloud.google.com/compute/docs/gpus

Last Update: About 16 days ago

UPDATE: Incident 20012 - Elevated frequency of Host Maintenance events on GCE instances with an attached GPU(s) and SSD(s)

Description: We are experiencing an issue with Google Compute Engine beginning in 2020-08. A firmware rollout is being created that should address the issue. The rollout is currently expected to complete next week, but mitigation efforts are still ongoing. We will provide more information by Tuesday, 2020-11-10 16:30 US/Pacific. Diagnosis: Affected customers will experience elevated frequency of Host Maintenance events on GCE instances with an attached GPU(s) and SSD(s). Workaround: Temporarily switch to use V100 GPU's which are unaffected by this issue. https://cloud.google.com/compute/docs/gpus

Last Update: About 16 days ago

RESOLVED: Incident 20011 - GCE instance creation/start/stop operations failing in us-central1-f

The issue with Google Compute Engine has been resolved for all affected users as of Friday, 2020-10-30 16:35 US/Pacific. We thank you for your patience while we worked on resolving the issue.

Last Update: About 27 days ago

UPDATE: Incident 20011 - GCE instance creation/start/stop operations failing in us-central1-f

Description: We believe the issue with Google Compute Engine operations is partially resolved. All previous operations should have completed, and customers should experience normal latency on new operations. We will provide an update by Friday, 2020-10-30 17:45 US/Pacific with current details. Diagnosis: Affected customers may see delays performing instance operations, but they should eventually complete. Other Cloud Services that create instances on demand may also experience impact such as: Google Kubernetes Engine, Composer, Dataproc, and Dataflow. Workaround: When possible try instance operations in other zones.

Last Update: About 27 days ago

UPDATE: Incident 20011 - GCE instance creation/start/stop operations failing in us-central1-f

Description: Other Cloud Services that create instances on demand may be impacted such as: Google Kubernetes Engine, Composer, Dataproc, and Dataflow We will provide an update by Friday, 2020-10-30 17:00 US/Pacific with current details. Diagnosis: Affected customers may see delays performing instance operations, but they should eventually complete. Other Cloud Services that create instances on demand may also experience impact such as: Google Kubernetes Engine, Composer, Dataproc, and Dataflow. Workaround: When possible try instance operations in other zones.

Last Update: About 27 days ago

UPDATE: Incident 20011 - GCE instance creation/start/stop operations failing in us-central1-f

Description: We are experiencing an issue with Google Compute Engine starting on Friday, 2020-10-30 07:30 US/Pacific. The backlog is continuing to decrease, and delays on new instance creation have significantly decreased from the peak with an average of 1 minute. We are anticipating the backlog to clear in approximately 40 minutes. We will provide an update by Friday, 2020-10-30 17:00 US/Pacific with current details. Diagnosis: Affected customers may see delays performing instance operations, but they should eventually complete. Workaround: When possible try instance operations in other zones.

Last Update: About 27 days ago

RESOLVED: Incident 20007 - App Engine Standard returning elevated HTTP 500 errors in us-central1

The issue with Google App Engine has been resolved for all affected users as of Tuesday, 2020-10-06 09:13 US/Pacific. We thank you for your patience while we worked on resolving the issue.

Last Update: About 1 month ago

RESOLVED: Incident 20010 - We are experiencing an issue with multiple GCP products.

# BACKGROUND Google’s Global Service Load Balancer (GSLB) is a collection of software and services that load balance traffic across Google properties. There are two main components, a control plane, and a data plane. The control plane provides programming to the data plane on how to handle requests. A key component of the data plane is the Google Front End (GFE). The GFE is an HTTP/TCP reverse proxy which is used to serve requests to many Google properties including: Search, Ads, G Suite (Gmail, Chat, Meet, Docs, Drive, etc.), Cloud External HTTP(S) Load Balancing, Proxy/SSL Load Balancing, and many Cloud APIs. Google’s Global Load Balancers are implemented using a GFE architecture that has two tiers in some cases. The first tier of GFEs are situated as close to the user as possible to minimize latency during connection setup. First tier GFEs route requests either directly to applications, or in some cases to a second tier of GFEs providing additional functionality, before routing to applications. This architecture allows clients to have low latency connections anywhere in the world, while taking advantage of Google’s global network to serve requests to backends, regardless of region. The pool of GFE instances which were impacted in this incident are part of the second tier, handling a subset of Google services. Therefore, this incident only impacted services routed through this specific pool. # ISSUE SUMMARY On Thursday 24 September, 2020 at 18:00 US/Pacific, one of Google’s several second-tier GFE pools experienced intermittent failures resulting in impact to several downstream services. Almost all services recovered within the initial 33 minutes of the incident; exceptions are outlined in the detailed impact section below. Affected customers experienced elevated error rates and latency when connecting to Google APIs. Existing workloads (i.e. running instances on GCE, or containers on GKE) were not impacted unless they needed to invoke impacted APIs. Service impact can be divided into two categories, direct and indirect. Services which have a request path that flows through the impacted GFE pool would have been directly impacted. Calls to these services would have experienced higher latency or elevated errors in the form of HTTP 502 response codes. Alternatively, services which did not directly rely on this pool of impacted GFEs may invoke other services, such as authentication, that depend on this shared pool of GFEs. This indirect impact would have varied between customers. One example of this, which we expect to be one of the most common forms of indirect impact, would be use of an oauth token that needed to be refreshed or retrieved. While a service such as Cloud Spanner may not have been serving errors, customers using the Cloud Spanner Client may have seen errors when the client attempted to refresh credentials, depending on the API used to refresh/obtain the credential. A detailed description of impact can be found below. To our Cloud customers whose businesses were impacted during this disruption, we sincerely apologize – we have conducted a thorough internal investigation and are taking immediate action to improve the resiliency, performance, and availability of our services. # ROOT CAUSE For any given pool of tasks, the GFE control plane has a global view of capacity, service configurations, and network conditions, which are all combined and sent to GFEs to create efficient request serving paths. This global view allows requests to be routed seamlessly to other regions, which is useful in scenarios like failover or for load balancing between regions. GFEs are grouped into pools for a variety of traffic profiles, health checking requirements, and other factors; the impacted second-layer GFE pool was used by multiple services. The GFE control plane picks up service configuration changes and distributes them to GFEs. For this incident, two service changes contained an error that resulted in a significant increase in the number of backends accessed by GFEs in this pool. The particular nature of these changes additionally meant that they would be distributed to all GFEs in this pool globally, instead of being limited to a particular region. While the global aspect was intended, the magnitude of backend increases was not. The greatly increased number of programmed backends caused GFEs to exceed their memory allocation in many locations. GFE has many internal protections which are activated when there is memory pressure, such as closing idle connections or refusing to accept new connections, allowing them to keep running despite a memory shortage. Tasks which exceeded memory limits were terminated. The combination of a reduced number of available GFEs and a reduction in accepted connections meant that traffic to services behind the impacted GFE pool dropped by 50%. # REMEDIATION AND PREVENTION Google engineers were alerted to the outage three minutes after impact began at 2020-09-28 18:03, and immediately began an investigation. At 18:15 the first service change, which significantly increased the number of programmed backends, was rolled back. At 18:18 the second service configuration change was rolled back. Google engineers started seeing recovery at 18:20 and at 18:33 the issue was fully mitigated. GFE is one of the most critical pieces of infrastructure at Google and has multiple lines of defense in depth, both in software and operating procedure. As the result of this outage, we are adding additional protections to both in order to eliminate this class of failure. As an immediate step we have limited the type of configuration changes that can be made until additional safeguards are in place. Those additional safeguards will include stricter validation of configuration changes; specifically, rejecting changes that cause a large increase in backend count across multiple services. In addition to a check in the control plane, we will be augmenting existing protections in the GFE against unbounded growth in any resource dimension, such as backend counts. We will also be performing an audit of existing configurations and converting risky configurations to alternative setups. A restriction will be placed on certain configuration options, only allowing use with additional review and allow lists. Finally, an audit will be performed of services in shared GFE pools, with additional pools being created to reduce impact radius, should an issue in this part of the infrastructure surface again. # DETAILED DESCRIPTION OF IMPACT On 2020-09-24 from 18:00 to 18:33 US/Pacific (unless otherwise noted) the following services were impacted globally: ### OAuth The following OAuth paths were impacted and returned errors for 50% of requests during the impact period. Impact perceived by customers may have been less as many client libraries make requests to these paths asynchronous to refresh tokens before they expire and retry their requests upon failure, potentially receiving successful responses: - oauth2.googleapis.com/token - accounts.google.com/o/oauth2/token - www.youtube.com/o/oauth2/token - www.googleapis.com/o/oauth2/token - www.googleapis.com/oauth2/{v3,v4}/token - accounts.{google,youtube}.com/o/oauth2/{revoke,device/code,tokeninfo} - www.googleapis.com/oauth2/v3/authadvice - www.googleapis.com/oauth2/v2/IssueToken - oauthaccountmanager.googleapis.com _ The following APIs were NOT affected: - www.googleapis.com/oauth2/{v1,v2,v3}/certs - contents.googleapis.com/oauth2/{v1,v2,v3}/certs - contents6.googleapis.com/oauth2/{v1,v2,v3}/certs - iamcredentials.googleapis.com and accounts.google.com (other than specific URLs mentioned above) were not affected. _ ### Chat Google Chat experienced an elevated rate of HTTP 500 & 502 errors (averaging 34%) between 18:00 and 18:04, decreasing to a 7% error rate from 18:04 to 18:14, with a mean latency of 500ms. This resulted in affected users being unable to load the Chat page or to send Chat messages. ### Classic Hangouts Classic Hangouts experienced an elevated error rate of HTTP 500 errors (reducing Hangouts traffic by 44%) between 18:00 and 18:25. The service error rate was below 1% for Hangouts requests within the product, including sending messages. ### Meet Google Meet experienced error rates up to 23% of requests between 18:02 and 18:23. Affected users observed call startup failures which affected 85% of session attempts. Existing Meet sessions were not affected. ### Voice Google Voice experienced a 66% drop in traffic between 18:00 and 18:24. Additionally, the service had an elevated error rate below 1% between 18:01 and 18:14, and an average of 100% increase in mean latency between 18:03 and 18:12. ### Calendar Google Calendar web traffic observed up to a 60% reduction in traffic, and an elevated HTTP 500 error rate of 4.8% between 18:01 and 18:06, which decreased to and remained below 1% for the remainder of the outage. Calendar API traffic observed up to a 53% reduction in traffic, with an average error rate of 2% for the same period. The traffic reduction corresponded with HTTP 500 errors being served to users. ### Groups Google Groups web traffic dropped roughly 50% for the classic UI, and 30% for the new UI. Users experienced an average elevated HTTP 500 error rate between 0.12 and 3%. ### Gmail Gmail observed a 35% drop in traffic due to the GFE responding with HTTP 500 errors. The service error rate remained below 1% for the duration of the incident. This affected Gmail page loading and web interactions with the product. ### Docs Google Docs witnessed a 33% drop in traffic between 18:00 and 18:23, corresponding with the GFE returning HTTP 500 errors to user interactions. Additionally, between 18:01 and 18:06 the service error rate rose to 1.4%, before decreasing and remaining at approximately 0.3% until 18:23. ### Drive Google Drive observed a 60% traffic drop between 18:00 and 18:23, corresponding with the GFE returning HTTP 500 errors to user interactions. The Drive API experienced a peak error rate of 7% between 18:02 to 18:04, and then between 1% and 2% until 18:25. Google Drive web saw up to a 4% error rate between 18:01 and 18:06. 50th percentile latency was unaffected, but 95th percentile rose up to 1.3s between 18:02 and 18:06. ### Cloud Bigtable Some clients using the impacted OAuth authentication methods described above were unable to refresh their credentials and thus unable to access Cloud Bigtable. Clients using alternative authentication methods were not impacted. Impacted clients experienced elevated error rates and latency. The main impact was to clients accessing Cloud Bigtable from outside of GCP, there was a 38% drop in this traffic during the impact period. ### Cloud Build API Cloud Build API experienced elevated error rates due to a 50% loss of incoming traffic over 26 minutes before reaching the service front end. Additionally, 4% of Cloud Builds failed due receiving errors from other Google services ### Cloud Key Management Service (KMS) Cloud KMS was unable to receive API requests from GFE for ~33 minutes starting at 18:00 impacting non-Customer Managed Encryption Key (CMEK) customers. CMEK customers were not impacted. ### Cloud Logging Cloud Logging experienced increased error rates (25% average, up to 40% at peak) from 18:05 to 18:50. Customers would have experienced errors when viewing logs in the Cloud Console. Data ingestion was not impacted. ### Cloud Monitoring Cloud Monitoring API experienced elevated error rates (50% average, up to 80% at peak) of uptime checks and requests from 18:00 - 18:26. This affected cloud uptime workers running uptime checks. ### Cloud Networking API Cloud Networking API experienced up to 50% error rate for Network Load Balancer Creation from 18:00 to 18:20 due to downstream service errors. Additionally up to 35% of HTTP(S) Load Balancer or TCP/SSL Proxy Load Balancer creation requests failed from 18:00 - 18:28 due to downstream service errors. Traffic for existing load balancers was unaffected. ### Google Compute Engine API The Google Compute Engine (GCE) API experienced an error rate of up to 50% from 18:00 - 18:25 with affected users experiencing HTTP 502 error response codes. This would have prevented loading the GCE portion of the Cloud Console as well listing, modifying, and creating GCE resources via other API clients. This applies only to the GCE API. GCE instance connectivity and availability was not impacted. Please note that some GCP services were served by the impacted GFE pool, so customer workloads running inside compute instances may have seen impact if they depend on other GCP services that experienced impact. Autoscaler continued to function nominally during the outage window. ### Cloud Profiler Cloud Profiler API experienced an elevated rate of HTTP 502 errors due to an up to 50% reduction in global traffic for all requests. ### Cloud Run API Cloud Run API experienced an elevated rate of HTTP 502 errors up to 70% from 18:00 to 18:30. Existing Cloud Run deployments were unaffected. ### Cloud Spanner Cloud Spanner clients experienced elevated error rates due to authentication issues which caused a 20% drop in traffic. Impacted customers saw increased latency and errors accessing Cloud Spanner. Clients using alternative authentication methods, such as GKE Workload Identity, were not impacted. ### Game Servers Game Servers experienced elevated request latencies of up to 4x normal levels during the incident window, resulting in some clients experiencing connection timeouts and increased retry attempts. The service did not experience elevated error rates. ### Google Cloud Console 4.18% of customers experienced "The attempted action failed" error messages when attempting to load pages in the Cloud Console during the incident window. This prevented some customers from viewing the UI of networking, compute, billing, monitoring, and other products and services within the Cloud Console platform. ### Google Cloud SQL 0.08% of Cloud SQL's fleet experienced instance metrics and logging delays from 18:07 - 18:37 for a duration of 30 minutes. The Cloud SQL API did not serve errors during the outage, but incoming traffic dropped by ~30%. No spurious auto-failovers or auto-repairs were executed as a result of the incident. There were no actual instance failures. ### Google Kubernetes Engine Requests to the Google Kubernetes Engine (GKE) control plane experienced increased timeouts and HTTP 502 error. Up to 6.6% of cluster masters reported errors during the time of the incident. Up to 5.5% of newly added nodes to clusters may have experienced errors due to issues communicating with impacted cluster masters. ### Firebase Crashlytics 66% of Crashlytics imports from AWS were impacted from 18:01 - 19:01 US/Pacific for a duration of 60 minutes. This created an import backlog which was quickly worked through 10 minutes after the incident ended. # SLA CREDITS If you believe your paid application experienced an SLA violation as a result of this incident, please populate the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla A full list of all Google Cloud Platform Service Level Agreements can be found at https://cloud.google.com/terms/sla/. For G Suite, please request an SLA credit through one of the Support channels: https://support.google.com/a/answer/104721 G Suite Service Level Agreement can be found at https://gsuite.google.com/intl/en/terms/sla.html

Last Update: About 1 month ago

RESOLVED: Incident 20010 - We are experiencing an issue with multiple GCP products.

With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors and downstream impact to GCP and G Suite from this incident. The root cause of this disruption is well understood and safeguards have been put in place to prevent any possible recurrence of the issue. At this time we have determined that the following products were affected: ### Cloud Build API Google Front End (GFE) prevented API requests from reaching the service. CreateBuild requests that did make it to the servers were more likely to fail due to user code calling other GCP services. ### Cloud Firestore and Datastore Cloud Firestore saw 80% of listen streams become disconnected and a 50% drop in query/get requests across all regions except nam5 and eur3. ### Cloud Key Management Service (KMS) Google Front End (GFE) prevented API requests from reaching the service. ### Cloud Logging Unavailable for viewing in the Cloud Console, but data ingestion was not impacted. ### Cloud Monitoring Elevated error rates of uptime checks and requests to the Cloud Monitoring API ### Cloud Compute Engine Requests to compute.googleapis.com would have seen an increase in 502 errors. Existing instances were not impacted. ### Cloud Spanner Cloud Spanner experienced elevated latency spikes which may have resulted in connection timeouts. ### Game Servers Minor impact to cluster availability due to dependencies on other services. ### Google Cloud Console Multiple pages and some core functionality of the Cloud Console impacted. ### Google Cloud SQL Minor connectivity problems. Instance log reporting to stackdriver was delayed. There was a ~50% drop in SqlInstancesService.List API requests. ### Google Kubernetes Engine Minor impact to cluster availability due to dependencies on other services. ### Firebase Crashlytics From 18:00 - 18:24, Crashlytics imports from AWS were impacted. This created an import backlog which was quickly worked through 10 minutes after the incident ended. We are conducting an internal investigation of this issue and will make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a detailed report of this incident, including both GCP and G Suite impact, once we have completed our internal investigation. This detailed report will also contain information regarding SLA credits.

Last Update: A few months ago

UPDATE: Incident 20006 - Cloud Shell Connectivity Issues in asia-southeast1 and an issue with the Pricing UI not loading for some billing accounts with a custom price model has been resolved.

The issue with Cloud Shell has been resolved for all affected users as of Thursday, 2020-09-24 18:30 US/Pacific. We thank you for your patience while we worked on resolving the issue.

Last Update: A few months ago

RESOLVED: Incident 20010 - We are experiencing an issue with multiple GCP products.

We believe the issue with multiple GCP products has been resolved for most traffic at 2020-09-24 18:33 US/Pacific. Affected products include: Cloud Run, Firestore Watch, Cloud SQL, Cloud Spanner, GKE, Cloud Logging, Cloud Monitoring, Cloud Console, Cloud KMS, Game Server We thank you for your patience while we worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20010 - We are experiencing an issue with multiple GCP products.

Description: We are experiencing an issue with multiple GCP products, beginning at Thursday, 2020-09-24 17:58 US/Pacific. Symptoms: Increased error rate Affected products include: Cloud Run, Firestore Watch, Cloud SQL, Cloud Spanner, GKE, Cloud Logging, Cloud Monitoring, Cloud Console, Cloud KMS, Game Server Mitigation work is currently underway by our engineering team. We will provide an update by Thursday, 2020-09-24 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 20006 - Cloud Shell Connectivity Issues in asia-southeast1 and an issue with the Pricing UI not loading for some billing accounts with a custom price model has been resolved.

Description: Cloud Shell Issue Description: We are experiencing an issue with Cloud Shell in asia-southeast1. It has been partially mitigated, but our engineering team has determined that further investigation is required to fully resolve the issue. The investigation will take a few hours. Users might still encounter connectivity issues when starting new Cloud Shell sessions. We will provide an update by Thursday, 2020-09-24 23:30 US/Pacific with current details. Diagnosis: Cloud Shell Diagnosis: Error message "Cloud Shell is temporarily not available please try after some time", or connectivity errors when attempting to create a new Cloud Shell instance. Workaround: Cloud Shell Workaround: As a workaround you can use gcloud sdk on your local command line.

Last Update: A few months ago

UPDATE: Incident 20010 - We are experiencing an issue with multiple GCP products.

Description: We are experiencing an issue with multiple GCP products, beginning at Thursday, 2020-09-24 17:58 US/Pacific. Symptoms: Increased error rate. Affected products include: Cloud Run, Firestore Watch, Cloud SQL, Cloud Spanner, GKE, Cloud Logging, Cloud Monitoring, Cloud Console, Cloud KMS, Game Server Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2020-09-24 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 20010 - We are experiencing an issue with multiple GCP products.

Description: We are experiencing an issue with multiple GCP products, beginning at Thursday, 2020-09-24 17:58 US/Pacific. Symptoms: Increased error rate. Affected products include: Cloud Run, Firestore Watch, Cloud SQL, Cloud Spanner, GKE, Cloud Logging, Cloud Monitoring, Cloud Console Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2020-09-24 19:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 20010 - We are experiencing an issue with multiple GCP products.

Description: We are experiencing an issue with multiple GCP products. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2020-09-24 19:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 20006 - Cloud Shell Connectivity Issues in asia-southeast1 and an issue with the Pricing UI not loading for some billing accounts with a custom price model has been resolved.

Description: Cloud Shell Issue Description: We are experiencing an issue with Cloud Shell in asia-southeast1. It has been partially mitigated, but our engineering team has determined that further investigation is required to fully resolve the issue. The investigation will take a few hours. Users might still encounter connectivity issues when starting new Cloud Shell sessions. We will provide an update by Thursday, 2020-09-24 18:30 US/Pacific with current details. Diagnosis: Cloud Shell Diagnosis: Error message "Cloud Shell is temporarily not available please try after some time", or connectivity errors when attempting to create a new Cloud Shell instance. Workaround: Cloud Shell Workaround: As a workaround you can use gcloud sdk on your local command line.

Last Update: A few months ago

UPDATE: Incident 20006 - Cloud Shell Connectivity Issues in asia-southeast1 and an issue with the Pricing UI not loading for some billing accounts with a custom price model.

Description: Cloud Shell Issue Description: We are experiencing an issue with Cloud Shell in asia-southeast1. It has been partially mitigated, but our engineering team has determined that further investigation is required to fully resolve the issue. Users might still encounter connectivity issues when starting new Cloud Shell sessions. Cloud Console Billing UI Issue Description: Cloud Console is experiencing an issue with Pricing UI page not loading for some billing accounts associated with a custom price model globally. We are currently rolling back the code change that is responsible for this issue. We expect to complete this in the next hour. We will provide an update by Thursday, 2020-09-24 15:00 US/Pacific with current details. Diagnosis: Cloud Shell Diagnosis: Error message "Cloud Shell is temporarily not available please try after some time", or connectivity errors when attempting to create a new Cloud Shell instance. Cloud Console Billing UI Diagnosis: Affected customers' Billing UI page may not load properly. Workaround: Cloud Shell Workaround: As a workaround you can use gcloud sdk on your local command line. Cloud Billing UI Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 20009 - We're experiences issues with Google Cloud infrastructure in asia-east2

The issue with multiple GCP products has been resolved for all affected users as of Thursday, 2020-09-17 19:29 US/Pacific. We thank you for your patience while we worked on resolving the issue.

Last Update: A few months ago

RESOLVED: Incident 20008 - We are experiencing issues across multiple GCP products

A detailed incident report has been posted on the G Suite Status Dashboard [1]. [1] https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a45de3b26d6c5872f4cfe8e3424d7a82

Last Update: A few months ago

UPDATE: Incident 20008 - We are experiencing issues across multiple GCP products

The issue with App Engine, Cloud Storage and Cloud Logging has been resolved for all affected users as of Thursday, 2020-08-20 04:12 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 20008 - We are experiencing issues across multiple GCP products

Description: The issue with App Engine, Cloud Storage and Cloud Logging is partially resolved. Our engineers continue to work on restoring services to all users. We will provide an update by Thursday, 2020-08-20 04:45 US/Pacific with current details. Diagnosis: Deployment errors with AppEngine, high latencies while accessing GCS buckets and missing log entries in Cloud Logging. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20008 - We are experiencing issues across multiple GCP products

Description: We believe the issue with App Engine, Cloud Storage and Cloud Logging is partially resolved. The affected services must be fully operational for the vast majority of users and we are awaiting a full recovery in the nearest future. We will provide an update by Thursday, 2020-08-20 04:30 US/Pacific with current details. Diagnosis: Deployment errors with AppEngine, high latencies while accessing GCS buckets and missing log entries in Cloud Logging. Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 20009 - We are experiencing an issue with Google Compute Engine instances running RHEL and CentOS 7 and 8. Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart.

The issue with Google Compute Engine instances running RHEL and CentOS 7 and 8 is being actively investigated by Redhat. More details on this issue are available in the following article and bugs: - https://access.redhat.com/solutions/5272311 - https://bugzilla.redhat.com/show_bug.cgi?id=1861977 (RHEL 8) - https://bugzilla.redhat.com/show_bug.cgi?id=1862045 (RHEL 7) - https://issuetracker.google.com/162523000 Please follow our public issue tracker posting (https://issuetracker.google.com/162523000) for updates on this issue going forward. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 20009 - We are experiencing an issue with Google Compute Engine instances running RHEL and CentOS 7 and 8. Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart.

Description: We are experiencing an issue with Google Compute Engine instances running RHEL and CentOS 7 and 8. More details on this issue are available in the following article and bugs: https://access.redhat.com/solutions/5272311 https://bugzilla.redhat.com/show_bug.cgi?id=1861977 (RHEL 8) https://bugzilla.redhat.com/show_bug.cgi?id=1862045 (RHEL 7) Symptoms: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart with errors messages referring to a combination of: "X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID", "FXSAVE_STATE" or "Find image based on IP". This issue affects instances with specific versions of the shim package installed. To find the currently installed shim version, use the following command: `rpm -q shim-x64` Affected shim versions: CentOS 7: shim-x64-15-7.el7_9.x86_64 CentOS 8: shim-x64-15-13.el8.x86_64 RHEL 7: shim-x64-15-7.el7_8.x86_64 RHEL 8: shim-x64-15-14.el8_2.x86_64 Workaround: Do not update or reboot instances running RHEL or CentOS 7 and 8. If you are on an affected shim version, run `yum downgrade shim\* grub2\* mokutil` to downgrade to the correct version. This command may not work on CentOS 8. If you have already rebooted, you will need to attach the disk to another instance, chroot into the disk, then run the yum downgrade command. We will provide an update by Thursday, 2020-07-30 14:00 US/Pacific with current details. Diagnosis: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart with errors messages referring to a combination of: "X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID", "FXSAVE_STATE" or "Find image based on IP". This issue affects instances with specific versions of the shim package installed. To find the currently installed shim version, use the following command: `rpm -q shim-x64` Affected shim versions: CentOS 7: shim-x64-15-7.el7_9.x86_64 CentOS 8: shim-x64-15-13.el8.x86_64 RHEL 7: shim-x64-15-7.el7_8.x86_64 RHEL 8: shim-x64-15-14.el8_2.x86_64 Workaround: Do not update or reboot instances running RHEL or CentOS 7 and 8. If you are on an affected shim version, run `yum downgrade shim\* grub2\* mokutil` to downgrade to the correct version. This command may not work on CentOS 8. If you have already rebooted, you will need to attach the disk to another instance, chroot into the disk, then run the yum downgrade command.

Last Update: A few months ago

UPDATE: Incident 20009 - We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart.

Description: We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Symptoms: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart with errors messages referring to a combination of: "X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID", "FXSAVE_STATE" or "Find image based on IP". This issue affects instances with specific versions of the shim package installed. To find the currently installed shim version, use the following command: `rpm -q shim-x64` Affected shim versions: CentOS 7: shim-x64-15-7.el7_9.x86_64 CentOS 8: shim-x64-15-13.el8.x86_64 RHEL 7: shim-x64-15-7.el7_8.x86_64 RHEL 8: shim-x64-15-14.el8_2.x86_64 Workaround: Do not update or reboot instances running RHEL or CentOS 7 and 8. If you are on an affected shim version, run `yum downgrade shim\* grub2\* mokutil` to downgrade to the correct version. This command may not work on CentOS 8. If you have already rebooted, you will need to attach the disk to another instance, chroot into the disk, then run the yum downgrade command. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2020-07-30 11:00 US/Pacific with current details. Diagnosis: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart with errors messages referring to a combination of: "X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID", "FXSAVE_STATE" or "Find image based on IP". This issue affects instances with specific versions of the shim package installed. To find the currently installed shim version, use the following command: `rpm -q shim-x64` Affected shim versions: CentOS 7: shim-x64-15-7.el7_9.x86_64 CentOS 8: shim-x64-15-13.el8.x86_64 RHEL 7: shim-x64-15-7.el7_8.x86_64 RHEL 8: shim-x64-15-14.el8_2.x86_64 Workaround: Do not update or reboot instances running RHEL or CentOS 7 and 8. If you are on an affected shim version, run `yum downgrade shim\* grub2\* mokutil` to downgrade to the correct version. This command may not work on CentOS 8. If you have already rebooted, you will need to attach the disk to another instance, chroot into the disk, then run the yum downgrade command.

Last Update: A few months ago

UPDATE: Incident 20009 - We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart.

Description: We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Symptoms: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart. Customers should not update or reboot instances running RHEL or CentOS 7 and 8. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2020-07-30 10:30 US/Pacific with current details. Diagnosis: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart with errors messages referring to a combination of: "X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID", "FXSAVE_STATE" or "Find image based on IP". This issue affects instances with specific versions of the shim package installed. To find the currently installed shim version, use the following command: rpm -q shim-x64 Affected shim versions: CentOS 7: shim-x64-15-7.el7_9.x86_64 CentOS 8: shim-x64-15-13.el8.x86_64 RHEL 7: shim-x64-15-7.el7_8.x86_64 RHEL 8: shim-x64-15-14.el8_2.x86_64 Workaround: Do not update or reboot instances running RHEL or CentOS 7 and 8. If you are on an affected shim version, run "yum downgrade shim\* grub2\* mokutil" to downgrade to the correct version. This command may not work on CentOS 8. If you have already rebooted, you will need to attach the disk to another instance, chroot into the disk, then run the yum downgrade command.

Last Update: A few months ago

UPDATE: Incident 20009 - We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart.

Description: We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Symptoms: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart. Customers should not update or reboot instances running RHEL or CentOS 7 and 8. Our engineering team continues to investigate the issue. We are also preparing instructions for recovery from this issue for affected instances. We will provide an update by Thursday, 2020-07-30 10:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart. Workaround: Do not update or reboot instances running RHEL or CentOS 7 and 8.

Last Update: A few months ago

UPDATE: Incident 20009 - We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart.

Description: We are experiencing an issue with Google Compute Engine RHEL and CentOS 7 and 8 instances. Symptoms: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart. Customers should not update or reboot instances running RHEL or CentOS 7 and 8. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2020-07-30 09:30 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Instances running RHEL and CentOS 7 and 8 that run yum update may fail to boot after restart. Workaround: Do not update or reboot instances running RHEL or CentOS 7 and 8.

Last Update: A few months ago

RESOLVED: Incident 20008 - Issue with Compute Engine API

The issue with Compute Engine APIs is believed to be affecting a very small fraction of requests from customer projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here. We thank you for your patience while we're working on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20008 - Issue with Compute Engine API

Description: Mitigation work is currently underway by our engineering team. The mitigation is expected to complete by Tuesday, 2020-07-28 05:00 US/Pacific. We will provide more information by Tuesday, 2020-07-28 05:30 US/Pacific. Diagnosis: Calls to setCommonInstanceMetadata, and some other endpoints, fail or take a long time to execute. This also a Workaround: Customers should retry any failed operations.

Last Update: A few months ago

UPDATE: Incident 20008 - Issue with Compute Engine API

Description: We've received a report of an issue with Google Compute Engine as of Tuesday, 2020-07-28 00:57 US/Pacific. About 20% of calls to the API endpoint setCommonInstanceMetadata are failing or taking a long time. This is commonly used by GKE which is also affected by this outage. Some other API calls may also be affected. Our engineering team continues to investigate the issue. We will provide more information by Tuesday, 2020-07-28 04:30 US/Pacific. We apologize to all who are affected by the disruption. Diagnosis: Calls to setCommonInstanceMetadata, and some other endpoints, fail or take a long time to execute. This also a Workaround: Customers should retry any failed operations.

Last Update: A few months ago

UPDATE: Incident 20008 - Issue with Compute Engine API

Description: We've received a report of an issue with Google Compute Engine as of Tuesday, 2020-07-28 00:57 US/Pacific. About 20% of calls to the API endpoint setCommonInstanceMetadata are failing or taking a long time. This is commonly used by GKE which is also affected by this outage. Some other API calls may also be affected. We will provide more information by Tuesday, 2020-07-28 03:30 US/Pacific. Diagnosis: Calls to setCommonInstanceMetadata, and some other endpoints, fail or take a long time to execute. Workaround: Customers should retry any failed operations.

Last Update: A few months ago

UPDATE: Incident 20005 - Google BigQuery is currently experiencing an elevated rate of job failures.

The issue with Google BigQuery experiencing an elevated rate of job failures is believed to be affecting a very small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. We thank you for your patience while we're working on resolving the issue.

Last Update: A few months ago

RESOLVED: Incident 20007 - Cloud Networking L7 Load Balancers May Be Serving Stale Configurations

The issue with Cloud Networking has been resolved for all affected users as of Thursday, 2020-07-23 13:42 US/Pacific. We thank you for your patience while we worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20007 - Cloud Networking L7 Load Balancers May Be Serving Stale Configurations

Description: We believe the issue with Cloud Networking L7 Load Balancers (Internal HTTP(S) Load Balancers), where changes made to these load balancers are not being applied beginning at Thursday, 2020-07-23 11:45 US/Pacific is partially resolved. We do not have an ETA for full resolution at this point. For regular status updates, please follow: https://status.cloud.google.com/incident/zall/20006 Diagnosis: Changes made to Cloud Networking L7 Load Balancers (Internal HTTP(S) Load Balancers) may not be deployed. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20007 - Cloud Networking L7 Load Balancers May Be Serving Stale Configurations

Description: We believe the issue with Cloud Networking L7 Load Balancers (Internal HTTP(S) Load Balancers), where changes made to these load balancers are not being applied beginning at Thursday, 2020-07-23 11:45 US/Pacific is partially resolved. We do not have an ETA for full resolution at this point. For regular status updates, please follow: https://status.cloud.google.com/incident/zall/20006 Where we will provide the next update by Thursday, 2020-07-23 14:00 US/Pacific. Diagnosis: Changes made to Cloud Networking L7 Load Balancers (Internal HTTP(S) Load Balancers) may not be deployed. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20007 - We are experiencing an issue affecting multiple Cloud services.

Description: We are experiencing an issue with Google Compute Engine. Our engineering team continues to investigate the issue. For regular status updates, please follow: https://status.cloud.google.com/incident/zall/20006 Where we will provide the next update by Thursday, 2020-07-23 14:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 20006 - Issue creating and deleting Compute Engine instances

Initially, we believed that this incident affected all instance create and delete operations, which was inaccurate. Approximately 2% of requests globally were failing and the issue was transient, allowing retried operations to succeed. All existing instances continued to operate as expected, with only create or delete operations being affected. Upon evaluation of Google’s incident scope and severity, this incident’s severity has been adjusted from Outage to Disruption.

Last Update: A few months ago

UPDATE: Incident 20006 - Issue creating and deleting Compute Engine instances

The issue with creating and deleting Google Compute Engine instances has been resolved for all affected projects as of Wednesday, 2020-07-15 08:56 US/Pacific. We thank you for your patience while we worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20006 - Issue creating and deleting Compute Engine instances

Description: We are experiencing an issue creating and deleting Google Compute Engine instances, starting at Wednesday, 2020-07-15 05:07 US/Pacific. This is also affecting other products which rely on Compute Engine including Cloud Build, which is in turn used by services including App Engine and Cloud Functions. Other products affected are Cloud Data Fusion, Cloud Dataproc, Cloud Dataflow, and Google Kubernetes Engine. As well, instances created may experience DNS propagation delays. Current data indicates that approximately 2% of requests globally are affected by this issue. Mitigation work has been completed by our engineering team and we are monitoring recovery. Full resolution is expected to complete by Wednesday, 2020-07-15 09:05 US/Pacific. We will provide an update by Wednesday, 2020-07-15 09:20 US/Pacific with current details. Diagnosis: Attempts to create a new instance or to delete an instance fail with an error. Workaround: Retrying the instance operation may succeed.

Last Update: A few months ago

UPDATE: Incident 20006 - Issue creating and deleting Compute Engine instances

Description: We are experiencing an issue creating and deleting Google Compute Engine instances, starting at Wednesday, 2020-07-15 05:07 US/Pacific. This is also affecting other products which rely on Compute Engine including Cloud Build, which is in turn used by services including App Engine and Cloud Functions. Other products affected are Cloud Data Fusion, Cloud Dataproc, Cloud Dataflow, and Google Kubernetes Engine. Current data indicates that approximately 2% of requests globally are affected by this issue. Mitigation work has been completed by our engineering team and we are monitoring recovery. Full resolution is expected to complete by Wednesday, 2020-07-15 09:05 US/Pacific. We will provide an update by Wednesday, 2020-07-15 09:15 US/Pacific with current details. Diagnosis: Attempts to create a new instance or to delete an instance fail with an error. Workaround: Retrying the instance operation may succeed.

Last Update: A few months ago

UPDATE: Incident 20006 - Issue creating and deleting Compute Engine instances

Description: We are experiencing an issue creating and deleting Google Compute Engine instances, starting at Wednesday, 2020-07-15 05:07 US/Pacific. This is also affecting other products which rely on Compute Engine including Cloud Build, which is in turn used by services including App Engine and Cloud Functions. Other products affected are Cloud Data Fusion, Cloud Dataproc, and Google Kubernetes Engine. Mitigation work is still underway by our engineering team. Current data indicates that approximately 2% of requests globally are affected by this issue. We do not have an ETA for mitigation at this point. We will provide an update by Wednesday, 2020-07-15 09:00 US/Pacific with current details. Diagnosis: Attempts to create a new instance or to delete an instance fail with an error. Workaround: Retrying the instance operation may succeed.

Last Update: A few months ago

UPDATE: Incident 20006 - Issue creating and deleting Compute Engine instances

Description: We are experiencing an issue creating and deleting Google Compute Engine instances, starting at Wednesday, 2020-07-15 05:07 US/Pacific. This is also affecting other products which rely on Compute Engine including Cloud Build, which is in turn used by services including App Engine and Cloud Functions. Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide an update by Wednesday, 2020-07-15 08:20 US/Pacific with current details. Diagnosis: Attempts to create a new instance or to delete an instance fail with an error. Workaround: Retrying the instance operation may succeed.

Last Update: A few months ago

RESOLVED: Incident 20005 - We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d

# ISSUE SUMMARY On 2020-06-29 07:47 US/Pacific, Google Cloud experienced unavailability for some services hosted from our us-east1-c and us-east1-d zones. The unavailability primarily impacted us-east1-c but did have a short impact on us-east1-d. For approximately 1 hour and 30 minutes, 22.5% of Google Compute Engine (GCE) instances in us-east1-c, were unavailable. For approximately 7 minutes, 1.8% of GCE instances in us-east1-d, were unavailable. In addition, 0.0267% Persistent Disk (PD) devices hosted in us-east1-c were unavailable for up to 28 hours and the us-east1 region as a whole experienced 5% packet loss between 07:55 and 08:05 for Public IP and Network LB Traffic. We sincerely apologize and are taking steps detailed below to ensure this doesn’t happen again. # BACKGROUND Google Cloud Platform is built on various layers of abstraction in order to provide scale and distinct failure domains. One of those abstractions is Zones and clusters [1]. Zonal services such as Google Compute Engine (GCE) assign projects to one cluster to handle the majority of the compute needs when a project requests resources in a cloud zone. If a cluster backing a zone becomes degraded, services in that zone have resilience built in to handle some level of machine failures. Regional services, depending on the architecture, may see a short degradation before automatically recovering, or see no impact at all. Regional services with tasks in a degraded cluster are generally migrated to other functional clusters in the same region to reduce overall impact. In the Detailed Impact section below, the impact is only to projects and services mapped to the affected clusters, unless otherwise noted. Datacenter power delivery is architected in three tiers. The primary tier of power delivery is utility power, with multiple grid feeds and robust substations. Backing up utility power are generators, each generator powers a different part of each cluster, and additional backup generators and fuel are available if required in the event that a part of this backup power system fails.The fuel supply system for the generators is broken into two parts, storage tanks which store fuel in bulk, and a system which pumps that fuel to generators for consumption. The final tier of power delivery are batteries which provide power conditioning and a short run times when power from the other two tiers is interrupted. [1] https://cloud.google.com/compute/docs/regions-zones#zones_and_clusters # ROOT CAUSE During planned substation maintenance by the site’s electrical utility provider, two clusters supporting the us-east1 region were transferred to backup generator power for the duration of the maintenance, which was scheduled as a four hour window. Three hours into the maintenance window, 17% of the operating generators began to run out of fuel due to fuel delivery system failures even though there was adequate fuel available in the storage tanks. Multiple redundancies built into the backup power system were automatically activated as primary generators began to run out of fuel, however, as more primary generators ran out of fuel the part of the cluster they were supporting shutdown. # REMEDIATION AND PREVENTION Google engineers were alerted to the power issue impacting us-east1-c and us-east1-d at 2020-06-29 07:50 US/Pacific and immediately started an investigation. Impact to us-east1-d was resolved automatically by cluster level services. Other than some Persistent Disk devices, service impact in us-east1-d ended by 08:24. Onsite datacenter operators identified a fuel supply issue as the root cause of the power loss and quickly established a mitigation plan. Once a workaround for the fuel supply issue was deployed, the operators began restoring the affected generators to active service at 08:49. Almost at the same time, at 08:55, the planned substation maintenance had concluded and utility power returned to service. Between the restored utility power and recovered generators, power was fully restored to both clusters by 08:59. In a datacenter recovery scenario there is a sequential process that must be followed for downstream service recovery to succeed. By 2020-06-29 09:34, most GCE instances had recovered as the necessary upstream services were restored. All services had recovered by 10:50 except for a small percentage of Persistent Disk impacted instances. A more detailed timeline of individual service impact is included below in the “DETAILED DESCRIPTION OF IMPACT” section below. In the days following this incident the same system was put under load. There was an unplanned utility power outage for the same location on 2020-06-30 (the next day) due to a lightning strike near a substation transformer. The system was again tested on 2020-07-02 when a final maintenance operation was conducted on the site substation. We are committed to preventing this situation from happening again and are implementing the following actions: Resolving the issues identified with the fuel supply system which led to this incident. An audit of sites which have a similar fuel system has been conducted and onsite personnel have been provided updated procedures and training for dealing with this situation should it occur again. # DETAILED DESCRIPTION OF IMPACT On 2020-06-29 from 07:47 to 10:50 US/Pacific, Google Cloud experienced unavailability for some services hosted from cloud zones us-east1-c and us-east1-d as described in detail below: ## Google Compute Engine 22.5% of Google Compute Engine (GCE) instances in the us-east1-c zone were unavailable starting 2020-06-29 07:57 US/Pacific for 1 hour and 30 minutes. Up to 1.8% of instances in the us-east1-d zone were unavailable starting 2020-06-29 08:17 for 7 minutes. A small percentage of the instances in us-east1-c continued to be unavailable for up to 28 hours due to manual recovery of PD devices. ## Persistent Disk Persistent Disk (PD) experienced 23% of PD devices becoming degraded in us-east1-c starting at 2020-06-29 07:53 to 2020-06-29 09:28 US/Pacific for a duration of 1 hour and 35 minutes. 0.0267% of PD devices were unable to recover automatically and required manual recovery which completed at 2020-06-30 09:54 resulting in 26 hours of additional unavailability. The delay in recovery was primarily due to a configuration setting in PD clients that set metadata initialization retry attempts to a maximum value (with exponential backoff). Upon power loss, 0.0267% of PD devices in us-east1-c reached this limit and were unable to recover automatically as they had exhausted their retry attempts before power had been fully restored. To prevent this scenario from recurring, we are significantly increasing the number of retry attempts that will be performed by PD metadata initialization to ensure PD can recover from extended periods of power loss. A secondary factor resulting in the delay of some customer VMs was due to filesystem errors triggered by the PD unavailability. PD itself maintains defense-in-depth through a variety of end-to-end integrity mechanisms which prevented any PD corruption during this incident. However, some filesystems are not designed to be robust against cases where some parts of the block device presented by PD fail to initialize while others are still usable. This issue was technically external to PD, and only repairable by customers using filesystem repair utilities. The PD product team assisted affected customers in their manual recovery efforts during the extended incident window. Additionally, up to 0.429% of PD devices in us-east1-d were unhealthy for approximately 1 hour from 2020-06-29 08:15 to 2020-06-29 09:10. All PD devices in us-east1-d recovered automatically once power had been restored. ## Cloud Networking The us-east1 region as a whole experienced a 5% packet loss between 07:55 and 08:05 for Public IP and Network LB Traffic as Maglevs [1] servicing the region in the impacted cluster became unavailable. Cloud VPN saw 7% of Classic VPN tunnels in us-east1 reset between 07:57 and 08:07. As tunnels disconnected, they were rescheduled automatically in other clusters in the region. HA VPN tunnels were not impacted. Cloud Router saw 13% of BGP sessions in us-east1 flap between 07:57 and 08:07, Cloud Router tasks in the impacted cluster were rescheduled automatically in other clusters in the region. Cloud HTTP(S) Load Balancing saw a 166% spike in baseline HTTP 500 errors between 08:00 to 08:10. Starting at 09:38 the network control plane in the impacted cluster began to initialize but ran into an issue that required manual intervention to resolve. Between 09:38 and 10:14 instances continued to be inaccessible until the control plane initialized, updates were also not being propagated down to the cluster control plane. Due to this, some resources such as Internal Load Balancers and instances that were deleted during this time period, and then recreated at any time between 9:38 and 12:47 would have seen availability issues. This was resolved with additional intervention from the SRE team. To reduce time to recover for similar classes of issues, we are increasing the robustness of the control plane to better handle such exceptional failure conditions. We are also implementing additional monitoring and alerting to more quickly detect update propagation issues under exceptional failure conditions [1] https://cloud.google.com/blog/products/gcp/google-shares-software-network-load-balancer-design-powering-gcp-networking ## Cloud SQL Google Cloud SQL experienced a 7% drop in network connections that resulted in database unavailability from 2020-06-29 08:00 - 10:50 US/Pacific affecting <1.5% of instances in us-east1 due to power loss. This degraded Cloud SQL dependencies (Persistent Disk and GCE) in us-east1-c and us-east1-d for a period of 2 hours and 50 minutes. ## Filestore + Memorystore Cloud Filestore and Memorystore instances in us-east1-c were unavailable due to power loss from 2020-06-29 07:56 - 10:24 US/Pacific for a duration of 2 hours and 28 minutes. 7.4% of Redis non-HA (zonal) instances in us-east1 were unavailable and 5.9% of Redis HA (regional) standard tier instances in us-east1 failed over. All Cloud Filestore and Cloud Memorystore instances recovered by 10:24. ## Google Kubernetes Engine Google Kubernetes Engine (GKE) customers were unable to create or delete clusters in either the us-east1-c zone or the us-east1 region from 2020-06-29 8:00 to 2020-06-29 10:30 US/Pacific. Additionally, customers could not create nodepools in the us-east1-c zone during that same period. A maximum of 29% of zonal clusters in us-east1-c and 2% of zonal clusters in us-east1-d could not be administered; all but a small subset of clusters recovered by 11:00, and the remaining clusters recovered by 14:30. Existing regional clusters in us-east1 were not affected except for customers who had workloads on nodes in us-east1-c that they could not or were unable to migrate. ## Google Cloud Storage Google Cloud Storage (GCS) in us-east1 experienced 10 minutes of impact when availability fell down to 99.7%, with a 3-minute burst down to 98.6% availability. GCS multi-region experienced a total of 40 minutes of impact, down to 99.55% availability. # SLA CREDITS If you believe your paid application experienced an SLA violation as a result of this incident, please submit the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla A full list of all Google Cloud Platform Service Level Agreements can be found at https://cloud.google.com/terms/sla/

Last Update: A few months ago

RESOLVED: Incident 20006 - We are experiencing issues with creation of new Google Cloud Load Balancers and updates to existing ones.

The issue with Google Cloud Load Balancer has been resolved for all affected users as of Thursday, 2020-07-02 09:36 US/Pacific. We thank you for your patience while we're working on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20006 - We are experiencing issues with creation of new Google Cloud Load Balancers and updates to existing ones.

Description: We are experiencing an issue with Cloud Cloud Load Balancer as of Thursday, 2020-07-02 08:39 US/Pacific. Customers will not be able to create new HTTP services or update certain aspects of existing services (timeouts, health checks, backend port). They will still be able to drain, add groups, and add/remove VMs. We will provide more information by Thursday, 2020-07-02 10:15 US/Pacific. Diagnosis: Customers will not be able to create new HTTP services or update certain aspects of existing services (timeouts, health checks, backend port). Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20005 - We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d

The issue with Cloud Networking and Persistent Disk has been resolved for the majority of affected projects as of Monday, 2020-06-29 10:20 US/Pacific, and we expect full mitigation to occur for remaining projects within the hour. If you have questions or feel that you may be impacted, please open a case with the Support Team and we will work with you until the issue is resolved. We will publish analysis of this incident once we have completed our internal investigation. We thank you for your patience while we're working on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20005 - We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services. Services in us-east1-d have been fully restored. Services in us-east1-c are fully restored except for Persistent Disk which is partially restored. No ETA for full recovery of Persistent Disk yet. Impact is due to power failure. A more detailed analysis will be available at a later time. Our engineering team is working on recovery of impacted services. We will provide an update by Monday, 2020-06-29 13:00 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones. Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

Last Update: A few months ago

UPDATE: Incident 20005 - We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services. Services in us-east1-d have been fully restored. Services in us-east1-c are fully restored except for Persistent Disk which is partially restored. No ETA for full recovery of Persistent Disk yet. Impact is due to power failure. A more detailed analysis will be available at a later time. Our engineering team is working on recovery of impacted services. We will provide an update by Monday, 2020-06-29 12:20 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones. Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

Last Update: A few months ago

RESOLVED: Incident 20005 - We are investigating an issue with elevated error rates across multiple Google Cloud Platform Services

# ISSUE SUMMARY (All times in US/Pacific daylight time) On Wednesday 08 April, 2020 beginning at 06:48 US/Pacific, Google Cloud Identity and Access Management (IAM) experienced significantly elevated error rates for a duration of 54 minutes. IAM is used by several Google services to manage user information, and the elevated IAM error rates resulted in degraded performance that extended beyond 54 minutes for the following Cloud services: - Google BigQuery’s streaming service experienced degraded performance for 116 minutes; - Cloud IAM’s external API returned elevated errors for 102 minutes; - 3% of Cloud SQL HA instances were degraded for durations ranging from 54 to 192 minutes. To our Cloud customers whose businesses were impacted during this disruption, we sincerely apologize – we have conducted a thorough internal investigation and are taking immediate action to improve the resiliency, performance, and availability of our service. # ROOT CAUSE Many Cloud services depend on a distributed Access Control List (ACL) in Cloud Identity and Access Management (IAM) for validating permissions, activating new APIs, or creating new Cloud resources. Cloud IAM in turn relies on a centralized and planet-scale system to manage and evaluate access control for data stored within Google, known as Zanzibar [1]. Cloud IAM consists of regional and global instances; regional instances are isolated from each other and from the global instance for reliability. However, some specific IAM checks, such as checking an organizational policy, reference the global IAM instance. The trigger of this incident was a rarely-exercised type of configuration change in Zanzibar which also impacted Cloud IAM. A typical change to this configuration mutates existing configuration namespaces, and is gradually rolled out through a sequence of canary steps. However, in this case, a new configuration namespace was added, and a latent issue with our canarying system allowed this specific type of configuration change to propagate globally in a rapid manner. As the configuration was pushed to production, the global Cloud IAM service quickly began to experience internal errors. This resulted in downstream operations with a dependency on global Cloud IAM to fail. [1] https://research.google/pubs/pub48190/ # REMEDIATION AND PREVENTION Google engineers were automatically alerted to elevated error rates affecting Cloud IAM at 2020-04-08 06:52 US/Pacific and immediately began investigating. By 07:27, the engineering team responsible for managing Zanzibar identified the configuration change responsible for the issue, and swiftly reverted the change to mitigate. The mitigation finished propagating by 07:42, partially resolving the incident for a majority of internal services. Specific services such as the external Cloud IAM API, high-availability Cloud SQL, and Google BigQuery streaming took additional time to recover due to complications arising from the initial outage. Services with extended recovery timelines are described in the “detailed description of impact” section below. Google's standard production practice is to push any change gradually, in increments designed to maximize the probability of detecting problems before they have broad impact. Furthermore, we adhere to a philosophy of defence-in-depth: when problems occur, rapid mitigations (typically rollbacks) are used to restore service within service level objectives. In this outage, a combination of bugs resulted in these practices failing to be applied effectively. In addition to rolling back the configuration change responsible for this outage, we are fixing the issue with our canarying and release system that allowed this specific class of change to rapidly roll out globally; instead, such changes will in the future be subject to multiple layers of canarying, with automated rollback if problems are detected, and a progressive deployment over the course of multiple days. Both Cloud IAM and Zanzibar will enter a change freeze to prevent the possibility of further disruption to either service before these changes are implemented. We truly understand how important regional reliability is for our users and deeply apologize for this incident. # DETAILED DESCRIPTION OF IMPACT On Wednesday 08 April, 2020 from 6:48 to 7:42 US/Pacific, Cloud IAM experienced an outage, which had varying degrees of impact on downstream services as described in detail below. ## Cloud IAM Experienced a 100% error rate globally on all internal Cloud IAM API requests from 6:48 - 7:42. Upon the internal Cloud IAM service becoming unavailable (which impacted downstream Cloud services), the external Cloud IAM API also began returning HTTP 500 INTERNAL_ERROR codes. The rate and volume of incoming requests (due to aggressive retry policies) triggered the system’s Denial of Service (DoS) protection mechanism. The automatic DoS protection throttled the service, implementing a rate-limit on incoming requests resulting in query failures and a large volume of retry attempts. Upon the incident’s mitigation, the DoS protection was removed but took additional time to propagate across the fleet. Its removal finished propagating by 8:30, returning the service to normal operation. ## Gmail Experienced delays receiving and sending emails from 6:50 to 7:39. For inbound emails, 20% G Suite emails, 21% of G Suite customers, and 0.3% of consumer emails were affected. For outbound emails (including Gmail-to-Gmail) 1.3% of G Suite emails, and 0.3% of consumer emails were affected. Message delay period varied, with the 50th percentile peaking at 3.7 seconds, up to 2580 seconds for the 90th percentile. ## Compute Engine Experienced a 100% error rate when performing firewall modifications or create, update, or delete instance operations globally from 6:48 to 7:42. ## Cloud SQL Experienced a 100% error rate when performing instance creation, deletion, backup, and failover operations globally for high-availability (HA) instances from 6:48 - 7:42, due to the inability to authenticate VMs via the Cloud IAM service. Additionally, Cloud SQL experienced extended impact from this outage for 3% of HA instances. Such instances initiated failover when upstream health metrics were not propagated due to the Cloud IAM issues. HA instances automatically failed over in an attempt to recover from what was believed to be failures occurring on the master instances. Upon failing over, these instances became stuck in a failed state. The Cloud IAM outage prevented the master’s underlying data disk from being attached to the failover instance, leaving the failover instance in a stuck state. These stuck instances required manual engineer intervention to bring them back online. Affected instances impact ranged from 6:48 - 10:00 for a total duration of 3 hours and 12 minutes. To prevent HA Cloud SQL instances from encountering these failures in the future, we will change the auto-failover system to avoid triggering based on IAM issues. We are also re-examining the auto-failover system more generally to make sure it can distinguish a real outage from a system-communications issue going forward. ## Cloud Pub/Sub Experienced 100% error rates globally for Topic administration operations (create, get, and list) from 6:48 - 7:42. ## Kubernetes Engine Experienced a 100% error rate for cluster creation requests globally from 6:49 - 7:42. ## BigQuery Datasets.get and projects.getServiceAccount experienced nearly 100% failures globally from 6:48 - 7:42. Other dataset operations experienced elevated error rates up to 40% for the duration of the incident. BigQuery streaming was also impacted in us-east1 for 6 minutes, us-east4 for 20 minutes, asia-east1 for 12 minutes, asia-east2 for 40 minutes, europe-north1 for 11 minutes, and the EU multi-region for 52 minutes. With most of the above regions experiencing up to a maximum of 30% average error rates. The EU multi-region, US multi-region, and us-east2 regions specifically experienced higher error rates, reaching nearly 100% for the duration of their impact windows. Additionally, BigQuery streaming in the US multi-region experienced issues coping with traffic volume once IAM recovered. BigQuery streaming in the US multi-region experienced a 55% error rate from 7:42 - 8:44 for a total impact duration of 1 hour and 56 minutes. ## App Engine Experienced a 100% error rate when creating, updating, or deleting app deployments globally from 6:48 to 7:42. Public apps did not have HTTP serving affected. ## Cloud Run Experienced a 100% error rate when creating, updating, or deleting deployments globally from 6:48 to 7:42. Public services did not have HTTP serving affected. ## Cloud Functions Experienced a 100% error rate when creating, updating, or deleting functions with access control [2] globally from 6:48 to 7:42. Public functions did not have HTTP serving affected. [2] https://cloud.google.com/functions/docs/concepts/iam ## Cloud Monitoring Experienced intermittent errors when listing workspaces via the Cloud Monitoring UI from 6:42 - 7:42. ## Cloud Logging Experienced average and peak error rates of 60% for ListLogEntries API calls from 6:48 - 7:42. Affected customers received INTERNAL_ERRORs. Additionally, create, update, and delete sink calls experienced a nearly 100% error rate during the impact window. Log Ingestion and other Cloud Logging APIs were unaffected. ## Cloud Dataflow Experienced 100% error rates on several administrative operations including job creation, deletion, and autoscaling from 6:55 - 7:42. ## Cloud Dataproc Experienced a 100% error rate when attempting to create clusters globally from 6:50 - 7:42. ## Cloud Data Fusion Experienced a 100% error rate for create instance operations globally from 6:48 - 7:42. ## Cloud Composer Experienced 100% error rates when creating, updating, or deleting Cloud Composer environments globally between 6:48 - 7:42. Existing environments were unaffected. ## Cloud AI Platform Notebooks Experienced elevated average error rates of 97.2% (peaking to 100%) from 6:52 - 7:48 in the following regions: asia-east1, asia-northeast1, asia-southeast1, australia-southeast1, europe-west1, northamerica-northeast1, us-central1, us-east1, us-east4, and us-west1. ## Cloud KMS Experienced a 100% error rate for Create operations globally from 6:49 - 7:40. ## Cloud Tasks Experienced an average error rate of 8% (up to 15%) for CreateTasks, and a 96% error rate for AddTasks in the following regions: asia-northeast3, asia-south1, australia-southeast1, europe-west1, europe-west6, northamerica-northeast1, southamerica-east1, us-central1, us-east4, and us-west3.. Delivery of existing tasks were unaffected, but downstream services may have experienced other issues as documented. ## Cloud Scheduler Experienced 100% error rates for CreateJob and UpdateJob requests globally from 6:48 - 7:42. ## App Engine Task Queues Experienced an average error rate of 18% (up to 25% at peak) for UpdateTask requests from 6:48 - 7:42. ## Cloud Build Experienced no API errors, however, all builds submitted between 6:48 and 7:42 were queued until the issue was resolved. ## Cloud Deployment Manager Experienced an elevated average error rate of 20%, peaking to 36% for operations globally between 6:49 and 7:39. ## Data Catalogue Experienced a 100% error rate for API operations globally from 6:48 - 7:42. ## Firebase Real-time Database Experienced elevated average error rates of 7% for REST API and long-polling requests (peaking to 10%) during the incident window. ## Firebase Test Lab Experienced elevated average error rates of 85% (peaking to 100%) globally for Android tests running on virtual devices in Google Compute Engine instances. Impact lasted from 6:48 - 7:54 for a duration of 1 hour and 6 minutes. ## Firebase Hosting Experienced a 100% error rate when creating new versions globally from 6:48 - 7:42. ## Firebase Console Experienced a 100% error rate for developer resources globally. Additionally, the Firedata API experienced an average error rate of 20% for API operations from 6:48 - 7:42 Affected customers experienced a range of issues related to the Firebase Console and API. API invocations returned empty lists of projects, HTTP 404 errors, affected customers were unable to create, delete, update, or list many Firebase entities including (Android, iOS, and Web Apps), hosting sites, Real-time Database instances, Firebase-linked GCP buckets. Firebase developers were also unable to update billing settings. Firebase Cloud Functions could not be deployed successfully. Some customers experienced quota exhaustion errors due to extensive retry attempts. ## Cloud IoT Experienced a 100% error rate when performing DeleteRegistry API calls from 6:48 - 7:42. Though DeleteRegistry API calls threw errors, the deletions issued did complete successfully. ## Cloud Memorystore Experienced a 100% error rate for create, update, cancel, delete, and ListInstances operations on Redis instances globally from 6:48 - 7:42. ## Cloud Filestore Experienced an average error rate of 70% for instance and snapshot creation, update, list, and deletion operations, with a peak error rate of 92% globally between 6:48 and 7:45. ## Cloud Healthcare and Cloud Life Sciences Experienced a 100% error rate for CreateDataset operations globally from 6:48 - 7:42. # SLA CREDITS If you believe your paid application experienced an SLA violation as a result of this incident, please populate the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla A full list of all Google Cloud Platform Service Level Agreements can be found at https://cloud.google.com/terms/sla/. For G Suite, please request an SLA credit through one of the Support channels: https://support.google.com/a/answer/104721 G Suite Service Level Agreement can be found at https://gsuite.google.com/intl/en/terms/sla.html

Last Update: A few months ago

RESOLVED: Incident 20005 - We are investigating an issue with elevated error rates across multiple Google Cloud Platform Services

Our engineers have determined this issue to be linked to a single Google Incident. For regular status updates, please visit [https://status.cloud.google.com/incident/zall/20005](https://status.cloud.google.com/incident/zall/20005.). No further updates will be made through this incident.

Last Update: A few months ago

UPDATE: Incident 20005 - We are investigating an issue with elevated error rates across multiple Google Cloud Platform Services

Description: We are experiencing an issue in Cloud IAM which is impacting multiple services. Mitigation work is currently underway by our engineering team. We believe that most impact was mitigated at 07:40 US/Pacific, allowing many services to recover. Impact is now believed to be limited more directly to use of the IAM API. We will provide an update by Wednesday, 2020-04-08 09:00 US/Pacific with current details. Diagnosis: App Engine, Dataproc, Cloud Logging, Firebase Console, Cloud Build, Cloud Pub/Sub, BigQuery, Compute Engine, Cloud Tasks, Cloud Memorystore, Firebase Test Lab, Firebase Hosting, Cloud Networking, Cloud Data Fusion, Cloud Kubernetes Engine, Cloud Composer, Cloud SQL, and Firebase Realtime Database may experience elevated error rates. Additionally, customers may be unable to file support cases. Workaround: Customers may continue to file cases using https://support.google.com/cloud/contact/prod_issue or via phone.

Last Update: A few months ago

UPDATE: Incident 20005 - We are investigating an issue with elevated error rates across multiple Google Cloud Platform Services

Description: We are experiencing an issue with Google Cloud infrastructure components beginning on Wednesday, 2020-04-08 06:52 US/Pacific. Symptoms: elevated error rates across multiple products. Customers may be experiencing an issue with Google Cloud Support in which users are unable to create new support cases. Customers may continue to file cases using a https://support.google.com/cloud/contact/prod_issue or via phone. Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2020-04-08 08:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers may be unable to create new support cases. Workaround: Customers may continue to file cases using a https://support.google.com/cloud/contact/prod_issue or via phone.

Last Update: A few months ago

UPDATE: Incident 20005 - We are investigating an issue with elevated error rates across multiple Google Cloud Platform Services

Description: We are experiencing an issue with Google Cloud infrastructure components. Symptoms: elevated error rates across multiple products. Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2020-04-08 08:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

# ISSUE SUMMARY On Thursday 26 March, 2020 at 16:14 US/Pacific, Cloud IAM experienced elevated error rates which caused disruption across many services for a duration of 3.5 hours, and stale data (resulting in continued disruption in administrative operations for a subset of services) for a duration of 14 hours. Google's commitment to user privacy and data security means that IAM is a common dependency across many GCP services. To our Cloud customers whose business was impacted during this disruption, we sincerely apologize – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve the resiliency of our service. # ROOT CAUSE Many Cloud services depend on a distributed Access Control List (ACL) in Identity and Access Management (IAM) for validating permissions, activating new APIs, or creating new cloud resources. These permissions are stored in a distributed database and are heavily cached. Two processes keep the database up-to-date; one real-time, and one batch. However, if the real-time pipeline falls too far behind, stale data is served which may cause impact operations in downstream services. The trigger of the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions, which generated a large backlog of queued mutations to be applied in real-time. The processing of the backlog was degraded by a latent issue with the cache servers, which led to them running out of memory; this in turn resulted in requests to IAM timing out. The problem was temporarily exacerbated in various regions by emergency rollouts performed to mitigate the high memory usage. # REMEDIATION AND PREVENTION Once the scope of the issue became clear at 2020-03-27 16:35 US/Pacific, Google engineers quickly began looking for viable mitigations. At 17:06, an offline job to build an updated cache was manually started. Additionally, at 17:34, cache servers were restarted with additional memory, along with a configuration change to allow temporarily serving stale data (a snapshot from before the problematic bulk update) while investigation continued; this mitigated the first impact window. A second window of impact began in other regions at 18:49. At 19:13, similar efforts to mitigate with additional memory began, which mitigated the second impact window by 19:42. Additional efforts to fix the stale data continued, and finally the latest offline backfill of IAM data was loaded into the cache servers. The remaining time was spent progressing through the backlog of changes, and live data was slowly re-enabled region-by-region to successfully mitigate the staleness globally at 2020-03-27 05:55. Google is committed to quickly and continually improving our technology and operations to both prevent service disruptions, and to mitigate them quickly when they occur. In addition to ensuring that the cache servers can handle bulk updates of the kind which triggered this incident, efforts are underway to optimize the memory usage and protections on the cache servers, and allow emergency configuration changes without requiring restarts. To allow us to mitigate data staleness issues more quickly in future, we will also be sharding out the database batch processing to allow for parallelization and more frequent runs. We understand how important regional reliability is for our users and apologize for this incident. # DETAILED DESCRIPTION OF IMPACT On Thursday 26 March, 2020 from 16:14 to Friday 27 March 2020, 06:20 US/Pacific, Cloud IAM experienced out of date (stale) data, which had varying degrees of impact as described in detail below. Additionally, multiple services experienced bursts of Cloud IAM errors. These spikes were clustered around 16:35 to 17:45, 18:45 to 19:00, and 19:20 to 19:40, however the precise timing for each Cloud region differed. Error rates reached up to 100% in the later two periods as mitigations propagated globally. As a result, many Cloud services experienced concurrent outages in multiple regions, and most regions experienced some impact. Even though error rates recovered after mitigations, Cloud IAM members from Google Groups [1] remained stale until the full incident had been resolved. The staleness varied in severity throughout the incident as new batch processes completed, with an approximate four hour delay at 16:14, up to a 9 hour delay at 21:13. Users directly granted IAM roles were not impacted by stale permissions. [1] https://cloud.google.com/iam/docs/overview#google_group Cloud IAM Experienced delays mapping IAM roles from changes in Google Groups membership for users and Service Accounts, which resulted in serving stale permissions globally from 2020-03-26 16:15 to 2020-03-27 05:55. Permissions assigned to individual non-service account users were not affected. App Engine (GAE) Experienced elevated rates of deployment failures and increased timeouts on serving for apps with access control [2] from 16:22 to 2020-03-27 05:48 in the following regions: asia-east2, asia-northeast1, asia-south1, europe-west1, europe-west2, europe-west3,australia-southeast1, northamerica-northeast1, us-central1, us-east1, us-east4, and us-west2. Public apps did not have HTTP serving affected. [2] https://cloud.google.com/appengine/docs/standard/python3/access-control AI Platform Predictions Experienced elevated error rates from 16:50 to 19:54 in the following regions: europe-west1, asia-northeast1, us-east4. The average error rate was <1% with a peak of 2.2% during the impact window. AI Platform Notebooks Experienced elevated error rates and failure creating new instances from 16:34 to 19:17 in the following regions: asia-east1, us-west1, us-east1. Cloud Asset Inventory Experienced elevated error rates globally from 17:00 to 19:56. The average error rate during the first spike from 16:50 to 17:42 was 5%, and 40% during the second spike from 19:34 to 19:43, with a peak of 45%. Cloud Composer Experienced elevated error rates for various API calls in all regions, with the following regions seeing up to a 100% error rate: asia-east2, europe-west1, us-west3. This primarily impacted environment creation, updates, and upgrades, and existing healthy environments should have been unaffected, Cloud Console Experienced elevated error rates loading various pages globally from 16:40 to 20:00. 4.2% of page views experienced degraded performance , with a spike between 16:40 to 18:00, and 18:50 to 20:00. Some pages may have seen up to 100% degraded page views depending on the service requested. Cloud Dataproc Experienced elevated cluster operation error rates from 16:30 to 19:45 in the following regions: asia-east, asia-east2, asia-northeast1, asia-northeast2, asia-northeast3, asia-south1, asia-southwest1, australia-southeast1, europe-north1, europe-west1, europe-west2, europe-west3, europe-west4, europe-west6, northamerica-northeast1, southamerica-east1, us-central1, us-east1, us-east4, us-west1, us-west2. The average and peak error rates were <1% in all regions. Cloud Dataflow Experienced elevated error rates creating new jobs between 16:34 and 19:43 in the following regions: asia-east1, asia-northeast1, europe-west1, europe-west2, europe-west3, europe-west4, us-central1, us-east4, us-west1. The error rate varied by region over the course of the incident, averaging 70%, with peaks of up to 100%. Existing jobs may have seen temporarily increased latency. Cloud Data Fusion Experienced elevated error rates creating new pipelines from 17:00 to 2020-03-27 07:00 in the following regions: asia-east1, asia-east2, asia-northeast1, asia-northeast2, asia-northeast3, asia-south1, asia-southeast1, australia-southeast1, europe-north1, europe-west1, europe-west2, europe -west3, europe-west4, europe-west6, northamerica-northeast1, southamerica-east1, us-central1, us-east1, us-east4, us-west1, and us-west2. 100% of create operations failed during the impact window. Cloud Dialogflow Experienced elevated API errors from 16:36 to 17:43 and from 19:36 to 19:43 globally. The error rate averaged 2.6%, with peaks of up to 12% during the impact window. Cloud Filestore Experienced elevated errors on instance operations from 16:44 to 17:53 in asia-east1, asia-east2, us-west1, from 18:45 to 19:10 in asia-northeast1, australia-southeast1, southamerica-east1, and from 19:30 to 19:45 in europe-west4, asia-east2, europe-north1, australia-southeast1, us-east4, and us-west1. Globally, projects which had recently activated the Filestore service were unable to create instances. Cloud Firestore & Cloud Datastore Experienced elevated error rates and increased request latency between 16:41 and 20:14. From 16:41 to 17:45 only europe-west1 and asia-east2 were impacted. On average, the availability of Firestore was 99.75% with a low of 97.3% at 19:38. Datastore had an average <0.1% of errors, with a peak error rate of 1% at 19:40. From 18:45 to 19:06 the following regions were impacted: asia-east2, asia-northeast1, asia-northeast2, asia-northeast3, asia-south1, australia-southeast1, europe-west1, europe-west2, europe-west3, europe-west4, europe-west5, europe-west6, northamerica-northeast1, southamerica-east1, us-central1, us-east4, us-west2, us-west3, and us-west4. Finally, from 19:27 to 20:15 all regions were impacted. Cloud Functions Experienced elevated rates of deployment failures and increased timeouts when serving functions with access control [3] from 16:22 to 20-03-27 05:48 in the following regions: asia-east2, europe-west1, europe-west2, europe-west3, asia-northeast1, europe-north1, and us-east4. Public services did not have HTTP serving affected. [3] https://cloud.google.com/functions/docs/concepts/iam Cloud Healthcare API Experienced elevated error rates in the ‘us’ multi-region from 16:47 to 17:40, with a 12% average error rate, and a peak error rate of 25%. Cloud KMS Experienced elevated error rates from 16:30 to 17:46 in the following regions: asia, asia-east1, asia-east2, europe, europe-west1, us-west1, southamerica-east1, europe-west3, europe-north1, europe-west4, and us-east4. The average error rate during the impact window was 26%, with a peak of 36%. Cloud Memorystore Experienced failed instance operations from 16:44 to 17:53 in asia-east1, asia-east2, us-west1, from 18:45 to 19:10 in asia-northeast1, australia-southeast1, southamerica-east1, and from 19:30 to 19:45 in europe-west4, asia-east2, europe-north1, australia-southeast1, us-east4, us-west1. Globally, projects which had recently activated the Memorystore service were unable to create instances until 2020-03-27 06:00. Cloud Monitoring Experienced elevated error rates for the Dashboards and Accounts API endpoints from 16:35 to 19:42 in the following regions: asia-west1, asia-east1, europe-west1, us-central1, us-west1. Rates fluctuated by region throughout the duration of the incident, with an average of 15% for the Accounts API, and 30% for the Dashboards API, and a peak of 26% and 80% respectively. The asia-east1 region had the most significant impact. Cloud Pub/Sub Experienced elevated error rates in all regions from 16:30 to 19:46, with the most significant in europe-west1, asia-east1, asia-east2, and us-central1. Average error rates during each impact window was 30%, with a peak of 59% at 19:36. Operations had the following average error rates: Publish: 3.7%, StreamingPull: 1.9%, Pull: 1.4%. Cloud Scheduler Experienced elevated error rates in all regions from 16:42 to 17:42, and 18:47 to 19:42 with the most significant in asia-east2, europe-west1, and us-central1. The error rates varied across regions during the impact window with peaks of up to 100%. Cloud Storage (GCS) Experienced elevated error rates and timeouts for various API calls from 16:34 to 17:32 and 19:15 to 19:41. Per-region availability dropped as low as 91.4% for asia-east2, 98.55% for europe-west1, 99.04% for us-west1, 98.15% for the ‘eu’ multiregion, and 98.45% for the ‘asia’ multi-region. Additionally, errors in the Firebase Console were seen specifically for first-time Cloud Storage for Firebase users trying to create a project from 17:35 to 2020-03-27 08:18. Cloud SQL Experienced errors creating new instances globally from 2020-03-26 16:22 to 2020-03-27 06:05. Cloud Spanner Cloud Spanner instances experienced failures when managing or accessing databases from 17:03 to 20:40 in the following regions: regional-us-west1, regional-asia-east1, regional-asia-east2, regional-europe-west1, regional-asia-east2, eur3. The average error rate was 2.6% for all regions, with a peak of 33.3% in asia-east2. Cloud Tasks Experienced elevated error rates on new task creations in asia-east2, asia-northeast1, asia-northeast2, asia-northeast3, asia-south1, australia-southeast1, europe-west2, europe-west3, europe-west6, northamerica-northeast1, southamerica-east1, us-central1, us-east4, us-west2. Delivery of existing tasks were unaffected, but downstream services may have experienced other issues as documented. Compute Engine (GCE) Experienced elevated error rates on API operations, and elevated latency for disk snapshot creation from 19:35 to 19:43 in all regions. The average and peak error rate was 40% throughout the impact window. Container Registry Experienced elevated error rates on the Container Analysis API. Additionally, there was increased latency on Container Scanning and Continuous Analysis requests which took up to 1 hour. Continuous Analysis was also delayed. Cloud Run Experienced elevated rates of deployment failures and increased timeouts serving deployed services with access control [4] from 16:22 to 2020-03-27 05:48 in the following regions: asia-east1, asia-northeast1, europe-north1, europe-west1, europe-west4, us-east4, us-west1. Public services did not have HTTP serving affected. Newly created Cloud projects (with new IAM permissions) weren't able to complete service deployments because of stale IAM reads on the service account's permissions. [4] https://cloud.google.com/run/docs/securing/managing-access Data Catalog Experienced elevated error rates on read & write API’s in the following regions: ‘us’ multi-region, ‘eu’ multi-region, asia-east1, asia-east2, asia-south1, asia-southeast1, australia-southeast1, europe-west1, europe-west4, us-central1, us-west1, and us-west2. The exact error rate percentages varied by API method and region, but ranged from 0% to 8%. Errors began at 16:30, saw an initial recovery at 17:50, and were fully resolved by 19:42. Firebase ML Kit Experienced elevated errors from 16:45 to 17:45 globally. The average error rate was 10% globally, with a peak of 14% globally. However, users located near the Pacific Northwest and Western Europe saw the most impact. Google BigQuery Experienced significantly elevated error rates across many API methods in all regions. The asia-east1 and asia-east2 regions were the most impacted with 100% of metadata dataset insertion operations failing. The following regions experienced multiple customers with error rates above 10%: asia-east1, asia-east2, asia-northeast1, australia-southeast1, ‘eu’ multi-region, europe-north1, europe-west2, europe-west3, us-east4, and us-west2. The first round of errors occurred between 16:42 and 17:42. The second round of errors occurred between 18:45 and 19:45 and experienced slightly higher average error rates than the first. The exact impact windows differed slightly between APIs. Kubernetes Engine (GKE) Experienced elevated errors on GKE API from 16:35 - 17:40 and 19:35 - 19:40 in the following regions: asia-east1, asia-east2, us-west1, and europe-west1. This mainly affected cluster operations including creation, listing, upgrades and nodes changes. Existing healthy clusters remained unaffected. Secret Manager Experienced elevated error rates from 16:44 to 17:43 on secrets stored globally, however the most impacted regions were in europe-west1, asia-east1, and us-west1, with an additional spike between 19:35 to 19:42. The average error rate was <1%, with a peak of 4.2%. # SLA CREDITS If you believe your paid application experienced an SLA violation as a result of this incident, please populate the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla A full list of all Google Cloud Platform Service Level Agreements can be found at https://cloud.google.com/terms/sla/. For G Suite, please request an SLA credit through one of the Support channels: https://support.google.com/a/answer/104721 G Suite Service Level Agreement can be found at https://gsuite.google.com/intl/en/terms/sla.html

Last Update: A few months ago

RESOLVED: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

The issue with Google Cloud infrastructure components has been resolved for all affected users as of Tuesday, 2020-03-31 07:45 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 07:45 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 06:45 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 05:30 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 04:30 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 03:30 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 02:30 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 01:30 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. We will provide more information by Tuesday, 2020-03-31 00:15 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Our current data indicates that error rates have decreased with intermittent spikes. Our engineering team is continuing the mitigation for a full resolution. We will provide more information by Monday, 2020-03-30 22:30 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20004 - We are experiencing an intermittent issue with Google Cloud infrastructure components.

Description: Mitigation work is still underway by our engineering team. Current data indicates that error rates are decreasing globally. We will provide more information by Monday, 2020-03-30 21:15 US/Pacific. Diagnosis: Affected customers will experience intermittent failures in Cloud SQL instance creations and deletions, and increased error rates in related products which include: Cloud SQL, Cloud Data Fusion, and Cloud Composer. Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 20004 - We Are Investigating an Issue with the Cloud Console

The issue with Cloud Console has been resolved for all affected users as of Friday, 2020-03-27 06:32 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

RESOLVED: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

The issue with Google Cloud infrastructure components has been resolved for all affected projects as of Friday, 2020-03-27 06:32 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

Description: The engineering team has tested a fix, which is being gradually rolled out to the affected services. Current data indicates that the issue has been resolved for the majority of users and we expect full resolution within the next hour (by 2020-03-27 07:00 US/Pacific). The estimate is tentative and is subject to change. We will provide an update by Friday, 2020-03-27 07:00 US/Pacific with current details. Diagnosis: Affected customers may experience delayed IAM modifications that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigTable, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Memorystore, Cloud Filestore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog, Data Fusion, and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

UPDATE: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

Description: The engineering team has tested a fix, which is being gradually rolled out to the affected services. We expect full resolution within the next two hours (by 2020-03-27 06:00 US/Pacific). The estimate is tentative and is subject to change. We will provide an update by Friday, 2020-03-27 06:00 US/Pacific with current details. Diagnosis: Affected customers may experience delayed IAM modifications that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Memorystore, Cloud Filestore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog, Data Fusion, and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

UPDATE: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

Description: The engineering team has tested a fix, which is being gradually rolled out to the affected services. We expect full resolution within the next three hours (by 2020-03-27 05:00 US/Pacific). The estimate is tentative and is subject to change. We will provide an update by Friday, 2020-03-27 05:00 US/Pacific with current details. Diagnosis: Affected customers may experience delayed IAM modifications that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Memorystore, Cloud Filestore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog, Data Fusion, and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

UPDATE: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

Description: Mitigation work is still underway by our engineering team for a full resolution. The mitigation is expected to complete within the next few hours. We will provide an update by Friday, 2020-03-27 02:00 US/Pacific with current details. Diagnosis: Affected customers may experience delayed IAM modifications that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Memorystore, Cloud Filestore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog, Data Fusion, and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

UPDATE: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

Description: Mitigation work is still underway by our engineering team for a full resolution. We will provide an update by Friday, 2020-03-27 00:00 US/Pacific with current details. Diagnosis: Affected customers may experience delayed IAM modifications that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Memorystore, Cloud Filestore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog, Data Fusion, and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

UPDATE: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

Description: We believe the issue with Google Cloud infrastructure components is partially resolved. Restoration of IAM modifications to real-time is underway and Cloud IAM latency has decreased. We will provide an update by Thursday, 2020-03-26 23:00 US/Pacific with current details. Diagnosis: Affected customers may experience delayed IAM modifications that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Memorystore, Cloud Filestore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog, Data Fusion, and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

UPDATE: Incident 20003 - Elevated error rates across multiple Google Cloud Platform services.

Description: We believe the issue with Google Cloud infrastructure components is partially resolved. The mitigations have been rolled out globally, and the errors should have subsided for all affected users as of 2020-03-26 20:15. There remains a backlog of Cloud IAM modifications, which may still have increased latency before taking effect. We are currently working through the backlog to restore IAM applications to real-time. We will provide an update by Thursday, 2020-03-26 22:00 US/Pacific with current details. Diagnosis: Affected customers may experience elevated error rates that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud MemoryStore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog, Data Fusion and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

UPDATE: Incident 20003 - Network connectivity issues, and elevated error rates across multiple Google Cloud Platform services.

Description: Modifications to Cloud IAM permissions and service accounts may have significantly increased latency before taking effect. Existing permissions remain enforced. Mitigation work is still underway by our engineering team for the remaining services. Customers may experience intermittent spikes in errors while mitigations are pushed out globally. The following services have recovered at this time: App Engine, Cloud Functions, Cloud Run, BigQuery, Dataflow, Dialogflow, Cloud Console, MemoryStore, Cloud Storage, Cloud Spanner, Data Catalog, Cloud KMS, and Cloud Pub/Sub. Locations that saw the most impact were us-west1, europe-west1, asia-east1, and asia-east2. Services that may still be seeing impact include: Cloud SQL - New instance creation failing Cloud Composer - New Composer environments are failing to be created Cloud IAM - Significantly increased latency for changes to take effect We will provide an update by Thursday, 2020-03-26 21:00 US/Pacific with current details. Diagnosis: Affected customers may experience elevated error rates that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud MemoryStore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog and Cloud Console. Workaround: Retry failed requests with exponential backoff.

Last Update: A few months ago

RESOLVED: Incident 20004 - Cloud Composer environment creations are failing globally.

The issue with Cloud Monitoring has been resolved for all affected projects as of Thursday, 2020-03-26 17:40 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20004 - Cloud Composer environment creations are failing globally.

Description: Our engineers have determined this issue to be linked to a single Google incident. For regular status updates, please visit [https://status.cloud.google.com/incident/zall/20003](https://status.cloud.google.com/incident/zall/20003). An update will be posted here by Thursday, 2020-03-26 21:00 US/Pacific. Diagnosis: Cloud Composer environment creations are failing globally. Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 20003 - We are experiencing an issue with Google Cloud Functions in Europe, beginning at Wednesday, 2020-02-12 09:40 US/Pacific.

The Google Cloud Functions ListLocations requests issue is believed to be affecting a small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 20003 - We are experiencing an issue with Google Cloud Functions in Europe, beginning at Wednesday, 2020-02-12 09:40 US/Pacific.

Description: We are experiencing an issue with Google Cloud Functions ListLocations requests in Europe, beginning at Wednesday, 2020-02-12 09:40 US/Pacific. Symptoms: About 80% error rate on ListLocations requests Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2020-02-12 13:30 US/Pacific with current details. Diagnosis: Affected customers will get a "permission denied error" with a message "Cloud Functions API has not been used before or it is disabled" Workaround: None at this time

Last Update: A few months ago

UPDATE: Incident 20003 - We've received a report of an issue with Cloud Functions

Description: We are experiencing an issue with Google Cloud Functions beginning at Wednesday, 2020-02-12 09:40 US/Pacific. Symptoms: About 80% error rate on ListLocations requests Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2020-02-12 12:30 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Affected customers will get a "permission denied error" with a message "Cloud Functions API has not been used before or it is disabled" Workaround: None at this time

Last Update: A few months ago

UPDATE: Incident 20003 - We've received a report of an issue with

Description: We are experiencing an issue with Google Cloud Functions beginning at Wednesday, 2020-02-12 09:40 US/Pacific. Symptoms: About 80% error rate on ListLocations requests Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2020-02-12 12:30 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Affected customers will get a "permission denied error" with a message "Cloud Functions API has not been used before or it is disabled" Workaround: None at this time

Last Update: A few months ago

RESOLVED: Incident 20002 - The issue with Cloud Console has been partially resolved as of 09:10 US/Pacific.

The issue with Cloud Console has been resolved for all affected users as of Friday, 2020-02-07 09:10 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20002 - The issue with Cloud Console has been partially resolved as of 09:10 US/Pacific.

Description: We believe that the issue with Cloud Console has been partially resolved. We do not have an ETA for full resolution at this time, however services have begun recovering as of 09:10 US/Pacific. We will provide an update by Friday, 2020-02-07 11:00 US/Pacific with current details. For customers unable to file support cases through the cloud console, please use https://support.google.com/cloud/contact/prod_issue Diagnosis: Customers impacted by this issue may see Cloud Console load with increased latency or not function as expected, such as page actions appearing to do nothing. Workaround: The gcloud command line interface is unaffected and may provide an alternative to the Cloud Console.

Last Update: A few months ago

UPDATE: Incident 20002 - The issue with Cloud Console has been partially resolved as of 09:10 US/Pacific.

Description: We believe the issue with Cloud Console is partially resolved. We do not have an ETA for full resolution at this point. We will provide an update by Friday, 2020-02-07 10:30 US/Pacific with current details. Diagnosis: We believe that the issue with Cloud Console has been partially resolved. We do not have an ETA for full resolution at this time, however services have begun recovering as of 09:10 US/Pacific. We will provide an update by Friday, 2020-02-07 11:00 US/Pacific with current details. For customers unable to file support cases through the cloud console, please use https://support.google.com/cloud/contact/prod_issue Customers impacted by this issue may see Cloud Console load with increased latency or not function as expected, such as page actions appearing to do nothing. Workaround: The gcloud command line interface is unaffected and may provide an alternative to the Cloud Console.

Last Update: A few months ago

RESOLVED: Incident 20001 - The issue with instance DELETE and STOP operations in Google Compute Engine has been mitigated.

The issue with Google Compute Engine operations in us-central1-a has been resolved for all affected projects as of Wednesday, 2020-01-29 16:16 US/Pacific. We thank you for your patience while we’ve worked on resolving the issue.

Last Update: A few months ago

RESOLVED: Incident 20002 - We've received a report of an issue with Stackdriver Monitoring

The issue with Stackdriver Monitoring has been resolved for all affected projects as of Saturday, 2020-01-18 01:53 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20002 - We've received a report of an issue with Stackdriver Monitoring

Description: We are investigating a potential issue with Stackdriver Monitoring affecting the us-west2 region. We will provide more information by Saturday, 2020-01-18 02:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 20002 - We've received a report of an issue with Stackdriver Monitoring

Description: We are investigating a potential issue with Stackdriver Monitoring affecting the us-west2 region. We will provide more information by Friday, 2020-01-17 23:49 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 20001 - BigQuery exports for Stackdriver Logs delayed in europe-west1

The issue with Stackdriver Logging has been resolved for all affected projects as of Monday, 2020-01-13 18:50 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

RESOLVED: Incident 20001 - We've received a report of an issue with Cloud Spanner.

The issue with Cloud Spanner, where customers are unable to change node count for instances in multi-region configuration "nam3", has been resolved for all affected users as of Wednesday, 2020-01-08 19:08 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20001 - We've received a report of an issue with Google Cloud Functions, Google App Engine, and Firebase Cloud Functions deployments failing for some customers.

The issue with Google Cloud Functions Google App Engine, and Firebase Cloud Functions deployments has been resolved for all affected users as of Tuesday, 2020-01-07 11:05 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 20001 - We've received a report of an issue with Google Cloud Functions, Google App Engine, and Firebase Cloud Functions deployments failing for some customers.

Description: We are experiencing an issue with Google Cloud Functions, Google App Engine, and Firebase Cloud Functions deployments failing for some customers beginning at Tuesday, 2020-01-07 04:45 US/Pacific. Mitigation work is still underway. The engineering team is rolling back a release to attempt to mitigate this issue. We are seeing a reduction in error rates as the rollback progresses. We will provide more information by Tuesday, 2020-01-07 11:30 US/Pacific. Diagnosis: Google App Engine deployments will fail with the error message "The image has failed to build or could not be uploaded to the Google Container Registry." Google Cloud Functions deployments will fail with error message "Build failed." Firebase Cloud Functions deployments will fail. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 20001 - We've received a report of an issue with Google Cloud Functions, Google App Engine, and Firebase Cloud Functions deployments failing for some customers.

Description: Mitigation work is still underway. The engineering team is rolling back a release to attempt to mitigate this issue. An ETA for completion is not available at this time. We will provide more information by Tuesday, 2020-01-07 10:30 US/Pacific. Diagnosis: Google App Engine deployments will fail with the error message "The image has failed to build or could not be uploaded to the Google Container Registry." Google Cloud Functions deployments will fail with error message "Build failed." Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 19023 - Regional network issue

The issue with multiple simultaneous fiber cuts affecting traffic routed through Google's network in Bulgaria has been resolved for all affected users as of 2019-12-19 02:36 US/Pacific. Google services were not reachable for users who were accessing these Google services primarily through our Bulgaria network point of presence. We thank you for your patience whilst we were working on resolving the issue. We have identified the root cause and routed affected traffic around the impacted parts of our network. We are conducting an internal investigation and will provide a detailed public incident summary at a later date.

Last Update: A few months ago

RESOLVED: Incident 19012 - We are investigating an issue with Google Kubernetes Engine where some nodes in recently upgraded clusters (see affected versions) may be experiencing elevated numbers of kernel panics

The issue with Google Kubernetes Engine clusters with node pools experiencing an elevated number of kernel panics has been resolved in a new release of GKE available as of Wednesday, 2019-11-11 16:00 US/Pacific. The fix is contained in the following versions of GKE which is currently rolling out to node pools with auto upgrade enabled [1]. This should complete by Friday, 2019-11-15. Any customer on manual updates will need to manually upgrade their nodes to the following versions.: 1.13.11-gke.14 1.13.12-gke.8 1.14.7-gke.23 1.14.8-gke.12 Please note that this fix has downgraded the version of CoS to cos-73-11647-293-0 [2] as a temporary mitigation, we expect the next release of GKE to have an upgraded kernel and fix for the panics seen in the below releases.. Affected versions were: 1.13.11-gke.9, 1.14.7-gke.14, 1.13.12-gke.1, 1.14.8-gke.1, 1.13.11-gke.11, 1.13.12-gke.2, 1.14.7-gke.17, 1.14.8-gke.2, 1.13.12-gke.3, 1.14.8-gke.6, 1.13.11-gke.12, 1.13.12-gke.4, and 1.14.8-gke.7 We thank you for your patience while we've worked on resolving the issue. [1] - https://cloud.google.com/kubernetes-engine/versioning-and-upgrades#rollout_schedule [2] - https://cloud.google.com/container-optimized-os/docs/release-notes#cos-73-11647-293-0

Last Update: A few months ago

UPDATE: Incident 19012 - We are investigating an issue with Google Kubernetes Engine where some nodes in recently upgraded clusters (see affected versions) may be experiencing elevated numbers of kernel panics

Description: The following fixed versions are now available and should fix the kernel panic issue: 1.13.11-gke.14, 1.13.12-gke.8, 1.14.7-gke.23 and 1.14.8-gke.12. Mitigation work is currently underway by our engineering team to roll out the fixed versions to clusters configured with node auto-update, and is expected to be complete by Wednesday, 2019-11-13. Clusters not configured with node auto-update can be manually upgraded. At this time the following versions are still affected: 1.13.11-gke.9, 1.14.7-gke.14, 1.13.12-gke.1, 1.14.8-gke.1, 1.13.11-gke.11, 1.13.12-gke.2, 1.14.7-gke.17, 1.14.8-gke.2, 1.13.12-gke.3, 1.14.8-gke.6, 1.13.11-gke.12, 1.13.12-gke.4, and 1.14.8-gke.7. We will provide more information as it becomes available or by Wednesday, 2019-11-13 17:00 US/Pacific at the latest. Diagnosis: Affected users may notice elevated levels of kernel panics on nodes running one of the affected versions listed above. Workaround: Users seeing this issue can upgrade to a fixed release.

Last Update: A few months ago

RESOLVED: Incident 19022 - We've received a report of an issue with Cloud Networking.

The issue with Cloud Networking has been resolved for all affected projects as of Monday, 2019-11-11 04:00 US/Pacific We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 19022 - We've received a report of an issue with Cloud Networking.

Description: We believe the issue with Cloud Networking is partially resolved. We will provide an update by Monday, 2019-11-11 04:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19012 - We are investigating an issue with Google Kubernetes Engine where some nodes in recently upgraded clusters (see affected versions) may be experiencing elevated numbers of kernel panics

Description: This issue was downgraded to an orange category Service Disruption as the number of projects actually affected is very low. Mitigation work is currently underway by our engineering team and is expected to be complete by Wednesday, 2019-11-13. At this time the following versions are still affected: 1.13.11-gke.9, 1.14.7-gke.14, 1.13.12-gke.1, 1.14.8-gke.1, 1.13.11-gke.11, 1.13.12-gke.2, 1.14.7-gke.17, 1.14.8-gke.2, 1.13.12-gke.3, 1.14.8-gke.6, 1.13.11-gke.12, 1.13.12-gke.4, and 1.14.8-gke.7. We will provide more information as it becomes available or by Wednesday, 2019-11-13 17:00 US/Pacific at the latest. Diagnosis: Affected users may notice elevated levels of kernel panics on nodes upgraded to one of the affected versions listed above. Workaround: Users seeing this issue can downgrade to a previous release (not listed in the affected versions above). Users on a Release Channel affected by this issue should reach out to support for assistance with downgrading their nodes.

Last Update: A few months ago

RESOLVED: Incident 19008 - Some networking update/create/delete operations pending globally

# ISSUE SUMMARY On Thursday 31 October, 2019, network administration operations on Google Compute Engine (GCE), such as creating/deleting firewall rules, routes, global load balancers, subnets, or new VPCs, were subject to elevated latency and errors. Specific service impact is outlined in detail below. # DETAILED DESCRIPTION OF IMPACT On Thursday 31 October, 2019 from 16:30 to 18:00 US/Pacific and again from 20:24 to 23:08 Google Compute Engine experienced elevated latency and errors applying certain network administration operations. At 23:08, the issue was mitigated fully, and as a result, administrative operations began to succeed for most projects. However, projects which saw network administration operations fail during the incident were left stuck in a state where new operations could not be applied. The cleanup process for these stuck projects took until 2019-11-02 14:00. The following services experienced up to a 100% error rate when submitting create, modify, and/or delete requests that relied on Google Compute Engine’s global (and in some cases, regional) networking APIs between 2019-10-31 16:40 - 18:00 and 20:24 - 23:08 US/Pacific for a combined duration of 4 hours and 4 minutes: - Google Compute Engine - Google Kubernetes Engine - Google App Engine Flexible - Google Cloud Filestore - Google Cloud Machine Learning Engine - Google Cloud Memorystore - Google Cloud Composer - Google Cloud Data Fusion # ROOT CAUSE Google Compute Engine’s networking stack consists of software which is made up of two components, a control plane and data plane. The data plane is where packets are processed and routed based on the configuration set up by the control plane. GCE’s networking control plane has global components that are responsible for fanning-out network configurations that can affect an entire VPC network to downstream (regional/zonal) networking controllers. Each region and zone has their own control plane service, and each control plane service is sharded such that network programming is spread across multiple shards. A performance regression introduced in a recent release of the networking control software caused the service to begin accumulating a backlog of requests. The backlog eventually became significant enough that requests timed out, leaving some projects stuck in a state where further administrative operations could not be applied. The backlog was further exacerbated by the retry policy in the system sending the requests, which increased load still further. Manual intervention was required to clear the stuck projects, prolonging the incident. # REMEDIATION AND PREVENTION Google engineers were alerted to the problem on 2019-10-31 at 17:10 US/Pacific and immediately began investigating. From 17:10 to 18:00, engineers ruled out potential sources of the outage without finding a definitive root cause. The networking control plane performed an automatic failover at 17:57, dropping the error rate. This greatly reduced the number of stuck operations in the system and significantly mitigated user impact. However, after 18:59, the overload condition returned and error rates again increased. After further investigation from multiple teams, additional mitigation efforts began at 19:52, when Google engineers allotted additional resources to the overloaded components. At 22:16, as a further mitigation, Google engineers introduced a rate limit designed to throttle requests to the network programming distribution service. At 22:28, this service was restarted, allowing it to drop any pending requests from its queue. The rate limit coupled with the restart mitigated the issue of new operations becoming stuck, allowing the team to begin focusing on the cleanup of stuck projects. Resolving the stuck projects required manual intervention, which was unique to each failed operation type. Engineers worked round the clock to address each operation type in turn; as each was processed, further operations of the same type (from the same project) also began to be processed. 80% of the stuck operations were processed by 2019-11-01 16:00, and all operations were fully processed by 2019-11-02 14:00. We will be taking these immediate steps to prevent this class of error from recurring: - We are implementing continuous load testing as part of the deployment pipeline of the component which suffered the performance regression, so that such issues are identified before they reach production in future. - We have rate-limited the traffic between the impacted control plane components to avoid the congestion collapse experienced during this incident. - We are further sharding the global network programming distribution service to allow for graceful horizontal scaling under high traffic. - We are automating the steps taken to unstick administrative operations, to eliminate the need for manual cleanup after failures such as this one. - We are adding alerting to the network programming distribution service, to reduce response time in the event of a similar problem in the future. - We are changing the way the control plane processes requests to allow forward progress even when there is a significant backlog. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business. If you believe your application experienced an SLA violation as a result of this incident, please contact us (https://support.google.com/cloud/answer/6282346).

Last Update: A few months ago

RESOLVED: Incident 19008 - Some networking update/create/delete operations pending globally

Our engineers have made significant progress unsticking operations overnight and early this morning. At this point in time, the issue with Google Cloud Networking operations being stuck is believed to be affecting a very small number of remaining projects and our Engineering Team is actively working on unsticking the final stuck operations. If you have questions or are still impacted, please open a case with the Support Team and we will work with you directly until this issue is fully resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Description: Mitigation efforts have successfully mitigated most types of operations. At this time the backlog consists of mostly network and subnet deletion operations, and a small fraction of create subnet operations. This affects subnets created during the impact window. Subnets created outside of this window remain unaffected. Mitigation efforts will continue overnight to unstick the remaining operations. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we have worked on resolving the issue. We will provide more information by Saturday, 2019-11-02 11:00 US/Pacific. Diagnosis: Google Cloud Networking - Networking-related Compute API operations stuck pending if submitted during the above time. - The affected operations include: deleting and creating subnets, creating networks. Resubmitting similar requests may also enter a pending state as they are waiting for the previous operation to complete. Google Kubernetes Engine - Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine - New Cluster operations are now succeeding and further updates on recovering from this are underway as part of the mitigation mentioned under Google Cloud Networking. Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Description: Approximately 25% of global (and regional) route and subnet deletion operations remain stuck in a pending state. Mitigation work is still underway to unblock pending network operations globally. We expect the majority of mitigations to complete over the next several hours, with the long-tail going into tomorrow. Please note, this will allow newer incoming operations of the same type to eventually process successfully. However, resubmitting similar requests may still get stuck in a running state as they are waiting for previously queued operations to complete. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we have worked on resolving the issue. We will provide more information by Friday, 2019-11-01 20:30 US/Pacific. Diagnosis: Google Cloud Networking - Networking-related Compute API operations stuck pending if submitted during the above time. - The affected operations include: [deleting/creating] backend services, subnets, instance groups, routes and firewall rules. - Resubmitting similar requests may also enter a pending state as they are waiting for the previous operation to complete. - Our product team is working to unblock any pending operation Google Compute Engine - 40-80% of Compute Engine API operations may have become stuck pending if submitted during the above time. - Affected operations include any operation which would need to update Networking on affected projects Google Cloud Filestore - Impacts instance creation/deletion Google Kubernetes Engine - Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine - New Cluster operations are now succeeding and further updates on recovering from this are underway as part of the mitigation mentioned under Google Cloud Networking. Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Currently, the backlog of pending operations has been reduced by approximately 70%, and we expect the majority of mitigations to complete over the next several hours, with the long-tail going into tomorrow. Mitigation work is still underway to unblock pending network operations globally. To determine whether you are affected by this incident, you may run the following command [1] “gcloud compute operations list --filter="status!=DONE” to view your project’s pending operations. If you see global operations (or regional subnet operations) that are running for a long time (or significantly longer than usual), then you are likely still impacted. The remaining 30% of stuck operations are currently either being processed successfully or marked as failed. This will allow newer incoming operations of the same type to be eventually processed successfully, however, resubmitting similar requests may also get stuck in a running state as they are waiting for the queued operations to complete. If you have an operation that does not appear to be finishing, please wait for it to succeed or be marked as failed before retrying the operation. For Context: 40-80% of Cloud Networking operations submitted between 2019-10-31 16:41 US/Pacific and 2019-10-31 23:01 US/Pacific may have been affected. The exact percentage of failures is region dependent. We will provide more information by Friday, 2019-11-01 16:30 US/Pacific. [1] https://cloud.google.com/sdk/gcloud/reference/compute/operations/list Diagnosis: As we become aware of products which were impacted we will update this post to ensure transparency. Google Cloud Networking Networking-related Compute API operations stuck pending if submitted during the above time. The affected operations include: [deleting/creating] backend services, subnets, instance groups, routes and firewall rules. Resubmitting similar requests may also enter a pending state as they are waiting for the previous operation to complete. Our product team is working to unblock any pending operation Google Compute Engine 40-80% of Compute Engine API operations may have become stuck pending if submitted during the above time. Affected operations include any operation which would need to update Networking on affected projects Google Cloud DNS Some DNS updates may be stuck pending from during the above time. Google Cloud Filestore Impacts instance creation/deletion Cloud Machine Learning Online prediction jobs using Google Kubernetes Engine may have experienced failures during this time The team is no longer seeing issues affecting Cloud Machine Learning and we feel the incident for this product is now resolved. Cloud Composer Create Environment operations during the affected time may have experienced failures. Customers should no longer being seeing impact Google Kubernetes Engine Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine New Cluster operations are now succeeding and further updates on recovering from this are underway as part of the mitigation mentioned under Google Cloud Networking. Google Cloud Memorystore This issue is believed to have affected less than 1% of projects The affected projects should find full resolution once the issue affecting Google Compute Engine is resolved. App Engine Flexible New deployments experienced elevated failure rates during the affected time. The team is no longer seeing issues affecting new deployment creation

Last Update: A few months ago

UPDATE: Incident 19021 - We've received a report of issues with Google Cloud Networking.

Our engineers have determined that Google Cloud Networking was impacted by the same underlying issue as the Google Compute Engine (GCE) incident. The start & end times have been updated in this incident to reflect the approximate impact period. Please refer to [https://status.cloud.google.com/incident/compute/19008](https://status.cloud.google.com/incident/compute/19008) for the full impact details of the incident. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19011 - We are experiencing an issue with Google Kubernetes Engine.

Our engineers have determined that Google Kubernetes Engine (GKE) was impacted by the same underlying issue as the Google Compute Engine (GCE) incident. The start & end times have been updated in this incident to reflect the approximate impact period. Please refer to [https://status.cloud.google.com/incident/compute/19008](https://status.cloud.google.com/incident/compute/19008) for the full impact details of the incident. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Description: Mitigation work continues to unblock pending network operations globally. 40-80% of Cloud Networking operations submitted between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31 23:01 US/Pacific may have been affected. The exact amount of failures is region dependent. Our team has been able to reduce the number of pending operations by 60% at this time. We expect mitigation to continue over the next 4 hours and are working to clear the pending operations by largest type impacted. We will provide more information by Friday, 2019-11-01 14:30 US/Pacific. Diagnosis: As we become aware of products which were impacted we will update this post to ensure transparency. Google Compute Engine - Networking-related Compute API operations pending to complete if submitted during the above time. - Resubmitting similar requests may fail as they are waiting for the above operations to complete. - The affected operations include: deleting backend services, subnets, instance groups, routes and firewall rules. - Some operations may still show as pending and are being mitigated at this time. We are currently working to address operations around subnet deletion as our next target group Google Kubernetes Engine - Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine - New Cluster operations are now succeeding and further updates on recovering from this are underway as part of the mitigation mentioned under Google Compute Engine. No further updates will be provided for Google Kubernetes Engine in this post. Google Cloud Memorystore - This issue is believed to have affected less than 1% of projects - The affected projects should find full resolution once the issue affecting Google Compute Engine is resolved. No further updates will be provided for Google Cloud Memorystore App Engine Flexible - New deployments experienced elevated failure rates during the affected time. - The team is no longer seeing issues affecting new deployment creation and we feel the incident for this product is now resolved. No further updates will be provided for App Engine Flexible Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Description: Mitigation work is currently underway by our product team to unblock stuck network operations globally. Network operations submitted between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31 23:01 US/Pacific may be affected. New operations are succeeding as expected currently and we are currently working to clear a back log of pending operations in our system. We will provide more information by Friday, 2019-11-01 12:30 US/Pacific. Diagnosis: Customer may have encountered errors across the below products if affected. Google Compute Engine - Networking-related Compute API operations pending to complete if submitted during the above time. - Resubmitting similar requests may fail as they are waiting for the above operations to complete. - The affected operations include: deleting backend services, subnets, instance groups, routes and firewall rules. - Some operations may still show as pending and are being mitigated at this time. We expect this current mitigation work to be completed no later than 2019-11-01 12:30 PDT Google Kubernetes Engine - Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine - New Cluster operations are now succeeding and further updates on recovering from this are underway as part of the mitigation mentioned under Google Compute Engine. No further updates will be provided for Google Kubernetes Engine in this post. Google Cloud Memorystore - This issue is believed to have affected less than 1% of projects - The affected projects should find full resolution once the issue affecting Google Compute Engine is resolved. No further updates will be provided for Google Cloud Memorystore App Engine Flexible - New deployments experienced elevated failure rates during the affected time. - The team is no longer seeing issues affecting new deployment creation and we feel the incident for this product is now resolved. No further updates will be provided for App Engine Flexible Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Description: Mitigation work is currently underway by our product team to unblock stuck network operations globally. Network operations submitted between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31 23:01 US/Pacific may be affected. New operations are showing a reduction in failures currently and we are currently working to clear a back log of pending operations in our system. We will provide more information by Friday, 2019-11-01 12:00 US/Pacific. Diagnosis: Customer may have encountered errors across the below products if affected. Google Compute Engine - Networking-related Compute API operations pending to complete if submitted during the above time. - The affected operations include: deleting backend services, subnets, instance groups, routes and firewall rules. - Some operations may still show as pending and are being mitigated at this time. We expect this current mitigation work to be completed no later than 2019-11-01 12:30 PDT Google Kubernetes Engine - Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine - New Cluster operations are now succeeding and further updates on recovering from this are underway as part of the mitigation mentioned under Google Compute Engine. No further updates will be provided for Google Kubernetes Engine in this post. Google Cloud Memorystore - This issue is believed to have affected less than 1% of projects - The affected projects should find full resolution once the issue affecting Google Compute Engine is resolved. No further updates will be provided for Google Cloud Memorystore App Engine Flexible - New deployments experienced elevated failure rates during the affected time. - The team is no longer seeing issues affecting new deployment creation and we feel the incident for this product is now resolved. No further updates will be provided for App Engine Flexible Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Description: Mitigation work is currently underway by our product team to unblock stuck network operations globally. Network operations submitted between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31 23:01 US/Pacific may be affected. New operations are showing a reduction in failures currently and we are currently working to clear a back log of pending operations in our system. We will provide more information by Friday, 2019-11-01 12:00 US/Pacific. Diagnosis: Customer may have encountered errors across the below products if affected. Google Compute Engine - Networking-related Compute API operations pending to complete if submitted during the above time. - The affected operations include: deleting backend services, subnets, instance groups, routes and firewall rules. - Some operations may still show as pending and are being mitigated at this time. We expect this current mitigation work to be completed no later than 2019-11-01 12:30 PDT Google Kubernetes Engine - Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine - New Cluster operations are now succeeding and further updates on recovering from this can be found [https://status.cloud.google.com/incident/container-engine/19011](https://status.cloud.google.com/incident/container-engine/19011). No further updates will be provided for Google Kubernetes Engine in this post. Google Cloud Memorystore - This issue is believed to have affected less than 1% of projects - The affected projects should find full resolution once the issue affecting Google Compute Engine is resolved. No further updates will be provided for Google Cloud Memorystore App Engine Flexible - New deployments experienced elevated failure rates during the affected time. - The team is no longer seeing issues affecting new deployment creation and we feel the incident for this product is now resolved. No further updates will be provided for App Engine Flexible Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19011 - We are experiencing an issue with Google Kubernetes Engine. Clusters and nodes creation might fail.

The issue with creating new Google Kubernetes Engine clusters has been resolved for all affected projects as of Friday, 2019-11-01 08:50 US/Pacific. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking update/create/delete operations pending globally

Description: Mitigation work is currently underway by our product team to unblock stuck network operations globally. Network operations submitted between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31 23:01 US/Pacific may be affected. New operations are showing a reduction in failures currently and we are currently working to clear a back log of pending operations in our system. We will provide more information by Friday, 2019-11-01 10:00 US/Pacific. Diagnosis: Customer may be seeing errors across the below products if affected. Google Compute Engine - Networking-related Compute API operations failing to complete if submitted during the above time. - This may include deleting backend services, subnets, instance groups, routes and firewall rules. Google Kubernetes Engine - Cluster operations including creation, update, auto scaling may have failed due to the networking API failures mentioned under Google Compute Engine - New Cluster operations are now succeeding and further updates on restoring this can be found [https://status.cloud.google.com/incident/container-engine/19011](https://status.cloud.google.com/incident/container-engine/19011) Google Cloud Memorystore - Create/Delete events failed during the above time App Engine Flexible - Deployments seeing elevated failure rates Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - Some networking operations are failing globally

Description: Mitigation work is currently underway by our product team to address the ongoing issue with some network operations failing globally at this time. These reports started Thursday, 2019-10-31 16:41 US/Pacific. Operations are showing a reduction in failures currently and we are currently working to clear a back log of stuck operations in our system. We will provide more information by Friday, 2019-11-01 09:30 US/Pacific. Diagnosis: Customer may experience errors with the below products if affected Google Compute Engine * Networking-related Compute API operations failing * This may include deleting backend services, subnets, instance groups, routes and firewall rules and more. Google Kubernetes Engine * Cluster operations including creation, update, auto scaling may fail due to the networking API failures Google Cloud Memorystore * Create/Delete events failing App Engine Flexible * Deployments seeing elevated failure rates Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - The issue with the Google Compute Engine networking control plane is ongoing

Description: Mitigation work is still underway by our engineering team. We will provide more information by Friday, 2019-11-01 08:30 US/Pacific. Diagnosis: Customer may experience errors while creating or deleting backend services, subnets, instance groups, routes and firewall rules. Cloud Armor rules might not be updated. Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - The issue with the Google Compute Engine networking control plane is ongoing

Description: Our engineering team still investigating the issue. We will provide an update by Friday, 2019-11-01 07:00 US/Pacific with current details. Diagnosis: Customer may experience errors while creating or deleting backend services, subnets, instance groups, routes and firewall rules. Cloud Armor rules might not be updated. Workaround: No workaround is available at the moment

Last Update: A few months ago

UPDATE: Incident 19008 - The issue with the Google Compute Engine networking control plane is ongoing

Description: Our engineering team still investigating the issue. We will provide an update by Friday, 2019-11-01 06:00 US/Pacific with current details. Diagnosis: Customer may experience errors while creating or deleting backend services, subnets, instance groups and firewall rules. New GKE nodes creation might fail with NetworkUnavailable status set to True. Cloud Armor rules might not be updated. Workaround: No workaround is available at the moment

Last Update: A few months ago

RESOLVED: Incident 19020 - We've received a report of an issue with Cloud Networking.

ISSUE SUMMARY On Tuesday 22 October, 2019, Google Compute Engine experienced 100% packet loss to and from ~20% of instances in us-west1-b for a duration of 2 hours, 31 minutes. Additionally, 20% of Cloud Routers, and 6% of Cloud VPN gateways experienced equivalent packet loss in us-west1. Specific service impact is outlined in detail below. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Tuesday 22 October, 2019 from 16:20 to 18:51 US/Pacific, the Google Cloud Networking control plane in us-west1-b experienced failures in programming Google Cloud's virtualized networking stack. This means that new or migrated instances would have been unable to obtain network addresses and routes, making them unavailable Existing instances should have seen no impact; however, an additional software bug, triggered by the programming failure, caused 100% packet loss to 20% of existing instances in this zone. Impact in the us-west1 region for specific services is outlined below: ## Compute Engine Google Compute Engine experienced 100% packet loss to 20% of instances in us-west1-b. Additionally, creation of new instances in this zone failed, while existing instances that were live migrated during the incident would have experienced 100% packet loss. ## Cloud VPN Google Cloud VPN experienced failures creating new or modifying existing gateways in us-west1. Additionally, 6% of existing gateways experienced 100% packet loss. ## Cloud Router Google Cloud Router experienced failures creating new or modifying existing routes in us-west1. Additionally, 20% of existing cloud routers experienced 100% packet loss. ## Cloud Memorystore <1% of Google Cloud Memorystore instances in us-west1 were unreachable, and operations to create new instances failed. This affected basic tier instances and standard tier instances with the primary node in the affected zone. None of the affected instances experienced a cache flush, and impacted instances resumed normal operations as soon as the network was restored. ## Kubernetes Engine Google Kubernetes Engine clusters in us-west1 may have reported as unhealthy due to packet loss to and from the nodes and master, which may have triggered unnecessary node repair operations. ~1% of clusters were affected of which most were Zonal clusters in us-west1-b. Some regional clusters in us-west1 may have been briefly impacted if the elected etcd leader for the master was in us-west1-b, until it was re-elected. ## Cloud Bigtable Google Cloud Bigtable customers in us-west1-b without high availability replication and routing configured, would have experienced a high error rate. High Availability configurations had their traffic routed around the impact zone, and may have experienced a short period of increased latency. ## Cloud SQL Google Cloud SQL instances in us-west1 may have been temporarily unavailable. ~1% of instances were affected during the incident. ROOT CAUSE Google Cloud Networking consists of a software stack which is made up of two components, a control plane and data plane. The data plane is where packets are processed and routed based on the configuration set up by the control plane. Each zone has its own control plane service, and each control plane service is sharded such that network programming is spread across multiple shards. Additionally, each shard is made up of several leader elected [1] processes. During this incident, a failure in the underlying leader election system (Chubby [2]) resulted in components in the control plane losing and gaining leadership in short succession. These frequent leadership changes halted network programming, preventing VM instances from being created or modified. Google’s standard defense-in-depth philosophy means that existing network routes should continue to work normally when programming fails. The impact to existing instances was a result of this defense-in-depth failing: a race condition in the code which handles leadership changes caused programming updates to contain invalid configurations, resulting in packet loss for impacted instances. This particular bug has already been fixed, and a rollout of this fix was coincidentally in progress at the time of the outage. REMEDIATION AND PREVENTION Google engineers were alerted to the problem at 16:30 US/Pacific and immediately began investigating. Mitigation efforts began at 17:20 which involved a combination of actions including rate limits, forcing leader election, and redirection of traffic. These efforts gradually reduced the rate of packet loss, which eventually led to a full recovery of the networking control plane by 18:51. In order to increase the reliability of Cloud Networking, we will be taking these immediate steps to prevent a recurrence: We will complete the rollout of the fix for the race condition during leadership election which resulted in incorrect configuration being distributed. We will harden the components which process that configuration such that they reject obviously invalid configuration. We will improve incident response tooling used in this particular failure case to reduce time to recover. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business. [1] https://landing.google.com/sre/sre-book/chapters/managing-critical-state/#highly-available-processing-using-leader-election [2] https://ai.google/research/pubs/pub27897

Last Update: A few months ago

RESOLVED: Incident 19020 - We've received a report of an issue with Cloud Networking.

The issue with Cloud Networking has been resolved for all affected projects as of Tuesday, 2019-10-22 19:19 US/Pacific. We will publish analysis of this incident once we have completed our internal investigation. We thank you for your patience while we've worked on resolving the issue.

Last Update: A few months ago

UPDATE: Incident 19020 - We've received a report of an issue with Cloud Networking.

Summary: We've received a report of issues with multiple Cloud products including Google Compute Engine, Cloud Memorystore, Google Kubernetes Engine, Cloud Bigtable and Google Cloud Storage Description: We've received a report of issues with multiple Cloud products including Google Compute Engine, Cloud Memorystore, Google Kubernetes Engine, Cloud Bigtable and Google Cloud Storage starting around Tuesday, 2019-10-22 16:47 US/Pacific. Mitigation work has been completed by our Engineering Team and services such as Cloud Memorystore, Google Kubernetes Engine, Cloud Bigtable and Google Cloud Storage have recovered. We will provide another status update by Tuesday. 2019-10-22 23:00 US/Pacific with current details. Details: Cloud Networking - Network programming and packet loss levels have recovered, but not all jobs have recovered. Google Compute Engine - Packet loss behavior is starting to recover. There may be some remaining stuck VMs that are not reachable by SSH but the engineering team is working on fixing this.

Last Update: A few months ago

UPDATE: Incident 19020 - We've received a report of an issue with Cloud Networking.

Summary: We've received a report of issues with multiple Cloud products including Google Compute Engine, Cloud Memorystore, Google Kubernetes Engine, Cloud Bigtable and Google Cloud Storage Description: We've received a report of issues with multiple Cloud products including Google Compute Engine, Cloud Memorystore, Google Kubernetes Engine, Cloud Bigtable and Google Cloud Storage starting around Tuesday, 2019-10-22 16:47 US/Pacific. Mitigation work has been completed by our Engineering Team and some services are starting to recover. We will provide another status update by Tuesday. 2019-10-22 22:00 US/Pacific with current details. Details: Cloud Networking - Network programming and packet loss levels have recovered, but not all jobs have recovered. Google Compute Engine - Packet loss behavior is starting to recover. There may be some remaining stuck VMs that are not reachable by SSH but the engineering team is working on fixing this. Cloud Memorystore - Redis instances are recovering in us-west1-b. Google Kubernetes Engine - Situation has recovered. Cloud Bigtable - We are still seeing elevated levels of latency and errors in us-west1-b. Our engineering team is still working on recovery. Google Cloud Storage - Services in us-west1-b should be recovering.

Last Update: A few months ago

UPDATE: Incident 19020 - We've received a report of an issue with Cloud Networking.

Summary: We've received a report of issues with multiple Cloud products including Google Compute Engine, Cloud Memorystore, Google Kubernetes Engine, Cloud Bigtable and Google Cloud Storage Description: We've received a report of issues with multiple Cloud products including Google Compute Engine, Cloud Memorystore, Google Kubernetes Engine, Cloud Bigtable and Google Cloud Storage starting around Tuesday, 2019-10-22 16:47 US/Pacific. Mitigation work is currently underway by our Engineering Team. We will provide another status update by Tuesday. 2019-10-22 19:00 US/Pacific with current details. Diagnosis: Google Compute Engine - Unpredictable behavior due to packet loss such as failure to SSH to VMs in us-west1-b. Cloud Memorystore - New instances cannot be created and existing instances might not be reachable in us-west1-b. Google Kubernetes Engine - Regional clusters in us-west1 are impacted, zonal clusters in us-west1-b are impacted. Cloud Bigtable - Elevated latency and errors in us-west1-b. Google Cloud Storage - Service disruption in us-west1-b.

Last Update: A few months ago

UPDATE: Incident 19020 - We've received a report of an issue with Cloud Networking.

Description: We are experiencing an issue with Cloud Networking in us-west1. Our engineering team continues to investigate the issue. We will provide an update by Tuesday, 2019-10-22 17:41 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Affected customers will experience network programming changes fail to complete in us-west1-b. Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 19007 - We've received a report of issues with multiple Cloud products including Compute Engine Internal DNS, Cloud DNS, Stackdriver Logging and Monitoring, Compute Engine, App Engine, Cloud Run, Cloud Build,...

ISSUE SUMMARY On Tuesday 24 September, 2019, the following Google Cloud Platform services were partially impacted by an overload condition in an internal publish/subscribe messaging system which is a dependency of these products: App Engine, Compute Engine, Kubernetes Engine, Cloud Build, Cloud Composer, Cloud Dataflow, Cloud Dataproc, Cloud Firestore, Cloud Functions, Cloud DNS, Cloud Run, and Stackdriver Logging & Monitoring. Impact was limited to administrative operations for a number of these products, with existing workloads and instances not affected in most cases. We apologize to those customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Tuesday 24 September, 2019 from 12:46 to 20:00 US/Pacific, Google Cloud Platform experienced a partial disruption to multiple services with their respective impacts detailed below: ## App Engine Google App Engine (GAE) create, update, and delete admin operations failed globally from 12:57 to 18:21 for a duration of 5 hours and 24 minutes. Affected customers may have seen error messages like “APP_ERROR”. Existing GAE workloads were unaffected. ## Compute Engine Google Compute Engine (GCE) instances failed to start in us-central1-a from 13:11 to 14:32 for a duration of 1 hour and 21 minutes, and GCE Internal DNS in us-central1, us-east1, and us-east4 experienced delays for newly created hostnames to become resolvable. Existing GCE instances and hostnames were unaffected. ## Kubernetes Engine Google Kubernetes Engine (GKE) experienced delayed resource metadata and inaccurate Stackdriver Monitoring for cluster metrics globally. Additionally, cluster creation operations failed in us-central1-a from 3:11 to 14:32 for a duration of 1 hour and 21 minutes due to its dependency on GCE instance creation. Existing GKE clusters were unaffected by the GCE instance creation failures. ## Stackdriver Logging & Monitoring Stackdriver Logging experienced delays of up to two hours for logging events generated globally. Exports were delayed by up to 3 hours and 30 minutes. Some user requests to write logs in us-central1 failed. Some logs-based metric monitoring charts displayed lower counts, and queries to Stackdriver Logging briefly experienced a period of 50% error rates. The impact to Stackdriver Logging & Monitoring took place from 12:54 to 18:45 for a total duration of 5 hours and 51 minutes. ## Cloud Functions Cloud Functions deployments failed globally from 12:57 to 18:21 and experienced peak error rates of 13% in us-east1 and 80% in us-central1 from 19:12 to 19:57 for a combined duration of 6 hours and 15 minutes. Existing Cloud Function deployments were unaffected. ## Cloud Build Cloud Build failed to update build status for GitHub App triggers from 12:54 to 16:00 for a duration of 3 hours and 6 minutes. ## Cloud Composer Cloud Composer environment creations failed globally from 13:25 to 18:05 for a duration of 4 hours and 40 minutes. Existing Cloud Composer clusters were unaffected. ## Cloud Dataflow Cloud Dataflow workers failed to start in us-central1-a from 13:11 to 14:32 for a duration of 1 hour and 21 minutes due to its dependency on Google Compute Engine instance creation. Affected jobs saw error messages like “Startup of the worker pool in zone us-central1-a failed to bring up any of the desired X workers. INTERNAL_ERROR: Internal error. Please try again or contact Google Support. (Code: '-473021768383484163')”. All other Cloud Dataflow regions and zones were unaffected. ## Cloud Dataproc Cloud Dataproc cluster creations failed in us-central1-a from 13:11 to 14:32 for a duration of 1 hour and 21 minutes due to its dependency on Google Compute Engine instance creation. All other Cloud Dataproc regions and zones were unaffected. ## Cloud DNS Cloud DNS in us-central1, us-east1, and us-east4 experienced delays for newly created or updated Private DNS records to become resolvable from 12:46 to 19:51 for a duration of 7 hours and 5 minutes. ## Cloud Firestore Cloud Firestore API was unable to be enabled (if not previously enabled) globally from 13:36 to 17:50 for a duration of 4 hours and 14 minutes. ## Cloud Run Cloud Run new deployments failed in the us-central1 region from 12:48 to 16:35 for a duration of 3 hours and 53 minutes. Existing Cloud Run workloads, and deployments in other regions were unaffected. ROOT CAUSE Google runs an internal publish/subscribe messaging system, which many services use to propagate state for control plane operations. That system is built using a replicated, high-availability key-value store, holding information about current lists of publishers, subscribers and topics, which all clients of the system need access to. The outage was triggered when a routine software rollout of the key-value store in a single region restarted one of its tasks. Soon after, a network partition isolated other tasks, transferring load to a small number of replicas of the key-value store. As a defense-in-depth, clients of the key-value store are designed to continue working from existing, cached data when it is unavailable; unfortunately, an issue in a large number of clients caused them to fail and attempt to resynchronize state. The smaller number of key-value store replicas were unable to sustain the load of clients synchronizing state, causing those replicas to fail. The continued failures moved load around the available replicas of the key-value store, resulting in a degraded state of the interconnected components. The failure of the key-value store, combined with the issue in the key-value store client, meant that publishers and subscribers in the impacted region were unable to correctly send and receive messages, causing the documented impact on dependent services. REMEDIATION AND PREVENTION Google engineers were automatically alerted to the incident at 12:56 US/Pacific and immediately began their investigation. As the situation began to show signs of cascading failures, the scope of the incident quickly became apparent and our specialized incident response team joined the investigation at 13:58 to address the problem. The early hours of the investigation were spent organizing, developing, and trialing various mitigation strategies. At 15:59 a potential root cause was identified and a configuration change submitted which increased the client synchronization delay allowed by the system, allowing clients to successfully complete their requests without timing out and reducing the overall load on the system. By 17:24, the change had fully propagated and the degraded components had returned to nominal performance. In order to reduce the risk of recurrence, Google engineers configured the system to limit the number of tasks coordinating publishers and subscribers, which is a driver of load on the key-value store. The initial rollout of the constraint was faulty, and caused a more limited recurrence of problems at 19:00. This was quickly spotted and completely mitigated by 20:00, resolving the incident. We would like to apologize for the length and severity of this incident. We have taken immediate steps to prevent recurrence of this incident and improve reliability in the future. In order to reduce the chance of a similar class of errors from occurring we are taking the following actions. We will revise provisioning of the key-value store to ensure that it is sufficiently resourced to handle sudden failover, and fix the issue in the key-value store client so that it continues to work from cached data, as designed, when the key-value store fails. We will also shard the data to reduce the scope of potential impact when the key-value store fails. Furthermore, we will be implementing automatic horizontal scaling of key-value store tasks to enable faster time to mitigation in the future. Finally, we will be improving our communication tooling to more effectively communicate multi-product outages and disruptions. NOTE REGARDING CLOUD STATUS DASHBOARD COMMUNICATION Incident communication was centralized on a single product - in this case Stackdriver - in order to provide a central location for customers to follow for updates. We realize this may have created the incorrect impression that Stackdriver was the root cause. We apologize for the miscommunication and will make changes to ensure that we communicate more clearly in the future. SLA CREDITS If you believe your paid application experienced an SLA violation as a result of this incident, please submit the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla A full list of all Google Cloud Platform Service Level Agreements can be found at: https://cloud.google.com/terms/sla/.

Last Update: A few months ago

RESOLVED: Incident 19007 - We've received a report of issues with multiple Cloud products including Compute Engine Internal DNS, Cloud DNS, Stackdriver Logging and Monitoring, Compute Engine, App Engine, Cloud Run, Cloud Build,...

Affected Services: Google Compute Engine, Google Kubernetes Engine, Google App Engine, Cloud DNS, Cloud Run, Cloud Build, Cloud Composer, Cloud Firestore, Cloud Functions, Cloud Dataflow, Cloud Dataproc, Stackdriver Logging. Affected Features: Various resource operations, Environment creation, Log event ingestion. Issue Summary: Google Cloud Platform experienced a disruption to multiple services in us-central1, us-east1, and us-east4, while a few services were affected globally. Impact lasted for 7 hours, 6 minutes. We will publish a complete analysis of this incident once we have completed our internal investigation. (preliminary) Root cause: We rolled out a release of a key/value store component to an internal Google system which triggered a shift in load causing an out of memory (OOM) crash loop on these instances. The control plane was unable to automatically coordinate the increased load by publisher and subscriber clients, which resulted in degraded state of the interconnected components. Mitigation: Risk of recurrence has been mitigated by a configuration change that enabled control plane clients to successfully complete their requests and reduce overall load on the system. A Note on Cloud Status Dashboard Communication: Incident communication was centralized on a single product - in this case Stackdriver - in order to provide a central location for customers to follow for updates. Neither Stackdriver Logging nor Stackdriver Monitoring were the root cause for this incident. We realize this may have caused some confusion about the root cause of the issue. We apologize for the miscommunication and will make changes to ensure that we communicate more clearly in the future.

Last Update: A few months ago

RESOLVED: Incident 19007 - We've received a report of issues with multiple Cloud products including Compute Engine Internal DNS, Cloud DNS, Stackdriver Logging and Monitoring, Compute Engine, App Engine, Cloud Run, Cloud Build,...

The issue with multiple Cloud products has been resolved for all affected projects as of Tuesday, 2019-09-24 21:31 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 19007 - We've received a report of issues with multiple Cloud products including Compute Engine Internal DNS, Cloud DNS, Stackdriver Logging and Monitoring, Compute Engine, App Engine, Cloud Run, Cloud Build,...

Description: We are investigating issues with multiple Cloud products including Compute Engine Internal DNS, Cloud DNS, Stackdriver Logging and Monitoring, Compute Engine, App Engine, Cloud Run, Cloud Build, Cloud Composer, Cloud Firestore, Cloud Dataflow and Cloud Dataproc starting around Tuesday, 2019-09-24 13:00 US/Pacific. We have done some mitigations that have had positive impact, and some services have reported full recovery. Full recovery of all services is ongoing as our systems are processing significant backlogs. Services that have recovered include Compute Engine Internal DNS, App Engine, Cloud Run, Cloud Functions, Cloud Firestore, Cloud Composer, Cloud Build, Cloud Dataflow, Google Kubernetes Engine resource metadata and Stackdriver Monitoring metrics for clusters . Cloud DNS has also recovered but we are still monitoring to ensure the issue does not reoccur. Stackdriver Logging has also recovered for the majority of projects. We will provide another status update by Tuesday, 2019-09-24 22:00 US/Pacific with current details. Diagnosis: Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 19019 - We are investigating elevated errors with Cloud Interconnect in Google Cloud Networking in us-west2.

The Cloud Interconnect issue is believed to be affecting less than 1% of customers and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19019 - We are investigating elevated errors with Cloud Interconnect in Google Cloud Networking in us-west2.

Description: We are investigating elevated errors with Cloud Interconnect in Google Cloud Networking in us-west2. Our Engineering Team is investigating possible causes. We will provide another status update by Friday, 2019-09-13 13:30 US/Pacific with current details. Diagnosis: Customers peering with us-west2 may experience packet loss. Workaround: None at this time.

Last Update: A few months ago

RESOLVED: Incident 19006 - We investigating an Global issue with log ingestion delays on Stackdriver Logging.

The issue with Stackdriver Logging ingestion has been resolved for all affected users as of Tuesday, 2019-09-10 00:07 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19006 - We investigating an Global issue with log ingestion delays on Stackdriver Logging.

Description: The issue is mitigated. At the moment we processing the backlog. The expected time to completion is around 7 hours. We will provide more information by Tuesday, 2019-09-10 06:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 19006 - We investigating an Global issue with log ingestion delays on Stackdriver Logging.

Description: The issue is mitigated. At the moment we begin processing the backlog. The age of the oldest messages in the backlog in about 5 hours. We will provide more information by Monday, 2019-09-09 23:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 19006 - We investigating an Global issue with log ingestion delays on Stackdriver Logging.

Description: The original mitigation does not take expected effect. The Engineering Team working on another mitigation. We will provide more information by Monday, 2019-09-09 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 19010 - We've received a report of an issue with Google App Engine.

The Google App Engine issue is believed to have been resolved as of 12:21 US/Pacific for all affected projects. If you have questions or are still impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19018 - Internal load balancers in us-east1 not updating

The issue with Cloud Networking Internal load balancers in us-east1 not updating has been resolved for all affected projects as of Tuesday, 2019-08-27 15:56 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19018 - Internal load balancers in us-east1 not updating

Description: We are rolling back a potential fixto mitigate this issue. We will provide another status update by Tuesday, 2019-08-27 17:00 US/Pacific with current details. Diagnosis: Affected customers may experience configuration changes not propagating, including health checks for backends. This only affects Internal HTTP(S) Load Balancers. Internal TCP/UDP Load Balancing are not affected. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 19018 - Internal load balancers in us-east1 not updating

Description: Our Engineering Team believes they have identified the root cause and is working to mitigate. We will provide another status update by Tuesday, 2019-08-27 15:54 US/Pacific with current details. Diagnosis: Affected customers may experience configuration changes not propagating, including health checks for backends. This only affects Internal HTTP(S) Load Balancers. Internal TCP/UDP Load Balancing are not affected. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 19018 - Internal load balancers in us-east1 not updating

Description: Our Engineering Team believes they have identified the root cause and is working to mitigate. We will provide another status update by Tuesday, 2019-08-27 15:54 US/Pacific with current details. Diagnosis: Affected customers may experience configuration changes not propagating, including health checks for backends. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 19018 - Internal load balancers in us-east1 not updating

Description: We are investigating errors with Google Cloud Networking, regional internal load balancers in us-east1. Our Engineering Team is investigating possible causes. We will provide another status update by Tuesday, 2019-08-27 15:30 US/Pacific with current details. Diagnosis: Affected customers may experience configuration changes not propagating, including health checks for backends. Workaround: None at this time.

Last Update: A few months ago

UPDATE: Incident 19008 - We are currently experiencing an issue with authentication to Google App Engine sites, the Google Cloud Console, Identity Aware Proxy, and Google OAuth 2.0 endpoints.

The issue with authentication to Google App Engine sites, the Google Cloud Console, Identity Aware Proxy, and Google OAuth 2.0 endpoints has been resolved for all affected customers as of Monday, 2019-08-19 12:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19017 - We are investigating an intermittent networking issue

The issue with Cloud Networking packet loss has been resolved for all affected projects. Issue lasted from Friday, 2019-08-16 00:10 to Friday, 2019-08-16 00:26 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19017 - We are investigating an intermittent networking issue

Description: Mitigation is currently underway by our Engineering Team. We will provide another status update by Friday, 2019-08-16 02:00 US/Pacific with current details. Diagnosis: Affected projects may experience intermittent packet loss

Last Update: A few months ago

UPDATE: Incident 19017 - We are investigating a networking issue

Description: We are investigating a networking issue. We will provide more information by Friday, 2019-08-16 01:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 19008 - We've received a report of an issue with Google Kubernetes Engine.

After further investigation, we have determined that the impact was minimal and affected a subset of users. We have conducted an internal investigation of this issue and made appropriate improvements to our systems to help prevent or minimize a future recurrence.

Last Update: A few months ago

RESOLVED: Incident 19005 - We've received a report of an issue with Google Cloud Functions.

After further investigation, we have determined that the impact was minimal and affected a subset of users. We have conducted an internal investigation of this issue and made appropriate improvements to our systems to help prevent or minimize a future recurrence.

Last Update: A few months ago

RESOLVED: Incident 19014 - Cloud VPN endpoints in us-east1 and us-east4 experiencing elevated rates of packet loss.

The Cloud Networking issue is believed to be affecting a very small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19014 - Cloud VPN endpoints in us-east1 and us-east4 experiencing elevated rates of packet loss.

We are investigating an issue with elevated packet loss while using Cloud VPN endpoints in us-east1 and us-east4. We will provide more information by Wednesday, 2019-06-26 16:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 19013 - We are investigating an issue with network connectivity in southamerica-east1

The issue with Cloud Networking increased packet loss into and out of southamerica-east1 has been resolved for all affected users as of Thursday, 2019-06-13 10:46 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19013 - We are investigating an issue with network connectivity in southamerica-east1

We are investigating increased packet loss into and out of southamerica-east1. Impact to the region began 09:38 US/Pacific. Our Engineering Team believes they have identified the root cause for the packet loss and are working on mitigation. We will provide another status update by Thursday, 2019-06-13 11:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19013 - We are investigating an issue with network connectivity in southamerica-east1

We are investigating increased packet loss into and out of southamerica-east1. We will provide another status update by Thursday, 2019-06-13 10:45 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19012 - We are investigating a Cloud Networking issue in europe-west6

The issue with Cloud Networking in europe-west6 has been resolved for all affected users as of Wednesday, 2019-06-12 08:11 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19012 - We are investigating a Cloud Networking issue in europe-west6

We are experiencing an issue with Cloud Networking in europe-west6 beginning at Wednesday, 2019-06-12 07:45 US/Pacific. Current data indicates traffic to other regions may see approximately 12% packet loss. For everyone who is affected, we apologize for the disruption. We will provide an update by Wednesday, 2019-06-12 09:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19011 - We've Received A Report Of An Issue With Cloud Networking

The Cloud Networking issue is believed to be affecting a very small number of customers and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19011 - We've Received A Report Of An Issue With Cloud Networking

The Cloud Networking issue is currently affecting instances using the standard network tier in the following zones europe-west1-b, us-central1-c, europe-west1-d, europe-west1-c, asia-northeast1-b, us-east4-a, us-east4-c asia-northeast1-a, us-west1-a, us-east4-b and asia-northeast1-c . We will provide another status update by Wednesday, 2019-06-12 06:01 US/Pacific with current details..

Last Update: A few months ago

UPDATE: Incident 19011 - We've Received A Report Of An Issue With Cloud Networking

The Cloud Networking issue is currently affecting instances in the following zones europe-west1-b us-central1-c, europe-west1-d, europe-west1-c, asia-northeast1-b, us-east4-a, us-east4-c asia-northeast1-a, us-west1-a, us-east4-b and asia-northeast1-c . We will provide another status update by Wednesday, 2019-06-12 04:56 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19006 - Users trying to add API restrictions to an API key in Cloud Console may see an incomplete list of possible APIs to restrict their keys to.

Users trying to add API restrictions to an API key in Cloud Console may see an incomplete list of possible APIs to restrict their keys to. We are rolling back a configuration change to mitigate this issue, however, we estimate the rollback will not complete until later tonight. We will provide another status update by Thursday, 2019-06-06 20:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19009 - The network congestion issue in eastern USA, affecting Google Cloud, G Suite, and YouTube has been resolved for all affected users as of 4:00pm US/Pacific.

# ISSUE SUMMARY On Sunday 2 June, 2019, Google Cloud projects running services in multiple US regions experienced elevated packet loss as a result of network congestion for a duration of between 3 hours 19 minutes, and 4 hours 25 minutes. The duration and degree of packet loss varied considerably from region to region and is explained in detail below. Other Google Cloud services which depend on Google's US network were also impacted, as were several non-Cloud Google services which could not fully redirect users to unaffected regions. Customers may have experienced increased latency, intermittent errors, and connectivity loss to instances in us-central1, us-east1, us-east4, us-west2, northamerica-northeast1, and southamerica-east1. Google Cloud instances in us-west1, and all European regions and Asian regions, did not experience regional network congestion. Google Cloud Platform services were affected until mitigation completed for each region, including: Google Compute Engine, App Engine, Cloud Endpoints, Cloud Interconnect, Cloud VPN, Cloud Console, Stackdriver Metrics, Cloud Pub/Sub, Bigquery, regional Cloud Spanner instances, and Cloud Storage regional buckets. G Suite services in these regions were also affected. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability. A detailed assessment of impact is at the end of this report. # ROOT CAUSE AND REMEDIATION This was a major outage, both in its scope and duration. As is always the case in such instances, multiple failures combined to amplify the impact. Within any single physical datacenter location, Google's machines are segregated into multiple logical clusters which have their own dedicated cluster management software, providing resilience to failure of any individual cluster manager. Google's network control plane runs under the control of different instances of the same cluster management software; in any single location, again, multiple instances of that cluster management software are used, so that failure of any individual instance has no impact on network capacity. Google's cluster management software plays a significant role in automating datacenter maintenance events, like power infrastructure changes or network augmentation. Google's scale means that maintenance events are globally common, although rare in any single location. Jobs run by the cluster management software are labelled with an indication of how they should behave in the face of such an event: typically jobs are either moved to a machine which is not under maintenance, or stopped and rescheduled after the event. Two normally-benign misconfigurations, and a specific software bug, combined to initiate the outage: firstly, network control plane jobs and their supporting infrastructure in the impacted regions were configured to be stopped in the face of a maintenance event. Secondly, the multiple instances of cluster management software running the network control plane were marked as eligible for inclusion in a particular, relatively rare maintenance event type. Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations. The outage progressed as follows: at 11:45 US/Pacific, the previously-mentioned maintenance event started in a single physical location; the automation software created a list of jobs to deschedule in that physical location, which included the logical clusters running network control jobs. Those logical clusters also included network control jobs in other physical locations. The automation then descheduled each in-scope logical cluster, including the network control jobs and their supporting infrastructure in multiple physical locations. Google's resilience strategy relies on the principle of defense in depth. Specifically, despite the network control infrastructure being designed to be highly resilient, the network is designed to 'fail static' and run for a period of time without the control plane being present as an additional line of defense against failure. The network ran normally for a short period - several minutes - after the control plane had been descheduled. After this period, BGP routing between specific impacted physical locations was withdrawn, resulting in the significant reduction in network capacity observed by our services and users, and the inaccessibility of some Google Cloud regions. End-user impact began to be seen in the period 11:47-11:49 US/Pacific. Google engineers were alerted to the failure two minutes after it began, and rapidly engaged the incident management protocols used for the most significant of production incidents. Debugging the problem was significantly hampered by failure of tools competing over use of the now-congested network. The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging. Furthermore, the scope and scale of the outage, and collateral damage to tooling as a result of network congestion, made it initially difficult to precisely identify impact and communicate accurately with customers. As of 13:01 US/Pacific, the incident had been root-caused, and engineers halted the automation software responsible for the maintenance event. We then set about re-enabling the network control plane and its supporting infrastructure. Additional problems once again extended the recovery time: with all instances of the network control plane descheduled in several locations, configuration data had been lost and needed to be rebuilt and redistributed. Doing this during such a significant network configuration event, for multiple locations, proved to be time-consuming. The new configuration began to roll out at 14:03. In parallel with these efforts, multiple teams within Google applied mitigations specific to their services, directing traffic away from the affected regions to allow continued serving from elsewhere. As the network control plane was rescheduled in each location, and the relevant configuration was recreated and distributed, network capacity began to come back online. Recovery of network capacity started at 15:19, and full service was resumed at 16:10 US/Pacific time. The multiple concurrent failures which contributed to the initiation of the outage, and the prolonged duration, are the focus of a significant post-mortem process at Google which is designed to eliminate not just these specific issues, but the entire class of similar problems. Full details follow in the Prevention and Follow-Up section. # PREVENTION AND FOLLOW-UP We have immediately halted the datacenter automation software which deschedules jobs in the face of maintenance events. We will re-enable this software only when we have ensured the appropriate safeguards are in place to avoid descheduling of jobs in multiple physical locations concurrently. Further, we will harden Google's cluster management software such that it rejects such requests regardless of origin, providing an additional layer of defense in depth and eliminating other similar classes of failure. Google's network control plane software and supporting infrastructure will be reconfigured such that it handles datacenter maintenance events correctly, by rejecting maintenance requests of the type implicated in this incident. Furthermore, the network control plane in any single location will be modified to persist its configuration so that the configuration does not need to be rebuilt and redistributed in the event of all jobs being descheduled. This will reduce recovery time by an order of magnitude. Finally, Google's network will be updated to continue in 'fail static' mode for a longer period in the event of loss of the control plane, to allow an adequate window for recovery with no user impact. Google's emergency response tooling and procedures will be reviewed, updated and tested to ensure that they are robust to network failures of this kind, including our tooling for communicating with the customer base. Furthermore, we will extend our continuous disaster recovery testing regime to include this and other similarly catastrophic failures. Our post-mortem process will be thorough and broad, and remains at a relatively early stage. Further action items may be identified as this process progresses. # DETAILED DESCRIPTION OF IMPACT ## Compute Engine Compute Engine instances in us-east4, us-west2, northamerica-northeast1 and southamerica-east1 were inaccessible for the duration of the incident, with recovery times as described above. Instance to instance packet loss for traffic on private IPs and internet traffic: * us-east1 up to 33% packet loss from 11:38 to 12:17, up to 8% packet loss from 12:17 to 14:50. * us-central1 spike of 9% packet loss immediately after 11:38 and subsiding by 12:05. * us-west1 initial spikes up to 20% and 8.6% packet loss to us-east1 and us-central1 respectively, falling below 0.1% by 12:55. us-west1 to European regions saw an initial packet loss of up to 1.9%, with packet loss subsiding by 12:05. us-west1 to Asian regions did not see elevated packet loss. Instances accessing Google services via Google Private Access were largely unaffected. Compute Engine admin operations returned an average of 1.2% errors. ## App Engine App Engine applications hosted in us-east4, us-west2, northamerica-northeast1 and southamerica-east1 were unavailable for the duration of the disruption. The us-central region saw a 23.2% drop in requests per second (RPS). Requests that reached App Engine executed normally, while requests that did not returned client timeout errors. ## Cloud Endpoints Requests to Endpoints services during the network incident experienced a spike in error rates up to 4.4% at the start of the incident, decreasing to 0.6% average error rate between 12:50 and 15:40, at 15:40 error rates decreased to less than 0.1%. A separate Endpoints incident was caused by this disruption and its impact extended beyond the resolution time above. From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of service configuration push workflows failed. For the duration of the Cloud Endpoints disruption, requests to existing Endpoints services continued to serve based on an existing configuration. Requests to new Endpoints services, created after the disruption start time, failed with 500 errors unless the ESP flag service_control_network_fail_open was enabled, which is disabled by default. Since Tuesday 4 June, 2019 11:30, service configuration pushes have been successful, but may take up to one hour to take effect. As a result, requests to new Endpoints services may return 500 errors for up to 1 hour after the configuration push. We expect to return to the expected sub-minute configuration propagation by Friday 7 June 2019. Customers who are running on platforms other than Google App Engine Flex can work around this by setting the ESP flag service_control_network_fail_open to true. For customers whose backend is running on Google App Engine Flex, there is no mitigation for the delayed config pushes available at this time. ## Cloud Interconnect Cloud Interconnect reported packet loss ranging from 10% to 100% in affected regions during this incident. Interconnect Attachments in us-east4, us-west2, northamerica-northeast1 and southamerica-east1 reported packet loss ranging from 50% to 100% from 11:45 to 16:10. As part of this packet loss, some BGP sessions also reported going down. During this time, monitoring statistics were inconsistent where the disruption impacted our monitoring as well as Stackdriver monitoring, noted below. As a result we currently estimate that us-east4, us-west2, northamerica-northeast1 and southamerica-east1 sustained heavy packet loss until recovery at approximately 16:10. Further, Interconnect Attachments located in us-west1, us-east1, and us-central1 but connecting from Interconnects located on the east coast (e.g. New York, Washington DC) saw 10-50% packet loss caused by congestion on Google’s backbone in those geographies during this same time frame. ## Cloud VPN Cloud VPN gateways in us-east4, us-west2, northamerica-northeast1 and southamerica-east1 were unreachable for the duration of the incident. us-central1 VPN endpoints reported 25% packet loss and us-east1 endpoints reported 10% packet loss. VPN gateways in us-east4 recovered at 15:40. VPN gateways in us-west2, northamerica-northeast1 and southamerica-east1 recovered at 16:30. Additional intervention was required in us-west2, northamerica-northeast1 and southamerica-east1 to move the VPN control plane in these regions out of a fail-safe state, designed to protect existing gateways from potentially incorrect changes, caused by the disruption. ## Cloud Console Cloud Console customers may have seen pages load more slowly, partially or not at all. Impact was more severe for customers who were in the eastern US as the congested links were concentrated between central US and eastern US regions for the duration of the disruption. ## Stackdriver Monitoring Stackdriver Monitoring experienced a 5-10% drop in requests per second (RPS) for the duration of the event. Login failures to the Stackdriver Monitoring Frontend averaged 8.4% over the duration of the incident. The frontend was also loading with increased latency and encountering a 3.5% error rate when loading data in UI components. ## Cloud Pub/Sub Cloud Pub/Sub experienced Publish and Subscribe unavailability in the affected regions averaged over the duration of the incident: * us-east4 publish requests reported 0.3% error rate and subscribe requests reported a 25% error rate. * southamerica-east1 publish requests reported 11% error rate and subscribe requests reported a 36% error rate. * northamerica-northeast1 publish requests reported a 6% error rate and subscribe requests reported a 31% error rate. * us-west2 did not have a statistically significant change in usage. Additional Subscribe unavailability was experienced in other regions on requests for messages stored in the affected Cloud regions. Analysis shows a 27% global drop in successful publish and subscribe requests during the disruption. There were two periods of global unavailability for Cloud Pub/Sub Admin operations (create/delete topic/subscriptions) . First from 11:50 to 12:05 and finally from 16:05 to 16:25. ## BigQuery BigQuery saw an average error rate of 0.7% over the duration of the incident. Impact was greatest at the beginning of the incident, between 11:47 and 12:02 where jobs.insert API calls had an error rate of 27%. Streaming Inserts (tabledata.insertAll API calls) had an average error rate of less than 0.01% over the duration of the incident, peaking to 24% briefly between 11:47 and 12:02. ## Cloud Spanner Cloud Spanner in regions us-east4, us-west2, and northamerica-northeast1 were unavailable during the duration 11:48 to 15:44. We are continuing to investigate reports that multi-region nam3 was affected, as it involves impacted regions. Other regions' availability was not affected. Modest latency increases at the 50th percentile were observed in us-central1 and us-east1 regions for brief periods during the incident window; exact values were dependent on customer workload. Significant latency increases at the 99th percentile were observed: * nam-eur-asia1 had 120 ms of additional latency from 13:50 to 15:20. * nam3 had greater than 1 second of additional latency from 11:50 to 13:10, from 13:10 to 16:50 latency was increased by 100 ms. * nam6 had an additional 320 ms of latency between 11:50 to 13:10, from 13:10 to 16:50 latency was increased by 130 ms. * us-central1 had an additional 80 ms of latency between 11:50 to 13:10, from 13:10 to 16:50 latency was increased by 10 ms. * us-east1 had an additional 2 seconds of latency between 11:50 to 13:10, from 13:10 to 15:50 latency was increased by 250 ms. * us-west1 had an additional 20 ms of latency between 11:50 to 14:10. ## Cloud Storage Cloud Storage average error rates for bucket locations during the incident are as follows. This data is the best available approximation of the error rate available at the time of publishing: * us-west2 96.2% * southamerica-east1 79.3% * us-east4 62.4% * northamerica-northeast1 43.4% * us 3.5% * us-east1 1.7% * us-west1 1.2% * us-central1 0.7% ## G Suite The impact on G Suite users was different from and generally lower than the impact on Google Cloud users due to differences in architecture and provisioning of these services. Please see the G Suite Status Dashboard (https://www.google.com/appsstatus) for details on affected G Suite services. # SLA CREDITS If you believe your paid application experienced an SLA violation as a result of this incident, please populate the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla A full list of all Google Cloud Platform Service Level Agreements can be found at https://cloud.google.com/terms/sla/. For G Suite, please request an SLA credit through one of the Support channels: https://support.google.com/a/answer/104721 G Suite Service Level Agreement can be found at https://gsuite.google.com/intl/en/terms/sla.html

Last Update: A few months ago

RESOLVED: Incident 19010 - Cloud network programming delays in us-east1 and us-east4 and Load balancer health check failures in us-east4-a.

The issue with Cloud network programming delays and connectivity issues in us-east1 and us-east4, Load balancer health check failures in us-east4-a has been resolved for all affected users as of Tuesday, 2019-06-04 15:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19010 - Cloud network programming delays in us-east1 and us-east4 and Load balancer health check failures in us-east4-a.

Our Engineering Team believes they have identified the root cause of the issue and is working to mitigate. Current data indicate(s) that GCE VM creation, live instance migration and any changes to network programming might be delayed. Google Cloud Load Balancer and Internal Load Balancer backends may fail health checking in us-east4-a. Newly created VMs, Cloud VPN and Router in the affected regions may experience connectivity issues. We will provide another status update by Tuesday, 2019-06-04 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19010 - Cloud network programming delays in us-east1 and us-east4 and Load balancer health check failures in us-east4-a.

Cloud network programming delays in us-east1 and us-east4 and Load balancer health check failures in us-east4-a.

Last Update: A few months ago

UPDATE: Incident 19010 - Cloud network programming delays in us-east1 and us-east4 and Load balancer health check failures in us-east4-a.

We are experiencing an issue with Cloud network programming delays in us-east1 and us-east4 and Load balancer health check failures in us-east4-a beginning at Tuesday, 2019-06-04 13.26 US/Pacific. Current data indicate(s) that GCE VM creation, live instance migration and any changes to network programming might be delayed. Google Cloud Load Balancer and Internal Load Balancer backends may fail health checking in us-east4-a. Newly created VMs in the affected regions may experience connectivity issues. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2019-06-04 15:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19009 - The network congestion issue in eastern USA, affecting Google Cloud, G Suite, and YouTube has been resolved for all affected users as of 4:00pm US/Pacific.

Additional information on this service disruption has been published in the Google Cloud Blog: https://cloud.google.com/blog/topics/inside-google-cloud/an-update-on-sundays-service-disruption. A formal incident is still forthcoming.

Last Update: A few months ago

RESOLVED: Incident 19009 - The network congestion issue in eastern USA, affecting Google Cloud, G Suite, and YouTube has been resolved for all affected users as of 4:00pm US/Pacific.

The network congestion issue in eastern USA, affecting Google Cloud, G Suite, and YouTube has been resolved for all affected users as of 4:00pm US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a detailed report of this incident once we have completed our internal investigation. This detailed report will contain information regarding SLA credits.

Last Update: A few months ago

UPDATE: Incident 19009 - The network congestion issue affecting Google Cloud, G Suite, and YouTube is resolved for the vast majority of users, and we expect a full resolution in the near future.

The network congestion issue affecting Google Cloud, G Suite, and YouTube is resolved for the vast majority of users, and we expect a full resolution in the near future. We will provide another status update by Sunday, 2019-06-02 17:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19009 - We continue to experience high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors....

We continue to experience high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. Our engineering teams have completed the first phase of their mitigation work and are currently implementing the second phase, after which we expect to return to normal service. We will provide an update by Sunday, 2019-06-02 16:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 19009 - We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. We be...

We continue to experience high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. Our engineering teams have completed the first phase of their mitigation work and are currently implementing the second phase, after which we expect to return to normal service. We will provide an update at 16:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 19008 - Cloud Console Mobile App users seeing elevated error rates.

We continue to experience high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. Our engineering teams have completed the first phase of their mitigation work and are currently implementing the second phase, after which we expect to return to normal service. We will provide an update at 16:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 19009 - We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. We be...

We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. We believe we have identified the root cause of the congestion and expect to return to normal service shortly. We will provide more information by Sunday, 2019-06-02 15:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 19009 - We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. We be...

We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors. We believe we have identified the root cause of the congestion and expect to return to normal service shortly. We will provide more information by Sunday, 2019-06-02 15:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 19009 - We are currently encountering a cloud networking issue throughout all Google Cloud Platform, this is affecting all products currently

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Sunday, 2019-06-02 14:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19009 - We are currently encountering a cloud networking issue throughout all Google Cloud Platform, this is affecting all products currently

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Sunday, 2019-06-02 14:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19009 - We are currently encountering a cloud networking issue throughout all Google Cloud Platform, this is affecting all products currently

We are investigating an issue with Google Cloud Networking. We will provide more information by Sunday, 2019-06-02 13:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 19003 - Google BigQuery users experiencing elevated latency and error rates in US multi-region.

ISSUE SUMMARY On Friday, May 17 2019, 83% of Google BigQuery insert jobs in the US multi-region failed for a duration of 27 minutes. Query jobs experienced an average error rate of 16.7% for a duration of 2 hours. BigQuery users in the US multi-region also observed elevated latency for a duration of 4 hours and 40 minutes. To our BigQuery customers whose business analytics were impacted during this outage, we sincerely apologize – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Friday May 17 2019, from 08:30 to 08:57 US/Pacific, 83% of Google BigQuery insert jobs failed for 27 minutes in the US multi-region. From 07:30 to 09:30 US/Pacific, query jobs in US multi-region returned an average error rate of 16.7%. Other jobs such as list, cancel, get, and getQueryResults in the US multi-region were also affected for 2 hours along with query jobs. Google BigQuery users observed elevated latencies for job completion from 07:30 to 12:10 US/Pacific. BigQuery jobs in regions outside of the US remained unaffected. ROOT CAUSE The incident was triggered by a sudden increase in queries in US multi-region leading to quota exhaustion in the storage system serving incoming requests. Detecting the sudden increase, BigQuery initiated its auto-defense mechanism and redirected user requests to a different location. The high load of requests triggered an issue in the scheduling system, causing delays in scheduling incoming queries. These delays resulted in errors for query, insert, list, cancel, get and getQueryResults BigQuery jobs and overall latency experienced by users. As a result of these high number of requests at 08:30 US/Pacific, the scheduling system’s overload protection mechanism began rejecting further incoming requests, causing insert job failures for 27 minutes. REMEDIATION AND PREVENTION BigQuery’s defense mechanism began redirection at 07:50 US/Pacific. Google Engineers were automatically alerted at 07:54 US/Pacific and began investigation. The issue with the scheduler system began at 08:00 and our engineers were alerted again at 08:10. At 08:43, they restarted the scheduling system which mitigated the insert job failures by 08:57. Errors seen for insert, query, cancel, list, get and getQueryResults jobs were mitigated by 09:30 when queries were redirected to different locations. Google engineers then successfully blocked the source of sudden incoming queries that helped reduce overall latency. The issue was fully resolved at 12:10 US/Pacific when all active and pending queries completed running. We will resolve the issue that caused the scheduling system to delay scheduling of incoming queries. Although the overload protection mechanism prevented the incident from spreading globally, it did cause the failures for insert jobs. We will be improving this mechanism by lowering deadline for synchronous queries which will help prevent queries from piling up and overloading the scheduling system. To prevent future recurrence of the issue we will also implement changes to improve BigQuery’s quota exhaustion behaviour that would prevent the storage system to take on more load than it can handle. To reduce the duration of similar incidents, we will implement tools to quickly remediate backlogged queries.

Last Update: A few months ago

RESOLVED: Incident 19003 - Google BigQuery users experiencing elevated latency and error rates in US multi-region.

The issue with Google BigQuery users experiencing latency and high error rates in US multi region has been resolved for all affected users as of Friday, 2019-05-17 13:18 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 19003 - Google BigQuery users experiencing elevated latency and error rates in US multi-region.

Mitigation is underway and the rate of errors is decreasing. We will provide another status update by Friday, 2019-05-17 13:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19003 - Errors showing up in various areas in the Cloud Console

# ISSUE SUMMARY On Thursday 2 May 2019, Google Cloud Console experienced a 40% error rate for all pageviews over a duration of 1 hour and 53 minutes. To all customers affected by this Cloud Console service degradation, we apologize. We are taking immediate steps to improve the platform’s performance and availability. # DETAILED DESCRIPTION OF IMPACT On Thursday 2 May 2019 from 07:10 to 09:03 US/Pacific the Google Cloud Console served 40% of all pageviews with a timeout error. Affected console sections include Compute Engine, Stackdriver, Kubernetes Engine, Cloud Storage, Firebase, App Engine, APIs, IAM, Cloud SQL, Dataflow, BigQuery and Billing. # ROOT CAUSE The Google Cloud Console relies on many internal services to properly render individual user interface pages. The internal billing service is one of them, and is required to retrieve accurate state data for projects and accounts. At 07:09 US/Pacific, a service unrelated to the Cloud Console began to send a large amount of traffic to the internal billing service. The additional load caused time-out and failure of individual requests including those from Google Cloud Console. This led to the Cloud Console serving timeout errors to customers when the underlying requests to the billing service failed. # REMEDIATION AND PREVENTION Cloud Billing engineers were automatically alerted to the issue at 07:15 US/Pacific and Cloud Console engineers were alerted at 07:21. Both teams worked together to investigate the issue and once the root cause was identified, pursued two mitigation strategies. First, we increased the resources for the internal billing service in an attempt to handle the additional load. In parallel, we worked to identify the source of the extraneous traffic and then stop it from reaching the service. Once the traffic source was identified, mitigation was put in place and traffic to the internal billing service began to decrease at 08:40. The service fully recovered at 09:03. In order to reduce the chance of recurrence we are taking the following actions. We will implement improved caching strategies in the Cloud Console to reduce unnecessary load and reliance on the internal billing service. The load shedding response of the billing service will be improved to better handle sudden spikes in load and to allow for quicker recovery should it be needed. Additionally, we will improve monitoring for the internal billing service to more precisely identify which part of the system is running into limits. Finally, we are reviewing dependencies in the serving path for all pages in the Cloud Console to ensure that necessary internal requests are handled gracefully in the event of failure.

Last Update: A few months ago

RESOLVED: Incident 19003 - We experiencing delays in showing logging.

Previously, this incident was reported as a service outage of Stackdriver Logging which was inaccurate. Only logs generated between 2019-04-18 16:45 - 2019-04-19 00:15 US/Pacific were delayed by the incident, and became available once the incident was resolved. Upon evaluation of Google's incident scope and severity, this incident's severity has been adjusted from Outage to Disruption.

Last Update: A few months ago

RESOLVED: Incident 19008 - Cloud Console Mobile App users seeing elevated error rates.

The Google App Engine issue is believed to be affecting a very small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 19008 - Cloud Console Mobile App users seeing elevated error rates.

Cloud Console Mobile App users seeing elevated error rates.

Last Update: A few months ago

UPDATE: Incident 19008 - Cloud Console Mobile App users seeing elevated error rates.

We are experiencing an issue with Cloud Console Mobile App users experiencing elevated error rates beginning Wednesday, 2019-03-20 10:45 US/Pacific. Early investigation indicate(s) that affected users are failing to load the home dashboard. Some portions of the app's navigation are also impacted if users decide to navigate away from the home dashboard. For everyone who is affected, we apologize for the disruption. We will provide an update by Wednesday, 2019-03-20 12:40 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

ISSUE SUMMARY On Tuesday 12 March 2019, Google's internal blob storage service experienced a service disruption for a duration of 4 hours and 10 minutes. We apologize to customers whose service or application was impacted by this incident. We know that our customers depend on Google Cloud Platform services and we are taking immediate steps to improve our availability and prevent outages of this type from recurring. DETAILED DESCRIPTION OF IMPACT On Tuesday 12 March 2019 from 18:40 to 22:50 PDT, Google's internal blob (large data object) storage service experienced elevated error rates, averaging 20% error rates with a short peak of 31% errors during the incident. User-visible Google services including Gmail, Photos, and Google Drive, which make use of the blob storage service also saw elevated error rates, although (as was the case with GCS) the user impact was greatly reduced by caching and redundancy built into those services. There will be a separate incident report for non-GCP services affected by this incident. The Google Cloud Platform services that experienced the most significant customer impact were the following: Google Cloud Storage experienced elevated long tail latency and an average error rate of 4.8%. All bucket locations and storage classes were impacted. GCP services that depend on Cloud Storage were also impacted. Stackdriver Monitoring experienced up to 5% errors retrieving historical time series data. Recent time series data was available. Alerting was not impacted. App Engine's Blobstore API experienced elevated latency and an error rate that peaked at 21% for fetching blob data. App Engine deployments experienced elevated errors that peaked at 90%. Serving of static files from App Engine also experienced elevated errors. ROOT CAUSE On Monday 11 March 2019, Google SREs were alerted to a significant increase in storage resources for metadata used by the internal blob service. On Tuesday 12 March, to reduce resource usage, SREs made a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead to a cascading failure. REMEDIATION AND PREVENTION SREs were alerted to the service disruption at 18:56 PDT and immediately stopped the job that was making configuration changes. In order to recover from the cascading failure, SREs manually reduced traffic levels to the blob service to allow tasks to start up without crashing due to high load. In order to prevent service disruptions of this type, we will be improving the isolation between regions of the storage service so that failures are less likely to have global impact. We will be improving our ability to more quickly provision resources in order to recover from a cascading failure triggered by high load. We will make software measures to prevent any configuration changes that cause overloading of key parts of the system. We will improve load shedding behavior of the metadata storage system so that it degrades gracefully under overload.

Last Update: A few months ago

RESOLVED: Incident 19002 - Elevated error rate with Google Cloud Storage.

ISSUE SUMMARY On Tuesday 12 March 2019, Google's internal blob storage service experienced a service disruption for a duration of 4 hours and 10 minutes. We apologize to customers whose service or application was impacted by this incident. We know that our customers depend on Google Cloud Platform services and we are taking immediate steps to improve our availability and prevent outages of this type from recurring. DETAILED DESCRIPTION OF IMPACT On Tuesday 12 March 2019 from 18:40 to 22:50 PDT, Google's internal blob (large data object) storage service experienced elevated error rates, averaging 20% error rates with a short peak of 31% errors during the incident. User-visible Google services including Gmail, Photos, and Google Drive, which make use of the blob storage service also saw elevated error rates, although (as was the case with GCS) the user impact was greatly reduced by caching and redundancy built into those services. There will be a separate incident report for non-GCP services affected by this incident. The Google Cloud Platform services that experienced the most significant customer impact were the following: Google Cloud Storage experienced elevated long tail latency and an average error rate of 4.8%. All bucket locations and storage classes were impacted. GCP services that depend on Cloud Storage were also impacted. Stackdriver Monitoring experienced up to 5% errors retrieving historical time series data. Recent time series data was available. Alerting was not impacted. App Engine's Blobstore API experienced elevated latency and an error rate that peaked at 21% for fetching blob data. App Engine deployments experienced elevated errors that peaked at 90%. Serving of static files from App Engine also experienced elevated errors. ROOT CAUSE On Monday 11 March 2019, Google SREs were alerted to a significant increase in storage resources for metadata used by the internal blob service. On Tuesday 12 March, to reduce resource usage, SREs made a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead to a cascading failure. REMEDIATION AND PREVENTION SREs were alerted to the service disruption at 18:56 PDT and immediately stopped the job that was making configuration changes. In order to recover from the cascading failure, SREs manually reduced traffic levels to the blob service to allow tasks to start up without crashing due to high load. In order to prevent service disruptions of this type, we will be improving the isolation between regions of the storage service so that failures are less likely to have global impact. We will be improving our ability to more quickly provision resources in order to recover from a cascading failure triggered by high load. We will make software measures to prevent any configuration changes that cause overloading of key parts of the system. We will improve load shedding behavior of the metadata storage system so that it degrades gracefully under overload.

Last Update: A few months ago

RESOLVED: Incident 19002 - Elevated error rate with Google Cloud Storage.

We did a preliminary analysis about the impact of this issue. We confirmed that the error rate to Cloud Storage has been less than 6% during the incident.

Last Update: A few months ago

RESOLVED: Incident 19002 - Elevated error rate with Google Cloud Storage.

The issue with Google Cloud Storage has been resolved for all affected projects as of Tuesday, 2019-03-12 23:18 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

The issue with App Engine Blobstore API has been resolved for all affected projects as of Tuesday, 2019-03-12 23:27 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

The issue with App Engine Deployment should be resolved as of Tuesday, 2019-03-12 23:11 US/Pacific, and the issue with underlying storage of Blobstore API should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2019-03-12 23:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19002 - Elevated error rate with Google Cloud Storage.

The issue with Cloud Storage should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2019-03-12 23:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

The underlying storage infrastructure is gradually recovering. We will provide another status update by Tuesday, 2019-03-12 23:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19002 - Elevated error rate with Google Cloud Storage.

The underlying storage infrastructure of Cloud Storage is gradually recovering . We will provide another status update by Tuesday, 2019-03-12 23:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

We still have an issue with App Engine Blobstore API and Version Deployment. Our Engineering team understands the root cause and is working to implement the solution. We will provide another status update by Tuesday, 2019-03-12 22:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19002 - Elevated error rate with Google Cloud Storage.

We still have an issue with Google Cloud Storage. Our Engineering team understands the root cause and is working to implement the solution. We will provide another status update by Tuesday, 2019-03-12 22:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

The issue with App Engine Version Deployment should be resolved for some runtimes, including nodejs, python37, php72, go111, but we are still seeing the issue with other runtimes. We will provide another status update by Tuesday, 2019-03-12 22:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19002 - Elevated error rate with Google Cloud Storage

We are still working to address the root cause of the issue. We will provide another status update by Tuesday, 2019-03-12 22:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

We are still working to address the root cause of the issue. We will provide another status update by Tuesday, 2019-03-12 22:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

Mitigation work with the underlying storage infrastructure is still underway by our Engineering Team. We will provide another status update by Tuesday, 2019-03-12 21:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

We are still working on the issue with Google App Engine Blobstore API and App Engine Version Deployment. Our Engineering Team believes they have identified the potential root causes of the errors and is working to mitigate. We will provide another status update by Tuesday, 2019-03-12 21:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Tuesday, 2019-03-12 20:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API

We are still seeing the increased error rate with Google App Engine Blobstore API. Our Engineering Team is investigating possible causes. We will provide another status update by Tuesday, 2019-03-12 20:30 US/Pacific US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19007 - Elevated error rate with Google App Engine Blobstore API

We are investigating an issue with Google App Engine. We will provide more information by Tuesday, 2019-03-12 20:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 19005 - Google Cloud Networking issue with Cloud Routers in us-east4

# ISSUE SUMMARY On Wednesday 6 March 2019, Google Cloud Router and Cloud Interconnect experienced a service disruption in the us-east4 region for a duration of 8 hours and 34 minutes. Cloud VPN configurations with dynamic routes via Cloud Router were impacted during this time. We apologize to our customers who were impacted by this outage. # DETAILED DESCRIPTION OF IMPACT On Wednesday 6 March 2019 from 20:17 to Thursday 7 March 04:51 US/Pacific, Cloud Router and Cloud Interconnect experienced a service disruption in us-east4. Customers utilizing us-east4 were unable to advertise routes to their Google Compute Engine (GCE) instances or learn routes from GCE. Cloud VPN traffic with dynamic routes over Cloud Router and Cloud Interconnect in us-east4 was impacted by this service disruption. Cloud VPN traffic over pre-configured static routes was unaffected and continued to function without disruption during this time. # ROOT CAUSE The Cloud Router control plane service assigns Cloud Router tasks to individual customers and creates routes between those tasks and customer VPCs. Individual Cloud Router tasks establish external BGP sessions and propagate routes to and from the control plane service. A disruption occurred during the rollout of a new version of the control plane service in us-east4. This required the control plane to restart from a “cold” state requiring it to validate all assignments of the Cloud Router tasks. The control plane service did not successfully initialize and it was unable to assign individual Cloud Router tasks in order to propagate routes between those tasks and customer VPCs. Cloud Router tasks became temporarily disassociated with customers and BGP sessions were terminated. As a result, Cloud VPN and Cloud Interconnect configurations that were dependent on Cloud Router in us-east4 were unavailable during this time. # REMEDIATION AND PREVENTION Google engineers were automatically alerted at 20:30 PST on 6 March 2019 and immediately began an investigation. A fix for the control plane service was tested, integrated, and rolled out on 7 March 2019 at 04:33 US/Pacific. The control plane service fully recovered by 05:16 US/Pacific. We are taking immediate steps to prevent recurrence. The issue that prevented the control plane from restarting has been resolved. In order to ensure faster incident detection, we are improving control plane service testing, the instrumentation of Cloud Router tasks, and the control plane service instrumentation.

Last Update: A few months ago

RESOLVED: Incident 19001 - We are experiencing an issue with increased system lag in some Google Cloud Dataflow jobs.

The issue with Google Cloud Dataflow jobs experiencing system lag has been resolved for all affected projects as of Tuesday, 2019-03-12 04:49 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19001 - We are experiencing an issue with increased system lag in some Google Cloud Dataflow jobs.

The issue with Google Cloud Dataflow jobs experiencing system lag should be resolved for the majority of projects. We are still monitoring the system to confirm that the issue has been completely mitigated. We will provide another status update by Tuesday, 2019-03-12 06:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19001 - We are experiencing an issue with increased system lag in some Google Cloud Dataflow jobs.

The issue with Google Cloud Dataflow jobs experiencing system lag should be resolved for the vast majority of projects and we expect it to be resolved in near future. We will provide another status update by Tuesday, 2019-03-12 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19001 - We are experiencing an issue with increased system lag in some Google Cloud Dataflow jobs.

The issue with Google Cloud Dataflow jobs experiencing system lag should be resolved for the majority of projects. Our mitigation efforts are effective and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2019-03-12 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19001 - We are experiencing an issue with increased system lag in some Google Cloud Dataflow jobs.

The issue with Google Cloud Dataflow jobs experiencing system lag should be resolved for the majority of projects. Our mitigation efforts are effective and we expect a full resolution in the near future. We will provide another status update by Monday, 2019-03-11 23:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19001 - We've received a report of an issue with Google Cloud Console.

The issue with Google Cloud Console has been resolved for all affected projects as of Monday, 2019-03-11 16:27 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 19001 - We've received a report of an issue with Google Cloud Console.

The issue with Google Cloud Console should be resolved for the majority of projects as of 15:41 US/Pacific and we expect a full resolution in the near future. We will provide another status update by Monday, 2019-03-11 17:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19005 - Google Cloud Networking issue with Cloud Routers in us-east4

The issue with Google Cloud Routers in us-east4 has been resolved for all affected projects as of Thursday, 2019-03-07 7:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 19005 - Google Cloud Networking issue with Cloud Routers in us-east4

Our Engineering Team believes they have identified the root cause of the errors and provided mitigation. The issue with Cloud Routers in us-east4 should be resolved for majority of our customers as of Thursday, 2019-03-07 06:08 US/Pacific. And we expect a full resolution in the near future. We will provide another status update by Thursday, 2019-03-07 09:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19005 - Google Cloud Networking issue with Cloud Routers in us-east4

Our Engineering Team believes they have identified the root cause of the errors and provided mitigation. The issue with Cloud Routers in us-east4 should be resolved for majority of our customers as of Thursday, 2019-03-07 05:00 US/Pacific. And we expect a full resolution in the near future. We will provide another status update by Thursday, 2019-03-07 06:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19005 - Google Cloud Networking issue with Cloud Routers in us-east4

We are experiencing an issue with Google Cloud Networking beginning at Wednesday, 2019-03-06 21:15 US/Pacific. Current investigation indicates that approximately 100% of Cloud Router in us-east4 region are affected by this issue. Users will experience BGP sessions down on all of their Cloud Router enabled VPN tunnels and Cloud Interconnect VLAN Attachments in us-east4 region. Further us-east4 subnets might be not redistributed to other regions as part of VPC Global routing mode, thus making this region unreachable over Interconnect. As a workaround customers can setup a Cloud VPN without Cloud Router betwen us-east4 and their on-premise network. Cloud Console might be timing out for getting Cloud Router related status information, please use gcloud instead. Other regions are not affected. The engineering team is investigating the issue and we will provide another status update by Thursday, 2019-03-07 05:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19005 - Google Cloud Networking issue with Cloud Routers in us-east4

We are experiencing an issue with Google Cloud Networking beginning at Wednesday, 2019-03-06 21:15 US/Pacific. Current investigation indicates that approximately 100% of Cloud Router in us-east4 region are affected by this issue. Users will experience BGP sessions down on all of their Cloud Router enabled VPN tunnels and Cloud Interconnect links in us-east4 region. Further us-east4 subnets might be not redistributed to other regions as part of VPC Global routing mode, thus making this region unreachable over Interconnect. As a workaround customers can setup a Cloud VPN without Cloud Router betwen us-east4 and their on-premise network. Other regions are not affected. The engineering team is investigating the issue and we will provide another status update by Thursday, 2019-03-07 04:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19005 - Google Cloud Networking issue with Cloud Routers in us-east4

We are experiencing an issue with Google Cloud Networking beginning at Wednesday, 2019-03-06 21:15 US/Pacific. Current investigation indicates that approximately 100% of Cloud Router in us-east4 region are affected by this issue. Users will experience BGP sessions down on all of their Cloud Router enabled VPN tunnels and Cloud Interconnect links in us-east4 region. Other regions are not affected. The engineering team is investigating the issue and we will provide another status update by Thursday, 2019-03-07 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19005 - We've received a report of an issue with Google Cloud Networking

We are experiencing an issue with Google Cloud Networking beginning at Wednesday, 2019-03-06 21:15 US/Pacific. Current investigation indicates that approximately 100% of Cloud Router users in us-east4 region are affected by this issue. Users will experience BGP sessions down on all of their Cloud Router enabled VPN tunnels and Cloud Interconnect links in us-east4 region. Other regions are not affected. We will provide another status update by Thursday, 2019-03-07 02:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19005 - We've received a report of an issue with Google Cloud Networking

We are experiencing an issue with Google Cloud Networking beginning at Wednesday, 2019-03-06 21:15 US/Pacific. Current investigation indicates that approximately 100% of Cloud Router users in us-east4 region are affected by this issue. Users will experience BGP sessions down on all of their Cloud Router enabled VPN tunnels and Cloud Interconnect links in us-east4 region. Other regions are not affected. We will provide another status update by Thursday, 2019-03-07 00:40 US/Pacific with current details

Last Update: A few months ago

UPDATE: Incident 19005 - We've received a report of an issue with Google Cloud Networking

We've received a report of an issue with Google Cloud Networking

Last Update: A few months ago

UPDATE: Incident 19005 - We've received a report of an issue with Google Cloud Networking

We are still seeing errors on the services responsible for the Cloud Router BGP issue in us-east4 region. Our Engineering team is still working on the mitigation at the moment. We will provide another status update by Thursday, 2019-03-07 00:40 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 19004 - Instance connectivity issues in us-west1, us-central1, asia-east1 and europe-west1.

The Google Cloud Networking issue is believed to be affecting a very small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

RESOLVED: Incident 19004 - Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

The issue with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1 has been resolved for all affected users as of 2019-02-26 20:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19004 - Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

Our engineering team is continuing to work on the issues. We will provide another status update by Tuesday, 2019-02-26 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19004 - Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Tuesday, 2019-02-26 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19004 - Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

Further to this, the issue persists despite the previously published actions and we are pursuing alternative solutions. We will provide another status update by Tuesday, 2019-02-26 19:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19004 - Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

Mitigation work is currently underway by our Engineering Team. We are currently rolling back a change to mitigate the issue in the affected regions. We will provide another status update by Tuesday, 2019-02-26 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19004 - Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

Last Update: A few months ago

UPDATE: Incident 19004 - Issues with Google Kubernetes Engine GetClusters API endpoint in europe-west3 and us-east1.

We are experiencing an issue with Google Kubernetes Engine GetClusters API endpoint beginning Tuesday, 2019-02-26 14:52 US/Pacific. Current data indicate(s) that this issue affects europe-west3 and us-east1. The issue partially affected europe-west2-a and us-east1-d but these zones have recovered as of Tuesday, 2019-02-26 15:10 US/Pacific. Users affected by this issue may receive HTTP 5XX errors when listing clusters with the GetClusters API Endpoint. Users will also be unable to resize, upgrade or repair their clusters in the affected regions. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2019-02-26 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19006 - We've received a report of an issue with updating or deploying Google Cloud Functions

We've received a report of an issue with updating or deploying Google Cloud Functions

Last Update: A few months ago

UPDATE: Incident 19006 - We've received a report of an issue with updating or deploying Google Cloud Functions

We've received a report of an issue with Google Cloud Function deployments seeing increased errors as of Tuesday, 2019-02-26 13:00 US/Pacific. We will provide more information by Tuesday, 2019-02-26 14:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 19003 - Google Kubernetes Engine may clear certain add-ons from the configuration UI after a cluster upgrade.

The Google Kubernetes Engine issue is believed to be affecting less than 1% of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

RESOLVED: Incident 19003 - The issue with Google Cloud Networking in us-central1 has been resolved.

The issue with Google Cloud Networking in us-central1-b, us-central1-c or us-central1-f has been resolved for all affected users as of 07:46 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19003 - The issue with Google Cloud Networking in us-central1 has been resolved.

The issue with Google Cloud Networking in us-central1 has been resolved.

Last Update: A few months ago

UPDATE: Incident 19002 - Google Cloud Networking is currently experiencing an issue with packet loss in us-central1-[b,c,f]

The issue with Google Cloud Networking experiencing packet loss in us-central1-[b,c,f] is believed to be affecting less than 1% of customers and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

RESOLVED: Incident 19002 - Google App Engine Flex deployments are failing in all regions.

The issue with Google App Engine Flex deployment failures in all regions has been resolved as of Thursday, 2019-01-17 20:25 US/Pacific. The affected customer deployments may need to be retried. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19002 - Google App Engine Flex deployments are failing in all regions.

Google App Engine Flex deployments are failing in all regions.

Last Update: A few months ago

UPDATE: Incident 19002 - Google App Engine Flex deployments are failing in all regions.

Google App Engine Flex deployments are failing in all regions as of Thursday, 2019-01-17 19:13 US/Pacific. We will provide more information by Thursday, 2019-01-17 20:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 19001 - We are currently investigating an issue with Google App Engine app creation, Cloud Function deployments, and Project Creation in Cloud Console.

ISSUE SUMMARY On Wednesday 2 January, 2019, application creation in Google App Engine (App Engine), first-time deployment of Google Cloud Functions (Cloud Functions) per region, and project creation & API management in Cloud Console experienced elevated error rates ranging from 71% to 100% for a duration of 3 hours, 40 minutes starting at 14:40 PST. Workloads already running on App Engine and Cloud Functions, including deployment of new versions of applications and functions, as well as ongoing use of existing projects and activated APIs, were not impacted. We know that many customers depend on the ability to create new Cloud projects, applications & functions, and apologize for our failure to provide this to you during this time. The root cause of the incident is fully understood and engineering efforts are underway to ensure the issue is not at risk of recurrence. DETAILED DESCRIPTION OF IMPACT On Wednesday 2 January, 2019 from 14:40 PST to 18:20 PST, application creation in App Engine, first-time deployments of Cloud Functions, and project creation & API auto-enablement in Cloud Console experienced elevated error rates in all regions due to a recently deployed configuration update to the underlying control plane for all impacted services. First-time deployments of new Cloud Functions failed. Redeploying existing deployments of Cloud Functions were not impacted. Workloads on already deployed Cloud Functions were not impacted. App Engine app creation experienced an error rate of 98%. Workloads for deployed App Engine applications were not impacted. Cloud API enable requests experienced a 97% average error rate while disable requests had a 71% average error rate. Affected users observed these errors when attempting to enable an API via the Cloud Console and API Console. ROOT CAUSE The control plane responsible for managing new app creations in App Engine, new function deployments in Cloud Functions, project creation & API management in Cloud Console utilizes a metadata store. This metadata store is responsible for persisting and processing new project creations, function deployments, App Engine applications, and API enablements. Google engineers began rolling out a new feature designed to improve the fault-tolerance of the metadata store. The rollout had been successful in test environments, but triggered an issue in production due to an unexpected difference in configuration, which triggered a bug. The bug caused writes to the metadata store to fail. REMEDIATION AND PREVENTION Google engineers were automatically alerted of the elevated error rate within 3 minutes of the incident start and immediately began their investigation. At 15:17, an issue with our metadata store was identified as the root cause, and mitigation work began. An initial mitigation was applied, but automation intentionally slowed the rollout of this mitigation to minimize risks to production. To reduce time to resolution, Google engineers developed and implemented a new mitigation. The metadata store became fully available at 18:20. To prevent a recurrence, we will implement additional validation to the metadata store’s schemas and ensure that test validation of metadata store configuration matches production. To improve time to resolution for such issues, we are increasing the robustness of our emergency rollback procedures for the metadata store, and creating engineering runbooks for such scenarios.

Last Update: A few months ago

RESOLVED: Incident 19001 - We are currently investigating an issue with Google App Engine app creation, Cloud Function deployments, and Project Creation in Cloud Console.

The issue regarding Google App Engine application creation, GCP Project Creation, Cloud Function deployments, and GenerateUploadUrl calls has been resolved for all affected users as of Wednesday, 2019-01-02 18:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 19001 - We are currently investigating an issue with Google App Engine app creation, Cloud Function deployments, and Project Creation in Cloud Console.

Mitigation work is currently underway by our Engineering Team. As the work progresses, the error rates observed from various affected systems continue to drop. We will provide another status update by Wednesday, 2019-01-02 19:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19001 - We are currently investigating an issue with Google App Engine app creation, Cloud Function deployments, and Project Creation in Cloud Console as of Wednesday, 2019-01-02 15:07 US/Pacific.

Error rates have started to subside. Mitigation work is continuing by our Engineering Team. We will provide another status update by Wednesday, 2019-01-02 18:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19001 - We are currently investigating an issue with Google App Engine app creation, Cloud Function deployments, and Project Creation in Cloud Console as of Wednesday, 2019-01-02 15:07 US/Pacific.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Wednesday, 2019-01-02 17:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19001 - We are currently investigating an issue with Google App Engine app creation, Cloud Function deployments, and Project Creation in Cloud Console as of Wednesday, 2019-01-02 15:07 US/Pacific.

We are currently investigating an issue with Google App Engine app creation, Cloud Function deployments, and Project Creation in Cloud Console as of Wednesday, 2019-01-02 15:07 US/Pacific. Affected customers may see elevated errors when creating Google App Engine applications in all regions. Affected Customers may also experience an issue with GCP Project Creation, Cloud Function deployments, and GenerateUploadUrl calls may also fail. Existing App Engine applications are unaffected. We have identified the root cause and are working towards a mitigation. We will provide another status update by Wednesday, 2019-01-02 16:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 19001 - Google App Engine application creation is experiencing elevated error rates in all regions.

Google App Engine application creation is experiencing elevated error rates in all regions.

Last Update: A few months ago

UPDATE: Incident 19001 - Google App Engine application creation is experiencing elevated error rates in all regions.

We are investigating elevated errors when creating Google App Engine applications in all regions as of Wednesday, 2019-01-02 15:07 US/Pacific. Existing App Engine applications are unaffected. We have identified the root cause and are working towards a mitigation. We will provide another status update by Wednesday, 2019-01-02 16:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18005 - We are currently investigating an issue with Google Cloud Storage and App Engine. Google Cloud Build and Cloud Functions services are restored

ISSUE SUMMARY On Friday 21 December 2018, customers deploying App Engine apps, deploying in Cloud Functions, reading from Google Cloud Storage (GCS), or using Cloud Build experienced increased latency and elevated error rates ranging from 1.6% to 18% for a period of 3 hours, 41 minutes. We understand that these services are critical to our customers and sincerely apologize for the disruption caused by this incident; this is not the level of quality and reliability that we strive to offer you. We have several engineering efforts now under way to prevent a recurrence of this sort of problem; they are described in detail below. DETAILED DESCRIPTION OF IMPACT On Friday 21 December 2018, from 08:01 to 11:43 PST, Google Cloud Storage reads, App Engine deployments, Cloud Functions deployments, and Cloud Build experienced a disruption due to increased latency and 5xx errors while reading from Google Cloud Storage. The peak error rate for GCS reads was 1.6% in US multi-region. Writes were not impacted, as the impacted metadata store is not utilized on writes. Elevated deployment errors for App Engine Apps in all regions averaged 8% during the incident period. In Cloud Build, a 14% INTERNAL_ERROR rate and 18% TIMEOUT error rate occurred at peak. The aggregated average deployment failure rate of 4.6% for Cloud Functions occurred in us-central1, us-east1, europe-west1, and asia-northeast1. ROOT CAUSE Impact began when increased load on one of GCS's metadata stores resulted in request queuing, which in turn created an uneven distribution of service load. The additional load was created by a partially-deployed new feature. A routine maintenance operation in combination with this new feature resulted in an unexpected increase in the load on the metadata store. This increase in load affected read workloads due to increased request latency to the metadata store. In some cases, this latency exceeded the timeout threshold, causing an average of 0.6% of requests to fail in the US multi-region for the duration of the incident. REMEDIATION AND PREVENTION Google engineers were automatically alerted to the increased error rate at 08:22 PST. Since the issue involved multiple backend systems, multiple teams at Google were involved in the investigation and narrowed down the issue to the newly-deployed feature. The latency and error rate began to subside as Google Engineers initiated the rollback of the new feature. The issue was fully mitigated at 11:43 PST when the rollback finished, at which point the impacted GCP services recovered completely. In addition to updating the impacting feature to prevent this type of increased load, we will update the rollout workflow to stress feature limits before rollout. To improve time to resolution of issues in the metadata store, we are implementing additional instrumentation to the requests made of the subsystem.

Last Update: A few months ago

RESOLVED: Incident 18019 - We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

ISSUE SUMMARY On Wednesday 19 December 2018 multiple GCP services in europe-west1-b experienced a disruption for a duration of 34 minutes. Several GCP services were impacted: GCE, Monitoring, Cloud Console, GAE Admin API, Task Queues, Cloud Spanner, Cloud SQL, GKE, Cloud Bigtable, and Redis. GCP services in all other zones remained unaffected. This service disruption was caused by an erroneous trigger leading to a switch re-installation during upgrades to two control plane network (CPN) switches impacting a portion of europe-west1-b. Most impacted GCP services in the zone recovered within a few minutes after the issue was mitigated. We understand that these services are critical to our customers and sincerely apologize for the disruption caused by this incident. To prevent the issue from recurring we are fixing our repair workflows to catch such errors before serving traffic. DETAILED DESCRIPTION OF IMPACT On Wednesday 19 December 2018 from 05:53 to 06:27 US/Pacific, multiple GCP services in europe-west1-b experienced disruption due to a network outage in one of Google’s data centers. The following Google Cloud Services in europe-west1-b were impacted: GCE instance creation, GCE networking, Cloud VPN, Cloud Interconnect, Stackdriver Monitoring API, Cloud Console, App Engine Admin API, App Engine Task Queues, Cloud Spanner, Cloud SQL, GKE, Cloud Bigtable, and Cloud Memorystore for Redis. Most of these services suffered a brief disruption during the duration of the incident and recovered when the issue was mitigated. - Stackdriver: Around 1% of customers accessing Stackdriver Monitoring API directly received 5xx errors. - Cloud Console: Affected customers may not have been able to view graphs and API usage statistics. Impacted dashboards include: /apis/dashboard, /home/dashboard, /google/maps-api/api list. - Redis: After the network outage ended, ~50 standard Redis instances in europe-west1 remained unavailable until 07:55 US/Pacific due to a failover bug triggered by the outage. ROOT CAUSE As part of a program to upgrade network switches in control plane networks across Google’s data center, two control plane network (CPN) switches supporting a single CPN were scheduled to undergo upgrades. On December 17, the first switch was upgraded and was back online the same day. The issue triggered on December 19 when the second switch was due to be upgraded. During the upgrade of the second switch, a reinstallation was erroneously triggered on the first switch, causing it to go offline for a short period of time. Having both switches down partitioned the network supporting a portion of europe-west1-b. Due to this isolation, the zone was left partially functional. REMEDIATION AND PREVENTION The issue was mitigated at 06:27 US/Pacific when reinstallation of the first switch in the CPN completed. To prevent the issue from recurring we are changing the switch upgrade workflow to prevent erroneous triggers. The trigger inadvertently caused the switch to re-install before any CPN switch is deemed healthy to serve traffic. We are also adding additional checks to make sure upgraded devices are in full functional state before they are deemed healthy to start serving. We will also be improving our automation to catch offline peer devices sooner and help prevent related issues.

Last Update: A few months ago

RESOLVED: Incident 18005 - We are currently investigating an issue with Google Cloud Storage and App Engine. Google Cloud Build and Cloud Functions services are restored

The issue with Google Cloud Storage, App Engine, and Cloud Functions has been resolved for all affected projects as of Friday, 2018-12-21 11:46 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18005 - We are currently investigating an issue with Google Cloud Storage and App Engine. Google Cloud Build and Cloud Functions services are restored

The rollout for the potential fix is continuing its progress. The Google Cloud Storage error rate has improved and is currently 0.1% for US multi-region. Google App Engine App deployments and Google Cloud Function deployments remain affected. We will provide another status update by Friday, 2018-12-21 12:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18019 - We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

The Google Cloud Networking issue in zone europe-west1-b has been resolved. No further updates will be provided here.

Last Update: A few months ago

UPDATE: Incident 18019 - We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18019 - We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18006 - Google Cloud Console project creation broken when enabling APIs for users without organization.

The issue with customers without an organization are unable to create projects through the Enablement API has been resolved for all affected users as of Monday, 2018-11-19 23:01 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18006 - Google Cloud Console project creation broken when enabling APIs for users without organization.

The issue with customers without an organization are unable to create projects through the Enablement API should be resolved for some users and we expect a full resolution in the near future. We will provide another status update by Monday, 2018-11-19 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18006 - Google Cloud Console project creation broken when enabling APIs for users without organization.

Mitigation work is currently underway by our Engineering Team. No ETA is available at this point in time. We will provide another status update by Monday, 2018-11-19 20:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18018 - Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses have been resolved.

The issue with Google Cloud IP addresses being advertised by internet service providers has been resolved for all affected users as of 14:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18018 - Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses.

Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses.

Last Update: A few months ago

UPDATE: Incident 18018 - Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses.

We've received a report of an issue with Google Cloud Networking as of Monday, 2018-11-12 14:16 US/Pacific. We have reports of Google Cloud IP addresses being erroneously advertised by internet service providers other than Google. We will provide more information by Monday, 2018-11-12 15:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

The Google Cloud Networking issue is believed to be affecting a very small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: A few months ago

RESOLVED: Incident 18010 - Failed read requests to Google Stackdriver via the Cloud Console and API

The issue with failing read requests to Google Stackdriver via the Cloud Console and API has been resolved for all affected users as of 19:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

The issue with packet loss for Google Cloud VPN, Cloud Interconnect, and Cloud Router has been resolved for all affected users as of Monday, 2018-10-22 14:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

The issue with packet loss for Google Cloud VPN, Cloud Interconnect, and Cloud Router should be resolved for the majority of requests as of Monday, 2018-10-22 14:25 US/Pacific and we expect a full resolution in the near future. We will provide an update by Monday, 2018-10-22 15:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

Last Update: A few months ago

UPDATE: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

We are experiencing an issue with packet loss for Google Cloud VPN, Cloud Interconnect, and Cloud Router beginning at Monday, 2018-10-22 13:42 US/Pacific. Current data indicate that requests in the following regions are affected by this issue: asia-east1, asia-south1, asia-southeast1, europe-west1, europe-west4, northamerica-northeast1, us-east1. For everyone who is affected, we apologize for the disruption. We will provide an update by Monday, 2018-10-22 15:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products (Bigtable, Cloud SQL, Datastore, VMs)

ISSUE SUMMARY On Thursday 11 October 2018, a section of Google's network that includes part of us-central1-c lost connectivity to the Google network backbone that connects to the public internet for a duration of 41 minutes. We apologize if your service or application was impacted by this incident. We are following our postmortem process to ensure we fully understand what caused this incident and to determine the exact steps we can take to prevent incidents of this type from recurring. Our engineering team is committed to prioritizing fixes for the most critical findings that result from our postmortem. DETAILED DESCRIPTION OF IMPACT On Thursday 11 October 2018 from 16:13 to 16:54 PDT, a section of Google's network that includes part of us-central1-c lost connectivity to the Google network backbone that connects to the public internet. The us-central1-c zone is composed of two separate physical clusters. 61% of the VMs in us-central1-c were in the cluster impacted by this incident. Projects that create VMs in this zone have all of their VMs assigned to a single cluster. Customers with VMs in the zone were either impacted for all of their VMs in a project or for none. Impacted VMs could not communicate with VMs outside us-central1 during the incident. VM-to-VM traffic using an internal IP address within us-central1 was not affected. Traffic through the network load balancer was not able to reach impacted VMs in us-central1-c, but customers with VMs spread between multiple zones experienced the network load balancer shifting traffic to unaffected zones. Traffic through the HTTP(S), SSL Proxy, and TCP proxy load balancers was not significantly impacted by this incident. Other Google Cloud Platform services that experienced significant impact include the following: 30% of Cloud Bigtable clusters located in us-central1-c became unreachable. 10% of Cloud SQL instances in us-central lost external connectivity. ROOT CAUSE The incident occurred while Google's network operations team was replacing the routers that link us-central1-c to Google's backbone that connects to the public internet. Google engineers paused the router replacement process after determining that additional cabling would be required to complete the process and decided to start a rollback operation. The rollout and rollback operations utilized a version of workflow that was only compatible with the newer routers. Specifically, rollback was not supported on the older routers. When a configuration change was pushed to the older routers during the rollback, it deleted the Border Gateway Protocol (BGP) control plane sessions connecting the datacenter routers to the backbone routers resulting in a loss of external connectivity. REMEDIATION AND PREVENTION The BGP sessions were deleted in two tranches. The first deletion was at 15:43 and caused traffic to failover to other routers. The second set of BGP sessions were deleted at 16:13. The first alert for Google engineers fired at 16:16. We identified that the BGP sessions had been deleted at 16:41 and rolled back the configuration change at 16:52, ending the incident shortly thereafter. The preventative action items identified so far include the following: Fix the automated workflows for router replacements to ensure the correct version of workflows are utilized for both types of routers. Alert when BGP sessions are deleted and traffic fails off, so that we can detect and mitigate problems before they impact customers.

Last Update: A few months ago

RESOLVED: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

The issue with Cloud Networking impacted multiple GCP products in us-central1-c has been resolved for all affected projects as of Thursday, 2018-10-11 17:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

Cloud Network is now working and the mitigation work to restore the impacted services is currently underway by our Engineering Team. We will provide another status update by Thursday, 2018-10-11 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

Our engineers have applied a mitigation and the error rate is now decreasing. We will provide another status update by Thursday, 2018-10-11 17:20 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

Networking issues in us-central1-c impacting multiple GCP products

Last Update: A few months ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

We are investigating an issue with networking issues in us-central1-c. This issue is impacting multiple Google Cloud Platform products. We will provide more information by Thursday, 2018-10-11 17:15 US/Pacific

Last Update: A few months ago

RESOLVED: Incident 18038 - We are investigating an issue with Google BigQuery having an increased error rate in the US.

The issue with Google BigQuery having an increased error rate in the US has been resolved for all affected users as of 11:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18038 - We are investigating an issue with Google BigQuery having an increased error rate in the US.

We are investigating an issue with Google BigQuery having an increased error rate in the US. Currently applied mitigation has not resolved the issue, we are continuing to investigate and apply additional mitigation. We will provide more information by Thursday, 2018-09-27 12:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18038 - We are investigating an issue with Google BigQuery having an increased error rate in the US.

We are investigating an issue with Google BigQuery having an increased error rate in the US. We have already applied a mitigation and are currently monitoring the situation. We will provide more information by Thursday, 2018-09-27 11:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18015 - We are investigating an issue with Google Cloud Networking in europe-north1-c.

After further investigation, we have determined that the impact was minimal and effected a small number of users. We have conducted an internal investigation of this issue and made appropriate improvements to our systems to help prevent or minimize a future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18002 - AutoML Natural Language failing to train models

The issue with Google Cloud AutoML Natural language failing to train models has been resolved for all affected users. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18006 - App Engine increased latency and 5XX errors.

The issue with Google App Engine increased latency and 5XX errors has been resolved for all affected projects as of Wednesday 2018-09-12 08:07 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18002 - Error message when updating Cloud Functions via gcloud.

The issue with Google Cloud Functions experiencing errors when updating functions via gcloud has been resolved for all affected users as of Tuesday, 2018-09-11 09:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18015 - We are investigating an issue with Google Cloud Networking in europe-north1-c.

The issue with Cloud Networing in europe-north1-c has been resolved for all affected projects as of Tuesday, 2018-09-11 2:27 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18015 - We are investigating an issue with Google Cloud Networking in europe-north1-c.

We are investigating a major issue with Google Cloud Networking in europe-north1-c. The issue started at Tuesday, 2018-09-11 01:18 US/Pacific. Our Engineering Team is investigating possible causes. We will provide more information by Tuesday, 2018-09-11 03:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18003 - Increased error rate for Google Cloud Storage

# ISSUE SUMMARY On Tuesday 4 September 2018, Google Cloud Storage experienced 1.1% error rates and increased 99th percentile latency for US multiregion buckets for a duration of 5 hours 38 minutes. After that time some customers experienced 0.1% error rates which returned to normal progressively over the subsequent 4 hours. To our Google Cloud Storage customers whose businesses were impacted during this incident, we sincerely apologize; this is not the level of tail-end latency and reliability we strive to offer you. We are taking immediate steps to improve the platform’s performance and availability. # DETAILED DESCRIPTION OF IMPACT On Tuesday 4 September 2018 from 02:55 to 08:33 PDT, customers with buckets located in the US multiregion experienced a 1.066% error rate and 4.9x increased 99th percentile latency, with the peak effect occurring between 05:00 PDT and 07:30 PDT for write-heavy workloads. At 08:33 PDT 99th percentile latency decreased to 1.4x normal levels and error rates decreased, initially to 0.146% and then subsequently to nominal levels by 12:50 PDT. # ROOT CAUSE At the beginning of August, Google Cloud Storage deployed a new feature which among other things prefetched and cached the location of some internal metadata. On Monday 3rd September 18:45 PDT, a change in the underlying metadata storage system resulted in increased load to that subsystem, which eventually invalidated some cached metadata for US multiregion buckets. This meant that requests for that metadata experienced increased latency, or returned an error if the backend RPC timed out. This additional load on metadata lookups led to elevated error rates and latency as described above. # REMEDIATION AND PREVENTION Google Cloud Storage SREs were alerted automatically once error rates had risen materially above nominal levels. Additional SRE teams were involved as soon as the metadata storage system was identified as a likely root cause of the incident. In order to mitigate the incident, the keyspace that was suffering degraded performance needed to be identified and isolated so that it could be given additional resources. This work completed by the 4th September 08:33 PDT. In parallel, Google Cloud Storage SREs pursued the source of additional load on the metadata storage system and traced it to cache invalidations. In order to prevent this type of incident from occurring again in the future, we will expand our load testing to ensure that performance degradations are detected prior to reaching production. We will improve our monitoring diagnostics to ensure that we more rapidly pinpoint the source of performance degradation. The metadata prefetching algorithm will be changed to introduce randomness and further reduce the chance of creating excessive load on the underlying storage system. Finally, we plan to enhance the storage system to reduce the time needed to identify, isolate, and mitigate load concentration such as that resulting from cache invalidations.

Last Update: A few months ago

RESOLVED: Incident 18003 - Increased error rate for Google Cloud Storage

The issue with Google Cloud Storage errors on requests to US multiregional buckets has been resolved for all affected users as of Tuesday, 2018-09-04 12:52 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate for Google Cloud Storage

The mitigation efforts have further decreased the error rates to less than 1% of requests. We are still seeing intermittent spikes but these are less frequent. We expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-09-04 16:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate for Google Cloud Storage

We are rolling out a potential fix to mitigate this issue. Impact is intermittent but limited to US Multiregional Cloud Storage Buckets buckets. We will provide another status update by Tuesday, 2018-09-04 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate of high latency for Google Container Registry API calls

The issue with Google Container Registry API latency should be mitigated for the majority of requests and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-09-04 14:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate of high latency for Google Container Registry API calls

Temporary mitigation efforts have significantly reduced the error rate but we are still seeing intermittent errors or latency on requests. Full resolution efforts are still ongoing. We will provide another status update by Tuesday, 2018-09-04 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate of high latency for Google Container Registry API calls

The rate of errors is decreasing. We will provide another status update by Tuesday, 2018-09-04 11:15 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

ISSUE SUMMARY On Friday 27 July 2018, for a duration of 1 hour 4 minutes, Google Compute Engine (GCE) instances and Cloud VPN tunnels in europe-west4 experienced loss of connectivity to the Internet. The incident affected all new or recently live migrated GCE instances. VPN tunnels created during the incident were also impacted. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to avoid a recurrence. DETAILED DESCRIPTION OF IMPACT All Google Compute Engine (GCE) instances in europe-west4 created on Friday 27 July 2018 from 18:27 to 19:31 PDT lost connectivity to the Internet and other instances via their public IP addresses. Additionally any instances that live migrated during the outage period would have lost connectivity for approximately 30 minutes after the live migration completed. All Cloud VPN tunnels created during the impact period, and less than 1% of existing tunnels in europe-west4 also lost external connectivity. All other instances and VPN tunnels continued to serve traffic. Inter-instance traffic via private IP addresses remained unaffected. ROOT CAUSE Google's datacenters utilize software load balancers known as Maglevs [1] to efficiently load balance network traffic [2] across service backends. The issue was caused by an unintended side effect of a configuration change made to jobs that are critical in coordinating the availability of Maglevs. The change unintentionally lowered the priority of these jobs in europe-west4. The issue was subsequently triggered when a datacenter maintenance event required load shedding of low priority jobs. This resulted in failure of a portion of the Maglev load balancers. However, a safeguard in the network control plane ensured that some Maglev capacity remained available. This layer of our typical defense-in-depth allowed connectivity to extant cloud resources to remain up, and restricted the disruption to new or migrated GCE instances and Cloud VPN tunnels. REMEDIATION AND PREVENTION Automated monitoring alerted Google’s engineering team to the event within 5 minutes and they immediately began investigating at 18:36. At 19:25 the team discovered the root cause and started reverting the configuration change. The issue was mitigated at 19:31 when the fix was rolled out. At this point, connectivity was restored immediately. In addition to addressing the root cause, we will be implementing changes to both prevent and reduce the impact of this type of failure by improving our alerting when too many Maglevs become unavailable, and adding a check for configuration changes to detect priority reductions on critical dependencies. We would again like to apologize for the impact that this incident had on our customers and their businesses in the europe-west4 region. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly. [1] https://ai.google/research/pubs/pub44824 [2] https://cloudplatform.googleblog.com/2016/03/Google-shares-software-network-load-balancer-design-powering-GCP-networking.html

Last Update: A few months ago

RESOLVED: Incident 18008 - We are investigating errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions.

The issue with errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions has been resolved for all affected users as of Friday, 2018-08-03 11:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18008 - We are investigating errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions.

We are still seeing errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions. Our Engineering Team is investigating possible causes. We will provide another status update by Friday, 2018-08-03 12:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18014 - Traffic loss in region europe-west2

The issue with traffic loss in europe-west2 has been resolved for all affected projects as of Tuesday, 2018-07-31 07:26 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Tuesday, 2018-07-31 10:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

The issue with traffic loss in GCP region europe-west2 should be mitigated and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-07-31 08:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

We are experiencing traffic loss in GCP region europe-west2 beginning at Tuesday, 2018-07-31 06:45 US/Pacific. Early investigation indicate that approximately 20% of requests in this region are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2018-07-31 07:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

We are investigating an issue with traffic loss in region europe-west2. We will provide more information by Tuesday, 2018-07-31 07:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

The issue with Internet access for VMs in the europe-west4 region has been resolved for all affected projects as of Friday, 2018-07-27 19:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Friday, 2018-07-27 20:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

Our Engineering Team believes they have identified the root cause of the issue and is working to mitigate. We will provide another status update by Friday, 2018-07-27 20:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

Investigation is currently underway by our Engineering Team. We will provide another status update by Friday, 2018-07-27 20:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

We are investigating an issue with Google Cloud Networking for VM instances in the europe-west4 region. We will provide more information by Friday, 2018-07-27 19:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - Support Center inaccessible

A detailed analysis has been written for this incident and is available at cloud networking incident 18012: https://status.cloud.google.com/incident/cloud-networking/18012

Last Update: A few months ago

RESOLVED: Incident 18012 - The issue with Google Cloud Global Loadbalancers returning 502s has been fully resolved.

## ISSUE SUMMARY On Tuesday, 17 July 2018, customers using Google Cloud App Engine, Google HTTP(S) Load Balancer, or TCP/SSL Proxy Load Balancers experienced elevated error rates ranging between 33% and 87% for a duration of 32 minutes. Customers observed errors consisting of either 502 return codes, or connection resets. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability. We will be providing customers with a SLA credit for the affected timeframe that impacted the Google Cloud HTTP(S) Load Balancer, TCP/SSL Proxy Load Balancer and Google App Engine products. ## DETAILED DESCRIPTION OF IMPACT On Tuesday, 17 July 2018, from 12:17 to 12:49 PDT, Google Cloud HTTP(S) Load Balancers returned 502s for some requests they received. The proportion of 502 return codes varied from 33% to 87% during the period. Automated monitoring alerted Google’s engineering team to the event at 12:19, and at 12:44 the team had identified the probable root cause and deployed a fix. At 12:49 the fix became effective and the rate of 502s rapidly returned to a normal level. Services experienced degraded latency for several minutes longer as traffic returned and caches warmed. Serving fully recovered by 12:55. Connections to Cloud TCP/SSL Proxy Load Balancers would have been reset after connections to backends failed. Cloud services depending upon Cloud HTTP Load Balancing, such as Google App Engine application serving, Google Cloud Functions, Stackdriver's web UI, Dialogflow and the Cloud Support Portal/API, were affected for the duration of the incident. Cloud CDN cache hits dropped 70% due to decreased references to Cloud CDN URLs from services behind Cloud HTTP(S) Load balancers and an inability to validate stale cache entries or insert new content on cache misses. Services running on Google Kubernetes Engine and using the Ingress resource would have served 502 return codes as mentioned above. Google Cloud Storage traffic served via Cloud Load Balancers was also impacted. Other Google Cloud Platform services were not impacted. For example, applications and services that use direct VM access, or Network Load Balancing, were not affected. ## ROOT CAUSE Google’s Global Load Balancers are based on a two-tiered architecture of Google Front Ends (GFE). The first tier of GFEs answer requests as close to the user as possible to maximize performance during connection setup. These GFEs route requests to a second layer of GFEs located close to the service which the request makes use of. This type of architecture allows clients to have low latency connections anywhere in the world, while taking advantage of Google’s global network to serve requests to backends, regardless of in which region they are located. The GFE development team was in the process of adding features to GFE to improve security and performance. These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. At the beginning of the event, a configuration change in the production environment triggered the bug intermittently, which caused affected GFEs to repeatedly restart. Since restarts are not instantaneous, the available second layer GFE capacity was reduced. While some requests were correctly answered, other requests were interrupted (leading to connection resets) or denied due to a temporary lack of capacity while the GFEs were coming back online. ## REMEDIATION AND PREVENTION Google engineers were alerted to the issue within 3 minutes and began immediately investigating. At 12:44 PDT, the team discovered the root cause, the configuration change was promptly reverted, and the affected GFEs ceased their restarts. As all GFEs returned to service, traffic resumed its normal levels and behavior. In addition to fixing the underlying cause, we will be implementing changes to both prevent and reduce the impact of this type of failure in several ways: 1\. We are adding additional safeguards to disable features not yet in service. 2\. We plan to increase hardening of the GFE testing stack to reduce the risk of having a latent bug in production binaries that may cause a task to restart. 3\. We will also be pursuing additional isolation between different shards of GFE pools in order to reduce the scope of failures. 4\. Finally, to speed diagnosis in the future, we plan to create a consolidated dashboard of all configuration changes for GFE pools, allowing engineers to more easily and quickly observe, correlate, and identify problematic changes to the system. We would again like to apologize for the impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span regions. We are conducting a thorough investigation of the incident and will be making the changes which result from that investigation our very top priority in GCP engineering.

Last Update: A few months ago

RESOLVED: Incident 18012 - The issue with Google Cloud Global Loadbalancers returning 502s has been fully resolved.

The issue with Google Cloud Global Load balancers returning 502s has been resolved for all affected users as of 13:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18012 - We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

The issue with Google Cloud Load balancers returning 502s should be resolved for majority of users and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-07-17 13:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18012 - We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

Last Update: A few months ago

UPDATE: Incident 18012 - We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

We are investigating a problem with Google Cloud Global Loadbalancers returning 502s for many services including AppEngine, Stackdriver, Dialogflow, as well as customer Global Load Balancers. We will provide another update by Tuesday, 2018-07-17 13:00 US/Pacific

Last Update: A few months ago

RESOLVED: Incident 18008 - Stackdriver disturbance

The issue with Cloud Stackdriver, where you could have experienced some latency on the logs, has been resolved for all affected users as of 02:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18008 - Stackdriver disturbance

The issue with Cloud Stackdriver, where you could have experienced some latency on the logs, has been resolved for all affected users as of 02:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18007 - The issue with with Google Cloud networking in europe-west1-b and europe-west4-b has been resolved.

The issue with VM public IP address traffic in europe-west1-b and europe-west4-b has been resolved for all affected projects as of Wednesday, 2018-07-04 08:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18007 - We've received a report of an issue with Google Cloud networking in europe-west1-b and europe-west4-b.

We are allocating additional capacity to handle this load. Most of the changes are complete and access controls have been rolled back. The situation is still being closely monitored and further adjustments are still possible/likely. We will provide another status update by Wednesday, 2018-07-04 10:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18037 - We've received a report of an issue with Google BigQuery.

ISSUE SUMMARY On Friday 22 June 2018, Google BigQuery experienced increased query failures for a duration of 1 hour 6 minutes. We apologize for the impact of this issue on our customers and are making changes to mitigate and prevent a recurrence. DETAILED DESCRIPTION OF IMPACT On Friday 22 June 2018 from 12:06 to 13:12 PDT, up to 50% of total requests to the BigQuery API failed with error code 503. Error rates varied during the incident, with some customers experiencing 100% failure rate for their BigQuery table jobs. bigquery.tabledata.insertAll jobs were unaffected. ROOT CAUSE A new release of the BigQuery API introduced a software defect that caused the API component to return larger-than-normal responses to the BigQuery router server. The router server is responsible for examining each request, routing it to a backend server, and returning the response to the client. To process these large responses, the router server allocated more memory which led to an increase in garbage collection. This resulted in an increase in CPU utilization, which caused our automated load balancing system to shrink the server capacity as a safeguard against abuse. With the reduced capacity and now comparatively large throughput of requests, the denial of service protection system used by BigQuery responded by rejecting user requests, causing a high rate of 503 errors. REMEDIATION AND PREVENTION Google Engineers initially mitigated the issue by increasing the capacity of the BigQuery router server which prevented overload and allowed API traffic to resume normally. The issue was fully resolved by identifying and reverting the change that caused large response sizes. To prevent future occurrences, BigQuery engineers will also be adjusting capacity alerts to improve monitoring of server overutilization. We apologize once again for the impact of this incident on your business.

Last Update: A few months ago

UPDATE: Incident 18006 - We've received a report of an issue with Google Cloud Networking in us-east1.

The issue with Google Cloud Networking in us-east1 has been resolved for all affected projects as of Saturday, 2018-06-23 13:16 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18006 - We've received a report of an issue with Google Cloud Networking in us-east1.

We've received a report of an issue with Google Cloud Networking in us-east1.

Last Update: A few months ago

UPDATE: Incident 18006 - We've received a report of an issue with Google Cloud Networking in us-east1.

We are experiencing an issue with Google Cloud Networking beginning on Saturday, 2018-06-23 12:02 US/Pacific. Current data that approximately 33% of projects in us-east1 are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by Saturday, 2018-06-23 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18037 - We've received a report of an issue with Google BigQuery.

The issue with Google BigQuery has been resolved for all affected projects as of Friday, 2018-06-22 13:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18037 - We've received a report of an issue with Google BigQuery.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Friday, 2018-06-22 14:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18037 - We've received a report of an issue with Google BigQuery.

We've received a report of an issue with Google BigQuery.

Last Update: A few months ago

UPDATE: Incident 18037 - We've received a report of an issue with Google BigQuery.

We are investigating an issue with Google BiqQuery. Our Engineering Team is investigating possible causes. Affected customers may see their queries fail with 500 errors. We will provide another status update by Friday, 2018-06-22 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started.

The issue with Google Compute Engine VM instances being allocated duplicate external IP addresses has been resolved for all affected projects as of Saturday, 2018-06-16 12:59 US/Pacific. Customers with VMs having duplicate internal IP addresses should follow the workaround described earlier, which is to delete (without deleting the boot disk), and recreate the affected VM instances. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started.

Mitigation work is currently underway by our Engineering Team. GCE VMs that have an internal IP that is not assigned to another VM within the same project, region and network should no longer see this issue occurring, however instances where another VM is using their internal IP may fail to start with networking. We will provide another status update by Saturday, 2018-06-16 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started.

Detailed Description We are investigating an issue with Google Compute Engine VM instances failing to start with networking after being stopped, or new instances being allocated with the same IP address as a VM instance which was stopped within the past 24 hours. Our Engineering Team is evaluating a fix in a test environment. We will provide another status update by Saturday, 2018-06-16 03:30 US/Pacific with current details. Diagnosis Instances that were stopped at any time between 2018-06-14 08:42 and 2018-06-15 13:40 US/Pacific may fail to start with networking. A newly allocated VM instance has the same IP address as a VM instance which was stopped within the mentioned time period. Workaround As an immediate mitigation to fix instances for which networking is not working, instances should be recreated, that is a delete (without deleting the boot disk), and a create. e.g: gcloud compute instances describe instance-1 --zone us-central1-a gcloud compute instances delete instance-1 --zone us-central1-a --keep-disks=all gcloud compute instances create instance-1 --zone-us-central1-a --disk='boot=yes,name=instance-1' To prevent new instances from coming up with duplicate IP addresses we suggest creating f1-micros until new ip addresses are allocated, and then stopping the instances to stop incurring charges. Alternatively new instances can be brought up with a static internal ip address.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Our Engineering Team continues to evaluate the fix in a test environment. We believe that customers can work around the issue by launching then stopping f1 micro instances until no more duplicate IP addresses are obtained. We are awaiting confirmation that the provided workaround works for customers. We will provide another status update by Friday, 2018-06-15 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Our Engineering Team is evaluating a fix in a test environment. We will provide another status update by Friday, 2018-06-15 17:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Google Compute Engine VM instances allocated with duplicate internal IP addresses

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Investigation continues by our Engineering Team. We are investigating workarounds as well as a method to resolve the issue for all affected projects. We will provide another status update by Friday, 2018-06-15 17:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18011 - The issue with Google Networking in South America should be resolved.

The issue with Google Cloud Networking in South America has been resolved for all affected users as of Monday, 2018-06-04 22:22. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18011 - The rate of errors is decreasing for the Cloud Networking issue in South America.

The issue with Google Networking in South America should be resolved for some of users and we expect a full resolution in the near future. We will provide another status update by Monday, 2018-06-04 23:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18011 - The rate of errors is decreasing for the Cloud Networking issue in South America.

The rate of errors is decreasing. We will provide another status update by Monday, 2018-06-04 23:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18011 - Mitigation work is currently underway by our Engineering Team to restore the loss of traffic with Google Networking in South America.

Mitigation work is currently underway by our Engineering Team for Google Cloud Networking issue affecting South America. We will provide more information by Monday, 2018-06-04 22:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18011 - We are investigating loss of traffic with Google Networking in South America.

We are investigating loss of traffic with Google Networking in South America.

Last Update: A few months ago

UPDATE: Incident 18011 - We are investigating loss of traffic with Google Networking in South America.

We are investigating loss for traffic to and from South America. Our Engineering Team is investigating possible causes. We will provide another status update by Monday, 2018-06-04 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

ISSUE SUMMARY On Sunday, 20 May 2018 for 4 hours and 25 minutes, approximately 6% of Stackdriver Logging logs experienced a median ingest latency of 90 minutes. To our Stackdriver Logging customers whose operations monitoring was impacted during this outage, we apologize. We have conducted an internal investigation and are taking steps to ensure this doesn’t happen again. DETAILED DESCRIPTION OF IMPACT On Wednesday, 20 May 2018 from 18:40 to 23:05 Pacific Time, 6% of logs ingested by Stackdriver Logging experienced log event ingest latency of up to 2 hours 30 minutes, with a median latency of 90 minutes. Customers requesting log events within the latency window would receive empty responses. Logging export sinks were not affected. ROOT CAUSE Stackdriver Logging uses a pool of workers to persist ingested log events. On Wednesday, 20 May 2018 at 17:40, a load spike in the Stackdriver Logging storage subsystem caused 0.05% of persist calls made by the workers to time out. The workers would then retry persisting to the same address until reaching a retry timeout. While the workers were retrying, they were not persisting other log events. This resulted in multiple workers removed from the pool of available workers. By 18:40, enough workers had been removed from the pool to reduce throughput below the level of incoming traffic, creating delays for 6% of logs. REMEDIATION AND PREVENTION After Google Engineering was paged, engineers isolated the issue to these timing out workers. At 20:35, engineers configured the workers to return timed out log events to queue and move on to a different log event after timeout. This allowed workers to catch up with ingest rate. At 23:02, the last delayed message was delivered. We are taking the following steps to prevent the issue from happening again: we are modifying the workers to retry persists using alternate addresses to reduce the impact of persist timeouts; we are increasing the persist capacity of the storage subsystem to manage load spikes; we are modifying Stackdriver Logging workers to reduce their unavailability when the storage subsystem experiences higher latency.

Last Update: A few months ago

RESOLVED: Incident 18003 - Issue with Google Cloud project creation

The issue with project creation has been resolved for all affected projects as of Tuesday, 2018-05-22 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18009 - GCE Networking issue in us-east4

ISSUE SUMMARY On Wednesday 16 May 2018, Google Cloud Networking experienced loss of connectivity to external IP addresses located in us-east4 for a duration of 58 minutes. DETAILED DESCRIPTION OF IMPACT On Wednesday 16 May 2018 from 18:43 to 19:41 PDT, Google Compute Engine, Google Cloud VPN, and Google Cloud Network Load Balancers hosted in the us-east4 region experienced 100% packet loss from the internet and other GCP regions. Google Dedicated Interconnect Attachments located in us-east4 also experienced loss of connectivity. ROOT CAUSE Every zone in Google Cloud Platform advertises several sets of IP addresses to the Internet via BGP. Some of these IP addresses are global and are advertised from every zone, others are regional and advertised only from zones in the region. The software that controls the advertisement of these IP addresses contained a race condition during application startup that would cause regional IP addresses to be filtered out and withdrawn from a zone. During a routine binary rollout of this software, the race condition was triggered in each of the three zones in the us-east4 region. Traffic continued to be routed until the last zone received the rollout and stopped advertising regional prefixes. Once the last zone stopped advertising the regional IP addresses, external regional traffic stopped entering us-east4. REMEDIATION AND PREVENTION Google engineers were alerted to the problem within one minute and as soon as investigation pointed to a problem with the BGP advertisements, a rollback of the binary in the us-east4 region was created to mitigate the issue. Once the rollback proved effective, the original rollout was paused globally to prevent any further issues. We are taking the following steps to prevent the issue from happening again. We are adding additional monitoring which will provide better context in future alerts to allow us to diagnose issues faster. We also plan on improving the debuggability of the software that controls the BGP advertisements. Additionally, we will be reviewing the rollout policy for these types of software changes so we can detect issues before they impact an entire region. We apologize for this incident and we recognize that regional outages like this should be rare and quickly rectified. We are taking immediate steps to prevent recurrence and improve reliability in the future.

Last Update: A few months ago

UPDATE: Incident 18003 - Issue with Google Cloud project creation

We are experiencing an issue with creating new projects as well as activating some APIs for projects. beginning at Tuesday, 2018-05-22 07:50 US/Pacific. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2018-05-22 13:40 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Issue with Google Cloud project creation

Issue with Google Cloud project creation

Last Update: A few months ago

UPDATE: Incident 18003 - Issue with Google Cloud project creation

The rate of errors is decreasing. We will provide another status update by Tuesday, 2018-05-22 14:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - Issue with Google Cloud project creation and API activation

We are experiencing an issue with creating new projects as well as activating some APIs for projects. beginning at Tuesday, 2018-05-22 07:50 US/Pacific. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2018-05-22 13:40 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

The issue with StackDriver logging delay has been resolved for all affected projects as of Sunday, 2018-05-20 22:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Sunday, 2018-05-20 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

The Stackdriver logging service is experiencing a 30-minute delay. We will provide another status update by Sunday, 2018-05-20 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18007 - We’ve received a report of an issue with StackDriver logging delay. We will provide more information by Sunday 20:45 US/Pacific.

We are investigating an issue with Google Stackdriver. We will provide more information by Sunday, 2018-05-20 20:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4.

The issue with external IP allocation in us-east4 has been resolved as of Saturday, 2018-05-19 11:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4.

Allocation of new external IP addresses for GCE instance creation continues to be unavailable in us-east4. For everyone who is affected, we apologize for the disruption. We will provide an update by Saturday, 2018-05-19 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4.

We are experiencing an issue with Google Cloud Networking and Google Compute Engine in us-east4 that prevents the creation of GCE instances that require allocation of new external IP addresses. For everyone who is affected, we apologize for the disruption. We will provide an update by Saturday, 2018-05-19 06:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4. We will provide more information by Friday, 2018-05-18 22:00 US/Pacific.

We are experiencing an issue with Google Cloud Networking and Google Compute Engine in us-east4 that prevents the creation of GCE instances with external IP addresses attached. Early investigation indicates that all instances in us-east4 are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by Friday, 2018-05-18 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4. We will provide more information by Friday, 2018-05-18 22:00 US/Pacific.

We are investigating an issue with Google Cloud Networking in us-east4. We will provide more information by Friday, 2018-05-18 22:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18036 - Multiple failing BigQuery job types

ISSUE SUMMARY On Wednesday 16 May 2018, Google BigQuery experienced failures of import, export and query jobs for a duration of 88 minutes over two time periods (55 minutes initially, and 33 minutes in the second, which was isolated to the EU). We sincerely apologize to all of our affected customers; this is not the level of reliability we aim to provide in our products. We will be issuing SLA credits to customers who were affected by this incident and we are taking immediate steps to prevent a future recurrence of these failures. DETAILED DESCRIPTION OF IMPACT On Wednesday 16 May 2018 from 16:00 to 16:55 and from to 17:45 to 18:18 PDT, Google BigQuery experienced a failure of some import, export and query jobs. During the first period of impact, there was a 15.26% job failure rate; during the second, which was isolated to the EU, there was a 2.23% error rate. Affected jobs would have failed with INTERNAL_ERROR as the reason. ROOT CAUSE Configuration changes being rolled out on the evening of the incident were not applied in the intended order. This resulted in an incomplete configuration change becoming live in some zones, subsequently triggering the failure of customer jobs. During the process of rolling back the configuration, another incorrect configuration change was inadvertently applied, causing the second batch of job failures. REMEDIATION AND PREVENTION Automated monitoring alerted engineering teams 15 minutes after the error threshold was met and were able to correlate the errors with the configuration change 3 minutes later. We feel that the configured alert delay is too long and have lowered it to 5 minutes in order to aid in quicker detection. During the rollback attempt, another bad configuration change was enqueued for automatic rollout and when unblocked, proceeded to roll out, triggering the second round of job failures. To prevent this from happening in the future, we are working to ensure that rollouts are automatically switched to manual mode when engineers are responding to production incidents. In addition, we're switching to a different configuration system which will ensure the consistency of configuration at all stages of the rollout.

Last Update: A few months ago

RESOLVED: Incident 18001 - Issue affecting Google Cloud Function customers' ability to create and update functions.

The issue with Google Cloud Functions affecting the ability to create and update functions has been resolved for all affected users as of Thursday, 2018-05-17 13:01 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - Issue affecting Google Cloud Function customers' ability to create and update functions.

The rate of errors is decreasing. We will provide another status update by Thursday, 2018-05-17 13:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - Issue affecting the ability to update and create Cloud Functions.

Issue affecting the ability to update and create Cloud Functions.

Last Update: A few months ago

UPDATE: Incident 18001 - Issue affecting the ability to update and create Cloud Functions.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Thursday, 2018-05-17 13:10 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18004 - The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region has been resolved for all affected users as of Wednesday, 2018-05-16 19:40 US/Pacific.

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region has been resolved for all affected users as of Wednesday, 2018-05-16 19:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18009 - GCE Networking issue in us-east4

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region has been resolved for all affected users as of Wednesday, 2018-05-16 19:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18009 - GCE Networking issue in us-east4

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by Wednesday, 2018-05-16 20:20 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18004 - The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region should be resolved for the majority of users and we expect a full resolution in the near...

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by Wednesday, 2018-05-16 20:20 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18009 - GCE Networking issue in us-east4

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Wednesday, 2018-05-16 20:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18004 - GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss. Mitigation is underway.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Wednesday, 2018-05-16 20:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18009 - GCE Networking issue in us-east4

We are investigating an issue with GCE Networking in us-east4 region affecting GCE VMs, GKE, Cloud VPN and Cloud Private Interconnect resulting in network packet loss. We will provide more information by Wednesday, 2018-05-16 19:43 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18004 - GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss.

GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss.

Last Update: A few months ago

UPDATE: Incident 18004 - GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss.

We are investigating an issue with GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss. We will provide more information by Wednesday, 2018-05-16 19:43 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18036 - Multiple failing BigQuery job types

The issue with Google BigQuery has been resolved for all affected users as of 2018-05-16 17:06 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. Short Summary

Last Update: A few months ago

UPDATE: Incident 18036 - Multiple failing BigQuery job types

We are rolling back a configuration change to mitigate this issue. We will provide another status update by Wednesday 2018-05-16 17:21 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18007 - We are investigating an issue with increased packet loss in us-central1 with Google Cloud Networking.

ISSUE SUMMARY On Wednesday 2 May, 2018 Google Cloud Networking experienced increased packet loss to the internet as well as other Google regions from the us-central1 region for a duration of 21 minutes. We understand that the network is a critical component that binds all services together. We have conducted an internal investigation and are taking steps to improve our service. DETAILED DESCRIPTION OF IMPACT On Wednesday 2 May, 2018 from 13:47 to 14:08 PDT, traffic between all zones in the us-central1 region and all destinations experienced 12% packet loss. Traffic between us-central1 zones experienced 22% packet loss. Customers may have seen requests succeed to services hosted in us-central1 as loss was not evenly distributed, some connections did not experience any loss while others experienced 100% packet loss. ROOT CAUSE A control plane is used to manage configuration changes to the network fabric connecting zones in us-central1 to each other as well as the Internet. On Wednesday 2 May, 2018 Google Cloud Network engineering began deploying a configuration change using the control plane as part of planned maintenance work. During the deployment, a bad configuration was generated that blackholed a portion of the traffic flowing over the fabric. The control plane had a bug in it, which caused it to produce an incorrect configuration. New configurations deployed to the network fabric are evaluated for correctness, and regenerated if an error is found. In this case, the configuration error appeared after the configuration was evaluated, which resulted in deploying the erroneous configuration to the network fabric. REMEDIATION AND PREVENTION Automated monitoring alerted engineering teams 2 minutes after the loss started. Google engineers correlated the alerts to the configuration push and routed traffic away from the affected part of the fabric. Mitigation completed 21 minutes after loss began, ending impact to customers. After isolating the root cause, engineers then audited all configuration changes that were generated by the control plane and replaced them with known-good configurations. To prevent this from recurring, we will correct the control plane defect that generated the incorrect configuration and are adding additional validation at the fabric layer in order to more robustly detect configuration errors. Additionally, we intend on adding logic to the network control plane to be able to self-heal by automatically routing traffic away from the parts of the network fabric in an error state. Finally, we plan on evaluating further isolation of control plane configuration changes to reduce the size of the possible failure domain. Again, we would like to apologize for this issue. We are taking immediate steps to improve the platform’s performance and availability.

Last Update: A few months ago

RESOLVED: Incident 18008 - We've received a report of connectivity issues from GCE instances.

The network connectivity issues from GCE instances has been resolved for all affected users as of 2018-05-07 10:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18008 - We've received a report of connectivity issues from GCE instances. Our Engineering Team believes they have identified the root cause of the errors and is working to mitigate. We will provide another s...

We are investigating connectivity issues from GCE instances. We will provide more information by 2018-05-07 10:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18007 - We are investigating an issue with increased packet loss in us-central1 with Google Cloud Networking.

The issue with Google Cloud Networking having increased packet loss in us-central1 has been resolved for all affected users as of Wednesday, 2018-05-02 14:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18007 - We are investigating an issue with increased packet loss in us-central1 with Google Cloud Networking.

We are investigating an issue with Google Cloud Networking. We will provide more information by 14:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - The Cloud Shell availability issue has been resolved for all affected users as of 2018-05-02 08:56 US/Pacific.

The Cloud Shell availability issue has been resolved for all affected users as of 2018-05-02 08:56 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Cloud Shell. We will provide more information by 09:15 US/Pacific.

Our Engineering Team believes they have identified the potential root cause of the issue. We will provide another status update by 09:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Cloud Shell. We will provide more information by 08:45 US/Pacific.

We are investigating an issue with Cloud Shell. We will provide more information by 08:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18006 - Users may be experiencing increased error rates when accessing the Stackdriver web UI

The issue with Stackdriver Web UI Returning 500 and 502 error codes has been resolved for all affected users as of Monday, 2018-04-30 13:01 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18006 - Users may be experiencing increased error rates when accessing the Stackdriver web UI

Users may be experiencing increased error rates when accessing the Stackdriver web UI

Last Update: A few months ago

UPDATE: Incident 18006 - Users may be experiencing increased error rates when accessing the Stackdriver web UI

Mitigations are proving to be effective, error rates of 500s and 502s are decreasing, though are still elevated. We will provide more information by Monday, 2018-04-30 14:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

The issue with Google Cloud Pub/Sub message delivery has been resolved for all affected users as of 09:04 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

The mitigation work is currently underway by our Engineering Team and appears to be working. The messages that were delayed during this incidents are starting to get delivered. We expect a full resolution in the near future.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

Our Engineering Team believes they have identified the root cause of the Cloud Pub/Sub message delivery issues and is working to mitigate.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

We are still investigating message delivery issues with Google Cloud Pub/Sub. Our Engineering Team is investigating possible causes. We will provide another status update by 7:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR.

The issue with Dataflow, Dataproc, Compute Engine and GCR has been resolved for all affected users as of 19:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR. We will provide more information by 20:00 US/Pacific.

We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR. We will provide more information by 20:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR. We will provide more information by 20:00 US/Pacific.

The issue with Dataflow, Dataproc, Compute Engine and GCR has been resolved for all affected users as of 19:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18005 - The issue with StackDriver Logging GCS Exports is resolved. We have finished processing the backlog of GCS Export jobs.

The issue with StackDriver Logging GCS Exports has been resolved for all affected users. We have completed processing the backlog of GCS export jobs. An internal investigation has been started to make the appropriate improvements to our systems and help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18005 - The issue with Google StackDriver Logging GCS Export is resolved. We are currently processing the backlog.

The issue with StackDriver Logging GCS Export is mitigated and backlog processing is ongoing. We will provide another update by 08:00 US/Pacific with current status of the backlog.

Last Update: A few months ago

UPDATE: Incident 18005 - The issue with Google StackDriver Logging GCS Export is mitigated.

The issue with StackDriver Logging GCS Export is mitigated and backlog processing is ongoing. We will provide another update by 01:00 US/Pacific with current status of the backlog.

Last Update: A few months ago

UPDATE: Incident 18005 - The issue with Google StackDriver Logging GCS Export is mitigated. We will provide an update by 22:00 US/Pacific.

Our Engineering Team believes they have identified and mitigated the root cause of the delay on StackDriver Logging GCS Export service. We are actively working to process the backlogs. We will provide an update by 22:00 US/Pacific with current progress.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 18:45 US/Pacific.

Mitigation work is still underway by our Engineering Team to address the delay issue with Google StackDriver Logging GCS Export. We will provide more information by 18:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 17:45 US/Pacific.

Mitigation work is currently underway by our Engineering Team to address the issue with Google StackDriver Logging GCS Export. We will provide more information by 17:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 17:15 US/Pacific.

We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 17:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We've received a report of an issue with Google App Engine as of 2018-03-29 04:52 US/Pacific. We will provide more information by 05:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - We've received a report of an issue with Google App Engine as of 2018-03-29 04:52 US/Pacific. We will provide more information by 05:30 US/Pacific.

The issue with Cloud Datastore in europe-west2 has been resolved for all affected [users|projects] as of 5.03 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18003 - We've received a report of an issue with Google Compute Engine as of 2018-03-16 11:32 US/Pacific. We will provide more information by 12:15 US/Pacific.

The issue with slow network programming should be resolved for all zones in us-east1 as of 12:44 US/Pacific. The root cause has been identified and we are working to prevent a recurrence. We will provide more information by 14:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18003 - We've received a report of an issue with Google Compute Engine as of 2018-03-16 11:32 US/Pacific. We will provide more information by 12:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific.

The issue with Cloud Shell has been resolved for all affected users as of Tue 2018-03-13 19:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific. We will provide more information by 19:30 US/Pacific.

We are still investigating an issue with Google Cloud Shell starting 2018-03-13 17:44 US/Pacific. We will provide more information by 19:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific. We will provide more information by 18:30 US/Pacific.

We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific. We will provide more information by 18:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We've received a report of an issue with Google Compute Engine as of 2018-03-10 21:06 US/Pacific. Mitigation work is currently underway by our Engineering Team. We will provide more information by 22:...

Last Update: A few months ago

RESOLVED: Incident 18005 - We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

The issue with Google Cloud Networking intermittent traffic disruption to and from us-central has been resolved for all affected users as of 2018-02-23 22:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18006 - The issue with Google Cloud Networking intermittent traffic disruption to and from us-central has been resolved for all affected users as of 2018-02-23 22:35 US/Pacific.

The issue with Google Cloud Networking intermittent traffic disruption to and from us-central has been resolved for all affected users as of 2018-02-23 22:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18006 - The issue with Google Cloud Networking intermittent traffic disruption to and from us-central. should now show sign of recovery. We will provide another status update by Friday, 2018-02-23 22:40 US/Pa...

The issue with Google Cloud Networking intermittent traffic disruption to and from us-central. should now show sign of recovery for the majority of users and we expect a full resolution in the near future. We will provide another status update by Friday, 2018-02-23 22:40 US/Pacific. US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18006 - We are investigating an issue with Google Cloud Networking. The issue started around Friday, 2018-02-23 21:40 US/Pacific. This affects traffic to and from us-central. We are rolling out a potential fi...

Last Update: A few months ago

UPDATE: Incident 18006 - We are investigating an issue with Google Cloud Networking. The issue started around Friday, 2018-02-23 21:40 US/Pacific. This affects traffic to and from us-central. We are rolling out a potential fi...

We are investigating an issue with Google Cloud Networking. The issue started around Friday, 2018-02-23 21:40 US/Pacific. This affects traffic to and from us-central. We are rolling out a potential fix to mitigate this issue. We will provide more information by Friday, 2018-02-23 22:40 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

We are rolling out a potential fix to mitigate this issue. The affected region seem to be us-central.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - App Engine Admin-API experiencing high error rates

On Thursday 15 February 2018, specific Google Cloud Platform services experienced elevated errors and latency for a period of 62 minutes from 11:42 to 12:44 PST. The following services were impacted: Cloud Datastore experienced a 4% error rate for get calls and an 88% error rate for put calls. App Engine's serving infrastructure, which is responsible for routing requests to instances, experienced a 45% error rate, most of which were timeouts. App Engine Task Queues would not accept new transactional tasks, and also would not accept new tasks in regions outside us-central1 and europe-west1. Tasks continued to be dispatched during the event but saw start delays of 0-30 minutes; additionally, a fraction of tasks executed with errors due to the aforementioned Cloud Datastore and App Engine performance issues. App Engine Memcache calls experienced a 5% error rate. App Engine Admin API write calls failed during the incident, causing unsuccessful application deployments. App Engine Admin API read calls experienced a 13% error rate. App Engine Search API index writes failed during the incident though search queries did not experience elevated errors. Stackdriver Logging experienced delays exporting logs to systems including Cloud Console Logs Viewer, BigQuery and Cloud Pub/Sub. Stackdriver Logging retries on failure so no logs were lost during the incident. Logs-based Metrics failed to post some points during the incident. We apologize for the impact of this outage on your application or service. For Google Cloud Platform customers who rely on the products which were part of this event, the impact was substantial and we recognize that it caused significant disruption for those customers. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly.

Last Update: A few months ago

RESOLVED: Incident 18035 - Bigquery experiencing high latency rates

On Thursday 15 February 2018, specific Google Cloud Platform services experienced elevated errors and latency for a period of 62 minutes from 11:42 to 12:44 PST. The following services were impacted: Cloud Datastore experienced a 4% error rate for get calls and an 88% error rate for put calls. App Engine's serving infrastructure, which is responsible for routing requests to instances, experienced a 45% error rate, most of which were timeouts. App Engine Task Queues would not accept new transactional tasks, and also would not accept new tasks in regions outside us-central1 and europe-west1. Tasks continued to be dispatched during the event but saw start delays of 0-30 minutes; additionally, a fraction of tasks executed with errors due to the aforementioned Cloud Datastore and App Engine performance issues. App Engine Memcache calls experienced a 5% error rate. App Engine Admin API write calls failed during the incident, causing unsuccessful application deployments. App Engine Admin API read calls experienced a 13% error rate. App Engine Search API index writes failed during the incident though search queries did not experience elevated errors. Stackdriver Logging experienced delays exporting logs to systems including Cloud Console Logs Viewer, BigQuery and Cloud Pub/Sub. Stackdriver Logging retries on failure so no logs were lost during the incident. Logs-based Metrics failed to post some points during the incident. We apologize for the impact of this outage on your application or service. For Google Cloud Platform customers who rely on the products which were part of this event, the impact was substantial and we recognize that it caused significant disruption for those customers. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly.

Last Update: A few months ago

RESOLVED: Incident 18003 - App Engine seeing elevated error rates

On Thursday 15 February 2018, specific Google Cloud Platform services experienced elevated errors and latency for a period of 62 minutes from 11:42 to 12:44 PST. The following services were impacted: Cloud Datastore experienced a 4% error rate for get calls and an 88% error rate for put calls. App Engine's serving infrastructure, which is responsible for routing requests to instances, experienced a 45% error rate, most of which were timeouts. App Engine Task Queues would not accept new transactional tasks, and also would not accept new tasks in regions outside us-central1 and europe-west1. Tasks continued to be dispatched during the event but saw start delays of 0-30 minutes; additionally, a fraction of tasks executed with errors due to the aforementioned Cloud Datastore and App Engine performance issues. App Engine Memcache calls experienced a 5% error rate. App Engine Admin API write calls failed during the incident, causing unsuccessful application deployments. App Engine Admin API read calls experienced a 13% error rate. App Engine Search API index writes failed during the incident though search queries did not experience elevated errors. Stackdriver Logging experienced delays exporting logs to systems including Cloud Console Logs Viewer, BigQuery and Cloud Pub/Sub. Stackdriver Logging retries on failure so no logs were lost during the incident. Logs-based Metrics failed to post some points during the incident. We apologize for the impact of this outage on your application or service. For Google Cloud Platform customers who rely on the products which were part of this event, the impact was substantial and we recognize that it caused significant disruption for those customers. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly.

Last Update: A few months ago

RESOLVED: Incident 18001 - Cloud PubSub experiencing missing subscription metrics. Additionally, some Dataflow jobs with PubSub sources appear as they do not consume any messages.

The issue with Cloud PubSub causing watermark increase in Dataflow jobs has been resolved for all affected projects as of Tue, 2018-02-20 05:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - Cloud PubSub experiencing missing subscription metrics. Additionally, some Dataflow jobs with PubSub sources appear as they do not consume any messages.

The watermarks of the affected Dataflow jobs using PubSub are now returning to normal.

Last Update: A few months ago

UPDATE: Incident 18001 - Cloud PubSub experiencing missing subscription metrics. Additionally, some Dataflow jobs with PubSub sources appear as they do not consume any messages.

We are experiencing an issue with Cloud PubSub beginning approximately at 20:00 2018-02-19 US/Pacific. Early investigation indicates that approximately 10-15% of Dataflow jobs are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by 05:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are still investigating an issue with Google Cloud Pub/Sub. We will provide more information by 04:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are still investigating an issue with Google Cloud Pub/Sub. We will provide more information by 04:00 US/Pacific.

We are investigating an issue with Cloud PubSub. We will provide more information by 04:00 AM US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18003 - We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

ISSUE SUMMARY On Sunday 18 January 2018, Google Compute Engine networking experienced a network programming failure. The two impacts of this incident included the autoscaler not scaling instance groups, as well as migrated and newly-created VMs not communicating with VMs in other zones for a duration of up to 93 minutes. We apologize for the impact this event had on your applications and projects, and we will carefully investigate the causes and implement measures to prevent recurrences. DETAILED DESCRIPTION OF IMPACT On Sunday 18 January 2018, Google Compute Engine network provisioning updates failed in the following zones: europe-west3-a for 34 minutes (09:52 AM to 10:21 AM PT) us-central1-b for 79 minutes (09:57 AM to 11:16 AM PT) asia-northeast1-a for 93 minutes (09:53 AM to 11:26 AM PT) Propagation of Google Compute Engine networking configuration for newly created and migrated VMs is handled by two components. The first is responsible for providing a complete list of VM’s, networks, firewall rules, and scaling decisions. The second component provides a stream of updates for the components in a specific zone. During the affected period, the first component failed to return data. VMs in the affected zones were unable to communicate with newly-created or migrated VMs in another zone in the same private GCE network. VMs in the same zone were unaffected because they are updated by the streaming component. The autoscaler service also relies upon data from the failed first component to scale instance groups; without updates from that component, it could not make scaling decisions for the affected zones. ROOT CAUSE A stuck process failed to provide updates to the Compute Engine control plane. Automatic failover was unable to force-stop the process, and required manual failover to restore normal operation. REMEDIATION AND PREVENTION The engineering team was alerted when the propagation of network configuration information stalled. They manually failed over to the replacement task to restore normal operation of the data persistence layer. To prevent another occurrence of this incident, we are taking the following actions: We still stop VM migrations if the configuration data is stale. Modify the data persistence layer to re-resolve their peers during long-running processes, to allow failover to replacement tasks.

Last Update: A few months ago

RESOLVED: Incident 18003 - App Engine seeing elevated error rates

The issue with App Engine has been resolved for all affected projects as of 12:44 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18001 - Datastore experiencing elevated error rates

We continue to see significant improvement to all datastore services. We are still continuing to monitor and will provide another update by 15:00 PST.

Last Update: A few months ago

RESOLVED: Incident 18001 - App Engine Admin-API experiencing high error rates

The issue with App Engine Admin API has been resolved for all affected users as of Thursday, 2018-02-15 13:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18003 - Stackdriver Logging Service Degraded

The issue with Google Stackdriver has been resolved for all affected projects as of 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18004 - We are investigating an issue with Google Stackdriver. We will provide more information by 13:15 US/Pacific.

The issue with Google Stackdriver has been resolved for all affected projects as of 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18004 - We are investigating an issue with Google Stackdriver. We will provide more information by 13:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Datastore experiencing elevated error rates

We are seeing near return to baseline, however we aren't seeing a consistent view of our quota and are investigating. We will provide another update by roughly 13:45 PST.

Last Update: A few months ago

RESOLVED: Incident 18035 - Bigquery experiencing high latency rates

The issue with Bigquery has been resolved for all affected projects as of Thursday, 2018-02-15 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18003 - App Engine seeing elevated error rates

We're seeing widespread improvement in error rates in many / most regions since ~12:40 PST. We're continuing to investigate and will provide another update by 13:30 PST.

Last Update: A few months ago

UPDATE: Incident 18003 - Stackdriver Logging Service Degraded

We are investigating an issue with Stackdriver Logging Service. We will provide more information by 13:25 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18035 - Bigquery experiencing high latency rates

We are investigating an issue with Bigquery. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18003 - App Engine seeing elevated error rates

We are investigating an issue with App Engine. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Datastore experiencing elevated error rates

We are investigating an issue with Datastore. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - The issue with reduced URLFetch availability has been resolved for all affected projects as of 2018-02-10 18:55 US/Pacific. We will provide a more detailed analysis of this incident once we have compl...

The issue with reduced URLFetch availability has been resolved for all affected projects as of 2018-02-10 18:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm US/Pacific. We are currently rolling out a configuration change to mitigate this issue. We will ...

We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm US/Pacific. We are currently rolling out a configuration change to mitigate this issue. We will provide another status update by 2018-02-10 19:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm PT. We will provide more information by 19:00 US/Pacific.

We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm PT. We will provide more information by 19:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - We are investigating an issue with Google Kubernetes Engine. We will provide more information by 20:15 US/Pacific.

On Wednesday 31 January 2017, some Google Cloud services experienced elevated errors and latency on operations that required inter-data center network traffic for a duration of 72 minutes. The impact was visible during three windows: between 18:20 and 19:08 PST, between 19:10 and 19:29, and again between 19:45 and 19:50. Network traffic between the public internet and Google's data centers was not affected by this incident. The root cause of this incident was an error in a configuration update to the system that allocates network capacity for traffic between Google data centers. To prevent a recurrence, we will improve the automated checks that we run on configuration changes to detect problems before release. We will be improving the monitoring of the canary to detect problems before global rollout of changes to the configuration.

Last Update: A few months ago

RESOLVED: Incident 18001 - The issue with Google Stackdriver Logging has been resolved for all affected projects as of 20:02 US/Pacific.

On Wednesday 31 January 2017, some Google Cloud services experienced elevated errors and latency on operations that required inter-data center network traffic for a duration of 72 minutes. The impact was visible during three windows: between 18:20 and 19:08 PST, between 19:10 and 19:29, and again between 19:45 and 19:50. Network traffic between the public internet and Google's data centers was not affected by this incident. The root cause of this incident was an error in a configuration update to the system that allocates network capacity for traffic between Google data centers. To prevent a recurrence, we will improve the automated checks that we run on configuration changes to detect problems before release. We will be improving the monitoring of the canary to detect problems before global rollout of changes to the configuration.

Last Update: A few months ago

RESOLVED: Incident 18001 - The issue with Google App Engine services has been resolved for all affected projects as of 21:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements ...

On Wednesday 31 January 2017, some Google Cloud services experienced elevated errors and latency on operations that required inter-data center network traffic for a duration of 72 minutes. The impact was visible during three windows: between 18:20 and 19:08 PST, between 19:10 and 19:29, and again between 19:45 and 19:50. Network traffic between the public internet and Google's data centers was not affected by this incident. The root cause of this incident was an error in a configuration update to the system that allocates network capacity for traffic between Google data centers. To prevent a recurrence, we will improve the automated checks that we run on configuration changes to detect problems before release. We will be improving the monitoring of the canary to detect problems before global rollout of changes to the configuration.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Container Engine that is affecting cluster creation and upgrade. We will provide more information by 11:45 US/Pacific.

The issue with cluster creation and upgrade has been resolved for all affected projects as of 11:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Container Engine that is affecting cluster creation and upgrade. We will provide more information by 11:45 US/Pacific.

We are investigating an issue with Google Container Engine that is affecting cluster creation and upgrade. We will provide more information by 11:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API.

The issue with elevated failure rates in the Stackdriver Trace API has been resolved for all affected projects as of Friday, 2018-02-02 09:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API.

Stackdriver Trace API continues to exhibit an elevated rate of request failures. Our engineering teams have put a mitigation in place, and will proceed to address the cause of the issue. We will provide another update by 10:45 AM US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API.

StackDriver Trace API is experiencing an elevated rate of request failures. There is no workaround at this time. Our engineering teams have identified the cause of the issue and are working on a mitigation. We will provide another update by 9:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API. We will provide more information by 09:00 US/Pacific.

We are investigating an issue with Google Stackdriver Trace API. We will provide more information by 09:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

On Thursday 25 January 2018, while expanding the Google network serving the us-central1 region, a configuration change unexpectedly triggered packet loss and reduced network bandwidth for inter-region data transfer and replication traffic to and from the region. The network impact was observable during two windows, between 11:03 and 12:40 PST, and again between 14:27 and 15:27 PST. The principal user-visible impact was a degradation in the performance of some Google Cloud services that require cross data center traffic. There was no impact to network traffic between the us-central1 region and the internet, or to traffic between Compute Engine VM instances. We sincerely apologize for the impact of this incident on your application or service. We have performed a detailed analysis of root cause and taken careful steps to ensure that this type of incident will not recur.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation for most products have been completely resolved but some products ...

The issue with Google App Engine has been resolved for all affected projects as of 21:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We experienced an issue with Google Cloud Storage, beginning at approximately 18:30 US/Pacific. The situation has been completely resolved by 20:08 PST.

We experienced an issue with Google Cloud Storage, beginning at approximately 18:30 US/Pacific. The situation has been completely resolved by 20:08 PST.

Last Update: A few months ago

UPDATE: Incident 18001 - The issue with Google Stackdriver Logging has been resolved for all affected projects as of 20:02 US/Pacific.

The issue with Google Stackdriver Logging has been resolved for all affected projects as of 20:02 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. For everyone who is affected, we apologize for the disruption.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products ar...

Google App Engine has mostly recovered and our engineering team is working to completely mitigate the issue. We will provide an update by 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Kubernetes Engine. We will provide more information by 20:15 US/Pacific.

The issue with Google Kubernetes Engine has been resolved for all affected projects as of 20:33 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. For everyone who is affected, we apologize for the disruption.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Stackdriver Logging. We will provide more information by 20:00 US/Pacific.

The issue with Google Stackdriver Logging should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 20:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Container Engine. We will provide more information by 20:15 US/Pacific.

The issue with Google Kubernetes Engine should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 20:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products ar...

Google App Engine is recovering and our engineering team is working to completely mitigate the issue. We will provide an update by 20:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products ar...

We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products are still reporting errors. For everyone who is affected, we apologize for the disruption. We will provide an update by 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Compute Engine. We will provide more information by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Compute Engine. We will provide more information by 20:30 US/Pacific.

We are investigating an issue with Google Compute Engine. We will provide more information by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Container Engine. We will provide more information by 20:15 US/Pacific.

We are investigating an issue with Google Container Engine. We will provide more information by 20:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Stackdriver Logging. We will provide more information by 20:00 US/Pacific.

We are investigating an issue with Google Stackdriver Logging. We will provide more information by 20:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

We have updated our estimated impact to an average of 2% and a peak of 3.6% GCS global error rate, based on a more thorough review of monitoring data and server logs. The initial estimate of impact was based on a internal assessment hosted in a single region; subsequent investigation reveals that Google's redundancy and rerouting infrastructure worked as intended and dramatically reduced the user-visible impact of the event to GCS's global user base. The 2% average error rate is measured over the duration of the event, from its beginning at 11:05 PST to its conclusion at 12:24 PST.

Last Update: A few months ago

RESOLVED: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Networking has been resolved for all affected users as of 2018-01-25 13:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Networking should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 2018-01-25 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Storage error rates has been resolved for all affected users as of 12:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18034 - We are investigating an issue with BigQuery in the US region. We will provide more information by 13:00 US/Pacific.

The issue with BigQuery error rates has been resolved for all affected users as of 12:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Storage in the US regions should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 2018-01-25 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

We are experiencing an issue with Google Cloud Storage beginning Thursday, 2018-01-25 11:23 US/Pacific. Current investigation indicates that approximately 100% of customers in the US region are affected and we expect that for affected users the service is mostly or entirely unavailable at this time. For everyone who is affected, we apologize for the disruption. We will provide an update by 2018-01-25 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18034 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 12:45 US/Pacific.

We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 12:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

The issue with Google Cloud Networking connectivity has been resolved for all affected zones in europe-west3, us-central1, and asia-northeast1 as of 11:26am US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We have resolved an issue with Google Cloud Networking.

The issue with packet loss from North and South America regions to Asia regions has been resolved for all affected users as of 1:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Networking. We will provide more information by 14:30 US/Pacific.

Our Engineering Team believes they have identified the root cause of the packet loss and is working to mitigate.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:15 US/Pacific.

We are experiencing an issue with packet loss from Google North and South America regions to Asia regions beginning at 12:30 US/Pacific. Current data indicates that approximately 40% of packets are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by 1:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:15 US/Pacific.

We are investigating an issue with Google Cloud Networking. We will provide more information by 13:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - Some GKE 'create/delete' cluster operations failing

The issue with GKE 'create/delete' cluster operations failing should be resolved for the majority of users. We will continue monitoring to confirm full resolution. We will provide more information by 11:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - Some GKE 'create/delete' cluster operations failing

The issue with GKE 'create/delete' cluster operations failing has been resolved for all affected users as of 8:27 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - Some GKE 'create/delete' cluster operations failing

The issue with GKE 'create/delete' cluster operations failing should be resolved for the majority of users. We will continue monitoring to confirm full resolution. We will provide more information by 09:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Some GKE 'create/delete' cluster operations failing

We are still investigating an issue with some GKE 'create/delete' cluster operations failing. We will provide more information by 08:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - GKE 'create cluster' operations failing

We are investigating an issue with GKE 'create cluster' operations failing. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - The issue with Google Cloud Load Balancing (GCLB) creation has been resolved for all affected projects as of 20:22 US/Pacific.

The issue with Google Cloud Load Balancing (GCLB) creation has been resolved for all affected projects as of 20:22 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 20:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 20:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 20:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 19:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 19:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 18:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 19:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 18:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 18:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 18:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 18:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 17:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 16:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 17:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 16:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 16:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 16:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 16:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Cloud Spanner issues from 12:45 to 13:26 Pacific time have been resolved.

Cloud Spanner issues from 12:45 to 13:26 Pacific time have been resolved.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Cloud Spanner. Customers should be back to normal service as of 13:26 Pacific. We will provide more information by 15:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Cloud Spanner. Customers should be back to normal service as of 13:26 Pacific. We will provide more information by 15:00 US/Pacific.

An incident with Cloud Spanner availability started at 12:45 Pacific time and has been addressed. The service is restored for all customers as of 13:26. Another update will be posted before 15:00 Pacific time to confirm the service health.

Last Update: A few months ago

UPDATE: Incident 17002 - The issue with Cloud Machine Learning Engine's Create Version has been resolved for all affected users as of 2017-12-15 10:55 US/Pacific.

The issue with Cloud Machine Learning Engine's Create Version has been resolved for all affected users as of 2017-12-15 10:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17009 - The issue with the Google App Engine Admin API has been resolved for all affected users as of Thursday, 2017-12-14 12:15 US/Pacific.

The issue with the Google App Engine Admin API has been resolved for all affected users as of Thursday, 2017-12-14 12:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 18:00 US/Pacific.

The issue with Cloud Storage elevated error rate has been resolved for all affected projects as of Friday 2017-11-30 16:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 16:30 US/Pacific.

The Cloud Storage service is experiencing less than 10% error rate. We will provide another status update by 2017-11-30 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 15:00 US/Pacific.

The Cloud Storage service is experiencing less than 10% error rate. We will provide another status update by 2017-11-30 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 15:00 US/Pacific.

The Cloud Storage service is experiencing less than 10% error rate. We will provide another status update by YYYY-mm-dd HH:MM US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - From 10:58 to 11:57 US/Pacific, GCE VM instances experienced packet loss from GCE instances to the Internet. The issue has been mitigated for all affected projects.

From 10:58 to 11:57 US/Pacific, GCE VM instances experienced packet loss from GCE instances to the Internet. The issue has been mitigated for all affected projects.

Last Update: A few months ago

UPDATE: Incident 17004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 07:00 US/Pacific.

The issue with Google Cloud Engine VM instances losing connectivity has been resolved for all affected users as of Friday, 2017-11-17 7:17am US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 17008 - App Engine increasingly showing 5xx

The issue with App Engine has been resolved for all affected projects as of 4:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17008 - App Engine increasingly showing 5xx

The issue also affected projects in other regions but should be resolved for the majority of projects. We will provide another status update by 05:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17007 - The Memcache service has recovered from a disruption between 12:30 US/Pacific and 15:30 US/Pacific.

ISSUE SUMMARY On Wednesday 6 November 2017, the App Engine Memcache service experienced unavailability for applications in all regions for 1 hour and 50 minutes. We sincerely apologize for the impact of this incident on your application or service. We recognize the severity of this incident and will be undertaking a detailed review to fully understand the ways in which we must change our systems to prevent a recurrence. DETAILED DESCRIPTION OF IMPACT On Wednesday 6 November 2017 from 12:33 to 14:23 PST, the App Engine Memcache service experienced unavailability for applications in all regions. Some customers experienced elevated Datastore latency and errors while Memcache was unavailable. At this time, we believe that all the Datastore issues were caused by surges of Datastore activity due to Memcache being unavailable. When Memcache failed, if an application sent a surge of Datastore operations to specific entities or key ranges, then Datastore may have experienced contention or hotspotting, as described in https://cloud.google.com/datastore/docs/best-practices#designing_for_scale. Datastore experienced elevated load on its servers when the outage ended due to a surge in traffic. Some applications in the US experienced elevated latency on gets between 14:23 and 14:31, and elevated latency on puts between 14:23 and 15:04. Customers running Managed VMs experienced failures of all HTTP requests and App Engine API calls during this incident. Customers using App Engine Flexible Environment, which is the successor to Managed VMs, were not impacted. ROOT CAUSE The App Engine Memcache service requires a globally consistent view of the current serving datacenter for each application in order to guarantee strong consistency when traffic fails over to alternate datacenters. The configuration which maps applications to datacenters is stored in a global database. The incident occurred when the specific database entity that holds the configuration became unavailable for both reads and writes following a configuration update. App Engine Memcache is designed in such a way that the configuration is considered invalid if it cannot be refreshed within 20 seconds. When the configuration could not be fetched by clients, Memcache became unavailable. REMEDIATION AND PREVENTION Google received an automated alert at 12:34. Following normal practices, our engineers immediately looked for recent changes that may have triggered the incident. At 12:59, we attempted to revert the latest change to the configuration file. This configuration rollback required an update to the configuration in the global database, which also failed. At 14:21, engineers were able to update the configuration by sending an update request with a sufficiently long deadline. This caused all replicas of the database to synchronize and allowed clients to read the mapping configuration. As a temporary mitigation, we have reduced the number of readers of the global configuration, which avoids the contention during write and led to the unavailability during the incident. Engineering projects are already under way to regionalize this configuration and thereby limit the blast radius of similar failure patterns in the future.

Last Update: A few months ago

RESOLVED: Incident 17007 - The Memcache service has recovered from a disruption between 12:30 US/Pacific and 15:30 US/Pacific.

The issue with Memcache availability has been resolved for all affected projects as of 15:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation. This is the final update for this incident.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service experienced a disruption and is still recovering. We will provide more information by 16:00 US/Pacific.

The Memcache service is still recovering from the outage. The rate of errors continues to decrease and we expect a full resolution of this incident in the near future. We will provide an update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service experienced a disruption and is recovering now. We will provide more information by 15:30 US/Pacific.

The issue with Memcache and MVM availability should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide an update by 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service experienced a disruption and is being normalized now. We will provide more information by 15:15 US/Pacific.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. At this time we are gradually ramping up traffic to Memcache and we see that the rate of errors is decreasing. Other services affected by the outage, such as MVM instances, should be normalizing in the near future. We will provide an update by 15:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service is currently experiencing a disruption. We will provide more information by 14:30 US/Pacific.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. Our Engineering Team believes they have identified the root cause of the errors and is working to mitigate. We will provide an update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service is currently experiencing a disruption. We will provide more information by 14:30 US/Pacific.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. Current data indicates that all projects using Memcache are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 14:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - Investigating incident with AppEngine and Memcache.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. Current data indicate(s) that all projects using Memcache are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 14:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - Investigating incident with AppEngine and Memcache.

We are investigating an issue with Google App Engine and Memcache. We will provide more information by 13:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17003 - We are investigating an issue with GKE. We will provide more information by 16:00 US/Pacific.

We are investigating an issue involving the inability of pods to be rescheduled on Google Container Engine (GKE) nodes after Docker reboots or crashes. This affects GKE versions 1.6.11, 1.7.7, 1.7.8 and 1.8.1. Our engineering team will roll out a fix next week; no further updates will be provided here. If experienced, the issue can be mitigated by manually restarting the affected nodes.

Last Update: A few months ago

RESOLVED: Incident 17018 - We are investigating an issue with Google Cloud SQL. We see failures for Cloud SQL connections from App Engine and connections using the Cloud SQL Proxy. We are also observing elevated failure rates f...

The issue with Cloud SQL connectivity affecting connections from App Engine and connections using the Cloud SQL Proxy as well as the issue with Cloud SQL admin activities have been resolved for all affected as of 20:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17019 - We are investigating an issue with Google Cloud SQL. We see failures for Cloud SQL connections from App Engine and connections using the Cloud SQL Proxy. We are also observing elevated failure rates f...

The issue with Cloud SQL connectivity affecting connections from App Engine and connections using the Cloud SQL Proxy as well as the issue with Cloud SQL admin activities have been resolved for all affected as of 2017-10-30 20:45 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17018 - We are investigating an issue with Google Cloud SQL. We see failures for Cloud SQL connections from App Engine and connections using the Cloud SQL Proxy. We are also observing elevated failure rates f...

We are continuing to experience an issue with Cloud SQL connectivity, affecting only connections from App Engine and connections using the Cloud SQL Proxy, beginning at 2017-10-30 17:00 US/Pacific. We are also observing elevated failure rates for Cloud SQL admin activities (using the Cloud SQL portion of the Cloud Console UI, using gcloud beta sql, directly using the Admin API, etc.). Our Engineering Team believes they have identified the root cause and mitigation effort is currently underway. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide another update by 2017-10-30 21:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17005 - Elevated GCS Errors from Canada

ISSUE SUMMARY Starting Thursday 12 October 2017, Google Cloud Storage clients located in the Northeast of North America experienced up to a 10% error rate for a duration of 21 hours and 35 minutes when fetching objects stored in multi-regional buckets in the US. We apologize for the impact of this incident on your application or service. The reliability of our service is a top priority and we understand that we need to do better to ensure that incidents of this type do not recur. DETAILED DESCRIPTION OF IMPACT Between Thursday 12 October 2017 12:47 PDT and Friday 13 October 2017 10:12 PDT, Google Cloud Storage clients located in the Northeast of North America experienced up to a 10% rate of 503 errors and elevated latency. Some users experienced higher error rates for brief periods. This incident only impacted requests to fetch objects stored in multi-regional buckets in the US; clients were able to mitigate impact by retrying. The percentage of total global requests to Cloud Storage that experienced errors was 0.03%. ROOT CAUSE Google ensures balanced use of its internal networks by throttling outbound traffic at the source host in the event of congestion. This incident was caused by a bug in an earlier version of the job that reads Cloud Storage objects from disk and streams data to clients. Under high traffic conditions, the bug caused these jobs to incorrectly throttle outbound network traffic even though the network was not congested. Google had previously identified this bug and was in the process of rolling out a fix to all Google datacenters. At the time of the incident, Cloud Storage jobs in a datacenter in Northeast North America that serves requests to some Canadian and US clients had not yet received the fix. This datacenter is not a location for customer buckets (https://cloud.google.com/storage/docs/bucket-locations), but objects in multi-regional buckets can be served from instances running in this datacenter in order to optimize latency for clients. REMEDIATION AND PREVENTION The incident was first reported by a customer to Google on Thursday 12 October 14:59 PDT. Google engineers determined root cause on Friday 13 October 09:47 PDT. We redirected Cloud Storage traffic away from the impacted region at 10:08 and the incident was resolved at 10:12. We have now rolled out the bug fix to all regions. We will also add external monitoring probes for all regional points of presence so that we can more quickly detect issues of this type.

Last Update: A few months ago

UPDATE: Incident 17002 - Jobs not terminating

The issue with with Cloud Dataflow in which batch jobs are stuck and cannot be terminated has been resolved for all affected projects as of Wednesday, 201-10-18 02:58 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - Jobs not terminating

A fix for the issue with Cloud Dataflow in which batch jobs are stuck and cannot be terminated is currently getting rolled out. We expect a full resolution in the near future. We will provide another status update by Wednesday, 2017-10-18 03:45 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17007 - Stackdriver Uptime Check Alerts Not Firing

The issue with Stackdriver Uptime Check Alerts not firing has been resolved for all affected projects as of Monday, 2017-10-16 13:08 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17007 - Stackdriver Uptime Check Alerts Not Firing

We are investigating an issue with Stackdriver Uptime Check Alerts. We will provide more information by 13:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17005 - Elevated GCS Errors from Canada

The issue with Google Cloud Storage request failures for users in Canada and Northeast North America has been resolved for all affected users as of Friday, 2017-10-13 10:08 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17005 - Elevated GCS Errors from Canada

We are investigating an issue with Google Cloud Storage users in Canada and Northeast North America experiencing HTTP 503 failures. We will provide more information by 10:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17004 - Elevated GCS errors in us-east1

The issue with GCS service has been resolved for all affected users as of 14:31 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our system to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17004 - Elevated GCS errors in us-east1

The issue with GCS service should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17004 - Elevated GCS errors in us-east1

We are investigating an issue that occurred with GCS starting at 13:19 PDT. We will provide more information by 14:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17007 - Project creation failure

The issue with Project Creation failing with "Unknown error" has been resolved for all affected users as of Tuesday, 2017-10-03 22:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

The issue with Google Stackdriver has been resolved for all affected users as of 2017-10-03 16:28 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

We are continuing to investigate the Google Stackdriver issue. Graphs are fully restored, but alerting policies and uptime checks are still degraded. We will provide another update at 17:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

We are continuing to investigate the Google Stackdriver issue. In addition to graph and alerting policy unavailability, uptime checks are not completing successfully. We believe we have isolated the root cause and are working on a resolution, and we will provide another update at 16:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

We are investigating an issue with Google Stackdriver that is causing charts and alerting policies to be unavailable. We will provide more information by 15:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17007 - Project creation failue

The project creation is experiencing a 100% error rate on requests. We will provide another status update by Tuesday, 2017-10-03 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - Project creation failue

We are investigating an issue with Project creation. We will provide more information by 12:40PM US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17005 - Errors creating new Stackdriver accounts and adding new projects to existing Stackdriver accounts.

The issue with Google Stackdriver has been resolved for all affected projects as of Friday, 2017-09-29 15:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - Activity Stream not showing new Activity Logs

We are currently investigating an issue with the Cloud Console's Activity Stream not showing new Activity Logs.

Last Update: A few months ago

RESOLVED: Incident 17003 - Google Cloud Pub/Sub partially unavailable.

The issue with Pub/Sub subscription creation has been resolved for all affected projects as of 08:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17003 - Google Cloud Pub/Sub partially unavailable.

We are experiencing an issue with Pub/Sub subscription creation beginning at 2017-09-13 06:30 US/Pacific. Current data indicates that approximately 12% of requests are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 08:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - Google Cloud Pub/Sub partially unavailable.

We are investigating an issue with Google Pub/Sub. We will provide more information by 07:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

ISSUE SUMMARY On Tuesday 29 August and Wednesday 30 August 2017, Google Cloud Network Load Balancing and Internal Load Balancing could not forward packets to backend instances if they were live-migrated. This incident initially affected instances in all regions for 30 hours and 22 minutes. We apologize for the impact this had on your services. We are particularly cognizant of the failure of a system used to increase reliability and the long duration of the incident. We have completed an extensive postmortem to learn from the issue and improve Google Cloud Platform. DETAILED DESCRIPTION OF IMPACT Starting at 13:56 PDT on Tuesday 29 August 2017 to 20:18 on Wednesday 30 August 2017, Cloud Network Load Balancer and Internal Load Balancer in all regions were unable to reach any instance that live-migrated during that period. Instances which did not experience live-migration during this period were not affected. Our internal investigation shows that approximately 2% of instances using Network Load Balancing or Internal Load Balancing were affected by the issue. ROOT CAUSE Live-migration transfers a running VM from one host machine to another host machine within the same zone. All VM properties and attributes remain unchanged, including internal and external IP addresses, instance metadata, block storage data and volumes, OS and application state, network settings, network connections, and so on. In this case, a change in the internal representation of networking information in VM instances caused inconsistency between two values, both of which were supposed to hold the external and internal virtual IP addresses of load balancers. When an affected instance was live-migrated, the instance was deprogrammed from the load balancer because of the inconsistency. This made it impossible for load balancers that used the instance as backend to look up the destination IP address of the instance following its migration, which in turn caused for all packets destined to that instance to be dropped at the load balancer level. REMEDIATION AND PREVENTION Initial detection appeared with reports of lack of backend connectivity at 23:30 on Tuesday to the GCP support team. At 00:28 on Wednesday two Cloud Network engineering teams were paged to investigate the issue. Detailed investigations continued until 08:07 when the configuration change that caused the issue was confirmed as such. The roll back of the new configuration was completed by 08:32, at which point no new live-migration would cause the issue. Google engineers then started to run a program to fix all mismatched network information at 08:56, and all affected instances were restored to a healthy status by 20:18. In order to prevent the issue, Google engineers are working to enhance automated canary testing that simulates live-migration events, detection of load balancing packets loss, and enforce more restrictions on new configuration changes deployment for internal representation changes.

Last Update: A few months ago

RESOLVED: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

The issue with Network Load Balancers has been resolved for all affected projects as of 20:18 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18033 - We are investigating an issue with BigQuery queries failing starting at 10:15am PT

The issue with BigQuery queries failing has been resolved for all affected users as of 12:05pm US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18033 - We are investigating an issue with BigQuery queries failing starting at 10:15am PT

The BigQuery service is experiencing a 16% error rate on queries. We will provide another status update by 12:00pm US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We wanted to send another update with better formatting. We will provide more another update on resolving effected instances by 12 PDT. Affected customers can also mitigate their affected instances with the following procedure (which causes Network Load Balancer to be reprogrammed) using gcloud tool or via the Compute Engine API. NB: No modification to the existing load balancer configurations is necessary, but a temporary TargetPool needs to be created. Create a new TargetPool. Add the affected VMs in a region to the new TargetPool. Wait for the VMs to start working in their existing load balancer configuration. Delete the new TargetPool. DO NOT delete the existing load balancer config, including the old target pool. It is not necessary to create a new ForwardingRule. Example: 1) gcloud compute target-pools create dummy-pool --project=&lt;your_project> --region=&lt;region> 2) gcloud compute target-pools add-instances dummy-pool --instances=&lt;instance1,instance2,...> --project=&lt;your_project> --region=&lt;region> --instances-zone=&lt;zone> 3) (Wait) 4) gcloud compute target-pools delete dummy-pool --project=&lt;your_project> --region=&lt;region>

Last Update: A few months ago

UPDATE: Incident 18033 - We are investigating an issue with BigQuery queries failing starting at 10:15am PT

We are investigating an issue with BigQuery queries failing starting at 10:15am PT

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

Our first mitigation has completed at this point and no new instances should be effected. We are slowly going through an fixing affected customers. Affected customers can also mitigate their affected instances with the following procedure (which causes Network Load Balancer to be reprogrammed) using gcloud tool or via the Compute Engine API. NB: No modification to the existing load balancer configurations is necessary, but a temporary TargetPool needs to be created. Create a new TargetPool. Add the affected VMs in a region to the new TargetPool. Wait for the VMs to start working in their existing load balancer configuration. Delete the new TargetPool. DO NOT delete the existing load balancer config, including the old target pool. It is not necessary to create a new ForwardingRule. Example: gcloud compute target-pools create dummy-pool --project=<your_project> --region=<region> gcloud compute target-pools add-instances dummy-pool --instances=<instance1,instance2,...> --project=<your_project> --region=<region> --instances-zone=<zone> (Wait) gcloud compute target-pools delete dummy-pool --project=<your_project> --region=<region>

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balance. The configuration change to mitigate this issue has been rolled out and we are working on further measures to completely resolve the issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 10:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. The configuration change to mitigate this issue has been rolled out and we are working on further measures to completly resolve the issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 09:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. We have identified the event that triggers this issue and are rolling back a configuration change to mitigate this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 09:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Mitigation work is still in progress. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 08:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Our previous actions did not resolve the issue. We are pursuing alternative solutions. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 07:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-west1 and asia-east1 not being able to connect to backends. Our Engineering Team has reduced the scope of possible root causes and is still investigating. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 06:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. The investigation is still ongoing. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. We have ruled out several possible failure scenarios. The investigation is still ongoing. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 04:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 04:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are investigating an issue with network load balancer connectivity. We will provide more information by 02:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are investigating an issue with network connectivity. We will provide more information by 01:50 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are investigating an issue with network connectivity. We will provide more information by 01:20 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

ISSUE SUMMARY On Tuesday 15 August 2017, Google Cloud SQL experienced issues in the europe-west1 zones for a duration of 3 hours and 35 minutes. During this time, new connections from Google App Engine (GAE) or Cloud SQL Proxy would timeout and return an error. In addition, Cloud SQL connections with ephemeral certs that had been open for more than one hour timed out and returned an error. We apologize to our customers whose projects were affected – we are taking immediate action to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Tuesday 15 August 2017 from 17:20 to 20:55 PDT, 43.1% of Cloud SQL instances located in europe-west1 were unable to be managed with the Google Cloud SQL Admin API to create or make changes. Customers who connected from GAE or used the Cloud SQL Proxy (which includes most connections from Google Container Engine) were denied new connections to their database. ROOT CAUSE The issue surfaced through a combination of a spike in error rates internal to the Cloud SQL service and a lack of available resources in the Cloud SQL control plane for europe-west1. By way of background, the Cloud SQL system uses a database to store metadata for customer instances. This metadata is used for validating new connections. Validation will fail if the load on the database is heavy. In this case, Cloud SQL’s automatic retry logic overloaded the control plane and consumed all the available Cloud SQL control plane processing in europe-west1. This in turn made the Cloud SQL Proxy and front end client server pairing reject connections when ACLs and certificate information stored in the Cloud SQL control plane could not be accessed. REMEDIATION AND PREVENTION Google engineers were paged at 17:20 when automated monitoring detected an increase in control plane errors. Initial troubleshooting steps did not sufficiently isolate the issue and reduce the database load. Engineers then disabled non-critical control plane services for Cloud SQL to shed load and allow the service to catch up. They then began a rollback to the previous configuration to bring back the system to a healthy state. This issue has raised technical issues which hinder our intended level of service and reliability for the Cloud SQL service. We have begun a thorough investigation of similar potential failure patterns in order to avoid this type of service disruption in the future. We are adding additional monitoring to quickly detect metadata database timeouts which caused the control plane outage. We are also working to make the Cloud SQL control plane services more resilient to metadata database latency by making the service not directly call the database for connection validation. We realize this event may have impacted your organization and we apologize for this disruption. Thank you again for your business with Google Cloud SQL.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

The issue with Cloud SQL connectivity affecting connections from App Engine and connections using the Cloud SQL Proxy in europe-west1 has been resolved for all affected projects as of 20:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

We are continuing to experience an issue with Cloud SQL connectivity beginning at Tuesday, 2017-08-15 17:20 US/Pacific. Current investigation indicates that instances running in Europe-West1 are affected by this issue. Engineering is working on mitigating the situation. We will provide an update by 21:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

We are continuing to experience an issue with Cloud SQL connectivity beginning at Tuesday, 2017-08-15 17:20 US/Pacific. Current investigation indicates that instances running in Europe-West1 are affected by this issue. Engineering is currently working on mitigating the situation. We will provide an update by 20:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

We are experiencing an issue with Cloud SQL connectivity beginning at Tuesday, 2017-08-15 17:20 US/Pacific. Current investigation indicates that instances running in Europe-West1 are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - GCS triggers not fired when objects are updated

We are investigating an issue with Google Cloud Storage object overwrites. For buckets with Google Cloud Functions or Object Change Notification enabled, notifications were not being triggered when a new object overwrote an existing object. Other object operations are not affected. Buckets with Google Cloud Pub/Sub configured are also not affected. The root cause has been found and confirmed by partial rollback. Full rollback is expected to be completed within an hour. Between now and full rollback, affected buckets are expected to begin triggering on updates; it can be intermittent initially, and it is expected to stabilize when the rollback is complete. We will provide another update by 14:00 with any new details. ETA for resolution 14:00 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 17003 - GCS triggers not fired when objects are updated

We are investigating an issue with Google Cloud Storage function triggering on object update. Apiary notifications on object updates were also not sent during this issue. Other object operations are not reporting problems. The root cause has been found and confirmed by partial rollback. Full rollback is expected to be completed within an hour. Between now and full rollback, affected functions are expected to begin triggering on updates; it can be intermittent initially, and it is expected to stabilize when the rollback is complete. We will provide another update by 14:00 with any new details. ETA for resolution 13:30 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 17003 - GCS triggers not fired when objects are updated

We are investigating an issue with Google Cloud Storage function triggering on object update. Other object operations are not reporting problems. We will provide more information by 12:00 US/Pacific

Last Update: A few months ago

RESOLVED: Incident 18032 - BigQuery Disabled for Projected

ISSUE SUMMARY On 2017-07-26, BigQuery delivered error messages for 7% of queries and 15% of exports for a duration of two hours and one minute. It also experienced elevated failures for streaming inserts for one hour and 40 minutes. If your service or application was affected, we apologize – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT On 2017-07-26 from 13:45 to 15:45 US/PDT, BigQuery jobs experienced elevated failures at a rate of 7% to 15%, depending on the operation attempted. Overall 7% of queries, 15% of exports, and 9% of streaming inserts failed during this event. These failures occurred in 12% of customer projects The errors for affected projects varied from 2% to 69% of exports, over 50% for queries, and up to 28.5% for streaming inserts. Customers affected saw an error message stating that their project has “not enabled BigQuery”. ROOT CAUSE Prior to executing a BigQuery job, Google’s Service Manager validates that the project requesting the job has BigQuery enabled for the project. The Service Manager consists of several components, including a redundant data store for project configurations, and a permissions module which inspects configurations. The project configuration data is being migrated to a new format and new version of the data store, and as part of that migration, the permissions module is being updated to use the new format. As is normal production best practices, this migration is being performed in stages separated by time. The root cause of this event was that, during one stage of the rollout, configuration data for two GCP datacenters was migrated before the corresponding permissions module for BigQuery was updated. As a result, the permissions module in those datacenters began erroneously reporting that projects running there no longer had BigQuery enabled. Thus, while both BigQuery and the underlying data stores were unchanged, requests to BigQuery from affected projects received an error message indicating that they had not enabled BigQuery. REMEDIATION AND PREVENTION Google’s BigQuery on-call engineering team was alerted by automated monitoring within 15 minutes of the beginning of the event at 13:59. Subsequent investigation determined at 14:17 that multiple projects were experiencing BigQuery validation failures, and the cause of the errors was identified at 14:46 as being changed permissions. Once the root cause of the errors was understood, Google engineers focused on mitigating the user impact by configuring BigQuery in affected locations to skip the erroneous permissions check. This change was first tested in a portion of the affected projects beginning at 15:04, and confirmed to be effective at 15:29. That mitigation was then rolled out to all affected projects, and was complete by 15:44. Finally, with mitigations in place, the Google engineering team worked to safely roll back the data migration; this work completed at 23:33 and the permissions check mitigation was removed, closing the incident. Google engineering has created 26 high priority action items to prevent a recurrence of this condition and to better detect and more quickly mitigate similar classes of issues in the future. These action items include increasing the auditing of BigQuery’s use of Google’s Service Manager, improving the detection and alerting of the conditions that caused this event, and improving the response of Google engineers to similar events. In addition, the core issue that affected the BigQuery backend has already been fixed. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

The issue with BigQuery access errors has been resolved for all affected projects as of 16:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

The issue with BigQuery errors should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

The BigQuery engineers have identified a possible workaround to the issue affecting the platform and are deploying it now. Next update at 16:00PDT

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

We are investigating an issue with BigQuery. We will provide more information by 15:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

At this time BigQuery is experiencing a partial outage, reporting that the service is not available for the project. Engineers are currently investigating the issue.

Last Update: A few months ago

UPDATE: Incident 17001 - Issues with Cloud VPN in us-west1

The issue with connectivity to Cloud VPN and External IPs in us-west1 has been resolved for all affected projects as of 14:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

The issue with listing projects in the Google Cloud Console has been resolved as of 2017-07-21 07:11 PDT.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

The issue with Google Cloud Console errors should be resolved for majority of users and we expect a full resolution in the near future. We will provide another status update by 09:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

We are experiencing an issue with Google Cloud Console returning errors beginning at Fri, 2017-07-21 02:50 US/Pacific. Early investigation indicates that users may see errors when listing projects in Google Cloud Console and via the API. Some other pages in Google Cloud Console may also display an error, refreshing the pages may help. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

We are investigating an issue with Google Cloud Console. We will provide more information by 03:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18031 - BigQuery API server returning errors

The issue with BigQuery API returning errors has been resolved for all affected users as of 04:10 US/Pacific. We apologize for the impact that this incident had on your application.

Last Update: A few months ago

UPDATE: Incident 18031 - BigQuery API server returning errors

The issue with Google BigQuery should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18031 - BigQuery API server returning errors

We are investigating an issue with Google BigQuery. We will provide more information by 04:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18030 - Streaming API errors

ISSUE SUMMARY On Wednesday 28 June 2017, streaming data into Google BigQuery experienced elevated error rates for a period of 57 minutes. We apologize to all users whose data ingestion pipelines were affected by this issue. We understand the importance of reliability for a process as crucial as data ingestion and are taking committed actions to prevent a similar recurrence in the future. DETAILED DESCRIPTION OF IMPACT On Wednesday 28 June 2017 from 18:00 to 18:20 and from 18:40 to 19:17 US/Pacific time, BigQuery's streaming insert service returned an increased error rate to clients for all projects. The proportion varied from time to time, but failures peaked at 43% of streaming requests returning HTTP response code 500 or 503. Data streamed into BigQuery from clients that experienced errors without retry logic were not saved into target tables during this period of time. ROOT CAUSE Streaming requests are routed to different datacenters for processing based on the table ID of the destination table. A sudden increase in traffic to the BigQuery streaming service combined with diminished capacity in a datacenter resulted in that datacenter returning a significant amount of errors for tables whose IDs landed in that datacenter. Other datacenters processing streaming data into BigQuery were unaffected. REMEDIATION AND PREVENTION Google engineers were notified of the event at 18:20, and immediately started to investigate the issue. The first set of errors had subsided, but starting at 18:40 error rates increased again. At 19:17 Google engineers redirected traffic away from the affected datacenter. The table IDs in the affected datacenter were redistributed to remaining, healthy streaming servers and error rates began to subside. To prevent the issue from recurring, Google engineers are improving the load balancing configuration, so that spikes in streaming traffic can be more equitably distributed amongst the available streaming servers. Additionally, engineers are adding further monitoring as well as tuning existing monitoring to decrease the time it takes to alert engineers of issues with the streaming service. Finally, Google engineers are evaluating rate-limiting strategies for the backend to prevent them from becoming overloaded.

Last Update: A few months ago

UPDATE: Incident 17004 - Stackdriver Uptime Monitoring - alerting policies with uptime check health conditions will not fire or resolve

We are experiencing an issue with Stackdriver Uptime Monitoring - alerting policies with uptime check health conditions will not fire or resolve and latency charts on uptime dashboard will be missing. Beginning at approx. Thursday, 2017-07-06 17:00:00 US/Pacific. Current data indicates that all projects are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - We are investigating an issue with Google Cloud Storage. We will provide more information by 18:30 US/Pacific.

We are experiencing an intermittent issue with Google Cloud Storage - JSON API requests are failing with 5XX errors (XML API is unaffected) beginning at Thursday, 2017-07-06 16:50:40 US/Pacific. Current data indicates that approximately 70% of requests globally are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - We are investigating an issue with Google Cloud Storage. We will provide more information by 18:30 US/Pacific.

We are investigating an issue with Google Cloud Storage. We will provide more information by 18:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17001 - Google Cloud Storage elevated error rates

The issue with degraded availability for some Google Cloud Storage objects has been resolved for all affected projects. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 17016 - Cloud SQL V2 instance failing to create

The issue with Cloud SQL V2 incorrect reports of 'Unable to Failover' state has been resolved for all affected instances as of 12:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

The issue with Cloud SQL V2 incorrect reports of 'Unable to Failover' state should be resolved for some of instances and we expect a full resolution in the near future. We will provide another status update by 12:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

Our Engineering Team believes they have identified the root cause of the incorrect reports of 'Unable to Failover' state and is working to mitigate. We will provide another status update by 12:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17002 - Cloud Pub/Sub admin operations failing

The issue with Cloud Pub/Sub admin operations failing has been resolved for all affected users as of 10:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

The issue with Cloud SQL V2 some instance maybe failing to create should be resolved for some of porojects and we expect a full resolution in the near future. At this time we do not have additional information related to HA instances may report incorrect 'Unable to Failover' state. We will provide another status update by 11:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

We are experiencing an issue with Cloud SQL V2, some instance maybe failing to create or HA instances may reports incorrect 'Unable to Failover' state beginning at Thursday, 2016-06-29 08:45 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 11:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Cloud Pub/Sub admin operations failing

We are investigating an issue where Cloud Pub/Sub admin operations are failing. We will provide more information by 10:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Google Cloud Storage elevated error rates

Google engineers are continuing to restore the service. Error rates are continuing to decrease. We will provide another status update by 15:00 June 29 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing has been resolved for all affected projects on Tuesday, 2017-06-13 10:12 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18030 - Streaming API errors

The issue with BigQuery Streaming insert has been resolved for all affected users as of 19:17 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18030 - Streaming API errors

Our Engineering Team believes they have identified the root cause of the errors and have mitigated the issue by 19:17 US/Pacific. We will provide another status update by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18030 - Streaming API errors

We are investigating an issue with BigQuery Streaming insert. We will provide more information by 19:35 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Google Cloud Storage elevated error rates

Starting at Tuesday 27 June 2017 07:30 PST, Google Cloud Storage started experiencing degraded availability for some objects based in us-central1 buckets (Regional, Nearline, Coldline, Durable Reduced Availability) and US multi-region buckets. Between 08:00 and 18:00 PST error rate was ~3.5%, error rates have since decreased to 0.5%. Errors are expected to be consistent for objects. Customers do not need to make any changes at this time. Google engineers have identified the root cause and are working to restore the service. If your service or application is affected, we apologize. We will provide another status update by 05:00 June 29 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18029 - BigQuery Increased Error Rate

ISSUE SUMMARY For 10 minutes on Wednesday 14 June 2017, Google BigQuery experienced increased error rates for both streaming inserts and most API methods due to their dependency on metadata read operations. To our BigQuery customers whose business were impacted by this event, we sincerely apologize. We are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT Starting at 10:43am US/Pacific, global error rates for BigQuery streaming inserts and API calls dependent upon metadata began to rapidly increase. The error rate for streaming inserts peaked at 100% by 10:49am. Within that same window, the error rate for metadata operations increased to a height of 80%. By 10:54am the error rates for both streaming inserts and metadata operations returned to normal operating levels. During the incident, affected BigQuery customers would have experienced a noticeable elevation in latency on all operations, as well as increased “Service Unavailable” and “Timeout” API call failures. While BigQuery streaming inserts and metadata operations were the most severely impacted, other APIs also exhibited elevated latencies and error rates, though to a much lesser degree. For API calls returning status code 2xx the operation completed with successful data ingestion and integrity. ROOT CAUSE On Wednesday 14 June 2017, BigQuery engineers completed the migration of BigQuery's metadata storage to an improved backend infrastructure. This effort was the culmination of work to incrementally migrate BigQuery read traffic over the course of two weeks. As the new backend infrastructure came online, there was one particular type of read traffic that hadn’t yet migrated to the new metadata storage. This caused a sudden spike of that read traffic to the new backend. The spike came when the new storage backend had to process a large volume of incoming requests as well as allocate resources to handle the increased load. Initially the backend was able to process requests with elevated latency, but all available resources were eventually exhausted, which lead to API failures. Once the backend was able to complete the load redistribution, it began to free up resources to process existing requests and work through its backlog. BigQuery operations continued to experience elevated latency and errors for another five minutes as the large backlog of requests from the first five minutes of the incident were processed. REMEDIATION AND PREVENTION Our monitoring systems worked as expected and alerted us to the outage within 6 minutes of the error spike. By this time, the underlying root cause had already passed. Google engineers have created nine high priority action items, and three lower priority action items as a result of this event to better prevent, detect and mitigate the reoccurence of a similar event. The most significant of these priorities is to modify the BigQuery service to successfully handle a similar root cause event. This will include adjusting capacity parameters to better handle backend failures and improving caching and retry logic. Each of the 12 action items created from this event have already been assigned to an engineer and are underway.

Last Update: A few months ago

RESOLVED: Incident 18029 - BigQuery Increased Error Rate

ISSUE SUMMARY For 10 minutes on Wednesday 14 June 2017, Google BigQuery experienced increased error rates for both streaming inserts and most API methods due to their dependency on metadata read operations. To our BigQuery customers whose business were impacted by this event, we sincerely apologize. We are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT Starting at 10:43am US/Pacific, global error rates for BigQuery streaming inserts and API calls dependent upon metadata began to rapidly increase. The error rate for streaming inserts peaked at 100% by 10:49am. Within that same window, the error rate for metadata operations increased to a height of 80%. By 10:54am the error rates for both streaming inserts and metadata operations returned to normal operating levels. During the incident, affected BigQuery customers would have experienced a noticeable elevation in latency on all operations, as well as increased “Service Unavailable” and “Timeout” API call failures. While BigQuery streaming inserts and metadata operations were the most severely impacted, other APIs also exhibited elevated latencies and error rates, though to a much lesser degree. For API calls returning status code 2xx the operation completed with successful data ingestion and integrity. ROOT CAUSE On Wednesday 14 June 2017, BigQuery engineers completed the migration of BigQuery's metadata storage to an improved backend infrastructure. This effort was the culmination of work to incrementally migrate BigQuery read traffic over the course of two weeks. As the new backend infrastructure came online, there was one particular type of read traffic that hadn’t yet migrated to the new metadata storage. This caused a sudden spike of that read traffic to the new backend. The spike came when the new storage backend had to process a large volume of incoming requests as well as allocate resources to handle the increased load. Initially the backend was able to process requests with elevated latency, but all available resources were eventually exhausted, which lead to API failures. Once the backend was able to complete the load redistribution, it began to free up resources to process existing requests and work through its backlog. BigQuery operations continued to experience elevated latency and errors for another five minutes as the large backlog of requests from the first five minutes of the incident were processed. REMEDIATION AND PREVENTION Our monitoring systems worked as expected and alerted us to the outage within 6 minutes of the error spike. By this time, the underlying root cause had already passed. Google engineers have created nine high priority action items, and three lower priority action items as a result of this event to better prevent, detect and mitigate the reoccurence of a similar event. The most significant of these priorities is to modify the BigQuery service to successfully handle a similar root cause event. This will include adjusting capacity parameters to better handle backend failures and improving caching and retry logic. Each of the 12 action items created from this event have already been assigned to an engineer and are underway.

Last Update: A few months ago

RESOLVED: Incident 18029 - BigQuery Increased Error Rate

ISSUE SUMMARY For 10 minutes on Wednesday 14 June 2017, Google BigQuery experienced increased error rates for both streaming inserts and most API methods due to their dependency on metadata read operations. To our BigQuery customers whose business were impacted by this event, we sincerely apologize. We are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT Starting at 10:43am US/Pacific, global error rates for BigQuery streaming inserts and API calls dependent upon metadata began to rapidly increase. The error rate for streaming inserts peaked at 100% by 10:49am. Within that same window, the error rate for metadata operations increased to a height of 80%. By 10:54am the error rates for both streaming inserts and metadata operations returned to normal operating levels. During the incident, affected BigQuery customers would have experienced a noticeable elevation in latency on all operations, as well as increased “Service Unavailable” and “Timeout” API call failures. While BigQuery streaming inserts and metadata operations were the most severely impacted, other APIs also exhibited elevated latencies and error rates, though to a much lesser degree. For API calls returning status code 2xx the operation completed successfully with guaranteed data ingestion and integrity. ROOT CAUSE On Wednesday 14 June 2017, BigQuery engineers completed the migration of BigQuery's metadata storage to an improved backend infrastructure. This effort was the culmination of work to incrementally migrate BigQuery read traffic over the course of two weeks. As the new backend infrastructure came online, there was one particular type of read traffic that hadn’t yet migrated to the new metadata storage. This caused a sudden spike of that read traffic to the new backend. The spike came when the new storage backend had to process a large volume of incoming requests as well as allocate resources to handle the increased load. Initially the backend was able to process requests with elevated latency, but all available resources were eventually exhausted, which lead to API failures. Once the backend was able to complete the load redistribution, it began to free up resources to process existing requests and work through its backlog. BigQuery operations continued to experience elevated latency and errors for another five minutes as the large backlog of requests from the first five minutes of the incident were processed. REMEDIATION AND PREVENTION Our monitoring systems worked as expected and alerted us to the outage within 6 minutes of the error spike. By this time, the underlying root cause had already passed. Google engineers have created nine high priority action items, and three lower priority action items as a result of this event to better prevent, detect and mitigate the reoccurence of a similar event. The most significant of these priorities is to modify the BigQuery service to successfully handle a similar root cause event. This will include adjusting capacity parameters to better handle backend failures and improving caching and retry logic. Each of the 12 action items created from this event have already been assigned to an engineer and are underway.

Last Update: A few months ago

RESOLVED: Incident 17006 - Network issue in asia-northeast1

ISSUE SUMMARY On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, datacenters in the asia-northeast1 region experienced a loss of network connectivity for a total of 62 minutes. We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve. We recognize the severity of this incident and have completed an extensive internal postmortem. We thoroughly understand the root causes and no datacenters are at risk of recurrence. We are at work to add mechanisms to prevent and mitigate this class of problem in the future. We have prioritized this work and in the coming weeks, our engineering team will complete the action items we have generated from the postmortem. DETAILED DESCRIPTION OF IMPACT On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, network connectivity to and from Google Cloud services running in the asia-northeast1 region was unavailable for 62 minutes. This issue affected all Google Cloud Platform services in that region, including Compute Engine, App Engine, Cloud SQL, Cloud Datastore, and Cloud Storage. All external connectivity to the region was affected during this time frame, while internal connectivity within the region was not affected. In addition, inbound requests from external customers originating near Google’s Tokyo point of presence intended for Compute or Container Engine HTTP Load Balancing were lost for the initial 12 minutes of the outage. Separately, Internal Load Balancing within asia-northeast1 remained degraded until 10:23. ROOT CAUSE At the time of incident, Google engineers were upgrading the network topology and capacity of the region; a configuration error caused the existing links to be decommissioned before the replacement links could provide connectivity, resulting in a loss of connectivity for the asia-northeast1 region. Although the replacement links were already commissioned and appeared to be ready to serve, a network-routing protocol misconfiguration meant that the routes through those links were not able to carry traffic. As Google's global network grows continuously, we make upgrades and updates reliably by using automation for each step and, where possible, applying changes to only one zone at any time. The topology in asia-northeast1 was the last region unsupported by automation; manual work was required to be performed to align its topology with the rest of our regional deployments (which would, in turn, allow automation to function properly in the future). This manual change mistakenly did not follow the same per-zone restrictions as required by standard policy or automation, which meant the entire region was affected simultaneously. In addition, some customers with deployments across multiple regions that included asia-northeast1 experienced problems with HTTP Load Balancing due to a failure to detect that the backends were unhealthy. When a network partition occurs, HTTP Load Balancing normally detects this automatically within a few seconds and routes traffic to backends in other regions. In this instance, due to a performance feature being tested in this region at the time, the mechanism that usually detects network partitions did not trigger, and continued to attempt to assign traffic until our on-call engineers responded. Lastly, the Internal Load Balancing outage was exacerbated due to a software defined networking component which was stuck in a state where it was not able to provide network resolution for instances in the load balancing group. REMEDIATION AND PREVENTION Google engineers were paged by automated monitoring within one minute of the start of the outage, at 08:24 PDT. They began troubleshooting and declared an emergency incident 8 minutes later at 08:32. The issue was resolved when engineers reconnected the network path and reverted the configuration back to the last known working state at 09:22. Our monitoring systems worked as expected and alerted us to the outage promptly. The time-to-resolution for this incident was extended by the time taken to perform the rollback of the network change, as the rollback had to be performed manually. We are implementing a policy change that any manual work on live networks be constrained to a single zone. This policy will be enforced automatically by our change management software when changes are planned and scheduled. In addition, we are building automation to make these types of changes in future, and to ensure the system can be safely rolled back to a previous known-good configuration at any time during the procedure. The fix for the HTTP Load Balancing performance feature that caused it to incorrectly believe zones within asia-northeast1 were healthy will be rolled out shortly. SUPPORT COMMUNICATIONS During the incident, customers who had originally contacted Google Cloud Support in Japanese did not receive periodic updates from Google as the event unfolded. This was due to a software defect in the support tooling — unrelated to the incident described earlier. We have already fixed the software defect, so all customers who contact support will receive incident updates. We apologize for the communications gap to our Japanese-language customers. RELIABILITY SUMMARY One of our biggest pushes in GCP reliability at Google is a focus on careful isolation of zones from each other. As we encourage users to build reliable services using multiple zones, we also treat zones separately in our production practices, and we enforce this isolation with software and policy. Since we missed this mark—and affecting all zones in a region is an especially serious outage—we apologize. We intend for this incident report to accurately summarize the detailed internal post-mortem that includes final assessment of impact, root cause, and steps we are taking to prevent an outage of this form occurring again. We hope that this incident report demonstrates the work we do to learn from our mistakes to deliver on this commitment. We will do better. Sincerely, Benjamin Lutch | VP Site Reliability Engineering | Google

Last Update: A few months ago

RESOLVED: Incident 17008 - Network issue in asia-northeast1

ISSUE SUMMARY On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, datacenters in the asia-northeast1 region experienced a loss of network connectivity for a total of 62 minutes. We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve. We recognize the severity of this incident and have completed an extensive internal postmortem. We thoroughly understand the root causes and no datacenters are at risk of recurrence. We are at work to add mechanisms to prevent and mitigate this class of problem in the future. We have prioritized this work and in the coming weeks, our engineering team will complete the action items we have generated from the postmortem. DETAILED DESCRIPTION OF IMPACT On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, network connectivity to and from Google Cloud services running in the asia-northeast1 region was unavailable for 62 minutes. This issue affected all Google Cloud Platform services in that region, including Compute Engine, App Engine, Cloud SQL, Cloud Datastore, and Cloud Storage. All external connectivity to the region was affected during this time frame, while internal connectivity within the region was not affected. In addition, inbound requests from external customers originating near Google’s Tokyo point of presence intended for Compute or Container Engine HTTP Load Balancing were lost for the initial 12 minutes of the outage. Separately, Internal Load Balancing within asia-northeast1 remained degraded until 10:23. ROOT CAUSE At the time of incident, Google engineers were upgrading the network topology and capacity of the region; a configuration error caused the existing links to be decommissioned before the replacement links could provide connectivity, resulting in a loss of connectivity for the asia-northeast1 region. Although the replacement links were already commissioned and appeared to be ready to serve, a network-routing protocol misconfiguration meant that the routes through those links were not able to carry traffic. As Google's global network grows continuously, we make upgrades and updates reliably by using automation for each step and, where possible, applying changes to only one zone at any time. The topology in asia-northeast1 was the last region unsupported by automation; manual work was required to be performed to align its topology with the rest of our regional deployments (which would, in turn, allow automation to function properly in the future). This manual change mistakenly did not follow the same per-zone restrictions as required by standard policy or automation, which meant the entire region was affected simultaneously. In addition, some customers with deployments across multiple regions that included asia-northeast1 experienced problems with HTTP Load Balancing due to a failure to detect that the backends were unhealthy. When a network partition occurs, HTTP Load Balancing normally detects this automatically within a few seconds and routes traffic to backends in other regions. In this instance, due to a performance feature being tested in this region at the time, the mechanism that usually detects network partitions did not trigger, and continued to attempt to assign traffic until our on-call engineers responded. Lastly, the Internal Load Balancing outage was exacerbated due to a software defined networking component which was stuck in a state where it was not able to provide network resolution for instances in the load balancing group. REMEDIATION AND PREVENTION Google engineers were paged by automated monitoring within one minute of the start of the outage, at 08:24 PDT. They began troubleshooting and declared an emergency incident 8 minutes later at 08:32. The issue was resolved when engineers reconnected the network path and reverted the configuration back to the last known working state at 09:22. Our monitoring systems worked as expected and alerted us to the outage promptly. The time-to-resolution for this incident was extended by the time taken to perform the rollback of the network change, as the rollback had to be performed manually. We are implementing a policy change that any manual work on live networks be constrained to a single zone. This policy will be enforced automatically by our change management software when changes are planned and scheduled. In addition, we are building automation to make these types of changes in future, and to ensure the system can be safely rolled back to a previous known-good configuration at any time during the procedure. The fix for the HTTP Load Balancing performance feature that caused it to incorrectly believe zones within asia-northeast1 were healthy will be rolled out shortly. SUPPORT COMMUNICATIONS During the incident, customers who had originally contacted Google Cloud Support in Japanese did not receive periodic updates from Google as the event unfolded. This was due to a software defect in the support tooling — unrelated to the incident described earlier. We have already fixed the software defect, so all customers who contact support will receive incident updates. We apologize for the communications gap to our Japanese-language customers. RELIABILITY SUMMARY One of our biggest pushes in GCP reliability at Google is a focus on careful isolation of zones from each other. As we encourage users to build reliable services using multiple zones, we also treat zones separately in our production practices, and we enforce this isolation with software and policy. Since we missed this mark—and affecting all zones in a region is an especially serious outage—we apologize. We intend for this incident report to accurately summarize the detailed internal post-mortem that includes final assessment of impact, root cause, and steps we are taking to prevent an outage of this form occurring again. We hope that this incident report demonstrates the work we do to learn from our mistakes to deliver on this commitment. We will do better. Sincerely, Benjamin Lutch | VP Site Reliability Engineering | Google

Last Update: A few months ago

UPDATE: Incident 18029 - BigQuery Increased Error Rate

The BigQuery service was experiencing a 78% error rate on streaming operations and up to 27% error rates on other operations from 10:43 to 11:03 US/Pacific time. This issue has been resolved for all affected projects as of 10:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17004 - Cloud console: changing language preferences

The Google Cloud Console issue that was preventing the users from changing their language preferences has been resolved as of 06:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17004 - Cloud console: changing language preferences

The Google Cloud Console issue that is preventing the users from changing their language preferences is ongoing. Our Engineering Team is working on it. We will provide another status update by 06:00 US/Pacific with current details. A known workaround is to change the browser language.

Last Update: A few months ago

UPDATE: Incident 17004 - Cloud console: changing language preferences

We are investigating an issue with the cloud console. Users are unable to change their language preferences. We will provide more information by 04:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17005 - High Latency in App Engine

ISSUE SUMMARY On Wednesday 7 June 2017, Google App Engine experienced highly elevated serving latency and timeouts for a duration of 138 minutes. If your service or application was affected the increase in latency, we sincerely apologize – this is not the level of reliability and performance we expect of our platform, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Wednesday 7 June 2017, from 13:34 PDT to 15:52 PDT, 7.7% of active applications on the Google App Engine service experienced severely elevated latency; requests that typically take under 500ms to serve were taking many minutes. This elevated latency would have either resulted in users seeing additional latency when waiting for responses from the affected applications or 500 errors if the application handlers timed out. The individual application logs would have shown this increased latency or increases in “Request was aborted after waiting too long to attempt to service your request” error messages. ROOT CAUSE The incident was triggered by an increase in memory usage across all App Engine appservers in a datacenter in us-central. An App Engine appserver is responsible for creating instances to service requests for App Engine applications. When its memory usage increases to unsustainable levels, it will stop some of its current instances, so that they can be rescheduled on other appservers in order to balance out the memory requirements across the datacenter. This transfer of an App Engine instance between appservers consumes CPU resources, a signal used by the master scheduler of the datacenter to detect when it must further rebalance traffic across more appservers (such as when traffic to the datacenter increases and more App Engine instances are required). Normally, these memory management techniques are transparent to customers but in isolated cases, they can be exacerbated by large amounts of additional traffic being routed to the datacenter, which requires more instances to service user requests. The increased load and memory requirement from scheduling new instances combined with rescheduling instances from appservers with high memory usage resulted in most appservers being considered “busy” by the master scheduler. User requests needed to wait for an available instance to either be transferred or created before they were able to be serviced, which results in the increased latency seen at the app level. REMEDIATION AND PREVENTION Latencies began to increase at 13:34 PDT and Google engineers were alerted to the increase in latency at 13:45 PDT and were able to identify a subset of traffic that was causing the increase in memory usage. At 14:08, they were able to limit this subset of traffic to an isolated partition of the datacenter to ease the memory pressure on the remaining appservers. Latency for new requests started to improve as soon as this traffic was isolated; however, tail latency was still elevated due to the large backlog of requests that had accumulated since the incident started. This backlog was eventually cleared by 15:52 PDT. To prevent further recurrence, traffic to the affected datacenter was rebalanced with another datacenter. To prevent future recurrence of this issue, Google engineers will be re-evaluating the resource distribution in the us-central datacenters where App Engine instances are hosted. Additionally, engineers will be developing stronger alerting thresholds based on memory pressure signals so that traffic can be redirected before latency increases. And finally, engineers will be evaluating changes to the scheduling strategy used by the master scheduler responsible for scheduling appserver work to prevent this situation in the future.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing should be resolved for the majority of projects and we expect a full resolution in the next 12 hours. We will provide another status update by 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 2am PST with current details.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing should be resolved for some projects and we expect a full resolution in the near future. We will provide another status update by 11pm PST with current details.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are still investigating the issue with Cloud Logging exports to BigQuery failing. We will provide more information by 9pm PST. Currently, we are also working on restoring Cloud Logging exports to BigQuery.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are still investigating the issue with Cloud Logging exports to BigQuery failing. We will provide more information by 7pm PST. Currently, we are also working on restoring Cloud Logging exports to BigQuery.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are working on restoring Cloud Logging exports to BigQuery. We will provide further updates at 6pm PT.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

Cloud Logging exports to BigQuery failed from 13:19 to approximately 14:30 with loss of logging data. We have stopped the exports while we work on fixing the issue, so BigQuery will not reflect the latest logs. This incident only affects robot accounts using HTTP requests. We are working hard to restore service, and we will provide another update in one hour (by 5pm PT).

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are investigating an issue with Cloud Logging exports to BigQuery failing. We will provide more information by 5pm PT

Last Update: A few months ago

RESOLVED: Incident 17006 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 has been restored for all affected users as of 10:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 17008 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 has been restored for all affected users as of 10:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17008 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 should be restored for all affected users and we expect a full resolution in the near future. We will provide another status update by 09:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 should be restored for all affected users and we expect a full resolution in the near future. We will provide another status update by 09:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17008 - Network issue in asia-northeast1

Google Cloud Platform services in region asia-northeast1 are experiencing connectivity issues. We will provide another status update by 9:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - Network issue in asia-northeast1

Google Cloud Platform services in region asia-northeast1 are experiencing connectivity issues. We will provide another status update by 9:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17008 - Network issue in asia-northeast1

We are investigating an issue with network connectivity in asia-northeast1. We will provide more information by 09:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17006 - Network issue in asia-northeast1

We are investigating an issue with network connectivity in asia-northeast1. We will provide more information by 09:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17005 - High Latency in App Engine

The issue with Google App Engine displaying elevated error rate has been resolved for all affected projects as of 15:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17005 - High Latency in App Engine

The issue with Google App Engine displaying elevated error rate should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - High Latency in App Engine

We have identified an issue with App Engine that is causing increased latency to a portion of applications in the US Central region. Mitigation is under way. We will provide more information about the issue by 15:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The fix has been fully deployed. We confirmed that the issue has been fixed.

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The fix is still being deployed. We will provide another status update by 2017-05-26 19:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The fix is currently being deployed. We will provide another status update by 2017-05-26 16:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The root cause has been identified. The fix is currently being deployed. We will provide another status update by 2017-05-26 14:30 US/Pacific

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

We are currently investigating a problem that is causing app engine apps to experience an infinite redirect loop when users log in. We will provide another status update by 2017-05-26 13:30 US/Pacific

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

The issue with Cloud SQL First Generation automated backups should be resolved for all affected instances as of 12:52 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

We are still actively working on this issue. We are aiming on making the final fix available in production by end of day today.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

The daily backups continue to be taken and we are expect the final fix to be available tomorrow. We'll provide another update again Wednesday, May 24 10:00 US/Pacific time as originally planed.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

We are still actively working on this issue. We are aiming on making the fix available in production by end of day today. We will provide another update by end of day if there is anything changes.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

Daily backups are being taken for all Cloud SQL First Generation instances. For some instances, backups are being taken outside of defined backup windows. A fix is being tested and will be applied to First Generation instances pending positive test results. We will provide next update Wednesday, May 24 10:00 US/Pacific or if anything changes in between.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL v1 automated backups experiencing errors

The issue with automatic backups for Cloud SQL first generation is mitigated by forcing the backups. We’ll continue with this mitigation until Monday US/Pacific where a permanent fix will be rolled out. We will provide next update Monday US/Pacific or if anything changes in between.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL v1 automated backups experiencing errors

The Cloud SQL service is experiencing errors on automatic backups for 7% of Cloud SQL first generation instances. We’re forcing the backup for affected instances as short-term mitigation. We will provide another status update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL v1 automated backups experiencing errors

We are investigating an issue with Cloud SQL v1 automated backups. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Translate API elevated latency / errors

The issue with Translation Service and other API availability has been resolved for all affected users as of 19:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. This will be the final update on this incident.

Last Update: A few months ago

UPDATE: Incident 17001 - Translate API elevated latency / errors

Engineering is continuing to investigate the API service issues impacting Translation Service API availability, looking into potential causes and mitigation strategies. Certain other API's, such as Speech and Prediction may also be affected. Next update by 20:00 Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - GKE IP rotation procedure does not include FW rule change

Our Engineering Team has identified a fix for the issue with the GKE IP-rotation feature. We expect the rollout of the fix to begin next Tuesday, 2017-05-16 US/Pacific, completing on Friday, 2017-05-19.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

The issue with Google App Engine deployments and Memcache availability should have been resolved for all affected projects as of 18:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

The issue with Google App Engine deployments and Memcache availability is mitigated. We will provide an update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

We are continuing to investigate the issue with an underlying component that affects Google App Engine deployments and Memcache availability. The engineering team has tried several unsuccessful remediations and are continuing to investigate potential root causes and fixes. We will provide another update at 17:30 PDT.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

We are still investigating the issue with an underlying component that affects both Google App Engine deployments and Memcache availability. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

We are currently investigating an issue with an underlying component that affects both Google App Engine deployments and Memcache availability. Deployments will fail intermittently and memcache is returning intermittent errors for a small number of applications. For affected deployments, please try re-deploying while we continue to investigate this issue. For affected memcache users, retries in your application code should allow you to access your memcache intermittently while the underlying issue is being addressed. Work is ongoing to address the issue by the underlying component's engineering team. We will provide our next update at 15:30 PDT.

Last Update: A few months ago

UPDATE: Incident 17002 - GKE IP rotation procedure does not include FW rule change

We are experiencing an issue with GKE ip-rotation feature. Kubernetes features that rely on the proxy (including kubectl exec and logs commands, as well as exporting cluster metrics into stackdriver) are broken by GKE ip-rotation feature. This only affects users who have disabled default ssh access on their nodes. There is a manual fix described [here](https://cloud.google.com/container-engine/docs/ip-rotation#known_issues)

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

The issue with BigQuery jobs being on pending state for too long has been resolved for all affected projects as of 03:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

The issue with BigQuery jobs being on pending state for too long should be resolved for all new jobs. We will provide another status update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers are still working on a fix for the jobs being on pending state for too long. We will provide another update by 03:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers have identified the root cause for the jobs being on pending state for too long and are still working on a fix. We will provide another update by 02:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers have identified the root cause for the jobs being on pending state for too long and are applying a fix. We will provide another update by 01:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers continue to investigate jobs being on pending state for too long. We will provide another update by 00:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery service is experiencing jobs staying on pending state for longer than usual and our engineering team is working on it. We will provide another status update by 23:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

We are investigating an issue with BigQuery import jobs pending. We will provide more information by 23:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Node pool creation failing in multiple zones

We are investigating an issue with Google Container Engine (GKE) that is affecting node-pool creations in the following zones: us-east1-d, asia-northeast1-c, europe-west1-c, us-central1-b, us-west1-a, asia-east1-a, asia-northeast1-a, asia-southeast1-a, us-east4-b, us-central1-f, europe-west1-b, asia-east1-c, us-east1-c, us-west1-b, asia-northeast1-b, asia-southeast1-b, and us-east4-c. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17007 - 502 errors for HTTP(S) Load Balancers

ISSUE SUMMARY On Friday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring. DETAILED DESCRIPTION OF IMPACT On Friday 5 April 2017 from 01:13 to 01:35 PDT, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. Clients received 502 errors for failed requests. Some HTTP(S) Load Balancers that were recently modified experienced error rates of 100%. Google paused all configuration changes to the HTTP(S) Load Balancer for three hours and 41 minutes after the incident, until our engineers had understood the root cause. This caused deployments of App Engine Flexible apps to fail during that period. ROOT CAUSE A bug in the HTTP(S) Load Balancer configuration update process caused it to revert to a configuration that was substantially out of date. The configuration update process is controlled by a master server. In this case, one of the replicas of the master servers lost access to Google's distributed file system and was unable to read recent configuration files. Mastership then passed to the server that could not access Google's distributed file system. When the mastership changes, it begins the next configuration push as normal by testing on a subset of HTTP(S) Load Balancers. If this test succeeds, the configuration is pushed globally to all HTTP(S) Load Balancers. If the test fails (as it did in this case), the new master will revert all HTTP(S) Load Balancers to the last "known good" configuration. The combination of a mastership change, lack of access to more recent updates, and the initial test failure for the latest config caused the HTTP(S) Load Balancers to revert to the latest configuration that the master could read, which was substantially out-of-date. In addition, the update with the out-of-date configuration triggered a garbage collection process on the Google Frontend servers to free up memory used by the deleted configurations. The high number of deleted configurations caused the Google Frontend servers to spend a large proportion of CPU cycles on garbage collection which lead to failed health checks and eventual restart of the affected Google Frontend server. Any client requests served by a restarting server received 502 errors. REMEDIATION AND PREVENTION Google engineers were paged at 01:22 PDT. They switched the configuration update process to use a different master server at 01:34 which mitigated the issue for most services within one minute. Our engineers then paused the configuration updates to the HTTP(S) Load Balancer until 05:16 while the root cause was confirmed. To prevent incidents of this type in future, we are taking the following actions: * Master servers will be configured to never push HTTP(S) Load Balancer configurations that are more than a few hours old. * Google Frontend servers will reject loading a configuration file that is more than a few hours old. * Improve testing for new HTTP(S) Load Balancer configurations so that out-of-date configurations are flagged before being pushed to production. * Fix the issue that caused the master server to fail when reading files from Google's distributed file system. * Fix the issue that caused health check failures on Google Frontends during heavy garbage collection. Once again, we apologize for the impact that this incident had on your service.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue has been resolved for all affected users as of 00:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue in us-east1/asia-northeast1 regions has been resolved. The issue with deployments in us-east1 is mitigated. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will continue monitor and will provide another status update by 2017-04-07 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue in us-east1/asia-northeast1 regions has been partially resolved. The issue with deployments in us-east1 is partially mitigated. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 2017-04-07 00:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue in US-east1/Asia-northeast1 regions has been partially resolved. We are investigating related issues impacting deployments in US-east1. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

We are still investigating reports of an issue with App Engine Taskqueue in US-east1/Asia-northeast1 regions. We will provide another status update by 2017-04-06 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

We are investigating an issue impacting Google App Engine Task Queue in US-east1/Asia-northeast1. We will provide more information by 09:00pm US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18027 - BigQuery Streaming API returned a 500 response from 00:04 to 00:38 US/Pacific.

The issue with BigQuery Streaming API returning 500 response code has been resolved for all affected users as of 00:38 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 17001 - Issues with Cloud Pub/Sub

ISSUE SUMMARY On Tuesday 21 March 2017, new connections to Cloud Pub/Sub experienced high latency leading to timeouts and elevated error rates for a duration of 95 minutes. Connections established before the start of this issue were not affected. If your service or application was affected, we apologize – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Tuesday 21 March 2017 from 21:08 to 22:43 US/Pacific, Cloud Pub/Sub publish, pull and ack methods experienced elevated latency, leading to timeouts. The average error rate for the issue duration was 0.66%. The highest error rate occurred at 21:43, when the Pub/Sub publish error rate peaked at 4.1%, the ack error rate reached 5.7% and the pull error rate was 0.02%. ROOT CAUSE The issue was caused by the rollout of a storage system used by the Pub/Sub service. As part of this rollout, some servers were taken out of service, and as planned, their load was redirected to remaining servers. However, an unexpected imbalance in key distribution led some of the remaining servers to become overloaded. The Pub/Sub service was then unable to retrieve the status required to route new connections for the affected methods. Additionally, some Pub/Sub servers didn’t recover promptly after the storage system had been stabilized and required individual restarts to fully recover. REMEDIATION AND PREVENTION Google engineers were alerted by automated monitoring seven minutes after initial impact. At 21:24, they had correlated the issue with the storage system rollout and stopped it from proceeding further. At 21:41, engineers restarted some of the storage servers, which improved systemic availability. Observed latencies for Pub/Sub were still elevated, so at 21:54, engineers commenced restarting other Pub/Sub servers, restoring service to 90% of users. At 22:29 a final batch was restarted, restoring the Pub/Sub service to all. To prevent a recurrence of the issue, Google engineers are creating safeguards to limit the number of keys managed by each server. They are also improving the availability of Pub/Sub servers to respond to requests even when in an unhealthy state. Finally they are deploying enhancements to the Pub/Sub service to operate when the storage system is unavailable.

Last Update: A few months ago

UPDATE: Incident 17001 - Issues with Cloud Pub/Sub

The issue with Pub/Sub high latency has been resolved for all affected projects as of 22:02 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - Issues with Cloud Pub/Sub

We are investigating an issue with Pub/Sub. We will provide more information by 22:40 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18026 - BigQuery streaming inserts issue

ISSUE SUMMARY On Monday 13 March 2017, the BigQuery streaming API experienced 91% error rate in the US and 63% error rate in the EU for a duration of 30 minutes. We apologize for the impact of this issue on our customers, and the widespread nature of the issue in particular. We have completed a post mortem of the incident and are making changes to mitigate and prevent recurrences. DETAILED DESCRIPTION OF IMPACT On Monday 13 March 2017 from 10:22 to 10:52 PDT 91% of streaming API requests to US BigQuery datasets and 63% of streaming API requests to EU BigQuery datasets failed with error code 503 and an HTML message indicating "We're sorry... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now." All non-streaming API requests, including DDL requests and query, load extract and copy jobs were unaffected. ROOT CAUSE The trigger for this incident was a sudden increase in log entries being streamed from Stackdriver Logging to BigQuery by logs export. The denial of service (DoS) protection used by BigQuery responded to this by rejecting excess streaming API traffic. However the configuration of the DoS protection did not adequately segregate traffic streams resulting in normal sources of BigQuery streaming API requests being rejected. REMEDIATION AND PREVENTION Google engineers initially mitigated the issue by blocking the source of unexpected load. This prevented the overload and allowed all other traffic to resume normally. Engineers fully resolved the issue by identifying and reverting the change that triggered the increase in log entries and clearing the backlog of log entries that had grown. To prevent future occurrences, BigQuery engineers are updating configuration to increase isolation between different traffic sources. Tests are also being added to verify behavior under several new load scenarios.

Last Update: A few months ago

UPDATE: Incident 17006 - GCE networking in us-central1 zones is experiencing disruption

GCP Services' internet connectivity has been restored as of 2:12 pm Pacific Time. We apologize for the impact that this issue had on your application. We are still investigating the root cause of the issue, and will take necessary actions to prevent a recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - GCE networking in us-central1 zones is experiencing disruption

We are experiencing a networking issue with Google Compute Engine instances in us-central zones beginning at Wednesday, 2017-03-15 01:00 PM US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18026 - BigQuery streaming inserts issue

The issue with BigQuery streaming inserts has been resolved for all affected projects as of 11:06 AM US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18026 - BigQuery streaming inserts issue

We are investigating an issue with BigQuery streaming inserts. We will provide more information by 11:45 AM US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

During the period 12:05 - 13:57 PDT, GCS requests originating in Europe experienced a 17% error rate. GCS requests in other regions were unaffected.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

The issue with elevated latency and error rates for GCS in Europe should be resolved as of 13:56 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

We are continuing to investigate elevated errors and latency for GCS in Europe. The reliability team has performed several mitigation steps and error rates and latency are returning to normal levels. We will continue to monitor for recurrence.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

We are currently investigating elevated latency and error rates for Google Cloud Storage traffic transiting through Europe.

Last Update: A few months ago

RESOLVED: Incident 17003 - GCP accounts with credits are being charged without credits being applied

We have mitigated the issue as of 2017-03-10 09:30 PST

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

Some Cloud Console users may notice that Dataflow job logs are not displayed correctly. This is a known issue with the user interface that affects up to 35% of jobs. Google engineers are preparing a fix. Pipeline executions are not impacted and Dataflow services are operating as normal.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

We are still investigating the issue with Dataflow Job Log in Cloud Console. Current data indicates that between 30% and 35% of jobs are affected by this issue. Pipeline execution is not impacted. The root cause of the issue is known and the Dataflow Team is preparing the fix for production.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

The root cause of the issue is known and the Dataflow Team is preparing the fix for production. We will provide an update by 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

We are experiencing a visibility issue with Dataflow Job Log in Cloud Console beginning at Thursday, 2017-03-09 11:34 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 13:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

We are investigating an issue with Dataflow Job Log visibility in Cloud Console. We will provide more information by 12:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17014 - Cloud SQL instance creation in zones us-central1-c, us-east1-c, asia-northeast1-a, asia-east1-b, us-central1-f may be failing.

The issue should have been mitigated for all zones, except us-central1-c. Creating new Cloud SQL 2nd generation instances in us-central1-c with SSD Persistent Disk still have an issue. Workaround is to create your instances in a different zone or use Standard Persistent Disks in us-central1-c.

Last Update: A few months ago

UPDATE: Incident 17014 - Cloud SQL instance creation in zones us-central1-c and asia-east1-c are failing.

Correction: Attempts to create new Cloud SQL instances in zones us-central1-c, us-east1-c, asia-northeast1-a, asia-east1-b, us-central1-f may be intermittently failing. New instances affected in these zones will show a status of "Failed to create". Users will not incur charges for instances that failed to create; these instances can be safely deleted.

Last Update: A few months ago

UPDATE: Incident 17014 - Cloud SQL instance creation in zones us-central1-c and asia-east1-c are failing.

Attempts to create new Cloud SQL instances in zones us-central1-c and asia-east1-c are failing. New instances created in these zones will show a status of "Failed to create". Users will not incur charges for instances that failed to create; these instances can be safely deleted.

Last Update: A few months ago

RESOLVED: Incident 17005 - Network packet loss to Compute Engine us-west1 region

We confirm that the issue with GCE network connectivity to us-west1 should have been resolved for all affected endpoints as of 03:27 US/Pacific and the situation is stable. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

GCE network connectivity to us-west1 remains stable and we expect a final resolution in the near future. We will provide another status update by 05:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

Network connectivity to the Google Compute Engine us-west1 region has been restored but the issue remains under investigation. We will provide another status update by 05:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

We are experiencing an issue with GCE network connectivity to the us-west1 region beginning at Tuesday, 2017-02-28 02:57 US/Pacific. We will provide a further update by 04:45.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

We are investigating an issue with network connectivity to the us-west1 region. We will provide more information by 04:15 US/Pacific time.

Last Update: A few months ago

RESOLVED: Incident 17002 - Cloud Datastore Internal errors in the European region

ISSUE SUMMARY On Tuesday 14 February 2017, some applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced 2%-4% error rates and elevated latency for three periods with an aggregate duration of three hours and 36 minutes. We apologize for the disruption this caused to your service. We have already taken several measures to prevent incidents of this type from recurring and to improve the reliability of these services. DETAILED DESCRIPTION OF IMPACT On Tuesday 14 February 2017 between 00:15 and 01:18 PST, 54% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 4%. Between 08:35 and 08:48 PST, 50% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates. The average error rate for affected applications was 4%. Between 12:20 and 14:40 PST, 32% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 2%. Errors received by affected applications for all three incidents were either "internal error" or "timeout". ROOT CAUSE The incident was caused by a latent bug in a service used by both Cloud Datastore and the App Engine Search API that was triggered by high load on the service. Starting at 00:15 PST, several applications changed their usage patterns in one zone in Western Europe and began running more complex queries, which caused higher load on the service. REMEDIATION AND PREVENTION Google's monitoring systems paged our engineers at 00:35 PST to alert us to elevated errors in a single zone. Our engineers followed normal practice, by redirecting traffic to other zones to reduce the impact on customers while debugging the underlying issue. At 01:15, we redirected traffic to other zones in Western Europe, which resolved the incident three minutes later. At 08:35 we redirected traffic back to the zone that previously had errors. We found that the error rate in that zone was still high and so redirected traffic back to other zones at 08:48. At 12:45, our monitoring systems detected elevated errors in other zones in Western Europe. At 14:06 Google engineers added capacity to the service with elevated errors in the affected zones. This removed the trigger for the incident. We have now identified and fixed the latent bug that caused errors when the system was at high load. We expect to roll out this fix over the next few days. Our capacity planning team have generated forecasts for peak load generated by the Cloud Datastore and App Engine Search API and determined that we now have sufficient capacity currently provisioned to handle peak loads. We will be making several changes to our monitoring systems to improve our ability to quickly detect and diagnose errors of this type. Once again, we apologize for the impact of this incident on your application.

Last Update: A few months ago

UPDATE: Incident 17004 - Persistent Disk Does Not Produce Differential Snapshots In Some Cases

Since January 23rd, a small number of Persistent Disk snapshots were created as full snapshots rather than differential. While this results in overbilling, these snapshots still correctly backup your data and are usable for restores. We are working to resolve this issue and also to correct any overbilling that occurred. No further action is required from your side.

Last Update: A few months ago

RESOLVED: Incident 17002 - Cloud Datastore Internal errors in the European region

The issue with Cloud Datastore serving elevated internal errors in the European region should have been resolved for all affected projects as of 14:34 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17002 - Cloud Datastore Internal errors in the European region

We are investigating an issue with Cloud Datastore in the European region. We will provide more information by 15:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17003 - New VMs are experiencing connectivity issues

ISSUE SUMMARY On Monday 30 January 2017, newly created Google Compute Engine instances, Cloud VPNs and network load balancers were unavailable for a duration of 2 hours 8 minutes. We understand how important the flexibility to launch new resources and scale up GCE is for our users and apologize for this incident. In particular, we apologize for the wide scope of this issue and are taking steps to address the scope and duration of this incident as well as the root cause itself. DETAILED DESCRIPTION OF IMPACT Any GCE instances, Cloud VPN tunnels or GCE network load balancers created or live migrated on Monday 30 January 2017 between 10:36 and 12:42 PDT were unavailable via their public IP addresses until the end of that period. This also prevented outbound traffic from affected instances and load balancing health checks from succeeding. Previously created VPN tunnels, load balancers and instances that did not experience a live migration were unaffected. ROOT CAUSE All inbound networking for GCE instances, load balancers and VPN tunnels enter via shared layer 2 load balancers. These load balancers are configured with changes to IP addresses for these resources, then automatically tested in a canary deployment, before changes are globally propagated. The issue was triggered by a large set of updates which were applied to a rarely used load balancing configuration. The application of updates to this configuration exposed an inefficient code path which resulted in the canary timing out. From this point all changes of public addressing were queued behind these changes that could not proceed past the testing phase. REMEDIATION AND PREVENTION To resolve the issue, Google engineers restarted the jobs responsible for programming changes to the network load balancers. After restarting, the problematic changes were processed in a batch, which no longer reached the inefficient code path. From this point updates could be processed and normal traffic resumed. This fix was applied zone by zone between 11:36 and 12:42. To prevent this issue from reoccurring in the short term, Google engineers are increasing the canary timeout so that updates exercising the inefficient code path merely slow network changes rather than completely stop them. As a long term resolution, the inefficient code path is being improved, and new tests are being written to test behaviour on a wider range of configurations. Google engineers had already begun work to replace global propagation of address configuration with decentralized routing. This work is being accelerated as it will prevent issues with this layer having global impact. Google engineers are also creating additional metrics and alerting that will allow the nature of this issue to be identified sooner, which will lead to faster resolution.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine should have been resolved for all affected users as of 17:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine has been partially resolved. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine has been partially resolved. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine should have been partially resolved . We will provide another status update by 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

We are investigating reports of intermittent errors(502s) in Google App Engine. We will provide more information by 15:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 17001 - Investigating possible Google Cloud Datastore Application Monitoring Metrics problem

The issue with Google Cloud Datastore Application Monitoring Metrics has been fully resolved for all affected Applications as of 1:30pm US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Investigating possible Google Cloud Datastore Application Monitoring Metrics

We are investigating an issue with Google Cloud Datastore related with Application Monitoring Metrics. We will provide more information by 1:30pm US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17013 - Issue with Cloud SQL 2nd Generation instances beginning at Tuesday, 2017-01-26 01:00 US/Pacific.

The issue with Cloud SQL 2nd Generation Instances should have been resolved for all affected instances as of 21:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17013 - Issue with Cloud SQL 2nd Generation instances beginning at Tuesday, 2017-01-26 01:00 US/Pacific.

We are currently experiencing an issue with Cloud SQL 2nd Generation instances beginning at Tuesday, 2017-01-26 01:00 US/Pacific. This may cause poor performance or query failures for large queries on impacted instances. Current data indicates that 3% of Cloud SQL 2nd Generation Instances were affected by this issue. As of 2016-01-31 20:30 PT, a fix has been applied to the majority of impacted instances, and we expect a full resolution in the near future. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 21:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

We have fully mitigated the network connectivity issues for newly-created GCE instances as of 12:45 US/Pacific, with VPNs connectivity issues being fully mitigated at 12:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

The issue with newly-created GCE instances experiencing network connectivity problems should have been mitigated for all GCE regions except europe-west1, which is currently clearing. Newly-created VPNs are effected as well; we are still working on a mitigation for this. We will provide another status update by 13:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

The issue with newly-created GCE instances experiencing network connectivity problems should have been mitigated for the majority of GCE regions and we expect a full resolution in the near future. We will provide another status update by 12:40 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

We are experiencing a connectivity issue affecting newly-created VMs, as well as those undergoing live migrations beginning at Monday, 2017-01-30 10:54 US/Pacific. Mitigation work is currently underway. All zones should be coming back online in the next 15-20 minutes. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide another update at 12:10 PST.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

We are investigating an issue with newly-created VMs not having network connectivity. This also affects VMs undergoing live migrations. We will provide more information by 11:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

The issue with listing GCP projects and organizations should have been resolved for all affected users as of 15:21 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

The issue is still occurring for some projects for some users. Mitigation is still underway. We will provide the next update by 16:30 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

Listing of Google Cloud projects and organizations is still failing to show some entries. As this only affects listing, users can access their projects by navigating directly to appropriate URLs. Google engineers have a mitigation plan that is underway. We will provide another status update by 14:30 US/Pacific time with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

We are experiencing an (intermittent) issue with Google Cloud Projects search index beginning at Monday, 2017-01-23 00:00 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 8:30 US/Pacific with current details

Last Update: A few months ago

RESOLVED: Incident 18025 - Bigquery query job failures

The issue with BigQuery's Table Copy service and query jobs should have been resolved for all affected projects as of 07:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

The issue with BigQuery's Table Copy service and query jobs should be resolved for majority of users and expect a full resolution in the near future. We will provide another status update by 08:00 US/Pacific, 2017-01-21.

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

We believe the issue with BigQuery's Table Copy service and query jobs should be resolved for majority of users and expect a full resolution in the near future. We will provide another status update by 06:00 US/Pacific, 2017-01-21.

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

We believe the issue with BigQuery's Table Copy service and query jobs should be resolved for majority of users and expect a full resolution in the near future. We will provide another status update by 04:00 US/Pacific, 2017-01-21.

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

We are experiencing an issue with Bigquery query jobs beginning at Friday, 2017-01-21 18:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17001 - Cloud Console and Stackdriver may display the number of App Engine instances as zero for Java and Go

The issue with Cloud Console and Stackdriver showing the number of App Engine instances as zero should have been resolved for all affected projects as of 01:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - Cloud Console and Stackdriver may display the number of App Engine instances as zero for Java and Go

We are experiencing an issue with Cloud Console and Stackdriver which may show the number of App Engine instances as zero beginning at Wednesday, 2017-01-18 18:45 US/Pacific. This issue should have been resolved for the majority of projects and we expect a full resolution by 2017-01-20 00:00 PDT.

Last Update: A few months ago

RESOLVED: Incident 18024 - BigQuery Web UI currently unavailable for some Customers

The issue with BigQuery UI should have been resolved for all affected users as of 05:35MM US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18024 - BigQuery Web UI currently unavailable for some Customers

The issue with BigQuery Web UI should have been resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 06:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18024 - BigQuery Web UI currently unavailable for some Customers

We are investigating an issue with BigQuery Web UI. We will provide more information by 2017-01-10 06:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issues with Cloud Shell and Compute Engine serial output should have been resolved for all affected instances as of 22:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issue with Cloud Shell should be resolved for all customers at this time. The issue with Compute Engine serial port output should be resolved for all new instances created after 19:45 PT in all zones. Instances created before 14:10 PT remain unaffected. Some instances created between 14:10 and 19:45 PT in us-east1-c and us-west1 may still be unable to view the serial output. We are currently in the process of applying the fix to zones in this region. Access to the serial console output should be restored for instances created between 14:10 and 19:45 PT in all other zones. We expect a full resolution in the near future. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issue with Cloud Shell should be resolved for all customers at this time. The issue with Compute Engine serial port output should be resolved for all new instances created after 19:45 PT in all zones. Instances created before 14:10 PT remain unaffected. Access to the serial console output should be restored for all instances in the asia-east1 and asia-northeast1 regions, and the us-central1-a zone, created between 14:10 and 19:45 PT. Some instances created between 14:10 and 19:45 PT in other zones may still be unable to view the serial console output. We are currently in the process of applying the fix to the remaining zones. We expect a full resolution in the near future. We will provide another status update by 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issue with Cloud Shell should be resolved for all customers at this time. The issue with Compute Engine serial port output should be resolved for all new instances created after 19:45 PT, and all instances in us-central1-a, and asia-east1-b created between 14:10 and 19:45 PT. All other instances created before 14:10 PT remain unaffected. We expect a full resolution in the near future. We will provide another status update by 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

We are still investigating the issues with Compute Engine and Cloud Shell, and do not have any news at this time. We will provide another status update by 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

We are experiencing issues with Compute Engine serial port output and Cloud Shell beginning at Sunday, 2017-01-08 14:10 US/Pacific. Current data indicates that newly-created instances are unable to use the "get-serial-port-output" command in "gcloud", or use the Cloud Console to retrieve serial port output. Customers can still use the interactive serial console on these newly-created instances. Instances created before the impact time do not appear to be affected. Additionally, the Cloud Shell is intermittently available at this time. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17001 - Cloud VPN issues in regions us-west1 and us-east1

The issue with Cloud VPN where some tunnels weren’t connecting in us-east1 and us-west1 should have been resolved for all affected tunnels as of 23:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - Cloud VPN issues in regions us-west1 and us-east1

We are investigating reports of an issue with Cloud VPN in regions us-west1 and us-east1. We will provide more information by 23:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16036 - Cloud Storage is showing inconsistent result for object listing for multi-regional buckets in US

The issue with Cloud Storage showing inconsistent results for object listing for multi-regional buckets in US should have been resolved for all affected projects as of 23:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16036 - Cloud Storage is showing inconsistent result for object listing for multi-regional buckets in US

We are still investigating the issue with Cloud Storage showing inconsistent result for object listing for multi-regional buckets in US. We will provide another status update by 2016-12-20 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16036 - Cloud Storage is showing inconsistent result for object listing for multi-regional buckets in US

We are experiencing an intermittent issue with Cloud Storage showing inconsistent result for object listing for multi-regional buckets in US beginning approximately at Monday, 2016-12-16 09:00 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2016-12-17 00:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16035 - Elevated Cloud Storage error rate and latency

The issue with Google Cloud Storage seeing increased errors and latency should have been resolved for all affected users as of 09:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16035 - Elevated Cloud Storage error rate and latency

We are continuing experience an issue with Google Cloud Storage. Errors and latency have decreased, but are not yet at pre-incident levels. We are continuing to investigate mitigation strategies and identifying root cause. Impact is limited to the US region. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16035 - Elevated Cloud Storage error rate and latency

We are investigating an issue with Google Cloud Storage serving increased errors and at a higher latency. We will provide more information by 10:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16011 - App Engine remote socket API errors in us-central region

The issue with App Engine applications having higher than expected socket API errors should have been resolved for all affected applications as of 22:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

The issue with App Engine remote socket API errors in us-central region should have been resolved for all affected projects as of 19:46 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are still investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are still investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are still investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 19:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18023 - Increased 500 error rate for BigQuery API calls

The issue with increased 500 errors from the BigQuery API has been resolved. We apologize for the impact that this incident had on your application.

Last Update: A few months ago

RESOLVED: Incident 18022 - BigQuery Streaming API failing

Small correction to the incident report. The resolution time of the incident was 20:00 US/Pacific, not 20:11 US/Pacific. Similarly, total downtime was 4 hours.

Last Update: A few months ago

RESOLVED: Incident 18022 - BigQuery Streaming API failing

SUMMARY: On Tuesday 8 November 2016, Google BigQuery’s streaming service, which includes streaming inserts and queries against recently committed streaming buffers, was largely unavailable for a period of 4 hours and 11 minutes. To our BigQuery customers whose business analytics were impacted during this outage, we sincerely apologize. We will be providing an SLA credit for the affected timeframe. We have conducted an internal investigation and are taking steps to improve our service. DETAILED DESCRIPTION OF IMPACT: On Tuesday 8 November 2016 from 16:00 to 20:11 US/Pacific, 73% of BigQuery streaming inserts failed with a 503 error code indicating an internal error had occurred during the insertion. At peak, 93% of BigQuery streaming inserts failed. During the incident, queries performed on tables with recently-streamed data returned a result code (400) indicating that the table was unavailable for querying. Queries against tables in which data were not streamed within the 24 hours preceding the incident were unaffected. There were no issues with non-streaming ingestion of data. ROOT CAUSE: The BigQuery streaming service requires authorization checks to verify that it is streaming data from an authorized entity to a table that entity has permissions to access. The authorization service relies on a caching layer in order to reduce the number of calls to the authorization backend. At 16:00 US/Pacific, a combination of reduced backend authorization capacity coupled with routine cache entry refreshes caused a surge in requests to the authorization backends, exceeding their current capacity. Because BigQuery does not cache failed authorization attempts, this overload meant that new streaming requests would require re-authorization, thereby further increasing load on the authorization backend. This continual increase of authorization requests on an already overloaded authorization backend resulted in continued and sustained authorization failures which propagated into streaming request and query failures. REMEDIATION AND PREVENTION: Google engineers were alerted to issues with the streaming service at 16:21 US/Pacific. Their initial hypothesis was that the caching layer for authorization requests was failing. The engineers started redirecting requests to bypass the caching layer at 16:51. After testing the system without the caching layer, the engineers determined that the caching layer was working as designed, and requests were directed to the caching layer again at 18:12. At 18:13, the engineering team was able to pinpoint the failures to a set of overloaded authorization backends and begin remediation. The issue with authorization capacity was ultimately resolved by incrementally reducing load on the authorization system internally and increasing the cache TTL, allowing streaming authorization requests to succeed and populate the cache so that internal services could be restarted. Recovery of streaming errors began at 19:34 US/Pacific and the streaming service was fully restored at 20:11. To prevent short-term recurrence of the issue, the engineering team has greatly increased the request capacity of the authorization backend. In the longer term, the BigQuery engineering team will work on several mitigation strategies to address the currently cascading effect of failed authorization requests. These strategies include caching intermediary responses to the authorization flow for the streaming service, caching failure states for authorization requests and adding rate limiting to the authorization service so that large increases in cache miss rate will not overwhelm the authorization backend. In addition, the BigQuery engineering team will be improving the monitoring of available capacity on the authorization backend and will add additional alerting so capacity issues can be mitigated before they become cascading failures. The BigQuery engineering team will also be investigating ways to reduce the spike in authorization traffic that occurs daily at 16:00 US/Pacific when the cache is rebuilt to more evenly distribute requests to the authorization backend. Finally, we have received feedback that our communications during the outage left a lot to be desired. We agree with this feedback. While our engineering teams launched an all-hands-on-deck to resolve this issue within minutes of its detection, we did not adequately communicate both the level-of-effort and the steady progress of diagnosis, triage and restoration happening during the incident. We clearly erred in not communicating promptly, crisply and transparently to affected customers during this incident. We will be addressing our communications — for all Google Cloud systems, not just BigQuery — as part of a separate effort, which has already been launched. We recognize the extended duration of this outage, and we sincerely apologize to our BigQuery customers for the impact to your business analytics.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

The issue with the BigQuery Streaming API should have been resolved for all affected tables as of 20:07 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We're continuing to work to restore the service to the BigQuery Streaming API. We will add an update at 20:30 US/Pacific with further information.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are continuing to investigate the issue with BigQuery Streaming API. We will add an update at 20:00 US/Pacific with further information.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We have taken steps to mitigate the issue, which has led to some improvements. The issue continues to impact the BigQuery Streaming API and tables with a streaming buffer. We will provide a further status update at 19:30 US/Pacific with current details

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are continuing to investigate the issue with BigQuery Streaming API. The issue may also impact tables with a streaming buffer, making them inaccessible. This will be clarified in the next update at 19:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are still investigating the issue with BigQuery Streaming API. There are no other details to share at this time but we are actively working to resolve this. We will provide another status update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are still investigating the issue with BigQuery Streaming API. Current data indicates that all projects are affected by this issue. We will provide another status update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are investigating an issue with the BigQuery Streaming API. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18021 - BigQuery Streaming API fai

We are investigating an issue with the BigQuery Streaming API. We will provide more information by 5:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operations on Cloud Platform Console not being performed

The issue with Cloud Platform Console not being able to perform delete operations should have been resolved for all affected users as of 12:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 16003 - Issues with Cloud Pub/Sub

SUMMARY: On Monday, 31 October 2016, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed for a duration of 124 minutes. Creation of new Cloud SQL Second Generation instances also failed during this incident. If your service or application was affected, we apologize. We have conducted a detailed review of the causes of this incident and are ensuring that we apply the appropriate fixes so that it will not recur. DETAILED DESCRIPTION OF IMPACT: On Monday, 31 October 2016 from 13:11 to 15:15 PDT, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed. 0.1% of pull requests experienced latencies of up to 4 minutes for end-to-end message delivery. Creation of all new Cloud SQL Second Generation instances also failed during this incident. Existing instances were not affected. ROOT CAUSE: At 13:08, a system in the Cloud Pub/Sub control plane experienced a connectivity issue to its persistent storage layer for a duration of 83 seconds. This caused a queue of storage requests to build up. When the storage layer re-connected, the queued requests were executed, which exceeded the available processing quota for the storage system. The system entered a feedback loop in which storage requests continued to queue up leading to further latency increases and more queued requests. The system was unable to exit from this state until additional capacity was added. Creation of a new Cloud SQL Second Generation instance requires a new Cloud Pub/Sub subscription. REMEDIATION AND PREVENTION: Our monitoring systems detected the outage and paged oncall engineers at 13:19. We determined root cause at 14:05 and acquired additional storage capacity for the Pub/Sub control plane at 14:42. The outage ended at 15:15 when this capacity became available. To prevent this issue from recurring, we have already increased the storage capacity for the Cloud Pub/Sub control plane. We will change the retry behavior of the control plane to prevent a feedback loop if storage quota is temporarily exceeded. We will also improve our monitoring to ensure we can determine root cause for this type of failure more quickly in future. We apologize for the inconvenience this issue caused our customers.

Last Update: A few months ago

RESOLVED: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

SUMMARY: On Monday, 31 October 2016, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed for a duration of 124 minutes. Creation of new Cloud SQL Second Generation instances also failed during this incident. If your service or application was affected, we apologize. We have conducted a detailed review of the causes of this incident and are ensuring that we apply the appropriate fixes so that it will not recur. DETAILED DESCRIPTION OF IMPACT: On Monday, 31 October 2016 from 13:11 to 15:15 PDT, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed. 0.1% of pull requests experienced latencies of up to 4 minutes for end-to-end message delivery. Creation of all new Cloud SQL Second Generation instances also failed during this incident. Existing instances were not affected. ROOT CAUSE: At 13:08, a system in the Cloud Pub/Sub control plane experienced a connectivity issue to its persistent storage layer for a duration of 83 seconds. This caused a queue of storage requests to build up. When the storage layer re-connected, the queued requests were executed, which exceeded the available processing quota for the storage system. The system entered a feedback loop in which storage requests continued to queue up leading to further latency increases and more queued requests. The system was unable to exit from this state until additional capacity was added. Creation of a new Cloud SQL Second Generation instance requires a new Cloud Pub/Sub subscription. REMEDIATION AND PREVENTION: Our monitoring systems detected the outage and paged oncall engineers at 13:19. We determined root cause at 14:05 and acquired additional storage capacity for the Pub/Sub control plane at 14:42. The outage ended at 15:15 when this capacity became available. To prevent this issue from recurring, we have already increased the storage capacity for the Cloud Pub/Sub control plane. We will change the retry behavior of the control plane to prevent a feedback loop if storage quota is temporarily exceeded. We will also improve our monitoring to ensure we can determine root cause for this type of failure more quickly in future. We apologize for the inconvenience this issue caused our customers.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operations on Cloud Platform Console not being performed

The issue with the Cloud Platform Console not being able to perform delete operations should have been resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by Fri 2016-11-04 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operations on Cloud Platform Console not being performed

The issue with the Cloud Platform Console not being able to perform delete operations has been identified with the root cause and we expect a full resolution in the near future. We will provide another status update by Fri 2016-11-04 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operation on Cloud Platform Console not being performed

We are experiencing an issue with some delete operations within the Cloud Platform Console, beginning at Tuesday, 2016-10-01 10:00 US/Pacific. Current data indicates that all users are affected by this issue. The gcloud command line tool may be used as a workaround for those who need to manage their resources immediately. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16003 - Issues with Cloud Pub/Sub

The issue with Cloud Pub/Sub should be resolved for all affected projects as of 15:15 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

The issue with second generation Cloud SQL instance creation should be resolved for all affected projects as of 15:15 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

We are continuing to investigate an issue with second generation Cloud SQL instance creation. We will provide another update at 16:00 PDT.

Last Update: A few months ago

UPDATE: Incident 16003 - Issues with Cloud Pub/Sub

We are continuing to investigate an issue with Cloud Pub/Sub. We will provide an update at 16:00 PDT.

Last Update: A few months ago

UPDATE: Incident 16003 - Issues with Cloud Pub/Sub

We are currently investigating an issue with Cloud Pub/Sub. We will provide an update at 15:00 PDT with more information.

Last Update: A few months ago

UPDATE: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

We are currently investigating an issue with second generation Cloud SQL instance creation. We will provide an update with more information at 15:00 PDT.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

The issue with Google Container Engine nodes connecting to the metadata server should have been resolved for all affected clusters as of 10:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

The issue with Google Container Engine nodes connecting to the metadata server has now been resolved for some of the existing clusters, too. We are continuing to repair the rest of the clusters. We will provide next status update when this repair is complete.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

The issue with Google Container Engine nodes connecting to the metadata server has been fully resolved for all new clusters. The work to repair the existing clusters is still ongoing and is expected to last for a few more hours. It may also result in the restart of the containers and/or virtual machines in the repaired clusters as per the previous update. If your cluster is affected by this issue and you wish to solve this problem more quickly, you could choose to delete and recreate your GKE cluster. We will provide next status update when the work to repair the existing clusters has completed.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We have now identified the cause of the issue and are in the process of rolling out the fix for it into production. This may result in the restart of the affected containers and/or virtual machines in the GKE cluster. We apologize for any inconvenience this might cause.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are continuing to work on resolving the issue with Google Container Engine nodes connecting to the metadata server. We will provide next status update as soon as the proposed fix for this issue is finalized and validated.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still working on resolving the issue with Google Container Engine nodes connecting to the metadata server. We are actively testing a fix for it, and once it is validated, we will push this fix into production. We will provide next status update by 2016-10-24 01:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still investigating the issue with Google Container Engine nodes connecting to the metadata server. Current data indicates that that less than 10% of clusters are still affected by this issue. We are actively testing a fix. Once confirmed we will push this fix into production. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still investigating the issue with Google Container Engine nodes connecting to the metadata server. Further investigation reveals that less than 10% of clusters are affected by this issue. We will provide another status update by 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

Customers experiencing this error will see messages containing the following in their logs: "WARNING: exception thrown while executing request java.net.UnknownHostException: metadata" This is caused by a change that inadvertently prevents hosts from properly resolving the DNS address for the metadata server. We have identified the root cause and are preparing a fix. No action is required by customers at this time. The proposed fix should resolve the issue for all customers as soon as it is prepared, tested, and deployed. We will add another update at 21:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16020 - 502s from HTTP(S) Load Balancer

SUMMARY: On Thursday 13 October 2016, approximately one-third of requests sent to the Google Compute Engine HTTP(S) Load Balancers between 15:07 and 17:25 PDT received an HTTP 502 error rather than the expected response. If your service or application was affected, we apologize. We took immediate action to restore service once the problem was detected, and are taking steps to improve the Google Compute Engine HTTP(S) Load Balancer’s performance and availability. DETAILED DESCRIPTION OF IMPACT: Starting at 15:07 PDT on Thursday 13 October 2016, Google Compute Engine HTTP(S) Load Balancers started to return elevated rates of HTTP 502 (Bad Gateway) responses. The error rate rose progressively from 2% to a peak of 45% of all requests at 16:09 and remained there until 17:03. From 17:03 to 17:15, the error rate declined rapidly from 45% to 2%. By 17:25 requests were routing as expected and the incident was over. During the incident, the error rate seen by applications using GCLB varied depending on the network routing of their requests to Google. ROOT CAUSE: The Google Compute Engine HTTP(S) Load Balancer system is a global, geographically-distributed multi-tiered software stack which receives incoming HTTP(S) requests via many points in Google's global network, and dispatches them to appropriate Google Compute Engine instances. On 13 October 2016, a configuration change was rolled out to one of these layers with widespread distribution beginning at 15:07. This change triggered a software bug which decoupled second-tier load balancers from a number of first-tier load balancers. The affected first-tier load balancers therefore had no forwarding path for incoming requests and returned the HTTP 502 code to indicate this. Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations, and to reduce the scope of the impact in the event that a problem is exposed in production. These safeguards were partially successful in this instance, limiting both the scope and the duration of the event, but not preventing it entirely. The first relevant safeguard is a canary deployment, where the configuration is deployed at a single site and that site is verified to be functioning within normal bounds. In this case, the canary step did generate a warning, but it was not sufficiently precise to cause the on-call engineer to immediately halt the rollout. The new configuration subsequently rolled out in stages, but was halted part way through as further alerts indicated that it was not functioning correctly. By design, this progressive rollout limited the error rate experienced by customers. REMEDIATION AND PREVENTION: Once the nature and scope of the issue became clear, Google engineers first quickly halted and reverted the rollout. This prevented a larger fraction of GCLB instances from being affected. Google engineers then set about restoring function to the GCLB instances which had been exposed to the configuration. They verified that restarting affected GCLB instances restored the pre-rollout configuration, and then rapidly restarted all affected GCLB instances, ending the event. Google understands that global load balancers are extremely useful, but also may be a single point of failure for your service. We are committed to applying the lessons from this outage in order to ensure that this type of incident does not recur. One of our guiding principles for avoiding large-scale incidents is to roll out global changes slowly and carefully monitor for errors. We typically have a period of soak time during a canary release before rolling out more widely. In this case, the change was pushed too quickly for accurate detection of the class of failure uncovered by the configuration being rolled out. We will change our processes to be more conservative when rolling out configuration changes to critical systems. As defense in depth, Google engineers are also changing the black box monitoring for GCLB so that it will test the first-tier load balancers impacted by this incident. We will also be improving the black box monitoring to ensure that our probers cover all use cases. In addition, we will add an alert for elevated error rates between first-tier and second-tier load balancers. We apologize again for the impact this issue caused our customers.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

The issue with BigQuery failing queries should have been resolved for all affected users as of 8:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

The issue with Google BigQuery API calls returning 500 Internal Errors should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 09:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

The issue with Google BigQuery API calls returning 500 Internal Errors should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 07:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

We are investigating an issue with Google BigQuery. We will provide more information by 6:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

The issue with Google Cloud Platform HTTP(s) Load Balancer returning 502 response code should have been resolved for all affected customers as of 17:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

We are still investigating the issue with Google Cloud Platform HTTP(S) Load Balancers returning 502 errors, and will provide an update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

We are still investigating the issue with Google Cloud Platform HTTP(S) Load Balancers returning 502 errors, and will provide an update by 17:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

We are experiencing an issue with Google Cloud Platform HTTP(s) Load Balancer returning 502 response codes, starting at 2016-10-13 15:30 US/Pacific. We are investigating the issue, and will provide an update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16019 - Hurricane Matthew may impact GCP services in us-east1

The Google Cloud Platform team is keeping a close watch on the path of Hurricane Matthew. The National Hurricane Center 3-day forecast indicates that the storm is tracking within 200 miles of the datacenters housing the GCP region us-east1. We do not anticipate any specific service interruptions. Our datacenter is designed to withstand a direct hit from a more powerful hurricane than Matthew without disruption, and we maintain triple-redundant diverse-path backbone networking precisely to be resilient to extreme events. We have staff on site and plan to run services normally. Despite all of the above, it is statistically true that there is an increased risk of a region-level utility grid or other infrastructure disruption which may result in a GCP service interruption. If we anticipate a service interruption – for example, if the regional grid loses power and our datacenter is operating on generator – our protocol is to share specific updates to our customers with a 12 hour notice.

Last Update: A few months ago

UPDATE: Incident 16034 - Elevated Cloud Storage error rate and latency

The issue with Cloud Storage that some projects encountered elevated errors and latency should have been resolved for all affected projects as of 23:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16034 - Elevated Cloud Storage error rate and latency

We are investigating an issue with Cloud Storage. We will provide more information by 23:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We have restored most of the missing Google Cloud Pub/Sub subscriptions for affected projects. We expect to restore the remaining missing subscriptions within one hour. We have already identified and fixed the root cause of the issue. We will conduct an internal investigation and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are working on restoring the missing PubSub subscriptions for customers that are affected, and will provide an ETA for complete restoration when available.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 12:00 US/Pacific with current details

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 10:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 08:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 06:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. In the meantime, affected users can re-create missing subscriptions manually in order to make them available. We will provide another status update by Wednesday, 2016-09-28 00:00 US/Pacific with current details

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We experienced an issue with Cloud Pub/Sub that some subscriptions were deleted unexpectedly approximately from Tuesday, 2016-09-27 13:40-18:45 US/Pacific. We are going to recreate the deleted subscription. We will provide another status update by 22:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

The issue with slow Compute Engine operations in asia-east1-a is resolved since 13:37 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

We are still working on the issue with Compute Engine operations. After mitigation was applied operations in asia-east1-a have continued to run normally. A final fix is still underway. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

We are still investigating the issue with Compute Engine operations. We have applied mitigation and currently operations in asia-east1-a are processing normally. We are applying some final fixes and monitoring the issue. We will provide another status update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

It is taking multiple minutes to create new VMs, restart existing VMs that abnormally terminate, or Hot-attach disks.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

This incident only covers instances in the asia-east1-a zone

Last Update: A few months ago

RESOLVED: Incident 16033 - Google Cloud Storage serving high error rates.

We have completed our internal investigation and results suggest this incident impacted a very small number of projects. We have reached out to affected users directly and if you have not heard from us, your project(s) were not impacted.

Last Update: A few months ago

RESOLVED: Incident 16033 - Google Cloud Storage serving high error rates.

The issue with Google Cloud Storage serving a high percentage of errors should have been resolved for all affected users as of 13:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 16009 - Networking issue with Google App Engine services

SUMMARY: On Monday 22 August 2016, the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone for a duration of 25 minutes. All other zones in the US-CENTRAL1 region were unaffected. All network traffic within the zone was also unaffected. We apologize to customers whose service or application was affected by this incident. We understand that a network disruption has a negative impact on your application - particularly if it is homed in a single zone - and we apologize for the inconvenience this caused. What follows is our detailed analysis of the root cause and actions we will take in order to prevent this type of incident from recurring. DETAILED DESCRIPTION OF IMPACT: We have received feedback from customers asking us to specifically and separately enumerate the impact of incidents to any service that may have been touched. We agree that this will make it easier to reason about the impact of any particular event and we have done so in the following descriptions. On Monday 22 August 2016 from 07:05 to 07:30 PDT the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone. App Engine 6% of App Engine Standard Environment applications in the US-CENTRAL region served elevated error rates for up to 8 minutes, until the App Engine serving infrastructure automatically redirected traffic to a failover zone. The aggregate error rate across all impacted applications during the incident period was 3%. The traffic redirection caused a Memcache flush for affected applications, and also loading requests as new instances of the applications started up in the failover zones. All App Engine Flexible Environment applications deployed to the US-CENTRAL1-F zone were unavailable for the duration of the incident. Additionally, 4.5% of these applications experienced various levels of unavailability for up to an additional 5 hours while the system recovered. Deployments for US-CENTRAL Flexible applications were delayed during the incident. Our engineers disabled the US-CENTRAL1-F zone for new deployments during the incident, so that any customers who elected to redeploy, immediately recovered. Cloud Console The Cloud Console was available during the incident, though some App Engine administrative pages did not load for applications in US-CENTRAL and 50% of project creation requests failed to complete and needed to be retried by customers before succeeding. Cloud Dataflow Some Dataflow running jobs in the US-CENTRAL1 region experienced delays in processing. Although most of the affected jobs recovered gracefully after the incident ended, up to 2.5% of affected jobs in this zone became stuck and required manual termination by customers. New jobs created during the incident were not impacted. Cloud SQL Cloud SQL First Generation instances were not impacted by this incident. 30% of Cloud SQL Second Generation instances in US-CENTRAL1 were unavailable for up to 5 minutes, after which they became available again. An additional 15% of Second Generation instances were unavailable for 22 minutes. Compute Engine All instances in the US-CENTRAL1-F zone were inaccessible from outside the zone for the duration of the incident. 9% of them remained inaccessible from outside the zone for an additional hour. Container Engine Container Engine clusters running in US-CENTRAL1-F were inaccessible from outside of the zone during the incident although they continued to serve. In addition, calls to the Container Engine API experienced a 4% error rate and elevated latency during the incident, though this was substantially mitigated if the client retried the request. Stackdriver Logging 20% of log API requests sent to Stackdriver Logging in the US-CENTRAL1 region failed during the incident, though App Engine logging was not impacted. Clients retrying requests recovered gracefully. Stackdriver Monitoring Requests to the StackDriver web interface and the Google Monitoring API v2beta2 and v3 experienced elevated latency and an error rate of up to 3.5% during the incident. In addition, some alerts were delayed. Impact for API calls was substantially mitigated if the client retried the request. ROOT CAUSE: On 18 July, Google carried out a planned maintenance event to inspect and test the UPS on a power feed in one zone in the US-CENTRAL1 region. That maintenance disrupted one of the two power feeds to network devices that control routes into and out of the US-CENTRAL1-F zone. Although this did not cause any disruption in service, these devices unexpectedly and silently disabled the affected power supply modules - a previously unseen behavior. Because our monitoring systems did not notify our network engineers of this problem the power supply modules were not re-enabled after the maintenance event. The service disruption was triggered on Monday 22 August, when our engineers carried out another planned maintenance event that removed power to the second power feed of these devices, causing them to disable the other power supply module as well, and thus completely shut down. Following our standard procedure when carrying out maintenance events, we made a detailed line walk of all critical equipment prior to, and after, making any changes. However, in this case we did not detect the disabled power supply modules. Loss of these network devices meant that machines in US-CENTRAL1-F did not have routes into and out of the zone but could still communicate to other machines within the same zone. REMEDIATION AND PREVENTION: Our network engineers received an alert at 07:14, nine minutes after the incident started. We restored power to the devices at 07:30. The network returned to service without further intervention after power was restored. As immediate followup to this incident, we have already carried out an audit of all other network devices of this type in our fleet to verify that there are none with disabled power supply modules. We have also written up a detailed post mortem of this incident and will take the following actions to prevent future outages of this type: Our monitoring will be enhanced to detect cases in which power supply modules are disabled. This will ensure that conditions that are missed by the manual line walk prior to maintenance events are picked up by automated monitoring. We will change the configuration of these network devices so that power disruptions do not cause them to disable their power supply modules. The interaction between the network control plane and the data plane should be such that the data plane should "fail open" and continue to route packets in the event of control plane failures. We will add support for networking protocols that have the capability to continue to route traffic for a short period in the event of failures in control plane components. We will also be taking various actions to improve the resilience of the affected services to single-zone outages, including the following: App Engine Although App Engine Flexible Environment is currently in Beta, we expect production services to be more resilient to single zone disruptions. We will make this extra resilience an exit criteria before we allow the service to reach General Availability. Cloud Dataflow We will improve resilience of Dataflow to single-zone outages by implementing better strategies for migrating the job controller to a new zone in the event of an outage. Work on this remediation is already underway. Stackdriver Logging We will make improvements to the Stackdriver Logging service (currently in Beta) in the areas of automatic failover and capacity management before this service goes to General Availability. This will ensure that it is resilient to single-zone outages. Stackdriver Monitoring The Google Monitoring API (currently in beta) is already hosted in more than one zone, but we will further improve its resilience by adding additional capacity to ensure a single-zone outage does not cause overload in any other zones. We will do this before this service exits to General Availability. Finally, we know that you depend on Google Cloud Platform for your production workloads and we apologize for the inconvenience this event caused.

Last Update: A few months ago

RESOLVED: Incident 16017 - Networking issue with Google Compute Engine services

SUMMARY: On Monday 22 August 2016, the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone for a duration of 25 minutes. All other zones in the US-CENTRAL1 region were unaffected. All network traffic within the zone was also unaffected. We apologize to customers whose service or application was affected by this incident. We understand that a network disruption has a negative impact on your application - particularly if it is homed in a single zone - and we apologize for the inconvenience this caused. What follows is our detailed analysis of the root cause and actions we will take in order to prevent this type of incident from recurring. DETAILED DESCRIPTION OF IMPACT: We have received feedback from customers asking us to specifically and separately enumerate the impact of incidents to any service that may have been touched. We agree that this will make it easier to reason about the impact of any particular event and we have done so in the following descriptions. On Monday 22 August 2016 from 07:05 to 07:30 PDT the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone. App Engine 6% of App Engine Standard Environment applications in the US-CENTRAL region served elevated error rates for up to 8 minutes, until the App Engine serving infrastructure automatically redirected traffic to a failover zone. The aggregate error rate across all impacted applications during the incident period was 3%. The traffic redirection caused a Memcache flush for affected applications, and also loading requests as new instances of the applications started up in the failover zones. All App Engine Flexible Environment applications deployed to the US-CENTRAL1-F zone were unavailable for the duration of the incident. Additionally, 4.5% of these applications experienced various levels of unavailability for up to an additional 5 hours while the system recovered. Deployments for US-CENTRAL Flexible applications were delayed during the incident. Our engineers disabled the US-CENTRAL1-F zone for new deployments during the incident, so that any customers who elected to redeploy, immediately recovered. Cloud Console The Cloud Console was available during the incident, though some App Engine administrative pages did not load for applications in US-CENTRAL and 50% of project creation requests failed to complete and needed to be retried by customers before succeeding. Cloud Dataflow Some Dataflow running jobs in the US-CENTRAL1 region experienced delays in processing. Although most of the affected jobs recovered gracefully after the incident ended, up to 2.5% of affected jobs in this zone became stuck and required manual termination by customers. New jobs created during the incident were not impacted. Cloud SQL Cloud SQL First Generation instances were not impacted by this incident. 30% of Cloud SQL Second Generation instances in US-CENTRAL1 were unavailable for up to 5 minutes, after which they became available again. An additional 15% of Second Generation instances were unavailable for 22 minutes. Compute Engine All instances in the US-CENTRAL1-F zone were inaccessible from outside the zone for the duration of the incident. 9% of them remained inaccessible from outside the zone for an additional hour. Container Engine Container Engine clusters running in US-CENTRAL1-F were inaccessible from outside of the zone during the incident although they continued to serve. In addition, calls to the Container Engine API experienced a 4% error rate and elevated latency during the incident, though this was substantially mitigated if the client retried the request. Stackdriver Logging 20% of log API requests sent to Stackdriver Logging in the US-CENTRAL1 region failed during the incident, though App Engine logging was not impacted. Clients retrying requests recovered gracefully. Stackdriver Monitoring Requests to the StackDriver web interface and the Google Monitoring API v2beta2 and v3 experienced elevated latency and an error rate of up to 3.5% during the incident. In addition, some alerts were delayed. Impact for API calls was substantially mitigated if the client retried the request. ROOT CAUSE: On 18 July, Google carried out a planned maintenance event to inspect and test the UPS on a power feed in one zone in the US-CENTRAL1 region. That maintenance disrupted one of the two power feeds to network devices that control routes into and out of the US-CENTRAL1-F zone. Although this did not cause any disruption in service, these devices unexpectedly and silently disabled the affected power supply modules - a previously unseen behavior. Because our monitoring systems did not notify our network engineers of this problem the power supply modules were not re-enabled after the maintenance event. The service disruption was triggered on Monday 22 August, when our engineers carried out another planned maintenance event that removed power to the second power feed of these devices, causing them to disable the other power supply module as well, and thus completely shut down. Following our standard procedure when carrying out maintenance events, we made a detailed line walk of all critical equipment prior to, and after, making any changes. However, in this case we did not detect the disabled power supply modules. Loss of these network devices meant that machines in US-CENTRAL1-F did not have routes into and out of the zone but could still communicate to other machines within the same zone. REMEDIATION AND PREVENTION: Our network engineers received an alert at 07:14, nine minutes after the incident started. We restored power to the devices at 07:30. The network returned to service without further intervention after power was restored. As immediate followup to this incident, we have already carried out an audit of all other network devices of this type in our fleet to verify that there are none with disabled power supply modules. We have also written up a detailed post mortem of this incident and will take the following actions to prevent future outages of this type: Our monitoring will be enhanced to detect cases in which power supply modules are disabled. This will ensure that conditions that are missed by the manual line walk prior to maintenance events are picked up by automated monitoring. We will change the configuration of these network devices so that power disruptions do not cause them to disable their power supply modules. The interaction between the network control plane and the data plane should be such that the data plane should "fail open" and continue to route packets in the event of control plane failures. We will add support for networking protocols that have the capability to continue to route traffic for a short period in the event of failures in control plane components. We will also be taking various actions to improve the resilience of the affected services to single-zone outages, including the following: App Engine Although App Engine Flexible Environment is currently in Beta, we expect production services to be more resilient to single zone disruptions. We will make this extra resilience an exit criteria before we allow the service to reach General Availability. Cloud Dataflow We will improve resilience of Dataflow to single-zone outages by implementing better strategies for migrating the job controller to a new zone in the event of an outage. Work on this remediation is already underway. Stackdriver Logging We will make improvements to the Stackdriver Logging service (currently in Beta) in the areas of automatic failover and capacity management before this service goes to General Availability. This will ensure that it is resilient to single-zone outages. Stackdriver Monitoring The Google Monitoring API (currently in beta) is already hosted in more than one zone, but we will further improve its resilience by adding additional capacity to ensure a single-zone outage does not cause overload in any other zones. We will do this before this service exits to General Availability. Finally, we know that you depend on Google Cloud Platform for your production workloads and we apologize for the inconvenience this event caused.

Last Update: A few months ago

RESOLVED: Incident 17011 - Cloud SQL 2nd generation failing to create new instances

The issue in creating instances on Cloud SQL second generation should have been resolved for all affected projects as of 17:38 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17011 - Cloud SQL 2nd generation failing to create new instances

We are investigating an issue for creating new instances on Cloud SQL second generation. We will provide more information by 18:50 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16008 - App Engine Outage

SUMMARY: On Thursday 11 August 2016, 21% of Google App Engine applications hosted in the US-CENTRAL region experienced error rates in excess of 10% and elevated latency between 13:13 and 15:00 PDT. An additional 16% of applications hosted on the same GAE instance observed lower rates of errors and latency during the same period. We apologize for this incident. We know that you choose to run your applications on Google App Engine to obtain flexible, reliable, high-performance service, and in this incident we have not delivered the level of reliability for which we strive. Our engineers have been working hard to analyze what went wrong and ensure incidents of this type will not recur. DETAILED DESCRIPTION OF IMPACT: On Thursday 11 August 2016 from 13:13 to 15:00 PDT, 18% of applications hosted in the US-CENTRAL region experienced error rates between 10% and 50%, and 3% of applications experienced error rates in excess of 50%. Additionally, 14% experienced error rates between 1% and 10%, and 2% experienced error rate below 1% but above baseline levels. In addition, the 37% of applications which experienced elevated error rates also observed a median latency increase of just under 0.8 seconds per request. The remaining 63% of applications hosted on the same GAE instance, and applications hosted on other GAE instances, did not observe elevated error rates or increased latency. Both App Engine Standard and Flexible Environment applications in US-CENTRAL were affected by this incident. In addition, some Flexible Environment applications were unable to deploy new versions during this incident. App Engine applications in US-EAST1 and EUROPE-WEST were not impacted by this incident. ROOT CAUSE: The incident was triggered by a periodic maintenance procedure in which Google engineers move App Engine applications between datacenters in US-CENTRAL in order to balance traffic more evenly. As part of this procedure, we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources. The applications running on the drained servers are automatically rescheduled onto different servers. During this procedure, a software update on the traffic routers was also in progress, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity. The server drain resulted in rescheduling of multiple instances of manually-scaled applications. App Engine creates new instances of manually-scaled applications by sending a startup request via the traffic routers to the server hosting the new instance. Some manually-scaled instances started up slowly, resulting in the App Engine system retrying the start requests multiple times which caused a spike in CPU load on the traffic routers. The overloaded traffic routers dropped some incoming requests. Although there was sufficient capacity in the system to handle the load, the traffic routers did not immediately recover due to retry behavior which amplified the volume of requests. REMEDIATION AND PREVENTION: Google engineers were monitoring the system during the datacenter changes and immediately noticed the problem. Although we rolled back the change that drained the servers within 11 minutes, this did not sufficiently mitigate the issue because retry requests had generated enough additional traffic to keep the system’s total load at a substantially higher-than-normal level. As designed, App Engine automatically redirected requests to other datacenters away from the overload - which reduced the error rate. Additionally, our engineers manually redirected all traffic at 13:56 to other datacenters which further mitigated the issue. Finally, we then identified a configuration error that caused an imbalance of traffic in the new datacenters. Fixing this at 15:00 finally fully resolved the incident. In order to prevent a recurrence of this type of incident, we have added more traffic routing capacity in order to create more capacity buffer when draining servers in this region. We will also change how applications are rescheduled so that the traffic routers are not called and also modify that the system's retry behavior so that it cannot trigger this type of failure. We know that you rely on our infrastructure to run your important workloads and that this incident does not meet our bar for reliability. For that we apologize. Your trust is important to us and we will continue to all we can to earn and keep that trust.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with Compute Engine network connectivity should have been resolved for nearly all instances. For the remaining few remaining instances we are working directly with the affected customers. No further updates will be posted, but we will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with Compute Engine network connectivity should have been resolved for affected instances in us-central1-a, -b, and -c as of 08:00 US/Pacific. Less than 4% of instances in us-central1-f are currently affected. We will provide another status update by 12:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with Compute Engine network connectivity should have been resolved for affected instances in us-central1-a, -b, and -c as of 08:00 US/Pacific. Less than 4% of instances in us-central1-f are currently affected and we expect a full resolution soon. We will provide another status update by 11:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity issues, located in us-central1-f, is still ongoing. We will provide another status update by 11:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity issues, located in us-central1-f, is still ongoing. We will provide another status update by 10:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity is still ongoing. Affected instances are located in us-central1-f. We will provide another status update by 10:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity is still ongoing. Affected instances are located in us-central1-f. We will provide another status update by 09:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

We are still investigating network connectivity issues for a subset of instances that have not automatically recovered. We will provide another status update by 09:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with network connectivity to Google Compute Engine services should have been resolved for the majority of instances and we expect a full resolution in the near future. We will provide another status update by 08:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16009 - Networking issue with Google App Engine services

The issue with network connectivity to Google App Engine applications should have been resolved for all affected users as of 07:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

We are investigating an issue with network connectivity. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16009 - Networking issue with Google App Engine services

We are investigating an issue with network connectivity. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16016 - Networking issue with Google Compute Engine services

We are investigating an issue with network connectivity. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

The issue with App Engine apis being unavailable should have been resolved for nearly all affected projects as of 14:12 US/Pacific. We will follow up directly with few remaining affecting applications. We will also conduct a thorough internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. Finally, we will provide a more detailed analysis of this incident once we have completed this internal investigation.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

We are still investigating the issue with App Engine apis being unavailable. Current data indicates that some projects are affected by this issue. We will provide another status update by 15:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

The issue with App Engine apis being unavailable should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 15:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

We are experiencing an issue with App Engine apis being unavailable beginning at Thursday, 2016-08-11 13:45 US/Pacific. Current data indicates that Applications in us-central are affected by this issue.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

We are investigating reports of an issue with App Engine. We will provide more information by 02:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16015 - Networking issue with Google Compute Engine services

SUMMARY: On Friday 5 August 2016, some Google Cloud Platform customers experienced increased network latency and packet loss to Google Compute Engine (GCE), Cloud VPN, Cloud Router and Cloud SQL, for a duration of 99 minutes. If you were affected by this issue, we apologize. We intend to provide a higher level reliability than this, and we are working to learn from this issue to make that a reality. DETAILED DESCRIPTION OF IMPACT: On Friday 5th August 2016 from 00:55 to 02:34 PDT a number of services were disrupted: Some Google Compute Engine TCP and UDP traffic had elevated latency. Most ICMP, ESP, AH and SCTP traffic inbound from outside the Google network was silently dropped, resulting in existing connections being dropped and new connections timing out on connect. Most Google Cloud SQL first generation connections from sources external to Google failed with a connection timeout. Cloud SQL second generation connections may have seen higher latency but not failure. Google Cloud VPN tunnels remained connected, however there was complete packet loss for data through the majority of tunnels. As Cloud Router BGP sessions traverse Cloud VPN, all sessions were dropped. All other traffic was unaffected, including internal connections between Google services and services provided via HTTP APIs. ROOT CAUSE: While removing a faulty router from service, a new procedure for diverting traffic from the router was used. This procedure applied a new configuration that resulted in announcing some Google Cloud Platform IP addresses from a single point of presence in the southwestern US. As these announcements were highly specific they took precedence over the normal routes to Google's network and caused a substantial proportion of traffic for the affected network ranges to be directed to this one point of presence. This misrouting directly caused the additional latency some customers experienced. Additionally this misconfiguration sent affected traffic to next-generation infrastructure that was undergoing testing. This new infrastructure was not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter. This blocked traffic on the affected IP addresses that was routed through the affected point of presence to Cloud VPN, Cloud Router, Cloud SQL first generation and GCE on protocols other than TCP and UDP. REMEDIATION AND PREVENTION: Mitigation began at 02:04 PDT when Google engineers reverted the network infrastructure change that caused this issue, and all traffic routing was back to normal by 02:34. The system involved was made safe against recurrences by fixing the erroneous configuration. This includes changes to BGP filtering to prevent this class of incorrect announcements. We are implementing additional integration tests for our routing policies to ensure configuration changes behave as expected before being deployed to production. Furthermore, we are improving our production telemetry external to the Google network to better detect peering issues that slip past our tests. We apologize again for the impact this issue has had on our customers.

Last Update: A few months ago

RESOLVED: Incident 16015 - Networking issue with Google Compute Engine services

The issue with Google Cloud networking should have been resolved for all affected users as of 02:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16015 - Networking issue with Google Compute Engine services

We are still investigating the issue with Google Compute Engine networking. Current data also indicates impact on other GCP products including Cloud SQL. We will provide another status update by 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16015 - Networking issue with Google Compute Engine services

We are investigating a networking issue with Google Compute Engine. We will provide more information by 02:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18019 - BigQuery connection issues

We are experiencing an intermittent issue with BigQuery connections beginning at Thursday, 2016-Aug-04 13:49 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 4:00pm US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18002 - Execution of BigQuery jobs is delayed, jobs are backing up in Pending state

We are experiencing an intermittent issue with BigQuery connections beginning at Thursday, 2016-Aug-04 13:49 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 4:00pm US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16014 - HTTP(S) Load Balancing returning some 502 errors

We are still investigating the issue with HTTP(S) Load Balancing returning 502 errors. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16014 - HTTP(S) Load Balancing returning some 502 errors

The issue with HTTP(S) Load Balancing returning a small number 502 errors should have been resolved for all affected (instances) as of 11:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16014 - HTTP(S) Load Balancing returning some 502 errors

We are experiencing an issue with HTTP(S) Load Balancing returning a small number 502 errors, beginning at Friday, 2016-07-29 around 08:45 US/Pacific. The maximum error rate for affected users was below 2%. Remediation has been applied that should stop these errors; we are monitoring the situation. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 11:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16013 - HTTP(S) Load Balancing 502 Errors

We are investigating an issue with 502 errors from HTTP(S) Load Balancing. We will provide more information by 11:05 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18018 - Streaming API issues with BigQuery

SUMMARY: On Monday 25 July 2016, the Google BigQuery Streaming API experienced elevated error rates for a duration of 71 minutes. We apologize if your service or application was affected by this and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Monday 25 July 2016 between 17:03 and 18:14 PDT, the BigQuery Streaming API returned HTTP 500 or 503 errors for 35% of streaming insert requests, with a peak error rate of 49% at 17:40. Customers who retried on error were able to mitigate the impact. Calls to the BigQuery jobs API showed an error rate of 3% during the incident but could generally be executed reliably with normal retry behaviour. Other BigQuery API calls were not affected. ROOT CAUSE: An internal Google service sent an unexpectedly high amount of traffic to the BigQuery Streaming API service. The internal service used a different entry point that was not subject to quota limits. Google's internal load balancers drop requests that exceed the capacity limits of a service. In this case, the capacity limit for the Streaming API service had been configured higher than its true capacity. As a result, the internal Google service was able to send too many requests to the Streaming API, causing it to fail for a percentage of responses. The Streaming API service sends requests to BigQuery's Metadata service in order to handle incoming Streaming requests. This elevated volume of requests exceeded the capacity of the Metadata service which resulted in errors for BigQuery jobs API calls. REMEDIATION AND PREVENTION: The incident started at 17:03. Our monitoring detected the issue at 17:20 as error rates started to increase. Our engineers blocked traffic from the internal Google client causing the overload shortly thereafter which immediately started to mitigate the impact of the incident. Error rates dropped to normal by 18:14. In order to prevent a recurrence of this type of incident we will enforce quotas for internal Google clients on requests to the Streaming service in order to prevent a single client sending too much traffic. We will also set the correct capacity limits for the Streaming API service based on improved load tests in order to ensure that internal clients cannot exceed the service's capacity. We apologize again to customers impacted by this incident.

Last Update: A few months ago

RESOLVED: Incident 18018 - Streaming API issues with BigQuery

We experienced an issue with BigQuery streaming API returning 500/503 responses that has been resolved for all affected customers as of 18:11 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 16007 - Intermittent Google App Engine URLFetch API deadline exceeded errors.

The issue with Google App Engine URLFetch API service should have been resolved for all affected applications as of 02:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16007 - Intermittent Google App Engine URLFetch API deadline exceeded errors.

We are still investigating an intermittent issue with Google App Engine URLFetch API calls to non-Google services failing with deadline exceeded errors. We will provide another status update by 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16007 - Intermittent Google App Engine URLFetch API deadline exceeded errors.

We are currently investigating an intermittent issue with Google App Engine URLFetch API service. Fetch requests to non-Google related services are failing with deadline exceeded errors. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 02:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16011 - Compute Engine SSD Persistent disk latency in zone US-Central1-a

SUMMARY: On Tuesday, 28 June 2016 Google Compute Engine SSD Persistent Disks experienced elevated write latency and errors in one zone for a duration of 211 minutes. We would like to apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future. DETAILED DESCRIPTION OF IMPACT: On Tuesday, 28 June 2016 from 18:15 to 21:46 PDT SSD Persistent Disks (PD) in zone us-central1-a experienced elevated latency and errors for most writes. Instances using SSD as their root partition were likely unresponsive. For instances using SSD as a secondary disk, IO latency and errors were visible to applications. Standard (i.e. non-SSD) PD in us-central1-a suffered slightly elevated latency and errors. Latency and errors also occurred when taking and restoring from snapshots of Persistent Disks. Disk creation operations also had elevated error rates, both for standard and SSD PD. Persistent Disks outside of us-central1-a were unaffected. ROOT CAUSE: Two concurrent routine maintenance events triggered a rebalancing of data by the distributed storage system underlying Persistent Disk. This rebalancing is designed to make maintenance events invisible to the user, by redistributing data evenly around unavailable storage devices and machines. A previously unseen software bug, triggered by the two concurrent maintenance events, meant that disk blocks which became unused as a result of the rebalance were not freed up for subsequent reuse, depleting the available SSD space in the zone until writes were rejected. REMEDIATION AND PREVENTION: The issue was resolved when Google engineers reverted one of the maintenance events that triggered the issue. A fix for the underlying issue is already being tested in non-production zones. To reduce downtime related to similar issues in future, Google engineers are refining automated monitoring such that, if this issue were to recur, engineers would be alerted before users saw impact. We are also improving our automation to better coordinate different maintenance operations on the same zone to reduce the time it takes to revert such operations if necessary.

Last Update: A few months ago

RESOLVED: Incident 16005 - Issue with Developers Console

SUMMARY: On Thursday 9 June 2016, the Google Cloud Console was unavailable for a duration of 91 minutes, with significant performance degradation in the preceding half hour. Although this did not affect user resources running on the Google Cloud Platform, we appreciate that many of our customers rely on the Cloud Console to manage those resources, and we apologize to everyone who was affected by the incident. This report is to explain to our customers what went wrong, and what we are doing to make sure that it does not happen again. DETAILED DESCRIPTION OF IMPACT: On Thursday 9 June 2016 from 20:52 to 22:23 PDT, the Google Cloud Console was unavailable. Users who attempted to connect to the Cloud Console observed high latency and HTTP server errors. Many users also observed increasing latency and error rates during the half hour before the incident. Google Cloud Platform resources were unaffected by the incident and continued to run normally. All Cloud Platform resource management APIs remained available, allowing Cloud Platform resources to be managed via the Google Cloud SDK or other tools. ROOT CAUSE: The Google Cloud Console runs on Google App Engine, where it uses internal functionality that is not used by customer applications. Google App Engine version 1.9.39 introduced a bug in one internal function which affected Google Cloud Console instances, but not customer-owned applications, and thus escaped detection during testing and during initial rollout. Once enough instances of Google Cloud Console had been switched to 1.9.39, the console was unavailable and internal monitoring alerted the engineering team, who restored service by starting additional Google Cloud Console instances on 1.9.38. During the entire incident, customer-owned applications were not affected and continued to operate normally. To prevent a future recurrence, Google engineers are augmenting the testing and rollout monitoring to detect low error rates on internal functionality, complementing the existing monitoring for customer applications. REMEDIATION AND PREVENTION: When the issue was provisionally identified as a specific interaction between Google App Engine version 1.9.39 and the Cloud Console, App Engine engineers brought up capacity running the previous App Engine version and transferred the Cloud Console to it, restoring service at 22:23 PDT. The low-level bug that triggered the error has been identified and fixed. Google engineers are increasing the fidelity of the rollout monitoring framework to detect error signatures that suggest negative interactions of individual apps with a new App Engine release, even the signatures are invisible in global App Engine performance statistics. We apologize again for the inconvenience this issue caused our customers.

Last Update: A few months ago

RESOLVED: Incident 16012 - Newly created instances may be experiencing packet loss.

SUMMARY: On Wednesday 29 June 2016, newly created Google Compute Engine instances and newly created network load balancers in all zones were partially unreachable for a duration of 106 minutes. We know that many customers depend on the ability to rapidly deploy and change configurations, and apologise for our failure to provide this to you during this time. DETAILED DESCRIPTION OF IMPACT: On Wednesday 29 June 2016, from 11:58 PST until 13:44 US/Pacific, new Google Compute Engine instances and new network load balancers were partially unreachable via the network. In addition, changes to existing network load balancers were only partially applied. The level of unreachability depended on traffic path rather than instance or load balancer location. Overall, the average impact on new instances was 50% of traffic in the US and around 90% in Asia and Europe. Existing and unchanged instances and load balancers were unaffected. ROOT CAUSE: On 11:58 PST, a scheduled upgrade to Google’s network control system started, introducing an additional access control check for network configuration changes. This inadvertently removed the access of GCE’s management system to network load balancers in this environment. Only a fraction of Google's network locations require this access as an older design has an intermediate component doing access updates. As a result these locations did not receive updates for new and changed instances or load balancers. The change was only tested at network locations that did not require the direct access, which resulted in the issue not being detected during testing and canarying and being deployed globally. REMEDIATION AND PREVENTION: After identifying the root cause, the access control check was modified to allow access by GCE’s management system. The issue was resolved when this modification was fully deployed. To prevent future incidents, the network team is making several changes to their deployment processes. This will improve the level of testing and canarying to catch issues earlier, especially where an issue only affects a subset of the environments at Google. The deployment process will have the rollback procedure enhanced to allow the quickest possible resolution for future incidents. The access control system that was at the root of this issue will also be modified to improve operations that interacts with it. For example it will be integrated with a Google-wide change logging system to allow faster detection of issues caused by access control changes. It will also be outfitted with a dry run mode to allow consequences of changes to be tested during development time. Once again we would like to apologise for falling below the level of service you rely on.

Last Update: A few months ago

RESOLVED: Incident 16012 - Newly created instances may be experiencing packet loss.

The issue with new Google Compute Engine instance experiencing packet loss on startup should have been resolved for all affected instances as of 13:57 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16012 - Newly created instances may be experiencing packet loss.

The issue with new Google Compute Engine instance experiencing packet loss on startup should have been resolved for some instances and we expect a full resolution in the near future. We will provide another status update by 14:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16012 - Newly created instances may be experiencing packet loss.

We are experie