Google Cloud Status

RESOLVED: Incident 18018 - Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses have been resolved.

The issue with Google Cloud IP addresses being advertised by internet service providers has been resolved for all affected users as of 14:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: About 4 days ago

UPDATE: Incident 18018 - Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses.

Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses.

Last Update: About 4 days ago

UPDATE: Incident 18018 - Connectivity issues connecting to Google services including Google APIs, Load balancers, instances and other external IP addresses.

We've received a report of an issue with Google Cloud Networking as of Monday, 2018-11-12 14:16 US/Pacific. We have reports of Google Cloud IP addresses being erroneously advertised by internet service providers other than Google. We will provide more information by Monday, 2018-11-12 15:00 US/Pacific.

Last Update: About 4 days ago

RESOLVED: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

The Google Cloud Networking issue is believed to be affecting a very small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here.

Last Update: About 19 days ago

RESOLVED: Incident 18010 - Failed read requests to Google Stackdriver via the Cloud Console and API

The issue with failing read requests to Google Stackdriver via the Cloud Console and API has been resolved for all affected users as of 19:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: About 24 days ago

RESOLVED: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

The issue with packet loss for Google Cloud VPN, Cloud Interconnect, and Cloud Router has been resolved for all affected users as of Monday, 2018-10-22 14:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: About 25 days ago

UPDATE: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

The issue with packet loss for Google Cloud VPN, Cloud Interconnect, and Cloud Router should be resolved for the majority of requests as of Monday, 2018-10-22 14:25 US/Pacific and we expect a full resolution in the near future. We will provide an update by Monday, 2018-10-22 15:45 US/Pacific with current details.

Last Update: About 25 days ago

UPDATE: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

Last Update: About 25 days ago

UPDATE: Incident 18017 - Google Cloud VPN, Cloud Interconnect, and Cloud Router experiencing packet loss in multiple regions

We are experiencing an issue with packet loss for Google Cloud VPN, Cloud Interconnect, and Cloud Router beginning at Monday, 2018-10-22 13:42 US/Pacific. Current data indicate that requests in the following regions are affected by this issue: asia-east1, asia-south1, asia-southeast1, europe-west1, europe-west4, northamerica-northeast1, us-east1. For everyone who is affected, we apologize for the disruption. We will provide an update by Monday, 2018-10-22 15:30 US/Pacific with current details.

Last Update: About 25 days ago

RESOLVED: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products (Bigtable, Cloud SQL, Datastore, VMs)

ISSUE SUMMARY On Thursday 11 October 2018, a section of Google's network that includes part of us-central1-c lost connectivity to the Google network backbone that connects to the public internet for a duration of 41 minutes. We apologize if your service or application was impacted by this incident. We are following our postmortem process to ensure we fully understand what caused this incident and to determine the exact steps we can take to prevent incidents of this type from recurring. Our engineering team is committed to prioritizing fixes for the most critical findings that result from our postmortem. DETAILED DESCRIPTION OF IMPACT On Thursday 11 October 2018 from 16:13 to 16:54 PDT, a section of Google's network that includes part of us-central1-c lost connectivity to the Google network backbone that connects to the public internet. The us-central1-c zone is composed of two separate physical clusters. 61% of the VMs in us-central1-c were in the cluster impacted by this incident. Projects that create VMs in this zone have all of their VMs assigned to a single cluster. Customers with VMs in the zone were either impacted for all of their VMs in a project or for none. Impacted VMs could not communicate with VMs outside us-central1 during the incident. VM-to-VM traffic using an internal IP address within us-central1 was not affected. Traffic through the network load balancer was not able to reach impacted VMs in us-central1-c, but customers with VMs spread between multiple zones experienced the network load balancer shifting traffic to unaffected zones. Traffic through the HTTP(S), SSL Proxy, and TCP proxy load balancers was not significantly impacted by this incident. Other Google Cloud Platform services that experienced significant impact include the following: 30% of Cloud Bigtable clusters located in us-central1-c became unreachable. 10% of Cloud SQL instances in us-central lost external connectivity. ROOT CAUSE The incident occurred while Google's network operations team was replacing the routers that link us-central1-c to Google's backbone that connects to the public internet. Google engineers paused the router replacement process after determining that additional cabling would be required to complete the process and decided to start a rollback operation. The rollout and rollback operations utilized a version of workflow that was only compatible with the newer routers. Specifically, rollback was not supported on the older routers. When a configuration change was pushed to the older routers during the rollback, it deleted the Border Gateway Protocol (BGP) control plane sessions connecting the datacenter routers to the backbone routers resulting in a loss of external connectivity. REMEDIATION AND PREVENTION The BGP sessions were deleted in two tranches. The first deletion was at 15:43 and caused traffic to failover to other routers. The second set of BGP sessions were deleted at 16:13. The first alert for Google engineers fired at 16:16. We identified that the BGP sessions had been deleted at 16:41 and rolled back the configuration change at 16:52, ending the incident shortly thereafter. The preventative action items identified so far include the following: Fix the automated workflows for router replacements to ensure the correct version of workflows are utilized for both types of routers. Alert when BGP sessions are deleted and traffic fails off, so that we can detect and mitigate problems before they impact customers.

Last Update: About 1 month ago

RESOLVED: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

The issue with Cloud Networking impacted multiple GCP products in us-central1-c has been resolved for all affected projects as of Thursday, 2018-10-11 17:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: About 1 month ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

Cloud Network is now working and the mitigation work to restore the impacted services is currently underway by our Engineering Team. We will provide another status update by Thursday, 2018-10-11 18:00 US/Pacific with current details.

Last Update: About 1 month ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

Our engineers have applied a mitigation and the error rate is now decreasing. We will provide another status update by Thursday, 2018-10-11 17:20 US/Pacific with current details.

Last Update: About 1 month ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

Networking issues in us-central1-c impacting multiple GCP products

Last Update: About 1 month ago

UPDATE: Incident 18016 - Networking issues in us-central1-c impacting multiple GCP products

We are investigating an issue with networking issues in us-central1-c. This issue is impacting multiple Google Cloud Platform products. We will provide more information by Thursday, 2018-10-11 17:15 US/Pacific

Last Update: About 1 month ago

RESOLVED: Incident 18038 - We are investigating an issue with Google BigQuery having an increased error rate in the US.

The issue with Google BigQuery having an increased error rate in the US has been resolved for all affected users as of 11:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: About 1 month ago

UPDATE: Incident 18038 - We are investigating an issue with Google BigQuery having an increased error rate in the US.

We are investigating an issue with Google BigQuery having an increased error rate in the US. Currently applied mitigation has not resolved the issue, we are continuing to investigate and apply additional mitigation. We will provide more information by Thursday, 2018-09-27 12:00 US/Pacific.

Last Update: About 1 month ago

UPDATE: Incident 18038 - We are investigating an issue with Google BigQuery having an increased error rate in the US.

We are investigating an issue with Google BigQuery having an increased error rate in the US. We have already applied a mitigation and are currently monitoring the situation. We will provide more information by Thursday, 2018-09-27 11:15 US/Pacific.

Last Update: About 1 month ago

RESOLVED: Incident 18015 - We are investigating an issue with Google Cloud Networking in europe-north1-c.

After further investigation, we have determined that the impact was minimal and effected a small number of users. We have conducted an internal investigation of this issue and made appropriate improvements to our systems to help prevent or minimize a future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18002 - AutoML Natural Language failing to train models

The issue with Google Cloud AutoML Natural language failing to train models has been resolved for all affected users. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18006 - App Engine increased latency and 5XX errors.

The issue with Google App Engine increased latency and 5XX errors has been resolved for all affected projects as of Wednesday 2018-09-12 08:07 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18002 - Error message when updating Cloud Functions via gcloud.

The issue with Google Cloud Functions experiencing errors when updating functions via gcloud has been resolved for all affected users as of Tuesday, 2018-09-11 09:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18015 - We are investigating an issue with Google Cloud Networking in europe-north1-c.

The issue with Cloud Networing in europe-north1-c has been resolved for all affected projects as of Tuesday, 2018-09-11 2:27 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18015 - We are investigating an issue with Google Cloud Networking in europe-north1-c.

We are investigating a major issue with Google Cloud Networking in europe-north1-c. The issue started at Tuesday, 2018-09-11 01:18 US/Pacific. Our Engineering Team is investigating possible causes. We will provide more information by Tuesday, 2018-09-11 03:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18003 - Increased error rate for Google Cloud Storage

# ISSUE SUMMARY On Tuesday 4 September 2018, Google Cloud Storage experienced 1.1% error rates and increased 99th percentile latency for US multiregion buckets for a duration of 5 hours 38 minutes. After that time some customers experienced 0.1% error rates which returned to normal progressively over the subsequent 4 hours. To our Google Cloud Storage customers whose businesses were impacted during this incident, we sincerely apologize; this is not the level of tail-end latency and reliability we strive to offer you. We are taking immediate steps to improve the platform’s performance and availability. # DETAILED DESCRIPTION OF IMPACT On Tuesday 4 September 2018 from 02:55 to 08:33 PDT, customers with buckets located in the US multiregion experienced a 1.066% error rate and 4.9x increased 99th percentile latency, with the peak effect occurring between 05:00 PDT and 07:30 PDT for write-heavy workloads. At 08:33 PDT 99th percentile latency decreased to 1.4x normal levels and error rates decreased, initially to 0.146% and then subsequently to nominal levels by 12:50 PDT. # ROOT CAUSE At the beginning of August, Google Cloud Storage deployed a new feature which among other things prefetched and cached the location of some internal metadata. On Monday 3rd September 18:45 PDT, a change in the underlying metadata storage system resulted in increased load to that subsystem, which eventually invalidated some cached metadata for US multiregion buckets. This meant that requests for that metadata experienced increased latency, or returned an error if the backend RPC timed out. This additional load on metadata lookups led to elevated error rates and latency as described above. # REMEDIATION AND PREVENTION Google Cloud Storage SREs were alerted automatically once error rates had risen materially above nominal levels. Additional SRE teams were involved as soon as the metadata storage system was identified as a likely root cause of the incident. In order to mitigate the incident, the keyspace that was suffering degraded performance needed to be identified and isolated so that it could be given additional resources. This work completed by the 4th September 08:33 PDT. In parallel, Google Cloud Storage SREs pursued the source of additional load on the metadata storage system and traced it to cache invalidations. In order to prevent this type of incident from occurring again in the future, we will expand our load testing to ensure that performance degradations are detected prior to reaching production. We will improve our monitoring diagnostics to ensure that we more rapidly pinpoint the source of performance degradation. The metadata prefetching algorithm will be changed to introduce randomness and further reduce the chance of creating excessive load on the underlying storage system. Finally, we plan to enhance the storage system to reduce the time needed to identify, isolate, and mitigate load concentration such as that resulting from cache invalidations.

Last Update: A few months ago

RESOLVED: Incident 18003 - Increased error rate for Google Cloud Storage

The issue with Google Cloud Storage errors on requests to US multiregional buckets has been resolved for all affected users as of Tuesday, 2018-09-04 12:52 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate for Google Cloud Storage

The mitigation efforts have further decreased the error rates to less than 1% of requests. We are still seeing intermittent spikes but these are less frequent. We expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-09-04 16:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate for Google Cloud Storage

We are rolling out a potential fix to mitigate this issue. Impact is intermittent but limited to US Multiregional Cloud Storage Buckets buckets. We will provide another status update by Tuesday, 2018-09-04 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate of high latency for Google Container Registry API calls

The issue with Google Container Registry API latency should be mitigated for the majority of requests and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-09-04 14:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate of high latency for Google Container Registry API calls

Temporary mitigation efforts have significantly reduced the error rate but we are still seeing intermittent errors or latency on requests. Full resolution efforts are still ongoing. We will provide another status update by Tuesday, 2018-09-04 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Increased error rate of high latency for Google Container Registry API calls

The rate of errors is decreasing. We will provide another status update by Tuesday, 2018-09-04 11:15 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

ISSUE SUMMARY On Friday 27 July 2018, for a duration of 1 hour 4 minutes, Google Compute Engine (GCE) instances and Cloud VPN tunnels in europe-west4 experienced loss of connectivity to the Internet. The incident affected all new or recently live migrated GCE instances. VPN tunnels created during the incident were also impacted. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to avoid a recurrence. DETAILED DESCRIPTION OF IMPACT All Google Compute Engine (GCE) instances in europe-west4 created on Friday 27 July 2018 from 18:27 to 19:31 PDT lost connectivity to the Internet and other instances via their public IP addresses. Additionally any instances that live migrated during the outage period would have lost connectivity for approximately 30 minutes after the live migration completed. All Cloud VPN tunnels created during the impact period, and less than 1% of existing tunnels in europe-west4 also lost external connectivity. All other instances and VPN tunnels continued to serve traffic. Inter-instance traffic via private IP addresses remained unaffected. ROOT CAUSE Google's datacenters utilize software load balancers known as Maglevs [1] to efficiently load balance network traffic [2] across service backends. The issue was caused by an unintended side effect of a configuration change made to jobs that are critical in coordinating the availability of Maglevs. The change unintentionally lowered the priority of these jobs in europe-west4. The issue was subsequently triggered when a datacenter maintenance event required load shedding of low priority jobs. This resulted in failure of a portion of the Maglev load balancers. However, a safeguard in the network control plane ensured that some Maglev capacity remained available. This layer of our typical defense-in-depth allowed connectivity to extant cloud resources to remain up, and restricted the disruption to new or migrated GCE instances and Cloud VPN tunnels. REMEDIATION AND PREVENTION Automated monitoring alerted Google’s engineering team to the event within 5 minutes and they immediately began investigating at 18:36. At 19:25 the team discovered the root cause and started reverting the configuration change. The issue was mitigated at 19:31 when the fix was rolled out. At this point, connectivity was restored immediately. In addition to addressing the root cause, we will be implementing changes to both prevent and reduce the impact of this type of failure by improving our alerting when too many Maglevs become unavailable, and adding a check for configuration changes to detect priority reductions on critical dependencies. We would again like to apologize for the impact that this incident had on our customers and their businesses in the europe-west4 region. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly. [1] https://ai.google/research/pubs/pub44824 [2] https://cloudplatform.googleblog.com/2016/03/Google-shares-software-network-load-balancer-design-powering-GCP-networking.html

Last Update: A few months ago

RESOLVED: Incident 18008 - We are investigating errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions.

The issue with errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions has been resolved for all affected users as of Friday, 2018-08-03 11:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18008 - We are investigating errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions.

We are still seeing errors activating Windows and SUSE licenses on Google Compute Engine instances in all regions. Our Engineering Team is investigating possible causes. We will provide another status update by Friday, 2018-08-03 12:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18014 - Traffic loss in region europe-west2

The issue with traffic loss in europe-west2 has been resolved for all affected projects as of Tuesday, 2018-07-31 07:26 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Tuesday, 2018-07-31 10:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

The issue with traffic loss in GCP region europe-west2 should be mitigated and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-07-31 08:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

We are experiencing traffic loss in GCP region europe-west2 beginning at Tuesday, 2018-07-31 06:45 US/Pacific. Early investigation indicate that approximately 20% of requests in this region are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2018-07-31 07:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - traffic loss in region europe-west2

We are investigating an issue with traffic loss in region europe-west2. We will provide more information by Tuesday, 2018-07-31 07:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

The issue with Internet access for VMs in the europe-west4 region has been resolved for all affected projects as of Friday, 2018-07-27 19:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Friday, 2018-07-27 20:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

Our Engineering Team believes they have identified the root cause of the issue and is working to mitigate. We will provide another status update by Friday, 2018-07-27 20:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

Investigation is currently underway by our Engineering Team. We will provide another status update by Friday, 2018-07-27 20:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - We are investigating issues with Internet access for VMs in the europe-west4 region.

We are investigating an issue with Google Cloud Networking for VM instances in the europe-west4 region. We will provide more information by Friday, 2018-07-27 19:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - Support Center inaccessible

A detailed analysis has been written for this incident and is available at cloud networking incident 18012: https://status.cloud.google.com/incident/cloud-networking/18012

Last Update: A few months ago

RESOLVED: Incident 18012 - The issue with Google Cloud Global Loadbalancers returning 502s has been fully resolved.

## ISSUE SUMMARY On Tuesday, 17 July 2018, customers using Google Cloud App Engine, Google HTTP(S) Load Balancer, or TCP/SSL Proxy Load Balancers experienced elevated error rates ranging between 33% and 87% for a duration of 32 minutes. Customers observed errors consisting of either 502 return codes, or connection resets. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability. We will be providing customers with a SLA credit for the affected timeframe that impacted the Google Cloud HTTP(S) Load Balancer, TCP/SSL Proxy Load Balancer and Google App Engine products. ## DETAILED DESCRIPTION OF IMPACT On Tuesday, 17 July 2018, from 12:17 to 12:49 PDT, Google Cloud HTTP(S) Load Balancers returned 502s for some requests they received. The proportion of 502 return codes varied from 33% to 87% during the period. Automated monitoring alerted Google’s engineering team to the event at 12:19, and at 12:44 the team had identified the probable root cause and deployed a fix. At 12:49 the fix became effective and the rate of 502s rapidly returned to a normal level. Services experienced degraded latency for several minutes longer as traffic returned and caches warmed. Serving fully recovered by 12:55. Connections to Cloud TCP/SSL Proxy Load Balancers would have been reset after connections to backends failed. Cloud services depending upon Cloud HTTP Load Balancing, such as Google App Engine application serving, Google Cloud Functions, Stackdriver's web UI, Dialogflow and the Cloud Support Portal/API, were affected for the duration of the incident. Cloud CDN cache hits dropped 70% due to decreased references to Cloud CDN URLs from services behind Cloud HTTP(S) Load balancers and an inability to validate stale cache entries or insert new content on cache misses. Services running on Google Kubernetes Engine and using the Ingress resource would have served 502 return codes as mentioned above. Google Cloud Storage traffic served via Cloud Load Balancers was also impacted. Other Google Cloud Platform services were not impacted. For example, applications and services that use direct VM access, or Network Load Balancing, were not affected. ## ROOT CAUSE Google’s Global Load Balancers are based on a two-tiered architecture of Google Front Ends (GFE). The first tier of GFEs answer requests as close to the user as possible to maximize performance during connection setup. These GFEs route requests to a second layer of GFEs located close to the service which the request makes use of. This type of architecture allows clients to have low latency connections anywhere in the world, while taking advantage of Google’s global network to serve requests to backends, regardless of in which region they are located. The GFE development team was in the process of adding features to GFE to improve security and performance. These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. At the beginning of the event, a configuration change in the production environment triggered the bug intermittently, which caused affected GFEs to repeatedly restart. Since restarts are not instantaneous, the available second layer GFE capacity was reduced. While some requests were correctly answered, other requests were interrupted (leading to connection resets) or denied due to a temporary lack of capacity while the GFEs were coming back online. ## REMEDIATION AND PREVENTION Google engineers were alerted to the issue within 3 minutes and began immediately investigating. At 12:44 PDT, the team discovered the root cause, the configuration change was promptly reverted, and the affected GFEs ceased their restarts. As all GFEs returned to service, traffic resumed its normal levels and behavior. In addition to fixing the underlying cause, we will be implementing changes to both prevent and reduce the impact of this type of failure in several ways: 1\. We are adding additional safeguards to disable features not yet in service. 2\. We plan to increase hardening of the GFE testing stack to reduce the risk of having a latent bug in production binaries that may cause a task to restart. 3\. We will also be pursuing additional isolation between different shards of GFE pools in order to reduce the scope of failures. 4\. Finally, to speed diagnosis in the future, we plan to create a consolidated dashboard of all configuration changes for GFE pools, allowing engineers to more easily and quickly observe, correlate, and identify problematic changes to the system. We would again like to apologize for the impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span regions. We are conducting a thorough investigation of the incident and will be making the changes which result from that investigation our very top priority in GCP engineering.

Last Update: A few months ago

RESOLVED: Incident 18012 - The issue with Google Cloud Global Loadbalancers returning 502s has been fully resolved.

The issue with Google Cloud Global Load balancers returning 502s has been resolved for all affected users as of 13:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18012 - We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

The issue with Google Cloud Load balancers returning 502s should be resolved for majority of users and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2018-07-17 13:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18012 - We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

Last Update: A few months ago

UPDATE: Incident 18012 - We are investigating a problem with Google Cloud Global Loadbalancers returning 502s

We are investigating a problem with Google Cloud Global Loadbalancers returning 502s for many services including AppEngine, Stackdriver, Dialogflow, as well as customer Global Load Balancers. We will provide another update by Tuesday, 2018-07-17 13:00 US/Pacific

Last Update: A few months ago

RESOLVED: Incident 18008 - Stackdriver disturbance

The issue with Cloud Stackdriver, where you could have experienced some latency on the logs, has been resolved for all affected users as of 02:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18008 - Stackdriver disturbance

The issue with Cloud Stackdriver, where you could have experienced some latency on the logs, has been resolved for all affected users as of 02:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18007 - The issue with with Google Cloud networking in europe-west1-b and europe-west4-b has been resolved.

The issue with VM public IP address traffic in europe-west1-b and europe-west4-b has been resolved for all affected projects as of Wednesday, 2018-07-04 08:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18007 - We've received a report of an issue with Google Cloud networking in europe-west1-b and europe-west4-b.

We are allocating additional capacity to handle this load. Most of the changes are complete and access controls have been rolled back. The situation is still being closely monitored and further adjustments are still possible/likely. We will provide another status update by Wednesday, 2018-07-04 10:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18037 - We've received a report of an issue with Google BigQuery.

ISSUE SUMMARY On Friday 22 June 2018, Google BigQuery experienced increased query failures for a duration of 1 hour 6 minutes. We apologize for the impact of this issue on our customers and are making changes to mitigate and prevent a recurrence. DETAILED DESCRIPTION OF IMPACT On Friday 22 June 2018 from 12:06 to 13:12 PDT, up to 50% of total requests to the BigQuery API failed with error code 503. Error rates varied during the incident, with some customers experiencing 100% failure rate for their BigQuery table jobs. bigquery.tabledata.insertAll jobs were unaffected. ROOT CAUSE A new release of the BigQuery API introduced a software defect that caused the API component to return larger-than-normal responses to the BigQuery router server. The router server is responsible for examining each request, routing it to a backend server, and returning the response to the client. To process these large responses, the router server allocated more memory which led to an increase in garbage collection. This resulted in an increase in CPU utilization, which caused our automated load balancing system to shrink the server capacity as a safeguard against abuse. With the reduced capacity and now comparatively large throughput of requests, the denial of service protection system used by BigQuery responded by rejecting user requests, causing a high rate of 503 errors. REMEDIATION AND PREVENTION Google Engineers initially mitigated the issue by increasing the capacity of the BigQuery router server which prevented overload and allowed API traffic to resume normally. The issue was fully resolved by identifying and reverting the change that caused large response sizes. To prevent future occurrences, BigQuery engineers will also be adjusting capacity alerts to improve monitoring of server overutilization. We apologize once again for the impact of this incident on your business.

Last Update: A few months ago

UPDATE: Incident 18006 - We've received a report of an issue with Google Cloud Networking in us-east1.

The issue with Google Cloud Networking in us-east1 has been resolved for all affected projects as of Saturday, 2018-06-23 13:16 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18006 - We've received a report of an issue with Google Cloud Networking in us-east1.

We've received a report of an issue with Google Cloud Networking in us-east1.

Last Update: A few months ago

UPDATE: Incident 18006 - We've received a report of an issue with Google Cloud Networking in us-east1.

We are experiencing an issue with Google Cloud Networking beginning on Saturday, 2018-06-23 12:02 US/Pacific. Current data that approximately 33% of projects in us-east1 are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by Saturday, 2018-06-23 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18037 - We've received a report of an issue with Google BigQuery.

The issue with Google BigQuery has been resolved for all affected projects as of Friday, 2018-06-22 13:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18037 - We've received a report of an issue with Google BigQuery.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Friday, 2018-06-22 14:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18037 - We've received a report of an issue with Google BigQuery.

We've received a report of an issue with Google BigQuery.

Last Update: A few months ago

UPDATE: Incident 18037 - We've received a report of an issue with Google BigQuery.

We are investigating an issue with Google BiqQuery. Our Engineering Team is investigating possible causes. Affected customers may see their queries fail with 500 errors. We will provide another status update by Friday, 2018-06-22 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started.

The issue with Google Compute Engine VM instances being allocated duplicate external IP addresses has been resolved for all affected projects as of Saturday, 2018-06-16 12:59 US/Pacific. Customers with VMs having duplicate internal IP addresses should follow the workaround described earlier, which is to delete (without deleting the boot disk), and recreate the affected VM instances. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started.

Mitigation work is currently underway by our Engineering Team. GCE VMs that have an internal IP that is not assigned to another VM within the same project, region and network should no longer see this issue occurring, however instances where another VM is using their internal IP may fail to start with networking. We will provide another status update by Saturday, 2018-06-16 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses, stopped instances networking are not coming up when started.

Detailed Description We are investigating an issue with Google Compute Engine VM instances failing to start with networking after being stopped, or new instances being allocated with the same IP address as a VM instance which was stopped within the past 24 hours. Our Engineering Team is evaluating a fix in a test environment. We will provide another status update by Saturday, 2018-06-16 03:30 US/Pacific with current details. Diagnosis Instances that were stopped at any time between 2018-06-14 08:42 and 2018-06-15 13:40 US/Pacific may fail to start with networking. A newly allocated VM instance has the same IP address as a VM instance which was stopped within the mentioned time period. Workaround As an immediate mitigation to fix instances for which networking is not working, instances should be recreated, that is a delete (without deleting the boot disk), and a create. e.g: gcloud compute instances describe instance-1 --zone us-central1-a gcloud compute instances delete instance-1 --zone us-central1-a --keep-disks=all gcloud compute instances create instance-1 --zone-us-central1-a --disk='boot=yes,name=instance-1' To prevent new instances from coming up with duplicate IP addresses we suggest creating f1-micros until new ip addresses are allocated, and then stopping the instances to stop incurring charges. Alternatively new instances can be brought up with a static internal ip address.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Our Engineering Team continues to evaluate the fix in a test environment. We believe that customers can work around the issue by launching then stopping f1 micro instances until no more duplicate IP addresses are obtained. We are awaiting confirmation that the provided workaround works for customers. We will provide another status update by Friday, 2018-06-15 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Our Engineering Team is evaluating a fix in a test environment. We will provide another status update by Friday, 2018-06-15 17:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Google Compute Engine VM instances allocated with duplicate internal IP addresses

Last Update: A few months ago

UPDATE: Incident 18005 - Google Compute Engine VM instances allocated with duplicate internal IP addresses

Investigation continues by our Engineering Team. We are investigating workarounds as well as a method to resolve the issue for all affected projects. We will provide another status update by Friday, 2018-06-15 17:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18011 - The issue with Google Networking in South America should be resolved.

The issue with Google Cloud Networking in South America has been resolved for all affected users as of Monday, 2018-06-04 22:22. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18011 - The rate of errors is decreasing for the Cloud Networking issue in South America.

The issue with Google Networking in South America should be resolved for some of users and we expect a full resolution in the near future. We will provide another status update by Monday, 2018-06-04 23:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18011 - The rate of errors is decreasing for the Cloud Networking issue in South America.

The rate of errors is decreasing. We will provide another status update by Monday, 2018-06-04 23:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18011 - Mitigation work is currently underway by our Engineering Team to restore the loss of traffic with Google Networking in South America.

Mitigation work is currently underway by our Engineering Team for Google Cloud Networking issue affecting South America. We will provide more information by Monday, 2018-06-04 22:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18011 - We are investigating loss of traffic with Google Networking in South America.

We are investigating loss of traffic with Google Networking in South America.

Last Update: A few months ago

UPDATE: Incident 18011 - We are investigating loss of traffic with Google Networking in South America.

We are investigating loss for traffic to and from South America. Our Engineering Team is investigating possible causes. We will provide another status update by Monday, 2018-06-04 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

ISSUE SUMMARY On Sunday, 20 May 2018 for 4 hours and 25 minutes, approximately 6% of Stackdriver Logging logs experienced a median ingest latency of 90 minutes. To our Stackdriver Logging customers whose operations monitoring was impacted during this outage, we apologize. We have conducted an internal investigation and are taking steps to ensure this doesn’t happen again. DETAILED DESCRIPTION OF IMPACT On Wednesday, 20 May 2018 from 18:40 to 23:05 Pacific Time, 6% of logs ingested by Stackdriver Logging experienced log event ingest latency of up to 2 hours 30 minutes, with a median latency of 90 minutes. Customers requesting log events within the latency window would receive empty responses. Logging export sinks were not affected. ROOT CAUSE Stackdriver Logging uses a pool of workers to persist ingested log events. On Wednesday, 20 May 2018 at 17:40, a load spike in the Stackdriver Logging storage subsystem caused 0.05% of persist calls made by the workers to time out. The workers would then retry persisting to the same address until reaching a retry timeout. While the workers were retrying, they were not persisting other log events. This resulted in multiple workers removed from the pool of available workers. By 18:40, enough workers had been removed from the pool to reduce throughput below the level of incoming traffic, creating delays for 6% of logs. REMEDIATION AND PREVENTION After Google Engineering was paged, engineers isolated the issue to these timing out workers. At 20:35, engineers configured the workers to return timed out log events to queue and move on to a different log event after timeout. This allowed workers to catch up with ingest rate. At 23:02, the last delayed message was delivered. We are taking the following steps to prevent the issue from happening again: we are modifying the workers to retry persists using alternate addresses to reduce the impact of persist timeouts; we are increasing the persist capacity of the storage subsystem to manage load spikes; we are modifying Stackdriver Logging workers to reduce their unavailability when the storage subsystem experiences higher latency.

Last Update: A few months ago

RESOLVED: Incident 18003 - Issue with Google Cloud project creation

The issue with project creation has been resolved for all affected projects as of Tuesday, 2018-05-22 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18009 - GCE Networking issue in us-east4

ISSUE SUMMARY On Wednesday 16 May 2018, Google Cloud Networking experienced loss of connectivity to external IP addresses located in us-east4 for a duration of 58 minutes. DETAILED DESCRIPTION OF IMPACT On Wednesday 16 May 2018 from 18:43 to 19:41 PDT, Google Compute Engine, Google Cloud VPN, and Google Cloud Network Load Balancers hosted in the us-east4 region experienced 100% packet loss from the internet and other GCP regions. Google Dedicated Interconnect Attachments located in us-east4 also experienced loss of connectivity. ROOT CAUSE Every zone in Google Cloud Platform advertises several sets of IP addresses to the Internet via BGP. Some of these IP addresses are global and are advertised from every zone, others are regional and advertised only from zones in the region. The software that controls the advertisement of these IP addresses contained a race condition during application startup that would cause regional IP addresses to be filtered out and withdrawn from a zone. During a routine binary rollout of this software, the race condition was triggered in each of the three zones in the us-east4 region. Traffic continued to be routed until the last zone received the rollout and stopped advertising regional prefixes. Once the last zone stopped advertising the regional IP addresses, external regional traffic stopped entering us-east4. REMEDIATION AND PREVENTION Google engineers were alerted to the problem within one minute and as soon as investigation pointed to a problem with the BGP advertisements, a rollback of the binary in the us-east4 region was created to mitigate the issue. Once the rollback proved effective, the original rollout was paused globally to prevent any further issues. We are taking the following steps to prevent the issue from happening again. We are adding additional monitoring which will provide better context in future alerts to allow us to diagnose issues faster. We also plan on improving the debuggability of the software that controls the BGP advertisements. Additionally, we will be reviewing the rollout policy for these types of software changes so we can detect issues before they impact an entire region. We apologize for this incident and we recognize that regional outages like this should be rare and quickly rectified. We are taking immediate steps to prevent recurrence and improve reliability in the future.

Last Update: A few months ago

UPDATE: Incident 18003 - Issue with Google Cloud project creation

We are experiencing an issue with creating new projects as well as activating some APIs for projects. beginning at Tuesday, 2018-05-22 07:50 US/Pacific. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2018-05-22 13:40 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18003 - Issue with Google Cloud project creation

Issue with Google Cloud project creation

Last Update: A few months ago

UPDATE: Incident 18003 - Issue with Google Cloud project creation

The rate of errors is decreasing. We will provide another status update by Tuesday, 2018-05-22 14:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - Issue with Google Cloud project creation and API activation

We are experiencing an issue with creating new projects as well as activating some APIs for projects. beginning at Tuesday, 2018-05-22 07:50 US/Pacific. For everyone who is affected, we apologize for the disruption. We will provide an update by Tuesday, 2018-05-22 13:40 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

The issue with StackDriver logging delay has been resolved for all affected projects as of Sunday, 2018-05-20 22:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Sunday, 2018-05-20 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18007 - The Stackdriver logging service is experiencing a 30-minute delay.

The Stackdriver logging service is experiencing a 30-minute delay. We will provide another status update by Sunday, 2018-05-20 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18007 - We’ve received a report of an issue with StackDriver logging delay. We will provide more information by Sunday 20:45 US/Pacific.

We are investigating an issue with Google Stackdriver. We will provide more information by Sunday, 2018-05-20 20:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4.

The issue with external IP allocation in us-east4 has been resolved as of Saturday, 2018-05-19 11:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4.

Allocation of new external IP addresses for GCE instance creation continues to be unavailable in us-east4. For everyone who is affected, we apologize for the disruption. We will provide an update by Saturday, 2018-05-19 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4.

We are experiencing an issue with Google Cloud Networking and Google Compute Engine in us-east4 that prevents the creation of GCE instances that require allocation of new external IP addresses. For everyone who is affected, we apologize for the disruption. We will provide an update by Saturday, 2018-05-19 06:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4. We will provide more information by Friday, 2018-05-18 22:00 US/Pacific.

We are experiencing an issue with Google Cloud Networking and Google Compute Engine in us-east4 that prevents the creation of GCE instances with external IP addresses attached. Early investigation indicates that all instances in us-east4 are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by Friday, 2018-05-18 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18010 - We are investigating an issue with Google Cloud Networking in us-east4. We will provide more information by Friday, 2018-05-18 22:00 US/Pacific.

We are investigating an issue with Google Cloud Networking in us-east4. We will provide more information by Friday, 2018-05-18 22:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18036 - Multiple failing BigQuery job types

ISSUE SUMMARY On Wednesday 16 May 2018, Google BigQuery experienced failures of import, export and query jobs for a duration of 88 minutes over two time periods (55 minutes initially, and 33 minutes in the second, which was isolated to the EU). We sincerely apologize to all of our affected customers; this is not the level of reliability we aim to provide in our products. We will be issuing SLA credits to customers who were affected by this incident and we are taking immediate steps to prevent a future recurrence of these failures. DETAILED DESCRIPTION OF IMPACT On Wednesday 16 May 2018 from 16:00 to 16:55 and from to 17:45 to 18:18 PDT, Google BigQuery experienced a failure of some import, export and query jobs. During the first period of impact, there was a 15.26% job failure rate; during the second, which was isolated to the EU, there was a 2.23% error rate. Affected jobs would have failed with INTERNAL_ERROR as the reason. ROOT CAUSE Configuration changes being rolled out on the evening of the incident were not applied in the intended order. This resulted in an incomplete configuration change becoming live in some zones, subsequently triggering the failure of customer jobs. During the process of rolling back the configuration, another incorrect configuration change was inadvertently applied, causing the second batch of job failures. REMEDIATION AND PREVENTION Automated monitoring alerted engineering teams 15 minutes after the error threshold was met and were able to correlate the errors with the configuration change 3 minutes later. We feel that the configured alert delay is too long and have lowered it to 5 minutes in order to aid in quicker detection. During the rollback attempt, another bad configuration change was enqueued for automatic rollout and when unblocked, proceeded to roll out, triggering the second round of job failures. To prevent this from happening in the future, we are working to ensure that rollouts are automatically switched to manual mode when engineers are responding to production incidents. In addition, we're switching to a different configuration system which will ensure the consistency of configuration at all stages of the rollout.

Last Update: A few months ago

RESOLVED: Incident 18001 - Issue affecting Google Cloud Function customers' ability to create and update functions.

The issue with Google Cloud Functions affecting the ability to create and update functions has been resolved for all affected users as of Thursday, 2018-05-17 13:01 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - Issue affecting Google Cloud Function customers' ability to create and update functions.

The rate of errors is decreasing. We will provide another status update by Thursday, 2018-05-17 13:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - Issue affecting the ability to update and create Cloud Functions.

Issue affecting the ability to update and create Cloud Functions.

Last Update: A few months ago

UPDATE: Incident 18001 - Issue affecting the ability to update and create Cloud Functions.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Thursday, 2018-05-17 13:10 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18004 - The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region has been resolved for all affected users as of Wednesday, 2018-05-16 19:40 US/Pacific.

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region has been resolved for all affected users as of Wednesday, 2018-05-16 19:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18009 - GCE Networking issue in us-east4

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region has been resolved for all affected users as of Wednesday, 2018-05-16 19:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18009 - GCE Networking issue in us-east4

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by Wednesday, 2018-05-16 20:20 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18004 - The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region should be resolved for the majority of users and we expect a full resolution in the near...

The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by Wednesday, 2018-05-16 20:20 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18009 - GCE Networking issue in us-east4

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Wednesday, 2018-05-16 20:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18004 - GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss. Mitigation is underway.

Mitigation work is currently underway by our Engineering Team. We will provide another status update by Wednesday, 2018-05-16 20:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18009 - GCE Networking issue in us-east4

We are investigating an issue with GCE Networking in us-east4 region affecting GCE VMs, GKE, Cloud VPN and Cloud Private Interconnect resulting in network packet loss. We will provide more information by Wednesday, 2018-05-16 19:43 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18004 - GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss.

GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss.

Last Update: A few months ago

UPDATE: Incident 18004 - GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss.

We are investigating an issue with GCE Networking in us-east4 region affecting GCE VMs, Cloud VPN and Cloud Private Interconnect resulting in network packet loss. We will provide more information by Wednesday, 2018-05-16 19:43 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18036 - Multiple failing BigQuery job types

The issue with Google BigQuery has been resolved for all affected users as of 2018-05-16 17:06 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. Short Summary

Last Update: A few months ago

UPDATE: Incident 18036 - Multiple failing BigQuery job types

We are rolling back a configuration change to mitigate this issue. We will provide another status update by Wednesday 2018-05-16 17:21 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18007 - We are investigating an issue with increased packet loss in us-central1 with Google Cloud Networking.

ISSUE SUMMARY On Wednesday 2 May, 2018 Google Cloud Networking experienced increased packet loss to the internet as well as other Google regions from the us-central1 region for a duration of 21 minutes. We understand that the network is a critical component that binds all services together. We have conducted an internal investigation and are taking steps to improve our service. DETAILED DESCRIPTION OF IMPACT On Wednesday 2 May, 2018 from 13:47 to 14:08 PDT, traffic between all zones in the us-central1 region and all destinations experienced 12% packet loss. Traffic between us-central1 zones experienced 22% packet loss. Customers may have seen requests succeed to services hosted in us-central1 as loss was not evenly distributed, some connections did not experience any loss while others experienced 100% packet loss. ROOT CAUSE A control plane is used to manage configuration changes to the network fabric connecting zones in us-central1 to each other as well as the Internet. On Wednesday 2 May, 2018 Google Cloud Network engineering began deploying a configuration change using the control plane as part of planned maintenance work. During the deployment, a bad configuration was generated that blackholed a portion of the traffic flowing over the fabric. The control plane had a bug in it, which caused it to produce an incorrect configuration. New configurations deployed to the network fabric are evaluated for correctness, and regenerated if an error is found. In this case, the configuration error appeared after the configuration was evaluated, which resulted in deploying the erroneous configuration to the network fabric. REMEDIATION AND PREVENTION Automated monitoring alerted engineering teams 2 minutes after the loss started. Google engineers correlated the alerts to the configuration push and routed traffic away from the affected part of the fabric. Mitigation completed 21 minutes after loss began, ending impact to customers. After isolating the root cause, engineers then audited all configuration changes that were generated by the control plane and replaced them with known-good configurations. To prevent this from recurring, we will correct the control plane defect that generated the incorrect configuration and are adding additional validation at the fabric layer in order to more robustly detect configuration errors. Additionally, we intend on adding logic to the network control plane to be able to self-heal by automatically routing traffic away from the parts of the network fabric in an error state. Finally, we plan on evaluating further isolation of control plane configuration changes to reduce the size of the possible failure domain. Again, we would like to apologize for this issue. We are taking immediate steps to improve the platform’s performance and availability.

Last Update: A few months ago

RESOLVED: Incident 18008 - We've received a report of connectivity issues from GCE instances.

The network connectivity issues from GCE instances has been resolved for all affected users as of 2018-05-07 10:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18008 - We've received a report of connectivity issues from GCE instances. Our Engineering Team believes they have identified the root cause of the errors and is working to mitigate. We will provide another s...

We are investigating connectivity issues from GCE instances. We will provide more information by 2018-05-07 10:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18007 - We are investigating an issue with increased packet loss in us-central1 with Google Cloud Networking.

The issue with Google Cloud Networking having increased packet loss in us-central1 has been resolved for all affected users as of Wednesday, 2018-05-02 14:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18007 - We are investigating an issue with increased packet loss in us-central1 with Google Cloud Networking.

We are investigating an issue with Google Cloud Networking. We will provide more information by 14:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - The Cloud Shell availability issue has been resolved for all affected users as of 2018-05-02 08:56 US/Pacific.

The Cloud Shell availability issue has been resolved for all affected users as of 2018-05-02 08:56 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Cloud Shell. We will provide more information by 09:15 US/Pacific.

Our Engineering Team believes they have identified the potential root cause of the issue. We will provide another status update by 09:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Cloud Shell. We will provide more information by 08:45 US/Pacific.

We are investigating an issue with Cloud Shell. We will provide more information by 08:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18006 - Users may be experiencing increased error rates when accessing the Stackdriver web UI

The issue with Stackdriver Web UI Returning 500 and 502 error codes has been resolved for all affected users as of Monday, 2018-04-30 13:01 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18006 - Users may be experiencing increased error rates when accessing the Stackdriver web UI

Users may be experiencing increased error rates when accessing the Stackdriver web UI

Last Update: A few months ago

UPDATE: Incident 18006 - Users may be experiencing increased error rates when accessing the Stackdriver web UI

Mitigations are proving to be effective, error rates of 500s and 502s are decreasing, though are still elevated. We will provide more information by Monday, 2018-04-30 14:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

The issue with Google Cloud Pub/Sub message delivery has been resolved for all affected users as of 09:04 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

The mitigation work is currently underway by our Engineering Team and appears to be working. The messages that were delayed during this incidents are starting to get delivered. We expect a full resolution in the near future.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

Our Engineering Team believes they have identified the root cause of the Cloud Pub/Sub message delivery issues and is working to mitigate.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Pub/Sub that is affecting message delivery in some regions.

We are still investigating message delivery issues with Google Cloud Pub/Sub. Our Engineering Team is investigating possible causes. We will provide another status update by 7:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR.

The issue with Dataflow, Dataproc, Compute Engine and GCR has been resolved for all affected users as of 19:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR. We will provide more information by 20:00 US/Pacific.

We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR. We will provide more information by 20:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Dataflow, Dataproc, GCE and GCR. We will provide more information by 20:00 US/Pacific.

The issue with Dataflow, Dataproc, Compute Engine and GCR has been resolved for all affected users as of 19:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18005 - The issue with StackDriver Logging GCS Exports is resolved. We have finished processing the backlog of GCS Export jobs.

The issue with StackDriver Logging GCS Exports has been resolved for all affected users. We have completed processing the backlog of GCS export jobs. An internal investigation has been started to make the appropriate improvements to our systems and help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18005 - The issue with Google StackDriver Logging GCS Export is resolved. We are currently processing the backlog.

The issue with StackDriver Logging GCS Export is mitigated and backlog processing is ongoing. We will provide another update by 08:00 US/Pacific with current status of the backlog.

Last Update: A few months ago

UPDATE: Incident 18005 - The issue with Google StackDriver Logging GCS Export is mitigated.

The issue with StackDriver Logging GCS Export is mitigated and backlog processing is ongoing. We will provide another update by 01:00 US/Pacific with current status of the backlog.

Last Update: A few months ago

UPDATE: Incident 18005 - The issue with Google StackDriver Logging GCS Export is mitigated. We will provide an update by 22:00 US/Pacific.

Our Engineering Team believes they have identified and mitigated the root cause of the delay on StackDriver Logging GCS Export service. We are actively working to process the backlogs. We will provide an update by 22:00 US/Pacific with current progress.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 18:45 US/Pacific.

Mitigation work is still underway by our Engineering Team to address the delay issue with Google StackDriver Logging GCS Export. We will provide more information by 18:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 17:45 US/Pacific.

Mitigation work is currently underway by our Engineering Team to address the issue with Google StackDriver Logging GCS Export. We will provide more information by 17:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 17:15 US/Pacific.

We are investigating an issue with Google StackDriver Logging GCS Export. We will provide more information by 17:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We've received a report of an issue with Google App Engine as of 2018-03-29 04:52 US/Pacific. We will provide more information by 05:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - We've received a report of an issue with Google App Engine as of 2018-03-29 04:52 US/Pacific. We will provide more information by 05:30 US/Pacific.

The issue with Cloud Datastore in europe-west2 has been resolved for all affected [users|projects] as of 5.03 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18003 - We've received a report of an issue with Google Compute Engine as of 2018-03-16 11:32 US/Pacific. We will provide more information by 12:15 US/Pacific.

The issue with slow network programming should be resolved for all zones in us-east1 as of 12:44 US/Pacific. The root cause has been identified and we are working to prevent a recurrence. We will provide more information by 14:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18003 - We've received a report of an issue with Google Compute Engine as of 2018-03-16 11:32 US/Pacific. We will provide more information by 12:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific.

The issue with Cloud Shell has been resolved for all affected users as of Tue 2018-03-13 19:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific. We will provide more information by 19:30 US/Pacific.

We are still investigating an issue with Google Cloud Shell starting 2018-03-13 17:44 US/Pacific. We will provide more information by 19:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific. We will provide more information by 18:30 US/Pacific.

We are investigating an issue with Google Cloud Shell as of 2018-03-13 17:44 US/Pacific. We will provide more information by 18:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We've received a report of an issue with Google Compute Engine as of 2018-03-10 21:06 US/Pacific. Mitigation work is currently underway by our Engineering Team. We will provide more information by 22:...

Last Update: A few months ago

RESOLVED: Incident 18005 - We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

The issue with Google Cloud Networking intermittent traffic disruption to and from us-central has been resolved for all affected users as of 2018-02-23 22:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18006 - The issue with Google Cloud Networking intermittent traffic disruption to and from us-central has been resolved for all affected users as of 2018-02-23 22:35 US/Pacific.

The issue with Google Cloud Networking intermittent traffic disruption to and from us-central has been resolved for all affected users as of 2018-02-23 22:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18006 - The issue with Google Cloud Networking intermittent traffic disruption to and from us-central. should now show sign of recovery. We will provide another status update by Friday, 2018-02-23 22:40 US/Pa...

The issue with Google Cloud Networking intermittent traffic disruption to and from us-central. should now show sign of recovery for the majority of users and we expect a full resolution in the near future. We will provide another status update by Friday, 2018-02-23 22:40 US/Pacific. US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18006 - We are investigating an issue with Google Cloud Networking. The issue started around Friday, 2018-02-23 21:40 US/Pacific. This affects traffic to and from us-central. We are rolling out a potential fi...

Last Update: A few months ago

UPDATE: Incident 18006 - We are investigating an issue with Google Cloud Networking. The issue started around Friday, 2018-02-23 21:40 US/Pacific. This affects traffic to and from us-central. We are rolling out a potential fi...

We are investigating an issue with Google Cloud Networking. The issue started around Friday, 2018-02-23 21:40 US/Pacific. This affects traffic to and from us-central. We are rolling out a potential fix to mitigate this issue. We will provide more information by Friday, 2018-02-23 22:40 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

We are rolling out a potential fix to mitigate this issue. The affected region seem to be us-central.

Last Update: A few months ago

UPDATE: Incident 18005 - We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

We are investigating an issue with Google Cloud Networking. We will provide more information by 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - App Engine Admin-API experiencing high error rates

On Thursday 15 February 2018, specific Google Cloud Platform services experienced elevated errors and latency for a period of 62 minutes from 11:42 to 12:44 PST. The following services were impacted: Cloud Datastore experienced a 4% error rate for get calls and an 88% error rate for put calls. App Engine's serving infrastructure, which is responsible for routing requests to instances, experienced a 45% error rate, most of which were timeouts. App Engine Task Queues would not accept new transactional tasks, and also would not accept new tasks in regions outside us-central1 and europe-west1. Tasks continued to be dispatched during the event but saw start delays of 0-30 minutes; additionally, a fraction of tasks executed with errors due to the aforementioned Cloud Datastore and App Engine performance issues. App Engine Memcache calls experienced a 5% error rate. App Engine Admin API write calls failed during the incident, causing unsuccessful application deployments. App Engine Admin API read calls experienced a 13% error rate. App Engine Search API index writes failed during the incident though search queries did not experience elevated errors. Stackdriver Logging experienced delays exporting logs to systems including Cloud Console Logs Viewer, BigQuery and Cloud Pub/Sub. Stackdriver Logging retries on failure so no logs were lost during the incident. Logs-based Metrics failed to post some points during the incident. We apologize for the impact of this outage on your application or service. For Google Cloud Platform customers who rely on the products which were part of this event, the impact was substantial and we recognize that it caused significant disruption for those customers. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly.

Last Update: A few months ago

RESOLVED: Incident 18035 - Bigquery experiencing high latency rates

On Thursday 15 February 2018, specific Google Cloud Platform services experienced elevated errors and latency for a period of 62 minutes from 11:42 to 12:44 PST. The following services were impacted: Cloud Datastore experienced a 4% error rate for get calls and an 88% error rate for put calls. App Engine's serving infrastructure, which is responsible for routing requests to instances, experienced a 45% error rate, most of which were timeouts. App Engine Task Queues would not accept new transactional tasks, and also would not accept new tasks in regions outside us-central1 and europe-west1. Tasks continued to be dispatched during the event but saw start delays of 0-30 minutes; additionally, a fraction of tasks executed with errors due to the aforementioned Cloud Datastore and App Engine performance issues. App Engine Memcache calls experienced a 5% error rate. App Engine Admin API write calls failed during the incident, causing unsuccessful application deployments. App Engine Admin API read calls experienced a 13% error rate. App Engine Search API index writes failed during the incident though search queries did not experience elevated errors. Stackdriver Logging experienced delays exporting logs to systems including Cloud Console Logs Viewer, BigQuery and Cloud Pub/Sub. Stackdriver Logging retries on failure so no logs were lost during the incident. Logs-based Metrics failed to post some points during the incident. We apologize for the impact of this outage on your application or service. For Google Cloud Platform customers who rely on the products which were part of this event, the impact was substantial and we recognize that it caused significant disruption for those customers. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly.

Last Update: A few months ago

RESOLVED: Incident 18003 - App Engine seeing elevated error rates

On Thursday 15 February 2018, specific Google Cloud Platform services experienced elevated errors and latency for a period of 62 minutes from 11:42 to 12:44 PST. The following services were impacted: Cloud Datastore experienced a 4% error rate for get calls and an 88% error rate for put calls. App Engine's serving infrastructure, which is responsible for routing requests to instances, experienced a 45% error rate, most of which were timeouts. App Engine Task Queues would not accept new transactional tasks, and also would not accept new tasks in regions outside us-central1 and europe-west1. Tasks continued to be dispatched during the event but saw start delays of 0-30 minutes; additionally, a fraction of tasks executed with errors due to the aforementioned Cloud Datastore and App Engine performance issues. App Engine Memcache calls experienced a 5% error rate. App Engine Admin API write calls failed during the incident, causing unsuccessful application deployments. App Engine Admin API read calls experienced a 13% error rate. App Engine Search API index writes failed during the incident though search queries did not experience elevated errors. Stackdriver Logging experienced delays exporting logs to systems including Cloud Console Logs Viewer, BigQuery and Cloud Pub/Sub. Stackdriver Logging retries on failure so no logs were lost during the incident. Logs-based Metrics failed to post some points during the incident. We apologize for the impact of this outage on your application or service. For Google Cloud Platform customers who rely on the products which were part of this event, the impact was substantial and we recognize that it caused significant disruption for those customers. We are conducting a detailed post-mortem to ensure that all the root and contributing causes of this event are understood and addressed promptly.

Last Update: A few months ago

RESOLVED: Incident 18001 - Cloud PubSub experiencing missing subscription metrics. Additionally, some Dataflow jobs with PubSub sources appear as they do not consume any messages.

The issue with Cloud PubSub causing watermark increase in Dataflow jobs has been resolved for all affected projects as of Tue, 2018-02-20 05:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - Cloud PubSub experiencing missing subscription metrics. Additionally, some Dataflow jobs with PubSub sources appear as they do not consume any messages.

The watermarks of the affected Dataflow jobs using PubSub are now returning to normal.

Last Update: A few months ago

UPDATE: Incident 18001 - Cloud PubSub experiencing missing subscription metrics. Additionally, some Dataflow jobs with PubSub sources appear as they do not consume any messages.

We are experiencing an issue with Cloud PubSub beginning approximately at 20:00 2018-02-19 US/Pacific. Early investigation indicates that approximately 10-15% of Dataflow jobs are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by 05:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are still investigating an issue with Google Cloud Pub/Sub. We will provide more information by 04:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are still investigating an issue with Google Cloud Pub/Sub. We will provide more information by 04:00 US/Pacific.

We are investigating an issue with Cloud PubSub. We will provide more information by 04:00 AM US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18003 - We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

ISSUE SUMMARY On Sunday 18 January 2018, Google Compute Engine networking experienced a network programming failure. The two impacts of this incident included the autoscaler not scaling instance groups, as well as migrated and newly-created VMs not communicating with VMs in other zones for a duration of up to 93 minutes. We apologize for the impact this event had on your applications and projects, and we will carefully investigate the causes and implement measures to prevent recurrences. DETAILED DESCRIPTION OF IMPACT On Sunday 18 January 2018, Google Compute Engine network provisioning updates failed in the following zones: europe-west3-a for 34 minutes (09:52 AM to 10:21 AM PT) us-central1-b for 79 minutes (09:57 AM to 11:16 AM PT) asia-northeast1-a for 93 minutes (09:53 AM to 11:26 AM PT) Propagation of Google Compute Engine networking configuration for newly created and migrated VMs is handled by two components. The first is responsible for providing a complete list of VM’s, networks, firewall rules, and scaling decisions. The second component provides a stream of updates for the components in a specific zone. During the affected period, the first component failed to return data. VMs in the affected zones were unable to communicate with newly-created or migrated VMs in another zone in the same private GCE network. VMs in the same zone were unaffected because they are updated by the streaming component. The autoscaler service also relies upon data from the failed first component to scale instance groups; without updates from that component, it could not make scaling decisions for the affected zones. ROOT CAUSE A stuck process failed to provide updates to the Compute Engine control plane. Automatic failover was unable to force-stop the process, and required manual failover to restore normal operation. REMEDIATION AND PREVENTION The engineering team was alerted when the propagation of network configuration information stalled. They manually failed over to the replacement task to restore normal operation of the data persistence layer. To prevent another occurrence of this incident, we are taking the following actions: We still stop VM migrations if the configuration data is stale. Modify the data persistence layer to re-resolve their peers during long-running processes, to allow failover to replacement tasks.

Last Update: A few months ago

RESOLVED: Incident 18003 - App Engine seeing elevated error rates

The issue with App Engine has been resolved for all affected projects as of 12:44 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18001 - Datastore experiencing elevated error rates

We continue to see significant improvement to all datastore services. We are still continuing to monitor and will provide another update by 15:00 PST.

Last Update: A few months ago

RESOLVED: Incident 18001 - App Engine Admin-API experiencing high error rates

The issue with App Engine Admin API has been resolved for all affected users as of Thursday, 2018-02-15 13:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18003 - Stackdriver Logging Service Degraded

The issue with Google Stackdriver has been resolved for all affected projects as of 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18004 - We are investigating an issue with Google Stackdriver. We will provide more information by 13:15 US/Pacific.

The issue with Google Stackdriver has been resolved for all affected projects as of 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18004 - We are investigating an issue with Google Stackdriver. We will provide more information by 13:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Datastore experiencing elevated error rates

We are seeing near return to baseline, however we aren't seeing a consistent view of our quota and are investigating. We will provide another update by roughly 13:45 PST.

Last Update: A few months ago

RESOLVED: Incident 18035 - Bigquery experiencing high latency rates

The issue with Bigquery has been resolved for all affected projects as of Thursday, 2018-02-15 13:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18003 - App Engine seeing elevated error rates

We're seeing widespread improvement in error rates in many / most regions since ~12:40 PST. We're continuing to investigate and will provide another update by 13:30 PST.

Last Update: A few months ago

UPDATE: Incident 18003 - Stackdriver Logging Service Degraded

We are investigating an issue with Stackdriver Logging Service. We will provide more information by 13:25 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18035 - Bigquery experiencing high latency rates

We are investigating an issue with Bigquery. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18003 - App Engine seeing elevated error rates

We are investigating an issue with App Engine. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Datastore experiencing elevated error rates

We are investigating an issue with Datastore. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - The issue with reduced URLFetch availability has been resolved for all affected projects as of 2018-02-10 18:55 US/Pacific. We will provide a more detailed analysis of this incident once we have compl...

The issue with reduced URLFetch availability has been resolved for all affected projects as of 2018-02-10 18:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm US/Pacific. We are currently rolling out a configuration change to mitigate this issue. We will ...

We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm US/Pacific. We are currently rolling out a configuration change to mitigate this issue. We will provide another status update by 2018-02-10 19:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm PT. We will provide more information by 19:00 US/Pacific.

We are investigating an issue with Google App Engine reduced URLFetch availability starting at 17:15pm PT. We will provide more information by 19:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18002 - We are investigating an issue with Google Kubernetes Engine. We will provide more information by 20:15 US/Pacific.

On Wednesday 31 January 2017, some Google Cloud services experienced elevated errors and latency on operations that required inter-data center network traffic for a duration of 72 minutes. The impact was visible during three windows: between 18:20 and 19:08 PST, between 19:10 and 19:29, and again between 19:45 and 19:50. Network traffic between the public internet and Google's data centers was not affected by this incident. The root cause of this incident was an error in a configuration update to the system that allocates network capacity for traffic between Google data centers. To prevent a recurrence, we will improve the automated checks that we run on configuration changes to detect problems before release. We will be improving the monitoring of the canary to detect problems before global rollout of changes to the configuration.

Last Update: A few months ago

RESOLVED: Incident 18001 - The issue with Google Stackdriver Logging has been resolved for all affected projects as of 20:02 US/Pacific.

On Wednesday 31 January 2017, some Google Cloud services experienced elevated errors and latency on operations that required inter-data center network traffic for a duration of 72 minutes. The impact was visible during three windows: between 18:20 and 19:08 PST, between 19:10 and 19:29, and again between 19:45 and 19:50. Network traffic between the public internet and Google's data centers was not affected by this incident. The root cause of this incident was an error in a configuration update to the system that allocates network capacity for traffic between Google data centers. To prevent a recurrence, we will improve the automated checks that we run on configuration changes to detect problems before release. We will be improving the monitoring of the canary to detect problems before global rollout of changes to the configuration.

Last Update: A few months ago

RESOLVED: Incident 18001 - The issue with Google App Engine services has been resolved for all affected projects as of 21:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements ...

On Wednesday 31 January 2017, some Google Cloud services experienced elevated errors and latency on operations that required inter-data center network traffic for a duration of 72 minutes. The impact was visible during three windows: between 18:20 and 19:08 PST, between 19:10 and 19:29, and again between 19:45 and 19:50. Network traffic between the public internet and Google's data centers was not affected by this incident. The root cause of this incident was an error in a configuration update to the system that allocates network capacity for traffic between Google data centers. To prevent a recurrence, we will improve the automated checks that we run on configuration changes to detect problems before release. We will be improving the monitoring of the canary to detect problems before global rollout of changes to the configuration.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Container Engine that is affecting cluster creation and upgrade. We will provide more information by 11:45 US/Pacific.

The issue with cluster creation and upgrade has been resolved for all affected projects as of 11:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Container Engine that is affecting cluster creation and upgrade. We will provide more information by 11:45 US/Pacific.

We are investigating an issue with Google Container Engine that is affecting cluster creation and upgrade. We will provide more information by 11:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API.

The issue with elevated failure rates in the Stackdriver Trace API has been resolved for all affected projects as of Friday, 2018-02-02 09:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API.

Stackdriver Trace API continues to exhibit an elevated rate of request failures. Our engineering teams have put a mitigation in place, and will proceed to address the cause of the issue. We will provide another update by 10:45 AM US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API.

StackDriver Trace API is experiencing an elevated rate of request failures. There is no workaround at this time. Our engineering teams have identified the cause of the issue and are working on a mitigation. We will provide another update by 9:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Stackdriver Trace API. We will provide more information by 09:00 US/Pacific.

We are investigating an issue with Google Stackdriver Trace API. We will provide more information by 09:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

On Thursday 25 January 2018, while expanding the Google network serving the us-central1 region, a configuration change unexpectedly triggered packet loss and reduced network bandwidth for inter-region data transfer and replication traffic to and from the region. The network impact was observable during two windows, between 11:03 and 12:40 PST, and again between 14:27 and 15:27 PST. The principal user-visible impact was a degradation in the performance of some Google Cloud services that require cross data center traffic. There was no impact to network traffic between the us-central1 region and the internet, or to traffic between Compute Engine VM instances. We sincerely apologize for the impact of this incident on your application or service. We have performed a detailed analysis of root cause and taken careful steps to ensure that this type of incident will not recur.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation for most products have been completely resolved but some products ...

The issue with Google App Engine has been resolved for all affected projects as of 21:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We experienced an issue with Google Cloud Storage, beginning at approximately 18:30 US/Pacific. The situation has been completely resolved by 20:08 PST.

We experienced an issue with Google Cloud Storage, beginning at approximately 18:30 US/Pacific. The situation has been completely resolved by 20:08 PST.

Last Update: A few months ago

UPDATE: Incident 18001 - The issue with Google Stackdriver Logging has been resolved for all affected projects as of 20:02 US/Pacific.

The issue with Google Stackdriver Logging has been resolved for all affected projects as of 20:02 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. For everyone who is affected, we apologize for the disruption.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products ar...

Google App Engine has mostly recovered and our engineering team is working to completely mitigate the issue. We will provide an update by 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Kubernetes Engine. We will provide more information by 20:15 US/Pacific.

The issue with Google Kubernetes Engine has been resolved for all affected projects as of 20:33 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. For everyone who is affected, we apologize for the disruption.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Stackdriver Logging. We will provide more information by 20:00 US/Pacific.

The issue with Google Stackdriver Logging should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 20:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Container Engine. We will provide more information by 20:15 US/Pacific.

The issue with Google Kubernetes Engine should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 20:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products ar...

Google App Engine is recovering and our engineering team is working to completely mitigate the issue. We will provide an update by 20:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products ar...

We are experiencing an issue with multiple Google Cloud Platform Services, beginning at approximately 18:30 US/Pacific. The situation is getting improved for the most of products, but some products are still reporting errors. For everyone who is affected, we apologize for the disruption. We will provide an update by 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Compute Engine. We will provide more information by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Compute Engine. We will provide more information by 20:30 US/Pacific.

We are investigating an issue with Google Compute Engine. We will provide more information by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Container Engine. We will provide more information by 20:15 US/Pacific.

We are investigating an issue with Google Container Engine. We will provide more information by 20:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Stackdriver Logging. We will provide more information by 20:00 US/Pacific.

We are investigating an issue with Google Stackdriver Logging. We will provide more information by 20:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

We have updated our estimated impact to an average of 2% and a peak of 3.6% GCS global error rate, based on a more thorough review of monitoring data and server logs. The initial estimate of impact was based on a internal assessment hosted in a single region; subsequent investigation reveals that Google's redundancy and rerouting infrastructure worked as intended and dramatically reduced the user-visible impact of the event to GCS's global user base. The 2% average error rate is measured over the duration of the event, from its beginning at 11:05 PST to its conclusion at 12:24 PST.

Last Update: A few months ago

RESOLVED: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Networking has been resolved for all affected users as of 2018-01-25 13:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Networking should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 2018-01-25 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Storage error rates has been resolved for all affected users as of 12:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18034 - We are investigating an issue with BigQuery in the US region. We will provide more information by 13:00 US/Pacific.

The issue with BigQuery error rates has been resolved for all affected users as of 12:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

The issue with Google Cloud Storage in the US regions should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 2018-01-25 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

We are investigating an issue with Google Cloud Networking. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 13:00 US/Pacific.

We are experiencing an issue with Google Cloud Storage beginning Thursday, 2018-01-25 11:23 US/Pacific. Current investigation indicates that approximately 100% of customers in the US region are affected and we expect that for affected users the service is mostly or entirely unavailable at this time. For everyone who is affected, we apologize for the disruption. We will provide an update by 2018-01-25 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18034 - We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 12:45 US/Pacific.

We are investigating an issue with Google Cloud Storage in the US region. We will provide more information by 12:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

The issue with Google Cloud Networking connectivity has been resolved for all affected zones in europe-west3, us-central1, and asia-northeast1 as of 11:26am US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18003 - We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

We are investigating an issue with Google Cloud Networking affecting connectivity in us-central1 and europe-west3. We will provide more information by 12:15pm US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18002 - We have resolved an issue with Google Cloud Networking.

The issue with packet loss from North and South America regions to Asia regions has been resolved for all affected users as of 1:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Networking. We will provide more information by 14:30 US/Pacific.

Our Engineering Team believes they have identified the root cause of the packet loss and is working to mitigate.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:15 US/Pacific.

We are experiencing an issue with packet loss from Google North and South America regions to Asia regions beginning at 12:30 US/Pacific. Current data indicates that approximately 40% of packets are affected by this issue. For everyone who is affected, we apologize for the disruption. We will provide an update by 1:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18002 - We are investigating an issue with Google Cloud Networking. We will provide more information by 13:15 US/Pacific.

We are investigating an issue with Google Cloud Networking. We will provide more information by 13:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - Some GKE 'create/delete' cluster operations failing

The issue with GKE 'create/delete' cluster operations failing should be resolved for the majority of users. We will continue monitoring to confirm full resolution. We will provide more information by 11:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18001 - Some GKE 'create/delete' cluster operations failing

The issue with GKE 'create/delete' cluster operations failing has been resolved for all affected users as of 8:27 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - Some GKE 'create/delete' cluster operations failing

The issue with GKE 'create/delete' cluster operations failing should be resolved for the majority of users. We will continue monitoring to confirm full resolution. We will provide more information by 09:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Some GKE 'create/delete' cluster operations failing

We are still investigating an issue with some GKE 'create/delete' cluster operations failing. We will provide more information by 08:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - GKE 'create cluster' operations failing

We are investigating an issue with GKE 'create cluster' operations failing. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - The issue with Google Cloud Load Balancing (GCLB) creation has been resolved for all affected projects as of 20:22 US/Pacific.

The issue with Google Cloud Load Balancing (GCLB) creation has been resolved for all affected projects as of 20:22 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 20:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 20:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 20:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 19:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 19:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 18:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 19:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 18:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 18:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 18:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 18:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 17:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 16:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 17:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 16:30 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 16:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Google Cloud Networking. We will provide more information by 16:00 US/Pacific.

We are investigating an issue with Google Cloud Networking that is preventing the creation of new GCLB load balancers and updating of existing ones. We will provide more information by 16:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - Cloud Spanner issues from 12:45 to 13:26 Pacific time have been resolved.

Cloud Spanner issues from 12:45 to 13:26 Pacific time have been resolved.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Cloud Spanner. Customers should be back to normal service as of 13:26 Pacific. We will provide more information by 15:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18001 - We are investigating an issue with Cloud Spanner. Customers should be back to normal service as of 13:26 Pacific. We will provide more information by 15:00 US/Pacific.

An incident with Cloud Spanner availability started at 12:45 Pacific time and has been addressed. The service is restored for all customers as of 13:26. Another update will be posted before 15:00 Pacific time to confirm the service health.

Last Update: A few months ago

UPDATE: Incident 17002 - The issue with Cloud Machine Learning Engine's Create Version has been resolved for all affected users as of 2017-12-15 10:55 US/Pacific.

The issue with Cloud Machine Learning Engine's Create Version has been resolved for all affected users as of 2017-12-15 10:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17009 - The issue with the Google App Engine Admin API has been resolved for all affected users as of Thursday, 2017-12-14 12:15 US/Pacific.

The issue with the Google App Engine Admin API has been resolved for all affected users as of Thursday, 2017-12-14 12:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 18:00 US/Pacific.

The issue with Cloud Storage elevated error rate has been resolved for all affected projects as of Friday 2017-11-30 16:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 16:30 US/Pacific.

The Cloud Storage service is experiencing less than 10% error rate. We will provide another status update by 2017-11-30 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 15:00 US/Pacific.

The Cloud Storage service is experiencing less than 10% error rate. We will provide another status update by 2017-11-30 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - We are investigating an issue with Google Cloud Storage. We will provide more information by 15:00 US/Pacific.

The Cloud Storage service is experiencing less than 10% error rate. We will provide another status update by YYYY-mm-dd HH:MM US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - From 10:58 to 11:57 US/Pacific, GCE VM instances experienced packet loss from GCE instances to the Internet. The issue has been mitigated for all affected projects.

From 10:58 to 11:57 US/Pacific, GCE VM instances experienced packet loss from GCE instances to the Internet. The issue has been mitigated for all affected projects.

Last Update: A few months ago

UPDATE: Incident 17004 - We are investigating an issue with Google Cloud Networking. We will provide more information by 07:00 US/Pacific.

The issue with Google Cloud Engine VM instances losing connectivity has been resolved for all affected users as of Friday, 2017-11-17 7:17am US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 17008 - App Engine increasingly showing 5xx

The issue with App Engine has been resolved for all affected projects as of 4:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17008 - App Engine increasingly showing 5xx

The issue also affected projects in other regions but should be resolved for the majority of projects. We will provide another status update by 05:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17007 - The Memcache service has recovered from a disruption between 12:30 US/Pacific and 15:30 US/Pacific.

ISSUE SUMMARY On Wednesday 6 November 2017, the App Engine Memcache service experienced unavailability for applications in all regions for 1 hour and 50 minutes. We sincerely apologize for the impact of this incident on your application or service. We recognize the severity of this incident and will be undertaking a detailed review to fully understand the ways in which we must change our systems to prevent a recurrence. DETAILED DESCRIPTION OF IMPACT On Wednesday 6 November 2017 from 12:33 to 14:23 PST, the App Engine Memcache service experienced unavailability for applications in all regions. Some customers experienced elevated Datastore latency and errors while Memcache was unavailable. At this time, we believe that all the Datastore issues were caused by surges of Datastore activity due to Memcache being unavailable. When Memcache failed, if an application sent a surge of Datastore operations to specific entities or key ranges, then Datastore may have experienced contention or hotspotting, as described in https://cloud.google.com/datastore/docs/best-practices#designing_for_scale. Datastore experienced elevated load on its servers when the outage ended due to a surge in traffic. Some applications in the US experienced elevated latency on gets between 14:23 and 14:31, and elevated latency on puts between 14:23 and 15:04. Customers running Managed VMs experienced failures of all HTTP requests and App Engine API calls during this incident. Customers using App Engine Flexible Environment, which is the successor to Managed VMs, were not impacted. ROOT CAUSE The App Engine Memcache service requires a globally consistent view of the current serving datacenter for each application in order to guarantee strong consistency when traffic fails over to alternate datacenters. The configuration which maps applications to datacenters is stored in a global database. The incident occurred when the specific database entity that holds the configuration became unavailable for both reads and writes following a configuration update. App Engine Memcache is designed in such a way that the configuration is considered invalid if it cannot be refreshed within 20 seconds. When the configuration could not be fetched by clients, Memcache became unavailable. REMEDIATION AND PREVENTION Google received an automated alert at 12:34. Following normal practices, our engineers immediately looked for recent changes that may have triggered the incident. At 12:59, we attempted to revert the latest change to the configuration file. This configuration rollback required an update to the configuration in the global database, which also failed. At 14:21, engineers were able to update the configuration by sending an update request with a sufficiently long deadline. This caused all replicas of the database to synchronize and allowed clients to read the mapping configuration. As a temporary mitigation, we have reduced the number of readers of the global configuration, which avoids the contention during write and led to the unavailability during the incident. Engineering projects are already under way to regionalize this configuration and thereby limit the blast radius of similar failure patterns in the future.

Last Update: A few months ago

RESOLVED: Incident 17007 - The Memcache service has recovered from a disruption between 12:30 US/Pacific and 15:30 US/Pacific.

The issue with Memcache availability has been resolved for all affected projects as of 15:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation. This is the final update for this incident.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service experienced a disruption and is still recovering. We will provide more information by 16:00 US/Pacific.

The Memcache service is still recovering from the outage. The rate of errors continues to decrease and we expect a full resolution of this incident in the near future. We will provide an update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service experienced a disruption and is recovering now. We will provide more information by 15:30 US/Pacific.

The issue with Memcache and MVM availability should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide an update by 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service experienced a disruption and is being normalized now. We will provide more information by 15:15 US/Pacific.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. At this time we are gradually ramping up traffic to Memcache and we see that the rate of errors is decreasing. Other services affected by the outage, such as MVM instances, should be normalizing in the near future. We will provide an update by 15:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service is currently experiencing a disruption. We will provide more information by 14:30 US/Pacific.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. Our Engineering Team believes they have identified the root cause of the errors and is working to mitigate. We will provide an update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - The Memcache service is currently experiencing a disruption. We will provide more information by 14:30 US/Pacific.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. Current data indicates that all projects using Memcache are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 14:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - Investigating incident with AppEngine and Memcache.

We are experiencing an issue with Memcache availability beginning at November 6, 2017 at 12:30 pm US/Pacific. Current data indicate(s) that all projects using Memcache are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 14:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - Investigating incident with AppEngine and Memcache.

We are investigating an issue with Google App Engine and Memcache. We will provide more information by 13:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17003 - We are investigating an issue with GKE. We will provide more information by 16:00 US/Pacific.

We are investigating an issue involving the inability of pods to be rescheduled on Google Container Engine (GKE) nodes after Docker reboots or crashes. This affects GKE versions 1.6.11, 1.7.7, 1.7.8 and 1.8.1. Our engineering team will roll out a fix next week; no further updates will be provided here. If experienced, the issue can be mitigated by manually restarting the affected nodes.

Last Update: A few months ago

RESOLVED: Incident 17018 - We are investigating an issue with Google Cloud SQL. We see failures for Cloud SQL connections from App Engine and connections using the Cloud SQL Proxy. We are also observing elevated failure rates f...

The issue with Cloud SQL connectivity affecting connections from App Engine and connections using the Cloud SQL Proxy as well as the issue with Cloud SQL admin activities have been resolved for all affected as of 20:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17019 - We are investigating an issue with Google Cloud SQL. We see failures for Cloud SQL connections from App Engine and connections using the Cloud SQL Proxy. We are also observing elevated failure rates f...

The issue with Cloud SQL connectivity affecting connections from App Engine and connections using the Cloud SQL Proxy as well as the issue with Cloud SQL admin activities have been resolved for all affected as of 2017-10-30 20:45 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17018 - We are investigating an issue with Google Cloud SQL. We see failures for Cloud SQL connections from App Engine and connections using the Cloud SQL Proxy. We are also observing elevated failure rates f...

We are continuing to experience an issue with Cloud SQL connectivity, affecting only connections from App Engine and connections using the Cloud SQL Proxy, beginning at 2017-10-30 17:00 US/Pacific. We are also observing elevated failure rates for Cloud SQL admin activities (using the Cloud SQL portion of the Cloud Console UI, using gcloud beta sql, directly using the Admin API, etc.). Our Engineering Team believes they have identified the root cause and mitigation effort is currently underway. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide another update by 2017-10-30 21:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17005 - Elevated GCS Errors from Canada

ISSUE SUMMARY Starting Thursday 12 October 2017, Google Cloud Storage clients located in the Northeast of North America experienced up to a 10% error rate for a duration of 21 hours and 35 minutes when fetching objects stored in multi-regional buckets in the US. We apologize for the impact of this incident on your application or service. The reliability of our service is a top priority and we understand that we need to do better to ensure that incidents of this type do not recur. DETAILED DESCRIPTION OF IMPACT Between Thursday 12 October 2017 12:47 PDT and Friday 13 October 2017 10:12 PDT, Google Cloud Storage clients located in the Northeast of North America experienced up to a 10% rate of 503 errors and elevated latency. Some users experienced higher error rates for brief periods. This incident only impacted requests to fetch objects stored in multi-regional buckets in the US; clients were able to mitigate impact by retrying. The percentage of total global requests to Cloud Storage that experienced errors was 0.03%. ROOT CAUSE Google ensures balanced use of its internal networks by throttling outbound traffic at the source host in the event of congestion. This incident was caused by a bug in an earlier version of the job that reads Cloud Storage objects from disk and streams data to clients. Under high traffic conditions, the bug caused these jobs to incorrectly throttle outbound network traffic even though the network was not congested. Google had previously identified this bug and was in the process of rolling out a fix to all Google datacenters. At the time of the incident, Cloud Storage jobs in a datacenter in Northeast North America that serves requests to some Canadian and US clients had not yet received the fix. This datacenter is not a location for customer buckets (https://cloud.google.com/storage/docs/bucket-locations), but objects in multi-regional buckets can be served from instances running in this datacenter in order to optimize latency for clients. REMEDIATION AND PREVENTION The incident was first reported by a customer to Google on Thursday 12 October 14:59 PDT. Google engineers determined root cause on Friday 13 October 09:47 PDT. We redirected Cloud Storage traffic away from the impacted region at 10:08 and the incident was resolved at 10:12. We have now rolled out the bug fix to all regions. We will also add external monitoring probes for all regional points of presence so that we can more quickly detect issues of this type.

Last Update: A few months ago

UPDATE: Incident 17002 - Jobs not terminating

The issue with with Cloud Dataflow in which batch jobs are stuck and cannot be terminated has been resolved for all affected projects as of Wednesday, 201-10-18 02:58 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - Jobs not terminating

A fix for the issue with Cloud Dataflow in which batch jobs are stuck and cannot be terminated is currently getting rolled out. We expect a full resolution in the near future. We will provide another status update by Wednesday, 2017-10-18 03:45 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17007 - Stackdriver Uptime Check Alerts Not Firing

The issue with Stackdriver Uptime Check Alerts not firing has been resolved for all affected projects as of Monday, 2017-10-16 13:08 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17007 - Stackdriver Uptime Check Alerts Not Firing

We are investigating an issue with Stackdriver Uptime Check Alerts. We will provide more information by 13:15 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17005 - Elevated GCS Errors from Canada

The issue with Google Cloud Storage request failures for users in Canada and Northeast North America has been resolved for all affected users as of Friday, 2017-10-13 10:08 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17005 - Elevated GCS Errors from Canada

We are investigating an issue with Google Cloud Storage users in Canada and Northeast North America experiencing HTTP 503 failures. We will provide more information by 10:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17004 - Elevated GCS errors in us-east1

The issue with GCS service has been resolved for all affected users as of 14:31 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our system to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17004 - Elevated GCS errors in us-east1

The issue with GCS service should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17004 - Elevated GCS errors in us-east1

We are investigating an issue that occurred with GCS starting at 13:19 PDT. We will provide more information by 14:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17007 - Project creation failure

The issue with Project Creation failing with "Unknown error" has been resolved for all affected users as of Tuesday, 2017-10-03 22:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

The issue with Google Stackdriver has been resolved for all affected users as of 2017-10-03 16:28 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

We are continuing to investigate the Google Stackdriver issue. Graphs are fully restored, but alerting policies and uptime checks are still degraded. We will provide another update at 17:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

We are continuing to investigate the Google Stackdriver issue. In addition to graph and alerting policy unavailability, uptime checks are not completing successfully. We believe we have isolated the root cause and are working on a resolution, and we will provide another update at 16:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17006 - Stackdriver console unavailable

We are investigating an issue with Google Stackdriver that is causing charts and alerting policies to be unavailable. We will provide more information by 15:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17007 - Project creation failue

The project creation is experiencing a 100% error rate on requests. We will provide another status update by Tuesday, 2017-10-03 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - Project creation failue

We are investigating an issue with Project creation. We will provide more information by 12:40PM US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17005 - Errors creating new Stackdriver accounts and adding new projects to existing Stackdriver accounts.

The issue with Google Stackdriver has been resolved for all affected projects as of Friday, 2017-09-29 15:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - Activity Stream not showing new Activity Logs

We are currently investigating an issue with the Cloud Console's Activity Stream not showing new Activity Logs.

Last Update: A few months ago

RESOLVED: Incident 17003 - Google Cloud Pub/Sub partially unavailable.

The issue with Pub/Sub subscription creation has been resolved for all affected projects as of 08:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17003 - Google Cloud Pub/Sub partially unavailable.

We are experiencing an issue with Pub/Sub subscription creation beginning at 2017-09-13 06:30 US/Pacific. Current data indicates that approximately 12% of requests are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 08:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - Google Cloud Pub/Sub partially unavailable.

We are investigating an issue with Google Pub/Sub. We will provide more information by 07:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

ISSUE SUMMARY On Tuesday 29 August and Wednesday 30 August 2017, Google Cloud Network Load Balancing and Internal Load Balancing could not forward packets to backend instances if they were live-migrated. This incident initially affected instances in all regions for 30 hours and 22 minutes. We apologize for the impact this had on your services. We are particularly cognizant of the failure of a system used to increase reliability and the long duration of the incident. We have completed an extensive postmortem to learn from the issue and improve Google Cloud Platform. DETAILED DESCRIPTION OF IMPACT Starting at 13:56 PDT on Tuesday 29 August 2017 to 20:18 on Wednesday 30 August 2017, Cloud Network Load Balancer and Internal Load Balancer in all regions were unable to reach any instance that live-migrated during that period. Instances which did not experience live-migration during this period were not affected. Our internal investigation shows that approximately 2% of instances using Network Load Balancing or Internal Load Balancing were affected by the issue. ROOT CAUSE Live-migration transfers a running VM from one host machine to another host machine within the same zone. All VM properties and attributes remain unchanged, including internal and external IP addresses, instance metadata, block storage data and volumes, OS and application state, network settings, network connections, and so on. In this case, a change in the internal representation of networking information in VM instances caused inconsistency between two values, both of which were supposed to hold the external and internal virtual IP addresses of load balancers. When an affected instance was live-migrated, the instance was deprogrammed from the load balancer because of the inconsistency. This made it impossible for load balancers that used the instance as backend to look up the destination IP address of the instance following its migration, which in turn caused for all packets destined to that instance to be dropped at the load balancer level. REMEDIATION AND PREVENTION Initial detection appeared with reports of lack of backend connectivity at 23:30 on Tuesday to the GCP support team. At 00:28 on Wednesday two Cloud Network engineering teams were paged to investigate the issue. Detailed investigations continued until 08:07 when the configuration change that caused the issue was confirmed as such. The roll back of the new configuration was completed by 08:32, at which point no new live-migration would cause the issue. Google engineers then started to run a program to fix all mismatched network information at 08:56, and all affected instances were restored to a healthy status by 20:18. In order to prevent the issue, Google engineers are working to enhance automated canary testing that simulates live-migration events, detection of load balancing packets loss, and enforce more restrictions on new configuration changes deployment for internal representation changes.

Last Update: A few months ago

RESOLVED: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

The issue with Network Load Balancers has been resolved for all affected projects as of 20:18 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 18033 - We are investigating an issue with BigQuery queries failing starting at 10:15am PT

The issue with BigQuery queries failing has been resolved for all affected users as of 12:05pm US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18033 - We are investigating an issue with BigQuery queries failing starting at 10:15am PT

The BigQuery service is experiencing a 16% error rate on queries. We will provide another status update by 12:00pm US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We wanted to send another update with better formatting. We will provide more another update on resolving effected instances by 12 PDT. Affected customers can also mitigate their affected instances with the following procedure (which causes Network Load Balancer to be reprogrammed) using gcloud tool or via the Compute Engine API. NB: No modification to the existing load balancer configurations is necessary, but a temporary TargetPool needs to be created. Create a new TargetPool. Add the affected VMs in a region to the new TargetPool. Wait for the VMs to start working in their existing load balancer configuration. Delete the new TargetPool. DO NOT delete the existing load balancer config, including the old target pool. It is not necessary to create a new ForwardingRule. Example: 1) gcloud compute target-pools create dummy-pool --project=<your_project> --region=<region> 2) gcloud compute target-pools add-instances dummy-pool --instances=<instance1,instance2,...> --project=<your_project> --region=<region> --instances-zone=<zone> 3) (Wait) 4) gcloud compute target-pools delete dummy-pool --project=<your_project> --region=<region>

Last Update: A few months ago

UPDATE: Incident 18033 - We are investigating an issue with BigQuery queries failing starting at 10:15am PT

We are investigating an issue with BigQuery queries failing starting at 10:15am PT

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

Our first mitigation has completed at this point and no new instances should be effected. We are slowly going through an fixing affected customers. Affected customers can also mitigate their affected instances with the following procedure (which causes Network Load Balancer to be reprogrammed) using gcloud tool or via the Compute Engine API. NB: No modification to the existing load balancer configurations is necessary, but a temporary TargetPool needs to be created. Create a new TargetPool. Add the affected VMs in a region to the new TargetPool. Wait for the VMs to start working in their existing load balancer configuration. Delete the new TargetPool. DO NOT delete the existing load balancer config, including the old target pool. It is not necessary to create a new ForwardingRule. Example: gcloud compute target-pools create dummy-pool --project=<your_project> --region=<region> gcloud compute target-pools add-instances dummy-pool --instances=<instance1,instance2,...> --project=<your_project> --region=<region> --instances-zone=<zone> (Wait) gcloud compute target-pools delete dummy-pool --project=<your_project> --region=<region>

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balance. The configuration change to mitigate this issue has been rolled out and we are working on further measures to completely resolve the issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 10:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. The configuration change to mitigate this issue has been rolled out and we are working on further measures to completly resolve the issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 09:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. We have identified the event that triggers this issue and are rolling back a configuration change to mitigate this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 09:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with Cloud Network Load Balancers connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Mitigation work is still in progress. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 08:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 not being able to connect to backends. Our previous actions did not resolve the issue. We are pursuing alternative solutions. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 07:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an issue with a subset of Network Load Balancer in regions us-east1, us-west1 and asia-east1 not being able to connect to backends. Our Engineering Team has reduced the scope of possible root causes and is still investigating. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 06:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. The investigation is still ongoing. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. We have ruled out several possible failure scenarios. The investigation is still ongoing. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 04:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 04:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are experiencing an intermittent issue with Network Load Balancer connectivity to their backends. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are investigating an issue with network load balancer connectivity. We will provide more information by 02:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are investigating an issue with network connectivity. We will provide more information by 01:50 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - Issue with network connectivity

We are investigating an issue with network connectivity. We will provide more information by 01:20 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

ISSUE SUMMARY On Tuesday 15 August 2017, Google Cloud SQL experienced issues in the europe-west1 zones for a duration of 3 hours and 35 minutes. During this time, new connections from Google App Engine (GAE) or Cloud SQL Proxy would timeout and return an error. In addition, Cloud SQL connections with ephemeral certs that had been open for more than one hour timed out and returned an error. We apologize to our customers whose projects were affected – we are taking immediate action to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Tuesday 15 August 2017 from 17:20 to 20:55 PDT, 43.1% of Cloud SQL instances located in europe-west1 were unable to be managed with the Google Cloud SQL Admin API to create or make changes. Customers who connected from GAE or used the Cloud SQL Proxy (which includes most connections from Google Container Engine) were denied new connections to their database. ROOT CAUSE The issue surfaced through a combination of a spike in error rates internal to the Cloud SQL service and a lack of available resources in the Cloud SQL control plane for europe-west1. By way of background, the Cloud SQL system uses a database to store metadata for customer instances. This metadata is used for validating new connections. Validation will fail if the load on the database is heavy. In this case, Cloud SQL’s automatic retry logic overloaded the control plane and consumed all the available Cloud SQL control plane processing in europe-west1. This in turn made the Cloud SQL Proxy and front end client server pairing reject connections when ACLs and certificate information stored in the Cloud SQL control plane could not be accessed. REMEDIATION AND PREVENTION Google engineers were paged at 17:20 when automated monitoring detected an increase in control plane errors. Initial troubleshooting steps did not sufficiently isolate the issue and reduce the database load. Engineers then disabled non-critical control plane services for Cloud SQL to shed load and allow the service to catch up. They then began a rollback to the previous configuration to bring back the system to a healthy state. This issue has raised technical issues which hinder our intended level of service and reliability for the Cloud SQL service. We have begun a thorough investigation of similar potential failure patterns in order to avoid this type of service disruption in the future. We are adding additional monitoring to quickly detect metadata database timeouts which caused the control plane outage. We are also working to make the Cloud SQL control plane services more resilient to metadata database latency by making the service not directly call the database for connection validation. We realize this event may have impacted your organization and we apologize for this disruption. Thank you again for your business with Google Cloud SQL.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

The issue with Cloud SQL connectivity affecting connections from App Engine and connections using the Cloud SQL Proxy in europe-west1 has been resolved for all affected projects as of 20:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

We are continuing to experience an issue with Cloud SQL connectivity beginning at Tuesday, 2017-08-15 17:20 US/Pacific. Current investigation indicates that instances running in Europe-West1 are affected by this issue. Engineering is working on mitigating the situation. We will provide an update by 21:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

We are continuing to experience an issue with Cloud SQL connectivity beginning at Tuesday, 2017-08-15 17:20 US/Pacific. Current investigation indicates that instances running in Europe-West1 are affected by this issue. Engineering is currently working on mitigating the situation. We will provide an update by 20:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17017 - Cloud SQL connectivity issue in Europe-West1

We are experiencing an issue with Cloud SQL connectivity beginning at Tuesday, 2017-08-15 17:20 US/Pacific. Current investigation indicates that instances running in Europe-West1 are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - GCS triggers not fired when objects are updated

We are investigating an issue with Google Cloud Storage object overwrites. For buckets with Google Cloud Functions or Object Change Notification enabled, notifications were not being triggered when a new object overwrote an existing object. Other object operations are not affected. Buckets with Google Cloud Pub/Sub configured are also not affected. The root cause has been found and confirmed by partial rollback. Full rollback is expected to be completed within an hour. Between now and full rollback, affected buckets are expected to begin triggering on updates; it can be intermittent initially, and it is expected to stabilize when the rollback is complete. We will provide another update by 14:00 with any new details. ETA for resolution 14:00 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 17003 - GCS triggers not fired when objects are updated

We are investigating an issue with Google Cloud Storage function triggering on object update. Apiary notifications on object updates were also not sent during this issue. Other object operations are not reporting problems. The root cause has been found and confirmed by partial rollback. Full rollback is expected to be completed within an hour. Between now and full rollback, affected functions are expected to begin triggering on updates; it can be intermittent initially, and it is expected to stabilize when the rollback is complete. We will provide another update by 14:00 with any new details. ETA for resolution 13:30 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 17003 - GCS triggers not fired when objects are updated

We are investigating an issue with Google Cloud Storage function triggering on object update. Other object operations are not reporting problems. We will provide more information by 12:00 US/Pacific

Last Update: A few months ago

RESOLVED: Incident 18032 - BigQuery Disabled for Projected

ISSUE SUMMARY On 2017-07-26, BigQuery delivered error messages for 7% of queries and 15% of exports for a duration of two hours and one minute. It also experienced elevated failures for streaming inserts for one hour and 40 minutes. If your service or application was affected, we apologize – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT On 2017-07-26 from 13:45 to 15:45 US/PDT, BigQuery jobs experienced elevated failures at a rate of 7% to 15%, depending on the operation attempted. Overall 7% of queries, 15% of exports, and 9% of streaming inserts failed during this event. These failures occurred in 12% of customer projects The errors for affected projects varied from 2% to 69% of exports, over 50% for queries, and up to 28.5% for streaming inserts. Customers affected saw an error message stating that their project has “not enabled BigQuery”. ROOT CAUSE Prior to executing a BigQuery job, Google’s Service Manager validates that the project requesting the job has BigQuery enabled for the project. The Service Manager consists of several components, including a redundant data store for project configurations, and a permissions module which inspects configurations. The project configuration data is being migrated to a new format and new version of the data store, and as part of that migration, the permissions module is being updated to use the new format. As is normal production best practices, this migration is being performed in stages separated by time. The root cause of this event was that, during one stage of the rollout, configuration data for two GCP datacenters was migrated before the corresponding permissions module for BigQuery was updated. As a result, the permissions module in those datacenters began erroneously reporting that projects running there no longer had BigQuery enabled. Thus, while both BigQuery and the underlying data stores were unchanged, requests to BigQuery from affected projects received an error message indicating that they had not enabled BigQuery. REMEDIATION AND PREVENTION Google’s BigQuery on-call engineering team was alerted by automated monitoring within 15 minutes of the beginning of the event at 13:59. Subsequent investigation determined at 14:17 that multiple projects were experiencing BigQuery validation failures, and the cause of the errors was identified at 14:46 as being changed permissions. Once the root cause of the errors was understood, Google engineers focused on mitigating the user impact by configuring BigQuery in affected locations to skip the erroneous permissions check. This change was first tested in a portion of the affected projects beginning at 15:04, and confirmed to be effective at 15:29. That mitigation was then rolled out to all affected projects, and was complete by 15:44. Finally, with mitigations in place, the Google engineering team worked to safely roll back the data migration; this work completed at 23:33 and the permissions check mitigation was removed, closing the incident. Google engineering has created 26 high priority action items to prevent a recurrence of this condition and to better detect and more quickly mitigate similar classes of issues in the future. These action items include increasing the auditing of BigQuery’s use of Google’s Service Manager, improving the detection and alerting of the conditions that caused this event, and improving the response of Google engineers to similar events. In addition, the core issue that affected the BigQuery backend has already been fixed. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

The issue with BigQuery access errors has been resolved for all affected projects as of 16:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

The issue with BigQuery errors should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

The BigQuery engineers have identified a possible workaround to the issue affecting the platform and are deploying it now. Next update at 16:00PDT

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

We are investigating an issue with BigQuery. We will provide more information by 15:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18032 - BigQuery Disabled for Projected

At this time BigQuery is experiencing a partial outage, reporting that the service is not available for the project. Engineers are currently investigating the issue.

Last Update: A few months ago

UPDATE: Incident 17001 - Issues with Cloud VPN in us-west1

The issue with connectivity to Cloud VPN and External IPs in us-west1 has been resolved for all affected projects as of 14:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

The issue with listing projects in the Google Cloud Console has been resolved as of 2017-07-21 07:11 PDT.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

The issue with Google Cloud Console errors should be resolved for majority of users and we expect a full resolution in the near future. We will provide another status update by 09:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

We are experiencing an issue with Google Cloud Console returning errors beginning at Fri, 2017-07-21 02:50 US/Pacific. Early investigation indicates that users may see errors when listing projects in Google Cloud Console and via the API. Some other pages in Google Cloud Console may also display an error, refreshing the pages may help. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Issues with Google Cloud Console

We are investigating an issue with Google Cloud Console. We will provide more information by 03:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18031 - BigQuery API server returning errors

The issue with BigQuery API returning errors has been resolved for all affected users as of 04:10 US/Pacific. We apologize for the impact that this incident had on your application.

Last Update: A few months ago

UPDATE: Incident 18031 - BigQuery API server returning errors

The issue with Google BigQuery should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18031 - BigQuery API server returning errors

We are investigating an issue with Google BigQuery. We will provide more information by 04:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18030 - Streaming API errors

ISSUE SUMMARY On Wednesday 28 June 2017, streaming data into Google BigQuery experienced elevated error rates for a period of 57 minutes. We apologize to all users whose data ingestion pipelines were affected by this issue. We understand the importance of reliability for a process as crucial as data ingestion and are taking committed actions to prevent a similar recurrence in the future. DETAILED DESCRIPTION OF IMPACT On Wednesday 28 June 2017 from 18:00 to 18:20 and from 18:40 to 19:17 US/Pacific time, BigQuery's streaming insert service returned an increased error rate to clients for all projects. The proportion varied from time to time, but failures peaked at 43% of streaming requests returning HTTP response code 500 or 503. Data streamed into BigQuery from clients that experienced errors without retry logic were not saved into target tables during this period of time. ROOT CAUSE Streaming requests are routed to different datacenters for processing based on the table ID of the destination table. A sudden increase in traffic to the BigQuery streaming service combined with diminished capacity in a datacenter resulted in that datacenter returning a significant amount of errors for tables whose IDs landed in that datacenter. Other datacenters processing streaming data into BigQuery were unaffected. REMEDIATION AND PREVENTION Google engineers were notified of the event at 18:20, and immediately started to investigate the issue. The first set of errors had subsided, but starting at 18:40 error rates increased again. At 19:17 Google engineers redirected traffic away from the affected datacenter. The table IDs in the affected datacenter were redistributed to remaining, healthy streaming servers and error rates began to subside. To prevent the issue from recurring, Google engineers are improving the load balancing configuration, so that spikes in streaming traffic can be more equitably distributed amongst the available streaming servers. Additionally, engineers are adding further monitoring as well as tuning existing monitoring to decrease the time it takes to alert engineers of issues with the streaming service. Finally, Google engineers are evaluating rate-limiting strategies for the backend to prevent them from becoming overloaded.

Last Update: A few months ago

UPDATE: Incident 17004 - Stackdriver Uptime Monitoring - alerting policies with uptime check health conditions will not fire or resolve

We are experiencing an issue with Stackdriver Uptime Monitoring - alerting policies with uptime check health conditions will not fire or resolve and latency charts on uptime dashboard will be missing. Beginning at approx. Thursday, 2017-07-06 17:00:00 US/Pacific. Current data indicates that all projects are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - We are investigating an issue with Google Cloud Storage. We will provide more information by 18:30 US/Pacific.

We are experiencing an intermittent issue with Google Cloud Storage - JSON API requests are failing with 5XX errors (XML API is unaffected) beginning at Thursday, 2017-07-06 16:50:40 US/Pacific. Current data indicates that approximately 70% of requests globally are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - We are investigating an issue with Google Cloud Storage. We will provide more information by 18:30 US/Pacific.

We are investigating an issue with Google Cloud Storage. We will provide more information by 18:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17001 - Google Cloud Storage elevated error rates

The issue with degraded availability for some Google Cloud Storage objects has been resolved for all affected projects. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 17016 - Cloud SQL V2 instance failing to create

The issue with Cloud SQL V2 incorrect reports of 'Unable to Failover' state has been resolved for all affected instances as of 12:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

The issue with Cloud SQL V2 incorrect reports of 'Unable to Failover' state should be resolved for some of instances and we expect a full resolution in the near future. We will provide another status update by 12:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

Our Engineering Team believes they have identified the root cause of the incorrect reports of 'Unable to Failover' state and is working to mitigate. We will provide another status update by 12:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17002 - Cloud Pub/Sub admin operations failing

The issue with Cloud Pub/Sub admin operations failing has been resolved for all affected users as of 10:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

The issue with Cloud SQL V2 some instance maybe failing to create should be resolved for some of porojects and we expect a full resolution in the near future. At this time we do not have additional information related to HA instances may report incorrect 'Unable to Failover' state. We will provide another status update by 11:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17016 - Cloud SQL V2 instance failing to create

We are experiencing an issue with Cloud SQL V2, some instance maybe failing to create or HA instances may reports incorrect 'Unable to Failover' state beginning at Thursday, 2016-06-29 08:45 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 11:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Cloud Pub/Sub admin operations failing

We are investigating an issue where Cloud Pub/Sub admin operations are failing. We will provide more information by 10:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Google Cloud Storage elevated error rates

Google engineers are continuing to restore the service. Error rates are continuing to decrease. We will provide another status update by 15:00 June 29 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing has been resolved for all affected projects on Tuesday, 2017-06-13 10:12 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 18030 - Streaming API errors

The issue with BigQuery Streaming insert has been resolved for all affected users as of 19:17 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18030 - Streaming API errors

Our Engineering Team believes they have identified the root cause of the errors and have mitigated the issue by 19:17 US/Pacific. We will provide another status update by 20:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18030 - Streaming API errors

We are investigating an issue with BigQuery Streaming insert. We will provide more information by 19:35 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Google Cloud Storage elevated error rates

Starting at Tuesday 27 June 2017 07:30 PST, Google Cloud Storage started experiencing degraded availability for some objects based in us-central1 buckets (Regional, Nearline, Coldline, Durable Reduced Availability) and US multi-region buckets. Between 08:00 and 18:00 PST error rate was ~3.5%, error rates have since decreased to 0.5%. Errors are expected to be consistent for objects. Customers do not need to make any changes at this time. Google engineers have identified the root cause and are working to restore the service. If your service or application is affected, we apologize. We will provide another status update by 05:00 June 29 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18029 - BigQuery Increased Error Rate

ISSUE SUMMARY For 10 minutes on Wednesday 14 June 2017, Google BigQuery experienced increased error rates for both streaming inserts and most API methods due to their dependency on metadata read operations. To our BigQuery customers whose business were impacted by this event, we sincerely apologize. We are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT Starting at 10:43am US/Pacific, global error rates for BigQuery streaming inserts and API calls dependent upon metadata began to rapidly increase. The error rate for streaming inserts peaked at 100% by 10:49am. Within that same window, the error rate for metadata operations increased to a height of 80%. By 10:54am the error rates for both streaming inserts and metadata operations returned to normal operating levels. During the incident, affected BigQuery customers would have experienced a noticeable elevation in latency on all operations, as well as increased “Service Unavailable” and “Timeout” API call failures. While BigQuery streaming inserts and metadata operations were the most severely impacted, other APIs also exhibited elevated latencies and error rates, though to a much lesser degree. For API calls returning status code 2xx the operation completed with successful data ingestion and integrity. ROOT CAUSE On Wednesday 14 June 2017, BigQuery engineers completed the migration of BigQuery's metadata storage to an improved backend infrastructure. This effort was the culmination of work to incrementally migrate BigQuery read traffic over the course of two weeks. As the new backend infrastructure came online, there was one particular type of read traffic that hadn’t yet migrated to the new metadata storage. This caused a sudden spike of that read traffic to the new backend. The spike came when the new storage backend had to process a large volume of incoming requests as well as allocate resources to handle the increased load. Initially the backend was able to process requests with elevated latency, but all available resources were eventually exhausted, which lead to API failures. Once the backend was able to complete the load redistribution, it began to free up resources to process existing requests and work through its backlog. BigQuery operations continued to experience elevated latency and errors for another five minutes as the large backlog of requests from the first five minutes of the incident were processed. REMEDIATION AND PREVENTION Our monitoring systems worked as expected and alerted us to the outage within 6 minutes of the error spike. By this time, the underlying root cause had already passed. Google engineers have created nine high priority action items, and three lower priority action items as a result of this event to better prevent, detect and mitigate the reoccurence of a similar event. The most significant of these priorities is to modify the BigQuery service to successfully handle a similar root cause event. This will include adjusting capacity parameters to better handle backend failures and improving caching and retry logic. Each of the 12 action items created from this event have already been assigned to an engineer and are underway.

Last Update: A few months ago

RESOLVED: Incident 18029 - BigQuery Increased Error Rate

ISSUE SUMMARY For 10 minutes on Wednesday 14 June 2017, Google BigQuery experienced increased error rates for both streaming inserts and most API methods due to their dependency on metadata read operations. To our BigQuery customers whose business were impacted by this event, we sincerely apologize. We are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT Starting at 10:43am US/Pacific, global error rates for BigQuery streaming inserts and API calls dependent upon metadata began to rapidly increase. The error rate for streaming inserts peaked at 100% by 10:49am. Within that same window, the error rate for metadata operations increased to a height of 80%. By 10:54am the error rates for both streaming inserts and metadata operations returned to normal operating levels. During the incident, affected BigQuery customers would have experienced a noticeable elevation in latency on all operations, as well as increased “Service Unavailable” and “Timeout” API call failures. While BigQuery streaming inserts and metadata operations were the most severely impacted, other APIs also exhibited elevated latencies and error rates, though to a much lesser degree. For API calls returning status code 2xx the operation completed with successful data ingestion and integrity. ROOT CAUSE On Wednesday 14 June 2017, BigQuery engineers completed the migration of BigQuery's metadata storage to an improved backend infrastructure. This effort was the culmination of work to incrementally migrate BigQuery read traffic over the course of two weeks. As the new backend infrastructure came online, there was one particular type of read traffic that hadn’t yet migrated to the new metadata storage. This caused a sudden spike of that read traffic to the new backend. The spike came when the new storage backend had to process a large volume of incoming requests as well as allocate resources to handle the increased load. Initially the backend was able to process requests with elevated latency, but all available resources were eventually exhausted, which lead to API failures. Once the backend was able to complete the load redistribution, it began to free up resources to process existing requests and work through its backlog. BigQuery operations continued to experience elevated latency and errors for another five minutes as the large backlog of requests from the first five minutes of the incident were processed. REMEDIATION AND PREVENTION Our monitoring systems worked as expected and alerted us to the outage within 6 minutes of the error spike. By this time, the underlying root cause had already passed. Google engineers have created nine high priority action items, and three lower priority action items as a result of this event to better prevent, detect and mitigate the reoccurence of a similar event. The most significant of these priorities is to modify the BigQuery service to successfully handle a similar root cause event. This will include adjusting capacity parameters to better handle backend failures and improving caching and retry logic. Each of the 12 action items created from this event have already been assigned to an engineer and are underway.

Last Update: A few months ago

RESOLVED: Incident 18029 - BigQuery Increased Error Rate

ISSUE SUMMARY For 10 minutes on Wednesday 14 June 2017, Google BigQuery experienced increased error rates for both streaming inserts and most API methods due to their dependency on metadata read operations. To our BigQuery customers whose business were impacted by this event, we sincerely apologize. We are taking immediate steps to improve BigQuery’s performance and availability. DETAILED DESCRIPTION OF IMPACT Starting at 10:43am US/Pacific, global error rates for BigQuery streaming inserts and API calls dependent upon metadata began to rapidly increase. The error rate for streaming inserts peaked at 100% by 10:49am. Within that same window, the error rate for metadata operations increased to a height of 80%. By 10:54am the error rates for both streaming inserts and metadata operations returned to normal operating levels. During the incident, affected BigQuery customers would have experienced a noticeable elevation in latency on all operations, as well as increased “Service Unavailable” and “Timeout” API call failures. While BigQuery streaming inserts and metadata operations were the most severely impacted, other APIs also exhibited elevated latencies and error rates, though to a much lesser degree. For API calls returning status code 2xx the operation completed successfully with guaranteed data ingestion and integrity. ROOT CAUSE On Wednesday 14 June 2017, BigQuery engineers completed the migration of BigQuery's metadata storage to an improved backend infrastructure. This effort was the culmination of work to incrementally migrate BigQuery read traffic over the course of two weeks. As the new backend infrastructure came online, there was one particular type of read traffic that hadn’t yet migrated to the new metadata storage. This caused a sudden spike of that read traffic to the new backend. The spike came when the new storage backend had to process a large volume of incoming requests as well as allocate resources to handle the increased load. Initially the backend was able to process requests with elevated latency, but all available resources were eventually exhausted, which lead to API failures. Once the backend was able to complete the load redistribution, it began to free up resources to process existing requests and work through its backlog. BigQuery operations continued to experience elevated latency and errors for another five minutes as the large backlog of requests from the first five minutes of the incident were processed. REMEDIATION AND PREVENTION Our monitoring systems worked as expected and alerted us to the outage within 6 minutes of the error spike. By this time, the underlying root cause had already passed. Google engineers have created nine high priority action items, and three lower priority action items as a result of this event to better prevent, detect and mitigate the reoccurence of a similar event. The most significant of these priorities is to modify the BigQuery service to successfully handle a similar root cause event. This will include adjusting capacity parameters to better handle backend failures and improving caching and retry logic. Each of the 12 action items created from this event have already been assigned to an engineer and are underway.

Last Update: A few months ago

RESOLVED: Incident 17006 - Network issue in asia-northeast1

ISSUE SUMMARY On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, datacenters in the asia-northeast1 region experienced a loss of network connectivity for a total of 62 minutes. We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve. We recognize the severity of this incident and have completed an extensive internal postmortem. We thoroughly understand the root causes and no datacenters are at risk of recurrence. We are at work to add mechanisms to prevent and mitigate this class of problem in the future. We have prioritized this work and in the coming weeks, our engineering team will complete the action items we have generated from the postmortem. DETAILED DESCRIPTION OF IMPACT On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, network connectivity to and from Google Cloud services running in the asia-northeast1 region was unavailable for 62 minutes. This issue affected all Google Cloud Platform services in that region, including Compute Engine, App Engine, Cloud SQL, Cloud Datastore, and Cloud Storage. All external connectivity to the region was affected during this time frame, while internal connectivity within the region was not affected. In addition, inbound requests from external customers originating near Google’s Tokyo point of presence intended for Compute or Container Engine HTTP Load Balancing were lost for the initial 12 minutes of the outage. Separately, Internal Load Balancing within asia-northeast1 remained degraded until 10:23. ROOT CAUSE At the time of incident, Google engineers were upgrading the network topology and capacity of the region; a configuration error caused the existing links to be decommissioned before the replacement links could provide connectivity, resulting in a loss of connectivity for the asia-northeast1 region. Although the replacement links were already commissioned and appeared to be ready to serve, a network-routing protocol misconfiguration meant that the routes through those links were not able to carry traffic. As Google's global network grows continuously, we make upgrades and updates reliably by using automation for each step and, where possible, applying changes to only one zone at any time. The topology in asia-northeast1 was the last region unsupported by automation; manual work was required to be performed to align its topology with the rest of our regional deployments (which would, in turn, allow automation to function properly in the future). This manual change mistakenly did not follow the same per-zone restrictions as required by standard policy or automation, which meant the entire region was affected simultaneously. In addition, some customers with deployments across multiple regions that included asia-northeast1 experienced problems with HTTP Load Balancing due to a failure to detect that the backends were unhealthy. When a network partition occurs, HTTP Load Balancing normally detects this automatically within a few seconds and routes traffic to backends in other regions. In this instance, due to a performance feature being tested in this region at the time, the mechanism that usually detects network partitions did not trigger, and continued to attempt to assign traffic until our on-call engineers responded. Lastly, the Internal Load Balancing outage was exacerbated due to a software defined networking component which was stuck in a state where it was not able to provide network resolution for instances in the load balancing group. REMEDIATION AND PREVENTION Google engineers were paged by automated monitoring within one minute of the start of the outage, at 08:24 PDT. They began troubleshooting and declared an emergency incident 8 minutes later at 08:32. The issue was resolved when engineers reconnected the network path and reverted the configuration back to the last known working state at 09:22. Our monitoring systems worked as expected and alerted us to the outage promptly. The time-to-resolution for this incident was extended by the time taken to perform the rollback of the network change, as the rollback had to be performed manually. We are implementing a policy change that any manual work on live networks be constrained to a single zone. This policy will be enforced automatically by our change management software when changes are planned and scheduled. In addition, we are building automation to make these types of changes in future, and to ensure the system can be safely rolled back to a previous known-good configuration at any time during the procedure. The fix for the HTTP Load Balancing performance feature that caused it to incorrectly believe zones within asia-northeast1 were healthy will be rolled out shortly. SUPPORT COMMUNICATIONS During the incident, customers who had originally contacted Google Cloud Support in Japanese did not receive periodic updates from Google as the event unfolded. This was due to a software defect in the support tooling — unrelated to the incident described earlier. We have already fixed the software defect, so all customers who contact support will receive incident updates. We apologize for the communications gap to our Japanese-language customers. RELIABILITY SUMMARY One of our biggest pushes in GCP reliability at Google is a focus on careful isolation of zones from each other. As we encourage users to build reliable services using multiple zones, we also treat zones separately in our production practices, and we enforce this isolation with software and policy. Since we missed this mark—and affecting all zones in a region is an especially serious outage—we apologize. We intend for this incident report to accurately summarize the detailed internal post-mortem that includes final assessment of impact, root cause, and steps we are taking to prevent an outage of this form occurring again. We hope that this incident report demonstrates the work we do to learn from our mistakes to deliver on this commitment. We will do better. Sincerely, Benjamin Lutch | VP Site Reliability Engineering | Google

Last Update: A few months ago

RESOLVED: Incident 17008 - Network issue in asia-northeast1

ISSUE SUMMARY On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, datacenters in the asia-northeast1 region experienced a loss of network connectivity for a total of 62 minutes. We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve. We recognize the severity of this incident and have completed an extensive internal postmortem. We thoroughly understand the root causes and no datacenters are at risk of recurrence. We are at work to add mechanisms to prevent and mitigate this class of problem in the future. We have prioritized this work and in the coming weeks, our engineering team will complete the action items we have generated from the postmortem. DETAILED DESCRIPTION OF IMPACT On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, network connectivity to and from Google Cloud services running in the asia-northeast1 region was unavailable for 62 minutes. This issue affected all Google Cloud Platform services in that region, including Compute Engine, App Engine, Cloud SQL, Cloud Datastore, and Cloud Storage. All external connectivity to the region was affected during this time frame, while internal connectivity within the region was not affected. In addition, inbound requests from external customers originating near Google’s Tokyo point of presence intended for Compute or Container Engine HTTP Load Balancing were lost for the initial 12 minutes of the outage. Separately, Internal Load Balancing within asia-northeast1 remained degraded until 10:23. ROOT CAUSE At the time of incident, Google engineers were upgrading the network topology and capacity of the region; a configuration error caused the existing links to be decommissioned before the replacement links could provide connectivity, resulting in a loss of connectivity for the asia-northeast1 region. Although the replacement links were already commissioned and appeared to be ready to serve, a network-routing protocol misconfiguration meant that the routes through those links were not able to carry traffic. As Google's global network grows continuously, we make upgrades and updates reliably by using automation for each step and, where possible, applying changes to only one zone at any time. The topology in asia-northeast1 was the last region unsupported by automation; manual work was required to be performed to align its topology with the rest of our regional deployments (which would, in turn, allow automation to function properly in the future). This manual change mistakenly did not follow the same per-zone restrictions as required by standard policy or automation, which meant the entire region was affected simultaneously. In addition, some customers with deployments across multiple regions that included asia-northeast1 experienced problems with HTTP Load Balancing due to a failure to detect that the backends were unhealthy. When a network partition occurs, HTTP Load Balancing normally detects this automatically within a few seconds and routes traffic to backends in other regions. In this instance, due to a performance feature being tested in this region at the time, the mechanism that usually detects network partitions did not trigger, and continued to attempt to assign traffic until our on-call engineers responded. Lastly, the Internal Load Balancing outage was exacerbated due to a software defined networking component which was stuck in a state where it was not able to provide network resolution for instances in the load balancing group. REMEDIATION AND PREVENTION Google engineers were paged by automated monitoring within one minute of the start of the outage, at 08:24 PDT. They began troubleshooting and declared an emergency incident 8 minutes later at 08:32. The issue was resolved when engineers reconnected the network path and reverted the configuration back to the last known working state at 09:22. Our monitoring systems worked as expected and alerted us to the outage promptly. The time-to-resolution for this incident was extended by the time taken to perform the rollback of the network change, as the rollback had to be performed manually. We are implementing a policy change that any manual work on live networks be constrained to a single zone. This policy will be enforced automatically by our change management software when changes are planned and scheduled. In addition, we are building automation to make these types of changes in future, and to ensure the system can be safely rolled back to a previous known-good configuration at any time during the procedure. The fix for the HTTP Load Balancing performance feature that caused it to incorrectly believe zones within asia-northeast1 were healthy will be rolled out shortly. SUPPORT COMMUNICATIONS During the incident, customers who had originally contacted Google Cloud Support in Japanese did not receive periodic updates from Google as the event unfolded. This was due to a software defect in the support tooling — unrelated to the incident described earlier. We have already fixed the software defect, so all customers who contact support will receive incident updates. We apologize for the communications gap to our Japanese-language customers. RELIABILITY SUMMARY One of our biggest pushes in GCP reliability at Google is a focus on careful isolation of zones from each other. As we encourage users to build reliable services using multiple zones, we also treat zones separately in our production practices, and we enforce this isolation with software and policy. Since we missed this mark—and affecting all zones in a region is an especially serious outage—we apologize. We intend for this incident report to accurately summarize the detailed internal post-mortem that includes final assessment of impact, root cause, and steps we are taking to prevent an outage of this form occurring again. We hope that this incident report demonstrates the work we do to learn from our mistakes to deliver on this commitment. We will do better. Sincerely, Benjamin Lutch | VP Site Reliability Engineering | Google

Last Update: A few months ago

UPDATE: Incident 18029 - BigQuery Increased Error Rate

The BigQuery service was experiencing a 78% error rate on streaming operations and up to 27% error rates on other operations from 10:43 to 11:03 US/Pacific time. This issue has been resolved for all affected projects as of 10:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17004 - Cloud console: changing language preferences

The Google Cloud Console issue that was preventing the users from changing their language preferences has been resolved as of 06:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17004 - Cloud console: changing language preferences

The Google Cloud Console issue that is preventing the users from changing their language preferences is ongoing. Our Engineering Team is working on it. We will provide another status update by 06:00 US/Pacific with current details. A known workaround is to change the browser language.

Last Update: A few months ago

UPDATE: Incident 17004 - Cloud console: changing language preferences

We are investigating an issue with the cloud console. Users are unable to change their language preferences. We will provide more information by 04:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17005 - High Latency in App Engine

ISSUE SUMMARY On Wednesday 7 June 2017, Google App Engine experienced highly elevated serving latency and timeouts for a duration of 138 minutes. If your service or application was affected the increase in latency, we sincerely apologize – this is not the level of reliability and performance we expect of our platform, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Wednesday 7 June 2017, from 13:34 PDT to 15:52 PDT, 7.7% of active applications on the Google App Engine service experienced severely elevated latency; requests that typically take under 500ms to serve were taking many minutes. This elevated latency would have either resulted in users seeing additional latency when waiting for responses from the affected applications or 500 errors if the application handlers timed out. The individual application logs would have shown this increased latency or increases in “Request was aborted after waiting too long to attempt to service your request” error messages. ROOT CAUSE The incident was triggered by an increase in memory usage across all App Engine appservers in a datacenter in us-central. An App Engine appserver is responsible for creating instances to service requests for App Engine applications. When its memory usage increases to unsustainable levels, it will stop some of its current instances, so that they can be rescheduled on other appservers in order to balance out the memory requirements across the datacenter. This transfer of an App Engine instance between appservers consumes CPU resources, a signal used by the master scheduler of the datacenter to detect when it must further rebalance traffic across more appservers (such as when traffic to the datacenter increases and more App Engine instances are required). Normally, these memory management techniques are transparent to customers but in isolated cases, they can be exacerbated by large amounts of additional traffic being routed to the datacenter, which requires more instances to service user requests. The increased load and memory requirement from scheduling new instances combined with rescheduling instances from appservers with high memory usage resulted in most appservers being considered “busy” by the master scheduler. User requests needed to wait for an available instance to either be transferred or created before they were able to be serviced, which results in the increased latency seen at the app level. REMEDIATION AND PREVENTION Latencies began to increase at 13:34 PDT and Google engineers were alerted to the increase in latency at 13:45 PDT and were able to identify a subset of traffic that was causing the increase in memory usage. At 14:08, they were able to limit this subset of traffic to an isolated partition of the datacenter to ease the memory pressure on the remaining appservers. Latency for new requests started to improve as soon as this traffic was isolated; however, tail latency was still elevated due to the large backlog of requests that had accumulated since the incident started. This backlog was eventually cleared by 15:52 PDT. To prevent further recurrence, traffic to the affected datacenter was rebalanced with another datacenter. To prevent future recurrence of this issue, Google engineers will be re-evaluating the resource distribution in the us-central datacenters where App Engine instances are hosted. Additionally, engineers will be developing stronger alerting thresholds based on memory pressure signals so that traffic can be redirected before latency increases. And finally, engineers will be evaluating changes to the scheduling strategy used by the master scheduler responsible for scheduling appserver work to prevent this situation in the future.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing should be resolved for the majority of projects and we expect a full resolution in the next 12 hours. We will provide another status update by 14:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 2am PST with current details.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

The issue with Cloud Logging exports to BigQuery failing should be resolved for some projects and we expect a full resolution in the near future. We will provide another status update by 11pm PST with current details.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are still investigating the issue with Cloud Logging exports to BigQuery failing. We will provide more information by 9pm PST. Currently, we are also working on restoring Cloud Logging exports to BigQuery.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are still investigating the issue with Cloud Logging exports to BigQuery failing. We will provide more information by 7pm PST. Currently, we are also working on restoring Cloud Logging exports to BigQuery.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are working on restoring Cloud Logging exports to BigQuery. We will provide further updates at 6pm PT.

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

Cloud Logging exports to BigQuery failed from 13:19 to approximately 14:30 with loss of logging data. We have stopped the exports while we work on fixing the issue, so BigQuery will not reflect the latest logs. This incident only affects robot accounts using HTTP requests. We are working hard to restore service, and we will provide another update in one hour (by 5pm PT).

Last Update: A few months ago

RESOLVED: Incident 17003 - Cloud Logging export to BigQuery failing.

We are investigating an issue with Cloud Logging exports to BigQuery failing. We will provide more information by 5pm PT

Last Update: A few months ago

RESOLVED: Incident 17006 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 has been restored for all affected users as of 10:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 17008 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 has been restored for all affected users as of 10:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17008 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 should be restored for all affected users and we expect a full resolution in the near future. We will provide another status update by 09:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - Network issue in asia-northeast1

Network connectivity in asia-northeast1 should be restored for all affected users and we expect a full resolution in the near future. We will provide another status update by 09:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17008 - Network issue in asia-northeast1

Google Cloud Platform services in region asia-northeast1 are experiencing connectivity issues. We will provide another status update by 9:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17006 - Network issue in asia-northeast1

Google Cloud Platform services in region asia-northeast1 are experiencing connectivity issues. We will provide another status update by 9:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17008 - Network issue in asia-northeast1

We are investigating an issue with network connectivity in asia-northeast1. We will provide more information by 09:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17006 - Network issue in asia-northeast1

We are investigating an issue with network connectivity in asia-northeast1. We will provide more information by 09:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17005 - High Latency in App Engine

The issue with Google App Engine displaying elevated error rate has been resolved for all affected projects as of 15:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17005 - High Latency in App Engine

The issue with Google App Engine displaying elevated error rate should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - High Latency in App Engine

We have identified an issue with App Engine that is causing increased latency to a portion of applications in the US Central region. Mitigation is under way. We will provide more information about the issue by 15:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The fix has been fully deployed. We confirmed that the issue has been fixed.

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The fix is still being deployed. We will provide another status update by 2017-05-26 19:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The fix is currently being deployed. We will provide another status update by 2017-05-26 16:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

The root cause has been identified. The fix is currently being deployed. We will provide another status update by 2017-05-26 14:30 US/Pacific

Last Update: A few months ago

UPDATE: Incident 5204 - Google App Engine Incident #5204

We are currently investigating a problem that is causing app engine apps to experience an infinite redirect loop when users log in. We will provide another status update by 2017-05-26 13:30 US/Pacific

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

The issue with Cloud SQL First Generation automated backups should be resolved for all affected instances as of 12:52 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

We are still actively working on this issue. We are aiming on making the final fix available in production by end of day today.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

The daily backups continue to be taken and we are expect the final fix to be available tomorrow. We'll provide another update again Wednesday, May 24 10:00 US/Pacific time as originally planed.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

We are still actively working on this issue. We are aiming on making the fix available in production by end of day today. We will provide another update by end of day if there is anything changes.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL First Generation automated backups experiencing errors

Daily backups are being taken for all Cloud SQL First Generation instances. For some instances, backups are being taken outside of defined backup windows. A fix is being tested and will be applied to First Generation instances pending positive test results. We will provide next update Wednesday, May 24 10:00 US/Pacific or if anything changes in between.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL v1 automated backups experiencing errors

The issue with automatic backups for Cloud SQL first generation is mitigated by forcing the backups. We’ll continue with this mitigation until Monday US/Pacific where a permanent fix will be rolled out. We will provide next update Monday US/Pacific or if anything changes in between.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL v1 automated backups experiencing errors

The Cloud SQL service is experiencing errors on automatic backups for 7% of Cloud SQL first generation instances. We’re forcing the backup for affected instances as short-term mitigation. We will provide another status update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17015 - Cloud SQL v1 automated backups experiencing errors

We are investigating an issue with Cloud SQL v1 automated backups. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Translate API elevated latency / errors

The issue with Translation Service and other API availability has been resolved for all affected users as of 19:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. This will be the final update on this incident.

Last Update: A few months ago

UPDATE: Incident 17001 - Translate API elevated latency / errors

Engineering is continuing to investigate the API service issues impacting Translation Service API availability, looking into potential causes and mitigation strategies. Certain other API's, such as Speech and Prediction may also be affected. Next update by 20:00 Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - GKE IP rotation procedure does not include FW rule change

Our Engineering Team has identified a fix for the issue with the GKE IP-rotation feature. We expect the rollout of the fix to begin next Tuesday, 2017-05-16 US/Pacific, completing on Friday, 2017-05-19.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

The issue with Google App Engine deployments and Memcache availability should have been resolved for all affected projects as of 18:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

The issue with Google App Engine deployments and Memcache availability is mitigated. We will provide an update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

We are continuing to investigate the issue with an underlying component that affects Google App Engine deployments and Memcache availability. The engineering team has tried several unsuccessful remediations and are continuing to investigate potential root causes and fixes. We will provide another update at 17:30 PDT.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

We are still investigating the issue with an underlying component that affects both Google App Engine deployments and Memcache availability. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - Deployment Failures and Memcache Unavailability Due to Underlying Component

We are currently investigating an issue with an underlying component that affects both Google App Engine deployments and Memcache availability. Deployments will fail intermittently and memcache is returning intermittent errors for a small number of applications. For affected deployments, please try re-deploying while we continue to investigate this issue. For affected memcache users, retries in your application code should allow you to access your memcache intermittently while the underlying issue is being addressed. Work is ongoing to address the issue by the underlying component's engineering team. We will provide our next update at 15:30 PDT.

Last Update: A few months ago

UPDATE: Incident 17002 - GKE IP rotation procedure does not include FW rule change

We are experiencing an issue with GKE ip-rotation feature. Kubernetes features that rely on the proxy (including kubectl exec and logs commands, as well as exporting cluster metrics into stackdriver) are broken by GKE ip-rotation feature. This only affects users who have disabled default ssh access on their nodes. There is a manual fix described [here](https://cloud.google.com/container-engine/docs/ip-rotation#known_issues)

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

The issue with BigQuery jobs being on pending state for too long has been resolved for all affected projects as of 03:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

The issue with BigQuery jobs being on pending state for too long should be resolved for all new jobs. We will provide another status update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers are still working on a fix for the jobs being on pending state for too long. We will provide another update by 03:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers have identified the root cause for the jobs being on pending state for too long and are still working on a fix. We will provide another update by 02:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers have identified the root cause for the jobs being on pending state for too long and are applying a fix. We will provide another update by 01:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery engineers continue to investigate jobs being on pending state for too long. We will provide another update by 00:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

BigQuery service is experiencing jobs staying on pending state for longer than usual and our engineering team is working on it. We will provide another status update by 23:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18028 - BigQuery import jobs pending

We are investigating an issue with BigQuery import jobs pending. We will provide more information by 23:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Node pool creation failing in multiple zones

We are investigating an issue with Google Container Engine (GKE) that is affecting node-pool creations in the following zones: us-east1-d, asia-northeast1-c, europe-west1-c, us-central1-b, us-west1-a, asia-east1-a, asia-northeast1-a, asia-southeast1-a, us-east4-b, us-central1-f, europe-west1-b, asia-east1-c, us-east1-c, us-west1-b, asia-northeast1-b, asia-southeast1-b, and us-east4-c. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17007 - 502 errors for HTTP(S) Load Balancers

ISSUE SUMMARY On Friday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring. DETAILED DESCRIPTION OF IMPACT On Friday 5 April 2017 from 01:13 to 01:35 PDT, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. Clients received 502 errors for failed requests. Some HTTP(S) Load Balancers that were recently modified experienced error rates of 100%. Google paused all configuration changes to the HTTP(S) Load Balancer for three hours and 41 minutes after the incident, until our engineers had understood the root cause. This caused deployments of App Engine Flexible apps to fail during that period. ROOT CAUSE A bug in the HTTP(S) Load Balancer configuration update process caused it to revert to a configuration that was substantially out of date. The configuration update process is controlled by a master server. In this case, one of the replicas of the master servers lost access to Google's distributed file system and was unable to read recent configuration files. Mastership then passed to the server that could not access Google's distributed file system. When the mastership changes, it begins the next configuration push as normal by testing on a subset of HTTP(S) Load Balancers. If this test succeeds, the configuration is pushed globally to all HTTP(S) Load Balancers. If the test fails (as it did in this case), the new master will revert all HTTP(S) Load Balancers to the last "known good" configuration. The combination of a mastership change, lack of access to more recent updates, and the initial test failure for the latest config caused the HTTP(S) Load Balancers to revert to the latest configuration that the master could read, which was substantially out-of-date. In addition, the update with the out-of-date configuration triggered a garbage collection process on the Google Frontend servers to free up memory used by the deleted configurations. The high number of deleted configurations caused the Google Frontend servers to spend a large proportion of CPU cycles on garbage collection which lead to failed health checks and eventual restart of the affected Google Frontend server. Any client requests served by a restarting server received 502 errors. REMEDIATION AND PREVENTION Google engineers were paged at 01:22 PDT. They switched the configuration update process to use a different master server at 01:34 which mitigated the issue for most services within one minute. Our engineers then paused the configuration updates to the HTTP(S) Load Balancer until 05:16 while the root cause was confirmed. To prevent incidents of this type in future, we are taking the following actions: * Master servers will be configured to never push HTTP(S) Load Balancer configurations that are more than a few hours old. * Google Frontend servers will reject loading a configuration file that is more than a few hours old. * Improve testing for new HTTP(S) Load Balancer configurations so that out-of-date configurations are flagged before being pushed to production. * Fix the issue that caused the master server to fail when reading files from Google's distributed file system. * Fix the issue that caused health check failures on Google Frontends during heavy garbage collection. Once again, we apologize for the impact that this incident had on your service.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue has been resolved for all affected users as of 00:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue in us-east1/asia-northeast1 regions has been resolved. The issue with deployments in us-east1 is mitigated. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will continue monitor and will provide another status update by 2017-04-07 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue in us-east1/asia-northeast1 regions has been partially resolved. The issue with deployments in us-east1 is partially mitigated. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 2017-04-07 00:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

The issue with Google App Engine Taskqueue in US-east1/Asia-northeast1 regions has been partially resolved. We are investigating related issues impacting deployments in US-east1. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

We are still investigating reports of an issue with App Engine Taskqueue in US-east1/Asia-northeast1 regions. We will provide another status update by 2017-04-06 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - App Engine taskqueue error rate increase in US-east1/Asia-northeast1 region

We are investigating an issue impacting Google App Engine Task Queue in US-east1/Asia-northeast1. We will provide more information by 09:00pm US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18027 - BigQuery Streaming API returned a 500 response from 00:04 to 00:38 US/Pacific.

The issue with BigQuery Streaming API returning 500 response code has been resolved for all affected users as of 00:38 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 17001 - Issues with Cloud Pub/Sub

ISSUE SUMMARY On Tuesday 21 March 2017, new connections to Cloud Pub/Sub experienced high latency leading to timeouts and elevated error rates for a duration of 95 minutes. Connections established before the start of this issue were not affected. If your service or application was affected, we apologize – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT On Tuesday 21 March 2017 from 21:08 to 22:43 US/Pacific, Cloud Pub/Sub publish, pull and ack methods experienced elevated latency, leading to timeouts. The average error rate for the issue duration was 0.66%. The highest error rate occurred at 21:43, when the Pub/Sub publish error rate peaked at 4.1%, the ack error rate reached 5.7% and the pull error rate was 0.02%. ROOT CAUSE The issue was caused by the rollout of a storage system used by the Pub/Sub service. As part of this rollout, some servers were taken out of service, and as planned, their load was redirected to remaining servers. However, an unexpected imbalance in key distribution led some of the remaining servers to become overloaded. The Pub/Sub service was then unable to retrieve the status required to route new connections for the affected methods. Additionally, some Pub/Sub servers didn’t recover promptly after the storage system had been stabilized and required individual restarts to fully recover. REMEDIATION AND PREVENTION Google engineers were alerted by automated monitoring seven minutes after initial impact. At 21:24, they had correlated the issue with the storage system rollout and stopped it from proceeding further. At 21:41, engineers restarted some of the storage servers, which improved systemic availability. Observed latencies for Pub/Sub were still elevated, so at 21:54, engineers commenced restarting other Pub/Sub servers, restoring service to 90% of users. At 22:29 a final batch was restarted, restoring the Pub/Sub service to all. To prevent a recurrence of the issue, Google engineers are creating safeguards to limit the number of keys managed by each server. They are also improving the availability of Pub/Sub servers to respond to requests even when in an unhealthy state. Finally they are deploying enhancements to the Pub/Sub service to operate when the storage system is unavailable.

Last Update: A few months ago

UPDATE: Incident 17001 - Issues with Cloud Pub/Sub

The issue with Pub/Sub high latency has been resolved for all affected projects as of 22:02 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - Issues with Cloud Pub/Sub

We are investigating an issue with Pub/Sub. We will provide more information by 22:40 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18026 - BigQuery streaming inserts issue

ISSUE SUMMARY On Monday 13 March 2017, the BigQuery streaming API experienced 91% error rate in the US and 63% error rate in the EU for a duration of 30 minutes. We apologize for the impact of this issue on our customers, and the widespread nature of the issue in particular. We have completed a post mortem of the incident and are making changes to mitigate and prevent recurrences. DETAILED DESCRIPTION OF IMPACT On Monday 13 March 2017 from 10:22 to 10:52 PDT 91% of streaming API requests to US BigQuery datasets and 63% of streaming API requests to EU BigQuery datasets failed with error code 503 and an HTML message indicating "We're sorry... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now." All non-streaming API requests, including DDL requests and query, load extract and copy jobs were unaffected. ROOT CAUSE The trigger for this incident was a sudden increase in log entries being streamed from Stackdriver Logging to BigQuery by logs export. The denial of service (DoS) protection used by BigQuery responded to this by rejecting excess streaming API traffic. However the configuration of the DoS protection did not adequately segregate traffic streams resulting in normal sources of BigQuery streaming API requests being rejected. REMEDIATION AND PREVENTION Google engineers initially mitigated the issue by blocking the source of unexpected load. This prevented the overload and allowed all other traffic to resume normally. Engineers fully resolved the issue by identifying and reverting the change that triggered the increase in log entries and clearing the backlog of log entries that had grown. To prevent future occurrences, BigQuery engineers are updating configuration to increase isolation between different traffic sources. Tests are also being added to verify behavior under several new load scenarios.

Last Update: A few months ago

UPDATE: Incident 17006 - GCE networking in us-central1 zones is experiencing disruption

GCP Services' internet connectivity has been restored as of 2:12 pm Pacific Time. We apologize for the impact that this issue had on your application. We are still investigating the root cause of the issue, and will take necessary actions to prevent a recurrence.

Last Update: A few months ago

UPDATE: Incident 17006 - GCE networking in us-central1 zones is experiencing disruption

We are experiencing a networking issue with Google Compute Engine instances in us-central zones beginning at Wednesday, 2017-03-15 01:00 PM US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18026 - BigQuery streaming inserts issue

The issue with BigQuery streaming inserts has been resolved for all affected projects as of 11:06 AM US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18026 - BigQuery streaming inserts issue

We are investigating an issue with BigQuery streaming inserts. We will provide more information by 11:45 AM US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

During the period 12:05 - 13:57 PDT, GCS requests originating in Europe experienced a 17% error rate. GCS requests in other regions were unaffected.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

The issue with elevated latency and error rates for GCS in Europe should be resolved as of 13:56 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

We are continuing to investigate elevated errors and latency for GCS in Europe. The reliability team has performed several mitigation steps and error rates and latency are returning to normal levels. We will continue to monitor for recurrence.

Last Update: A few months ago

UPDATE: Incident 16037 - Elevated Latency and Error Rates For GCS in Europe

We are currently investigating elevated latency and error rates for Google Cloud Storage traffic transiting through Europe.

Last Update: A few months ago

RESOLVED: Incident 17003 - GCP accounts with credits are being charged without credits being applied

We have mitigated the issue as of 2017-03-10 09:30 PST

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

Some Cloud Console users may notice that Dataflow job logs are not displayed correctly. This is a known issue with the user interface that affects up to 35% of jobs. Google engineers are preparing a fix. Pipeline executions are not impacted and Dataflow services are operating as normal.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

We are still investigating the issue with Dataflow Job Log in Cloud Console. Current data indicates that between 30% and 35% of jobs are affected by this issue. Pipeline execution is not impacted. The root cause of the issue is known and the Dataflow Team is preparing the fix for production.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

The root cause of the issue is known and the Dataflow Team is preparing the fix for production. We will provide an update by 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

We are experiencing a visibility issue with Dataflow Job Log in Cloud Console beginning at Thursday, 2017-03-09 11:34 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 13:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - Dataflow Job Log visibility issue in Cloud Console

We are investigating an issue with Dataflow Job Log visibility in Cloud Console. We will provide more information by 12:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17014 - Cloud SQL instance creation in zones us-central1-c, us-east1-c, asia-northeast1-a, asia-east1-b, us-central1-f may be failing.

The issue should have been mitigated for all zones, except us-central1-c. Creating new Cloud SQL 2nd generation instances in us-central1-c with SSD Persistent Disk still have an issue. Workaround is to create your instances in a different zone or use Standard Persistent Disks in us-central1-c.

Last Update: A few months ago

UPDATE: Incident 17014 - Cloud SQL instance creation in zones us-central1-c and asia-east1-c are failing.

Correction: Attempts to create new Cloud SQL instances in zones us-central1-c, us-east1-c, asia-northeast1-a, asia-east1-b, us-central1-f may be intermittently failing. New instances affected in these zones will show a status of "Failed to create". Users will not incur charges for instances that failed to create; these instances can be safely deleted.

Last Update: A few months ago

UPDATE: Incident 17014 - Cloud SQL instance creation in zones us-central1-c and asia-east1-c are failing.

Attempts to create new Cloud SQL instances in zones us-central1-c and asia-east1-c are failing. New instances created in these zones will show a status of "Failed to create". Users will not incur charges for instances that failed to create; these instances can be safely deleted.

Last Update: A few months ago

RESOLVED: Incident 17005 - Network packet loss to Compute Engine us-west1 region

We confirm that the issue with GCE network connectivity to us-west1 should have been resolved for all affected endpoints as of 03:27 US/Pacific and the situation is stable. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

GCE network connectivity to us-west1 remains stable and we expect a final resolution in the near future. We will provide another status update by 05:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

Network connectivity to the Google Compute Engine us-west1 region has been restored but the issue remains under investigation. We will provide another status update by 05:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

We are experiencing an issue with GCE network connectivity to the us-west1 region beginning at Tuesday, 2017-02-28 02:57 US/Pacific. We will provide a further update by 04:45.

Last Update: A few months ago

UPDATE: Incident 17005 - Network packet loss to Compute Engine us-west1 region

We are investigating an issue with network connectivity to the us-west1 region. We will provide more information by 04:15 US/Pacific time.

Last Update: A few months ago

RESOLVED: Incident 17002 - Cloud Datastore Internal errors in the European region

ISSUE SUMMARY On Tuesday 14 February 2017, some applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced 2%-4% error rates and elevated latency for three periods with an aggregate duration of three hours and 36 minutes. We apologize for the disruption this caused to your service. We have already taken several measures to prevent incidents of this type from recurring and to improve the reliability of these services. DETAILED DESCRIPTION OF IMPACT On Tuesday 14 February 2017 between 00:15 and 01:18 PST, 54% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 4%. Between 08:35 and 08:48 PST, 50% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates. The average error rate for affected applications was 4%. Between 12:20 and 14:40 PST, 32% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 2%. Errors received by affected applications for all three incidents were either "internal error" or "timeout". ROOT CAUSE The incident was caused by a latent bug in a service used by both Cloud Datastore and the App Engine Search API that was triggered by high load on the service. Starting at 00:15 PST, several applications changed their usage patterns in one zone in Western Europe and began running more complex queries, which caused higher load on the service. REMEDIATION AND PREVENTION Google's monitoring systems paged our engineers at 00:35 PST to alert us to elevated errors in a single zone. Our engineers followed normal practice, by redirecting traffic to other zones to reduce the impact on customers while debugging the underlying issue. At 01:15, we redirected traffic to other zones in Western Europe, which resolved the incident three minutes later. At 08:35 we redirected traffic back to the zone that previously had errors. We found that the error rate in that zone was still high and so redirected traffic back to other zones at 08:48. At 12:45, our monitoring systems detected elevated errors in other zones in Western Europe. At 14:06 Google engineers added capacity to the service with elevated errors in the affected zones. This removed the trigger for the incident. We have now identified and fixed the latent bug that caused errors when the system was at high load. We expect to roll out this fix over the next few days. Our capacity planning team have generated forecasts for peak load generated by the Cloud Datastore and App Engine Search API and determined that we now have sufficient capacity currently provisioned to handle peak loads. We will be making several changes to our monitoring systems to improve our ability to quickly detect and diagnose errors of this type. Once again, we apologize for the impact of this incident on your application.

Last Update: A few months ago

UPDATE: Incident 17004 - Persistent Disk Does Not Produce Differential Snapshots In Some Cases

Since January 23rd, a small number of Persistent Disk snapshots were created as full snapshots rather than differential. While this results in overbilling, these snapshots still correctly backup your data and are usable for restores. We are working to resolve this issue and also to correct any overbilling that occurred. No further action is required from your side.

Last Update: A few months ago

RESOLVED: Incident 17002 - Cloud Datastore Internal errors in the European region

The issue with Cloud Datastore serving elevated internal errors in the European region should have been resolved for all affected projects as of 14:34 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17002 - Cloud Datastore Internal errors in the European region

We are investigating an issue with Cloud Datastore in the European region. We will provide more information by 15:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17003 - New VMs are experiencing connectivity issues

ISSUE SUMMARY On Monday 30 January 2017, newly created Google Compute Engine instances, Cloud VPNs and network load balancers were unavailable for a duration of 2 hours 8 minutes. We understand how important the flexibility to launch new resources and scale up GCE is for our users and apologize for this incident. In particular, we apologize for the wide scope of this issue and are taking steps to address the scope and duration of this incident as well as the root cause itself. DETAILED DESCRIPTION OF IMPACT Any GCE instances, Cloud VPN tunnels or GCE network load balancers created or live migrated on Monday 30 January 2017 between 10:36 and 12:42 PDT were unavailable via their public IP addresses until the end of that period. This also prevented outbound traffic from affected instances and load balancing health checks from succeeding. Previously created VPN tunnels, load balancers and instances that did not experience a live migration were unaffected. ROOT CAUSE All inbound networking for GCE instances, load balancers and VPN tunnels enter via shared layer 2 load balancers. These load balancers are configured with changes to IP addresses for these resources, then automatically tested in a canary deployment, before changes are globally propagated. The issue was triggered by a large set of updates which were applied to a rarely used load balancing configuration. The application of updates to this configuration exposed an inefficient code path which resulted in the canary timing out. From this point all changes of public addressing were queued behind these changes that could not proceed past the testing phase. REMEDIATION AND PREVENTION To resolve the issue, Google engineers restarted the jobs responsible for programming changes to the network load balancers. After restarting, the problematic changes were processed in a batch, which no longer reached the inefficient code path. From this point updates could be processed and normal traffic resumed. This fix was applied zone by zone between 11:36 and 12:42. To prevent this issue from reoccurring in the short term, Google engineers are increasing the canary timeout so that updates exercising the inefficient code path merely slow network changes rather than completely stop them. As a long term resolution, the inefficient code path is being improved, and new tests are being written to test behaviour on a wider range of configurations. Google engineers had already begun work to replace global propagation of address configuration with decentralized routing. This work is being accelerated as it will prevent issues with this layer having global impact. Google engineers are also creating additional metrics and alerting that will allow the nature of this issue to be identified sooner, which will lead to faster resolution.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine should have been resolved for all affected users as of 17:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine has been partially resolved. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine has been partially resolved. For everyone who is still affected, we apologize for any inconvenience you may be experiencing. We will provide another status update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

The issue with Google App Engine should have been partially resolved . We will provide another status update by 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17001 - We are currently investigating reports of Intermittent Errors(502s) with Google App Engine

We are investigating reports of intermittent errors(502s) in Google App Engine. We will provide more information by 15:00 US/Pacific

Last Update: A few months ago

UPDATE: Incident 17001 - Investigating possible Google Cloud Datastore Application Monitoring Metrics problem

The issue with Google Cloud Datastore Application Monitoring Metrics has been fully resolved for all affected Applications as of 1:30pm US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17001 - Investigating possible Google Cloud Datastore Application Monitoring Metrics

We are investigating an issue with Google Cloud Datastore related with Application Monitoring Metrics. We will provide more information by 1:30pm US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17013 - Issue with Cloud SQL 2nd Generation instances beginning at Tuesday, 2017-01-26 01:00 US/Pacific.

The issue with Cloud SQL 2nd Generation Instances should have been resolved for all affected instances as of 21:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17013 - Issue with Cloud SQL 2nd Generation instances beginning at Tuesday, 2017-01-26 01:00 US/Pacific.

We are currently experiencing an issue with Cloud SQL 2nd Generation instances beginning at Tuesday, 2017-01-26 01:00 US/Pacific. This may cause poor performance or query failures for large queries on impacted instances. Current data indicates that 3% of Cloud SQL 2nd Generation Instances were affected by this issue. As of 2016-01-31 20:30 PT, a fix has been applied to the majority of impacted instances, and we expect a full resolution in the near future. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 21:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

We have fully mitigated the network connectivity issues for newly-created GCE instances as of 12:45 US/Pacific, with VPNs connectivity issues being fully mitigated at 12:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

The issue with newly-created GCE instances experiencing network connectivity problems should have been mitigated for all GCE regions except europe-west1, which is currently clearing. Newly-created VPNs are effected as well; we are still working on a mitigation for this. We will provide another status update by 13:10 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

The issue with newly-created GCE instances experiencing network connectivity problems should have been mitigated for the majority of GCE regions and we expect a full resolution in the near future. We will provide another status update by 12:40 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

We are experiencing a connectivity issue affecting newly-created VMs, as well as those undergoing live migrations beginning at Monday, 2017-01-30 10:54 US/Pacific. Mitigation work is currently underway. All zones should be coming back online in the next 15-20 minutes. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide another update at 12:10 PST.

Last Update: A few months ago

UPDATE: Incident 17003 - New VMs are experiencing connectivity issues

We are investigating an issue with newly-created VMs not having network connectivity. This also affects VMs undergoing live migrations. We will provide more information by 11:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

The issue with listing GCP projects and organizations should have been resolved for all affected users as of 15:21 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

The issue is still occurring for some projects for some users. Mitigation is still underway. We will provide the next update by 16:30 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

Listing of Google Cloud projects and organizations is still failing to show some entries. As this only affects listing, users can access their projects by navigating directly to appropriate URLs. Google engineers have a mitigation plan that is underway. We will provide another status update by 14:30 US/Pacific time with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Incident in progress - Some projects not visible for customers

We are experiencing an (intermittent) issue with Google Cloud Projects search index beginning at Monday, 2017-01-23 00:00 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 8:30 US/Pacific with current details

Last Update: A few months ago

RESOLVED: Incident 18025 - Bigquery query job failures

The issue with BigQuery's Table Copy service and query jobs should have been resolved for all affected projects as of 07:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

The issue with BigQuery's Table Copy service and query jobs should be resolved for majority of users and expect a full resolution in the near future. We will provide another status update by 08:00 US/Pacific, 2017-01-21.

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

We believe the issue with BigQuery's Table Copy service and query jobs should be resolved for majority of users and expect a full resolution in the near future. We will provide another status update by 06:00 US/Pacific, 2017-01-21.

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

We believe the issue with BigQuery's Table Copy service and query jobs should be resolved for majority of users and expect a full resolution in the near future. We will provide another status update by 04:00 US/Pacific, 2017-01-21.

Last Update: A few months ago

UPDATE: Incident 18025 - Bigquery query job failures

We are experiencing an issue with Bigquery query jobs beginning at Friday, 2017-01-21 18:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17001 - Cloud Console and Stackdriver may display the number of App Engine instances as zero for Java and Go

The issue with Cloud Console and Stackdriver showing the number of App Engine instances as zero should have been resolved for all affected projects as of 01:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - Cloud Console and Stackdriver may display the number of App Engine instances as zero for Java and Go

We are experiencing an issue with Cloud Console and Stackdriver which may show the number of App Engine instances as zero beginning at Wednesday, 2017-01-18 18:45 US/Pacific. This issue should have been resolved for the majority of projects and we expect a full resolution by 2017-01-20 00:00 PDT.

Last Update: A few months ago

RESOLVED: Incident 18024 - BigQuery Web UI currently unavailable for some Customers

The issue with BigQuery UI should have been resolved for all affected users as of 05:35MM US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18024 - BigQuery Web UI currently unavailable for some Customers

The issue with BigQuery Web UI should have been resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 06:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18024 - BigQuery Web UI currently unavailable for some Customers

We are investigating an issue with BigQuery Web UI. We will provide more information by 2017-01-10 06:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issues with Cloud Shell and Compute Engine serial output should have been resolved for all affected instances as of 22:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issue with Cloud Shell should be resolved for all customers at this time. The issue with Compute Engine serial port output should be resolved for all new instances created after 19:45 PT in all zones. Instances created before 14:10 PT remain unaffected. Some instances created between 14:10 and 19:45 PT in us-east1-c and us-west1 may still be unable to view the serial output. We are currently in the process of applying the fix to zones in this region. Access to the serial console output should be restored for instances created between 14:10 and 19:45 PT in all other zones. We expect a full resolution in the near future. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issue with Cloud Shell should be resolved for all customers at this time. The issue with Compute Engine serial port output should be resolved for all new instances created after 19:45 PT in all zones. Instances created before 14:10 PT remain unaffected. Access to the serial console output should be restored for all instances in the asia-east1 and asia-northeast1 regions, and the us-central1-a zone, created between 14:10 and 19:45 PT. Some instances created between 14:10 and 19:45 PT in other zones may still be unable to view the serial console output. We are currently in the process of applying the fix to the remaining zones. We expect a full resolution in the near future. We will provide another status update by 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

The issue with Cloud Shell should be resolved for all customers at this time. The issue with Compute Engine serial port output should be resolved for all new instances created after 19:45 PT, and all instances in us-central1-a, and asia-east1-b created between 14:10 and 19:45 PT. All other instances created before 14:10 PT remain unaffected. We expect a full resolution in the near future. We will provide another status update by 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

We are still investigating the issues with Compute Engine and Cloud Shell, and do not have any news at this time. We will provide another status update by 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17002 - Issues with Compute Engine serial port output and Cloud Shell

We are experiencing issues with Compute Engine serial port output and Cloud Shell beginning at Sunday, 2017-01-08 14:10 US/Pacific. Current data indicates that newly-created instances are unable to use the "get-serial-port-output" command in "gcloud", or use the Cloud Console to retrieve serial port output. Customers can still use the interactive serial console on these newly-created instances. Instances created before the impact time do not appear to be affected. Additionally, the Cloud Shell is intermittently available at this time. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17001 - Cloud VPN issues in regions us-west1 and us-east1

The issue with Cloud VPN where some tunnels weren’t connecting in us-east1 and us-west1 should have been resolved for all affected tunnels as of 23:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17001 - Cloud VPN issues in regions us-west1 and us-east1

We are investigating reports of an issue with Cloud VPN in regions us-west1 and us-east1. We will provide more information by 23:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16036 - Cloud Storage is showing inconsistent result for object listing for multi-regional buckets in US

The issue with Cloud Storage showing inconsistent results for object listing for multi-regional buckets in US should have been resolved for all affected projects as of 23:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16036 - Cloud Storage is showing inconsistent result for object listing for multi-regional buckets in US

We are still investigating the issue with Cloud Storage showing inconsistent result for object listing for multi-regional buckets in US. We will provide another status update by 2016-12-20 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16036 - Cloud Storage is showing inconsistent result for object listing for multi-regional buckets in US

We are experiencing an intermittent issue with Cloud Storage showing inconsistent result for object listing for multi-regional buckets in US beginning approximately at Monday, 2016-12-16 09:00 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2016-12-17 00:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16035 - Elevated Cloud Storage error rate and latency

The issue with Google Cloud Storage seeing increased errors and latency should have been resolved for all affected users as of 09:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16035 - Elevated Cloud Storage error rate and latency

We are continuing experience an issue with Google Cloud Storage. Errors and latency have decreased, but are not yet at pre-incident levels. We are continuing to investigate mitigation strategies and identifying root cause. Impact is limited to the US region. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16035 - Elevated Cloud Storage error rate and latency

We are investigating an issue with Google Cloud Storage serving increased errors and at a higher latency. We will provide more information by 10:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16011 - App Engine remote socket API errors in us-central region

The issue with App Engine applications having higher than expected socket API errors should have been resolved for all affected applications as of 22:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

The issue with App Engine remote socket API errors in us-central region should have been resolved for all affected projects as of 19:46 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are still investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are still investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are still investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - App Engine remote socket API errors in us-central region

We are investigating reports of an issue with App Engine applications having higher than expected socket API errors if they are located in the us-central region. We will provide another status update by 2016-12-02 19:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18023 - Increased 500 error rate for BigQuery API calls

The issue with increased 500 errors from the BigQuery API has been resolved. We apologize for the impact that this incident had on your application.

Last Update: A few months ago

RESOLVED: Incident 18022 - BigQuery Streaming API failing

Small correction to the incident report. The resolution time of the incident was 20:00 US/Pacific, not 20:11 US/Pacific. Similarly, total downtime was 4 hours.

Last Update: A few months ago

RESOLVED: Incident 18022 - BigQuery Streaming API failing

SUMMARY: On Tuesday 8 November 2016, Google BigQuery’s streaming service, which includes streaming inserts and queries against recently committed streaming buffers, was largely unavailable for a period of 4 hours and 11 minutes. To our BigQuery customers whose business analytics were impacted during this outage, we sincerely apologize. We will be providing an SLA credit for the affected timeframe. We have conducted an internal investigation and are taking steps to improve our service. DETAILED DESCRIPTION OF IMPACT: On Tuesday 8 November 2016 from 16:00 to 20:11 US/Pacific, 73% of BigQuery streaming inserts failed with a 503 error code indicating an internal error had occurred during the insertion. At peak, 93% of BigQuery streaming inserts failed. During the incident, queries performed on tables with recently-streamed data returned a result code (400) indicating that the table was unavailable for querying. Queries against tables in which data were not streamed within the 24 hours preceding the incident were unaffected. There were no issues with non-streaming ingestion of data. ROOT CAUSE: The BigQuery streaming service requires authorization checks to verify that it is streaming data from an authorized entity to a table that entity has permissions to access. The authorization service relies on a caching layer in order to reduce the number of calls to the authorization backend. At 16:00 US/Pacific, a combination of reduced backend authorization capacity coupled with routine cache entry refreshes caused a surge in requests to the authorization backends, exceeding their current capacity. Because BigQuery does not cache failed authorization attempts, this overload meant that new streaming requests would require re-authorization, thereby further increasing load on the authorization backend. This continual increase of authorization requests on an already overloaded authorization backend resulted in continued and sustained authorization failures which propagated into streaming request and query failures. REMEDIATION AND PREVENTION: Google engineers were alerted to issues with the streaming service at 16:21 US/Pacific. Their initial hypothesis was that the caching layer for authorization requests was failing. The engineers started redirecting requests to bypass the caching layer at 16:51. After testing the system without the caching layer, the engineers determined that the caching layer was working as designed, and requests were directed to the caching layer again at 18:12. At 18:13, the engineering team was able to pinpoint the failures to a set of overloaded authorization backends and begin remediation. The issue with authorization capacity was ultimately resolved by incrementally reducing load on the authorization system internally and increasing the cache TTL, allowing streaming authorization requests to succeed and populate the cache so that internal services could be restarted. Recovery of streaming errors began at 19:34 US/Pacific and the streaming service was fully restored at 20:11. To prevent short-term recurrence of the issue, the engineering team has greatly increased the request capacity of the authorization backend. In the longer term, the BigQuery engineering team will work on several mitigation strategies to address the currently cascading effect of failed authorization requests. These strategies include caching intermediary responses to the authorization flow for the streaming service, caching failure states for authorization requests and adding rate limiting to the authorization service so that large increases in cache miss rate will not overwhelm the authorization backend. In addition, the BigQuery engineering team will be improving the monitoring of available capacity on the authorization backend and will add additional alerting so capacity issues can be mitigated before they become cascading failures. The BigQuery engineering team will also be investigating ways to reduce the spike in authorization traffic that occurs daily at 16:00 US/Pacific when the cache is rebuilt to more evenly distribute requests to the authorization backend. Finally, we have received feedback that our communications during the outage left a lot to be desired. We agree with this feedback. While our engineering teams launched an all-hands-on-deck to resolve this issue within minutes of its detection, we did not adequately communicate both the level-of-effort and the steady progress of diagnosis, triage and restoration happening during the incident. We clearly erred in not communicating promptly, crisply and transparently to affected customers during this incident. We will be addressing our communications — for all Google Cloud systems, not just BigQuery — as part of a separate effort, which has already been launched. We recognize the extended duration of this outage, and we sincerely apologize to our BigQuery customers for the impact to your business analytics.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

The issue with the BigQuery Streaming API should have been resolved for all affected tables as of 20:07 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We're continuing to work to restore the service to the BigQuery Streaming API. We will add an update at 20:30 US/Pacific with further information.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are continuing to investigate the issue with BigQuery Streaming API. We will add an update at 20:00 US/Pacific with further information.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We have taken steps to mitigate the issue, which has led to some improvements. The issue continues to impact the BigQuery Streaming API and tables with a streaming buffer. We will provide a further status update at 19:30 US/Pacific with current details

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are continuing to investigate the issue with BigQuery Streaming API. The issue may also impact tables with a streaming buffer, making them inaccessible. This will be clarified in the next update at 19:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are still investigating the issue with BigQuery Streaming API. There are no other details to share at this time but we are actively working to resolve this. We will provide another status update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are still investigating the issue with BigQuery Streaming API. Current data indicates that all projects are affected by this issue. We will provide another status update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18022 - BigQuery Streaming API failing

We are investigating an issue with the BigQuery Streaming API. We will provide more information by 17:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18021 - BigQuery Streaming API fai

We are investigating an issue with the BigQuery Streaming API. We will provide more information by 5:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operations on Cloud Platform Console not being performed

The issue with Cloud Platform Console not being able to perform delete operations should have been resolved for all affected users as of 12:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 16003 - Issues with Cloud Pub/Sub

SUMMARY: On Monday, 31 October 2016, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed for a duration of 124 minutes. Creation of new Cloud SQL Second Generation instances also failed during this incident. If your service or application was affected, we apologize. We have conducted a detailed review of the causes of this incident and are ensuring that we apply the appropriate fixes so that it will not recur. DETAILED DESCRIPTION OF IMPACT: On Monday, 31 October 2016 from 13:11 to 15:15 PDT, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed. 0.1% of pull requests experienced latencies of up to 4 minutes for end-to-end message delivery. Creation of all new Cloud SQL Second Generation instances also failed during this incident. Existing instances were not affected. ROOT CAUSE: At 13:08, a system in the Cloud Pub/Sub control plane experienced a connectivity issue to its persistent storage layer for a duration of 83 seconds. This caused a queue of storage requests to build up. When the storage layer re-connected, the queued requests were executed, which exceeded the available processing quota for the storage system. The system entered a feedback loop in which storage requests continued to queue up leading to further latency increases and more queued requests. The system was unable to exit from this state until additional capacity was added. Creation of a new Cloud SQL Second Generation instance requires a new Cloud Pub/Sub subscription. REMEDIATION AND PREVENTION: Our monitoring systems detected the outage and paged oncall engineers at 13:19. We determined root cause at 14:05 and acquired additional storage capacity for the Pub/Sub control plane at 14:42. The outage ended at 15:15 when this capacity became available. To prevent this issue from recurring, we have already increased the storage capacity for the Cloud Pub/Sub control plane. We will change the retry behavior of the control plane to prevent a feedback loop if storage quota is temporarily exceeded. We will also improve our monitoring to ensure we can determine root cause for this type of failure more quickly in future. We apologize for the inconvenience this issue caused our customers.

Last Update: A few months ago

RESOLVED: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

SUMMARY: On Monday, 31 October 2016, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed for a duration of 124 minutes. Creation of new Cloud SQL Second Generation instances also failed during this incident. If your service or application was affected, we apologize. We have conducted a detailed review of the causes of this incident and are ensuring that we apply the appropriate fixes so that it will not recur. DETAILED DESCRIPTION OF IMPACT: On Monday, 31 October 2016 from 13:11 to 15:15 PDT, 73% of requests to create new subscriptions for Google Cloud Pub/Sub failed. 0.1% of pull requests experienced latencies of up to 4 minutes for end-to-end message delivery. Creation of all new Cloud SQL Second Generation instances also failed during this incident. Existing instances were not affected. ROOT CAUSE: At 13:08, a system in the Cloud Pub/Sub control plane experienced a connectivity issue to its persistent storage layer for a duration of 83 seconds. This caused a queue of storage requests to build up. When the storage layer re-connected, the queued requests were executed, which exceeded the available processing quota for the storage system. The system entered a feedback loop in which storage requests continued to queue up leading to further latency increases and more queued requests. The system was unable to exit from this state until additional capacity was added. Creation of a new Cloud SQL Second Generation instance requires a new Cloud Pub/Sub subscription. REMEDIATION AND PREVENTION: Our monitoring systems detected the outage and paged oncall engineers at 13:19. We determined root cause at 14:05 and acquired additional storage capacity for the Pub/Sub control plane at 14:42. The outage ended at 15:15 when this capacity became available. To prevent this issue from recurring, we have already increased the storage capacity for the Cloud Pub/Sub control plane. We will change the retry behavior of the control plane to prevent a feedback loop if storage quota is temporarily exceeded. We will also improve our monitoring to ensure we can determine root cause for this type of failure more quickly in future. We apologize for the inconvenience this issue caused our customers.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operations on Cloud Platform Console not being performed

The issue with the Cloud Platform Console not being able to perform delete operations should have been resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by Fri 2016-11-04 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operations on Cloud Platform Console not being performed

The issue with the Cloud Platform Console not being able to perform delete operations has been identified with the root cause and we expect a full resolution in the near future. We will provide another status update by Fri 2016-11-04 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - Delete operation on Cloud Platform Console not being performed

We are experiencing an issue with some delete operations within the Cloud Platform Console, beginning at Tuesday, 2016-10-01 10:00 US/Pacific. Current data indicates that all users are affected by this issue. The gcloud command line tool may be used as a workaround for those who need to manage their resources immediately. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16003 - Issues with Cloud Pub/Sub

The issue with Cloud Pub/Sub should be resolved for all affected projects as of 15:15 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

The issue with second generation Cloud SQL instance creation should be resolved for all affected projects as of 15:15 PDT. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

We are continuing to investigate an issue with second generation Cloud SQL instance creation. We will provide another update at 16:00 PDT.

Last Update: A few months ago

UPDATE: Incident 16003 - Issues with Cloud Pub/Sub

We are continuing to investigate an issue with Cloud Pub/Sub. We will provide an update at 16:00 PDT.

Last Update: A few months ago

UPDATE: Incident 16003 - Issues with Cloud Pub/Sub

We are currently investigating an issue with Cloud Pub/Sub. We will provide an update at 15:00 PDT with more information.

Last Update: A few months ago

UPDATE: Incident 17012 - Issue With Second Generation Cloud SQL Instance Creation

We are currently investigating an issue with second generation Cloud SQL instance creation. We will provide an update with more information at 15:00 PDT.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

The issue with Google Container Engine nodes connecting to the metadata server should have been resolved for all affected clusters as of 10:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

The issue with Google Container Engine nodes connecting to the metadata server has now been resolved for some of the existing clusters, too. We are continuing to repair the rest of the clusters. We will provide next status update when this repair is complete.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

The issue with Google Container Engine nodes connecting to the metadata server has been fully resolved for all new clusters. The work to repair the existing clusters is still ongoing and is expected to last for a few more hours. It may also result in the restart of the containers and/or virtual machines in the repaired clusters as per the previous update. If your cluster is affected by this issue and you wish to solve this problem more quickly, you could choose to delete and recreate your GKE cluster. We will provide next status update when the work to repair the existing clusters has completed.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We have now identified the cause of the issue and are in the process of rolling out the fix for it into production. This may result in the restart of the affected containers and/or virtual machines in the GKE cluster. We apologize for any inconvenience this might cause.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are continuing to work on resolving the issue with Google Container Engine nodes connecting to the metadata server. We will provide next status update as soon as the proposed fix for this issue is finalized and validated.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still working on resolving the issue with Google Container Engine nodes connecting to the metadata server. We are actively testing a fix for it, and once it is validated, we will push this fix into production. We will provide next status update by 2016-10-24 01:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still investigating the issue with Google Container Engine nodes connecting to the metadata server. Current data indicates that that less than 10% of clusters are still affected by this issue. We are actively testing a fix. Once confirmed we will push this fix into production. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still investigating the issue with Google Container Engine nodes connecting to the metadata server. Further investigation reveals that less than 10% of clusters are affected by this issue. We will provide another status update by 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

Customers experiencing this error will see messages containing the following in their logs: "WARNING: exception thrown while executing request java.net.UnknownHostException: metadata" This is caused by a change that inadvertently prevents hosts from properly resolving the DNS address for the metadata server. We have identified the root cause and are preparing a fix. No action is required by customers at this time. The proposed fix should resolve the issue for all customers as soon as it is prepared, tested, and deployed. We will add another update at 21:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16020 - 502s from HTTP(S) Load Balancer

SUMMARY: On Thursday 13 October 2016, approximately one-third of requests sent to the Google Compute Engine HTTP(S) Load Balancers between 15:07 and 17:25 PDT received an HTTP 502 error rather than the expected response. If your service or application was affected, we apologize. We took immediate action to restore service once the problem was detected, and are taking steps to improve the Google Compute Engine HTTP(S) Load Balancer’s performance and availability. DETAILED DESCRIPTION OF IMPACT: Starting at 15:07 PDT on Thursday 13 October 2016, Google Compute Engine HTTP(S) Load Balancers started to return elevated rates of HTTP 502 (Bad Gateway) responses. The error rate rose progressively from 2% to a peak of 45% of all requests at 16:09 and remained there until 17:03. From 17:03 to 17:15, the error rate declined rapidly from 45% to 2%. By 17:25 requests were routing as expected and the incident was over. During the incident, the error rate seen by applications using GCLB varied depending on the network routing of their requests to Google. ROOT CAUSE: The Google Compute Engine HTTP(S) Load Balancer system is a global, geographically-distributed multi-tiered software stack which receives incoming HTTP(S) requests via many points in Google's global network, and dispatches them to appropriate Google Compute Engine instances. On 13 October 2016, a configuration change was rolled out to one of these layers with widespread distribution beginning at 15:07. This change triggered a software bug which decoupled second-tier load balancers from a number of first-tier load balancers. The affected first-tier load balancers therefore had no forwarding path for incoming requests and returned the HTTP 502 code to indicate this. Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations, and to reduce the scope of the impact in the event that a problem is exposed in production. These safeguards were partially successful in this instance, limiting both the scope and the duration of the event, but not preventing it entirely. The first relevant safeguard is a canary deployment, where the configuration is deployed at a single site and that site is verified to be functioning within normal bounds. In this case, the canary step did generate a warning, but it was not sufficiently precise to cause the on-call engineer to immediately halt the rollout. The new configuration subsequently rolled out in stages, but was halted part way through as further alerts indicated that it was not functioning correctly. By design, this progressive rollout limited the error rate experienced by customers. REMEDIATION AND PREVENTION: Once the nature and scope of the issue became clear, Google engineers first quickly halted and reverted the rollout. This prevented a larger fraction of GCLB instances from being affected. Google engineers then set about restoring function to the GCLB instances which had been exposed to the configuration. They verified that restarting affected GCLB instances restored the pre-rollout configuration, and then rapidly restarted all affected GCLB instances, ending the event. Google understands that global load balancers are extremely useful, but also may be a single point of failure for your service. We are committed to applying the lessons from this outage in order to ensure that this type of incident does not recur. One of our guiding principles for avoiding large-scale incidents is to roll out global changes slowly and carefully monitor for errors. We typically have a period of soak time during a canary release before rolling out more widely. In this case, the change was pushed too quickly for accurate detection of the class of failure uncovered by the configuration being rolled out. We will change our processes to be more conservative when rolling out configuration changes to critical systems. As defense in depth, Google engineers are also changing the black box monitoring for GCLB so that it will test the first-tier load balancers impacted by this incident. We will also be improving the black box monitoring to ensure that our probers cover all use cases. In addition, we will add an alert for elevated error rates between first-tier and second-tier load balancers. We apologize again for the impact this issue caused our customers.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

The issue with BigQuery failing queries should have been resolved for all affected users as of 8:53 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

The issue with Google BigQuery API calls returning 500 Internal Errors should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 09:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

The issue with Google BigQuery API calls returning 500 Internal Errors should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 07:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18020 - BigQuery query failures

We are investigating an issue with Google BigQuery. We will provide more information by 6:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

The issue with Google Cloud Platform HTTP(s) Load Balancer returning 502 response code should have been resolved for all affected customers as of 17:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

We are still investigating the issue with Google Cloud Platform HTTP(S) Load Balancers returning 502 errors, and will provide an update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

We are still investigating the issue with Google Cloud Platform HTTP(S) Load Balancers returning 502 errors, and will provide an update by 17:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16020 - 502s from HTTP(s) Load Balancer

We are experiencing an issue with Google Cloud Platform HTTP(s) Load Balancer returning 502 response codes, starting at 2016-10-13 15:30 US/Pacific. We are investigating the issue, and will provide an update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16019 - Hurricane Matthew may impact GCP services in us-east1

The Google Cloud Platform team is keeping a close watch on the path of Hurricane Matthew. The National Hurricane Center 3-day forecast indicates that the storm is tracking within 200 miles of the datacenters housing the GCP region us-east1. We do not anticipate any specific service interruptions. Our datacenter is designed to withstand a direct hit from a more powerful hurricane than Matthew without disruption, and we maintain triple-redundant diverse-path backbone networking precisely to be resilient to extreme events. We have staff on site and plan to run services normally. Despite all of the above, it is statistically true that there is an increased risk of a region-level utility grid or other infrastructure disruption which may result in a GCP service interruption. If we anticipate a service interruption – for example, if the regional grid loses power and our datacenter is operating on generator – our protocol is to share specific updates to our customers with a 12 hour notice.

Last Update: A few months ago

UPDATE: Incident 16034 - Elevated Cloud Storage error rate and latency

The issue with Cloud Storage that some projects encountered elevated errors and latency should have been resolved for all affected projects as of 23:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16034 - Elevated Cloud Storage error rate and latency

We are investigating an issue with Cloud Storage. We will provide more information by 23:45 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We have restored most of the missing Google Cloud Pub/Sub subscriptions for affected projects. We expect to restore the remaining missing subscriptions within one hour. We have already identified and fixed the root cause of the issue. We will conduct an internal investigation and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are working on restoring the missing PubSub subscriptions for customers that are affected, and will provide an ETA for complete restoration when available.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 12:00 US/Pacific with current details

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 10:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 08:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 06:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. We will provide another status update by Wednesday, 2016-09-28 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We are still investigating the issue with Cloud Pub/Sub subscriptions. In the meantime, affected users can re-create missing subscriptions manually in order to make them available. We will provide another status update by Wednesday, 2016-09-28 00:00 US/Pacific with current details

Last Update: A few months ago

UPDATE: Incident 16002 - Cloud Pub/Sub subscriptions deleted unexpectedly

We experienced an issue with Cloud Pub/Sub that some subscriptions were deleted unexpectedly approximately from Tuesday, 2016-09-27 13:40-18:45 US/Pacific. We are going to recreate the deleted subscription. We will provide another status update by 22:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

The issue with slow Compute Engine operations in asia-east1-a is resolved since 13:37 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

We are still working on the issue with Compute Engine operations. After mitigation was applied operations in asia-east1-a have continued to run normally. A final fix is still underway. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

We are still investigating the issue with Compute Engine operations. We have applied mitigation and currently operations in asia-east1-a are processing normally. We are applying some final fixes and monitoring the issue. We will provide another status update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

It is taking multiple minutes to create new VMs, restart existing VMs that abnormally terminate, or Hot-attach disks.

Last Update: A few months ago

UPDATE: Incident 16018 - Slow instance start times in asia-east1-a

This incident only covers instances in the asia-east1-a zone

Last Update: A few months ago

RESOLVED: Incident 16033 - Google Cloud Storage serving high error rates.

We have completed our internal investigation and results suggest this incident impacted a very small number of projects. We have reached out to affected users directly and if you have not heard from us, your project(s) were not impacted.

Last Update: A few months ago

RESOLVED: Incident 16033 - Google Cloud Storage serving high error rates.

The issue with Google Cloud Storage serving a high percentage of errors should have been resolved for all affected users as of 13:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 16009 - Networking issue with Google App Engine services

SUMMARY: On Monday 22 August 2016, the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone for a duration of 25 minutes. All other zones in the US-CENTRAL1 region were unaffected. All network traffic within the zone was also unaffected. We apologize to customers whose service or application was affected by this incident. We understand that a network disruption has a negative impact on your application - particularly if it is homed in a single zone - and we apologize for the inconvenience this caused. What follows is our detailed analysis of the root cause and actions we will take in order to prevent this type of incident from recurring. DETAILED DESCRIPTION OF IMPACT: We have received feedback from customers asking us to specifically and separately enumerate the impact of incidents to any service that may have been touched. We agree that this will make it easier to reason about the impact of any particular event and we have done so in the following descriptions. On Monday 22 August 2016 from 07:05 to 07:30 PDT the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone. App Engine 6% of App Engine Standard Environment applications in the US-CENTRAL region served elevated error rates for up to 8 minutes, until the App Engine serving infrastructure automatically redirected traffic to a failover zone. The aggregate error rate across all impacted applications during the incident period was 3%. The traffic redirection caused a Memcache flush for affected applications, and also loading requests as new instances of the applications started up in the failover zones. All App Engine Flexible Environment applications deployed to the US-CENTRAL1-F zone were unavailable for the duration of the incident. Additionally, 4.5% of these applications experienced various levels of unavailability for up to an additional 5 hours while the system recovered. Deployments for US-CENTRAL Flexible applications were delayed during the incident. Our engineers disabled the US-CENTRAL1-F zone for new deployments during the incident, so that any customers who elected to redeploy, immediately recovered. Cloud Console The Cloud Console was available during the incident, though some App Engine administrative pages did not load for applications in US-CENTRAL and 50% of project creation requests failed to complete and needed to be retried by customers before succeeding. Cloud Dataflow Some Dataflow running jobs in the US-CENTRAL1 region experienced delays in processing. Although most of the affected jobs recovered gracefully after the incident ended, up to 2.5% of affected jobs in this zone became stuck and required manual termination by customers. New jobs created during the incident were not impacted. Cloud SQL Cloud SQL First Generation instances were not impacted by this incident. 30% of Cloud SQL Second Generation instances in US-CENTRAL1 were unavailable for up to 5 minutes, after which they became available again. An additional 15% of Second Generation instances were unavailable for 22 minutes. Compute Engine All instances in the US-CENTRAL1-F zone were inaccessible from outside the zone for the duration of the incident. 9% of them remained inaccessible from outside the zone for an additional hour. Container Engine Container Engine clusters running in US-CENTRAL1-F were inaccessible from outside of the zone during the incident although they continued to serve. In addition, calls to the Container Engine API experienced a 4% error rate and elevated latency during the incident, though this was substantially mitigated if the client retried the request. Stackdriver Logging 20% of log API requests sent to Stackdriver Logging in the US-CENTRAL1 region failed during the incident, though App Engine logging was not impacted. Clients retrying requests recovered gracefully. Stackdriver Monitoring Requests to the StackDriver web interface and the Google Monitoring API v2beta2 and v3 experienced elevated latency and an error rate of up to 3.5% during the incident. In addition, some alerts were delayed. Impact for API calls was substantially mitigated if the client retried the request. ROOT CAUSE: On 18 July, Google carried out a planned maintenance event to inspect and test the UPS on a power feed in one zone in the US-CENTRAL1 region. That maintenance disrupted one of the two power feeds to network devices that control routes into and out of the US-CENTRAL1-F zone. Although this did not cause any disruption in service, these devices unexpectedly and silently disabled the affected power supply modules - a previously unseen behavior. Because our monitoring systems did not notify our network engineers of this problem the power supply modules were not re-enabled after the maintenance event. The service disruption was triggered on Monday 22 August, when our engineers carried out another planned maintenance event that removed power to the second power feed of these devices, causing them to disable the other power supply module as well, and thus completely shut down. Following our standard procedure when carrying out maintenance events, we made a detailed line walk of all critical equipment prior to, and after, making any changes. However, in this case we did not detect the disabled power supply modules. Loss of these network devices meant that machines in US-CENTRAL1-F did not have routes into and out of the zone but could still communicate to other machines within the same zone. REMEDIATION AND PREVENTION: Our network engineers received an alert at 07:14, nine minutes after the incident started. We restored power to the devices at 07:30. The network returned to service without further intervention after power was restored. As immediate followup to this incident, we have already carried out an audit of all other network devices of this type in our fleet to verify that there are none with disabled power supply modules. We have also written up a detailed post mortem of this incident and will take the following actions to prevent future outages of this type: Our monitoring will be enhanced to detect cases in which power supply modules are disabled. This will ensure that conditions that are missed by the manual line walk prior to maintenance events are picked up by automated monitoring. We will change the configuration of these network devices so that power disruptions do not cause them to disable their power supply modules. The interaction between the network control plane and the data plane should be such that the data plane should "fail open" and continue to route packets in the event of control plane failures. We will add support for networking protocols that have the capability to continue to route traffic for a short period in the event of failures in control plane components. We will also be taking various actions to improve the resilience of the affected services to single-zone outages, including the following: App Engine Although App Engine Flexible Environment is currently in Beta, we expect production services to be more resilient to single zone disruptions. We will make this extra resilience an exit criteria before we allow the service to reach General Availability. Cloud Dataflow We will improve resilience of Dataflow to single-zone outages by implementing better strategies for migrating the job controller to a new zone in the event of an outage. Work on this remediation is already underway. Stackdriver Logging We will make improvements to the Stackdriver Logging service (currently in Beta) in the areas of automatic failover and capacity management before this service goes to General Availability. This will ensure that it is resilient to single-zone outages. Stackdriver Monitoring The Google Monitoring API (currently in beta) is already hosted in more than one zone, but we will further improve its resilience by adding additional capacity to ensure a single-zone outage does not cause overload in any other zones. We will do this before this service exits to General Availability. Finally, we know that you depend on Google Cloud Platform for your production workloads and we apologize for the inconvenience this event caused.

Last Update: A few months ago

RESOLVED: Incident 16017 - Networking issue with Google Compute Engine services

SUMMARY: On Monday 22 August 2016, the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone for a duration of 25 minutes. All other zones in the US-CENTRAL1 region were unaffected. All network traffic within the zone was also unaffected. We apologize to customers whose service or application was affected by this incident. We understand that a network disruption has a negative impact on your application - particularly if it is homed in a single zone - and we apologize for the inconvenience this caused. What follows is our detailed analysis of the root cause and actions we will take in order to prevent this type of incident from recurring. DETAILED DESCRIPTION OF IMPACT: We have received feedback from customers asking us to specifically and separately enumerate the impact of incidents to any service that may have been touched. We agree that this will make it easier to reason about the impact of any particular event and we have done so in the following descriptions. On Monday 22 August 2016 from 07:05 to 07:30 PDT the Google Cloud US-CENTRAL1-F zone lost network connectivity to services outside that zone. App Engine 6% of App Engine Standard Environment applications in the US-CENTRAL region served elevated error rates for up to 8 minutes, until the App Engine serving infrastructure automatically redirected traffic to a failover zone. The aggregate error rate across all impacted applications during the incident period was 3%. The traffic redirection caused a Memcache flush for affected applications, and also loading requests as new instances of the applications started up in the failover zones. All App Engine Flexible Environment applications deployed to the US-CENTRAL1-F zone were unavailable for the duration of the incident. Additionally, 4.5% of these applications experienced various levels of unavailability for up to an additional 5 hours while the system recovered. Deployments for US-CENTRAL Flexible applications were delayed during the incident. Our engineers disabled the US-CENTRAL1-F zone for new deployments during the incident, so that any customers who elected to redeploy, immediately recovered. Cloud Console The Cloud Console was available during the incident, though some App Engine administrative pages did not load for applications in US-CENTRAL and 50% of project creation requests failed to complete and needed to be retried by customers before succeeding. Cloud Dataflow Some Dataflow running jobs in the US-CENTRAL1 region experienced delays in processing. Although most of the affected jobs recovered gracefully after the incident ended, up to 2.5% of affected jobs in this zone became stuck and required manual termination by customers. New jobs created during the incident were not impacted. Cloud SQL Cloud SQL First Generation instances were not impacted by this incident. 30% of Cloud SQL Second Generation instances in US-CENTRAL1 were unavailable for up to 5 minutes, after which they became available again. An additional 15% of Second Generation instances were unavailable for 22 minutes. Compute Engine All instances in the US-CENTRAL1-F zone were inaccessible from outside the zone for the duration of the incident. 9% of them remained inaccessible from outside the zone for an additional hour. Container Engine Container Engine clusters running in US-CENTRAL1-F were inaccessible from outside of the zone during the incident although they continued to serve. In addition, calls to the Container Engine API experienced a 4% error rate and elevated latency during the incident, though this was substantially mitigated if the client retried the request. Stackdriver Logging 20% of log API requests sent to Stackdriver Logging in the US-CENTRAL1 region failed during the incident, though App Engine logging was not impacted. Clients retrying requests recovered gracefully. Stackdriver Monitoring Requests to the StackDriver web interface and the Google Monitoring API v2beta2 and v3 experienced elevated latency and an error rate of up to 3.5% during the incident. In addition, some alerts were delayed. Impact for API calls was substantially mitigated if the client retried the request. ROOT CAUSE: On 18 July, Google carried out a planned maintenance event to inspect and test the UPS on a power feed in one zone in the US-CENTRAL1 region. That maintenance disrupted one of the two power feeds to network devices that control routes into and out of the US-CENTRAL1-F zone. Although this did not cause any disruption in service, these devices unexpectedly and silently disabled the affected power supply modules - a previously unseen behavior. Because our monitoring systems did not notify our network engineers of this problem the power supply modules were not re-enabled after the maintenance event. The service disruption was triggered on Monday 22 August, when our engineers carried out another planned maintenance event that removed power to the second power feed of these devices, causing them to disable the other power supply module as well, and thus completely shut down. Following our standard procedure when carrying out maintenance events, we made a detailed line walk of all critical equipment prior to, and after, making any changes. However, in this case we did not detect the disabled power supply modules. Loss of these network devices meant that machines in US-CENTRAL1-F did not have routes into and out of the zone but could still communicate to other machines within the same zone. REMEDIATION AND PREVENTION: Our network engineers received an alert at 07:14, nine minutes after the incident started. We restored power to the devices at 07:30. The network returned to service without further intervention after power was restored. As immediate followup to this incident, we have already carried out an audit of all other network devices of this type in our fleet to verify that there are none with disabled power supply modules. We have also written up a detailed post mortem of this incident and will take the following actions to prevent future outages of this type: Our monitoring will be enhanced to detect cases in which power supply modules are disabled. This will ensure that conditions that are missed by the manual line walk prior to maintenance events are picked up by automated monitoring. We will change the configuration of these network devices so that power disruptions do not cause them to disable their power supply modules. The interaction between the network control plane and the data plane should be such that the data plane should "fail open" and continue to route packets in the event of control plane failures. We will add support for networking protocols that have the capability to continue to route traffic for a short period in the event of failures in control plane components. We will also be taking various actions to improve the resilience of the affected services to single-zone outages, including the following: App Engine Although App Engine Flexible Environment is currently in Beta, we expect production services to be more resilient to single zone disruptions. We will make this extra resilience an exit criteria before we allow the service to reach General Availability. Cloud Dataflow We will improve resilience of Dataflow to single-zone outages by implementing better strategies for migrating the job controller to a new zone in the event of an outage. Work on this remediation is already underway. Stackdriver Logging We will make improvements to the Stackdriver Logging service (currently in Beta) in the areas of automatic failover and capacity management before this service goes to General Availability. This will ensure that it is resilient to single-zone outages. Stackdriver Monitoring The Google Monitoring API (currently in beta) is already hosted in more than one zone, but we will further improve its resilience by adding additional capacity to ensure a single-zone outage does not cause overload in any other zones. We will do this before this service exits to General Availability. Finally, we know that you depend on Google Cloud Platform for your production workloads and we apologize for the inconvenience this event caused.

Last Update: A few months ago

RESOLVED: Incident 17011 - Cloud SQL 2nd generation failing to create new instances

The issue in creating instances on Cloud SQL second generation should have been resolved for all affected projects as of 17:38 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17011 - Cloud SQL 2nd generation failing to create new instances

We are investigating an issue for creating new instances on Cloud SQL second generation. We will provide more information by 18:50 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16008 - App Engine Outage

SUMMARY: On Thursday 11 August 2016, 21% of Google App Engine applications hosted in the US-CENTRAL region experienced error rates in excess of 10% and elevated latency between 13:13 and 15:00 PDT. An additional 16% of applications hosted on the same GAE instance observed lower rates of errors and latency during the same period. We apologize for this incident. We know that you choose to run your applications on Google App Engine to obtain flexible, reliable, high-performance service, and in this incident we have not delivered the level of reliability for which we strive. Our engineers have been working hard to analyze what went wrong and ensure incidents of this type will not recur. DETAILED DESCRIPTION OF IMPACT: On Thursday 11 August 2016 from 13:13 to 15:00 PDT, 18% of applications hosted in the US-CENTRAL region experienced error rates between 10% and 50%, and 3% of applications experienced error rates in excess of 50%. Additionally, 14% experienced error rates between 1% and 10%, and 2% experienced error rate below 1% but above baseline levels. In addition, the 37% of applications which experienced elevated error rates also observed a median latency increase of just under 0.8 seconds per request. The remaining 63% of applications hosted on the same GAE instance, and applications hosted on other GAE instances, did not observe elevated error rates or increased latency. Both App Engine Standard and Flexible Environment applications in US-CENTRAL were affected by this incident. In addition, some Flexible Environment applications were unable to deploy new versions during this incident. App Engine applications in US-EAST1 and EUROPE-WEST were not impacted by this incident. ROOT CAUSE: The incident was triggered by a periodic maintenance procedure in which Google engineers move App Engine applications between datacenters in US-CENTRAL in order to balance traffic more evenly. As part of this procedure, we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources. The applications running on the drained servers are automatically rescheduled onto different servers. During this procedure, a software update on the traffic routers was also in progress, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity. The server drain resulted in rescheduling of multiple instances of manually-scaled applications. App Engine creates new instances of manually-scaled applications by sending a startup request via the traffic routers to the server hosting the new instance. Some manually-scaled instances started up slowly, resulting in the App Engine system retrying the start requests multiple times which caused a spike in CPU load on the traffic routers. The overloaded traffic routers dropped some incoming requests. Although there was sufficient capacity in the system to handle the load, the traffic routers did not immediately recover due to retry behavior which amplified the volume of requests. REMEDIATION AND PREVENTION: Google engineers were monitoring the system during the datacenter changes and immediately noticed the problem. Although we rolled back the change that drained the servers within 11 minutes, this did not sufficiently mitigate the issue because retry requests had generated enough additional traffic to keep the system’s total load at a substantially higher-than-normal level. As designed, App Engine automatically redirected requests to other datacenters away from the overload - which reduced the error rate. Additionally, our engineers manually redirected all traffic at 13:56 to other datacenters which further mitigated the issue. Finally, we then identified a configuration error that caused an imbalance of traffic in the new datacenters. Fixing this at 15:00 finally fully resolved the incident. In order to prevent a recurrence of this type of incident, we have added more traffic routing capacity in order to create more capacity buffer when draining servers in this region. We will also change how applications are rescheduled so that the traffic routers are not called and also modify that the system's retry behavior so that it cannot trigger this type of failure. We know that you rely on our infrastructure to run your important workloads and that this incident does not meet our bar for reliability. For that we apologize. Your trust is important to us and we will continue to all we can to earn and keep that trust.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with Compute Engine network connectivity should have been resolved for nearly all instances. For the remaining few remaining instances we are working directly with the affected customers. No further updates will be posted, but we will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with Compute Engine network connectivity should have been resolved for affected instances in us-central1-a, -b, and -c as of 08:00 US/Pacific. Less than 4% of instances in us-central1-f are currently affected. We will provide another status update by 12:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with Compute Engine network connectivity should have been resolved for affected instances in us-central1-a, -b, and -c as of 08:00 US/Pacific. Less than 4% of instances in us-central1-f are currently affected and we expect a full resolution soon. We will provide another status update by 11:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity issues, located in us-central1-f, is still ongoing. We will provide another status update by 11:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity issues, located in us-central1-f, is still ongoing. We will provide another status update by 10:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity is still ongoing. Affected instances are located in us-central1-f. We will provide another status update by 10:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The work on the remaining instances with network connectivity is still ongoing. Affected instances are located in us-central1-f. We will provide another status update by 09:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

We are still investigating network connectivity issues for a subset of instances that have not automatically recovered. We will provide another status update by 09:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

The issue with network connectivity to Google Compute Engine services should have been resolved for the majority of instances and we expect a full resolution in the near future. We will provide another status update by 08:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16009 - Networking issue with Google App Engine services

The issue with network connectivity to Google App Engine applications should have been resolved for all affected users as of 07:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16017 - Networking issue with Google Compute Engine services

We are investigating an issue with network connectivity. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16009 - Networking issue with Google App Engine services

We are investigating an issue with network connectivity. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16016 - Networking issue with Google Compute Engine services

We are investigating an issue with network connectivity. We will provide more information by 08:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

The issue with App Engine apis being unavailable should have been resolved for nearly all affected projects as of 14:12 US/Pacific. We will follow up directly with few remaining affecting applications. We will also conduct a thorough internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. Finally, we will provide a more detailed analysis of this incident once we have completed this internal investigation.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

We are still investigating the issue with App Engine apis being unavailable. Current data indicates that some projects are affected by this issue. We will provide another status update by 15:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

The issue with App Engine apis being unavailable should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 15:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

We are experiencing an issue with App Engine apis being unavailable beginning at Thursday, 2016-08-11 13:45 US/Pacific. Current data indicates that Applications in us-central are affected by this issue.

Last Update: A few months ago

UPDATE: Incident 16008 - App Engine Outage

We are investigating reports of an issue with App Engine. We will provide more information by 02:15 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16015 - Networking issue with Google Compute Engine services

SUMMARY: On Friday 5 August 2016, some Google Cloud Platform customers experienced increased network latency and packet loss to Google Compute Engine (GCE), Cloud VPN, Cloud Router and Cloud SQL, for a duration of 99 minutes. If you were affected by this issue, we apologize. We intend to provide a higher level reliability than this, and we are working to learn from this issue to make that a reality. DETAILED DESCRIPTION OF IMPACT: On Friday 5th August 2016 from 00:55 to 02:34 PDT a number of services were disrupted: Some Google Compute Engine TCP and UDP traffic had elevated latency. Most ICMP, ESP, AH and SCTP traffic inbound from outside the Google network was silently dropped, resulting in existing connections being dropped and new connections timing out on connect. Most Google Cloud SQL first generation connections from sources external to Google failed with a connection timeout. Cloud SQL second generation connections may have seen higher latency but not failure. Google Cloud VPN tunnels remained connected, however there was complete packet loss for data through the majority of tunnels. As Cloud Router BGP sessions traverse Cloud VPN, all sessions were dropped. All other traffic was unaffected, including internal connections between Google services and services provided via HTTP APIs. ROOT CAUSE: While removing a faulty router from service, a new procedure for diverting traffic from the router was used. This procedure applied a new configuration that resulted in announcing some Google Cloud Platform IP addresses from a single point of presence in the southwestern US. As these announcements were highly specific they took precedence over the normal routes to Google's network and caused a substantial proportion of traffic for the affected network ranges to be directed to this one point of presence. This misrouting directly caused the additional latency some customers experienced. Additionally this misconfiguration sent affected traffic to next-generation infrastructure that was undergoing testing. This new infrastructure was not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter. This blocked traffic on the affected IP addresses that was routed through the affected point of presence to Cloud VPN, Cloud Router, Cloud SQL first generation and GCE on protocols other than TCP and UDP. REMEDIATION AND PREVENTION: Mitigation began at 02:04 PDT when Google engineers reverted the network infrastructure change that caused this issue, and all traffic routing was back to normal by 02:34. The system involved was made safe against recurrences by fixing the erroneous configuration. This includes changes to BGP filtering to prevent this class of incorrect announcements. We are implementing additional integration tests for our routing policies to ensure configuration changes behave as expected before being deployed to production. Furthermore, we are improving our production telemetry external to the Google network to better detect peering issues that slip past our tests. We apologize again for the impact this issue has had on our customers.

Last Update: A few months ago

RESOLVED: Incident 16015 - Networking issue with Google Compute Engine services

The issue with Google Cloud networking should have been resolved for all affected users as of 02:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16015 - Networking issue with Google Compute Engine services

We are still investigating the issue with Google Compute Engine networking. Current data also indicates impact on other GCP products including Cloud SQL. We will provide another status update by 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16015 - Networking issue with Google Compute Engine services

We are investigating a networking issue with Google Compute Engine. We will provide more information by 02:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18019 - BigQuery connection issues

We are experiencing an intermittent issue with BigQuery connections beginning at Thursday, 2016-Aug-04 13:49 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 4:00pm US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18002 - Execution of BigQuery jobs is delayed, jobs are backing up in Pending state

We are experiencing an intermittent issue with BigQuery connections beginning at Thursday, 2016-Aug-04 13:49 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 4:00pm US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16014 - HTTP(S) Load Balancing returning some 502 errors

We are still investigating the issue with HTTP(S) Load Balancing returning 502 errors. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16014 - HTTP(S) Load Balancing returning some 502 errors

The issue with HTTP(S) Load Balancing returning a small number 502 errors should have been resolved for all affected (instances) as of 11:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16014 - HTTP(S) Load Balancing returning some 502 errors

We are experiencing an issue with HTTP(S) Load Balancing returning a small number 502 errors, beginning at Friday, 2016-07-29 around 08:45 US/Pacific. The maximum error rate for affected users was below 2%. Remediation has been applied that should stop these errors; we are monitoring the situation. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 11:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16013 - HTTP(S) Load Balancing 502 Errors

We are investigating an issue with 502 errors from HTTP(S) Load Balancing. We will provide more information by 11:05 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18018 - Streaming API issues with BigQuery

SUMMARY: On Monday 25 July 2016, the Google BigQuery Streaming API experienced elevated error rates for a duration of 71 minutes. We apologize if your service or application was affected by this and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Monday 25 July 2016 between 17:03 and 18:14 PDT, the BigQuery Streaming API returned HTTP 500 or 503 errors for 35% of streaming insert requests, with a peak error rate of 49% at 17:40. Customers who retried on error were able to mitigate the impact. Calls to the BigQuery jobs API showed an error rate of 3% during the incident but could generally be executed reliably with normal retry behaviour. Other BigQuery API calls were not affected. ROOT CAUSE: An internal Google service sent an unexpectedly high amount of traffic to the BigQuery Streaming API service. The internal service used a different entry point that was not subject to quota limits. Google's internal load balancers drop requests that exceed the capacity limits of a service. In this case, the capacity limit for the Streaming API service had been configured higher than its true capacity. As a result, the internal Google service was able to send too many requests to the Streaming API, causing it to fail for a percentage of responses. The Streaming API service sends requests to BigQuery's Metadata service in order to handle incoming Streaming requests. This elevated volume of requests exceeded the capacity of the Metadata service which resulted in errors for BigQuery jobs API calls. REMEDIATION AND PREVENTION: The incident started at 17:03. Our monitoring detected the issue at 17:20 as error rates started to increase. Our engineers blocked traffic from the internal Google client causing the overload shortly thereafter which immediately started to mitigate the impact of the incident. Error rates dropped to normal by 18:14. In order to prevent a recurrence of this type of incident we will enforce quotas for internal Google clients on requests to the Streaming service in order to prevent a single client sending too much traffic. We will also set the correct capacity limits for the Streaming API service based on improved load tests in order to ensure that internal clients cannot exceed the service's capacity. We apologize again to customers impacted by this incident.

Last Update: A few months ago

RESOLVED: Incident 18018 - Streaming API issues with BigQuery

We experienced an issue with BigQuery streaming API returning 500/503 responses that has been resolved for all affected customers as of 18:11 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 16007 - Intermittent Google App Engine URLFetch API deadline exceeded errors.

The issue with Google App Engine URLFetch API service should have been resolved for all affected applications as of 02:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16007 - Intermittent Google App Engine URLFetch API deadline exceeded errors.

We are still investigating an intermittent issue with Google App Engine URLFetch API calls to non-Google services failing with deadline exceeded errors. We will provide another status update by 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16007 - Intermittent Google App Engine URLFetch API deadline exceeded errors.

We are currently investigating an intermittent issue with Google App Engine URLFetch API service. Fetch requests to non-Google related services are failing with deadline exceeded errors. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 02:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16011 - Compute Engine SSD Persistent disk latency in zone US-Central1-a

SUMMARY: On Tuesday, 28 June 2016 Google Compute Engine SSD Persistent Disks experienced elevated write latency and errors in one zone for a duration of 211 minutes. We would like to apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future. DETAILED DESCRIPTION OF IMPACT: On Tuesday, 28 June 2016 from 18:15 to 21:46 PDT SSD Persistent Disks (PD) in zone us-central1-a experienced elevated latency and errors for most writes. Instances using SSD as their root partition were likely unresponsive. For instances using SSD as a secondary disk, IO latency and errors were visible to applications. Standard (i.e. non-SSD) PD in us-central1-a suffered slightly elevated latency and errors. Latency and errors also occurred when taking and restoring from snapshots of Persistent Disks. Disk creation operations also had elevated error rates, both for standard and SSD PD. Persistent Disks outside of us-central1-a were unaffected. ROOT CAUSE: Two concurrent routine maintenance events triggered a rebalancing of data by the distributed storage system underlying Persistent Disk. This rebalancing is designed to make maintenance events invisible to the user, by redistributing data evenly around unavailable storage devices and machines. A previously unseen software bug, triggered by the two concurrent maintenance events, meant that disk blocks which became unused as a result of the rebalance were not freed up for subsequent reuse, depleting the available SSD space in the zone until writes were rejected. REMEDIATION AND PREVENTION: The issue was resolved when Google engineers reverted one of the maintenance events that triggered the issue. A fix for the underlying issue is already being tested in non-production zones. To reduce downtime related to similar issues in future, Google engineers are refining automated monitoring such that, if this issue were to recur, engineers would be alerted before users saw impact. We are also improving our automation to better coordinate different maintenance operations on the same zone to reduce the time it takes to revert such operations if necessary.

Last Update: A few months ago

RESOLVED: Incident 16005 - Issue with Developers Console

SUMMARY: On Thursday 9 June 2016, the Google Cloud Console was unavailable for a duration of 91 minutes, with significant performance degradation in the preceding half hour. Although this did not affect user resources running on the Google Cloud Platform, we appreciate that many of our customers rely on the Cloud Console to manage those resources, and we apologize to everyone who was affected by the incident. This report is to explain to our customers what went wrong, and what we are doing to make sure that it does not happen again. DETAILED DESCRIPTION OF IMPACT: On Thursday 9 June 2016 from 20:52 to 22:23 PDT, the Google Cloud Console was unavailable. Users who attempted to connect to the Cloud Console observed high latency and HTTP server errors. Many users also observed increasing latency and error rates during the half hour before the incident. Google Cloud Platform resources were unaffected by the incident and continued to run normally. All Cloud Platform resource management APIs remained available, allowing Cloud Platform resources to be managed via the Google Cloud SDK or other tools. ROOT CAUSE: The Google Cloud Console runs on Google App Engine, where it uses internal functionality that is not used by customer applications. Google App Engine version 1.9.39 introduced a bug in one internal function which affected Google Cloud Console instances, but not customer-owned applications, and thus escaped detection during testing and during initial rollout. Once enough instances of Google Cloud Console had been switched to 1.9.39, the console was unavailable and internal monitoring alerted the engineering team, who restored service by starting additional Google Cloud Console instances on 1.9.38. During the entire incident, customer-owned applications were not affected and continued to operate normally. To prevent a future recurrence, Google engineers are augmenting the testing and rollout monitoring to detect low error rates on internal functionality, complementing the existing monitoring for customer applications. REMEDIATION AND PREVENTION: When the issue was provisionally identified as a specific interaction between Google App Engine version 1.9.39 and the Cloud Console, App Engine engineers brought up capacity running the previous App Engine version and transferred the Cloud Console to it, restoring service at 22:23 PDT. The low-level bug that triggered the error has been identified and fixed. Google engineers are increasing the fidelity of the rollout monitoring framework to detect error signatures that suggest negative interactions of individual apps with a new App Engine release, even the signatures are invisible in global App Engine performance statistics. We apologize again for the inconvenience this issue caused our customers.

Last Update: A few months ago

RESOLVED: Incident 16012 - Newly created instances may be experiencing packet loss.

SUMMARY: On Wednesday 29 June 2016, newly created Google Compute Engine instances and newly created network load balancers in all zones were partially unreachable for a duration of 106 minutes. We know that many customers depend on the ability to rapidly deploy and change configurations, and apologise for our failure to provide this to you during this time. DETAILED DESCRIPTION OF IMPACT: On Wednesday 29 June 2016, from 11:58 PST until 13:44 US/Pacific, new Google Compute Engine instances and new network load balancers were partially unreachable via the network. In addition, changes to existing network load balancers were only partially applied. The level of unreachability depended on traffic path rather than instance or load balancer location. Overall, the average impact on new instances was 50% of traffic in the US and around 90% in Asia and Europe. Existing and unchanged instances and load balancers were unaffected. ROOT CAUSE: On 11:58 PST, a scheduled upgrade to Google’s network control system started, introducing an additional access control check for network configuration changes. This inadvertently removed the access of GCE’s management system to network load balancers in this environment. Only a fraction of Google's network locations require this access as an older design has an intermediate component doing access updates. As a result these locations did not receive updates for new and changed instances or load balancers. The change was only tested at network locations that did not require the direct access, which resulted in the issue not being detected during testing and canarying and being deployed globally. REMEDIATION AND PREVENTION: After identifying the root cause, the access control check was modified to allow access by GCE’s management system. The issue was resolved when this modification was fully deployed. To prevent future incidents, the network team is making several changes to their deployment processes. This will improve the level of testing and canarying to catch issues earlier, especially where an issue only affects a subset of the environments at Google. The deployment process will have the rollback procedure enhanced to allow the quickest possible resolution for future incidents. The access control system that was at the root of this issue will also be modified to improve operations that interacts with it. For example it will be integrated with a Google-wide change logging system to allow faster detection of issues caused by access control changes. It will also be outfitted with a dry run mode to allow consequences of changes to be tested during development time. Once again we would like to apologise for falling below the level of service you rely on.

Last Update: A few months ago

RESOLVED: Incident 16012 - Newly created instances may be experiencing packet loss.

The issue with new Google Compute Engine instance experiencing packet loss on startup should have been resolved for all affected instances as of 13:57 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16012 - Newly created instances may be experiencing packet loss.

The issue with new Google Compute Engine instance experiencing packet loss on startup should have been resolved for some instances and we expect a full resolution in the near future. We will provide another status update by 14:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16012 - Newly created instances may be experiencing packet loss.

We are experiencing an issue with new Google Compute Engine instance experiencing packet loss on startup beginning at Wednesday, 2016-06-29 12:18 US/Pacific. Current data indicates that 100% of instances are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 1:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16012 - Newly created instances may be experiencing packet loss.

We are investigating reports of an issue with Compute Engine. We will provide more information by 01:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16011 - Compute Engine SSD Persistent disk latency in zone US-Central1-a

The issue with Compute Engine SSD persistent disk latency in zone US-Central1-a should have been resolved for all projects as of 21:57 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16011 - Compute Engine SSD Persistent disk latency in zone US-Central1-a

The issue with Compute Engine SSD Persistent disk latency in zone US-Central1-a should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - Compute Engine SSD Persistent disk latency in zone US-Central1-a

We are still investigating the issue with Compute Engine SSD Persistent disk latency in zone US-Central1-a. We will provide another status update by 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16011 - Compute Engine SSD Persistent disk latency in zone US-Central1-a

We are investigating an issue with Compute Engine SSD Persistent disk latency in zone US-Central1-a. We will provide more information by 21:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16007 - Cloud Console is not displaying project lists

The issue with Cloud Console not displaying lists of projects should have been resolved for all affected projects as of 17:39 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16007 - Cloud Console is not displaying project lists

The issue with Cloud Console not displaying lists of projects should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 18:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16007 - Cloud Console is not displaying project lists

We are experiencing an ongoing issue with the Cloud Console not displaying lists of projects beginning at Tuesday, 2016-06-28 14:29 US/Pacific. Current data indicates that approximately 10% of projects are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 17:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16006 - Requests to Cloud Console failing

The issue with Cloud Console serving errors should have been resolved for all affected users as of 11:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16006 - Requests to Cloud Console failing

The issue with Cloud Console serving errors should be resolved for many users and we expect a full resolution shortly. We will provide another status update by 11:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16006 - Requests to Cloud Console failing

We are experiencing an issue with Cloud Console serving errors beginning at Tuesday, 2016-06-14 08:49 US/Pacific. Current data indicates that errors are intermittent but may affect all users. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 10:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16006 - Requests to Cloud Console failing

We are investigating reports of an issue with Google Cloud Console. We will provide more information by 09:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16006 - Requests to Pantheon failing

We are investigating reports of an issue with Pantheon. We will provide more information by 09:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16005 - Issue with Developers Console

The issue with Developers Console should have been resolved for all affected users as of 22:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16005 - Issue with Developers Console

We are experiencing an issue with Developers Console beginning at Thursday, 2016-06-09 21:09 US/Pacific. Current data indicates that all users are affected by this issue. The gcloud command line tool can be used as a workaround for those who need to manage their resources immediately. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 22:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16005 - Issue with Developers Console

We are investigating an issue with Developers Console. We will provide more information by 21:40 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16004 - We are investigating an issue with Developers Console

We are investigating an issue with Developers Console. We will provide more information by 21:40 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18016 - Intermittent connectivity issues with BigQuery

A incident report for this issue is available at https://status.cloud.google.com/incident/bigquery/18015 as these issues shared a common cause.

Last Update: A few months ago

RESOLVED: Incident 18015 - Google BigQuery issues

SUMMARY: On Wednesday 18 May 2016 the BigQuery API was unavailable for two periods totaling 31 minutes. We understand how important access to your data stored in BigQuery is and we apologize for the impact this had on you. We have investigated the incident to determine how we can mitigate future issues and provide better service for you in the future. DETAILED DESCRIPTION OF IMPACT: On Wednesday 18 May 2016 from 11:50 until 12:15 PDT all non-streaming BigQuery API calls failed, and additionally from 14:41 until 14:47, 70% of calls failed. An error rate of 1% occurred from 11:28 until 15:34. API calls affected by this issue experienced elevated latency and eventually returned an HTTP 500 status with an error message of "Backend Error". The BigQuery web console was also unavailable during these periods. The streaming API and BigQuery export of logs and usage data were unaffected. ROOT CAUSE: In 2015 BigQuery introduced datasets located in Europe. This required infrastructure to allow BigQuery API calls to be routed to an appropriate zone. This infrastructure was deployed uneventfully and has been operating in production for some time. The errors on 18 May were caused when a new configuration was deployed to improve routing of APIs, and then subsequently rolled back. The engineering team has made changes to the routing configuration for BigQuery API calls to prevent this issue from recurring in the future, and to more rapidly detect elevated error levels in BigQuery API calls in the future Finally, we would like to apologize for this issue - particularly its scope and duration. We know that BigQuery is a critical component of many GCP deployments, and we are committed to continually improving its availability.

Last Update: A few months ago

UPDATE: Incident 16011 - Snapshotting of some PDs in US-CENTRAL-1A are failing

We are investigating reports that some snapshotting of PDs in US-CENTRAL-1A are failing. We just wanted to let you know that we're aware of it. No action required on your end

Last Update: A few months ago

UPDATE: Incident 16031 - Google Cloud Storage POST errors

From Wednesday, 2016-05-26 14:57 until 15:17 US/Pacific, the Google Cloud Storage XML and JSON APIs were unavailable for 72% of POST requests in the US. Requests originating outside of the US or using a method other than POST were unaffected. Affected queries returned error 500. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16010 - Network disruption in us-central1-c

The issue with networking in us-central1-a should have been resolved for all affected instances as of 16:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16010 - Network disruption in us-central1-c

The issue with networking in us-central1-c should have been resolved for the majority of affected instances and we expect a full resolution in the near future. Customers still experiencing networking issues in us-central1-c can perform a Stop/Start cycle on their instances to regain connectivity. We will provide another status update by 17:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16010 - Network disruption in us-central1-c

We are still investigating the issue with networking in us-central1-c. Current data indicates that between approximately 0.5% of instances in the zone are affected by this issue. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16010 - Network disruption in us-central1-c

We are experiencing an issue with networking in us-central1-c beginning at Friday, 2016-05-20 13:08. Some instances will be inaccessible by internal and external IP addresses. No other zones are affected by this incident. We will provide more information by 15:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16010 - Network disruption in us-central1-c

We are investigating an issue with networking in us-central1-c. We will provide more information by 14:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 17010 - Issues with Cloud SQL First Generation instances

SUMMARY: On Tuesday 17 May 2016, connections to Cloud SQL instances in the Central United States region experienced an elevated error rate for 130 minutes. We apologize to customers who were affected by this incident. We know that reliability is critical for you and we are committed to learning from incidents in order to improve the future reliability of our service. DETAILED DESCRIPTION OF IMPACT: On Tuesday 17 May 2016 from 04:15 to 06:12 and from 08:24 to 08:37 PDT, connections to Cloud SQL instances in the us-central1 region experienced an elevated error rate. The average rate of connection errors to instances in this region was 10.5% during the first part of the incident and 1.9% during the second part of the incident. 51% of in-use Cloud SQL instances in the affected region were impacted during the first part of the incident; 4.2% of in-use instances were impacted during the second part. Cloud SQL Second Generation instances were not impacted. ROOT CAUSE: Clients connect to a Cloud SQL frontend service that forwards the connection to the correct MySQL database server. The frontend calls a separate service to start up a new Cloud SQL instance if a connection arrives for an instance that is not running. This incident was triggered by a Cloud SQL instance that could not successfully start. The incoming connection requests for this instance resulted in a large number of calls to the start up service. This caused increased memory usage of the frontend service as start up requests backed up. The frontend service eventually failed under load and dropped some connection requests due to this memory pressure. REMEDIATION AND PREVENTION: Google received its first customer report at 04:39 PDT and we tried to remediate the problem by redirecting new connections to different datacenters. This effort proved unsuccessful as the start up capacity was used up there also. At 06:12 PDT, we fixed the issue by blocking all incoming connections to the misbehaving Cloud SQL instance. At 08:24 PDT, we moved this instance to a separate pool of servers and restarted it. However, the separate pool of servers did not provide sufficient isolation for the service that starts up instances, causing the incident to recur. We shutdown the instance at 08:37 PDT which resolved the incident. To prevent incidents of this type in the future, we will ensure that a single Cloud SQL instance cannot use up all the capacity of the start up service. In addition, we will improve our monitoring in order to detect this type of issue more quickly. We apologize for the inconvenience this issue caused our customers.

Last Update: A few months ago

RESOLVED: Incident 17010 - Issues with Cloud SQL First Generation instances

SUMMARY: On Tuesday 17 May 2016, connections to Cloud SQL instances in the Central United States region experienced an elevated error rate for 130 minutes. We apologize to customers who were affected by this incident. We know that reliability is critical for you and we are committed to learning from incidents in order to improve the future reliability of our service. DETAILED DESCRIPTION OF IMPACT: On Tuesday 17 May 2016 from 04:15 to 06:12 and from 08:24 to 08:37 PDT, connections to Cloud SQL instances in the us-central1 region experienced an elevated error rate. The average rate of connection errors to instances in this region was 10.5% during the first part of the incident and 1.9% during the second part of the incident. 51% of in-use Cloud SQL instances in the affected region were impacted during the first part of the incident; 4.2% of in-use instances were impacted during the second part. Cloud SQL Second Generation instances were not impacted. ROOT CAUSE: Clients connect to a Cloud SQL frontend service that forwards the connection to the correct MySQL database server. The frontend calls a separate service to start up a new Cloud SQL instance if a connection arrives for an instance that is not running. This incident was triggered by a Cloud SQL instance that could not successfully start. The incoming connection requests for this instance resulted in a large number of calls to the start up service. This caused increased memory usage of the frontend service as start up requests backed up. The frontend service eventually failed under load and dropped some connection requests due to this memory pressure. REMEDIATION AND PREVENTION: Google received its first customer report at 04:39 PDT and we tried to remediate the problem by redirecting new connections to different datacenters. This effort proved unsuccessful as the start up capacity was used up there also. At 06:12 PDT, we fixed the issue by blocking all incoming connections to the misbehaving Cloud SQL instance. At 08:24 PDT, we moved this instance to a separate pool of servers and restarted it. However, the separate pool of servers did not provide sufficient isolation for the service that starts up instances, causing the incident to recur. We shutdown the instance at 08:37 PDT which resolved the incident. To prevent incidents of this type in the future, we will ensure that a single Cloud SQL instance cannot use up all the capacity of the start up service. In addition, we will improve our monitoring in order to detect this type of issue more quickly. We apologize for the inconvenience this issue caused our customers.

Last Update: A few months ago

UPDATE: Incident 18016 - Intermittent connectivity issues with BigQuery

The issue with BigQuery intermittent connectivity issues should have been resolved. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18016 - Intermittent connectivity issues with BigQuery

We are currently investigating intermittent connectivity issues with BigQuery that are affecting some of our Customers. We'll provide another update at 5:30pm PDT

Last Update: A few months ago

UPDATE: Incident 18015 - Google BigQuery issues

The issue with BigQuery API should have been resolved for all affected projects as of 12:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18015 - Google BigQuery issues

We are currently investigating an issue with the BigQuery API. We'll provide an update at 12:30 PDT

Last Update: A few months ago

RESOLVED: Incident 17010 - Issues with Cloud SQL First Generation instances

The issue with Cloud SQL should have been resolved for all affected Cloud SQL instances as of 06:20 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 17010 - Issues with Cloud SQL First Generation instances

The issue is confirmed to be confined to a subset of Cloud SQL First Generation instances. We have started to apply mitigation measures. We will provide next update by 07:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16005 - Issues with App Engine applications connecting to Cloud SQL

The issue is confirmed to affect a subset of Cloud SQL First Generation instances. All further updates will be provided here: https://cloud-status.googleplex.com/incident/cloud-sql/17010

Last Update: A few months ago

UPDATE: Incident 17010 - Issues with Cloud SQL First Generation instances

We are currently experiencing an issue with Cloud SQL that affects Cloud SQL First Generation instances, and applications depending on them.

Last Update: A few months ago

UPDATE: Incident 16005 - Issues with App Engine applications connecting to Cloud SQL

We are currently investigating an issue with App Engine that affects applications using Cloud SQL. We will provide more information about the issue by 06:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16009 - Elevated latency for operations on us-central1-a

The issue with elevated latency on GCE management operations on us-central1-a has been resolved as of 10:24 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16009 - Elevated latency for operations on us-central1-a

The applied mitigation measures have been successful, and as the result the latency on GCE management operations in us-central1-a have returned to normal levels. The customer visible impact should now be over. We are continuing with finalizing the investigation of the root cause of this issue and with applying further measures to prevent this issue from happening again in the future.

Last Update: A few months ago

UPDATE: Incident 16009 - Elevated latency for operations on us-central1-a

We are continuing to apply mitigation measures and are seeing further latency improvements on GCE management operations in us-central1-a. We will provide next update by 04:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16009 - Elevated latency for operations on us-central1-a

We have started to apply mitigation measures and are seeing latency improvements on GCE management operations in us-central1-a. We are continuing to work on resolving this issue and will provide next status update by 02:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16009 - Elevated latency for operations on us-central1-a

We are still investigating the elevated latency on GCE management operations on us-central1-a. We will provide more another status update by 01:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16009 - Elevated latency for operations on us-central1-a

We are investigating elevated latency on GCE management operations on us-central1-a. Running instances and networking continue to operate normally. If you need to create new resources we recommend using other zones within this region for the time being. We will provide more information by 23:50 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16003 - Authentication issues with Google Cloud Platform APIs

SUMMARY: On Tuesday 19th April 2016, 1.1% of all requests to obtain new Google OAuth 2.0 tokens failed for a period of 70 minutes. Users of affected applications experienced authentication errors. This incident affected all Google services that use OAuth. We apologize to any customer whose application was impacted by this incident. We take outages very seriously and are strongly focused on learning from these incidents to improve the future reliability of our services. DETAILED DESCRIPTION OF IMPACT: On Tuesday 19 April 2016 from 06:12 to 07:22 PDT, the Google OAuth 2.0 service returned HTTP 500 errors for 1.1% of all requests. OAuth tokens are granted to applications on behalf of users. The application requesting the token is identified by its client ID. Google's OAuth service looks up the application associated with a client ID before granting the new token. If the mapping from client ID to application is not cached by Google's OAuth service, then it is fetched from a separate client ID lookup service. The client ID lookup service dropped some requests during the incident, which caused those token requests to fail. The token request failures predominantly affected applications which had not populated the client ID cache because they were less frequently used. Such infrequently-used applications may have experienced high error rates on token requests for their users, though the overall average error rate was 1.1% measured across all applications. Once access tokens were obtained, they could be used without problems. Tokens issued before the incident continued to function until they expired. Any requests for tokens that did not use a client ID were not affected by this incident. ROOT CAUSE: Google's OAuth system depends on an internal service to lookup details of the client ID that is making the token request. During this incident, the client ID lookup service had insufficient capacity to respond to all requests to lookup client ID details. Before the incident started, the client ID lookup service had been running close to its rated capacity. In an attempt to prevent a future problem, Google SREs triggered an update to add capacity to the service at 05:30. Normally adding capacity does not cause a restart of the service. However, the update process had a misconfiguration which caused a rolling restart. While servers were restarting, the capacity of the service was reduced further. In addition, the restart triggered a bug in a specific client's code that caused its cache to be invalidated, leading to a spike in requests from that client. Google's systems are designed to throttle clients in these situations. However, the throttling was insufficient to prevent overloading of the client ID lookup service. Google's software load balancer was configured to drop a fraction of incoming requests to the client ID lookup service during overload in order to prevent cascading failure. In this case, the load balancer was configured too conservatively and dropped more traffic than needed. REMEDIATION AND PREVENTION: Google's internal monitoring systems detected the incident at 06:28 and our engineers isolated the root cause as an overload in the client ID lookup service at 06:47. We added additional capacity to work around the issue at 07:07 and the error rate dropped to normal levels by 07:22. In order to prevent future incidents of this type from occurring, we are taking several actions. 1. We will improve our monitoring to detect immediately when usage of the client ID lookup service gets close to its capacity. 2. We will ensure that the client ID lookup service always has more than 10% spare capacity at peak. 3. We will change the load balancer configuration so that it will not uniformly drop traffic when overloaded. Instead, the load balancer will throttle the clients that are causing traffic spikes. 4. We will change the update process to minimize the capacity that is temporarily lost during an update. 5. We will fix the client bug that caused its client ID cache to be invalidated.

Last Update: A few months ago

UPDATE: Incident 16008 - We are investigating a problem with GCE Instances that were created before mid-February in Zones us-central1-b, europe-west1-c and asia-east1-b

An issue is ongoing with a very small number of Google Compute Engine instances hanging during startup. The root cause has been established and mitigation is in progress. Affected instances can be recovered by manually stopping them and starting them again. We were able to identify affected projects and will notify appropriate project contacts within the next 60 minutes. Current data indicates that less than 0.001% of projects were affected by the issue. No further public communications will be made on this issue.

Last Update: A few months ago

UPDATE: Incident 16008 - We are investigating a problem with GCE Instances that were created before mid-February in Zones us-central1-b, europe-west1-c and asia-east1-b

We are still investigating the issue with Google Compute Engine. We will provide another status update by 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - We are investigating a problem with GCE Instances that were created before mid-February in Zones us-central1-b, europe-west1-c and asia-east1-b

We are still investigating the issue with Google Compute Engine. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16008 - We are investigating a problem with GCE Instances that were created before mid-February in Regions us-central1-b, europe-west1-c and asia-east1-a

We are still investigating the issue with Google Compute Engine. We will provide another status update by 21:00 US/Pacific with current details. Affected zones correction: Zone [asia-east1-b] is currently affected but zone [asia-east1-a] is not affected.

Last Update: A few months ago

UPDATE: Incident 16008 - We are investigating a problem with GCE Instances that were created before mid-February in Regions us-central1-b, europe-west1-c and asia-east1-a

We are still investigating this problem and we'll provide a new update at 7:00 pm PT

Last Update: A few months ago

UPDATE: Incident 16008 - We are investigating a problem with GCE Instances that were created before mid-February in Regions us-central1-b, europe-west1-c and asia-east1-a

We Continue investigating this problem and we'll provide a new update at 6:00 pm PT

Last Update: A few months ago

UPDATE: Incident 16008 - We are investigating a problem with GCE Instances that were created before mid-February in Regions us-central1-b, europe-west1-c and asia-east1-a

We are currently investigating a problem with GCE instances created before mid-February in regions asia-east1-b and europe-west1-c where instance restarts will render them unavailable.

Last Update: A few months ago

UPDATE: Incident 16003 - Authentication issues with Google Cloud Platform APIs

The issue with Authentication Services should have been resolved for all affected projects as of 07:24 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16003 - Authentication issues with Google Cloud Platform APIs

We are still investigating the issue with Authentication services for Google Cloud Platform APIs. We will provide another status update by 08:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 16007 - Connectivity issues in all regions

SUMMARY: On Monday, 11 April, 2016, Google Compute Engine instances in all regions lost external connectivity for a total of 18 minutes, from 19:09 to 19:27 Pacific Time. We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur. As of this writing, the root cause of the outage is fully understood and GCE is not at risk of a recurrence. In this incident report, we are sharing the background, root cause and immediate steps we are taking to prevent a future occurrence. Additionally, our engineering teams will be working over the next several weeks on a broad array of prevention, detection and mitigation systems intended to add additional defense in depth to our existing production safeguards. Finally, to underscore how seriously we are taking this event, we are offering GCE and VPN service credits to all impacted GCP applications equal to (respectively) 10% and 25% of their monthly charges for GCE and VPN. These credits exceed what we promise in the Compute Engine Service Level Agreement (https://cloud.google.com/compute/sla) or Cloud VPN Service Level Agreement (https://cloud.google.com/vpn/sla), but are in keeping with the spirit of those SLAs and our ongoing intention to provide a highly-available Google Cloud product suite to all our customers. DETAILED DESCRIPTION OF IMPACT: On Monday, 11 April, 2016 from 19:09 to 19:27 Pacific Time, inbound internet traffic to Compute Engine instances was not routed correctly, resulting in dropped connections and an inability to reconnect. The loss of inbound traffic caused services depending on this network path to fail as well, including VPNs and L3 network load balancers. Additionally, the Cloud VPN offering in the asia-east1 region experienced the same traffic loss starting at an earlier time of 18:14 Pacific Time but the same end time of 19:27. This event did not affect Google App Engine, Google Cloud Storage, and other Google Cloud Platform products; it also did not affect internal connectivity between GCE services including VMs, HTTP and HTTPS (L7) load balancers, and outbound internet traffic. TIMELINE and ROOT CAUSE: Google uses contiguous groups of internet addresses -- known as IP blocks -- for Google Compute Engine VMs, network load balancers, Cloud VPNs, and other services which need to communicate with users and systems outside of Google. These IP blocks are announced to the rest of the internet via the industry-standard BGP protocol, and it is these announcements which allow systems outside of Google’s network to ‘find’ GCP services regardless of which network they are on. To maximize service performance, Google’s networking systems announce the same IP blocks from several different locations in our network, so that users can take the shortest available path through the internet to reach their Google service. This approach also enhances reliability; if a user is unable to reach one location announcing an IP block due to an internet failure between the user and Google, this approach will send the user to the next-closest point of announcement. This is part of the internet’s fabled ability to ‘route around’ problems, and it masks or avoids numerous localized outages every week as individual systems in the internet have temporary problems. At 14:50 Pacific Time on April 11th, our engineers removed an unused GCE IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network. By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management. In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network. One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout. As the rollout progressed, those sites which had been announcing GCE IP blocks ceased to do so when they received the new configuration. The fault tolerance built into our network design worked correctly and sent GCE traffic to the the remaining sites which were still announcing GCE IP blocks. As more and more sites stopped announcing GCE IP blocks, our internal monitoring picked up two anomalies: first, the Cloud VPN in asia-east1 stopped functioning at 18:14 because it was announced from fewer sites than GCE overall, and second, user latency to GCE was anomalously rising as more and more users were sent to sites which were not close to them. Google’s Site Reliability Engineers started investigating the problem when the first alerts were received, but were still trying to determine the root cause 53 minutes later when the last site announcing GCE IP blocks received the configuration at 19:07. With no sites left announcing GCE IP blocks, inbound traffic from the internet to GCE dropped quickly, reaching >95% loss by 19:09. Internal monitors generated dozens of alerts in the seconds after the traffic loss became visible at 19:08, and the Google engineers who had been investigating a localized failure of the asia-east1 VPN now knew that they had a widespread and serious problem. They did precisely what we train for, and decided to revert the most recent configuration changes made to the network even before knowing for sure what the problem was. This was the correct action, and the time from detection to decision to revert to the end of the outage was thus just 18 minutes. With the immediate outage over, the team froze all configuration changes to the network, and worked in shifts overnight to ensure first that the systems were stable and that there was no remaining customer impact, and then to determine the root cause of the problem. By 07:00 on April 12 the team was confident that they had established the root cause as a software bug in the network configuration management software. DETECTION, REMEDIATION AND PREVENTION: With both the incident and the immediate risk now over, the engineering team’s focus is on prevention and mitigation. There are a number of lessons to be learned from this event -- for example, that the safeguard of a progressive rollout can be undone by a system designed to mask partial failures -- which yield similarly-clear actions which we will take, such as monitoring directly for a decrease in capacity or redundancy even when the system is still functioning properly. It is our intent to enumerate all the lessons we can learn from this event, and then to implement all of the changes which appear useful. As of the time of this writing in the evening of 12 April, there are already 14 distinct engineering changes planned spanning prevention, detection and mitigation, and that number will increase as our engineering teams review the incident with other senior engineers across Google in the coming week. Concretely, the immediate steps we are taking include: * Monitoring targeted GCE network paths to detect if they change or cease to function; * Comparing the IP block announcements before and after a network configuration change to ensure that they are identical in size and coverage; * Semantic checks for network configurations to ensure they contain specific Cloud IP blocks. A FINAL WORD: We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages. This incident report is both longer and more detailed than usual precisely because we consider the April 11th event so important, and we want you to understand why it happened and what we are doing about it. It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform. Sincerely, Benjamin Treynor Sloss | VP 24x7 | Google

Last Update: A few months ago

RESOLVED: Incident 16007 - Connectivity issues in all regions

The issue with networking should have been resolved for all affected services as of 19:27 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident on the Cloud Status Dashboard once we have completed our internal investigation. For everyone who is affected, we apologize for any inconvenience you experienced.

Last Update: A few months ago

UPDATE: Incident 16007 - Connectivity issues in all regions

The issue with networking should have been resolved for all affected services as of 19:27 US/Pacific. We're continuing to monitor the situation. We will provide another status update by 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16007 - Connectivity issues with Cloud VPN in asia-east1

Current data indicates that there are severe network connectivity issues in all regions. Google engineers are currently working to resolve this issue. We will post a further update by 20:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16007 - Connectivity issues with Cloud VPN in asia-east1

We are experiencing an issue with Cloud VPN in asia-east1 beginning at Monday, 2016-04-11 18:25 US/Pacific. Current data suggests that all Cloud VPN traffic in this region is affected. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16007 - Connectivity issues with Cloud VPN in asia-east1

We are investigating reports of an issue with Cloud VPN in asia-east1. We will provide more information by 19:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 18014 - BigQuery streaming inserts delayed in EU

SUMMARY On Wednesday 5 April and Thursday 6 April 2016, some streaming inserts to BigQuery datasets in the EU were delayed by up to 16 hours and 46 minutes. We sincerely apologise for these delays and we are addressing the root causes of the issue as part of our commitment to BigQuery's availability and responsiveness. DETAILED DESCRIPTION OF IMPACT From 15:16 PDT to 23:40 on Wednesday 05 April 2016, some BigQuery streaming inserts to datasets in the EU did not immediately become available to subsequent queries. From 23:40, new streaming inserts worked normally, but some previously delayed inserts remained unavailable to BigQuery queries. Virtually all delayed inserts were committed and available by 07:52 on Thursday 06 April. The event was accompanied by slightly elevated error rates (< 0.7% failure rate) and latency (< 50% latency increase) of API calls for streaming inserts. ROOT CAUSE BigQuery streaming inserts are buffered in one of Google's large-scale storage systems before being committed to the main BigQuery repository. At 15:16 PDT on Wednesday 05 April, this storage system began to experience issues in one of the datacenters that host BigQuery datasets in the EU, blocking BigQuery's I/O operations for streaming inserts. The impact reached monitoring threshold levels after a few hours, and at 18:29 automated monitoring systems sent alerts to the Google engineering team, but the monitoring systems displayed the alerts in a way that disguised the scale of the issue and made it seem to be a low priority. This error was identified at 23:01, and Google engineers began routing all European streaming insert traffic to another EU datacenter, restoring normal insert behaviour by 23:40. The delayed inserts in the system were committed when the underlying storage system was restored to service. REMEDIATION AND PREVENTION Google engineers are addressing the technical root cause of the incident by increasing the fault-tolerance of I/O between BigQuery and the storage system that buffers streaming inserts. The principal remediation efforts for this event, however, are focused on the systems monitoring, alert escalation, and data visualisation issues which were involved. Google engineers are updating the BigQuery monitoring systems to more clearly represent the scale of system behaviour, and modifying internal procedures and documentation accordingly.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery streaming inserts delayed in EU

The issue with BigQuery job execution have been fully resolved. Affected customers will be notified directly in order to assess any potential lingering impact. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery streaming inserts delayed in EU

Current data indicates that BigQuery streaming inserts are being applied normally. We are still working on restoring the visibility of some streaming inserts to EU datasets from 18:00 to 00:00 US/Pacific. We will provide another status update by 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery streaming inserts delayed in EU

Current data indicates that BigQuery streaming inserts are being applied normally. Some streaming inserts to EU datasets from 18:00 to 00:00 US/Pacific are not yet visible in BigQuery and we are working to propagate them. We will provide another status update by 10:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery streaming inserts delayed in EU

Current data indicates that BigQuery streaming inserts are being applied normally. Some streaming inserts to EU datasets from 18:00 to 00:00 US/Pacific are not yet visible in BigQuery and we are working to propagate them. We will provide another status update by 04:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery streaming inserts delayed in EU

We are still investigating the issue with BigQuery job execution. Current data indicates that the issue only affects projects which use streaming inserts to datasets located in the EU. We will provide another status update by 04:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery Job Execution Issue

We are still investigating the issue with BigQuery job execution. Current data indicates that the issue only affects projects which use streaming inserts to datasets located in the EU. We will provide another status update by 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery Job Execution Issue

We are still investigating the issue with BigQuery Job execution. We will provide another status update by 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery Job Execution Issue

We are still investigating the issue with Bigquery Job execution. We will provide another status update by 01:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18014 - BigQuery Job Execution Issue

We are investigating an issue with BigQuery Job execution. We will provide more information by 2016-04-06 00:20 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16005 - Network Connectivity Issues in Europe-West1-C

SUMMARY: On Wednesday 24 February 2016, some Google Compute Engine instances in the europe-west1-c zone experienced network connectivity loss for a duration of 62 minutes. If your service or application was affected by these network issues, we sincerely apologize. We have taken immediate steps to remedy the issue and we are working through a detailed plan to prevent any recurrence. DETAILED DESCRIPTION OF IMPACT: On 24 February 2016 from 11:43 to 12:45 PST, up to 17% of Google Compute Engine instances in the europe-west1-c zone experienced a loss of network connectivity. Affected instances lost connectivity to both internal and external destinations. ROOT CAUSE: The root cause of this incident was complex, involving interactions between three components of the Google Compute Engine control plane: the main configuration repository, an integration layer for networking configuration, and the low-level network programming mechanism. Several hours before the incident on 24th February 2016, Google engineers modified the Google Compute Engine control plane in the europe-west1-c zone, migrating the management of network firewall rules from an older system to the modern integration layer. This was a well-understood change that had been carried out several times in other zones without incident. As on previous occasions, the migration was completed without issues. On this occasion, however, the migrated networking configuration included a small ratio (approximately 0.002%) of invalid rules. The GCP network programming layer is hardened against invalid or inconsistent configuration information, and continued to operate correctly in the presence of these invalid rules. Twenty minutes before the incident, however, a remastering event occurred in the network programming layer in the europe-west1-c zone. Events of this kind are routine but, in this case, the presence of the invalid rules in the configuration coupled with a race condition in the way the new master loads its configuration caused the new master to load its network configuration incorrectly.. The consequence, at 11:43 PST, was a loss of network programming configuration for a subset of Compute Engine instances in the zone, effectively removing their network connectivity until the configuration could be re-propagated from the central repository. REMEDIATION AND PREVENTION Google engineers restored service by forcing another remastering of the network programming layer, restoring a correct network configuration. To prevent recurrence, Google engineers are fixing both the race condition which led to an incorrect configuration during mastership change, and adding alerting for the presence of invalid rules in the network configuration so that they will be detected promptly upon introduction. The combination of these two changes provide defense in depth against future configuration inconsistency and we believe will preserve correct function of the network programming system in the face of invalid information.

Last Update: A few months ago

UPDATE: Incident 16003 - Log viewer delays

The backlog in the log processing pipeline has now cleared and the issue was resolved as of Wednesday, 2016-03-02 17:50 US/Pacific. We do apologize for any inconvenience this may have caused.

Last Update: A few months ago

UPDATE: Incident 16003 - Log viewer delays

We are experiencing an issue with Cloud Logging beginning at Wednesday, 2016-03-02 14:40 US/Pacific. The log processing pipeline is running behind demand, but no logs have been lost. New entries in the Cloud Logging viewer will be delayed. We apologize for any inconvenience you may be experiencing. We will provide an update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16003 - Log viewer delays

We are investigating reports of an issue with the Logs Viewer. We will provide more information by 17:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16004 - Network connectivity issue in us-central1-f

SUMMARY: On Tuesday 23 February 2015, Google Compute Engine instances in the us-central1-f zone experienced intermittent packet loss for 46 minutes. If your service or application was affected by these network issues, we sincerely apologize. A reliable network is one of our top priorities. We have taken immediate steps to remedy the issue and we are working through a detailed plan to prevent any recurrence. DETAILED DESCRIPTION OF IMPACT: On 23 February 2015 from 19:56 to 20:42 PST, Google Compute Engine instances in the us-central1-f zone experienced partial loss of network traffic. The disruption had a 25% chance of affecting any given network flow (e.g. a TCP connection or a UDP exchange) which entered or exited the us-central1-f zone. Affected flows were blocked completely. All other flows experienced no disruption. Systems that experienced a blocked TCP connection were often able to establish connectivity by retrying. Connections between endpoints within the us-central1-f zone were unaffected. ROOT CAUSE: Google follows a gradual rollout process for all new releases. As part of this process, Google network engineers modified a configuration setting on a group of network switches within the us-central1-f zone. The update was applied correctly to one group of switches, but, due to human error, it was also applied to some switches which were outside the target group and of a different type. The configuration was not correct for them and caused them to drop part of their traffic. REMEDIATION AND PREVENTION: The traffic loss was detected by automated monitoring, which stopped the misconfiguration from propagating further, and alerted Google network engineers. Conflicting signals from our monitoring infrastructure caused some initial delay in correctly diagnosing the affected switches. This caused the incident to last longer than it should have. The network engineers restored normal service by isolating the misconfigured switches. To prevent recurrence of this issue, Google network engineers are refining configuration management policies to enforce isolated changes which are specific to the various switch types in the network. We are also reviewing and adjusting our monitoring signals in order to lower our response times.

Last Update: A few months ago

UPDATE: Incident 16006 - Internal DNS resolution in us-central1

The issue with Google Compute Engine internal DNS resolution should have been resolved for all affected instances as of 20:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16006 - Internal DNS resolution in us-central1

We are investigating reports of an issue with internal DNS resolution in us-central1. We will provide more information by 21:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16003 - Quotas were reset to default values for some Customers

SUMMARY: On Tuesday 23 February 2016, for a duration of 10 hours and 6 minutes, 7.8% of Google Compute Engine projects had reduced quotas. We know that the ability to scale is vital to our customers, and apologize for preventing you from using the resources you need. DETAILED DESCRIPTION OF IMPACT: On Tuesday 23 February 2016 from 06:06 to 16:12 PST, 7.8% of Google Compute Engine projects had quotas reduced. This impacted all quotas, including number of cores, IP addresses and disk size. If reduced quota was applied to your project and your usage reached this reduced quota you would have been unable to create new resources during this incident. Any such attempt would have resulted in a QUOTA_EXCEEDED error code with message "Quota 'XX_XX' exceeded. Limit: N". Any resources that were already created were unaffected by this issue. ROOT CAUSE: In order to maximize ease of use for Google Compute Engine customers, in some cases we automatically raise resource quotas. We then provide exclusions to ensure that no quotas previously raised are reduced. We occasionally tune the algorithm to determine which quotas can be safely raised. This incident occurred when one such change was made but a bug in the aforementioned exclusion process allowed some projects to have their quotas reduced. REMEDIATION AND PREVENTION: As soon as Google engineers identified the cause of the issue the initiating change was rolled back and quota changes were reverted. To provide faster resolution to quota related issues in the future we are creating new automated alerting and operational documentation. To prevent a recurrence of this specific issue, we have fixed the bug in the exclusion process. To prevent similar future issues, we are also creating a dry-run testing phase to verify the impact quota system changes will have.

Last Update: A few months ago

RESOLVED: Incident 16005 - Network Connectivity Issues in Europe-West1-C

The issue with network connectivity to VMs in europe-west1-c should have been resolved for all affected instances as of 12:57 PST. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16005 - Network Connectivity Issues in Europe-West1-C

We are currently investigating network connectivity issues affecting the europe-west1-c zone. We will provide another update with more information by 13:00 PST.

Last Update: A few months ago

RESOLVED: Incident 16004 - Network connectivity issue in us-central1

The network connectivity issue in us-central1 should have been resolved for all affected projects as of 20:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

The issue with quotas being reset to default values should have been resolved for all affected customers as of 16:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

The issue with quotas being reset to default values should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

We are continuing to investigate the issue with quotas being reset to default values for some customers. We'll provide a new update at 15:30 US / Pacific time.

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

We are continuing to investigate the issue with quotas being reset to default values for some customers. We'll provide a new update at 14:30 US / Pacific time.

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

We are continuing to investigate the issue with quotas being reset to default values for some customers. We'll provide a new update at 13:30 Pacific time

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

We continue to investigate the problem with quotas being reset to default values for some customers. We'll provide a new update at 12:30 PT.

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

We are still investigating the problem of some projects quotas being reverted back to Default values and we'll provide a new update in at 11:30 PST time.

Last Update: A few months ago

UPDATE: Incident 16003 - Quotas were reset to default values for some Customers

We are investigating a problem with our Quota System where Quotas were reset to default values for some Customers.

Last Update: A few months ago

UPDATE: Incident 16001 - Pub Sub Performance Degradation

The issue with Pub/Sub performance should have been resolved for all affected users as of 13:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16001 - Pub Sub Performance Degradation

The performance of Google Cloud Pub/Sub in us-central1 is recovering and we expect a full resolution in the near future. We will provide another status update by 13:50 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16001 - Pub Sub Performance Degredatin

We are investigating reports of degraded performance with Pub/Sub in us-central1. We will provide more information by 13:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

The issue with BigQuery returning Internal Error should have been resolved for all affected projects as of 06:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

The issue with BigQuery returning Internal Error should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 07:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

The issue with BigQuery returning Internal Error should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 05:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

The issue with BigQuery returning Internal Error should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 04:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

We are still investigating the issue with BigQuery returning Internal Error. Current data indicates that around 2%of projects are affected by this issue. We will provide another status update by 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

We are still investigating the issue with BigQuery returning Internal Error. Current data indicates that around 2%of projects are affected by this issue. We will provide another status update by 02:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

We are still investigating the issue with BigQuery returning Internal Error. Current data indicates that around 2%of projects are affected by this issue. We will provide another status update by 01:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

We are experiencing an intermittent issue with BigQuery returning Internal Error beginning at Wednesday, 2016-02-10 13:20 US/Pacific. Current data indicates that around 2% of projects are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 00:20 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 18013 - BigQuery returns Internal Error intermittently

We are investigating reports of an issue with BigQuery returning Internal Error intermittently. We will provide more information by 23:40 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16002 - Errors on the Developers Console and cloud.google.com sites

We experienced an issue with the Developers Console and cloud.google.com sites which returned errors beginning at Wednesday, 2016-02-10 17:05 until 17:20 US/Pacific. We have fixed this now and apologize for any inconvenience caused.

Last Update: A few months ago

RESOLVED: Incident 16002 - Issues with App Engine Java and Go runtimes

SUMMARY: On Wednesday 3 February 2016, some App Engine applications running on Java7, Go and Python runtimes served errors with HTTP response 500 for a duration of 18 minutes. We sincerely apologize to customers who were affected. We have taken and are taking immediate steps to improve the platform's performance and availability. DETAILED DESCRIPTION OF IMPACT: On Wednesday 3 February 2016, from 18:37 PST to 18:55 PST, 1.1% of Java7, 3.1% of Go and 0.2% of all Python applications served errors with HTTP response code 500. The impact varied across applications, with less than 0.8% of all applications serving more than 100 errors during this time period. The distribution of errors was heavily tail-weighted, with a few applications receiving a large fraction of errors for their traffic during the event. ROOT CAUSE: An experiment meant to test a new feature on a small number of applications was inadvertently applied to Java7 and Go applications globally. Requests to these applications tripped over the incompatible experimental feature, causing the instances to shut down without serving any requests successfully, while the depletion of healthy instances caused these applications to serve HTTP requests with a 500 response. Additionally, the high rate of failure in Java and Go instances caused resource contention as the system tried to start new instances, which resulted in collateral damage to a small number of Python applications. REMEDIATION AND PREVENTION: At 18:35, a configuration change was erroneously enabled globally instead of to the intended subset of applications. Within a few minutes, Google Engineers noticed a drop in global traffic to GAE applications and determined that the configuration change was the root cause. At 18:53 the configuration change was rolled back and normal operations were restored by 18:55. To prevent a recurrence of this problem, Google Engineers are modifying the fractional push framework to inhibit changes which would simultaneously apply to the majority of applications, and creating telemetry to accurately predict the fraction of instances affected by a given change. Google Engineers are also enhancing the alerts on traffic drop and error spikes to quickly identify and mitigate similar incidents.

Last Update: A few months ago

RESOLVED: Incident 16002 - Connectivity issue in asia-east1

SUMMARY: On Wednesday 3 February 2016, one third of network connections from external sources to Google Compute Engine instances and network load balancers in the asia-east1 region experienced high rates of network packet loss for 89 minutes. We sincerely apologize to customers who were affected. We have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Wednesday 3 February 2016, from 00:40 PST to 02:09 PST, one third of network connections from external sources to Google Compute Engine instances and network load balancers in the asia-east1 region experienced high rates of network packet loss. Traffic between instances within the region was not affected. ROOT CAUSE: Google Compute Engine maintains a pool of systems that encapsulate incoming packets and forward them to the appropriate instance. During a regular system update, a master failover triggered a latent configuration error in two internal packet processing servers. This configuration rendered the affected packet forwarders unable to properly encapsulate external packets destined to instances. REMEDIATION AND PREVENTION: Google's monitoring system detected the problem within two minutes of the configuration change. Additional alerts issued by the monitoring system for the asia-east1 region negatively affected total time required to root cause and resolve the issue. At 02:09 PST, Google engineers applied a temporary configuration change to divert incoming network traffic away from the affected packet encapsulation systems and fully restore network connectivity. In parallel, the incorrect configuration has been rectified and pushed to the affected systems. To prevent this issue from recurring, we will change the way packet processor configurations are propagated and audited, to ensure that incorrect configurations are detected while their servers are still on standby.In addition, we will make improvements to our monitoring to make it easier for engineers to quickly diagnose and pinpoint the impact of such problems.

Last Update: A few months ago

RESOLVED: Incident 16001 - Google App Engine admin permissions

Blacklog processing has completed for all permission changes made during the affected time frame.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine admin permissions

Backlog processing is going slower than expected. At the current rate, it will take another 5 hours of time to reprocess all of the updates. We are going to provide another update on this by 2:00.

Last Update: A few months ago

RESOLVED: Incident 16002 - Issues with App Engine Java and Go runtimes

The issue with App Engine Java and Go runtimes serving errors should have been resolved for all affected applications as of 18:57 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16002 - Issues with App Engine Java and Go runtimes

We are investigating reports of an issue with App Engine Java and Go applications. We will provide more information by 19:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine admin permissions

We are backfilling changes to project permissions made between 2016-02-02 11:30 US/Pacific and 2016-02-03 15:10. Our current estimate is to complete this at around 21:00. We are going to provide further updates at that time or in between if the estimate changes significantly.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine admin permissions

The retroactive changes are underway. This means the current impact of the issue is that some changes to project permissions made between 2016-02-02 11:30 US/Pacific and 2016-02-03 15:10 have not applied to App Engine. Some permissions changes during that time and all permissions changes before and after that window are fully applied. We will provide another status update by 18:00 with current details.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine admin permissions

The issue with Google App Engine authorization has been resolved for all new permission changes. Our engineers are now retroactively applying changes made during the last day. We will provide another status update by 16:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine admin permissions

We are still working to resolve the issue with Google App Engine authorization. We will provide another status update by 15:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine authentication and authorization

We are still investigating the issue with Google App Engine authorization. In addition to the rollback that is underway we are preparing to retroactively apply permission changes that did not fully take effect. We will provide another status update by 14:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine authentication and authorization

We are experiencing an issue with Google App Engine authorization beginning at Tuesday, 2015-02-02. Effected customers will see that changes to project permissions are not taking effect on App Engine. Our engineers are in the process of a rollback that should restore service. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 14:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine authentication and authorization

We are still investigating the issue with App Engine Authentication and Authorization. We will provide another status update by 14:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16001 - Google App Engine authentication and authorization

We are investigating reports of an issue with Google App Engine authentication and authorization. We will provide more information by 13:35 US/Pacific

Last Update: A few months ago

UPDATE: Incident 16002 - Connectivity issue in asia-east1

We are still investigating the issue with Google Compute Engine instances experiencing packet loss in the asia-east1 region. Current data indicates that up to 33% of instances in the region are experiencing up to 10% packet loss when communicating whit external resources. We will provide another status update by 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Connectivity issue in asia-east1

The issue with Google Compute Engine instances experiencing packet loss in the asia-east1 region should have been resolved for all affected instances as of 02:11 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16002 - Connectivity issue in asia-east1

We are experiencing an issue with Google Compute Engine seeing packet loss in asia-east1 beginning at Wednesday, 2016-02-03 01:40 US/Pacific. Instances of affected customers may experience packet loss. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 02:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16002 - Connectivity issue in asia-east1

We are investigating reports of an issue with Google Compute Engine. We will provide more information by 02:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16001 - Cannot open Cloud Shell in production Pantheon using @google.com account

OMG/1512 Internal only.

Last Update: A few months ago

RESOLVED: Incident 16001 -

Starting at approximately 14:47 Customers of Google Container Registry were unable to pull images that had been pushed through the V2 Docker protocol ending at 16:04 PDT. The issue has now been resolved. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16001 - us-central1-c Persistent Disk latency

The issue with persistent disks latency should have been resolved as of 15:20 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16001 - us-central1-c Persistent Disk latency

We are seeing elevated latency for a small number of persistent disks in us-central1-c. We are investigating cause and possible mitigation now.

Last Update: A few months ago

RESOLVED: Incident 15025 - Authentication issues with App Engine

SUMMARY: On Monday 7 December 2015, 1.29% of Google App Engine applications received errors when issuing authenticated calls to Google APIs over a period of 17 hours and 3 minutes. During a 45-minute period, authenticated calls to Google APIs from outside of App Engine also received errors, with the error rate peaking at 12%. We apologise for the impact of this issue on you and your service. We consider service degradation of this level and duration to be very serious and we are planning many changes to prevent this occurring again in the future. DETAILED DESCRIPTION OF IMPACT: Between Monday 7 December 2015 20:09 PST and Tuesday 8 December 2015 13:12, 1.29% of Google App Engine applications using service accounts received error 401 "Access Denied" for all requests to Google APIs requiring authentication. Unauthenticated API calls were not affected. Different applications experienced impact at different times, with few applications being affected for the full duration of the incident. In addition, between 23:05 and 23:50, an average of 7% of all requests to Google Cloud APIs failed or timed out, peaking briefly at 12%. Outside of this time only API calls from App Engine were affected. ROOT CAUSE: Google engineers have recently carried out a migration of the Google Accounts system to a new storage backend, which included copying API authentication service credentials data and redirecting API calls to the new backend. To complete this migration, credentials were scheduled to be deleted from the previous storage backend. This process started at 20:09 PST on Monday 7 December 2015. Due to a software bug, the API authentication service continued to look up some credentials, including those used by Google App Engine service accounts, in the old storage backend. As these credentials were progressively deleted, their corresponding service accounts could no longer be authenticated. The impact increased as more credentials were deleted and some Google App Engine applications started to issue a high volume of retry requests. At 23:05, the retry volume exceeded the regional capacity of the API authentication service, causing 1.3% of all authenticated API calls to fail or timeout, including Google APIs called from outside Google App Engine. At 23:30 the API authentication service exceeded its global capacity, causing up to 12% of all authenticated API calls to fail until 23:50, when the overload issue was resolved. REMEDIATION AND PREVENTION: At 23:50 PST on Monday 8 December, Google engineers blocked certain authentication credentials that were known to be failing, preventing retries on these credentials from overloading the API authentication service. On Tuesday 9 December 08:52 PST, the deletion process was halted, having removed 2.3% of credentials, preventing further applications from being affected. At 10:08, Google engineers identified the root cause for the misdirected credentials lookup. After thorough testing, a fix was rolled out globally, resolving the issue for all affected Google App Engine applications by 13:12. Google has conducted a far-reaching review of the issue's root causes and contributory factors, leading to numerous prevention and mitigation actions in the following areas: — Google engineers have deployed monitoring for additional infrastructure signals to detect and analyse similar issues more quickly. — Google engineers have improved internal tools to extend auditing and logging and automatically advise relevant teams on potentially risky data operations. — Additional rate limiting and caching features will be added to the API authentication service, increasing its resilience to load spikes. — Google’s development guidelines are being reviewed and updated to improve the handling of service or backend migrations, including a grace period of disabling access to old data locations before fully decommissioning them. Our customers rely on us to provide a superior service and we regret we did not live up to expectations in this case. We apologize again for the inconvenience this caused you and your users.

Last Update: A few months ago

RESOLVED: Incident 15065 - 400 errors when trying to create an external (L3) Load Balancer for GCE/GKE services

SUMMARY: On Monday 7 December 2015, Google Container Engine customers could not create external load balancers for their services for a duration of 21 hours and 38 minutes. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: From Monday 7 December 2015 15:00 PST to Tuesday 8 December 2015 12:38 PST, Google Container Engine customers could not create external load balancers for their services. Affected customers saw HTTP 400 “invalid argument” errors when creating load balancers in their Container Engine clusters. 6.7% of clusters experienced API errors due to this issue. The issue also affected customers who deployed Kubernetes clusters in the Google Compute Engine environment. The issue was confined to Google Container Engine and Kubernetes, with no effect on users of any other resource based on Google Compute Engine. ROOT CAUSE: Google Container Engine uses the Google Compute Engine API to manage computational resources. At about 15:00 PST on Monday 7 December, a minor update to the Compute Engine API inadvertently changed the case-sensitivity of the “sessionAffinity” enum variable in the target pool definition, and this variation was not covered by testing. Google Container Engine was not aware of this change and sent requests with incompatible case, causing the Compute Engine API to return an error status. REMEDIATION AND PREVENTION: Google engineers re-enabled load balancer creation by rolling back the Google Compute Engine API to its previous version. This was complete by 8 December 2015 12:38 PST. At 8 December 2015 10:00 PST, Google engineers committed a fix to the Kubernetes public open source repository. Google engineers will increase the coverage of the Container Engine continuous integration system to detect compatibility issues of this kind. In addition, Google engineers will change the release process of the Compute Engine API to detect issues earlier to minimize potential negative impact.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

The issue with App Engine applications accessing Google APIs should have been resolved for all affected customers as of 13:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We believe the issue is resolved for most customers. A new update will be provided by 2015-12-08 13:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15065 - 400 errors when trying to create an external (L3) Load Balancer for GCE/GKE services

The problem has been fully addressed as of 2015-12-08 12:22pm US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 15065 - 400 errors when trying to create an external (L3) Load Balancer for GCE/GKE services

We are currently testing a fix that hopefully will address the underlying issue. In the meanwhile please use the workaround provided previously of creating load balancers with client IP session affinity. We'll provide another status update by 2015-12-08 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We’re investigating elevated error rates for some Google Cloud Platform users. We believe these errors are affecting between 2-5 percent of Google App Engine (GAE) applications. We are working directly with the customers who are affected to restore full operation in their application as quickly as possible, and apologize for any inconvenience. We will provide another status update by 2015-12-08 12:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15065 - 400 errors when trying to create an external (L3) Load Balancer for GCE/GKE services

We've identified an issue in GCE/GKE when attempting to create external (L3) load balancers for their Kubernetes clusters. A proper fix for this issue is being worked on. Meanwhile, a potential workaround is to create load balancers with client IP session affinity. See an example here: https://gist.github.com/cjcullen/2aad7d51b76b190e2193 . We will provide another status update by 2015-12-08 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are still investigating the issue with App Engine applications accessing Google APIs. We will provide another status update by 2015-12-08 11:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15065 - 400 errors when trying to create an external (L2) Load Balancer for GCE/GKE services

We are still investigating reports of an issue with GCE/GKE when attempting to create an external (L2) load balancer for their services on GCE / GKE. We will provide another status update by 2015-12-08 11:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15065 - 400 errors when trying to create an external (L2) Load Balancer for GCE/GKE services

We are investigating reports of an issue with GCE/GKE when attempting to create an external (L3) load balancer for their services on GCE / GKE.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are still investigating the issue with App Engine applications accessing Google APIs. We will provide another status update by 2015-12-08 10:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are still investigating the issue with App Engine applications accessing Google APIs. We will provide another status update by 2015-12-08 09:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are still investigating the issue with App Engine applications accessing Google APIs. We will provide another status update by 2015-12-08 08:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are still investigating the issue with App Engine applications accessing Google APIs. We will provide another status update by 2015-12-08 07:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are still investigating the issue with App Engine applications accessing Google APIs. We will provide another status update by 2015-12-08 06:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

Despite actions taken to mitigate the problem, a significant number of App Engine applications have continued to experience errors while accessing Google APIs. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-12-08 05:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

The issue with App Engine applications accessing Google APIs should have been resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by 08:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are experiencing an issue with App Engine applications accessing Google APIs beginning at Monday, 2015-12-08 22:00 US/Pacific. Affected APIs may return a "401 Invalid Credentials" error message. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-12-09 03:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are experiencing an issue with App Engine applications accessing Google APIs beginning at Monday, 2015-12-08 22:00 US/Pacific. Affected APIs may return a "401 Invalid Credentials" error message. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-12-09 02:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are experiencing an issue with App Engine applications accessing Google APIs beginning at Monday, 2015-12-08 22:00 US/Pacific. Affected APIs may return a "401 Invalid Credentials" error message. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-12-09 01:20 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

"We are experiencing an issue with App Engine applications accessing Google APIs beginning at Monday, 2015-12-07 22:00 US/Pacific. Affected APIs may return a "401 Invalid Credentials" error message. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-12-08 00:20 US/Pacific with current details."

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

"We are experiencing an issue with App Engine applications accessing Google APIs beginning at Monday, 2015-12-07 22:00 US/Pacific. Affected APIs may return a "401 Invalid Credentials" error message. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-12-08 00:20 US/Pacific with current details."

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

"We are experiencing an issue with App Engine applications accessing Google APIs beginning at Monday, 2015-12-07 22:00 US/Pacific. Affected APIs may return a "401 Invalid Credentials" error message. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-12-08 00:20 US/Pacific with current details."

Last Update: A few months ago

UPDATE: Incident 15025 - Authentication issues with App Engine

We are investigating reports of an issue with App Engine applications accessing Google APIs. We will provide more information by 23:50 US/Pacific. Affected APIs may return a "401 Invalid Credentials"

Last Update: A few months ago

RESOLVED: Incident 15024 - App Engine Task Queue Slow Execution

The issue with App Engine task queue tasks should have been resolved for all affected projects as of 14:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

The issue with slow execution of Google App Engine task queue tasks is resolved for the majority of applications. Our engineering team is currently working on measures to ensure that the issue will not resurface. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

The issue with slow execution of Google App Engine task queue tasks should be resolved for the majority of applications and we expect a full resolution in the near future. Our engineering teams are continuing to perform system remediation and monitor system performance. We will provide another status update by tomorrow December 5, 2015 at 11:00 PST with more details.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

We are still investigating the issue with slow execution of Google App Engine task queue tasks. We will provide another status update by 2015-12-04 10:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

We are still investigating the issue with slow execution of Google App Engine task queue tasks. We will provide another status update by 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

We are still investigating the issue with slow execution of Google App Engine task queue tasks. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

We are still investigating the issue with slow execution of Google App Engine task queue tasks. We will provide another status update by Thursday, 2015-12-03 11:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

We are implementing multiple mitigation strategies to address the slow execution of task queue tasks. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

We are still investigating the issue with slow execution of Google App Engine task queue tasks. Current data indicates that between 1% and 10% of applications are affected. We will provide another status update by 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15024 - App Engine Task Queue Slow Execution

We are currently investigating an issue where task queue processing for Google App Engine applications is slower than expected. Current data indicates that between 1% and 10% or applications are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 16:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 18012 - Errors Accessing the Big Query UI and API

SUMMARY: On Sunday 29th of November 2015, for an aggregate of 33 minutes occurring between 7:31am and 8:24am PST, 11% of all requests to the BigQuery API experienced errors. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Sunday 29th of November 2015, between 7:31am and 7:41am, 7% of BigQuery API requests were redirected (HTTP 302) to a CAPTCHA service. The issue reoccurred between 8:01am and 8:24am PST, affecting 22% of requests. As the CAPTCHA service is intended to verify that the requester is human, any automated requests that were redirected failed. ROOT CAUSE: The BigQuery API is designed to provide fair service to all users during intervals of unusually-high traffic. During this event, a surge in traffic to the API caused traffic verification and fairness systems to activate, causing a fraction of requests to be redirected to the CAPTCHA service. REMEDIATION AND PREVENTION: While investigating the source of the increased traffic, Google engineers assessed that BigQuery’s service capacity was sufficient to handle the additional queries without putting existing queries at risk. The engineers instructed BigQuery to allow the additional queries without verification, ending the incident. To prevent future recurrences of this problem, Google engineers will change BigQuery's traffic threshold policy to an adaptive mechanism appropriate for automated requests, which provides intelligent traffic control and isolation for individual users.

Last Update: A few months ago

RESOLVED: Incident 18012 - Errors Accessing the Big Query UI and API

We experienced an intermittent issue with Big Query for requests to the UI or API beginning at Sunday, 2016-11-29 07:30 US/Pacific. Current data indicates that approximately 25% of requests are affected by this issue. This issue should have been resolved for all affected users as of 08:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 15064 - Network Connectivity Issues in europe-west1

SUMMARY: On Monday 23 November 2015, for a duration of 70 minutes, a subset of Internet destinations was unreachable from the Google Compute Engine europe-west1 region. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Monday 23 November 2015 from 11:55 to 13:05 PST, a number of Internet regions (Autonomous Systems) became unreachable from Google Compute Engine's europe1-west region. The region's traffic volume decreased by 13% during the incident. The majority of affected destination addresses were located in eastern Europe and the Middle East. Traffic to other external destinations was not affected. There was no impact on Google Compute Engine instances in any other region, nor on traffic to any destination within Google. ROOT CAUSE: At 11:51 on Monday 23 November, Google networking engineers activated an additional link in Europe to a network carrier with whom Google already shares many peering links globally. On this link, the peer's network signalled that it could route traffic to many more destinations than Google engineers had anticipated, and more than the link had capacity for. Google's network responded accordingly by routing a large volume of traffic to the link. At 11:55, the link saturated and began dropping the majority of its traffic. In normal operation, peering links are activated by automation whose safety checks would have detected and rectified this condition. In this case, the automation was not operational due to an unrelated failure, and the link was brought online manually, so the automation's safety checks did not occur. The automated checks were expected to protect the network for approximately one hour after link activation, and normal congestion monitoring began at the end of that period. As the post-activation checks were missing, this allowed a 61-minute delay before the normal monitoring started, detected the congestion, and alerted Google network engineers. REMEDIATION AND PREVENTION: Automated alerts fired at 12:56. At 13:02, Google network engineers directed traffic away from the new link and traffic flows returned to normal by 13:05. To prevent recurrence of this issue, Google network engineers are changing procedure to disallow manual link activation. Links may only be brought up using automated mechanisms, including extensive safety checks both before and after link activation. Additionally, monitoring now begins immediately after link activation, providing redundant error detection.

Last Update: A few months ago

UPDATE: Incident 15064 - Network Connectivity Issues in europe-west1

The issue with network connectivity issues in europe-west1 should have been resolved for all affected users as of 13:04 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15064 - Network Connectivity Issues in europe-west1

We are investigating reports of an issue with network connectivity in europe-west1. We will provide more information by 14:22 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16030 -

The issue with Cloud Bigtable is resolved for all affected projects as of 14:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16030 -

We have identified a root cause and believe we have resolved the issue for all customers connecting from GCE. We currently estimate resolving the issue for all customers within a few hours. We will provide another update at 15:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16030 -

We are investigating reports of an issue with Cloud Bigtable (currently in Beta) that is affecting customers using Java 7 and Google's client libraries. We will provide more information by 13:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 15063 - Network Connectivity disruption between us-east1 and asia-east1

Between 10:04 and 10:27 am PST time, instances in asia-east1 and us-east1 experienced connectivity issues. Instance creation in both regions and during the same time-frame, was also impacted.

Last Update: A few months ago

RESOLVED: Incident 15023 - Network Connectivity and Latency Issues in Europe

SUMMARY: On Tuesday, 10 November 2015, outbound traffic going through one of our European routers from both Google Compute Engine and Google App Engine experienced high latency for a duration of 6h43m minutes. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Tuesday, 10 November 2015 from 06:30 - 13:13 PST, a subset of outbound traffic from Google Compute Engine VMs and Google App Engine instances experienced high latency. The disruption to service was limited to outbound traffic through one of our European routers, and caused at peak 40% of all traffic being routed through this device to be dropped. This accounted for 1% of all Google Compute Engine traffic being routed from EMEA and <0.05% of all traffic for Google App Engine. ROOT CAUSE: A network component failure in one of our European routers temporarily reduced network capacity in the region causing network congestion for traffic traversing this route. Although the issue was mitigated by changing the traffic priority, the problem was only fully resolved when the affected hardware was replaced. REMEDIATION AND PREVENTION: As soon as significant traffic congestion in the network path was detected, at 09:10 PST, Google Engineers diverted a subset of traffic away from the affected path. As this only slightly decreased the congestion, Google Engineers made a change in traffic priority which fully mitigated the problem by 13:13 PST time. The replacement of the faulty hardware resolved the problem. To address time to resolution, Google engineers have added appropriate alerts to the monitoring of this type of router, so that similar congestion events will be spotted significantly more quickly in future. Additionally, Google engineers will ensure that capacity plans properly account for all types of traffic in single device failures. Furthermore, Google engineers will audit and augment capacity in this region to ensure sufficient redundancy is available.

Last Update: A few months ago

RESOLVED: Incident 15062 - Network Connectivity and Latency Issues in Europe

SUMMARY: On Tuesday, 10 November 2015, outbound traffic going through one of our European routers from both Google Compute Engine and Google App Engine experienced high latency for a duration of 6h43m minutes. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Tuesday, 10 November 2015 from 06:30 - 13:13 PST, a subset of outbound traffic from Google Compute Engine VMs and Google App Engine instances experienced high latency. The disruption to service was limited to outbound traffic through one of our European routers, and caused at peak 40% of all traffic being routed through this device to be dropped. This accounted for 1% of all Google Compute Engine traffic being routed from EMEA and <0.05% of all traffic for Google App Engine. ROOT CAUSE: A network component failure in one of our European routers temporarily reduced network capacity in the region causing network congestion for traffic traversing this route. Although the issue was mitigated by changing the traffic priority, the problem was only fully resolved when the affected hardware was replaced. REMEDIATION AND PREVENTION: As soon as significant traffic congestion in the network path was detected, at 09:10 PST, Google Engineers diverted a subset of traffic away from the affected path. As this only slightly decreased the congestion, Google Engineers made a change in traffic priority which fully mitigated the problem by 13:13 PST time. The replacement of the faulty hardware resolved the problem. To address time to resolution, Google engineers have added appropriate alerts to the monitoring of this type of router, so that similar congestion events will be spotted significantly more quickly in future. Additionally, Google engineers will ensure that capacity plans properly account for all types of traffic in single device failures. Furthermore, Google engineers will audit and augment capacity in this region to ensure sufficient redundancy is available.

Last Update: A few months ago

UPDATE: Incident 18011 - BigQuery API Returns "Billing has not been enabled for this project" Error When Billing Is Enabled

The issue with the BigQuery API returning "Billing has not been enabled for this project" errors for a small number of billing-enabled projects has been resolved as of 14:10 PST. We apologize for any disruption to your service or application - this is not the level of service we strive to provide and we are taking immediate steps to ensure this issue does not recur.

Last Update: A few months ago

UPDATE: Incident 18011 - BigQuery API Returns "Billing has not been enabled for this project" Error When Billing Is Enabled

The BigQuery API is returning "Billing has not been enabled for this project" error for a small number of billing-enabled projects. Both newly created and existing projects could be affected. Our Engineering team is working on resolving this issue with high priority. We will provide you with an update as soon as more information becomes available.

Last Update: A few months ago

UPDATE: Incident 15062 - Network Connectivity and Latency Issues in Europe

We have resolved the issue with high latency and network connectivity to/from services hosted in Europe. This issue started at approximately 08:00 PST and was resolved as of 12:35 PST. We will be conducting an internal investigation and will share the results of our investigation soon. If you continue to see issues with connectivity to/from services in Europe, please create a case and let us know.

Last Update: A few months ago

UPDATE: Incident 15023 - Network Connectivity and Latency Issues in Europe

We have resolved the issue with high latency and network connectivity to/from services hosted in Europe. This issue started at approximately 08:00 PST and was resolved as of 13:15 PST. We will be conducting an internal investigation and will share the results of our investigation soon. If you continue to see issues with connectivity to/from services in Europe, pease create a case and let us know.

Last Update: A few months ago

UPDATE: Incident 15023 - Network Connectivity and Latency Issues in Europe

We are investigating reports of issues with network connectivity and latency for Google App Engine and Google Compute Engine in Europe. We will provide more information by 13:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15062 - Network Connectivity and Latency Issues in Europe

We are investigating reports of issues with network connectivity and latency for Google App Engine and Google Compute Engine in Europe. We will provide more information by 13:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 15060 - Intermittent Network Connectivity Issues in us-central1-b.

For those who cannot access the link in the previous message, please try https://status.cloud.google.com/incident/compute/15058

Last Update: A few months ago

RESOLVED: Incident 15061 - Network connectivity issue in us-central1-b

We posted the incident report at https://status.cloud.google.com/incident/compute/15058, which covers this incident too.

Last Update: A few months ago

RESOLVED: Incident 15060 - Intermittent Network Connectivity Issues in us-central1-b.

We posted the incident report at https://status.cloud.google.com/incident/compute/15058, which covers this incident too.

Last Update: A few months ago

RESOLVED: Incident 15058 - Intermittent Connectivity Issues In us-central1b

SUMMARY: Between Saturday 31 October 2015 and Sunday 1 November 2015, Google Compute Engine networking in the us-central1-b zone was impaired on 3 occasions for an aggregate total of 4 hours 10 minutes. We apologize if your service was affected in one of these incidents, and we are working to improve the platform’s performance and availability to meet our customer’s expectations. DETAILED DESCRIPTION OF IMPACT (All times in Pacific/US): Outage timeframes for Saturday 31 October 2015: 05:52 to 07:05 for 73 minutes Outage timeframes for Sunday 1 November 2015: 14:10 to 15:30 for 80 minutes 19:03 to 22:40 for 97 minutes During the affected timeframes, up to 14% of the VMs in us-central1-b experienced up to 100% packet loss communicating with other VMs in the same project. The issue impacted both in-zone and intra-zone communications. ROOT CAUSE: Google network control fabrics are designed to permit simultaneous failure of one or more components. When such failures occur, redundant components on the network may assume new roles within the control fabric. A race condition in one of these role transitions resulted the loss of flow information for a subset of the VMs controlled by the fabric. REMEDIATION AND PREVENTION: Google engineers began rolling out a change to eliminate this race condition at 18:03 PST on Monday November 2 2015. The rollout completed on at 11:13 PST on Wednesday November 4 2015. Additionally, monitoring is being improved to reduce the time required to detect, identify and resolve problematic changes to the network control fabric.

Last Update: A few months ago

RESOLVED: Incident 15059 - Google Compute Engine Instance operations failing

SUMMARY: On Saturday 31 October 2015, Google Compute Engine (GCE) management operations experienced high latency for a duration of 181 minutes. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Saturday 31 October 2015 from 18:04 to 21:05 PDT, all Google Compute Engine management operations were slow or timed out in the Google Developers Console, the gcloud tool or the Google Compute Engine API. ROOT CAUSE: An issue in the handling of Google Compute Engine management operations caused requests to not complete in a timely manner, due to older operations retrying excessively and preventing newer operations from succeeding. Once discovered, remediation steps were taken by Google Engineers to reduce the number of retrying operations, enabling recovery from the operation backlog. The incident was resolved at 21:05 PDT when all backlogged operations were processed by the Google Compute Engine management backend and latency and error rates returned to typical values. REMEDIATION AND PREVENTION: To detect similar situations in the future, the GCE Engineering team is implementing additional automated monitoring to detect high numbers of queued management operations and limiting the number of operation retries. Google Engineers are also enabling more robust operation handling and load splitting to better isolate system disruptions.

Last Update: A few months ago

RESOLVED: Incident 15061 - Network connectivity issue in us-central1-b

The issue with Google Compute Engine network connectivity in us-central1-b should have been resolved for all affected projects as of Monday, 2015-11-02 00:00 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 15061 - Network connectivity issue in us-central1-b

We are experiencing a network connectivity issue with Google Compute Engine instances in us-central1-b zone beginning at Sunday, 2015-11-01 21:05 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by Monday, 2015-11-02 00:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15061 - Network connectivity issue in us-central1-b

We are investigating reports of an issue with Google Compute Engine Network in us-central1-b zone. We will provide more information by 23:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15060 - Intermittent Network Connectivity Issues in us-central1-b.

The issue with network connectivity in us-central1-b should have been resolved for all affected instances as of 15:43 PST. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15060 - Intermittent Network Connectivity Issues in us-central1-b.

We are continuing to investigate an issue with network connectivity in the us-central1-b zone. We will provide another update by 16:30 PST.

Last Update: A few months ago

UPDATE: Incident 15060 - Intermittent Network Connectivity Issues in us-central1-b.

We are currently investigating a transient issue with sending internal traffic to and from us-central1b. We will have more information for you by 15:30 PST.

Last Update: A few months ago

UPDATE: Incident 15059 - Google Compute Engine Instance operations failing

The issue with Google Compute Engine Instance operation high latency should have been resolved for all affected users as of 21:05 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15059 - Google Compute Engine Instance operations failing

We are still investigating the issue with Google Compute Engine Instance operation high latency. We will provide another status update by 22:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15059 - Google Compute Engine Instance operations failing

We are experiencing an issue with Google Compute Engine Instance operation high latency beginning at Saturday, 2015-10-31 18:04 US/Pacific. Current data indicates that only users who are attempting to run instance management operations are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15059 - Google Compute Engine Instance operations failing

We are investigating reports of an issue with Google Compute Engine Instance operations. We will provide more information by 2015-10-31 19:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 15058 - Intermittent Connectivity Issues In us-central1b

The issue with sending and receiving traffic between VMs in us-central1b should have been resolved for all affected instannces as of 07:08 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We sincerely apologize for any affect this disruption had on your applications and/or services.

Last Update: A few months ago

UPDATE: Incident 15058 - Intermittent Connectivity Issues In us-central1b

The issue with sending and receiving internal traffic in us-central1b should have been resolved for the majority of instances and we expect a full resolution in the near future. We will provide an update with the affected timeframe after our investigation is complete.

Last Update: A few months ago

UPDATE: Incident 15058 - Intermittent Connectivity Issues In us-central1b

We are continuing to investigate an intermittent issue with sending and receiving internal traffic in us-central1b and will provide another update by 09:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15058 - Intermittent Connectivity Issues In us-central1b

We are currently investigating a transient issue with sending internal traffic to and from us-central1b.

Last Update: A few months ago

RESOLVED: Incident 15057 - Google Compute Engine issue with newly created instances

SUMMARY: On Tuesday 27 October 2015, Google Compute Engine instances created within a 90 minute period in us-central1 and asia-east1 regions took longer than usual to obtain external network connectivity. Existing instances in the specified regions were not affected and continued to be available. We know how important it is to be able to create instances both for new deployments and scaling existing deployments, and we apologize for the impairment of these actions. DETAILED DESCRIPTION OF IMPACT: On Tuesday 27 October 2015 GCE instances created between 21:44 and 23:13 PDT in the us-central1 and asia-east1 regions took over 5 minutes before they started to receive traffic via their external IP address or network load balancer. Existing instances continued to operate without any issue, and there was no effect on internal networking for any instance. ROOT CAUSE: This issue was triggered by rapid changes in external traffic patterns. The networking infrastructure automatically reconfigured itself to adapt to the changes, but the reconfiguration involved processing a substantial queue of modifications. The network registration of new GCE instances was required to wait on events in this queue, leading to delays in registration. REMEDIATION AND PREVENTION: This issue was resolved as the backlog of network configuration changes was automatically processed. Google engineers will decouple the GCE networking operations and management systems that were involved in the issue such that a backlog in one system does not affect the other. Although the issue was detected promptly, Google engineers have identified opportunities to further improve internal monitoring and alerting for related issues.

Last Update: A few months ago

UPDATE: Incident 15057 - Google Compute Engine issue with newly created instances

The issue with Google Compute Engine for newly created instances should have been resolved for all affected regions (us-central1 and asia-east1) as of 23:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15057 - Google Compute Engine issue with newly created instances

We are experiencing an issue with Google Compute Engine for newly created instances, being delayed to become accessible, beginning at Tuesday, 2015-10-27 22:05 US/Pacific. Current data indicates that zones us-central1 and asia-east1 are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 23:50 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15057 - Google Compute Engine issue with newly created instances

We are investigating reports of an issue with Google Compute Engine. We will provide more information by 23:20 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15022 - App Engine applications using custom domains unreachable from some parts of Central Europe

The issue with App Engine applications using custom domains being unreachable from some parts of Central Europe should have been resolved for all affected users as of 20:25 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 15022 - App Engine applications using custom domains unreachable from some parts of Central Europe

We are investigating reports of an issue with App Engine applications using custom domains being unreachable from some parts of Central Europe. We will provide more information by 21:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We have identified a small number of additional Google Container Engine clusters that were not fixed by the previous round of repair. We have now applied the fix to these clusters, and so this issue should be resolved for all known affected clusters as of 21:30 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still working on resolving the issue with Google Container Engine nodes connecting to the metadata server. We are actively testing a fix for it, and once it is validated, we will push this fix into production. We will provide next status update by 2016-10-24 01:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are still working on resolving the issue with Google Container Engine nodes connecting to the metadata server. We are actively testing a fix for it, and once it is validated, we will push this fix into production. We will provide next status update by 2016-10-24 01:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15001 - Google Container Engine nodes expericing trouble connecting to http://metadata

We are experiencing an issue with Google Container Engine nodes connecting to the metadata server beginning at Friday, 2016-10-23 15:25 US/Pacific. Current data indicates that a majority of clusters may be affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:30 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 15021 - App Engine increased 500 errors

SUMMARY: On Thursday 17 September 2015, Google App Engine experienced increased latency and HTTP errors for 1 hour 28 minutes. We apologize to our customers who were affected by this issue. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to prevent similar issues from occurring in future. DETAILED DESCRIPTION OF IMPACT: On Thursday 17 September 2015 from 12:40 to 14:08 PDT, <0.01% of applications using Google App Engine experienced elevated latencies, HTTP error rates, and failures for the memcache API. The Google Developers Console was also affected and experienced timeouts during the time. ROOT CAUSE: An unhealthy Managed VMs application triggered an excessive number of retries in the App Engine infrastructure in a single datacenter. App Engine's serving stack automatically detected the overload, and diverted the majority of traffic to an alternate datacenter. Memcache was unavailable for apps which were diverted in this manner; this increased latency for those apps. Latency was also increased by the need to create new instances to run those apps in the alternate datacenter. Traffic which was not diverted experienced errors due to the overload. REMEDIATION AND PREVENTION: At 12:47, Google engineers were automatically alerted to increasing latency, followed by elevated error rate, for App Engine, and started investigating the root cause of the issue. The incident was resolved at 14:08. Google engineers are rolling out a fix which curbs the excessive number of retries that caused this incident. Additionally, the team is implementing improved monitoring to reduce the time taken to detect and isolate problematic workloads.

Last Update: A few months ago

RESOLVED: Incident 15006 - Developers Console was loading slowly

The increased 500 errors with Google App Engine beginning at Thursday, 2015-09-17 12:40 US/Pacific also affected the Developers Console. The issue has been resolved at 14:08 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 15021 - App Engine increased 500 errors

We are investigating reports of increased 500 errors with Google App Engine beginning at Thursday, 2015-09-17 12:40 US/Pacific. The issue has been resolved at 14:08 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 16029 - Problem with Google Cloud Storage XML API signed URLs

SUMMARY On Wednesday 26 August 2015, requests to Signed URLs [1] on Google Cloud Storage (GCS) returned errors for an extended period of time, affecting approximately 18% of projects using the Signed URLs feature. We apologize to our customers who were affected by this issue. We have identified and fixed the root causes of the incident, and we are putting measures in place to keep similar issues from occurring in future. [1] https://cloud.google.com/storage/docs/access-control#Signed-URLs DETAILED DESCRIPTION OF IMPACT On Wednesday August 26th 2015 at 07:26 PDT, approximately 18% of projects sending requests to GCS Signed URLs started receiving HTTP 500 and 503 responses. The majority of the errors occurred from 07:25 to 10:34 PDT. From 10:34 the number of affected projects decreased, and by 11:25 PDT fewer than 2% of Signed URL requests were receiving errors. A small number of projects encountered continued errors until remediation was completed at 20:38. There was no disruption to any GCS functionality that did not involve Signed URLs. ROOT CAUSE GCS Signed URLs are cryptographically signed by the owner of the stored data, using the private key of a Google Cloud Platform service account. The private key is known only to the owner, but the corresponding public key is retained by Google and used to verify the URL signatures. Within Google, similar service accounts are used for many internal authentication purposes. (For example, these accounts include the default service account which internally represents the customer's Cloud Platform project.) For these service accounts, Google retains both the public and the private key. These keypairs have a short lifetime and are frequently regenerated. All keys are managed in a central, strongly hardened security module. As part of an effort to simplify the key management system, prior to the incident, a configuration change was pushed which mistakenly caused the security module to consider customers' service accounts as candidates for automatic keypair management. This change was later rolled back, removing the service accounts from automatic management, but some customers' service accounts had already received new keypairs with finite lifetimes. Accounts with heavy Signed URL usage were more likely to be affected. On 07:25 PDT on Wednesday 26 August, the lifetimes of the temporary keypairs for affected accounts began to expire. Since the expired keypairs were not automatically removed as the service account were no longer under automatic management, it's presence was treated as an error by the Signed URL verification process, causing all Signed URL requests for the service account to fail. At no time during this incident were any keys at risk and they remain safe and secure. No action is required by customers. REMEDIATION AND PREVENTION Automated monitoring signalled the issue at 07:33. Google engineers identified the need to remove the expired keys, which required manual access to the security module. This is protected by multiple security procedures, by design, so it required several escalations to get the correct people. Access was gained at 10:34 PDT, and thereafter service was progressively restored as each service account was reactivated by removing its expired keys. To eliminate the immediate cause of the issue, Google engineers are modifying the URL signature verifier to be more robust when it encounters expired keys. To avoid various related classes of errors, Google engineers are increasing the testing performed on configuration changes for the security system, both to strengthen consistency and to ensure that configuration changes do not induce unexpected side effects. In case of other future issues with the security module, Google engineers are streamlining internal escalation procedures to improve response times, and upgrading tools for more efficient key administration.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API signed URLs

The issue with GCS XML signed URLs should be resolved for vast majority of projects and traffic as of 11:25 US/Pacific and we are working to fix the issue for the 0.02% remaining traffic. We will provide provide a written Incident Report within 24 hours.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API signed URLs

We are still actively working on fully resolving the issue with GCS XML signed URLs returning HTTP 500 errors. Current data indicates that less than 0.009% of requests using XML signed URLs are still affected by this issue. We will provide another status update by 17:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API signed URLs

We are still working on fully resolving the issue with GCS XML signed URLs returning HTTP 500 errors and expecting for a full resolution in the near future. Current data indicates that less than 0.1% of requests using XML signed URLs are affected by this issue. We will provide another status update by 15:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API signed URLs

We are still working on fully resolving the issue with GCS XML signed URLs returning HTTP 500 errors. Current data indicates that less than 0.2% of requests using XML signed URLs are affected by this issue. We will provide another status update by 14:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API signed URLs

The issue with GCS XML signed URLs returning HTTP 500 errors should be resolved for the majority of projects and we expect a full resolution in the near future. Current data indicates that less than 0.2% of requests are affected by this issue. We will provide another status update by 13:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API signed URLs

Error rates for GCS XML signed URL requests have fallen to 1%. We are working to resolve the issue for the remaining impacted customers. We will provide another status update by 12:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API signed urls

Our engineers have determined the cause of the GCS XML API signed URL issue and are applying a fix. Current data indicates that 5% of XML API requests remain affected by this issue. We will provide another status update by 11:30 US/Pacific with further details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API

We have identified that the Google Cloud Storage XML API errors are limited to requests authorized by signed URLs [1]. We are continuing to investigate and will provide another status update by 10:45 US/Pacific with current details. [1] https://cloud.google.com/storage/docs/access-control?hl=en#Signed-URLs

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API

Errors on the Google Cloud Storage XML API are continuing; current data indicates that between 30% and 35% of XML API requests are affected by this issue. We are continuing to investigate and will provide another status update by 10:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API

The issue with errors on the Google Cloud Storage XML API should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by 09:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API

We are still investigating the issue with requests to the Google Cloud Storage XML API. We will provide another status update by 09:00 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API

There is an elevated error rate on the XML API to Google Cloud Storage. Affected users observe HTTP 500 or other 50x responses to XML API requests. We will provide a further update by 08:30 US/Pacific time.

Last Update: A few months ago

UPDATE: Incident 16029 - Problem with Google Cloud Storage XML API

We are experiencing an issue with the XML API to Google Cloud Storage. The issue started at 07:27 US/Pacific time. We will provide a further update by 08:00.

Last Update: A few months ago

RESOLVED: Incident 17009 - Issues connecting to Google Cloud SQL instances

SUMMARY: On Friday, 14 August 2015, Google Cloud SQL instances in the US Central region experienced intermittent connectivity issues over an interval of 6 hours 50 minutes. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Friday, 14 August 2015 from 03:24 to 10:16 PDT, some attempts to connect to Google Cloud SQL instances in the US Central region failed. Approximately 12% of all active Cloud SQL instances experienced a denied connection attempt. ROOT CAUSE: On Wednesday, 12 August 2015, a standard rollout was performed for Google Cloud SQL which introduced a memory leak in the serving component. Before the rollout, an unrelated periodic maintenance activity necessitated disabling some automated alerts, and these were not enabled again once maintenance was complete. As a result, Google engineers were not alerted to high resource usage until Cloud SQL serving tasks began exceeding resource limits and rejecting more incoming connections. REMEDIATION AND PREVENTION: At 07:47, Google engineers were alerted to high reported error rates and began allocating more resources for Cloud SQL serving tasks, which provided an initial reduction in error rates. Finally, a restart of running Cloud SQL serving tasks eliminated remaining connectivity issues by 10:16. To prevent the issue from recurring, we are implementing mitigation and monitoring changes as a result of this incident, which include rolling back the problematic update, making the Cloud SQL serving component more resilient to high resource usage, and improving monitoring procedures to reduce the time taken to detect and isolate similar problems.

Last Update: A few months ago

RESOLVED: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

SUMMARY: From Thursday 13 August 2015 to Monday 17 August 2015, errors occurred on a small proportion of Google Compute Engine persistent disks in the europe-west1-b zone. The affected disks sporadically returned I/O errors to their attached GCE instances, and also typically returned errors for management operations such as snapshot creation. In a very small fraction of cases (less than 0.000001% of PD space in europe-west1-b), there was permanent data loss. Google takes availability very seriously, and the durability of storage is our highest priority. We apologise to all our customers who were affected by this exceptional incident. We have conducted a thorough analysis of the issue, in which we identified several contributory factors across the full range of our hardware and software technology stack, and we are working to improve these to maximise the reliability of GCE's whole storage layer. DETAILED DESCRIPTION OF IMPACT: From 09:19 PDT on Thursday 13 August 2015, to Monday 17 August 2015, some Standard Persistent Disks in the europe-west1-b zone began to return sporadic I/O errors to their connected GCE instances. In total, approximately 5% of the Standard Persistent Disks in the zone experienced at least one I/O read or write failure during the course of the incident. Some management operations on the affected disks also failed, such as disk snapshot creation. From the start of the incident, the number of affected disks progressively declined as Google engineers carried out data recovery operations. By Monday 17 August, only a very small number of disks remained affected, totalling less than 0.000001% of the space of allocated persistent disks in europe-west1-b. In these cases, full recovery is not possible. The issue only affected Standard Persistent Disks that existed when the incident began at 09:19 PDT. There was no effect on Standard Persistent Disks created after 09:19. SSD Persistent Disks, disk snapshots, and Local SSDs were not affected by the incident. In particular, it was possible at all times to recreate new Persistent Disks from existing snapshots. ROOT CAUSE: At 09:19 PDT on Thursday 13 August 2015, four successive lightning strikes on the electrical systems of a European datacenter caused a brief loss of power to storage systems which host disk capacity for GCE instances in the europe-west1-b zone. Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain. In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk. This outage is wholly Google's responsibility. However, we would like to take this opportunity to highlight an important reminder for our customers: GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters. Customers who need maximum availability should be prepared to switch their operations to another GCE zone. For maximum durability we recommend GCE snapshots and Google Cloud Storage as resilient, geographically replicated repositories for your data. REMEDIATION AND PREVENTION: Google has an ongoing program of upgrading to storage hardware that is less susceptible to the power failure mode that triggered this incident. Most Persistent Disk storage is already running on this hardware. Since the incident began, Google engineers have conducted a wide-ranging review across all layers of the datacenter technology stack, from electrical distribution systems through computing hardware to the software controlling the GCE persistent disk layer. Several opportunities have been identified to increase physical and procedural resilience, including: * Continue to upgrade our hardware to improve cache data retention during transient power loss. * Implement multiple orthogonal schemes to increase Persistent Disk data durability for greater resilience. * Improve response procedures for system engineers during possible future incidents.

Last Update: A few months ago

RESOLVED: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

SUMMARY: From Thursday 13 August 2015 to Monday 17 August 2015, errors occurred on a small proportion of Google Compute Engine persistent disks in the europe-west1-b zone. The affected disks sporadically returned I/O errors to their attached GCE instances, and also typically returned errors for management operations such as snapshot creation. In a very small fraction of cases (less than 0.000001% of PD space in europe-west1-b), there was permanent data loss. Google takes availability very seriously, and the durability of storage is our highest priority. We apologise to all our customers who were affected by this exceptional incident. We have conducted a thorough analysis of the issue, in which we identified several contributory factors across the full range of our hardware and software technology stack, and we are working to improve these to maximise the reliability of GCE's whole storage layer. DETAILED DESCRIPTION OF IMPACT: From 09:19 PDT on Thursday 13 August 2015, to Monday 17 August 2015, some Standard Persistent Disks in the europe-west1-b zone began to return sporadic I/O errors to their connected GCE instances. In total, approximately 5% of the Standard Persistent Disks in the zone experienced at least one I/O read or write failure during the course of the incident. Some management operations on the affected disks also failed, such as disk snapshot creation. From the start of the incident, the number of affected disks progressively declined as Google engineers carried out data recovery operations. By Monday 17 August, only a very small number of disks remained affected, totalling less than 0.000001% of the space of allocated persistent disks in europe-west1-b. In these cases, full recovery is not possible. The issue only affected Standard Persistent Disks that existed when the incident began at 09:19 PDT. There was no effect on Standard Persistent Disks created after 09:19. SSD Persistent Disks, disk snapshots, and Local SSDs were not affected by the incident. In particular, it was possible at all times to recreate new Persistent Disks from existing snapshots. ROOT CAUSE: At 09:19 PDT on Thursday 13 August 2015, four successive lightning strikes on the local utilities grid that powers our European datacenter caused a brief loss of power to storage systems which host disk capacity for GCE instances in the europe-west1-b zone. Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain. In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk. This outage is wholly Google's responsibility. However, we would like to take this opportunity to highlight an important reminder for our customers: GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters. Customers who need maximum availability should be prepared to switch their operations to another GCE zone. For maximum durability we recommend GCE snapshots and Google Cloud Storage as resilient, geographically replicated repositories for your data. REMEDIATION AND PREVENTION: Google has an ongoing program of upgrading to storage hardware that is less susceptible to the power failure mode that triggered this incident. Most Persistent Disk storage is already running on this hardware. Since the incident began, Google engineers have conducted a wide-ranging review across all layers of the datacenter technology stack, from electrical distribution systems through computing hardware to the software controlling the GCE persistent disk layer. Several opportunities have been identified to increase physical and procedural resilience, including: Continue to upgrade our hardware to improve cache data retention during transient power loss. Implement multiple orthogonal schemes to increase Persistent Disk data durability for greater resilience. Improve response procedures for system engineers during possible future incidents.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

At present, less than 0.05% of PD's are experiencing read failures in europe-west1-b. Neither restoring Persistent Disks from snapshots nor creating new Persistent Disks have been affected. If your PD is one of those 0.05% currently affected, you may restore it from a snapshot and continue using it at full availability. Given the low rate of read failures, this will be the final update for this incident. Instead, the Cloud Support team will be reaching out to affected customers within 3 business days. Please also feel free to proactively contact support for more information. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16028 - Increased latency and error rates

The issue with Google Cloud Storage JSON API increased latency and error rates should be resolved for all affected users as of 23:20 US/Pacific. Contrary to our initial analysis of a widespread issue, this incident affected less than 0.1% of our customers. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16028 - Increased latency and error rates

We are experiencing an issue with Google Cloud Storage JSON API increased latency and error rates beginning at Saturday, 2015-Aug-15 22:05 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 2015-Aug-16 00:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16028 - Increased latency and error rates

We are investigating reports of an issue with Google Cloud Storage. We will provide more information by 23:10 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

At present, less than 0.05% of PD's are experiencing read failures in europe-west1-b. Neither restoring Persistent Disks from snapshots nor creating new Persistent Disks have been affected. If your PD is one of those 0.05% currently affected, you may restore it from a snapshot and continue using it at full availability. Given the low rate of read failures, we will be decreasing the velocity of updates. We will provide another status update on 17 August by 16:00 US/Pacific with current details. In addition, the Cloud Support team will be reaching out to affected customers within 3 business days. Please also feel free to proactively reach to Cloud Support for more information.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the full service for every Google Compute Engine Persistent Disk in europe-west1-b. In terms of impact, less than 0.1% of Google Compute Engine Persistent Disks in that zone have been experiencing read failures on some of the blocks. Current data indicates than no more than 1% of PDs could be affected going forward. Neither restoring Persistent Disks from snapshots nor creating new Persistent Disks have been affected. If your PD is one of those 0.1% currently affected, you may restore it from a snapshot and continue using it at full availability. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 17009 - Issues connecting to Google Cloud SQL instances

We experienced intermittent connectivity issues with Google Cloud SQL beginning at Friday, 2015-08-14 03:28 US/Pacific. The issue should be resolved for all affected projects as of 08:35 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 11:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 08:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by Aug 14 03:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

RESOLVED: Incident 15020 - Google App Engine Search API errors and latency

SUMMARY: On Wednesday, 12 August 2015, the Search API for Google App Engine experienced increased latency and errors for 40 minutes. We apologize for this incident and the effect it had on applications using the Search API. We strive for excellent performance and uptime, so we will take appropriate actions right away to improve the Search API’s availability. If you believe your paid application experienced an SLA violation as a result of this incident, please contact us at: https://support.google.com/cloud/answer/3420056 DETAILED DESCRIPTION OF IMPACT: On Wednesday, 12 August 2015 from 11:05am to 11:45am PDT, the Search API service experienced an increase in latency and error rate. 8.7% of applications using the Search API received a 7.5% error rate with messages like: “Timeout: Failed to complete request in NNNNms” ROOT CAUSE: A set of queries sent to a Google-owned service running on App Engine caused the Search API service to fail. REMEDIATION AND PREVENTION: At 10:28, Google engineers were automatically alerted to increasing latency in the Search API backend, but at this point, customers were not impacted. At 11:05, the increasing latency started causing Search API timeouts. Once the cause of the latency increase was discovered, the relevant user was isolated from other customers, ending the incident at 11:45. The Search API team is implementing mitigation and monitoring changes as a result of this incident, which include changes to the API backend to isolate the impact of similar issues and improved monitoring to reduce the time taken to detect and isolate problematic workloads for the Search API.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 19:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. Current data indicates that less than 1% of PDs in europe-west1-b are susceptible to degraded performance because of this issue. The service has been partially recovered. Meanwhile, affected customers can restore from snapshots as a workaround solution. We will provide another status update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are still actively working on mitigating the issue with Google Compute Engine Persistent Disks in europe-west1-b beginning at Thursday, 2015-08-13 09:25 US/Pacific. For everyone who is affected, we apologize for the impact to your systems. We will provide another status update by 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15056 - Google Compute Engine Persistent Disk issue in europe-west1-b

We are experiencing an issue with Google Compute Engine Persistent Disks in europe-west1-b beginning at Thursday, 2015-08-13 09:25 US/Pacific. Customers who have machines running in this zone may see read errors. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 12:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15020 - Google App Engine Search API errors and latency

The issue with App Engine Search API should be resolved for all affected apps as of 11:46 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15020 - Google App Engine Search API errors and latency

The issue with App Engine Search API Timeouts should be resolved as of 12:00 US/Pacific. Our internal investigation is in progress and at this point cannot be certain that the issue cannot re-occur. We will post a further update by 13:00 as we work towards declaring the incident fully over.

Last Update: A few months ago

UPDATE: Incident 15020 - Google App Engine Datastore search errors and latency

We are experiencing an issue with App Engine Search API requests timing out beginning at Wednesday, 2015-08-12 11:05 US/Pacific. You may see requests timing out or returning successfully with increased latency. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 13:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15020 - Google App Engine Datastore search errors and latency

We are investigating reports of an issue with Google App Engine Datastore. We will provide more information by 12:00 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16027 - High error rate of requests to Google Cloud Storage

SUMMARY: On Saturday 8 August 2015, Google Cloud Storage served an elevated error rate for a duration of 139 minutes. If your service or application was affected, we apologize. We have taken an initial set of actions to prevent recurrence of the problem, and have a larger set of changes which will provide defense in depth currently under review by the engineering teams. DETAILED DESCRIPTION OF IMPACT: On Saturday 8 August 2015 from 20:21 to 22:40 PDT, Google Cloud Storage returned a high rate of error responses to queries. The average error rate during this time was 28.4%, with an initial peak of 47% at 20:27. Error levels gradually decreased subsequently, with intermediate periods of normal operation from 21:46-21:54 and 22:04-22:10. Usage was equally affected across the Google Cloud Storage XML and JSON APIs. ROOT CAUSE: The elevated GCS error rate was induced by a large increase in traffic compared to normal levels. The traffic surge was exacerbated by retries from software clients receiving errors. The GCS errors were principally served to the sources which were generating the unusual traffic levels, but a fraction of the errors were served to other users as well. REMEDIATION AND PREVENTION: Google engineers were alerted to the elevated error rate by automated monitoring, and took steps to both reduce the impact and to increase capacity to mitigate the duration and severity of the incident for GCS users. In parallel, Google’s support team contacted the system owners which were generating the bulk of unexpected traffic, and helped them reduce their demand. The combination of these two actions resolved the incident. To prevent a potential future recurrence, Google’s engineering team have made or are making a number of changes to GCS, including: - Adding traffic rate ‘collaring’ to prevent unexpected surges in demand from exceeding sustainable levels; - Adding or improving caching at multiple levels in order to increase capacity, and increase resilience to repetitive queries; - Changing RPC queuing behavior in GCS to provide more capacity and more graceful handling of overload.

Last Update: A few months ago

UPDATE: Incident 16027 - High error rate of requests to Google Cloud Storage

The issue with Google Cloud Storage should be resolved for all affected projects as of 22:50 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16027 - High error rate of requests to Google Cloud Storage

We are still investigating the issue with Google Cloud Storage. Current data indicates that the error rate is improving for most projects affected by this issue. We will provide another status update by 23:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16027 - High error rate of requests to Google Cloud Storage

We are investigating reports of an issue with Google Cloud Storage. We will provide more information by 22:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 15055 - Google Cloud Compute difficulty reaching internet

SUMMARY: On Tuesday 4 August 2015, incoming network traffic to Google Compute Engine (GCE) was interrupted for 5 minutes. We are sorry for any impact this had on our customers' services. We have identified the cause of the incident and we are taking steps to avoid this class of problems in future. DETAILED DESCRIPTION OF IMPACT: On Tuesday 4 August 2015 from 08:56 to 09:01 PDT, all incoming network packets from the Internet to GCE public IP addresses were dropped. The incident affected network load balancers and Google Cloud VPNs as well as the external IP addresses of GCE instances. Whilst packets from GCE to the Internet were not affected, the loss of return traffic prevented the correct operation of TCP connections. There was no effect on instance-to-instance traffic using GCE internal IP addresses. ROOT CAUSE: GCE's external network connectivity is provided by a Google core networking component that supports many of Google's public services. A software deployment for this system introduced a bug which failed to handle a specific property of the configuration for GCE IP addresses. This led to the removal of all inward-bound routes for GCE. REMEDIATION AND PREVENTION: The impact on GCE networking triggered immediate alerts, and Google engineers restored service by rolling back the software deployment. To avoid regression of the specific issue, Google engineers will extend testing frameworks to include the GCE configuration property that triggered the bug. To increase our defence in depth against related issues in future, Google engineers will also implement a number of technical and procedural measures, including: increased engineer review of configuration changes, automatic sanity-checks on route deployment changes, and protection of IP ranges associated with GCE.

Last Update: A few months ago

RESOLVED: Incident 15055 - Google Cloud Compute difficulty reaching internet

The issue with Google Compute Engine instances experiencing difficulty reaching the internet between 8:57 to 9:05 US/Pacific is now resolved. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrences. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 15005 - Developers console showing 404s for some users

SUMMARY: On Monday, 27 July 2015, the Google Developers Console was unavailable to all users for a duration of 41 minutes. We apologize for the inconvenience and any impact on your operations that this may have caused. We are urgently working to implement preventative measures to ensure similar incidents do not occur in the future. DETAILED DESCRIPTION OF IMPACT: On Monday, 27th July 2015 from 13:21 to 14:02 PDT, the Google Developers Console was unavailable to users. All requests to https://console.developers.google.com returned a 404 "Not Found" response. Existing Cloud Platform resources such as Compute Engine instances or App Engine applications were not affected. All Google Cloud Platform APIs remained fully functional, allowing most Cloud Platform resources to be managed through the Google Cloud SDK and other API-based tools until Console access was restored. ROOT CAUSE: At 13:21, while reviewing the production status of the Developer Console, a Google engineer inadvertently disabled the production instance of the console. The engineer immediately recognised the error and began remediating the problem, but the configuration change had also engaged a security mechanism which restricted the application to the Google corporate network. This mechanism was identified and disengaged at 14:01, which restored public access to the Console. REMEDIATION AND PREVENTION: To prevent similar incidents, Google Engineers are currently adding safeguards to make it harder to change application settings by mistake, implementing external monitoring to detect errors outside of the Google network, and creating alerts based on serving errors from the Developers Console.

Last Update: A few months ago

UPDATE: Incident 15005 - Developers console showing 404s for some users

The issue with Developers Console should be resolved for all affected users as of 14:02 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15005 - Developers console showing 404s for some users

We are experiencing an issue with the Developers Console beginning at Monday, 2015-07-27 13:21 US/Pacific. Current data indicates that all users are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 16:15 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15005 - Developers console showing 404s for some users

We are currently investigating an issue where some users are getting a 404 error when accessing the Developers Console. We will update the status of this issue within 15 minutes

Last Update: A few months ago

RESOLVED: Incident 17008 - Issues accessing Google Cloud SQL instances

SUMMARY: On Friday, June 26 2015 13:04 PDT, connections to Cloud SQL instances in one datacenter in the United States failed for a duration of 19 minutes. We apologise if you were affected by this issue. Our engineers have completed a postmortem analysis of this incident and are performing remediation tasks to prevent similar issues in the future. DETAILED DESCRIPTION OF IMPACT: On Friday, June 26 2015 from 13:04 to 13:23 PDT, traffic to affected Cloud SQL instances in United States was impacted. Existing connections were dropped and new connections could not be established. Approximately 4% of Cloud SQL instances in the United States also were also rebooted. ROOT CAUSE: A manual procedure to reprogram the power supply of a router as part of a maintenance routine didn’t have the desired effect and caused the router to restart in an undesired manner. The team performing the routine detected the issue immediately and was able to restore the system to a stable state within 20 minutes. REMEDIATION AND PREVENTION: We have identified areas of improvement needed to prevent similar issues in the future. We have taken measures to prevent issues with the same root cause to reappear in the future. We are also adding additional monitoring metrics to alert for failures at this level earlier and improve cross-team communication for planned maintenance events.

Last Update: A few months ago

UPDATE: Incident 16026 - Google Cloud Storage Web Browser Is Inaccessible

The issue with the Google Cloud Storage Browser in the Developer's Console has been resolved. We apologize for any inconvenience this may have caused.

Last Update: A few months ago

UPDATE: Incident 16026 - Google Cloud Storage Web Browser Is Inaccessible

We are investigating an issue with the Google Cloud Storage Browser in the Developer's Console. If you need to access the contents of your buckets in the meantime, please use the Google Cloud Storage gsutil command line tool[1]. We will provide another update by 13:30 PDT. [1] https://cloud.google.com/storage/docs/gsutil

Last Update: A few months ago

UPDATE: Incident 17008 - Issues accessing Google Cloud SQL instances

The problem with Google Cloud SQL is resolved as of shortly after 13:28 US/Pacific. We are sorry for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

Last Update: A few months ago

UPDATE: Incident 17008 - Issues accessing Google Cloud SQL instances

We're investigating an issue with connectivity to Google Cloud SQL beginning at 13:25 US/Pacific. We'll provide an new update in the next 30 min.

Last Update: A few months ago

RESOLVED: Incident 15019 - Task queues not able to execute

SUMMARY: On Tuesday, 16 June 2015, Google App Engine Task Queue service and App Engine application deployment experienced increased error rates for a duration of 3 hours and 25 minutes. If your service or application was affected, we apologize. We have taken actions to fix the issue and are in process of making the system more reliable. DETAILED DESCRIPTION OF IMPACT: On Tuesday, 16 June 2015 from 20:10 to 23:35 PDT, some developers of Google App Engine applications in the US region were unable to deploy their applications. The overall error rate of deployments during this period was approximately 60%. Affected developers saw that attempted deployments would exit and report an internal server error message after HTTP requests to appengine.google.com timed out. App Engine Admin Console was unable to load data for affected applications. Additionally, between 20:58 to 21:33, applications in the US region experienced an increase in error rate of up to 0.25% as well as slower execution of Task Queue tasks. ROOT CAUSE: Google engineers had performed maintenance on a storage system of one of datacenters which App Engine uses. During this maintenance, components of App Engine that rely on this storage system had to rely on a replica in a different datacenter. For both deployments and Task Queues, this switch did not function properly. REMEDIATION AND PREVENTION: Google engineers took necessary measures to prevent the Task Queue service from accessing the storage under the maintenance at 21:33. In addition, all traffic for the affected applications was redirected to alternate datacenters at 23:26. This was completed by 23:35 and applications were again able to deploy successfully. To prevent the issue from recurring, we are working to make deployments and Task Queue are more resilient to movements in the underlying storage system, in a similar fashion to other App Engine components.

Last Update: A few months ago

UPDATE: Incident 18010 - Query errors for rows inserted using BigQuery Streaming API

We experienced an issue with Bigquery Streaming API service on Wednesday 2015-06-17 between 14:12 and 17:48 US/Pacific. During this interval querying rows inserted by the service failed with the following error: """ Error: System error. The error has been logged and we will investigate. """ Current data indicates less than 7% of projects that issued queries during the incident were affected by this issue. The issue with Bigquery streaming service should be resolved for all affected projects as of Wednesday 2015-06-17 17:48 US/Pacific. We will conduct an internal investigation and make appropriate improvements to our systems to prevent or minimize future recurrence. As a result of this issue the rows inserted during the incident might be unavailable for a longer duration. If your project is affected by this delay, please track the progress in the public issue tracker[1] or the Google for Work Support Center[2]. [1]. https://code.google.com/p/google-bigquery/issues/detail?id=263 [2]. https://enterprise.google.com/supportcenter/

Last Update: A few months ago

UPDATE: Incident 18010 - Delayed visibility of streamed data

In some BigQuery projects, streaming data inserted after 2015-06-17 14:12 US/Pacific is not being returned in queries that match the inserted data. The streamed data is not lost. However, it is unavailable to search queries for a while.

Last Update: A few months ago

UPDATE: Incident 17007 - Unable to connect to Cloud SQL

he issue with trouble connecting to Cloud SQL should be resolved for all affected instances as of 10:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 17007 - Unable to connect to Cloud SQL

The issue with trouble connecting to Cloud SQL should be resolved for all affected instances as of 10:55 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 17007 - Unable to connect to Cloud SQL

We are experiencing an issue with Cloud SQL. User may be unable to connect to their Cloud Sql Instances beginning at Wednesday, 2015-06-17 09:50 US/Pacific. Current data indicates that only instances in us-central1 are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 11:45 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 17007 - Unable to connect to Cloud SQL

We are investigating reports of an issue with Cloud SQL. We will provide more information by 10:45 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 15019 - Task queues not able to execute

The issue with application deployment was resolved as of Wednesday, 2015-06-17 00:00. Again we do apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15019 - Task queues not able to execute

We are continuing to investigate the issue with application deployment and will provide a further update by Wednesday, 2015-06-17 00:20.

Last Update: A few months ago

UPDATE: Incident 15019 - Task queues not able to execute

The issue with application deployments is ongoing; symptoms of a deployment failure are posted below.  We are continuing to investigate this and a further update will be posted in 30 minutes. -- Error posting to URL: https://appengine.google.com/api/appversion/precompile?module=default&app_id=[APPID]&version=[VERSION_ID] 500 Internal Server Error <html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><title>500 Server Error</title></head><body text=#000000 bgcolor=#ffffff><h1>Error: Server Error</h1><h2>The server encountered an error and could not complete your request.<p>Please try again in 30 seconds.</h2><h2></h2></body></html> This is try #0 [TIMESTAMP] com.google.appengine.tools.admin.AbstractServerConnection send1 Error posting to URL: https://appengine.google.com/api/appversion/precompile?module=default&app_id=[APPID]&version=[VERSION_ID] 503 Service Unavailable Try Again (503)An unexpected failure has occurred. Please try again.

Last Update: A few months ago

UPDATE: Incident 15019 - Task queues not able to execute

We are continuing to investigate the issue with application deployment and will provide a further update  by Tuesday, 2015-06-16 23:20.

Last Update: A few months ago

UPDATE: Incident 15019 - Task queues not able to execute

The problem with Google App Engine Task Queue was resolved as of Tuesday, 2015-06-16 21:35 (all times are in US/Pacific), however some users may continue to experience difficulties with application deployment. We are continuing to investigate this and will provide a further update by Tuesday, 2015-06-16 22:50 with current details. Currently, this service disruption is affecting less than 8% of users. We apologize for the inconvenience and thank you for your patience and continued support.

Last Update: A few months ago

UPDATE: Incident 15019 - Task queues not able to execute

We're investigating an issue with Google App Engine task queues beginning at Tuesday, 2015-06-16 20:00 (all times are in US/Pacific). Users may also experience issues with application deployment. We will provide more information within 30 minutes.

Last Update: A few months ago

RESOLVED: Incident 16025 - Increased error rate in Google Cloud Storage

SUMMARY: On Tuesday 9 June 2015 Google Cloud Storage served an elevated error rate and latency for a duration of 1 hour 40 minutes. If your service or application was affected, we apologize. We understand that many services and applications rely on consistent performance of our service and we failed to uphold that level of performance. We are taking immediate actions to ensure this issue does not happen again. DETAILED DESCRIPTION OF IMPACT: On Tuesday 9 June 2015 from 15:29 to 17:09 PDT, an average of 52.7% of the global requests to Cloud Storage resulted in HTTP 500 or 503 responses. The impact was most pronounced for GCS requests in the US with 55% average error rate, while requests to GCS in Europe and Asia resulted in 25% and 1% failures, respectively. ROOT CAUSE: The incident resulted from a change in the default behavior of a dependency of Cloud Storage that was included in a new GCS server release. At scale, the change led to pathological retry behavior which eventually caused increased latency, led to request timeouts, threadpool saturation and an increased error rate for the service. Google follows a canary process for all new releases, by upgrading a small number of servers and looking for problems before releasing the change everywhere in a gradual fashion. In this case the traffic served from those canary servers was not sufficient to expose the issue. REMEDIATION AND PREVENTION: Automated monitoring detected increased latency for Cloud Storage requests in one datacenter at 15:44. In an attempt to mitigate the problem and allow the troubleshooting of the underlying issue to continue, engineers increased resources to several backend systems that were exhibiting hotspotting problems which resulted in a small improvement. When it was evident that this was related to a behavior change in one of the libraries for a dependent system, Google Engineers made the necessary production adjustments to stabilize the system by quickly disabling the pathological retry behavior. To prevent similar issues from happening in the future we're making a number of changes: Cloud Storage release tools will be upgraded to allow for quicker rollbacks, risky changes will be released through an experimental traffic framework allowing for more precise canarying, and our outage response procedures will default to rolling back new releases more quickly as a default mitigation for all incidents.

Last Update: A few months ago

UPDATE: Incident 16025 - Increased error rate in Google Cloud Storage

The problem with Google Cloud Storage should be resolved as of 2015-06-09 17:15 (PT). We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 16025 - Increased error rate in Google Cloud Storage

We're still investigating an issue with code 500 Backend Error on Google Cloud Storage RPC calls, beginning at 2015-06-09 15:31 US/Pacific. We will provide another update by 18:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 16025 - Increased error rate in Google Cloud Storage

We are experiencing an issue with Google Cloud Storage beginning at Tuesday, 2015-06-09 15:31 US/Pacific. Impacted customers will receive Code 500 Backend Error when using the service. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 17:30 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 16025 - Increased error rate in Google Cloud Storage

We're investigating an issue with Google Cloud Storage beginning at 15:31 Pacific Time. We will provide more information shortly.

Last Update: A few months ago

UPDATE: Incident 15015 - Stale composite indexes on Google App Engine

The remediation of missing index entries has been completed. We apologize for the inconvenience and thank you for your patience and continued support.

Last Update: A few months ago

UPDATE: Incident 15015 - Stale composite indexes on Google App Engine

Some of the missing index entries have been fixed and the remaining ones will require additional time. We will provide another status update by 20:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15015 - Stale composite indexes on Google App Engine

Google Engineers have identified the issue with Google App Engine Datastore composite indexes missing entries. Remediation has been applied so that all future writes will correctly update indexes. Over the next several hours all remaining missing index entries will be fixed. We will provide another status update by 18:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15015 - Stale composite indexes on Google App Engine

The investigation into Google App Engine stale composite indexes is ongoing. Current data indicates the issue is isolated to approximately 0.01% of applications running in the US. We will provide another status update by 16:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15015 - Stale composite indexes on Google App Engine

We are still investigating the issue with a small number of Google App Engine apps experiencing stale composite indexes for recently updated Datastore entities. Current data indicates the issue only affects apps located in the US. Approximately 0.01% of apps are affected by this issue. We will provide another status update by 15:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15015 - Stale composite indexes on Google App Engine

Approximately 0.01% of Google App Engine apps are experiencing stale composite indexes for recently updated Datastore entities. Partial remediation has been applied but some Datastore queries may still get old results. Google Engineers are working to complete the remediation. To test if you are affected, use the App Engine Datastore Viewer to query for a suspected stale entity both by it's key and it's indexed value. If you cannot find the entity by index but can see the entity by key you may be affected. Example queries to do this are "SELECT __key__ FROM YourKind WHERE __key__=Key('YourKind', 'id_string')" and "SELECT __key__ FROM YourKind WHERE indexed_col1='value1' and indexed_col_2='value2'" For those affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 14:00 US/Pacific time with current details.

Last Update: A few months ago

UPDATE: Incident 15015 - Stale composite indexes on Google App Engine

Approximately 0.01% of Google App Engine apps are experiencing stale composite indexes for recently updated Datastore entities. Partial remediation has been applied but some Datastore queries may still get old results. Google Engineers are working to complete the remediation. To test if you are affected, use the App Engine Datastore Viewer to query for a suspected stale entity both by it's key and it's indexed value. If you cannot find the entity by index but can see the entity by key you may be affected. Example queries to do this are: ``` SELECT key FROM YourKind WHERE key__=Key('YourKind', 'id_string') ``` and ``` SELECT __key FROM YourKind WHERE indexed_col1='value1' and indexed_col_2='value2' ``` For those affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 14:00 US/Pacific time with current details.

Last Update: A few months ago

RESOLVED: Incident 16024 - Increased error rate and latency in GCS uploads

SUMMARY: On Friday 15 May 2015, uploads to Google Cloud Storage experienced increased latency and error rates for a duration of 1 hour 38 minutes. If your service or application was affected, we apologize. We understand that many services and applications rely on consistent performance when uploading objects to our service and we failed to uphold that level of performance. We are taking immediate actions to ensure this issue does not happen again. DETAILED DESCRIPTION OF IMPACT: On Friday 15 May 2015 from 03:35 to 05:13 PDT, uploads to Google Cloud Storage either failed or took longer than expected. During the incident, 6% of all POST requests globally returned error code 503 between 03:35 and 03:43, and the error rate remained at >0.5% until 05:13. Google Cloud Storage is a highly distributed system; one of Google's US datacenters was the focus of much of the impact, with an error rate peaking at over 40%. Median latency for successful requests increased 16%, compared to typical levels, while latency at the 90th and 99th percentiles increased 29% and 63% respectively. ROOT CAUSE: A periodic replication job, run automatically against Google Cloud Storage's underlying storage system, increased load and reduced available resources for processing new uploads. As a result, latency increased and uploads to GCS either continued to wait for completion or failed. REMEDIATION AND PREVENTION: Google engineers were alerted to increased latency in one of the datacenters responsible for processing uploads at 03:12 PDT and redirected upload traffic to several other datacenters to distribute the load. When it was evident this redirection had not alleviated the increase in latency, engineers began to provision additional capacity while continuing to investigate the underlying root cause. When the replication job was identified as the source of the increased load, Google engineers reduced the rate of replication and service was restored. In the short term, Google engineers will be adding additional monitoring to the underlying storage layer to better identify problematic system load conditions as well as the tasks responsible. In the longer term, Google engineers will be isolating the impact the replication job can have on the latency and performance of other services.

Last Update: A few months ago

RESOLVED: Incident 16024 - Increased error rate and latency in GCS uploads

Correction: the start of the incident occurred on Friday, 2015-05-15 03:30 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 16024 - Increased error rate and latency in GCS uploads

GCS API experienced increased error rate and latency for uploads starting Friday, 2015-05-15 05:30 ending Friday, 2015-05-15 05:15 US/Pacific. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 15053 - Elevated latency and error rate for Google Compute Engine API

SUMMARY: On Tuesday 12 May 2015 an infrastructure event caused the reboot of 6% of virtual machines in the Google Compute Engine zone us-central1-a. In addition, API operations targeting the us-central1-a zone resulted in errors for a duration of 1 hour and 38 minutes, while other Compute Engine API operations experienced elevated latency for the same duration. If you or your customers were affected by either the reboots or the API issues, we apologize. We failed to contain the issue to the affected power hardware and are working to improve the failure isolation of the systems involved. DETAILED DESCRIPTION OF IMPACT: On Tuesday 12 May 2015 at 02:58 PDT, 6% of virtual machines in the us-central1-a zone rebooted due to a power domain failure. The affected instances finished rebooting by 03:35 PDT. At 03:21 PDT, Compute Engine API operations began to fail for the us-central1-a zone, and other Compute Engine API operations experienced higher than usual latency. The API issue was resolved at 04:59 PDT and API latency recovered by 05:12 PDT. ROOT CAUSE: At 02:58 PDT power systems in the us-central1-a zone initiated a shutdown for safety reasons, and alerted Google engineers to the issue. In response to the power issue Google engineers initiated a change at 03:15 PDT intended to direct lower priority traffic away from us-central1-a during the event. However, a software bug in the GCE control plane interacted poorly with this change and caused API requests directed to us-central1-a to be rejected starting at 03:21 PDT. Retries and timeouts from the failed calls caused increased load on other API backends, resulting in higher latency for all GCE API calls. The API issues were resolved when Google engineers identified the control plane issue and corrected it at 04:59 PDT, with the backlog fully cleared by 05:12 PDT. REMEDIATION AND PREVENTION: Google engineers are fixing the bug in the control plane software so it will not unintentionally reject requests in similar situations in future. Google engineers have manually validated the configuration of the components of the API system to ensure that no similar errors will happen in the future. Google engineers will also improve the robustness of the API backends so that a single zone failure does not manifest increased latency outside of the affected zone.

Last Update: A few months ago

RESOLVED: Incident 15053 - Elevated latency and error rate for Google Compute Engine API

The issue with Google Compute Engine API latency should be resolved for all affected users as of 05:28 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15053 - Elevated latency and error rate for Google Compute Engine API

We are experiencing an issue with elevated GCE API latency globally which also impacts ability to successfully create new instances and perform other operations in us-central1-a zone beginning at Tuesday, 2015-05-12 03:04 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:50 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15053 - Elevated latency and error rate for Google Compute Engine API

We are investigating reports of an issue with Google Compute Engine. We will provide more information by 05:20 US/Pacific.

Last Update: A few months ago

RESOLVED: Incident 15004 - Users unable to create new projects

SUMMARY: On Friday 8 May 2015, users were unable to reliably create projects in the Google Cloud Developers Console for 6 hours 58 minutes in aggregate. We treat the reliability and availability of the Developers Console with the highest priority and we apologize to any users who were unable to create new projects during that time. DETAILED DESCRIPTION OF IMPACT: On Friday 8 May 2015, from 04:53 to 07:25 and from 09:35 to 12:31 PDT, users were unable to reliably create new projects using the Developers Console. Specifically, up to 90% of project creation requests failed and returned a "Backend Error" message to the user. Existing projects were not affected by this incident. ROOT CAUSE AND REMEDIATION: A configuration change in the access management system of Developers Console triggered an existing software bug in a previously unused code path. As the change was rolled out, the access management system could no longer correctly verify user’s permissions, resulting in an error response. Google engineers were automatically alerted to the issue at 05:04. At 05:22 a preliminary source of the issue was identified. At 09:32 the root cause was confirmed and a solution was developed. An expedited roll-out of the fix commenced at 11:41 and completed at 12:31.

Last Update: A few months ago

RESOLVED: Incident 18008 - Elevated error rates for API and Web UI

SUMMARY: On Thursday 7 May 2015, requests to the Google BigQuery Web UI and APIs experienced errors for a total duration of 2 hours and 9 minutes over two separate periods. We understand the high level of reliability that is demanded and expected of a service like BigQuery and apologize for the disruption. We are taking immediate actions to ensure we minimize the risk of this issue repeating itself. DETAILED DESCRIPTION OF IMPACT: On Thursday 7 May 2015 from 20:45 to 21:20 PDT and on Friday 8 May from 03:13 to 04:47, requests to the Web UI resulted in the page hanging with the message “Loading BigQuery…”. Additionally, when accessing BigQuery via the API, users would have seen responses with error code 400 or 500. ROOT CAUSE: A routine software upgrade to the authorization process in BigQuery had a side effect of reducing the cache hit rate of dataset permission validation. A particular query load triggered a cascade of live authorization checks that fanned out and amplified throughout the BigQuery service, eventually causing user visible errors as the authorization backends became overwhelmed. As a byproduct, error rates for the service increased as individual requests failed to authorize. REMEDIATION AND PREVENTION: Google engineers were able to identify and cancel problematic in-flight BigQuery queries that were causing a high number of retries to the permissions validation backend. To prevent a recurrence of this issue, engineers temporarily disabled the retry of these queries to prevent retries from amplifying the effect of unhealthy permission validation backends. Google engineers were also able to adjust the retry parameters of the authorization system to return cache hit rates to normal. As the system stabilized, BigQuery engineers were able to gradually allow query traffic to flow in and re-enabled permission validation, restoring service. To prevent future recurrences of this issue, Google engineers will change the structure of permissions validation so that continual retries will not destabilize the entire service. This restructuring includes reducing the number of backends that require permissions validation by changing the steps involved in the BigQuery request validation process. Engineers will also introduce safety limits governing communication between BigQuery and the permissions validation system. Google engineers are also adding additional monitoring to better detect and potentially preemptively mitigate issues of this nature.

Last Update: A few months ago

UPDATE: Incident 15004 - Users unable to create new projects

The Developer Console issue that prevented users from creating projects was resolved as of 2015-05-08 13:15 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15004 - Users unable to create new projects

We are continuing to work toward a resolution of the Developers Console issue that is preventing users from creating projects. We will provide another status update no later than 13:25 pacific time.

Last Update: A few months ago

UPDATE: Incident 15004 - Users unable to create new projects

We are currently experiencing an issue with the Developers Console where users are unable to create new projects. We are currently working toward a resolution. For everyone who is affected, we apologize for any inconvenience you may be experiencing.

Last Update: A few months ago

RESOLVED: Incident 18009 - Elevated error rates for API and Web UI

We experienced an issue with the Google BigQuery Web UI and BigQuery APIs, beginning at 2015-05-08 03:13. The issue should be resolved as of 2015-05-08 04:47 (all times are in US/Pacific). We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation and post it at: https://status.cloud.google.com/incident/bigquery/18008

Last Update: A few months ago

UPDATE: Incident 18008 - Elevated error rates for API and Web UI

The issue with Bigquery elevated error rates for API and Web UI should be resolved for all affected users as of 2015-05-08 04:47 (all times are in US/Pacific). We will provide a more detailed analysis of this incident once we have completed our internal investigation

Last Update: A few months ago

UPDATE: Incident 18008 - Elevated error rates for API and Web UI

We are still investigating the issue with Bigquery elevated error rates for API and Web UI. We will provide another status update by 2015-05-08 05:30 (all times are in US/Pacific) with current details.

Last Update: A few months ago

UPDATE: Incident 18008 - Elevated error rates for API and Web UI

We are experiencing a recurrence of this problem starting at 03:00 US/Pacific time on Friday 2015-05-08. We will provide an update by 04:30 US/Pacific time.

Last Update: A few months ago

RESOLVED: Incident 16022 - Elevated Error Rate from Google Cloud Storage

As this issue later recurred, a public report has been posted on the Google Cloud Status Dashboard at https://status.cloud.google.com/incident/storage/16023.

Last Update: A few months ago

RESOLVED: Incident 16023 - Elevated Error Rate from Google Cloud Storage

SUMMARY: On Tuesday 5 May 2015 and Wednesday 6 May 2015, Google Cloud Storage (GCS) experienced elevated request latency and error rates for a total duration of 43 minutes during two separate periods. We understand how important uptime and latency are to you and we apologize for this disruption. We are using the lessons from this incident to achieve a higher level of service in the future. DETAILED DESCRIPTION OF IMPACT: On Tuesday 5 May 2015 from 12:26 to 12:48 PDT requests to GCS returned with elevated latency and error rates. Averaged over the incident, 33% of requests returned error code 500 or 503. At 12:34 the error rate peaked at 42% of requests. Median latency for successful requests increased by 29%. On Wednesday 6 May from 19:25 to 19:41 PDT, and for two minutes at 19:46 and three minutes at 19:52, the same symptoms were seen with a 64% average error rate and 55% increase in median latency. ROOT CAUSE: At 12:25 on 5 May 2015 GCS received an extremely high rate of requests to a small set of GCS objects, causing high load and queuing on a single metadata database shard. This load caused a fraction of unrelated GCS requests to be queued as well, resulting in latency and timeouts visible to other GCS users. At 19:25 on 6 May 2015 GCS received a second round of unusual load to a different set of objects, causing a recurrence of the same issue. REMEDIATION AND PREVENTION: In both incidents, Google engineers identified the set of GCS objects affected and increased localized caching to increase capacity. The Google support team also contacted the project generating the load and worked with them to reduce their demand. In addition to these tactical fixes, Google engineers will enable service-wide caching for the affected GCS components. Google engineers are also working on other steps to improve service isolation between unrelated GCS objects and projects.

Last Update: A few months ago

RESOLVED: Incident 18008 - Elevated error rates for API and Web UI

We experienced an issue with the Google BigQuery Web UI and BigQuery APIs, beginning at 2015-05-07 20:45 (all times are in US/Pacific). The issue should be resolved as of 2015-05-07 21:20. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 16023 - Elevated Error Rate from Google Cloud Storage

We were experiencing an issue with the elevated error rates on Google Cloud Storage, beginning at 2015-05-06 19:26 (all times are in US/Pacific). The issue should be resolved as of 2015-05-06 20:00. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

RESOLVED: Incident 15003 - Logs errors in the Developers Console

SUMMARY: On Saturday 2 May 2015, the Logs Viewer in the Google Cloud Developers Console was unavailable for a duration of 7 hours 10 minutes. We apologize to users who were affected by this issue — this is not the level of quality and reliability we strive to offer you, and we are working to improve the availability of the affected components. DETAILED DESCRIPTION OF IMPACT: On Saturday 2 May 2015 from 11:10 to 18:20 PDT, the Logs Viewer in the Google Cloud Developers Console was unavailable. Users that tried to access the Logs Viewer saw a loading animation which continued indefinitely. Log recording was not affected and messages logged during the incident were available when the Logs Viewer service was restored. Additionally, App Engine logs were still viewable in the App Engine console under appengine.google.com for the duration of the incident. ROOT CAUSE: At 11:10 Google engineers deployed a networking configuration change which prevented the Logs Viewer from contacting a backend component of the Cloud logging subsystem. This change affected requests that were issued from outside the Google network. The Cloud Developers Console has continuous monitoring to detect elevated error rates, but the monitoring submits requests from within the Google network and therefore reported that the Logs Viewer was working correctly, with the result that Google engineers did not become aware of the issue until it was reported by Cloud Platform customers. REMEDIATION AND PREVENTION: Google engineers identified the problematic change at 17:52 and immediately started a rollback to resolve the issue. This was complete at 18:20. To ensure prompt detection of any similar issues in future, Google engineers are extending the monitoring and testing of the Cloud Developer Console to include probes from outside Google's network.

Last Update: A few months ago

RESOLVED: Incident 15003 - Google API Issues beginning on January 28, 2015

SUMMARY: On Saturday 2 May 2015, the Logs Viewer in the Google Cloud Developers Console was unavailable for a duration of 7 hours 10 minutes. We apologize to users who were affected by this issue — this is not the level of quality and reliability we strive to offer you, and we are working to improve the availability of the affected components. DETAILED DESCRIPTION OF IMPACT: On Saturday 2 May 2015 from 11:10 to 18:20 PDT, the Logs Viewer in the Google Cloud Developers Console was unavailable. Users that tried to access the Logs Viewer saw a loading animation which continued indefinitely. Log recording was not affected and messages logged during the incident were available when the Logs Viewer service was restored. Additionally, App Engine logs were still view-able in the App Engine console under appengine.google.com for the duration of the incident. ROOT CAUSE: At 11:10 Google engineers deployed a networking configuration change which prevented the Logs Viewer from contacting a back-end component of the Cloud logging subsystem. This change affected requests that were issued from outside the Google network. The Cloud Developers Console has continuous monitoring to detect elevated error rates, but the monitoring submits requests from within the Google network and therefore reported that the Logs Viewer was working correctly, with the result that Google engineers did not become aware of the issue until it was reported by Cloud Platform customers. REMEDIATION AND PREVENTION: Google engineers identified the problematic change at 17:52 and immediately started a rollback to resolve the issue. This was complete at 18:20. To ensure prompt detection of any similar issues in future, Google engineers are extending the monitoring and testing of the Cloud Developer Console to include probes from outside Google's network.

Last Update: A few months ago

UPDATE: Incident 16022 - Elevated Error Rate from Google Cloud Storage

The issue with Google Cloud Storage's elevated error rates should be resolved for all affected projects as of 1:01pm US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Last Update: A few months ago

UPDATE: Incident 16022 - Elevated Error Rate from Google Cloud Storage

We're investigating an issue with Google Cloud Storage beginning at 12:28 Pacific Time. We will provide more information shortly.

Last Update: A few months ago

UPDATE: Incident 15003 - Logs errors in the Developers Console

The problem with the Developers Console should be resolved as of 18:22 PST. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

Last Update: A few months ago

UPDATE: Incident 15003 - Logs errors in the Developers Console

We are still investigating the issue with the Developers Console where the Logs page is failing to load. Impacted users may still view logs in the Admin Console (https://appengine.google.com). We will provide another status update by 19:00 PDT.

Last Update: A few months ago

UPDATE: Incident 15003 - Logs errors in the Developers Console

We are currently experiencing an issue with the Developers Console and some users are unable to view the Logs page. Impacted users may still view logs in the Admin Console (https://appengine.google.com). For everyone who is affected, we apologize for any inconvenience you may be experiencing.

Last Update: A few months ago

RESOLVED: Incident 15014 - App Engine deployments failing for some customers

The problem with Google App Engine deployments was resolved as of Thursday, 2015-04-30 18:32 (all times are in US/Pacific). We apologize for the inconvenience for our affected customers and thank you for your continued support.

Last Update: A few months ago

UPDATE: Incident 15014 - App Engine deployments failing for some customers

We are still investigating the issue with Google App Engine deployments. The issue is isolated to one data center and at this point it is recovered to normal levels. We will provide another status update by Thursday 2015-04-30 21:00 (all times are in US/Pacific).

Last Update: A few months ago

UPDATE: Incident 15014 - App Engine deployments failing for some customers

We are currently experiencing an issue with application deployment on Google App Engine, starting at 17:00 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 19:00 US/Pacific time with current details. The issue manifests itself as follows: unexpected errors or high latency deploying AppEngine applications. Other AppEngine components were unaffected. We are currently investigating the issue.

Last Update: A few months ago

RESOLVED: Incident 15013 - AppEngine deployments failing for some customers

SUMMARY: For 4 hours and 37 minutes on Tuesday 28 April 2015, 45% of Google App Engine deployments failed. If you were trying to deploy new versions of your app during this time, we apologize for the extended outage and the delay on public announcement. DETAILED DESCRIPTION OF IMPACT: On Tuesday 28 April 2015 from 12:05 to 16:42 PDT, 45% of App Engine deployments experienced delays that resulted in the operation timing out. A small number of deployments were exceptionally slow but succeeded. There was no impact to already-deployed applications, nor to applications after a successful deployment. Affected deployments showed the messages, "Checking if deployment succeeded" and "Will check again in 60 seconds" for long periods. Deployments that timed out showed, "Version still not ready to serve, aborting" and "An unexpected error occurred. Aborting". ROOT CAUSE: Google engineers ran an internal software upgrade that has a side-effect of slowing some deployments for a period of up to one hour. The upgrade failed in its first run and had to be restarted, extending the duration of its impact. Most affected deployments were delayed beyond their timeout period, causing the deployment to be aborted. REMEDIATION AND PREVENTION: Google engineers are improving the deployment infrastructure to minimize the impact of routine maintenance. Google will also make app deployment speed and reliability more transparent to customers. In case of future deployment delays, Google is updating procedures to ensure that public notifications are issued in a timely fashion.

Last Update: A few months ago

RESOLVED: Incident 15013 - AppEngine deployments failing for some customers

We have experienced a problem with Google AppEngine from 2015-04-28 12:05 US/Pacific to 2015-04-28 17:00 US/Pacific, which manifested itself as follows: unexpected errors or high latency deploying AppEngine applications. Other AppEngine components were unaffected. The issue was fully resolved as of Tuesday, 2015-04-28 17:00 US/Pacific. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 15052 - Issue with booting up newly created instances and packet loss in asia-east1-a

SUMMARY: From Saturday 25 April 2015 23:15 PDT to Sunday 26 April 2015 02:14 PDT, 7.4% of Google Compute Engine (GCE) instances in the asia-east1-a zone experienced networking issues. Affected instances were those either created or live migrated during the incident, and were subsequently unable to communicate on the network. Google Engineers resolved the network issues on Sunday 26 April 2015 at 02:14. We apologize to the affected customers and are implementing preventative measures to ensure similar incidents do not occur in the future. DETAILED IMPACT: Beginning on Saturday 25 April 2015 at 23:15, the GCE network control plane stopped propagating network configuration changes to the instance provisioning component. As a result, any instances that were created or live migrated during this time were unable to obtain valid network configuration and thus could not communicate on the network. Overall, the outgoing network traffic in the asia-east1-a zone was 25% lower than expected. Instances in other zones were unaffected. During the incident, network and HTTP load balancing correctly marked the affected instances as unhealthy and routed requests to instances in other zones if they were available. ROOT CAUSE: During a scheduled maintenance event in the asia-east1 region, a latent configuration issue resulted in a network control plane component failing to restart correctly. Automated monitoring alerted Google Engineers to the issue, and after investigation they discovered and corrected the configuration problem allowing the network control plane to start and to begin propagating network configuration. REMEDIATION AND PREVENTION: To prevent similar incidents, Google Engineers are currently auditing the configuration of all critical GCE components to ensure they can migrate successfully. In addition, Google Engineers are implementing more aggressive monitoring to ensure more rapid remediation of issues.

Last Update: A few months ago

RESOLVED: Incident 15052 - Issue with booting up newly created instances and packet loss in asia-east1-a

We have experienced a problem with Google Compute Engine Networking in the asia-east1-a zone from 2015-04-25 23:25 US/Pacific to 2015-04-26 02:14 US/Pacific, which manifested itself as follows: newly created (or rebooted) VMs in asia-east1-a were not able to boot successfully. Additionally, existing VMs in that zone saw some packet loss between VMs and to the Internet. Other Compute Engine zones were unaffected. The issue was fully resolved as of Sunday, 2015-04-26 02:14 US/Pacific. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

RESOLVED: Incident 15011 - High logging service error rate

SUMMARY: On Friday 17 April 2015, the Google App Engine Logs API experienced intermittent failures and reduced throughput for read requests for a duration of 54 minutes. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Friday 17 April 2015 from 16:02 to 16:56 PDT, 3% of read requests to the Logs API failed and there was a 96% drop in throughput. The problem affected 16% of applications that rely on this API to export logs. In this time window, users experienced intermittent timeouts while attempting to view application logs on App Engine Admin Console or Google Cloud Developers console. ROOT CAUSE: Hotspotting in the App Engine Logs API's storage subsystem caused a number of storage nodes to fail. This eventually resulted in resource depletion and request failures. REMEDIATION AND PREVENTION: At 16:05 on Friday 17 April 2015, an automated alert on depletion of available resources for the Logs API was sent out to Google Engineers. To resolve the immediate problem they started redirecting traffic away from the affected storage layer. The service started recovering at 16:51 and normal operation was restored at 16:56. To prevent similar incidents in future, we are implementing changes to reallocate resources consumed by high use individual nodes of the storage layer backing the Logs API.

Last Update: A few months ago

RESOLVED: Incident 15012 - Problem with App Engine custom domains

SUMMARY: On Wednesday 22 April 2015, for a duration of 92 minutes, some requests from European regions to Google App Engine custom domains were redirected to the Google front page. We apologise to our customers and users who were affected by this issue, and we have taken and are taking immediate steps to improve the platform’s availability. DETAILED DESCRIPTION OF IMPACT: Starting at 06:37 PDT on Wednesday 22 April, some custom-domain URL requests from the Europe region were redirected to the www.google.com front page, or to equivalent national Google front pages, instead of being dispatched to their target Google App Engine applications. The incident had two phases. In the first phase, from 06:37 to 07:30, 7.9% of traffic to custom domains was affected. In the second phase, from 07:30 to 08:09, 13.7% of custom domain traffic was affected. In total, approximately 0.2% of requests to App Engine were incorrectly redirected during the incident. Requests originating outside Europe were not affected, except for a very small percentage which were routed to the Google network through European points of presence. Requests to applications via appspot.com domains were also not affected. The hosting region of the application was not a factor. ROOT CAUSE: App Engine custom domains are handled by a system which performs domain mapping for a number of Google services. In order to increase performance, capacity and supportability, Google engineers are in the process of migrating this system's traffic onto Google's general-purpose network infrastructure. The outage commenced when a rollout of this integration began in European datacenters, with a small fraction of custom domain requests being routed through the general infrastructure. Detailed monitoring was in place for this migration but, incorrectly, did not include App Engine custom domains. Due to a configuration error, the migrated App Engine custom domains were not recognized by the infrastructure, which therefore redirected them to its default target of the Google front page. REMEDIATION AND PREVENTION: At 08:04, the issue was identified and Google engineers immediately cancelled the rollout, restoring service by 08:09. To prevent similar issues from reaching production in future, Google engineers are implementing software release tests to identify the class of configuration error that triggered the incident. In case similar issues do reach production, Google engineers are extending rollout testing to include App Engine custom domains so that problematic rollouts will be detected and cancelled automatically and immediately. Finally, continuous monitoring will be added to ensure that all types of custom domain are being correctly recognized and dispatched by the infrastructure, so that Google engineers will be rapidly notified if similar issues recur, regardless of the cause.

Last Update: A few months ago

UPDATE: Incident 15051 - Intermittent network issue

The previous message should have stated "isolated to some customers on us-central1-b region". The internal investigation will clarify the amount of impacted customers in the region.

Last Update: A few months ago

UPDATE: Incident 15051 - Intermittent network issue

The problem with Google Compute Engine Networking is now resolved for majority of affected customers as of 2015-04-22 18:35 US/Pacific. The issue was isolated to some[a small number of?] customers on us-central1-b region. We will provide a more detailed analysis of this incident once we have completed our internal investigation. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support.

Last Update: A few months ago

UPDATE: Incident 15051 - Intermittent network issue

We have identified the issue and a fix is in the process of being rolled out. We will continue to monitor the situation closely. Next update in 1 hour.

Last Update: A few months ago

UPDATE: Incident 15051 - Intermittent network issue

We're investigating an intermittent network packet drop for affecting some customers. We will provide more information shortly.

Last Update: A few months ago

RESOLVED: Incident 15012 - Problem with App Engine custom domains

We confirm that the problem with custom domains on Google App Engine was resolved as of shortly after 08:09 PDT on Wednesday 22 April, 2015. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15012 - Problem with App Engine custom domains

The issue with custom domains is alleviated for most customers. Google engineers are working to ensure service is fully restored.

Last Update: A few months ago

UPDATE: Incident 15012 - Problem with App Engine custom domains

We are currently experiencing an issue with custom domains on Google App Engine. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 08:30 US/Pacific time with current details.

Last Update: A few months ago

RESOLVED: Incident 15011 - High logging service error rate

Apologies - the date in the previous post was incorrect. The resolution time was Friday, April 17th, 2015 at 17:00 PDT.

Last Update: A few months ago

RESOLVED: Incident 15011 - High logging service error rate

The problem with the Google App Engine Logging service was resolved as of Friday April 18th, 2015 at 17:00 PDT. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.

Last Update: A few months ago

UPDATE: Incident 15011 - High logging service error rate

We're investigating an issue with Google App Engine Logging service beginning at 2015-04-17 16:00 (all times are in US/Pacific). We will provide more information within one hour.

Last Update: A few months ago

RESOLVED: Incident 15010 - Datastore Indexing Issue

SUMMARY: On Friday 10th April 2015, attempts to create or update Datastore indexes failed for some Google App Engine applications for a duration of 148 minutes. In addition, a number of applications retrieved stale data using eventually consistent read operations for an unexpectedly long period. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Friday 10 April 2015 from 11:30 to 13:58 PDT, 331 requests to create or update the definition of Datastore composite indexes across 21 applications failed to complete. In addition, about 34% of applications retrieved stale data using eventually consistent QUERY or GET operations [1]. Unlike strongly consistent queries, it is expected of eventually consistent read operations to return stale data for a brief period. However, this behaviour was extended to a longer duration than that which is typically observed during normal operations. There was no impact on strongly consistent operations. During the recovery phase of this incident about 7% of Google App Engine applications experienced elevated latency on PUT operations for 15 minutes. ROOT CAUSE: During a planned maintenance activity, undertaken to create a new Datastore replica to accommodate organic growth, incorrectly configured automation created an unnecessary large table in the new replica. This resulted in exhaustion of resources allocated to Datastore and write failures to this replica. Once the underlying problem was resolved, a high volume of writes were routed to the new replica, resulting in elevated latency for write operations. REMEDIATION AND PREVENTION: At 00:30 PDT on Friday 10th April 2015, an automated alert on resource depletion was sent out to Google Engineers. However, this alert was suppressed, as is normal practice when undertaking this type of maintenance activity. At 11:30 PDT, quota allocated to the replica was exhausted. Google Engineers were notified by internal teams at 12:53 PDT of problems with Datastore indexes. At 13:26 PDT, Google Engineers deleted the problematic large table and started the procedure to reserve additional quota for this storage replica. This took effect at 13:35 PDT and the replica started receiving write requests immediately, which caused a brief increase in latency. Normal operation was restored at 13:58 PDT. To prevent similar incidents in future, we are modifying our maintenance procedures to avoid suppression of the appropriate alerts, and to ensure that this large table is created under close monitoring. [1]. Details on eventual and strong consistency on Google Cloud Datastore: https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/#h.tf76fya5nqk8

Last Update: A few months ago

RESOLVED: Incident 15050 - Degraded network performance

SUMMARY: On Sunday 12th April 2015, Google Compute Engine instances in us-central1 and asia-east1 experienced intermittent 0 to 4 percent packet loss on outgoing external traffic for an overall duration of 6 hours. This was followed by a peak of up to 20 percent packet loss for a duration of 13 minutes. We apologize to our customers who were affected by this issue, and are working to address the factors that allowed this to happen. DETAILED DESCRIPTION OF IMPACT: On Sunday 12th April 2015 from 16:26 to 22:24 PDT, Compute Engine experienced packet loss totalling 0 to 4 percent of traffic to external addresses. Individual instances could have experienced up to 100 percent loss intermittently. The duration of packet loss was zone-specific. The issue began in us-central1-a at 16:26 and lasted until 22:24. The zones us-central1-b, us-central1-f, and asia-east1-a experienced loss between 19:20 and 20:20. Total loss ranged from 2 to 4 percent between 19:27 and 20:03, and 2 to 20 percent between 22:20 and 22:40. Outside these time periods, loss was under 2 percent of total egress traffic. ROOT CAUSE: An abrupt increase in traffic from several large projects triggered Compute Engine's traffic shaping mechanisms. These mechanisms had an unintended spillover effect on other projects, causing some packet loss even for uninvolved projects. REMEDIATION AND PREVENTION: Google engineers increased the traffic capacity dedicated to Compute Engine to address the usage increase and eliminate packet loss. Google’s engineers are also making code changes to improve the fidelity and speed of response of the traffic shaping mechanism, and eliminate packet loss for uninvolved projects in future traffic shaping events.

Last Update: A few months ago

RESOLVED: Incident 15049 - Issue with Network Connectivity on April 10th, 2015

SUMMARY: On Friday 10 April 2015, Google Compute Engine instances in us-central1 experienced elevated packet loss for a duration of 14 minutes. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Friday 10 April 2015 from 02:10 to 02:24 PDT, instances hosted in Google Compute Engine zone us-central1-b experienced elevated packet loss for internal (VM <-> VM) traffic, and every zone in region us-central1 experienced elevated packet loss for external (Internet <-> VM) traffic. The impact varied on different network paths e.g., for VM to VM and VM to Internet reported packet loss was between 26 to 47% at peak, while for Internet to VM 18 to 34% of total packets were lost. ROOT CAUSE: During routine planned maintenance a miscommunication resulted in traffic being sent to a datacenter router that was running a test configuration. This resulted in a saturated link, causing packet loss. The faulty configuration became effective at 02:10 and caused traffic congestion soon after. REMEDIATION AND PREVENTION: Google Engineers were notified by our alerting systems at 02:12 and confirmed an unusually high rate of packet loss at 02:18. At 02:21 Google Engineers disabled the problematic router, distributing traffic to other, unsaturated links. Normal operation was restored at 02:24. To prevent similar incidents in future, we are changing procedure to include additional validation checks while configuring routers during maintenance activities. We are also implementing a higher degree of automation to remove potential human and communication errors when changing router configurations.

Last Update: A few months ago

RESOLVED: Incident 15050 - Degraded network performance

The problem with Google Compute Engine Networking is resolved as of 2015-04-12 22:40 US/Pacific. We verified there weren't any replicas of this issue since that time.

Last Update: A few months ago

RESOLVED: Incident 15050 - Degraded network performance

The problem with Google Compute Engine Networking should be resolved as of 22:40 US/Pacific. Our engineers identified the root cause of the issues and deployed a fix that is preventing packet loss since that time. We will continue to monitor the situation and post an all clear around 2015-04-13 00:00 US/Pacific. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15050 - Degraded network performance

We are still investigating the issue with Google Compute Engine Networking performance. As of this moment, the issue is isolated to the us-central1-a region. We will provide another status update by 2015-04-12 23:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15050 - Degraded network performance

We are still investigating the issue with Google Compute Engine Networking performance. We will provide another status update by 2015-04-12 22:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15050 - Degraded network performance

We are currently experiencing an issue with Google Compute Engine Networking performance and some users are seeing higher packet loss rates. We would like to apologise for any inconvenience caused and reassure you that our reliability staff is currently handling this issue as maximum priority. This issue started on us-central1-a and us-central1-f, but has progressed to affect all zones. We will provide an update by 2015-04-12 21:00 US/Pacific with current details.

Last Update: A few months ago

UPDATE: Incident 15050 - Degraded network performance

Correct incident start time: 2015-04-12 17:00 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15050 - Degraded network performance

We're investigating an issue with Google Compute Engine Networking performance beginning at around 2015-03-12 17:00 US/Pacific. We will provide more information shortly.

Last Update: A few months ago

RESOLVED: Incident 15010 - Datastore Indexing Issue

The problem with Google App Engine Datastore was resolved as of Friday 2015-04-10 14:00 (US/Pacific). We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15010 - Datastore Indexing Issue

We are currently experiencing an issue with Google App Engine Datastore. Some applications' Datastore indexes are not updating. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by Friday 2015-10-04 14:45 (US/Pacific) with current details. Google Engineers have identified the cause and are currently working on multiple resolution strategies.

Last Update: A few months ago

UPDATE: Incident 15010 - Datastore Indexing Issue

We're investigating an issue with Google App Engine Datastore beginning at Friday 2015-04-10 12:30 (all times are in US/Pacific). We will provide more information shortly within one hour.

Last Update: A few months ago

RESOLVED: Incident 15002 - Compute Engine errors in the Developers Console

SUMMARY: On Wednesday 8 April 2015, 15% of requests to the Beta Compute Engine Instance Groups and Instance Group Manager APIs failed for a duration of 5 hours. Affected projects experienced issues accessing Compute Engine pages in the Developers Console during the outage. Not all projects were impacted. If your user experience was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability. DETAILED DESCRIPTION OF IMPACT: On Wednesday 8 April 2015 from 11:37 to 16:35 PDT the Beta Compute Engine Instance Groups API and Instance Group Manager API returned error code 500 responses for 15% of the calls made. The Compute Engine Developers Console (http://console.developers.google.com/) pages for VM Instances, HTTP Load Balancing, and Metadata also failed for those projects due to reliance on those APIs. Users in affected projects trying to access these pages were presented with the message ‘Service internal error occurred.". Error code: "internalError”’. The gcloud compute instance list command was unaffected, but gcloud commands using the Instance Groups and Instance Group Manager APIs were also affected. ROOT CAUSE: A routine software upgrade to the Instance Groups and Instance Group Manager APIs increased the resource consumption of the backend servicing these APIs. As projects used the API, the backend exhausted its internal quota and was unable to service requests from new projects. Those projects whose requests succeeded before the quota had exhausted continued to use the API successfully. REMEDIATION AND PREVENTION: Google engineers were automatically alerted that the Instance Groups and Instance Group Manager backends were approaching an internal quota within an hour from the start of the incident. Google engineers began to revert the change that led to the increased resource consumption, and simultaneously added capacity to the service to allow it to serve new projects. The increase in quota allowed requests to the API and the Developers Console function correctly until the rollback was complete. To prevent similar issues occurring in future Google engineers are rolling out improvements to monitoring, alerting and testing procedures. In addition, Google engineers are increasing the resilience of the Developers Console by isolating issues in backend services and APIs.

Last Update: A few months ago

RESOLVED: Incident 15049 - Issue with Network Connectivity on April 10th, 2015

The problem with Google Compute Engine network connectivity was resolved as of 02:24 US/Pacific on 10th April 2015. We apologize for any issue this may have caused you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google and we are constantly working to improve the reliability of our systems.

Last Update: A few months ago

UPDATE: Incident 15049 - Issue with Network Connectivity on April 10th, 2015

We are currently investigating an issue with Google Compute Engine network connectivity. We will provide an update by Friday 10th April 2015 03:30 PST.

Last Update: A few months ago

RESOLVED: Incident 15002 - Compute Engine errors in the Developers Console

The problem with the Developers Console should be resolved as of 16:35 PDT. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

Last Update: A few months ago

UPDATE: Incident 15002 - Compute Engine errors in the Developers Console

We are still investigating the issue with the Developers Console where some users are seeing errors when browsing the Compute Engine related pages of the Developers Console. The gcloud command line continues to function as normal. We will provide another status update by 17:00 PDT.

Last Update: A few months ago

UPDATE: Incident 15002 - Compute Engine errors in the Developers Console

We are still investigating the issue with the Developers Console where some users are seeing errors when browsing the Compute Engine related pages of the Developers Console. The gcloud command line continues to function as normal. We will provide another status update by 16:00 PDT.

Last Update: A few months ago

UPDATE: Incident 15002 - Compute Engine errors in the Developers Console

We are still investigating the issue with the Developers Console where some users are seeing errors when browsing the Compute Engine related pages of the Developers Console. The gcloud command line continues to function as normal. We will provide another status update by 15:30 PDT.

Last Update: A few months ago

UPDATE: Incident 15002 - Compute Engine errors in the Developers Console

We are still investigating the issue with the Developer's Console and the Compute Engine related pages. We will provide another status update by 2015-04-08 14:50 US/Pacific.

Last Update: A few months ago

UPDATE: Incident 15002 - We are still investigating the issue with the Developers Console.

We are currently experiencing an issue with the Developers Console and some users are seeing errors when accessing Compute Engine specific pages. For everyone who is affected, we apologize for any inconvenience you may be experiencing.

Last Update: A few months ago

RESOLVED: Incident 15009 - Some App Engine apps experiencing quota denial 503 errors

SUMMARY: On Tuesday 24th March 2015, Google App Engine served elevated 503 errors on <1% of applications for a typical duration of 50 minutes. We know how important high uptime and low error rates are to you and your users, and we apologize for these errors. We are learning from this incident and are implementing several improvements to make our service more reliable. DETAILED DESCRIPTION OF IMPACT: On Tuesday 24th March 2015 from 13:03 to 13:53 PDT approximately 1% of requests to App Engine erroneously received an error 503 with a message "Over Quota. This application is temporarily over its serving quota. Please try again later." This occurred despite applications being within their quotas. The distribution of these errors was not uniform; some applications received a disproportionately high fraction of the total errors. ROOT CAUSE: A latent bug in the App Engine quota handling code was triggered during a routine software update of the quota system. This resulted in App Engine returning over-quota errors to some applications that were not over quota. As App Engine software updates are rolled out progressively, only some applications were affected by the update before the issue was detected and remediated. REMEDIATION AND PREVENTION: Google engineers directed traffic away from the affected App Engine infrastructure once the nature of the problem was understood. This led to the return of global 503 rates to pre-incident levels at 13:53. Google engineers identified a small number of applications that escaped the initial change and fixed their quota behavior manually at 14:45. In order to prevent recurrence of this issue, Google engineers will add monitoring and alerting for the quota issue that resulted in spurious 503 errors, create a new quick response protocol for handling erroneous quota responses, and will modify application quota behavior to tolerate novel quota system behavior with lower application impact.

Last Update: A few months ago

RESOLVED: Incident 15009 - Some App Engine apps experiencing quota denial 503 errors

The issue with 503 "Over serving quota" errors was resolved as of 14:52 PDT on Tuesday 2015-03-24, as previously indicated on the 15:00 update. We received a report that led us to mistakenly conclude that the issue had resurfaced. However, our reliability engineering team uncovered a different root cause for that report. If your app is receiving 503 “Over serving quota errors” after 14:53 PDT please file a support case.

Last Update: A few months ago

RESOLVED: Incident 15009 - Some App Engine apps experiencing quota denial 503 errors

The issue with 503 "Over serving quota" errors has resurfaced. We're currently investigating it and will provide more information shortly within an hour.

Last Update: A few months ago

UPDATE: Incident 15009 - Some App Engine apps experiencing quota denial 503 errors

The issue with App Engine "503: Over Serving Quota" errors was resolved as of 14:52 on Tuesday 2015-03-24. We apologize for any issues you may have experienced. We will provide a detailed analysis of this incident at https://status.cloud.google.com/incident/appengine/15009 once we have completed our internal investigation.

Last Update: A few months ago

UPDATE: Incident 15009 - Some App Engine apps experiencing quota denial 503 errors

Several applications running on Google Application Engine have reported elevated rates of "503: Over Serving Quota" errors. The Google engineering and support teams are working with affected applications to correct and diagnose the cause of these elevated error rates. The overall error rate in Google Application Engine increased by 1% at the time of the reports, and has since returned to approximately baseline levels. We expect to have another update at 15:30.

Last Update: A few months ago

UPDATE: Incident 15009 - Some App Engine apps experiencing quota denial 503 errors

Some App Engine apps are serving 503 "Over serving quota" errors. For everyone who is affected, we apologize for any inconvenience you or your customers are experiencing. We will provide an update by 15:00 with current details.

Last Update: A few months ago

UPDATE: Incident 15009 - Some App Engine apps experiencing quota denial 503 errors

We're investigating an issue with Google App Engine beginning at Tuesday 2015-03-24 13:05 (all times are in US/Pacific)]. We will provide more information shortly within 20 minutes.

Last Update: A few months ago

RESOLVED: Incident 15007 - Intermittent errors for Managed VM (Beta) deployments

SUMMARY: On Monday 9 March 2015, deployments of Google App Engine applications running on Beta Managed VMs experienced intermittent deployment failures for a duration of 75 minutes. Whilst Managed VMs is still in Beta, we apologize for any impact this incident may have had on your service or application. DETAILED DESCRIPTION OF IMPACT: Between 18:20 and 19:40 PDT, users of Google App Engine Managed VMs (Beta) experienced intermittent deployment failures. Users impacted by this incident would see gcloud or appcfg deploy commands return with error messages. Over this period, 81% of deployments returned errors. ROOT CAUSE: The Managed VMs (Beta) deplo