Irvine, California +1 949 264 1472 info@xavor.com
GCP_BIGQUERY

Use-case

BigQuery on Google Cloud Platform (GCP) is used by developers for ad hoc analysis, building operational dashboards, and processing data using stored procedures. The analysis cost of using this service is determined by the size of the data that is processed by queries issued on GCP BigQuery. Under the current on-demand pricing model, this price can range from $5/TB in US (multi-region) to $6/TB in Belgium (europe-west1).

This essentially means that if developers fire lots of unoptimized queries which process a large amount of data, they can rack up thousands of dollars in cost. This case is quite common in large organizations, especially those on pay-as-you-go models.

In terms of cost monitoring at user level, currently GCP does not disclose BigQuery granular usage. While the billing dashboard shows that the cost is coming from BigQuery, it cannot be drilled down to identify specific users and activities causing these cost spikes. As such, the issue cannot be remediated.

An Innovative Solution

One workaround for this is to use the built-in and natively available BigQuery INFORMATION_SCHEMA metadata tables, and more specifically, the BigQuery jobs metadata table. The INFORMATION_SCHEMA.JOBS_BY_PROJECT table can be queried to see project-level query history for the past 180 days.

The metadata, which will be described in more detail in the next section, includes the user email, job ID, the amount of data queried, and the query execution time, among other things. While query cost is not explicitly a column, it can be calculated using the column which shows bytes billed, since query cost and amount of bytes billed are directly proportional.

It is important to note that the metadata tables are independent from Stackdriver logs, which means that data such as IP addresses and device information is not available. They are a series of tables natively available in BigQuery and do not require Stackdriver logs to be exported to BigQuery.

Metadata in INFORMATION_SCHEMA.JOBS_BY_PROJECT

The following information is available in the BigQuery job metadata table, as described in Google’s documentation here:

Column nameData typeValue
creation_timeTIMESTAMP(Partitioning column) Creation time of this job. Partitioning is based on the UTC time of this timestamp.
project_idSTRING(Clustering column) ID of the project.
project_numberINTEGERNumber of the project.
folder_numbersRECORDGoogle Accounts and ID Administration (GAIA) IDs of folders in a project’s ancestry, in order starting with the leaf folder closest to the project. This column is only populated in JOBS_BY_FOLDER.
user_emailSTRING(Clustering column) Email address or service account of the user who ran the job.
job_idSTRINGID of the job. For example, bquxjob_1234.
job_typeSTRINGThe type of the job. Can be QUERY, LOAD, EXTRACT, COPY, or null. Job type null indicates an internal job, such as script job statement evaluation or materialized view refresh.
statement_typeSTRINGThe type of query statement, if valid. For example, SELECT, INSERT, UPDATE, or DELETE.
prioritySTRINGThe priority of this job.
start_timeTIMESTAMPStart time of this job.
end_timeTIMESTAMPEnd time of this job.
querySTRINGSQL query text. Note: The JOBS_BY_ORGANIZATION view does not have the query column.
stateSTRINGRunning state of the job. Valid states include PENDING, RUNNING, and DONE.
reservation_idSTRINGName of the primary reservation assigned to this job, if applicable. If your job ran in a project that is assigned to a reservation, it would follow this format: reservation-admin-project:reservation-location.reservation-name.
total_bytes_processedINTEGERTotal bytes processed by the job.
total_slot_msINTEGERSlot-milliseconds for the job over its entire duration.
error_resultRECORDDetails of error (if any) as an ErrorProto.
cache_hitBOOLEANWhether the query results of this job were from a cache.
destination_tableRECORDDestination table for results (if any).
referenced_tablesRECORDArray of tables referenced by the job.
labelsRECORDArray of labels applied to the job as key, value strings.
timelineRECORDQuery timeline of the job. Contains snapshots of query execution.
job_stagesRECORDQuery stages of the job.
total_bytes_billedINTEGERtotal bytes billed by the job.

Note that the folder_numbers column is not applicable to JOBS_BY_PROJECT table, as described in the relevant “value” column of the table above.

IAM

The minimum roles and permissions required to query JOBS_BY_PROJECT are the following:

  • jobs.listAll
  • roles/bigquery.jobUser (includes bigquery.jobs.create, resourcemanager.projects.get, resourcemanager.projects.list)

Details about these IAM permissions can be found in Google’s documentation here.

These permissions do not give ‘view’ or ‘write access’ to existing user datasets. Their jobs.listAll permission gives access to project-level job history, while jobUser allows firing of queries only on the tables accessible to the users. In this case, jobs.listAll exposes the JOBS_BY_PROJECT metadata table, so together these permissions allow queries to be fired on that table.

Demo

IAM in demo project

Only the minimum permissions described in the “IAM” section were applied to a user in a project, as can be seen below:

iam_demo_project

acl_tech_project

Query in demo project

The following query is executed which fetches all the data from the job history metadata table and displays the top 5 results in order of bytes billed (cost). This is a slight modification of the query described in Google’s documentation here.

SELECT

*

FROM `region-europe-west2`.INFORMATION_SCHEMA.JOBS_BY_PROJECT

ORDER BY total_bytes_processed DESC

LIMIT 5

This query can also be modified to have a calculated column such that ‘total_bytes_processed’ is first converted to TBs, and then multiplied with the per-TB cost of the region.

As can be seen in the screenshot, other existing datasets in this project are not visible to the user firing the query:

existing_dataset_project

The INFORMATION_SCHEMA.JOBS_BY_PROJECT table is qualified with region. The example above shows query history in europe-west2, but it can be modified as “region-us” to get the metadata from US.

This query can also further be optimized to get only relevant columns, – which would reduce query cost – upon which further operational dashboards or views can be built.

Conclusion

Given that GCP does not offer this level of granularity by default, in most cases, it becomes nearly impossible to determine the source of BigQuery costs and how they can be remediated. In such cases, the metadata tables available natively in BigQuery can be extremely useful, as they show exactly which user triggered the query, the time they triggered it at, the exact query itself, and the cost for running that query.

Remarkably, the fine-grained permissions available in GCP ensure that queries can be confined to operational and admin teams in such a way that they can only use it to monitor costs on BigQuery, without being able to view or query other existing data.

Share: