Create and use clustered tables  |  BigQuery  |  Google Cloud (2024)

This document describes how to create and use clustered tables inBigQuery. For an overview of clustered table support inBigQuery, see Introduction to clustered tables.

Create clustered tables

You can create a clustered table by using the following methods:

  • Create a table from a query result:

    • Run a DDLCREATE TABLE AS SELECTstatement.
    • Run a query that creates aclustered destination table.
  • Use a DDL CREATE TABLE statement with a CLUSTER BY clause containinga clustering_column_list.

  • Run the bq command-line tool bq mk command.

  • Make calls to the tables.insertAPI method.

  • Load data into BigQuery.

  • Use the client libraries.

Table naming

When you create a table in BigQuery, the table name mustbe unique per dataset. The table name can:

  • Contain characters with a total of up to 1,024 UTF-8 bytes.
  • Contain Unicode characters in category L (letter), M (mark), N (number),Pc (connector, including underscore), Pd (dash), Zs (space). For moreinformation, seeGeneral Category.

The following are all examples of valid table names:table 01, ग्राहक, 00_お客様, étudiant-01.

Caveats:

  • Table names are case-sensitive by default. mytable and MyTable cancoexist in the same dataset, unless they are part of a dataset withcase-sensitivity turned off.
  • Some table names and table name prefixes are reserved. Ifyou receive an error saying that your table name or prefix isreserved, then select a different name and try again.
  • If you include multiple dot operators (.) in a sequence, the duplicateoperators are implicitly stripped.

    For example, this:project_name....dataset_name..table_name

    Becomes this:project_name.dataset_name.table_name

Required permissions

To create a table, you need the following IAM permissions:

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.jobs.create

Additionally, you might require the bigquery.tables.getData permission toaccess the data that you write to the table.

Each of the following predefined IAM roles includes the permissions that you need in order to create a table:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • roles/bigquery.user (includes the bigquery.jobs.create permission)
  • roles/bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you cancreate and update tables in the datasets that you create.

For more information on IAM roles and permissions inBigQuery, seePredefined roles and permissions.

Create an empty clustered table with a schema definition

You specify clustering columns when you create a table inBigQuery. After the table is created, you can modify theclustering columns; see Modifying clustering specificationfor details.

Clustering columns must be top-level, non-repeated columns, and they must be oneof the following simple data types:

  • DATE
  • BOOLEAN
  • GEOGRAPHY
  • INTEGER
  • NUMERIC
  • BIGNUMERIC
  • STRING
  • TIMESTAMP
  • RANGE (preview)

You can specify up to four clustering columns. When you specify multiplecolumns, the order of the columns determines how the data is sorted. Forexample, if the table is clustered by columns a, b and c, the data is sorted inthe same order: first by column a, then by column b, and then by columnc. As a best practice, place the most frequently filtered or aggregated columnfirst.

The order of your clustering columns also affects query performance and pricing.For more information about query best practices for clustered tables, seeQuerying clustered tables.

To create an empty clustered table with a schema definition:

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.
  3. In the Dataset info section, click add_box Create table.
  4. In the Create table panel, specify the following details:
    1. In the Source section, select Empty table in the Create table from list.
    2. In the Destination section, specify the following details:
      1. For Dataset, select the dataset in which you want to create the table.
      2. In the Table field, enter the name of the table that you want to create.
      3. Verify that the Table type field is set to Native table.
    3. In the Schema section, enter the schema definition. You can enter schema information manually by using one of the following methods:
      • Option 1: Click Edit as text and paste the schema in the form of a JSON array. When you use a JSON array, you generate the schema using the same process as creating a JSON schema file. You can view the schema of an existing table in JSON format by entering the following command:
         bq show --format=prettyjson dataset.table 
      • Option 2: Click add_box Add field and enter the table schema. Specify each field's Name, Type, and Mode.
    4. For Clustering order, enter between one and four comma-separated column names.
    5. Optional: In the Advanced options section, if you want to use a customer-managed encryption key, then select the Use a customer-managed encryption key (CMEK) option. By default, BigQuery encrypts customer content stored at rest by using a Google-managed key.
    6. Click Create table.

SQL

Use the CREATE TABLE DDL statementcommand with the CLUSTER BY option. The following example creates aclustered table named myclusteredtable in mydataset:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE TABLE mydataset.myclusteredtable( customer_id STRING, transaction_amount NUMERIC)CLUSTER BY customer_id OPTIONS ( description = 'a table clustered by customer_id');
  3. Click play_circle Run.

For more information about how to run queries, see Run an interactive query.

bq

Use the bq mk commandwith the following flags:

  • --table (or the -t shortcut).
  • --schema. You can supply the table's schema definition inline oruse a JSON schema file.
  • --clustering_fields. You can specify up to four clustering columns.

Optional parameters include --expiration, --description,--time_partitioning_type, --time_partitioning_field,--time_partitioning_expiration, --destination_kms_key, and --label.

If you are creating a table in a project other than your default project,add the project ID to the dataset in the following format:project_id:dataset.

--destination_kms_key is not demonstrated here. For information aboutusing --destination_kms_key, seecustomer-managed encryption keys.

Enter the following command to create an empty clustered table with aschema definition:

bq mk \ --table \ --expiration INTEGER1 \ --schema SCHEMA \ --clustering_fields CLUSTER_COLUMNS \ --description "DESCRIPTION" \ --label KEY:VALUE,KEY:VALUE \ PROJECT_ID:DATASET.TABLE

Replace the following:

  • INTEGER1: the default lifetime, in seconds, forthe table.The minimum value is 3,600 seconds (one hour). The expirationtime evaluates to the current UTC time plus the integer value. If you setthe table's expiration time when you create a table, the dataset's defaulttable expiration setting is ignored. Setting this value deletes the tableafter the specified time.
  • SCHEMA: an inline schema definition in the formatCOLUMN:DATA_TYPE,COLUMN:DATA_TYPE or the path tothe JSON schema file on your local machine.
  • CLUSTER_COLUMNS: a comma-separated list of up tofour clustering columns. The list cannot contain any spaces.
  • DESCRIPTION: a description of the table, in quotes.
  • KEY:VALUE: the key-value pair that represents alabel. You can enter multiple labels usinga comma-separated list.
  • PROJECT_ID: your project ID.
  • DATASET: a dataset in your project.
  • TABLE: the name of the table you're creating.

When you specify the schema on the command line, you cannot include aRECORD (STRUCT)type, you cannot include a column description, and youcannot specify the column's mode. All modes default to NULLABLE. Toinclude descriptions, modes, and RECORD types, supply a JSON schemafile instead.

Examples:

Enter the following command to create a clustered tablenamed myclusteredtable in mydataset in your default project. The table'sexpiration is set to 2,592,000 (1 30-day month), the description is set toThis is my clustered table, and the label is set toorganization:development. The command uses the -t shortcut instead of--table.

The schema is specified inline as:timestamp:timestamp,customer_id:string,transaction_amount:float. Thespecified clustering field customer_id is used to cluster the table.

bq mk \ -t \ --expiration 2592000 \ --schema 'timestamp:timestamp,customer_id:string,transaction_amount:float' \ --clustering_fields customer_id \ --description "This is my clustered table" \ --label org:dev \ mydataset.myclusteredtable

Enter the following command to create a clustered table namedmyclusteredtable in myotherproject, not your default project. Thedescription is set to This is my clustered table, and the label is setto organization:development. The command uses the -t shortcut instead of--table. This command does not specify a table expiration. If the datasethas a default table expiration, it is applied. If the dataset has no defaulttable expiration, the table never expires.

The schema is specified in a local JSON file: /tmp/myschema.json. Thecustomer_id field is used to cluster the table.

bq mk \ -t \ --expiration 2592000 \ --schema /tmp/myschema.json \ --clustering_fields=customer_id \ --description "This is my clustered table" \ --label org:dev \ myotherproject:mydataset.myclusteredtable

After the table is created, you can update the table'sdescriptionand labels.

Terraform

Use thegoogle_bigquery_tableresource.

The following example creates a table named mytable that is clusteredon the ID and Created columns:

resource "google_bigquery_dataset" "default" { dataset_id = "mydataset" default_partition_expiration_ms = 2592000000 # 30 days default_table_expiration_ms = 31536000000 # 365 days description = "dataset description" location = "US" max_time_travel_hours = 96 # 4 days labels = { billing_group = "accounting", pii = "sensitive" }}resource "google_bigquery_table" "default" { dataset_id = google_bigquery_dataset.default.dataset_id table_id = "mytable" deletion_protection = false # set to "true" in production clustering = ["ID", "Created"] schema = <<EOF[ { "name": "ID", "type": "INT64", "description": "Item ID" }, { "name": "Item", "type": "STRING", "mode": "NULLABLE" }, { "name": "Created", "type": "TIMESTAMP" }]EOF}

To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.

Prepare Cloud Shell

  1. Launch Cloud Shell.
  2. Set the default Google Cloud project where you want to apply your Terraform configurations.

    You only need to run this command once per project, and you can run it in any directory.

    export GOOGLE_CLOUD_PROJECT=PROJECT_ID

    Environment variables are overridden if you set explicit values in the Terraform configuration file.

Prepare the directory

Each Terraform configuration file must have its own directory (alsocalled a root module).

  1. In Cloud Shell, create a directory and a new file within that directory. The filename must have the .tf extension—for example main.tf. In this tutorial, the file is referred to as main.tf.
    mkdir DIRECTORY && cd DIRECTORY && touch main.tf
  2. If you are following a tutorial, you can copy the sample code in each section or step.

    Copy the sample code into the newly created main.tf.

    Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.

  3. Review and modify the sample parameters to apply to your environment.
  4. Save your changes.
  5. Initialize Terraform. You only need to do this once per directory.
    terraform init

    Optionally, to use the latest Google provider version, include the -upgrade option:

    terraform init -upgrade

Apply the changes

  1. Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
    terraform plan

    Make corrections to the configuration as necessary.

  2. Apply the Terraform configuration by running the following command and entering yes at the prompt:
    terraform apply

    Wait until Terraform displays the "Apply complete!" message.

  3. Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.

API

Call the tables.insertmethod with a defined table resourcethat specifies the clustering.fields property and the schema property.

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

from google.cloud import bigquery# Construct a BigQuery client object.client = bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_name"schema = [ bigquery.SchemaField("full_name", "STRING"), bigquery.SchemaField("city", "STRING"), bigquery.SchemaField("zipcode", "INTEGER"),]table = bigquery.Table(table_id, schema=schema)table.clustering_fields = ["city", "zipcode"]table = client.create_table(table) # Make an API request.print( "Created clustered table {}.{}.{}".format( table.project, table.dataset_id, table.table_id ))

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import ("context""fmt""time""cloud.google.com/go/bigquery")// createTableClustered demonstrates creating a BigQuery table with advanced properties like// partitioning and clustering features.func createTableClustered(projectID, datasetID, tableID string) error {// projectID := "my-project-id"// datasetID := "mydatasetid"// tableID := "mytableid"ctx := context.Background()client, err := bigquery.NewClient(ctx, projectID)if err != nil {return fmt.Errorf("bigquery.NewClient: %v", err)}defer client.Close()sampleSchema := bigquery.Schema{{Name: "timestamp", Type: bigquery.TimestampFieldType},{Name: "origin", Type: bigquery.StringFieldType},{Name: "destination", Type: bigquery.StringFieldType},{Name: "amount", Type: bigquery.NumericFieldType},}metaData := &bigquery.TableMetadata{Schema: sampleSchema,TimePartitioning: &bigquery.TimePartitioning{Field: "timestamp",Expiration: 90 * 24 * time.Hour,},Clustering: &bigquery.Clustering{Fields: []string{"origin", "destination"},},}tableRef := client.Dataset(datasetID).Table(tableID)if err := tableRef.Create(ctx, metaData); err != nil {return err}return nil}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import com.google.cloud.bigquery.BigQuery;import com.google.cloud.bigquery.BigQueryException;import com.google.cloud.bigquery.BigQueryOptions;import com.google.cloud.bigquery.Clustering;import com.google.cloud.bigquery.Field;import com.google.cloud.bigquery.Schema;import com.google.cloud.bigquery.StandardSQLTypeName;import com.google.cloud.bigquery.StandardTableDefinition;import com.google.cloud.bigquery.TableId;import com.google.cloud.bigquery.TableInfo;import com.google.cloud.bigquery.TimePartitioning;import com.google.common.collect.ImmutableList;public class CreateClusteredTable { public static void runCreateClusteredTable() { // TODO(developer): Replace these variables before running the sample. String datasetName = "MY_DATASET_NAME"; String tableName = "MY_TABLE_NAME"; createClusteredTable(datasetName, tableName); } public static void createClusteredTable(String datasetName, String tableName) { try { // Initialize client that will be used to send requests. This client only needs to be created // once, and can be reused for multiple requests. BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService(); TableId tableId = TableId.of(datasetName, tableName); TimePartitioning partitioning = TimePartitioning.of(TimePartitioning.Type.DAY); Schema schema = Schema.of( Field.of("name", StandardSQLTypeName.STRING), Field.of("post_abbr", StandardSQLTypeName.STRING), Field.of("date", StandardSQLTypeName.DATE)); Clustering clustering = Clustering.newBuilder().setFields(ImmutableList.of("name", "post_abbr")).build(); StandardTableDefinition tableDefinition = StandardTableDefinition.newBuilder() .setSchema(schema) .setTimePartitioning(partitioning) .setClustering(clustering) .build(); TableInfo tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build(); bigquery.create(tableInfo); System.out.println("Clustered table created successfully"); } catch (BigQueryException e) { System.out.println("Clustered table was not created. \n" + e.toString()); } }}

Create a clustered table from a query result

There are two ways to create a clustered table from a query result:

  • Write the results to a new destination table and specify the clusteringcolumns.
  • By using a DDL CREATE TABLE AS SELECT statement. For more information aboutthis method, seeCreating a clustered table from the result of a queryon the Using data definition language statementspage.

You can create a clustered table by querying either a partitioned table or anon-partitioned table. You cannot change an existing table to a clustered tableby using query results.

When you create a clustered table from a query result, you must use standardSQL. Currently, legacy SQL is not supported for querying clustered tables orfor writing query results to clustered tables.

SQL

To create a clustered table from a query result, use theCREATE TABLE DDL statementwith the CLUSTER BY option. The following example creates a newtable clustered by customer_id by querying an existing unclustered table:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE TABLE mydataset.clustered_table( customer_id STRING, transaction_amount NUMERIC)CLUSTER BY customer_idAS ( SELECT * FROM mydataset.unclustered_table);
  3. Click play_circle Run.

For more information about how to run queries, see Run an interactive query.

bq

Enter the following command to create a new, clustered destinationtable from a query result:

bq --location=LOCATION query \ --use_legacy_sql=false 'QUERY'

Replace the following:

  • LOCATION: the name of your location. The --location flag isoptional. For example, if you are using BigQuery in theTokyo region, you can set the flag's value to asia-northeast1. Youcan set a default value for the location using the.bigqueryrc file.
  • QUERY: a query in GoogleSQL syntax. Currently, you cannotuse legacy SQL to query clustered tables or to write query results toclustered tables. The query can contain a CREATE TABLEDDLstatement that specifies the options for creating your clustered table.You can use DDL rather than specifying the individual command-lineflags.

Examples:

Enter the following command to write query results to a clustereddestination table named myclusteredtable in mydataset. mydataset is inyour default project. The query retrieves data from a non-partitioned table:mytable. The table's customer_id column is used to cluster thetable. The table's timestamp column is used to create a partitioned table.

bq query --use_legacy_sql=false \ 'CREATE TABLE mydataset.myclusteredtable PARTITION BY DATE(timestamp) CLUSTER BY customer_id AS ( SELECT * FROM `mydataset.mytable` );'

API

To save query results to a clustered table,call the jobs.insertmethod, configure aquery job,and include a CREATE TABLE DDLstatement that creates your clustered table.

Specify your location in the location property in thejobReference section of thejob resource.

Create a clustered table when you load data

You can create a clustered table by specifying clustering columns when you loaddata into a new table. You do not need to create an empty table before loadingdata into it. You can create the clustered table and load your data at the sametime.

For more information about loading data, seeIntroduction to loading data into BigQuery.

To define clustering when defining a load job:

SQL

Use the LOAD DATA statement.The following example loads AVRO data to create a table that is partitionedby the transaction_date field and clustered by the customer_id field.It also configures the partitions to expire after three days.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    LOAD DATA INTO mydataset.mytablePARTITION BY transaction_dateCLUSTER BY customer_id OPTIONS ( partition_expiration_days = 3)FROM FILES( format = 'AVRO', uris = ['gs://bucket/path/file.avro']);
  3. Click play_circle Run.

For more information about how to run queries, see Run an interactive query.

API

To define a clustering configuration when creating a table through aload job, you can populate theClusteringproperties for the table.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import ("context""fmt""cloud.google.com/go/bigquery")// importClusteredTable demonstrates creating a table from a load job and defining partitioning and clustering// properties.func importClusteredTable(projectID, destDatasetID, destTableID string) error {// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx := context.Background()client, err := bigquery.NewClient(ctx, projectID)if err != nil {return fmt.Errorf("bigquery.NewClient: %v", err)}defer client.Close()gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/sample-transactions/transactions.csv")gcsRef.SkipLeadingRows = 1gcsRef.Schema = bigquery.Schema{{Name: "timestamp", Type: bigquery.TimestampFieldType},{Name: "origin", Type: bigquery.StringFieldType},{Name: "destination", Type: bigquery.StringFieldType},{Name: "amount", Type: bigquery.NumericFieldType},}loader := client.Dataset(destDatasetID).Table(destTableID).LoaderFrom(gcsRef)loader.TimePartitioning = &bigquery.TimePartitioning{Field: "timestamp",}loader.Clustering = &bigquery.Clustering{Fields: []string{"origin", "destination"},}loader.WriteDisposition = bigquery.WriteEmptyjob, err := loader.Run(ctx)if err != nil {return err}status, err := job.Wait(ctx)if err != nil {return err}if status.Err() != nil {return fmt.Errorf("job completed with error: %v", status.Err())}return nil}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import com.google.cloud.bigquery.BigQuery;import com.google.cloud.bigquery.BigQueryException;import com.google.cloud.bigquery.BigQueryOptions;import com.google.cloud.bigquery.Clustering;import com.google.cloud.bigquery.Field;import com.google.cloud.bigquery.FormatOptions;import com.google.cloud.bigquery.Job;import com.google.cloud.bigquery.JobInfo;import com.google.cloud.bigquery.LoadJobConfiguration;import com.google.cloud.bigquery.Schema;import com.google.cloud.bigquery.StandardSQLTypeName;import com.google.cloud.bigquery.TableId;import com.google.cloud.bigquery.TimePartitioning;import com.google.common.collect.ImmutableList;public class LoadTableClustered { public static void runLoadTableClustered() throws Exception { // TODO(developer): Replace these variables before running the sample. String datasetName = "MY_DATASET_NAME"; String tableName = "MY_TABLE_NAME"; String sourceUri = "/path/to/file.csv"; loadTableClustered(datasetName, tableName, sourceUri); } public static void loadTableClustered(String datasetName, String tableName, String sourceUri) throws Exception { try { // Initialize client that will be used to send requests. This client only needs to be created // once, and can be reused for multiple requests. BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService(); TableId tableId = TableId.of(datasetName, tableName); Schema schema = Schema.of( Field.of("name", StandardSQLTypeName.STRING), Field.of("post_abbr", StandardSQLTypeName.STRING), Field.of("date", StandardSQLTypeName.DATE)); TimePartitioning partitioning = TimePartitioning.of(TimePartitioning.Type.DAY); Clustering clustering = Clustering.newBuilder().setFields(ImmutableList.of("name", "post_abbr")).build(); LoadJobConfiguration loadJobConfig = LoadJobConfiguration.builder(tableId, sourceUri) .setFormatOptions(FormatOptions.csv()) .setSchema(schema) .setTimePartitioning(partitioning) .setClustering(clustering) .build(); Job loadJob = bigquery.create(JobInfo.newBuilder(loadJobConfig).build()); // Load data from a GCS parquet file into the table // Blocks until this load table job completes its execution, either failing or succeeding. Job completedJob = loadJob.waitFor(); // Check for errors if (completedJob == null) { throw new Exception("Job not executed since it no longer exists."); } else if (completedJob.getStatus().getError() != null) { // You can also look at queryJob.getStatus().getExecutionErrors() for all // errors, not just the latest one. throw new Exception( "BigQuery was unable to load into the table due to an error: \n" + loadJob.getStatus().getError()); } System.out.println("Data successfully loaded into clustered table during load job"); } catch (BigQueryException | InterruptedException e) { System.out.println("Data not loaded into clustered table during load job \n" + e.toString()); } }}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

from google.cloud import bigquery# Construct a BigQuery client object.client = bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_name"job_config = bigquery.LoadJobConfig( skip_leading_rows=1, source_format=bigquery.SourceFormat.CSV, schema=[ bigquery.SchemaField("timestamp", bigquery.SqlTypeNames.TIMESTAMP), bigquery.SchemaField("origin", bigquery.SqlTypeNames.STRING), bigquery.SchemaField("destination", bigquery.SqlTypeNames.STRING), bigquery.SchemaField("amount", bigquery.SqlTypeNames.NUMERIC), ], time_partitioning=bigquery.TimePartitioning(field="timestamp"), clustering_fields=["origin", "destination"],)job = client.load_table_from_uri( ["gs://cloud-samples-data/bigquery/sample-transactions/transactions.csv"], table_id, job_config=job_config,)job.result() # Waits for the job to complete.table = client.get_table(table_id) # Make an API request.print( "Loaded {} rows and {} columns to {}".format( table.num_rows, len(table.schema), table_id ))

Control access to clustered tables

To configure access to tables and views, you can grant anIAM role to an entity at the following levels, listed inorder of range of resources allowed (largest to smallest):

  • a high level in theGoogle Cloud resource hierarchysuch as the project, folder, or organization level
  • the dataset level
  • the table or view level

You can also restrict data access within tables, by using the followingmethods:

  • column-level security
  • column data masking
  • row-level security

Access with any resource protected by IAM is additive. Forexample, if an entity does not have access at the high level such as a project,you could grant the entity access at the dataset level, and then the entity willhave access to the tables and views in the dataset. Similarly, if the entitydoes not have access at the high level or the dataset level, you could grant theentity access at the table or view level.

Granting IAM roles at a higher level in the Google Cloudresource hierarchysuch as the project, folder, or organization level gives the entity access to abroad set of resources. For example, granting a role to an entity at the projectlevel gives that entity permissions that apply to all datasets throughout theproject.

Granting a role at the dataset level specifies the operations an entity isallowed to perform on tables and views in that specific dataset, even if theentity does not have access at a higher level. For information on configuringdataset-level access controls, seeControlling access to datasets.

Granting a role at the table or view level specifies the operations an entity isallowed to perform on specific tables and views, even if the entity does nothave access at a higher level. For information on configuring table-level accesscontrols, seeControlling access to tables and views.

You can also create IAM custom roles.If you create a custom role, the permissions you grant depend on the specificoperations you want the entity to be able to perform.

You can't set a "deny" permission on any resource protected byIAM.

For more information about roles and permissions, see Understanding rolesin the IAM documentation and the BigQueryIAM roles and permissions.

Use clustered tables

Get information about clustered tables

You can get information about tables in the following ways:

  • Using the Google Cloud console.
  • Using the bq command-line tool's bq show command.
  • Calling the tables.getAPI method.
  • Querying INFORMATION_SCHEMA views.

Required permissions

At a minimum, to get information about tables, you must be grantedbigquery.tables.get permissions. The following predefined IAMroles include bigquery.tables.get permissions:

  • bigquery.metadataViewer
  • bigquery.dataViewer
  • bigquery.dataOwner
  • bigquery.dataEditor
  • bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when thatuser creates a dataset, they are granted bigquery.dataOwner access to it.bigquery.dataOwner access gives the user the ability to get information abouttables in a dataset.

For more information about IAM roles and permissions inBigQuery, seePredefined roles and permissions.

Get clustered table information

To view information about a clustered table:

Console

  1. In the Google Cloud console, go to the Resources pane.Click your dataset name to expand it, and then click the table nameyou want to view.

  2. Click Details. This page displays the table'sdetails including the clustering columns.

    Create and use clustered tables | BigQuery | Google Cloud (1)

SQL

For clustered tables, you can query the CLUSTERING_ORDINAL_POSITION columnin the INFORMATION_SCHEMA.COLUMNS viewto find the 1-indexed offset of the column within the table's clusteringcolumns:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE TABLE mydataset.data (column1 INT64, column2 INT64)CLUSTER BY column1, column2;SELECT column_name, clustering_ordinal_positionFROM mydataset.INFORMATION_SCHEMA.COLUMNS;
  3. Click play_circle Run.

For more information about how to run queries, see Run an interactive query.

The clustering ordinal position is 1 for column1 and 2 for column2.More table metadata is available through the TABLES, TABLE_OPTIONS,COLUMNS, and COLUMN_FIELD_PATH views in INFORMATION_SCHEMA.

bq

Issue the bq show command to display all table information. Use the--schema flag to display only table schema information. The --formatflag can be used to control the output.

If you are getting information about a table in a project other thanyour default project, add the project ID to the dataset in the followingformat: project_id:dataset.

bq show \ --schema \ --format=prettyjson \ PROJECT_ID:DATASET.TABLE

Replace the following:

  • PROJECT_ID: your project ID
  • DATASET: the name of the dataset
  • TABLE: the name of the table

Examples:

Enter the following command to display all information aboutmyclusteredtable in mydataset. mydataset in your default project.

bq show --format=prettyjson mydataset.myclusteredtable

The output should look like the following:

{ "clustering": { "fields": [ "customer_id" ] },...}

API

Call the bigquery.tables.getmethod and provide any relevant parameters.

List clustered tables in a dataset

You can list clustered tables in datasets in the following ways:

  • Using the Google Cloud console.
  • Using the bq command-line tool's bq ls command.
  • Calling the tables.list APImethod.
  • Using the client libraries.
  • Querying the CLUSTERING_ORDINAL_POSITION column in theINFORMATION_SCHEMA.COLUMNS view.

The permissions required to list clustered tables and the steps to list themare the same as for standard tables. For more information about listingtables, seeListing tables in a dataset.

Modify clustering specification

You can change or remove a table's clustering specifications, or change the setof clustered columns in a clustered table. This method of updating theclustering column set is useful for tables that use continuous streaming insertsbecause those tables cannot be easily swapped by other methods.

Follow these steps to apply a new clustering specification to unpartitioned orpartitioned tables.

  1. In the bq tool, update the clustering specification of yourtable to match the new clustering:

     bq update --clustering_fields=CLUSTER_COLUMN DATASET.ORIGINAL_TABLE 

    Replace the following:

    • CLUSTER_COLUMN: the column you are clustering on—for example, mycolumn
    • DATASET: the name of the dataset containing the table—for example, mydataset
    • ORIGINAL_TABLE: the name of your original table—for example, mytable

    You can also call the tables.update or tables.patch API method to modify the clustering specification.

  2. To cluster all rows according to the new clustering specification,run the following UPDATE statement:

    UPDATE DATASET.ORIGINAL_TABLE SET CLUSTER_COLUMN=CLUSTER_COLUMN WHERE true

Table security

To control access to tables in BigQuery, seeIntroduction to table access controls.

What's next

  • For information about querying clustered tables, seeQuerying clustered tables.
  • For an overview of partitioned table support in BigQuery, seeIntroduction to partitioned tables.
  • To learn how to create partitioned tables, seeCreating partitioned tables.
  • To see an overview of INFORMATION_SCHEMA, go toIntroduction to BigQuery INFORMATION_SCHEMA.
Create and use clustered tables  |  BigQuery  |  Google Cloud (2024)

FAQs

What is the difference between partitioning and clustering in BigQuery? ›

Partitioning divides a table into segments, while clustering sorts the table based on user-defined columns. Both methods can improve the performance of certain types of queries, such as queries that use filter clauses and queries that aggregate data.

How do I create a table in Google cloud storage? ›

Create external tables on partitioned data

In the Google Cloud console, go to BigQuery. In the Explorer pane, expand your project and select a dataset. Click more_vert View actions, and then click Create table.

Are BigQuery tables case sensitive? ›

Table names are case-sensitive by default.

How to create a cluster in Google Cloud? ›

Open the Dataproc Create a cluster page in the Google Cloud console in your browser, then click Create in the cluster on Compute engine row in the Create a Dataproc cluster on Compute Engine page. The Set up cluster panel is selected with fields filled in with default values.

How do I create a cluster in AWS? ›

To create a cluster (console)
  1. On the navigation bar, use the region selector to choose one of the AWS Regions where AWS CloudHSM is currently supported.
  2. Choose Create cluster.
  3. In the Cluster configuration section, do the following: ...
  4. Choose Next.
  5. Specify how long the service should retain backups. ...
  6. Choose Next.

What is the significance of a cluster table? ›

A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition.

What is the difference between table partition and clustered index? ›

Partitioning a table only divides it into "chunks" based on the partition function. The clustered index will give order to the data within each partition.

What is an example of partitioning clustering? ›

The most common example of partitioning clustering is the K-Means Clustering algorithm. In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups.

How do I create a SQL database in Google Cloud? ›

Create a database on the Cloud SQL instance
  1. In the Google Cloud console, go to the Cloud SQL Instances page. ...
  2. To open the Overview page of an instance, click the instance name.
  3. Select Databases from the SQL navigation menu.
  4. Click Create database.
  5. In the New database dialog, specify the name of the database.
  6. Click Create.

How do I create a Bigtable in GCP? ›

To create a new table in an instance:
  1. Open the list of Bigtable instances in the Google Cloud console. ...
  2. Click the instance whose tables you want to view.
  3. Click Tables in the left pane. ...
  4. Enter a table ID for the table.
  5. Add column families (optional). ...
  6. Optional: Enable a change stream for the table. ...
  7. Click Create a table.

What is the limit of Cloud Storage? ›

You can use Google Cloud Storage to store data in Google's cloud. Cloud Storage is typically used to store unstructured data. You can add objects of any kind and size, and up to 5 TB.

What are the limitations of BigQuery? ›

Query jobs
LimitDefault
Maximum request size10 MB
Maximum response size10 GB compressed
Maximum row size100 MB
Maximum columns in a table, query result, or view definition10,000 columns
21 more rows

What are the different types of tables in BigQuery? ›

There are two types of tables in BigQuery: logical tables and physical tables. Logical tables are the tables that you query and physical tables are the underlying storage for those logical tables. Physical tables can be either materialised or non-materialized.

Does BigQuery have primary keys? ›

Recently, BigQuery introduced Primary Key and Foreign Key constraints. However, they differ from what we're accustomed to in traditional RDBMS. For instance, these constraints aren't currently enforced, and they can only be set between tables within the same dataset.

What is table clustering? ›

A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition.

How do we create cluster? ›

To create a cluster, you need to include at least one node in the cluster and you must have access to at least one of the nodes that will be in the cluster. If only one node is specified, it must be the system that you are currently accessing.

How do you make a cluster model? ›

The five steps to building a clustering model
  1. STEP 1: Create a dataset. Compile an aggregated dataset ready to use by your model.
  2. STEP 2: Create a model. Create a clustering model instance in your platform of choice.
  3. STEP 3: Explore your dataset. ...
  4. STEP 4: Configure and train your model. ...
  5. STEP 5: Predict using your model.
Jan 11, 2023

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6700

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.