Data Ingestion with COPY INTO in Snowflake (Local file, AWS, Azure, GCP)

Chayan Shrang Raj
11 min readDec 9, 2023

--

Enterprises are experiencing an explosive growth in their data estates and are leveraging high performant and cost effective data solutions such as Snowflake to gather data insights to grow their business. In this article, I would like to share all the data ingestion tools and techniques available in Snowflake to ingest Structured, Semi-structured and unstructured data formats coming in batches or via streaming.

Image credits: Author

Let’s connect — LinkedIn
Let’s build something — GitHub

This article is divided into 5 scenarios:

  1. Load csv file from local file system.
  2. Load multiple csv files from local file system.
  3. Load csv file from AWS S3 bucket.
  4. Load csv file from Azure Data Lake Gen 2 storage.
  5. Load csv file from GCP cloud storage.

So definitely feel free to navigate to the section that you wish to understand. All the SQL scripts and sample data files can be found here:

https://github.com/chayansraj/Snowflake-data-ingestion-hands-on-tutorial/tree/main

So first thing we are going to take on is how to extract structured data which come in various shapes and sizes such as comma delimited, tab delimited, etc. and are also considered flat files.

One of the most common files is CSV format and according to Snowflake, it is the most performant, efficient and cost effective technique to load the data, keeping in mind some of the best practices.

There are many different ways to extract the data from different sources, so we will break it down to each scenario as below:

Scenario 1:
- File format: CSV
- Data Load Technique: COPY INTO
- Source: Local file system
- Loading type: Single file, Bulk load

Prerequisites: Snowflake enterprise account, SnowSQL CLI, sample data file

Similar to other data warehouses, Snowflake also uses staging area called ‘Stages’ that offer much more customizations than older data warehouses. Simplest way to understand stage is the conveyor belt, it gives us a platform to move our data from one place to another.
There are three kind of stages in Snowflake (more here):
- User stage
- Table stage
- Named stage

Each of them play a different role based on the specific use case. We will use internal named stage because when loading files from local file system, you should use internal stage. Named stages are recommended when you plan regular data loads that could involve multiple users and/or tables.

First, we need to create a non system-defined roles (Best Practice). Then, we grant privileges to this custom role to perform database, schema and table creations. Then we create an XSMALL virtual warehouse in order to process the loading operation since the sample csv file is not large.

Firstly we use ACCOUNTADMIN role to create a new role and then create a role hierarchy by granting SYSADMIN role privileges to DATA_LOADER role (Best Practice). Granting specific privileges to roles to create and query various database objects.

Snowflake follows Role Based Access Control (RBAC) and Discretionary Access control (DAC) which provides high security and flexibility for each account and warrants a whole different article to discuss about it.

Image credits: Author

COPY INTO uses a predefined, customer-managed virtual warehouse to read the data from the remote storage, optionally transform its structure, and write it to native Snowflake tables. Next, we create the virtual warehouse XSMALL.

Image credits: Author

After this step, we have a role with necessary privileges and a virtual warehouse to process data loading. Next, we create a database, schema and table to store the landing data from the named stage. The sample data file contains 10K customer data having 15 columns.

Image credits: Author

Post creation of a database, schema and table, we will create a FILE FORMAT and an internal named STAGE. Creating a file format is not a necessary step when it comes to csv, since it is the default file format. Creating a custom file format is a best practice since it allows granular control in processing various structures of data and it gives many user-defined parameters. There are many parameters that can be set for stages as well as file formats for specific use cases. Please check it out to gain more control over your work.

Image credits: Author

File staging information is required during both steps in the data loading process:

  1. You must specify an internal stage in the PUT command when uploading files to Snowflake.
  2. You must specify the same stage in the COPY INTO <table> command when loading data into a table from the staged files.

Considering the first point, we need to use SnowSQL as we are loading local files and it provides a command line interface to stage the data. After setting up the SnowSQL, you need to login to your snowflake account using your organization name, account name and username as shown below:

snowsql -a <organization name>-<account name> -u username

This will prompt for your password and then you can connect your snowSQL with your Snowflake account. Afterward, we use the custom role that we created above, activate the virtual warehouse and use the database.schema.

Image credits: Author
Image credits: Author

Next, we execute the PUT command that stages the data into the named stage ‘csv_load’ with some parameters that can be changed according to the need. Using AUTO_COMPRESS is the best practice since Snowflake then compresses the file in ‘gzip’ format that allows for fast and efficient loading.
This step has put our data into staging area and as an added feature we could query this data before loading it into our tables using directory tables and preprocess it remove any inconsistencies.

Image credits: Author

We go back to Snowsight and run this command which will then load the data into ‘customer_csv’ table. We can then query the data. We can see that we have 9.8k records out of 10k since some of the records contained inconsistencies and Snowflake skipped those records.

Image credits: Author

Other considerations — Since the file size is small, the whole process looks faster and easier. But in real world scenarios, this might not be the case and we would have to deal with files of Gigabytes or Terabytes sizes. Snowflake gives different options based on the workload such as:
- Increase the size of the warehouse (Scale up) (MEDIUM, LARGE, XLARGE)
- Use multi-cluster warehouse option (Scale out)
- Compress the file before loading such as .gz
- Use ‘parallel’ option in PUT command to use all the available cores.

Next we would like to ingest multiple files at once into Snowflake tables.

Scenario 2:
- File format: CSV
- Data Load Technique: COPY INTO
- Source: Local file system
- Loading type: multiple files, Bulk load

The procedure of creating new roles, granting privileges, creating new virtual warehouses is same as above and we shall continue with the previous script and modify when we want to create a new table and give pattern matching parameter in the COPY INTO command.

Image credits: Author

Above, we define a pattern that matches all the file contained in the local directory. It can as complex as you want and uses regex to match the strings. This also applies to any external cloud storage. ‘PURGE’ parameter removes the staged files after loading, this can be used to minimize the storage cost, remember there is a cost associated with every database object in Snowflake.

So when we stage the data using the named stage as before, we can see there are three files now.

Image credits: Author

After using the COPY INTO command, we can see the number of records in each file and the records loaded into our ‘customer_csv’ table.

Image credits: Author

When we query the data, we can see that the total records is the sum of the records in each table because we loaded all the files in the same table. To load the data into different tables, we would have to run COPY INTO command individually for respective tables.

Image credits: Author

Along with loading data, it is also crucial to monitor your loading processes to optimize and reduce cost by leveraging various Snowflake parameters and best practices. The following command uses INFORMATION_SCHEMA which is a default object for every new database object that you create.

Image credits: Author

Scenario 3:
- File format: CSV
- Data Load Technique: COPY INTO
- Source: AWS S3 Bucket
- Loading type: Bulk load

The procedure of creating new roles, granting privileges, creating new virtual warehouses is same as above as well as creating a different database, schema and table are the same.
Now, since we are dealing with cloud storage, there will always be Identity and Management procedures that one needs to follow to establish a connection between any two services. Snowflake provides several different options to configure secure access to cloud storage and we will configure a Snowflake Storage Integration to Access Amazon S3 buckets. The advantage of this approach is that you have to provide credentials only one time and you will have access to appropriate buckets and AWS accounts.

The data is already loaded into S3 buckets:

Image credits: Author

After creating the appropriates schemas and tables to land the data, we will create a storage integration by following the steps here. You need CREATE INTEGRATION privilege to execute the below mentioned command. Only the ACCOUNTADMIN role has this privilege by default. The privilege can be granted to additional roles as needed. It would look something like this in Snowflake:

Image credits: Author

By following the guide, we can have secure access for reading and writing files from S3 buckets. The blacked out region is private and should not be shared if the access is set to public.

This integration is then passed to the external stage in Snowflake to stage the data from S3. You can also look at the description of external stages to better understand the data loading process.

Image credits: Author

If you list the stage now, you can see the three files staged. We can also preview the files or query the files in an external stage to understand the files before loading or after unloading.

Image credits: Author

Loading the data is similar to local file system but this time we are using external stage. We can see the files have loaded partially, perhaps because of some formatting issue but the transfer is successful.

Image credits: Author

Scenario 4:
- File format: CSV
- Data Load Technique: COPY INTO
- Source: Azure ADLS Gen 2
- Loading type: Bulk load

Use Azure blob storage containers for storing and managing your data files, you can make use of your existing containers and folder paths for bulk loading into Snowflake. Snowflake currently supports loading from blob storage only. Snowflake supports the following types of storage accounts:
- Blob storage
- Data Lake Storage Gen2
- General-purpose v1
- General-purpose v2

Image credits: Snowflake

The steps are almost similar as we carried out for AWS S3 bucket but since this is a different cloud platform, we need to create a different STORAGE INTEGRATION in Snowflake to allow secure access to Azure blob storage. The files are already loaded inside the container called ‘customer_csv’ in Azure blob storage.

Image credits: Author

Creating the storage integration requires a bit different credentials for each of the cloud providers.

Image credits: Author

Next, you load the data files using the same COPY INTO command and providing storage integration and file formats along with blob storage url.

Image credits: Author

Querying the customer_csv table will return the same number of tables as in previous load scenarios.

Image credits: Author

Scenario 5:
- File format: CSV
- Data Load Technique: COPY INTO
- Source: GCP Cloud storage
- Loading type: Bulk load

Image credits: Snowflake

The steps are almost similar as we carried out for AWS S3 or Azure blob storage but since this is a different cloud platform, we need to create a different STORAGE INTEGRATION in Snowflake to allow secure access to Azure blob storage. The files are already loaded inside the container called ‘customer_csv’ in Google cloud storage.

Image credits: Author

Creating the storage integration requires a bit different credentials for each of the cloud providers.

Image credits: Author

Next, you load the data files using the same COPY INTO command and providing storage integration and file formats along with blob storage url.

Image credits: Author

Some of the best practices for data loading using COPY INTO command:

  • Use file sizes above 10 MB and preferably in the range of 100 MB to 250 MB.
  • Check file loading success/failure for Snowpipe using COPY_HISTORY or other options.
  • Leverage your existing object path partitioning for COPY when possible. Using the most explicit path allows COPY to efficiently list and load your data as quickly as possible.
  • Ingestion of gzip’ed CSV files (.csv.gz) is not only the most widely used but usually the most performant configuration for ingestion.

Using COPY INTO is a manual task and is useful in only certain situations like doing ad-hoc analysis, checking the schema before loading, etc. COPY fits nicely in an existing infrastructure where one or more warehouses are managed for size and suspension/resumption to achieve peak price to performance of various workloads, such as SELECT queries or data transformations. In the absence of such an existing infrastructure managing warehouses, Snowpipe should be considered for simplicity and convenience.

We will discuss Snowpipe in the next article with is used to continuous data ingestion based on message queues. It is a very interesting topic!

Thanks for reading!

Let’s connect — LinkedIn

Let’s build something — GitHub

--

--

Chayan Shrang Raj
Chayan Shrang Raj

Written by Chayan Shrang Raj

Learning by writing. Writing by experiencing. Experiencing by taking action. GitHub - https://github.com/chayansraj

No responses yet