Migrate ClickHouse data to S3 with AWS Glue

Michael Cherepnev
2 min readJul 30, 2020

--

ClickHouse is a modern columnar-oriented database management system for analytical processing designed in Yandex.

AWS Glue is managed ETL service based on Apache Spark in Amazon ecosystem. It has its own Data Catalog.

Usually when you’re working with Glue — you setup connection to DBMS and crawler to get metadata about tables in DBMS. And then you can use this metadata in Glue ETL jobs. But Glue doesn’t have support of ClickHouse and you should manage to work with Clickhouse by yourself.

Here is a step-by-step guide on how to migrate data from Clickhouse to S3 with AWS Glue.

Preparations

First of all, you need JDBC driver for ClickHouse. You can download it at official GitHub project and place it in S3 bucket (ex., s3://my-bucket/spark-jars/clickhouse-jdbc-0.2.4.jar)

When you work with non-supported DBMSs in Glue you can’t use Glue Connections and should manage secrets. The best practice is to use AWS Secrets Manager. Here is a detailed guide on how to create Secret and use it in Glue ETL job: https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/.

And don’t forget to create an S3 bucket for output data.

Creating a job

To iterate faster do not use Glue ETL script editor. Configure job once to use the existing script that you provide and write code in your IDE.

Create clickhouse_to_s3.py in your favorite IDE:

Deploy all your *.py files with ETL jobs to S3:

aws s3 cp . s3://my-aws-glue-scripts/jobs/ --recursive --exclude "*" --include "*.py" --exclude "*/*"

Create a new job in Glue ETL:

  • Name
  • IAM role — choose proper Role
  • Type — Spark
  • Glue version — Spark 2.4, Python 3 (Glue Version 1.0)
  • This job runs — An existing script that you provide
  • S3 path where the script is stored — s3://my-aws-glue-scripts/jobs/clickhouse_to_s3.py
  • Temporary directory — s3://my-aws-glue-temporary/username
  • Monitoring options — check Job metrics
  • Configure Security configuration, script libraries, and job parameters (optional) -> Dependent jars paths3://my-bucket/spark-jars/clickhouse-jdbc-0.2.4.jar

Then you can iterate: run job, watch logs and output files on S3, edit code in your IDE, redeploy it to S3, run again.

--

--