All about Cloud, mostly about Amazon Web Services (AWS)

Converting Apache Avro to Apache Parquet Using Amazon Athena

 2020-01-20 /  640 words /  4 minutes

Converting files between various formats is a fact of life for IT staff. This is often because different formats are optimized for specific use cases. For example, JPEG files are great for photographs and use lossy compression that can be almost impossible for the human eye to detect by PNG files are great when logos or graphics need to preserve every pixel in an image.

Apache Avro and Apache Parquet are similar examples. This post explains the benefits of each file format, and demonstrates how to convert between the two formats using Amazon Athena.

About Apache Avro

Apache Avro is a row based format which means that data within the file is arranged with all of the first row occurring before all the second row. This approach has been used for most transactional databases for many years and is great for streaming IO operations. For example, you can write a program to open an Avro file in an Amazon Simple Storage Service (S3) bucket, read a small part of it, process a few records, write some data out, and then read another small part of it. Using this approach large files can be processed by a resource constrained system, although it can take some time to process the files in a serial manner.

About Apache Parquet

Apache Parquet is a columnar format which means that the first column of all the rows are stored first, then the second column of all the rows are stored. This approach has been pioneered by analytical systems such as Amazon Elastic MapReduce (EMR) and Amazon Redshift because they save significant disk access when only a subset of a row is required. While Parquet is highly optimized for reading only part of a row, it is much more difficult to insert new rows into a columnar store, and more difficult to work with entire rows because data related to a single row is spread across the entire file. These challenges are often offset because Parquet files are usually processed by large clustered analytics systems such as EMR.

Conversion

The conversion between the two data types is fairly straightforward using Amazon Athena. The “CTAS” (“CREATE TABLE AS”) functionality which was added allows an Avro file to be quickly converted using a single Athena command. The exact command is:

CREATE TABLE test1_parquet1
WITH (
      format = 'Parquet',
      external_location = 's3://<bucket name>/parquet1/data1.parquet',
      parquet_compression = 'SNAPPY')
AS 
SELECT * 
FROM test1_avro1;

The command creates a new table called “test1_parquet1”. The “AS” section uses the output from the “SELECT * FROM test1_avro1” SQL command to define the fields and data. The “WITH” section of the file defines the format to create (“Parquet”), the Amazon Simple Storage Service (S3) bucket in which to store the new file, and specifically for the Parquet file format, the compression format to use on the file. The output from running the command can be seen below:

Athena Conversion Query

You can see from the figure that the query took 8.32 seconds to execute, and scanned 43.88 MB of data, which is the entire size of the data1.avro file. You can also see that the Parquet backed table has been listed on the right hand side of the Athena console, beneath “test1_avro1” and if we check the AWS Glue console, we can see that the Parquet file exists there too:

Glue Catalog Table Details

Interestingly though, if we check Amazon Simple Storage Service (S3), we can see the Parquet representation was broken across two files in the S3 Bucket:

Parquet backed table in Amazon S3

Again, we can verify that Athena can query the data in the Parquet backed table by again clicking on the three vertical dots by the side of the table name (“test1_parquet1”) but this time by selecting “Preview table”. The query executes and the output is shown below:

Athena Query Results

Conclusion

Using Amazon Athena to perform conversions between Apache Avro and Apache Parquet is fast and simple.


Tags:  AWS  Analytics  Amazon Athena  AWS Glue  Storage  Amazon Simple Storage Service (S3)  Apache Avro  Apache Parquet
Categories:  AWS  Analytics  Amazon Athena  AWS Glue  Storage  Amazon Simple Storage Service (S3)

See Also

 Top Ten Tags

AWS (43)   Kinesis (9)   Streams (8)   AWS Console (5)   Go (5)   Analytics (4)   Data (4)   database (4)   Amazon DynamoDB (3)   Amazon Elastic Compute Cloud (EC2) (3)  


All Tags (173)

Disclaimer

All data and information provided on this site is for informational purposes only. cloudninja.cloud makes no representations as to accuracy, completeness, currentness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis.

This is a personal weblog. The opinions expressed here represent my own and not those of my employer. My opinions may change over time.