Working with Apache Avro in AWS Glue and Amazon Athena

2020-01-19 / 909 words / 5 minutes

When working with “big data” analytics systems, the de facto standard file system format is Apache Parquet. Parquet is a columnar format, meaning that unlike more transactional processing systems which arrange their data in row order, the data is arranged in columns instead. This has the benefit of reducing the amount of data which must be read if only a few columns are required, but the disadvantage that inserts, updates, and record (row) based processing may be slower. An alternative to Apache Parquet is the Apache Avro file format. It is row-based, supports an embedded schema and can be processed by Amazon Elastic MapReduce (EMR), Amazon Athena, Amazon Redshift and AWS Glue.

Creating an Apache Avro File

We’ll start with some code to create an Avro file, which starts with an Avro schema. In this simple, but very contrived example, we’re storing a few integers, and a UUID. The Avro schema is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


{"namespace": "cloudninja",
 "type": "record",
 "name": "Data",
 "fields": [
     {"name": "idx", "type": "int"},
     {"name": "mod3", "type": "int"},
     {"name": "mod5", "type": "int"},
     {"name": "mod7", "type": "int"},
     {"name": "mod11", "type": "int"},
     {"name": "mod13", "type": "int"},
     {"name": "mod17", "type": "int"},
     {"name": "uuid", "type": "string"}
 ]
}

I wrote a quick Java program to generate an Avro file. There’s a couple of ways to do this. One is using code generation, which generates a Java class representing the Avro schema. I chose not to use code generation in this example. The code is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


/**
 * 
 */
package ninja.cloudninja.cloud;

import java.io.File;
import java.io.IOException;
import java.util.UUID;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumWriter;

/**
 *
 */
public class GenAvro {

	public static void genAvro() throws IOException {
		Schema schema = new Schema.Parser().parse( new File("data.avsc") );
	    
	    File file = new File( "data.avro" );
	    DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
	    DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
	    dataFileWriter.create( schema, file);
	    
	    for ( int i = 0; i < 1000000; i++ ) {
	        GenericRecord data = new GenericData.Record(schema);
	        data.put( schema.getFields().get( 0 ).name(), i ); // idx
	        data.put( schema.getFields().get( 1 ).name(), i % 3 ); // mod3
	        data.put( schema.getFields().get( 2 ).name(), i % 5 ); // mod5
	        data.put( schema.getFields().get( 3 ).name(), i % 7 ); // mod7
	        data.put( schema.getFields().get( 4 ).name(), i % 11 ); // mod11
	        data.put( schema.getFields().get( 5 ).name(), i % 13 ); // mod13
	        data.put( schema.getFields().get( 6 ).name(), i % 17 ); // mod17
	        data.put( schema.getFields().get( 7 ).name(), UUID.randomUUID().toString() );
	        dataFileWriter.append( data );
	    }
	
	    dataFileWriter.close();
	}
}

The code requires Apache Avro, so the Maven dependency used is:

1
2
3
4
5


  <dependency>
		<groupId>org.apache.avro</groupId>
		<artifactId>avro</artifactId>
		<version>1.9.1</version>
	</dependency>

Using the AWS Glue Catalog

When executed, the file data.avro is created. The file is then uploaded to S3 and we use an AWS Glue crawler to discover the data. Athena is a bit picky in how it finds the data. For my first attempt I simply copied the file to the root of the S3 bucket. Although AWS Glue was able to find the data, Athena didn’t seem to be able to query it. I then copied it into a subdirectory (called “year=2020”) of a subdirectory (called “avro1”) of the root folder and both AWS Glue cataloged it, and Amazon Athena was able to query it. The command I used was:

$ aws s3 cp data.avro s3:///avro1/year=2020/data.avro

I tried reusing an AWS Glue crawler from a previous session which used a different bucket. The IAM role that the crawler used to search the bucket was given access only to a different S3 bucket and failed to catalog the tables. This would have been immediately obvious if there was an error generated, but I didn’t see one. By trial and error I realized the problem and the tables were cataloged.

One benefit of Avro with its embedded schema is that the Glue crawlers really don’t need to guess at the data types. The result is shown below:

Preparing Amazon Athena

One we have the defined the table, we can query the data using Amazon Athena. Amazon Athena is a fully-managed AWS Service offering SQL-like queries across “big data” sets. It is build on Presto, and charges per MiB of data scanned. This makes it a great choice for low-volume experimental usage.

Once the Amazon Athena console is opened, and the AWS Glue database is selected (“test1” in this example), the list of tables is listed. First, we run a command to ensure that the table partitions are discovered. This can be achieved by clicking on the three vertical dots by the side of the table name and selecting “Load partitions”. The query automatically executes and the output is shown below:

Executing Queries using Amazon Athena

Next, we can verify that Athena can query the data by again clicking on the three vertical dots by the side of the table name but this time by selecting “Preview table”. The query executes and the output is shown below:

Conclusion

Apache Avro is an alternate to Apache Parquet which uses a row-based storage format rather than a columnar storage format that works well with “big data” analytics tools such as Amazon Athena, Amazon Redshift, Amazon EMR and AWS Glue.

Tags: Avro Glue Athena Java
Categories: AWS Analytics AWS Glue Amazon Athena

Top Ten Tags

AWS (43) Kinesis (9) Streams (8) AWS Console (5) Go (5) Analytics (4) Data (4) database (4) Amazon DynamoDB (3) Amazon Elastic Compute Cloud (EC2) (3)

All Tags (173)

Top Ten Categories

AWS (38) Amazon DynamoDB (6) Amazon Elastic Compute Cloud (EC2) (4) AWS CloudFormation (4) Certification (3) Database (3) Amazon Athena (2) Amazon Aurora for Amazon RDS (2) Amazon CloudFormation (2) Amazon CloudWatch (2)

All Categories (72)

Disclaimer

All data and information provided on this site is for informational purposes only. cloudninja.cloud makes no representations as to accuracy, completeness, currentness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis.

This is a personal weblog. The opinions expressed here represent my own and not those of my employer. My opinions may change over time.