Skip to main content

Command Palette

Search for a command to run...

Working with Avro file format in Python the right way

Updated
•2 min read
Working with Avro file format in Python the right way

Here are some quick helpful tips for using Avro file format correctly in python.

Note: I am asumming you familair with Apache Avro file format, its advantages, its shortcomings, etc.

Tip no 1: Use the correct package

Instead of using the official package from Apache Avro use Fast Avro for Python. Trust me the claims made by the author of fastavro mostly holds true.

Tip no 2: Use of schema

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. You can read the specification docs to understand more about it in detail. I too found it a bit confusing & keep ever forgetting. So here is the rule of thumb that I follow:

schema = {
    'doc': 'A dummy avro file', # a short description 
    'name': 'dummy', # your supposed to be file name with .avro extension 
    'type': 'record', # type of avro serilazation, there are more (see above docs) but as per me this will do most of the time
    'fields': [ # this defines actual keys & their types
        {'name': 'key1', 'type': 'string'},
        {'name': 'key2', 'type': 'int'},
        {'name': 'key2', 'type': 'boolean'},
    ],
}

Tip no 3: Write correctly

The fastavro default write method for some reason does not use any codec or compression algorithm, which defeats the purpose of using avro. See the below screenshot.

FormatSize
JSON13.6 mb
Avro (no compression/code)13.3 mb
Avro (deflate compression/code)2.13 mb
Avro (snappy compression/code)3.4 mb

from fastavro import writer, parse_schema, reader

# default codec is None 
with open('dummy.avro', 'wb') as out:
    writer(out, parse_schema(schema), more_rows)

# from the above screenshot its best to use deflate. It also have native supoprt
with open('dummy_deflate.avro', 'wb') as out:
    writer(out, parse_schema(schema), more_rows, codec="deflate")

Tip no 3: Read as a generator

Assuming the file size is huge (that will be the case why you had the need to move from JSON to something like Avro) & fastavro do support lazy loading why not use it. Also, it is straightforward

from fastavro import reader
with open("dummy.avro", "rb") as out:
    f = reader(out) # now f is generator
    # you loop it
    for i in f:
        do_something(i)

Express Ideas

Part 3 of 5

Small & self-contained tips that can help in many larger context. This is my attempt to help you out & shed some light on express ideas that helped me because it's just an idea that all you need 💡.

Up next

How to SSH login password free from Windows, Linux, Mac

It's very painful and annoying to always provide password for ssh login. But it can be fixed by auto password functionality of ssh