Here are some quick helpful tips for using Avro file format correctly in python.
Note: I am asumming you familair with Apache Avro file format, its advantages, its shortcomings, etc.
Tip no 1: Use the correct package
Instead of using the official package from Apache Avro use Fast Avro for Python. Trust me the claims made by the author of fastavro mostly holds true.
Tip no 2: Use of schema
Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. You can read the specification docs to understand more about it in detail. I too found it a bit confusing & keep ever forgetting. So here is the rule of thumb that I follow:
schema = {
'doc': 'A dummy avro file', # a short description
'name': 'dummy', # your supposed to be file name with .avro extension
'type': 'record', # type of avro serilazation, there are more (see above docs) but as per me this will do most of the time
'fields': [ # this defines actual keys & their types
{'name': 'key1', 'type': 'string'},
{'name': 'key2', 'type': 'int'},
{'name': 'key2', 'type': 'boolean'},
],
}
Tip no 3: Write correctly
The fastavro default write method for some reason does not use any codec or compression algorithm, which defeats the purpose of using avro. See the below screenshot.
Format | Size |
JSON | 13.6 mb |
Avro (no compression/code) | 13.3 mb |
Avro (deflate compression/code) | 2.13 mb |
Avro (snappy compression/code) | 3.4 mb |
from fastavro import writer, parse_schema, reader
# default codec is None
with open('dummy.avro', 'wb') as out:
writer(out, parse_schema(schema), more_rows)
# from the above screenshot its best to use deflate. It also have native supoprt
with open('dummy_deflate.avro', 'wb') as out:
writer(out, parse_schema(schema), more_rows, codec="deflate")
Tip no 3: Read as a generator
Assuming the file size is huge (that will be the case why you had the need to move from JSON to something like Avro) & fastavro
do support lazy loading why not use it. Also, it is straightforward
from fastavro import reader
with open("dummy.avro", "rb") as out:
f = reader(out) # now f is generator
# you loop it
for i in f:
do_something(i)