# Working with Avro file format in Python the right way

Here are some quick helpful tips for using Avro file format correctly in python.

> Note: I am asumming you familair with Apache Avro file format, its advantages, its shortcomings, etc.

**Tip no 1: Use the correct package**

Instead of using the official package from [Apache Avro](https://pypi.org/project/avro/) use [Fast Avro for Python](https://github.com/fastavro/fastavro). Trust me the claims made by the author of fastavro mostly holds true.

**Tip no 2: Use of schema**

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. You can read the [specification docs](https://avro.apache.org/docs/1.11.1/specification/) to understand more about it in detail. I too found it a bit confusing & keep ever forgetting. So here is the rule of thumb that I follow:

```python
schema = {
    'doc': 'A dummy avro file', # a short description 
    'name': 'dummy', # your supposed to be file name with .avro extension 
    'type': 'record', # type of avro serilazation, there are more (see above docs) but as per me this will do most of the time
    'fields': [ # this defines actual keys & their types
        {'name': 'key1', 'type': 'string'},
        {'name': 'key2', 'type': 'int'},
        {'name': 'key2', 'type': 'boolean'},
    ],
}
```

**Tip no 3: Write correctly**

The fastavro default write method for some reason does not use any codec or compression algorithm, which defeats the purpose of using avro. See the below screenshot.

| Format | Size |
| --- | --- |
| JSON | 13.6 mb |
| Avro (no compression/code) | 13.3 mb |
| Avro (deflate compression/code) | 2.13 mb |
| Avro (snappy compression/code) | 3.4 mb |

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1671430040095/WtxsRCxw_.png align="center")

```python
from fastavro import writer, parse_schema, reader

# default codec is None 
with open('dummy.avro', 'wb') as out:
    writer(out, parse_schema(schema), more_rows)

# from the above screenshot its best to use deflate. It also have native supoprt
with open('dummy_deflate.avro', 'wb') as out:
    writer(out, parse_schema(schema), more_rows, codec="deflate")
```

**Tip no 3: Read as a generator**

Assuming the file size is huge (that will be the case why you had the need to move from JSON to something like Avro) & `fastavro` do support lazy loading why not use it. Also, it is straightforward

```python
from fastavro import reader
with open("dummy.avro", "rb") as out:
    f = reader(out) # now f is generator
    # you loop it
    for i in f:
        do_something(i)
```
