import idea

Working with Avro file format in Python the right way

Akash Desarda — Mon, 19 Dec 2022 06:28:58 GMT

Here are some quick helpful tips for using Avro file format correctly in python.

Note: I am asumming you familair with Apache Avro file format, its advantages, its shortcomings, etc.

Tip no 1: Use the correct package

Instead of using the official package from Apache Avro use Fast Avro for Python. Trust me the claims made by the author of fastavro mostly holds true.

Tip no 2: Use of schema

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. You can read the specification docs to understand more about it in detail. I too found it a bit confusing & keep ever forgetting. So here is the rule of thumb that I follow:

schema = {    'doc': 'A dummy avro file', # a short description     'name': 'dummy', # your supposed to be file name with .avro extension     'type': 'record', # type of avro serilazation, there are more (see above docs) but as per me this will do most of the time    'fields': [ # this defines actual keys & their types        {'name': 'key1', 'type': 'string'},        {'name': 'key2', 'type': 'int'},        {'name': 'key2', 'type': 'boolean'},    ],}

Tip no 3: Write correctly

The fastavro default write method for some reason does not use any codec or compression algorithm, which defeats the purpose of using avro. See the below screenshot.

Format	Size
JSON	13.6 mb
Avro (no compression/code)	13.3 mb
Avro (deflate compression/code)	2.13 mb
Avro (snappy compression/code)	3.4 mb

from fastavro import writer, parse_schema, reader# default codec is None with open('dummy.avro', 'wb') as out:    writer(out, parse_schema(schema), more_rows)# from the above screenshot its best to use deflate. It also have native supoprtwith open('dummy_deflate.avro', 'wb') as out:    writer(out, parse_schema(schema), more_rows, codec="deflate")

Tip no 3: Read as a generator

Assuming the file size is huge (that will be the case why you had the need to move from JSON to something like Avro) & fastavro do support lazy loading why not use it. Also, it is straightforward

from fastavro import readerwith open("dummy.avro", "rb") as out:    f = reader(out) # now f is generator    # you loop it    for i in f:        do_something(i)

How to supercharge your config to make it truly environment agnostic

Akash Desarda — Tue, 13 Dec 2022 14:51:27 GMT

Anyone who develops a project which involves multiple environments (e.g. DEV, QA, PROD) knows how painful it is to write code once which works everywhere, especially if it involves lots of env-specific tools (including cloud). I too have faced such problems & to solve it once-end for all I came up with a system called Environment Agnostic Config.

1. The Old School way

Using the config to control environments & settings is nothing new. Developers have been using it for ages, so why is it necessary to reinvent the wheel? I am not proposing to reinvent the wheel, but modernize it.

The most popular (& probably the most nave way too 😕) is the use of file-based configs like JSON, YAML, TOML, etc. Apart from being nave, they have two significant issues:

they have to be hardcoded
they cannot be generated dynamically.

This is a big deal breaker when dealing with multiple environments. If you have multiple sources to populate it (e.g. some secret vault, environment variables, hardcoded values, etc) then I would strongly recommend just dropping the idea of using a file-based config to save hardships in your life 😉.

2. Environment Agnostic Config: Generating config programmatically

To achieve this, I use Pydantic's Setting Management application. Anyone familiar with Pydantic will find this familiar & easy to use.

Usually, you must be using

Something like a .env file to store all secrets & later read them. Or store secrets/configuration directly as environment variables & then read them.
Other configurations/settings can be hardcoded and written into json/yaml/toml/ini files.

So with the help of Pydantic's BaseSettings we can combine both. Let's dive in to see it in action.

Pydantic's BaseSettings already have support to read from environment variable, .env file, etc (Follow the original docs to see in more details - Settings management - pydantic). So this way we can have some fields (or class attribute) as hardcoded feils & some populate from environment variables under one common Pydantic model. Let's see an example.

from pydantic import BaseSettings, Feildclass Config(BaseSettings):    env_variable1: str = Feild(description="some description")    env_variable: str = Feild(description="some description")    hard_coded1: str = "some hardcoded value"    hard_coded2: int = 999

From the above example, the first two fields will be automatically populated from matching environment variables, the next two are the hard coded variables.

Note: Since this is based on Pydantic, you can add all sorts of regular Pydantic validators. See the original docs (above) to see all the possibilities.

3. One Config to rule them all

Now coming to the most important part - How to use one config for all possible environments (e.g. DEV, QA, PROD, etc). Actually, it is quite easy, just create a respective Pydantic BaseSettings model for all environments.

from pydantic import BaseSettings, Feildclass LocalSettings(BaseSettings):    env_variable1: str = Feild(description="some description")    env_variable: str = Feild(description="some description")    hard_coded1: str = "some hardcoded value"    hard_coded2: int = 999class DEVSettings(BaseSettings):    env_variable1: str = Feild(description="some description")    env_variable: str = Feild(description="some description")    hard_coded1: str = "some hardcoded value"    hard_coded2: int = 999class PRODSettings(BaseSettings):    env_variable1: str = Feild(description="some description")    env_variable: str = Feild(description="some description")    hard_coded1: str = "some hardcoded value"    hard_coded2: int = 999

Since the underlying environment variables will be different for all environments, so will be the populated fields.

4. How to actually consume the config in code

Now we have a single source of config so moving to next part is to how actually consume it in some code. Again there is nothing novel here. What I do is create a function which takes the underlying environment as a parameter as input & return respective config. Also, I usually store all this in config.py

# logic in config.pyfrom pydantic import BaseSettings, Feildclass LocalSettings(BaseSettings):# same as aboveclass DEVSettings(BaseSettings):# same as aboveclass PRODSettings(BaseSettings):# same as abovedef get_config(environment: str):    match environment:        case "local":            config = LocalSettings()        case "dev":            config = DEVSettings()        case "prod":            config = PRODSettings()    return config# in some other part of your code/libraryfrom config import get_configconfig = get_config("dev")some_variable = config.env_variable# NOTE - you don't need to even hardcode environment parameter. What I do is simply create a environment variable for environment it self & use to in function.config = get_config(os.environ["environment"])some_variable = config.env_variable# Now this makes truly environment agnostic & excat same code will work everywhere.

What if you have multiple scopes of multiple config requirements? The pattern remians the same. Add as many as required configs as Pydantic BaseSettings model & return them.

Let me explain with a simple use case of mine. My application needs to support multiple languages. About 80% of the code is generic but there few logic which are language dependent & changes based on underlying language. So I just define them in their respective language Pydantic model. Let's see an example

from pydantic import BaseSettings, Feildclass EnglishConfig(BaseSettings):    variable1: str = "some thing"    variable: list = [1,2,3]    hard_coded1: dict = {}    hard_coded2: int = 999class FrenchConfig(BaseSettings):    variable1: str = "some thing"    variable: list = [1,2,3]    hard_coded1: dict = {}    hard_coded2: int = 999class HindiConfig(BaseSettings):    variable1: str = "some thing"    variable: list = [1,2,3]    hard_coded1: dict = {}    hard_coded2: int = 999

Now exactly similar to above logic for environment settings we can we make language dependant or language specific logic as language agnostic.

5. Bringing all things together

Finally, let me show you how my final config.py looks like

from typing import Any, Optionalfrom pydantic import BaseSettingsclass EnglishConfig(BaseSettings):    regex_pattern_alphanumeric: Optional[str] = "[^0-9a-z/s]"    list_of_missing_must_include_words = ["Missing", "Must include"]    list_of_name_prefixes = ["dr", "mr", "mrs", "jr", "sr"]class SpanishConfig(BaseSettings):    regex_pattern_alphanumeric: Optional[str] = "[^0-9a-z/s]"    list_of_name_prefixes = ["sres", "seora"]    list_of_missing_must_include_words = ["Falta", "Debe incluir lo siguiente"]class FrenchConfig(BaseSettings):    regex_pattern_alphanumeric: Optional[str] = "[^0-9a-z\u00C0-\u017F/s]"    list_of_missing_must_include_words = ["Termes manquants", "Doit inclure"]    list_of_name_prefixes = ["m", "madame"]class LocalEnvironmentSettings(BaseSettings):    common_root_folder: Optional[str] = "/tmp"    logging_level: Optional[int] | Optional[tuple] = (10,10,10,)    status_url: str = "https://some-url-dev.com"    SOME_SECRET: str     CONNECTION_STRING: strclass DevEnvironmentSettings(BaseSettings):    common_root_folder: Optional[str] = "/tmp"    logging_level: Optional[int] | Optional[tuple] = (10,10,10,)    status_url: str = "https://some-url-dev.com"    SOME_SECRET: str     CONNECTION_STRING: strclass QAEnvironmentSettings(BaseSettings):    common_root_folder: Optional[str] = "/tmp"    logging_level: Optional[int] | Optional[tuple] = (10,10,20,)    status_url: str = "https://some-url-qa.com"    SOME_SECRET: str     CONNECTION_STRING: strclass PRODEnvironmentSettings(BaseSettings):    common_root_folder: Optional[str] = "/tmp"    logging_level: Optional[int] | Optional[tuple] = (10,10,20,)    status_url: str = "https://some-url-prod.com"    SOME_SECRET: str     CONNECTION_STRING: str# NOTE - See how I have changed the `status_url` & `logging_level`for all environments & `regex_pattern_alphanumeric` for all languages.def get_config(language: str, environment: str):    # setting language based config    match language:        case "en":            language_config = EnglishConfig()        case "es":            language_config = SpanishConfig()        case "fr":            language_config = FrenchConfig()        case _:            raise ValueError(f"given language: {language} must be either from en, es, pt,")    # setting environment based config    match environment:        case "local":            environment_settings = LocalEnvironmentSettings()        case "dev":            environment_settings = DevEnvironmentSettings()        case "qa":            environment_settings = QAEnvironmentSettings()        case "pro":            environment_settings = PRODEnvironmentSettings()    class GlobalConfig(BaseSettings):                global_language_config = language_config                global_environment_settings = environment_settings    return GlobalConfig()

Note: As my other blogs, this idea is not limited just to python but can be used anywhere. I have used python to explain the idea. Few modifications & same approach can be applied anywhere.

The practical guide to write useful comments

Akash Desarda — Sun, 19 Jun 2022 09:02:16 GMT

1. The Need

I don't think anyone will agree that writing comments in your code are a waste of time and effort. Then Why do most people don't really write good & useful comments? Why do they willingly or unwillingly make their own life & experience difficult?

You all must have seen this meme & then laugh & then move on. This is a very serious problem. Hers is one more,

So why do we fall into such a pitfall even knowing pretty well that there will be a pitfall ahead? Here are some of the reasons that I think are primary contributors:

The comments don't go hand in hand with the code & look out of the place.
In an already massive code base, it becomes even more difficult to navigate them.
The rush to commit the code.
Some unspoken but reality of human behaviour like, "I wrote the code so beautifully that it is self-explanatory" (Yeah, maybe to yourself but not so obvious to others). OR "If I am just going to write the comments then when I will write the actual code" (writing comments is not an afterthought but you should write them along with the code)
Finally, some people are just lazy. Nothing can be done about them. As they are already on the path which will surely make their life difficult in future.

So not writing comments is more of a philosophical or even behavioural problem rather than technical or knowledge problem.

2. How to write good comments

There are many good blogs out there (this one is very good) which explains what is good content for comments. My focus is on the practicality & ease of access. Following are the steps that you must follow for writing good comments.

2.1 Philosophical change

This can be either very easy or very difficult to adapt. It's upon you. The ideal way would be to have comments more than the actual lines of code. One more thing that will help is while writing a code think that you are not writing this for yourself but for others.

2.2 Tagging comments

Writing just the comments (even if their contents is good) usually is not that helpful because navigating them becomes difficult. So for this, I have come up with a system of 'Tagging Comments'. It is nothing fancy, you just have to start a comment with its type. Here are the type that I use,

ANCHOR - Used to indicate a section in your file
TODO - An item that is awaiting completion, address something, etc in future
FIXME - An item that requires an immediate bugfix
NOTE - An important note for a specific code section to fetch the attention of a fellow developer
REVIEW - An item that requires additional review, very useful during pull requests or merges
DEPRECATED - An item which is no longer being used & will be removed in future
WARNING - An warring showing for the following item, if not followed respective bad thing can happen
SECTION - Used to define a region
LINK - Used to link to a file that can be opened within the editor (See 'Link Anchors')
EG - An example of what we should expect in the following item

Let me show some practical examples of how they should be used

# 1. Without commentswith open(file_path) as data_file:    yield from reader(data_file)# 2. With comments# TODO - which encoding to use?with open(file_path) as data_file:    # spiting out only unit data lazily     yield from reader(data_file)

Any linter will give a warning in the above example as no encoding was provided. But I didn't know the answer straight away, so I used a TODO tag to resolve it once I'll know the answer.

# saving the current iteration's payload# FIXME - Input request is also sending a set in the payload, which it should not as set cannot to# save as json. Temporary fix is to save it as a text fileDataWriter.str_to_txt(    str(req_body),    f"/research/sessions/{uuid}/payload.txt",)

The above code snippet will still work, but its behaviour is incorrect & it must be changed. That's why I used a FIXME tag. Now how it's different from the TODO tag? So FIXME should be used when you know for certain that the following item will definitely turn into a potential bug & TODO should be used for a wide variety which may not be necessary bugs or code-breaking items.

# to let joblib release all workers gracefully from memory. NOTE - It's only needed here # because the same joblib workers from cleaning ops are used by entity extraction ops & RQ does not need# joblib's multiprocessingtime.sleep(5)

A NOTE tag is used to bring the attention of fellow developers to the following item. This type of comment has more importance than a regular comment.

SECTION tag should use to group certain business logics that can be put under the common bucket. This becomes very helpful in the case of a pipeline.

# NOTE - Following ETL pipeline will only work with prod configuration# SECTION - Part A: Extract collect_data(source)clean_data(raw_data)dump_clean_data(clean_data)# SECTION - Part B: Transformread_clean_data(clean_data)transform_data(data)# SECTION - Part C: Loadload_data_to_db(transformed_data)# SECTION - Part D: Post-processingclean_environment()

Everyone has the habit of commenting out certain business logic or some code snippet. And this is not wrong. There can be some valid reason to still keep the commented code snippet. But this becomes extremely confusing for others as they might not be aware of the reason. Ultimately, this leads to difficulty in maintaining the code. So it's better to add a DEPRECATED along with the reason.

# DEPRECATED - blob storage as a source is not required for now & maybe removed completely in future.#blob_data_source = [unit_source for unit_source in data if unit_source.is_present()]

For EG tag I follow a couple of rules of thumb,

If you think certain logic is not clear, then add an EG comment displaying what can be the potential value will be.
For every nested loop I write an example regarding what to expect in the next level.

# EG - "Random123 hello @#" will become "andom hello"regex_pattern = "[^a-z]"result = re.sub(regex_pattern, "", text)for key,val in random_dict.items():    # EG - another_random_dict[val] = 'some str'    for val in another_random_dict:        if isinstance(another_random_dict[val], str):            some_list.append(another_random_dict[val])

2.3 Navigate & automate comments

One of the reasons that I mentioned above is why people don't want to write comments is difficult in navigating them. If I don't have a solution for this problem then there is no point of this blog 😅. So let me introduce you to an amazing VS Code extension - Comment Anchors which I used very heavily for the above use case.

It searches for the tag & creates a bookmark. So that using its tree structure you can quickly jump to it. Also, it can act as a one-stop to track all important tags like FIXME, TODO, REVIEW, etc. Go through its docs for more info.

The extension comes with some default tags/strings to bookmark but with the ability to customize it. The following is the one that I am using (You can refer to their docs on how to customize it according to your specific needs)

"commentAnchors.tags.list": [    {        "tag": "ANCHOR",        "iconColor": "default",        "highlightColor": "#A8C023",        "scope": "file"    },    {        "tag": "TODO",        "iconColor": "blue",        "highlightColor": "#3ea8ff",        "scope": "workspace"    },    {        "tag": "FIXME",        "iconColor": "red",        "highlightColor": "#F44336",        "scope": "workspace",        "isBold": true    },    {        "tag": "NOTE",        "iconColor": "orange",        "highlightColor": "#FFB300",        "scope": "file",        "styleComment": true    },    {        "tag": "REVIEW",        "iconColor": "green",        "highlightColor": "#64DD17",        "scope": "workspace"    },    {        "tag": "SECTION",        "iconColor": "blurple",        "highlightColor": "#896afc",        "scope": "workspace",        "behavior": "region"    },    {        "tag": "LINK",        "iconColor": "#2ecc71",        "highlightColor": "#2ecc71",        "scope": "workspace",        "behavior": "link"    },    {        "tag": "DEPRECATED",        "iconColor": "#B22222",        "highlightColor": "#B22222",        "scope": "workspace",        "behavior": "anchor",        "isBold": true    },    {        "tag": "WARNING",        "iconColor": "#B22222",        "highlightColor": "#B22222",        "scope": "workspace",        "behavior": "anchor",        "isBold": true    },    {        "tag": "EG",        "iconColor": "#00FFFF",        "highlightColor": "#eb667d",        "backgroundColor": "rgba(49, 184, 79, 0.2)",        "borderStyle": "1px solid #23b2ea",        "borderRadius": 6,        "scope": "workspace",        "styleComment": true    }],

You can add this to your VS code settings.json file.

3. How to write a good message commit message

Writing a good commit message is also extremely important. The same pitfalls as above are true here too.
Just writing something like updated xyz.py or deleted abc.txt or moved some_file.js or bug fix etc is very bad practice. It does not provide any context & becomes difficult to track changes using git blame.
How to follow common standards across your team? What should be these standards? No need to reinvent the wheel and just use Conventional Commits.
It is based on excellent Conventional Commits 1.0.0 spec. You can go through the spec (it is definitely a good read).
It's very easy, intuitive & fun (as it also supports gitmoji 🤩) to use, trust me. Follow its doc to understand more.

How to SSH login password free from Windows, Linux, Mac

Akash Desarda — Tue, 01 Feb 2022 17:02:34 GMT

1. Linux OS & MAC OS

Run the following command on your bash (or any alternative like zsh, fish, etc) to set up auto ssh login.

ssh-copy-id -i "" ""

Example:

Output:

2. Windows

Run the following commands, in a local PowerShell window replacing user and host name as appropriate to copy your local public key to the SSH host.

$USER_AT_HOST="your-user-name-on-host@hostname"$PUBKEYPATH="$HOME\.ssh\id_rsa.pub"$pubKey=(Get-Content "$PUBKEYPATH" | Out-String); ssh "$USER_AT_HOST" "mkdir -p ~/.ssh && chmod 700 ~/.ssh && echo '${pubKey}' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

Example:

Output:

How to merge a specific directory or file in Git

Akash Desarda — Mon, 27 Sep 2021 02:43:28 GMT

Think of the following scenarios:

There might be two branches with active development and one of the branch needs some specific (updated) file or directory.
You don't want to merge the complete branch but just need some specific file(s).

Similarly, there might be 'n' no. of scenarios where the common theme is that instead of merging the complete branch, all you need to do is merges specific file(s)/directory(s).

Using a smart trick which I like to call 'Selective checkout' can do the intended job

git checkout destinationgit checkout source sub-directory/git commit -am "Message."git pull --rebasegit push

Using this git terminal trick now you can actually perform specific file/directory merge without merging the complete branch.

The Ultimate VS Code setup guide 🐱‍💻

Akash Desarda — Sun, 26 Sep 2021 03:47:03 GMT

The key factor for becoming a productive powerhouse and a good developer are:

Good technical & fundamental language knowledge
Understanding software designing
How to make the most use of an IDE.

The third point may seem a little odd to many, in fact, it is mostly undervalued. But to become a productive powerhouse 💪, it is far the most important point to master.

I have spent a lot of time customizing my VS code setup and now I feel I have reached a point where I can confidently present it to the world 🌍 as the most complete setup. 🧰
Note: This will be a long blog, but I have divided it into parts so that a reader can easily jump to his/her interest.

https://twitter.com/Love2Code/status/1427992702856142853

Everyone knows how powerful VS code is & supports almost all languages. The most important of all the great features is its support for a lot of languages out of the box. But tweaking the default setup will boost your productivity 🚀 to the next level.

Part 1: Python Development
Part 2: Git
Part 3: Productivity boosters
Part 4: Customization

Part 1: Python development

Python (extension) by Microsoft:

A compulsory requirement for python development.
Provides language, debug, test, etc support.
Latest updates have even brought support for the Jupyter notebooks.

Severity: Must

Pylance (extension) by Microsoft:
- Works alongside Python extension to provide performant language support.
- It adds tons of features into bare-metal Python extension like -
  - Docstrings
  - Signature help, with type information
  - Parameter suggestions
  - Code completion
  - Auto-imports (as well as add and remove import code actions)
  - As-you-type reporting of code errors and warnings (diagnostics)
  - Code outline and navigation
  - Type checking mode
  - IntelliCode compatibility

Severity: Must

Visual Studio IntelliCode (extension) by Microsoft:

It provides AI-assisted development features for Python, TypeScript/JavaScript and Java developers with insights based on understanding your code context combined with machine learning.
It AI predictions works fairly well and does not try to intrude on your development.

Severity: Must

There are many other extensions that may not fall under the 'must severity' but are still very helpful, lets continue then,

SonarLint (extension) by SonarSource:

It is a linter or static analysis tool (free and developed by a very respected company in this domain) that lets you fix coding issues before they exist by analyzing the code.
It can track Bugs 🐛 and Security Vulnerabilities as you write code. But the best part is the documentation that it provides as a Code smell without even the need to commit the code.
The installation can a little tricky as initially it needs java to run. It by default handles all installation if sufficient write permission is available else you have to do it manually. But trust me when the installation is done this will help you & your team member to stick to all best practices of python coding and also maintain consistent code. 🫂

Severity: Essential

Pylint (native support):

Pylint is not an extension but a dedicated linter that can be used independently without VS code, but VS code does have its native support, more here.
It is a Python static code analysis tool that looks for programming errors, helps to enforce a coding standard, sniffs for code smells and offers simple refactoring suggestions.
It can be installed as simple as pip install pylint. Follow the above official docs for pylint integration with vs code.
- But if you ask me using its CLI tool is the preferred way. It can be triggered by pylint path/to/dir for analyzing the complete directory & pylint path/to/some_file.py to analyze any specific file.

Severity: Essential

Note: The most optimal Python linter setup in vs code is a combination of Pylance, SonarLint & Pylint. All these three work independently and does not intrude on each other. Their combination provides the perfect linting experience.

Auto code formatting using Black (native support):

One of the more important point while working in a team and collaborating on a project is writing clean & consistent code. But dividing our focus on logic & writing clean code always harms productivity. To deal with such situations a formatter should be used.
VS Code support a wide variety of formatter tools (more here) & automates the code formatting. Following are the steps
- Goto settings --> Extensions --> Python --> Python Formatting: Provider and select black from drop down menu. Or if you prefer settings.json then simply and this "python.formatting.provider": "black"
- While you are on your file/document which you want to format, press right click --> Format code. There is even an option to automatically format the current file on save by adding "editor.formatOnSave": true in settings.json.

I personally use Black due for the following reasons
- It philosophy is kind of authoritative & will format the code strictly to its set of rules.
- I believe we already have to make a lot of critical decisions and formating should not be one of them.
- Also when the whole team uses Black formatter, then the complete codebase will look & feel consistent and clean.

Severity: Essential

Python Docstring Generator (extension) by Nils Werner:

I am assuming that you are well aware of the Docstring.

Writing a good informative docstring is very important in terms of documentation. But if you or your team decides to follow any formats like (which you should totally do) Numpy, Google,etc. can results in difficulty for writing consistent & complying with a format/style guide.
Python Docstring Generator can generate a Docstring template that adheres to the selected format based on type hints/type annotation. So that you have don't have to worry about formatting & just focus on writing the required information.
You can even jump to the next & previous element in the template using shift & shift + tab keyboard keys respectively.

To change format goto Settings --> Extensions --> Python Docstring Generator configuration --> Select you desire format from the Auto Docstring: Docstring Format drop-down menu.

Severity: Essential

Python Indent (extension) by Kevin Rose:

If you have ever felt that you keep on messing with the indentation then extension got you covered. Every time you press the Enter key it automatically adds the correct indent.

Severity: Helpful

Python Type Hint (extension) by njqdev:

Provides type hint auto-completion for Python, with completion items for built-in types, classes and the typing module.

Severity: Helpful

Python Test Explorer for Visual Studio Code (extension) by Little Fox Team:

This extension allows you to run your Python Unittest, Pytest or Testplan tests with the Test Explorer UI.
It provides much better and rich information with ample control.

Part 2: Git

GitLens (extension) by Eric Amodio:

It supercharges the Git capabilities built into Visual Studio Code. It helps you to visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more.
The best part of this extension is that all the default settings that come out of the box work well. It will improve your Git experience 10x. Even Gitlens can be converted to a standalone tool, yeah it's that powerful 💪.
There are so many awesome features so I would suggest you watch the video

https://www.youtube.com/watch?v=rxKGgSLwOnU

Severity: Must

Git Graph (extension) by mhutchie:

It generates beautiful, colourful & informative graphs to visualize all your commits history across all branches. This makes it extremely easy to track the project.
It even creates every single commit a clickable link that can be used to view git diff among others things.
It also offers a lot more git functionality which I highly suggest to checkout.

Severity: Must

Git History (extension) by Don Jayamanne:

View and search git log, history along with the graph and details, previous copy of the file.
Compare branches, commits, files across commits.

Severity: Essential

Info: You must be thinking 🤔 that extension like vs code's builtin git, gitlens, git graph, git history, seems to have some functionality overlapping, which isn't incorrect. But the important thing is all this extension works well without harming others functionality, so it's totally fine to use them together. In fact, when they all are used together, they make vs code the best Git management tool out there.

Conventional Commits (extension) by vivaxy:

Before using the extension I would suggest you understand the philosophy behind Conventional Commits
It brings support to Conventional Commits in vs code.
Note: Explanation about Conventional Commits is out of the scope of this article but you should must go through it before using this extension.

Severity: Essential

Part 3: Productivity boosters

VS Code Workspace (system functionality):

Workspace can be used to create a sandbox setup specific to the project or environment. Every single setting, extension, config, etc can be customized to cater specifics needs of a project.
Creating a workspace is very easy just click on on File & select Save Workspace as

Initially just created workspace will inherit everything that is present in User settings, then you can now start changing anything that you want to & while doing so you will have the option for saving just to the workspace or globally to the user settings.
I rate this feature of VS code right at the top 🔝. This is what I generally do:
- For a python project I set a default python path to be used every time
- I disabled any extensions that are not required in the project
- Change terminal-specific settings
- Hell yeah you can even set workspace specific themes, fonts, etc. which I do all the time 😎. I believe having a visual difference helps to differentiate projects.

Severity: Must

Useful keyboard shortcut:

Follow this excellent blog by Shubham Khatri. I too from time to give visit here to freshen up the commands.

Severity: Essential

Setting sync (system functionality):

Settings Sync lets you share your Visual Studio Code configurations such as settings, keybindings, and installed extensions across your machines so you are always working with your favourite setup.
This way you can maintain a similar & familiar setup with your personal, work, or other personal machines. I personally use this extensively.
But mind you as awesome as it looks, it can also be proved quickly a double-edged sword. Why?
- There might be few extensions that you just want on specific devices.
- There might be a few setting's configs that use path
- Any other settings or keybindings that you want to use only on a specific device.
To solve this (potential) conflict vs code provides us extremely granular control over what to sync what to not. Following are few of them (which might be most common):
- Do not want to sync a specific extension: open the extension page --> Click on Do not sync this extension

Do not want to sync some specific setting: Goto that specific setting (using settings UI) --> Click on gear icon --> Sync this setting

Severity: Essential

Thunder Client (extension) by Ranga Vadhineni:

Thunder Client is a lightweight Rest API Client Extension. It is basically like a Postman inside vs code so that you don't have to leave vs code at all.

Severity: Essential

Path Autocomplete (extension) by Mihai Vilcu:

Provides path completion. It supports relative, absolute, workspace path auto-completion.

Severity: Helpful

Comment Anchors (extension) by Exodius Studios:

Writing comments (even more than the code itself 🧑💻) is extremely important in long term, so does an efficient way to track & navigate them. Comment Anchors is the best extension to deal with this task. It supports all languages.
You can place anchors within comments or strings to place bookmarks within the context of your code. Anchors can be used to track TODOs, write notes, create foldable sections, or to build a simple navigation making it easier to navigate your files. Anchors can be viewed for the current file, or throughout the entire workspace, using an easy to use the sidebar.

It even supports adding a custom anchor. Following are list of anchor that I used

// My custom anchor related code in `settings.json`    "commentAnchors.tags.list": [        {            "tag": "ANCHOR",            "iconColor": "default",            "highlightColor": "#A8C023",            "scope": "file"        },        {            "tag": "TODO",            "iconColor": "blue",            "highlightColor": "#3ea8ff",            "scope": "workspace"        },        {            "tag": "FIXME",            "iconColor": "red",            "highlightColor": "#F44336",            "scope": "workspace",            "isBold": true        },        {            "tag": "STUB",            "iconColor": "purple",            "highlightColor": "#BA68C8",            "scope": "file"        },        {            "tag": "NOTE",            "iconColor": "orange",            "highlightColor": "#FFB300",            "scope": "file",            "styleComment": true        },        {            "tag": "REVIEW",            "iconColor": "green",            "highlightColor": "#64DD17",            "scope": "workspace"        },        {            "tag": "SECTION",            "iconColor": "blurple",            "highlightColor": "#896afc",            "scope": "workspace",            "behavior": "region"        },        {            "tag": "LINK",            "iconColor": "#2ecc71",            "highlightColor": "#2ecc71",            "scope": "workspace",            "behavior": "link"        },        {            "tag": "DEPRECATED",            "iconColor": "#B22222",            "highlightColor": "#B22222",            "scope": "workspace",            "behavior": "anchor",            "isBold": true        },        {            "tag": "WARNING",            "iconColor": "#B22222",            "highlightColor": "#B22222",            "scope": "workspace",            "behavior": "anchor",            "isBold": true        },        {            "tag": "EG",            "iconColor": "#00FFFF",            "highlightColor": "#31e0ec",            "backgroundColor": "rgba(49, 184, 79, 0.2)",            "borderStyle": "1px solid #23b2ea",            "borderRadius": 6,            "scope": "workspace",            "styleComment": true        },        {            "tag": "@:",            "iconColor": "yellow",            "highlightColor": "yellow",            "scope": "workspace",            "behavior": "anchor"        }    ],

Note: It offers much more features, so I highly suggest reading their description. 📖

Bracket Pair Colorizer 2 (extension) by CoenraadS:

This extension allows matching brackets to be identified with colours.

Severity: Helpful

Code Spell Checker (extension) by Street Side Software:

This is a very handy extension for someone like me who makes a lot of typos. It not only checks for typos but also provides correct suggestions too.
It works with 20+ file types (which obviously covers all the popular one). It supports camelCase, PascalCase, snake_case.
You can even add words to the global level or even workspace level.

Severity: Essential

Error Lens (extension) by Alexander:

ErrorLens turbo-charges language diagnostic features by making diagnostics stand out more prominently, highlighting the entire line wherever a diagnostic is generated by the language and also prints the message inline.
This makes debugging & catching errors comparatively easy.

Severity: Essential

footsteps (extension) by Wattenberger:

Keep your place when jumping between different parts of your code. This is a VSCode extension that will highlight lines as you edit them, fading as you move away. Jump between lines using ctrl+alt+left and ctrl+alt+right.

Severity: Helpful

Zoom Bar (extension) by wraith13

Can zoom via GUI in the status bar.

Severity: Helpful

Resource Monitor (extension) by mutantdino:

Display CPU frequency, usage, memory consumption, and battery percentage remaining within the VSCode status bar.

Draw.io Integration (extension) by Henning Dieterichs

Draw.io inside vs code.

Severity: Helpful

Part 4: Customization

This is the one place where vs code really shines.

Rainglow theme (extension) by Dayle Rees:

Rainglow is a collection of colour themes & consists of 320+ syntax and UI themes.
Colour combinations are excellent. All the themes follow similar categories and hierarchies which makes it super easy to pick a new theme.
Trust me after installing it you won't need any other theme.

Material Icon Theme (extension) by Philipp Kief

There are many good options for icons theme in vs code but Material Icon Theme covers most grounds and at the same time all the icons are precise and beautifully designed.
Window Colors (extension) by Stuart Robinson:
This extension is a bit unique and fun. Automatically adds a unique colour to each window's activityBar and titleBar.
It works without harming any existing theme extension & works along with it.
Why this is useful you ask? If you have multiple windows open (like me all the time 😅) then this adds a new colour & makes it super easy to differentiate.

Practical OOP in Python: Methods

Akash Desarda — Tue, 22 Jun 2021 03:58:00 GMT

The class is the backbone of OOP in Python and methods are body parts of the class. Understanding the practical application of the methods is the key to take the most advantage of the class and eventually OOP in Python.

Let's touch on some theory in brief as this blog focuses on practicality

1. Brief theory

class: A class is a user-defined blueprint or prototype from which objects are created. It wraps all the similar methods (ideally by conventions).

methods: A glorified function that is a class member and will always remain bound to a class.

If you wish to understand the theory, then I would suggest you go through

https://realpython.com/instance-class-and-static-methods-demystified/#lets-see-them-in-action

Types of methods available in Python:

True method:
1. instance method
2. class method
3. static method
4. property (not everyone will agree)
Method by conventions (this is what something I have derived)
1. private method
2. strict private method

2. Practical use case

The beauty of OOPs in any programming language (that has its support obviously 😎) is the flexibility. There will be always more than one way to do any task at hand, but not all are best practice or practical or pythonic.

Note: I will only touch the syntax/theory part briefly with an assumption that the reader knows the theory part but to see their real-world practical use-case.

Let's begin by writing a sample class that will be used across the complete blog

@dataclassclass SampleClass:    """This is a sample class to explain practical use-case of all methods    """        xyz: int    abc: int = 4    def ins_method(self, no: int):        """Sample instance method        Args:            no (int): any int number         """                print(self.xyz * no)    @classmethod    def cls_method(cls, no: int):        """Sample class method        Args:            no (int): any int number         """        print(cls.abc * no)    @staticmethod    def stc_method(no: int):        """Sample static method        Args:            no (int): any int number         """        print(SampleClass.xyz * no)

Now we'll see their practical use case one by one.

2.1 instance method

The plain, simple, regular method with any frills.
Practically speaking, this type of method is used 90% of the time. In fact, just using the instance method is what you are going to need all the time.
It has free access to all attributes & even other methods at the same object level. Due to this flexibility it is most widely used.
From the above SampleClass example ins_method() is the instance method.

@dataclassclass SampleClass:    xyz: int    abc: int = 4    def ins_method(self, no: int):              print(self.xyz * no)s = SampleClass(xyz=10)s.ins_method(2)# Output20

Some key point to note here:
- As instance method are bound to an object of the class, so first an object must be created.
- The same object then can be used anywhere.

Factory function: Before moving to the next part, it is important to understand factory function) as this is one use-case where classmethod and staticmethod have wide practical application. Again here I will be focussing on practical.

2.2 classmethod

First, get this, classmethod is not a must have/use in Python OOP. Almost always plain instance method will do the job. But there is one use-case where it fits perfectly, i.e. if you want to use the factory function for accessing class attribute.
Unlike the instance method where we need to create an object of class first & then use the '.' notation to use it, classmethod can be used without creating an object as it is bound to the class itself and not to the object. Sound too technical let's see an example.

@dataclassclass SampleClass:    xyz: int    abc: int = 4    @classmethod    def cls_method(cls, no: int):        print(cls.abc * no)SampleClass(4).cls_method(2)# Output8

As you can see above we have not created any object for SampleClass but directly used '.' notation with 'SampleClass' itself as classmethod can be bound to class directly. If you compare it with the instance method, there we created an object s then the method ins_method() was being accessed as it is bound to s but not to Sampleclass
So what is the benefit of all this, not a lot but there are a few:
- The one obvious, use it as a factory function where you don't want to specifically create an object of the class. Why? Maybe you fear the object size will be too large & you only want to use a specific method.
- It is a way to tell your other fellow team member or other people that this method doesn't depend on the instance variable (value provided by the user) but on the class variable to which the user doesn't have access.

@dataclassclass SampleClass:    xyz: int #This is instance variable    abc: int = 4 # This is class variable    @classmethod    def cls_method(cls, no: int):        print(cls.abc * no)# Traditional wayclass SampleClass:    abc: int = 4 #This is class variable    def __init__(self, xyz):        self.xyz = xyz # This is instance variable

Here cls_method() only have access to abc which is class variable

Here ins_method() have access to everything.
A method that needs to use a class variable as well as user input.

@dataclassclass SampleClass:    xyz: int    abc: int = 4    @classmethod    def cls_method(cls, no: int):        print(cls.abc * no)SampleClass(4).cls_method(2) # Here cls_methos is using class variable - 'abc' as well as user input - 'no'# Output8

2.3 staticmethod

Similar to the 'classmethod' even 'staticmethod' is not a must have/use in Python OOP. In fact, its usage is even less than classmethod. Some dev argues that it is totally useless, see here.
It shares almost all properties from classmethod except access to neither class variable nor to instance variable as it does not use either self or cls.
Like the classmethod it can be used as a factory function, but should not. Because
- We already have classmethod for it.
- classmethod will work perfectly even if it is not using a class variable.
So where should it be used? There are no true practical use-case of it and can be skipped entirely. But I do use them in some rare occurrence or edge case,When you have some function outside the scope of a class but feel it is tightly related to the class, it can be included as staticmethod as it does not need any class/instance variable.
- You want to mimic the private method which doesn't need a class/instance variable. But I would advise instead use a private method (more on this later).

2.4 property

This is a special type of methods, in fact formally it falls under descriptor. Follow this excellent blog if you want to understand the theory part
https://www.machinelearningplus.com/python-property/
@property should be used when you want to return an attribute produced by some function/method.
As the name suggests the property object should always hold some kind of characteristic of the class. Python's builtin pathlib library is one of the best examples.

2.5 private method

It is a method to which external users do not have access & it is just used internally.
Python doesn't have a true private method (like in java). But we can implement it using generally agreed conventions.
The convention is to add '_' as a prefix to the instance method name. By seeing this name convention, the user will understand that this particular should not be touched.

@dataclassclass SampleClass:    xyz: int    abc: int = 4    def ins_method(self, no: int):        print(self._pvt_method(no) * no)    def _pvt_method(self, no: int):        return (self.abc * no)

But as mentioned earlier that Python doesn't have a true private method, so the user can still use it. Someone who doesn't the name convention might accidentally use it.
```
sc = SampleClass(4)sc._pvt_method(4)# Output16
```

2.6 strict private method

You might not find this term anywhere formally. This is somewhat I have coined. As above mentioned limitation of the private method in python, if you want to enforce it we can hack name mangling feature of python by adding '__' or double underscore as a prefix to the instance method name.

@dataclassclass SampleClass:    xyz: int    abc: int = 4    def ins_method(self, no: int):        return (self.__strict_pvt_method(no) * no)    def _pvt_method(self, no: int):        return (self.abc * no)    def __strict_pvt_method(self, no: int):        return (self.abc * no)# Outputsc.ins_method(4)64sc._pvt_method(4)16sc.__strict_pvt_method(4)---------------------------------------------------------------------------AttributeError                            Traceback (most recent call last)-31-102204654d54> in ----> 1 sc.__strict_pvt_method(4)AttributeError: 'SampleClass' object has no attribute '__strict_pvt_method'

Though it strict but not entirely 100% enforced rule as it can be still used by name mangling syntax

sc._SampleClass__strict_pvt_method(4)# Output16

Then why use it? Because it makes it really hard to use to outside the class as it is not that obvious to use.

Note: I will suggest to further read this thread
https://stackoverflow.com/questions/70528/why-are-pythons-private-methods-not-actually-private

Testing in a CI/CD Pipeline Part 3: Deployment testing

Akash Desarda — Sat, 29 May 2021 05:04:26 GMT

This is part 3 of the Testing in a CI/CD Pipeline series. It is advised first to go through part 1, part 2 🤓.

https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing

https://importidea.dev/testing-in-cicd-part-2-integration-testing

1. Deployment testing in brief 💼

Deployment testing is different from system or unit testing. Let's see in brief (as the original intent of this guide is how to integrate it CI/CD pipeline)

In unit testing, tests are performed to measure the correctness of the system's individual smaller or unit component.
In contrast, Deployment testing is a testing stage where two or more software units are joined and tested as one entity but after release or deployment.
Deployment testing in CI/CD pipeline works best and integrates easily if your system is build using a Microservice approach.

2. General mechanics 🧰

Why

To ensure every individual microservice is working/behaving correctly.
Buggy code can be caught after release (& ideally before exposing to the public) and trigger an automated (or manual depends on your use case) rollback.

How

Running a practical (mocking a real-world use case) example against the microservice.The result must be known beforehand and will be used to compare the output from the test.

When

Deployment testing must be done as the last component of the CD pipeline.
It should be triggered after the microservice is successfully deployed.

Where

It is always performed at your infra/deployment layer (eg. Kubernetes)
It should be the last step of any CI/CD pipeline.

3. Implementation in Azure DevOps pipeline 🚀

If your microservice exposes an endpoint (which will be in most of the cases) then all you need is to post a request using REST API (or whatever your microservice supports).
In my case, the complete microservice was packaged and containerized using Docker and deployed to the Kubernetes cluster. I used a simple python script to extract URL, post request using REST API, and compare the output.
If you have gone through part 2 of this series regarding integration testing, you will notice that a lot of things are common for both. That's true, the core of both these testing is exactly the same, just executing them differs.

Note: I believe the execution part is more important than the logic (as core logic is the same as integration testing). So I will be focussing more on the execution part.

3.1 Test script

As I mentioned earlier the core of integration testing and deployment testing is the same so even the testing logic is the same.
The key difference here is, the deployment testing script should be able to perform the test of all the microservices. To put together in context with integration test, deployment test is a compilation of all individual integration test w.r.t to their microservices.
As the concept of the test script is already explained in detail here, so I am skipping it, though I will show the overall code at the end just to make this blog less cluttered.

3.2 Perform deployment test in the Kubernetes

You may not be using the Kubernetes at all but the idea behind this is platform agnostic and understanding the flow is key, having said that let's move on

As all the microservices are Deployment app (in kubernetes world, more here) we will treat the deployment test as a microservice too. Doing so have tons of benefits like:
- All the networking requirement will be handled by kubernetes. If you maintained multiple staging envs (like dev, qa, prod), we use the kubernetes namespace to perform env specific test.
- As both client & user (here its microservice & deployment test service) are internal or at the same level in kubernetes, test latency will be much small.
The deployment test pod which will be generated from its k8s deployment app will only include the test script.
The CD pipeline of every microservice will need a kubectl exec task at the end because all we need to do is run the python script sitting inside the deployment pod.
```
kubectl -n  exec po/test pod name> -- python3 deployment_test.py
```
Note: In the later section will see how all this can be automated using reference/variables.

3.3 Extracting URL/endpoint

Here comes the magic of kubernets as all the networking is handled by itself. There are a lot of options, but we will be using DNS for Services and Pods as both the service as internal.

All you need to know:
1. Name of your k8s deployment app.
2. Namespace where the pod is currently sitting.
The IP address for URL will be http://..svc.cluster.local/. Let's see an example. Let's say if k8s deployment name is my-deployment & namespace is dev then it will be http://my-deployment.dev.svc.cluster.local/
Directly using IP address is bad practice as it will keep on changing after every new/restart pod. But the above method, kubernetes will handle this for us.

3.4 Integration with CD Job

I will be using the Release pipeline of the Azure DevOps pipeline for CD Job and focus just on the deployment testing task.

All you need are multiple kubectl task

Step 1: We need to extract the deployment test pod name to perform the kubectl exec command.
1. We will use the kubectl get command to extract the current pod name of the given deployment/app. See the Arguments section carefully. This is where we are extracting the name. The argument will be,
```
pods -l app=crs-ai-deployment-test -o jsonpath={.items[*].metadata.name}
```

Save the output/name in some reference/variable which can be used in a later stage. This can be done using Output variable > Reference name
Output format should be always none. This can be done using Advanced > Output format> none

As you can see from the above screenshot, I am using test as a reference which makes the variable name as a test.KubectlOutput.

Step 2: kubectl exec deployment testing
1. We will use the kubectl exec command to run the python script. See the argument section carefully. The test.KubectlOutput which was produced in the previous stage is used now.

Step 3: (extra) print logs if test fail
1. Similar to integration testing where logs were printed to the console after the test failed for investigation needs can be done even in deployment testing.
2. We will need a kubectl log command with --since=10m flag

See the argument section carefully. Here the reference $(pod.KubectlOutput) is the name of the microservice. Its current pod name can be extracted similarly to how we did for the deployment test pod.
We also need to change the Control option to Only when a previous task failed as this should be only run when the previous task of `kubectl exec' task performing deployment test failed.

So this is how deployment testing can be automated and integrated into the CD job. Let's see some action.

4. Demo

First, the pod name of the deployment test pod was extracted. Then the testing was performed. As the test failed, next the logs were printed.

5. Deployment test code

This may change based on your requirement. But you can still refer to this for the idea, as always an idea is platform agnostic.

import sysimport jsonimport loggingimport argparseimport requestsfrom typing import Dict, Tuplesys.tracebacklimit = 0logging.basicConfig(level=logging.INFO, format="[%(levelname)s]: %(message)s")logger = logging.getLogger(__name__)parser = argparse.ArgumentParser()parser.add_argument("--namespace", type=str) # Targeted namespaceparser.add_argument("--deployment_name", type=str) # Targeted test as this is compilation of all individual testdef payload_data(deployment_name: str) -> Tuple[str, Dict]:    """Prepare payload data for Deployment specifics    Parameters    ----------    deployment_name : str        Name of deployment to perform testing.    Returns    -------    str        Api name    Dict        Sample data to check for deployment testing    Raises    ------    ValueError        Must be from supported deployment testing:  research-clarity-id-applicability,research-clarity-id-adv-nonadv    """ # TODO: Add new deployment name to the list    supported_deployment = [        "research-clarity-id-applicability",        "research-clarity-id-adv-nonadv",    ]    if deployment_name not in supported_deployment:        raise ValueError(            f"Given deployment is either wrong or not supported.\nIt must be from {', '.join(supported_deployment)} "        )    # TODO: Add all new sample data and API here    elif deployment_name == "research-clarity-id-applicability":        api_name = "IDA"        svc = f"crs-id-applicability-api.{args.namespace}.svc.cluster.local"        path = "./data/sample_data_research-clarity-id-applicability.json"    elif deployment_name == "research-clarity-id-adv-nonadv":        api_name = "adverse_nonadverse"        svc = f"crs-id-adverse.{args.namespace}.svc.cluster.local"        path = "./data/sample_data_research-clarity-id-adv-nonadv.json"    with open(path, "r") as file:        payload = json.load(file)    return (api_name, svc, payload)api_name, svc, payload = payload_data(args.deployment_name)url = f"http://{svc}/api/{api_name}"logger.info(f"Testing api @ {url}")header = {"Content-Type": "application/json"}logger.info(f"Send API request for {args.deployment_name} deployment testing")response = requests.request("POST", url, headers=header, json=payload)# TODO: Add all new assert condition here.if args.deployment_name == "research-clarity-id-applicability":    response_data = response.json()    assert (        response_data["ida_output_path"].split("/")[-1] == "IDA.ndjson"    ), "Not Received expected output, test is failed"elif args.deployment_name == "research-clarity-id-adv-nonadv":    response_data = str(response.content).replace("'", "")    assert (        response_data.split("/")[-1] == "classifiation_output.ndjson"    ), "Not Received expected output, test is failed"logger.info("Deployment test passed successfully !!!")

Testing in a CI/CD Pipeline Part 2: Integration testing

Akash Desarda — Thu, 27 May 2021 06:35:11 GMT

This is part 2 of the Testing in a CI/CD Pipeline series. It is advised to first go through it 🤓.

https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing

1. Integration testing in brief 💼

Integration testing is different from system or unit testing. Let's see in brief (as the original intent of this guide is how to integrate it CI/CD pipeline)

In unit testing, tests are performed to measure the correctness of individual smaller or unit component of the system.
In contrast, Integration testing is a testing stage where two or more software units are joined and tested as one entity.
Integration testing in CI/CD pipeline works best and integrates easily if your system is build using a Microservice approach.

2. General mechanics

Why

To ensure every individual microservice is working/behaving correctly.
Buggy code will be restricted during the CI pipeline only & will never get deployed.

How

Running a practical (mocking a real-world use case) example against the microservice.The result must be known beforehand and will be used to compare the output from the test.

When

Integration testing must be done during the CI pipeline.
It should be triggered after the microservice is successfully build.

Where

It is always performed at Remote Repository (eg Github, Gitlab).
It should be the second step of any CI/CD pipeline.

3. Implementation in Azure DevOps pipeline 🚀

If your microservice exposes an endpoint (which will be in most of the cases) then all you need is to post a request using REST API (or whatever your microservice supports).
In my case, the complete microservice was packaged and containerized using Docker. I used a simple python script to extract URL, post request using REST API, and compare the output.

Let's go first through the testing script and then through the CI pipeline

3.1 Test script

The core of the script is to read some sample data send a post request and finally compare result. The bare minimum would be

with open('./tests/sample_data.json', 'r') as f:    payload = json.load(f)header = {"Content-Type": "application/json"}response = requests.request('POST', url, headers=header, json=payload)response_data = response.json()assert response_data['ida_output_path'].split('/')[-1] =='IDA.ndjson', 'Not Received expected output, test is failed.' # You may use some other method to compare 😁

This will only work in an ideal scenario which will be not possible 99% of the time & in fact completely defeats our original purpose.
We need to make it more suitable & versatile for this use case. It can be done by adding two more components,
1. A try-catch block.
2. console logging at the time of failure to investigate it.

try:    response = requests.request('POST', url, headers=header, json=payload)    response_data = response.json()except:    logging.error('An error has occurred. Refer logs to locate error.')    os.system("docker logs test_api > output.log")    time.sleep(3)    with open('./output.log', 'r') as log:        print(log.read())try:        assert response_data['ida_output_path'].split('/')[-1] =='IDA.ndjson', 'Not Received expected output, test is failed.'except (AssertionError,KeyError) as e:    os.system("docker logs test_api > output.log")    time.sleep(3)    with open('./output.log', 'r') as log:        print(log.read())

Some pointers from the above code block
1. I have added post request and result comparison in the try-catch block so that any error will not stop the code.
2. As I have mentioned earlier I am performing the test inside the docker image so all the logs I am extracting from it (as shown on line no 6, 15)
3. Printing logs at the time of failure is very important for investigating the issue. The aim of Integration testing is not just to restrict a buggy release but also to help to investigate it.

3.2 Extracting URL/endpoint

This may differ quite a lot based on your use case. This is how I do it.
1. I first start/run the freshly built docker container on the worker node of CI pipeline
2. Then extract the IP address where the container is running.
3. Use this IP as the endpoint to post request over REST API.

subprocess.call(['docker', 'run', '-d', '-p' ,'80:80','--name', 'test_api', args.image_name])ip = subprocess.getoutput("docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' test_api")url = f'http://{ip}/api/IDA'logger.info(f"Testing api @ {url}")

3.2 Integration with CI Job

If you follow a similar flow then all you need is a task to run the python script. It can be as simple as,

- task: PythonScript@0  displayName: Integration testing  inputs:    scriptSource: 'filePath'    scriptPath: '$(System.DefaultWorkingDirectory)/tests/integration_testing.py'    arguments: '--image_name your.repo.io/ida:$(Build.BuildNumber)'

There are two more important points for implementing Integration testing
1. Placement: It should be performed just after Docker build and before Docker push
2. If the test fails the script must stop the pipeline to proceed further. This can be achieved by using something like os.sys.exit()

4. Working/Demo

Img: Case- Passing of Integration test

Img: Case- Failing of Integration test.

As you can see no further task were executed after the failure of the integration test and eventually any subsequently connected CD tasks.

5. Complete code

integration_testing.py

import osimport jsonimport timeimport loggingimport argparseimport requestsimport subprocesslogging.basicConfig(level=logging.INFO, format='[%(levelname)s]: %(message)s')logger = logging.getLogger(__name__)parser = argparse.ArgumentParser()parser.add_argument('--image_name', type=str)args = parser.parse_args()# Run docker container at port 80subprocess.call(['docker', 'run', '-d', '-p' ,'80:80','--name', 'test_api', args.image_name])ip = subprocess.getoutput("docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' test_api")url = f'http://{ip}/api/IDA'logger.info(f"Testing api @ {url}")with open('./tests/sample_data.json', 'r') as f:    payload = json.load(f)header = {"Content-Type": "application/json"}logger.info('Waiting for 30 sec to let docker container start')time.sleep(30)logger.info("Send API request for integration testing")try:    response = requests.request('POST', url, headers=header, json=payload)    response_data = response.json()except:    logging.error('An error has occurred. Refer logs to locate error.')    os.system("docker logs test_api > output.log")    time.sleep(3)    with open('./output.log', 'r') as log:        print(log.read())    os.sys.exit('Task terminated')try:        assert response_data['ida_output_path'].split('/')[-1] =='IDA.ndjson', 'Not Received expected output, test is failed.'except (AssertionError,KeyError) as e:    os.system("docker logs test_api > output.log")    time.sleep(3)    with open('./output.log', 'r') as log:        print(log.read())    logging.error(f'Problem with {e} Refer logs to locate error')    logging.error(f"Response from API: {response_data}")    os.sys.exit('Task terminated')logger.info("Integration test passed successfully moving to next task")

CI pipeline

trigger:  branches:    include:    - develop  paths:    exclude:    - Dockerfile_base    - requirements.txtpool:  vmImage: 'ubuntu-latest'steps:- checkout: self  clean: true  fetchDepth: 1- task: UsePythonVersion@0  inputs:    versionSpec: '3.x'    addToPath: true    architecture: 'x64'- task: CmdLine@2  displayName: Install python package  inputs:    script: 'python3 -m pip install requests azure-devops'- task: PythonScript@0  displayName: Check build pipeline status  inputs:    scriptSource: 'filePath'    scriptPath: '$(System.DefaultWorkingDirectory)/tests/base_pipeline_status.py'    arguments: '--personal_access_token $(PERSONALACCESSTOKEN) --repo_id b978e55f-bf80-466c-86c8-fc0dfe909b2c --pipeline_def_id 179'- task: Docker@0  displayName: 'Docker Build Image'  inputs:    azureSubscription: 'Your Subscripton'    azureContainerRegistry: 'Your container registery'    dockerFile: Dockerfile    buildArguments: |      ARG_STORAGEACCOUNTNAME=$(STORAGEACCOUNTNAME)      ARG_CONTAINERNAME=$(CONTAINERNAME)      ARG_STORAGEACCOUNTKEY=$(STORAGEACCOUNTKEY)      ARG_MAXWORKERS=$(MAXWORKERS)    imageName: 'ida:$(Build.BuildNumber)'- task: PythonScript@0  displayName: Integration testing  inputs:    scriptSource: 'filePath'    scriptPath: '$(System.DefaultWorkingDirectory)/tests/integration_testing.py'    arguments: '--image_name your.repo.io/ida:$(Build.BuildNumber)'- task: Docker@0  displayName: 'Push image to ACR'  inputs:    azureSubscription: 'Your Subscripton'    azureContainerRegistry: 'Your container registery'    action: 'Push an image'    imageName: 'ida:$(Build.BuildNumber)'- task: PublishBuildArtifacts@1  inputs:    PathtoPublish: '$(Build.SourcesDirectory)/kube'    ArtifactName: 'drop'    publishLocation: 'Container'

Testing in a CI/CD Pipeline Part 1: PR testing

Akash Desarda — Wed, 26 May 2021 03:20:23 GMT

There is no doubt the benefits of TDD 🧪 or test driven development. Follow this awesome Twitter thread and I bet if you not a fan of TDD you will surely become one.

https://twitter.com/gareth_leake_/status/1308989905197043713

TDD is a lot more than just the Unit testing. Lets us see how we can design our CI/CD pipeline to adapt to the TDD approach.

Info: 💡💡💡
I am using the Azure DevOps pipeline as CI/CD tool. But I believe you will be able to implement it in other CI/CD tool too. Understanding the mechanism is key.

Let's break down the pipeline into three component and see how to perform a test there.

1. Pull request testing in brief 💼

PR testing is different from the system or unit testing. Let's see in brief (as the original intent of this guide is how to integrate it CI/CD pipeline)

In unit testing, tests are performed to measure the correctness of the system's individual smaller or unit component.
In contrast, PR testing is a testing stage where the functional test is performed on the complete scope branch which has to be merged.
PR testing in CI/CD pipeline works best and integrates easily if your system is build using a Microservice approach.

2. General mechanics 🧰

Why

To ensure the functionality code is working as expected.
It helps to restrict the merging of any unintended commits.
Ensures nothing breaks over collaboration.

How

The core of running a PR test is the same as running a Unit test.
All the tests which is used here are identical to Unit testing with a key difference of:
1. Unit test are/should be performed locally whereas PR testing is performed over the remote.
2. Code coverage of PR testing must be always more than the unit test because the intention of the unit test is to just check individual changes locally, whereas PR testing intention is to check for the complete branch which may more than one collaborative effort.
3. It may be possible that you may not need a 100% passing of the Unit test (though this is not ideal) but PR testing must be 100% passing.

When

PR testing should be performed at the time of a PR completion or merge activity.

Where

It is always performed at Remote Repository (eg Github, Gitlab).
It should be the first step of any CI/CD pipeline.

3. Implementation it in the Azure DevOps pipeline 🚀

I am using python and pytest here. I am assuming that you will already have all the unit tests written and follow all conventions for pytest automatic test discovery.
All you need to run is a terminal command

- task: CmdLine@2  inputs:    script: |      # If you have setup.py      python3 -m pip install .      pytest -vv    workingDirectory: '$(System.DefaultWorkingDirectory)'

But the critical part here is How to make it automated? After all, this is part of CI/CD and the whole point of CI/CD is automation.
This is a two-step process in Azure DevOps:
1. Create a new pipeline dedicated to running the unit tests. The trigger should be the same as its parent build or CI pipeline.

trigger:- developpool:  vmImage: 'ubuntu-latest'steps:- task: UsePythonVersion@0  inputs:    versionSpec: '3.x'    addToPath: true    architecture: 'x64'- task: CmdLine@2  inputs:    script: |       # If you have setup.py      python3 -m pip install .      pytest -vv    workingDirectory: '$(System.DefaultWorkingDirectory)'

Tip:🤝🤝🤝
From above you can see I am using trigger as develop branch as this PR testing is intended for develop branch.

Branch Policies:
- For enabling PR triggers we need to branch policies in Azure DevOps
- Goto Repo > Branch > Branch Policies > Build Validation

Img: Branch PoliciesImg: Build Validation

We have to add a build validation policy for the PR trigger. Out of all two settings is key,
1. Trigger: Must be automatic
2. Policy requirement: Required

Img: Build Validation Policy

See in the Build pipeline section, I am pointing to the PR testing pipeline that I showed earlier.

Img: How the PR testing works (see the marked item)

Understanding current AI industry expectation

Akash Desarda — Sun, 23 May 2021 16:47:53 GMT

1. The Past Expectation

It is vital to understand the past responsibility to adapt to the current expectation of the industry.

The AI space was still relatively new (though not in academics) and many companies, startups were analyzing its application and valid use-case.
The research was the primary focus. The caveat here was that this research many times was not directly in line with the core of the organization. So initially not much credibility was expected.
Generally, companies used to blend the roles of a Data Scientist with a Data analyst or Data engineer. Again, due to the vagueness of AI enterprise application.
Individuals also had a kind of similar dilemma. A lot of their research or work was not directly in line and practically not viable to be served as a product.

2. The current outlook

The democratization of AI has seen remarkable developments from businesses and startups. Let us try to understand it,

The industry now distinguishes the role of a Data Scientist, Machine Learning Engineer, Data Analyst, Data engineer, even MLops engineer.
Businesses no longer allow research in the wild, as they know what use-case exactly they are tapping in. A clear mindset & similar discrete approach from an individual is also required.
Every Research or POC must have a tangible and servable product

3. The thorough dissection of all the Roles

If we have to pick one area where the Businesses have excelled in AI space, it is undoubtedly the clear expectation from all varieties of the Roles, which are in a nutshell:

Data Scientist: A Data Scientist is a person who (generally from a stats/maths background) uses a variety of means including AI to extract valuable information from data.
- A fundamental difference between Data Analyst & Data scientist is- the former generally rely on domain knowledge and manual old school methods to make sense of data on a small to medium scale, whereas, the latter is responsible for collecting, analyzing and interpreting data on a larger scale using wider means of tools like AI, SQL, old school manual ways, etc.,
- Domain knowledge is not a must but having is helpful.
- The primary job is to maintain and extract business contributing insights from data & not to develop the software or product.
- A Statistician or a Mathematician can become a good Data Scientist.
Machine Learning Engineer: A niche software engineer who develops a product or service based on AI.
- An ML engineer needs to have all the expertise of traditional software engineering along with knowledge of AI because he/she is eventually going to build software with AI at its core.
- The primary job is not to extract data but to develop an AI tool that can perform the same job.
- A developer with good knowledge of machine learning/deep learning as well as software engineering can become a good Machine learning engineer.
Machine Learning Operation Engineer: A niche software engineer who maintains and automates the pipeline which is used by the ML system.
- Relatively new field inspired by DevOps. Though different from traditional DevOps roles.
- Unlike traditional software engineering, development for any product/software/service based on AI doesn't stop at the completion of the building of software. It has to be updated regularly with new data, based on the Data-Drift.
- The primary job includes all traditional DevOps work as well as maintaining/automating pipeline and Data-Drift
- A developer with good knowledge of machine learning/deep learning, software engineering & cloud technologies can become a good MlOps engineer.
Data Engineer: A niche software engineer who develops a pipeline to serve all data needs using a variety of tools (generally cloud-based)
- The data engineer needs to have expertise with major cloud-based data platform, batch or stream processing, big data platform (depends on business requirement), database (though not like a Database administrator).
- Have to work in line with Data Governance policies.
- Primary job is to design, implements and maintain the Data pipeline.
- A developer with good knowledge of cloud technologies, data platform, processing needs can become a good Data engineer.

For a new seeker or someone who is aiming to advance in his or her career, all these roles and expectations must be well understood. Given that companies are clearly distinguishing this role, it is expected that this will also be the case for individuals. A vague mindset is totally useless.

Retrieve Azure DevOps Pipeline current/older metadata

Akash Desarda — Sat, 17 Apr 2021 02:32:49 GMT

1. The Problem

I believe to better understand any blog, understanding its original requirement is essential.

We use serval microservices in our application which is finally deployment using Kubernetes. Each microservices has two CI pipeline:

Base pipeline: Used to install all dependencies.
Build pipeline: Used to build service on top of the base pipeline.

Now, the frequency to run base pipeline compared to build pipeline is rare as dependencies change very rarely. Both base pipeline and build pipeline produces docker images as the end result which is finally pushed to our container registry. If the pipeline fails then its respective docker image is not pushed.

Initially, I used to manually check them every time. It's not that daunting as it sounds as there were few checks in place,

Docker image in build pipeline is built on top docker image from base pipeline. So even if the latest base pipeline fails, the build pipeline will simply pull an earlier successfully pushed image.
In some rare scenario where if both the pipeline is triggered simultaneously then I have to manually pause the build pipeline till base pipeline is not succeeded. Now this is defeating the purpose of automation via CI pipeline 😭

As I mentioned earlier, the base pipeline can pull docker image from the repository no matter what is the current status of the base pipeline. This is done intentionally so that the CI pipeline will not break. But this also gave rise to a harmful design flaw. Ideally, the build pipeline must have all the latest packages and utilities installed. Smart people can definitely smell a trade-off here.

So let's summarize into a checklist that can solve all these problems, ideally, the build pipeline should check the following checklist before starting its execution:

Is the base pipeline running?
- If running then wait and check after some interval using the pooling mechanism
- Resume as soon as base pipeline succeeds
What is the status of the latest completed base pipeline?
- If it is succeeded then continue
- If it is failed then stop build pipeline

2. The Solution

After going through the problem now it's time to go through the solution.

2.1 Personal Access Token (PAT)

We'll need PAT to a establish connection with Azure DevOps and create its client. Follow this official guide from Microsoft to create one. Make sure the PAT must have Read access for Build and Release. You can enable it from Edit > Scopes.

Warning:
Make sure not to lose the PAT token key. As it is not stored anywhere in Azure DevOps and can be only copied once at the time of creation.

2.2 Creating Azure devops client

Azure DevOps provide extensive support through its REST api 🚀. We can directly use REST api over HTTPS operation, but we can't build custom logic around it. So instead I am using Azure DevOps Python API

Let's see how to create the client

from azure.devops.connection import Connectionfrom msrest.authentication import BasicAuthentication# Setting up client to connect Azure devops credentials = BasicAuthentication("", 'YOURPAT')connection = Connection(base_url="", creds=credentials)core_client = connection.clients.get_core_client()build_client = connection.clients.get_build_client()

2.3 Retrieve Builds

Here I will retrieve specific builds of in focus base pipeline with the help of filters. The best part of this REST API is that they are updated instantly 😍.

 builds = build_client.get_builds(        project="",        repository_id=,        repository_type="TfsGit", # For git based repo        branch_name="refs/heads/develop", # IF you want to retrieve build run for specif branch. In my case it is develop.  Do not remove 'refs/heads/'        definitions=[],    )

Tip:🤝🤝🤝
I struggled a bit for searching 'definitionId'. But there's a neat trick, just go to your pipeline's build page (where it shows all history of runs) and in the URL you will find it.

2.4 Pooling Mechanism

This is the most critical part as it will ensure to tick all element from the above checklist. The script first checks the result of the last build and if it's not in a failed state then continue to check the status for completion. This step is repeated every 10 seconds with a timeout of 30m

for step_time in list(range(10, 1810, 10)):    builds = build_client.get_builds(        project="",        repository_id=,        repository_type="TfsGit",        branch_name="refs/heads/develop",        definitions=[],    )    # Time out with 30m    if step_time == 1800:        logger.error(            f"Time out as build pipeline is talking more than 30m.\nCheck the pipeline to investigate here {builds.value[0].url.replace('_apis/build/Builds/', '_build/results?buildId=')}"        )        os.sys.exit("Task terminated")    # Checking last result    if builds.value[0].result == "failed":        logger.error(            f"Latest/last base pipeline is in filed state.\nCheck here {builds.value[0].url.replace('_apis/build/Builds/', '_build/results?buildId=')}"        )        os.sys.exit("Task terminated")    # Checking for current status     if builds.value[0].status == "completed":        logger.info("Since base pipeline is ready moving ahead")        break    else:        logger.info(            f"Respective pipeline's current status: {builds.value[0].status} and current period of waiting: {step_time}s"        )        time.sleep(10)

I had even added a little extra logging information in case of failure to point out to failed run.

logger.error(            f"Latest/last base pipeline is in filed state.\nCheck here {builds.value[0].url.replace('_apis/build/Builds/', '_build/results?buildId=')}"        )

and

logger.info(            f"Respective pipeline's current status: {builds.value[0].status} and current period of waiting: {step_time}s"        )

3. Integration with Pipeline

This part can vary with your use case or preference. This is how I do it.

I put the script at fix location in every repo
I set the python version to be used as 3.x
I used the pipeline variable to pass some parameters to the script. Like PAT which I store it as a secret and then pass it the script.
Then finally I used the Run Python Script task of the Azure DevOps pipeline to run it.

- task: UsePythonVersion@0  inputs:    versionSpec: '3.x'    addToPath: true    architecture: 'x64'- task: CmdLine@2  displayName: Install python package  inputs:    script: 'python3 -m pip install requests azure-devops'- task: PythonScript@0  displayName: Check build pipeline status  inputs:    scriptSource: 'filePath'    scriptPath: '$(System.DefaultWorkingDirectory)/tests/base_pipeline_status.py'    arguments: '--personal_access_token $(PERSONALACCESSTOKEN) --repo_id e589ca4a-16ff-453b-b101-2a9f21542d76 --pipeline_def_id 183' # You can see I am passing PAT token to PERSONALACCESSTOKEN from pipeline variable

Tip🤝🤝🤝
I always run this task at the beginning to avoid running other tasks unnecessarily if this fails eventually.

Here is a screenshot of the working of the logic

A complete guide to building a Docker Image serving a Machine learning system in Production

Akash Desarda — Fri, 12 Feb 2021 05:06:45 GMT

Building a Docker image is generally considered trivial compared to developing other components of a ML system like data pipeline, model training, serving infra, etc. But an inefficient, bulky docker image can greatly reduce performance and can even bring down the serving infra.

0. Disclaimer

This blog aims to focus on building an ideal Docker image and not on its concept or benefits. I am assuming you have basic knowledge of a few topics wrt to Docker:

General working of Docker
Basics of Docker build, run
Writing and syntax of a Dockerfile

1. General Docker build best practice

There quite a few very good source for general best-practice like the official docker guide, but I would like to keep this short and relevant to the ML system based project

Requirements.txt must always have a python package version. Never ever just write package name, as it will then always install the latest package and in the process completely defeats the purpose of using docker.
Always group similar RUN command which will result in a single Docker layer. (I will avoid the temptation to explain this as it is a little out of scope)

eg:

RUN apt update && \apt install --no-install-recommends -y build-essential gcc curl ca-certificates python3 && \apt clean && rm -rf /var/lib/apt/lists/*

Use - -no-cache-dir flag of pip as the targeted environment is the production as RUN pip install --no-cache-dir --user -r /req.txt
Use .dockerignore to avoid unnecessary build context. This works exactly like .gitignore
Whenever possible use the slim version of the base image like python:buster-slim, debian:buster-slim, etc.
Avoid the use of Alpine based base Docker image. This might be a little controversial but trust me they don't work well with Python. Refer to this excellent blog by Itamar Turner-Trauring.

2. Building a Docker image for any Python Project (CPU):

Most of the time a ML system will be based on Python, so it critical to build any Python-based Docker image efficiently. Let us go through it.

2.1 Single Stage

The single-stage will perform all the task in the same/single docker build-time.
The flow is, select a base image, install OS packages, copy source, install packages, set entry point (if required) or other commands.

FROM python:3.8-slimRUN apt update && \    apt install --no-install-recommends -y build-essential gcc && \    apt clean && rm -rf /var/lib/apt/lists/*COPY ./req.txt /req.txtCOPY ./src /srcRUN pip3 install --no-cache-dir -r /req.txtCMD ['python3', '/src/app.py']EXPOSE 8080

For demo purpose, I am using the following packages:

After running the docker build command the size of the docker image was 1.64 gb.

Single-stage is very simple can work in many use case. It is not a bad practice but does have some fundamental cons, especially for a python based project.
Here the use of - -no-install-recommends in apt and - -no-cache-dir in pip is key as I told earlier too we dont want to store cache as it is not intended for a development environment but not for production. In fact, if you are using any CI/CD platform (like Github action) with limited storage space, it will only work using this method.
Python library does not work out of the box, it must be first compiled in C. We just need the compiled part of any library, not all other leftovers. As you can see in the single-stage eg above; while performing pip install all the libraries are first downloaded and then compiled.
We should remove (& we can using bash commands) all the intermediate and leftover components created while installing libraries. This will full of hassles and can even break a library if done incorrectly. This is a real deal-breaker and so many of us will just avoid it & carry the bulkier image into production. But Docker Multi-stage comes to our rescue.

2.2 Multi-Stage

Multi-stage Docker build is by far one of the most effective optimization technique while keeping them easy to read and maintain. To write a really efficient Dockerfile, you have traditionally needed to employ shell tricks and other logic to keep the layers as small as possible and to ensure that each layer has the artifacts it needs from the previous layer and nothing else.
With multi-stage builds, you use multiple FROM statements in your Dockerfile. Each FROM instruction can use a different base, and each of them begins a new stage of the build. You can selectively copy artifacts from one stage to another, leaving behind everything you dont want in the final image. To show how this works see the example.

# Stage 1: Builder/CompilerFROM python:3.7-slim as builderRUN apt update && \    apt install --no-install-recommends -y build-essential gccCOPY req.txt /req.txtRUN pip install --no-cache-dir --user -r /req.txt# Stage 2: RuntimeFROM debian:buster-slimRUN apt update && \    apt install --no-install-recommends -y build-essential python3 && \    apt clean && rm -rf /var/lib/apt/lists/*COPY --from=builder /root/.local/lib/python3.7/site-packages /usr/local/lib/python3.7/dist-packagesCOPY ./src /srcCMD ['python3', '/src/app.py']EXPOSE 8080

Comparing them, the multi-stage docker image size is 1.61 gb and the single-stage is 1.64 gb. It's an improvement (even it seems small though), a lot of things are going here, lets us try to understand in a nutshell.

Line 15 is 1st stage or a compiler stage, where we are installing python libraries (which is first downloaded and & then complied in C, that's why we have even installed gcc). Then we are just copying the compiled libraries from stage 1 to stage 2 or the runtime stage using the syntax COPY --from= stage1/src stage2/destination
But as we can see from the screenshot, we are not seeing much significant improvement. We will certainly see a huge improvement in other languages, but python has some tricks in its sleeves,
- Many libraries now come as pre-compiled .whl are wheel format from PyPi, which does not need any kind of compilation.
- So does it mean, there is no scope of Multi-stage build for Python project? Absolutely yes!!! Not every package from PyPi are pre-compiled .whl format, many are all legacy tar.gz (tarballs compressed), which needs to be first compiled & here the Multi-stage build will work its charm.
- Also, Multi-stage is applicable is you are building a python package from source or using local package using setup.py, as again, they need to be compiled first.
- I would highly insist you to read this article from Real Python explaining what are wheels in python.
- From the req.txt that I am using for the demo, only the above packages are not wheel format & also they are already very small in size. But if some packages are not pre-compiled wheel and large in size will end up wasting a lot of disk size

3. Building a Docker image for any Python Project (GPU):

Building a CPU based Docker image is not complex, but not the same case with building a GPU based docker. If not build appropriately, it can end up in humongous size. I will focus on the practical and implementation part and not cover its theory part (as I think it is out of scope for this article).

3.1 Understanding Pre-requisite

Both Tensorflow and Pytorch uses Nvidia CUDA gpu drivers. So latest Nvidia drivers, CUDA drivers and its respective cuDNN must be first installed on the host machine (I cant include its process here as it is beyond the scope, perhaps for some other blog).
After getting the host device ready, nvidia-docker2 must be installed, which enables the Docker engine to access underlying Nvidia gpu drivers.
The most critical part is to select the correct version/tag of CUDA, cuDNN for nvidia docker image and tensorflow/pytorch wrt to it. So that the ML system can utilize underlying gpu hardware. Trust me this can be really frustrating task, so I have some rule of thumb:
1. Always use the same CUDA and cuDNN version in Docker image as present in the underlying host machine.
2. Dont blindly install the latest tensorflow/pytorch library from PyPi. It is absolutely incorrect that any version of both packages will work with any version of CUDA, cuDNN. In fact, the combination of the latest version of both, tensorflow/pytorch with CUDA/cuDNN may not be compatible. Always test the combination in a development environment first.
3. Docker hub of Nvidia has a lot of images, so understanding their tags and selecting the correct image is the most important building block. The description from the official Nvidia docker hub is,

We are only interested in base, runtime and not in devel (as we are targeting prod environment). How to select an exact specific tag? Ill answer it in the following sub-part.

3.2 Sinle Stage

Selecting tag: The rule of thumb which I follow is:

Step 1: Check Version of CUDA and cuDNN of the underlying host machine
Step 2: Select the Docker image based on step 1. So in my case, I have selected, nvidia/cuda:10.1-cudnn7-runtime. Why runtime? Because this is the one that includes both CUDA and cuDNN.
Step 3: Select the correct version of tensorflow/pytorch which is compatible with this version of CUDA and cuDNN. In my case, it was tensorflow=2.20.
Cautionary step: The docker image from Nvidia might be older Ubuntu (18.04 or even 16.04) which will install python 3.6. So attention must be given here to check the compatibility of your project as well as external packages with the python version. Anyways specific version can be installed from the source.

FROM nvidia/cuda:10.1-cudnn7-runtimeRUN apt update && \    apt install --no-install-recommends -y build-essential software-properties-common && \    add-apt-repository -y ppa:deadsnakes/ppa && \    apt install --no-install-recommends -y python3.8 python3-pip python3-setuptools python3-distutils && \    apt clean && rm -rf /var/lib/apt/lists/*COPY req.txt /req.txtCOPY ./src /srcRUN python3.8 -m pip install --upgrade pip && \    python3.8 -m pip install --no-cache-dir -r /req.txtCMD ['python3', '/src/app.py']EXPOSE 8080

Note: As you can the Docker image from nvidia is based on ubuntu 18.04, I have to make a little additional adjustment to install tensorflow=2.2.0.

3.3 Multi Stage

We can use the same mechanism which I showed in 2.2.
The first stage will be used to download and compile python packages and then they will be copied to the second stage or runtime stage
All thumb rules from 3.2 must be also used here

# Stage 1: Builder/CompilerFROM python:3.8-slim as builderRUN apt update && \    apt install --no-install-recommends -y build-essential gccCOPY req.txt /req.txtRUN pip install --no-cache-dir --user -r /req.txt# Stage 2: RuntimeFROM nvidia/cuda:10.1-cudnn7-runtimeRUN apt update && \    apt install --no-install-recommends -y build-essential software-properties-common && \    add-apt-repository -y ppa:deadsnakes/ppa && \    apt install --no-install-recommends -y python3.8 python3-distutils && \    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1 && \    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2 && \    apt clean && rm -rf /var/lib/apt/lists/*COPY --from=builder /root/.local/lib/python3.8/site-packages /usr/local/lib/python3.8/dist-packagesCOPY ./src /srcCMD ['python3', '/src/app.py']EXPOSE 8080

Note: To make python 3.8 as default I have added some additional code, if this is not the case for you then you can avoid this hassle.

Again not a significant improvement but even here the logic/explanation is applicable as was in case 2.2
I highly recommend always use a Multi-stage build in any use case as it also improves readability.

4. Inspecting Docker Image using Dive

Even after building a docker image by following all possible best practice, we should still investigate for any improvement.
Dive is an excellent command-line tool designed for exploring a docker image, layer contents, and discovering ways to shrink the size of your Docker/OCI image. It has 24k+ GitHub stars. Also, it is very easy to use and navigate.
It has two very useful matrics:
1. Potential wasted space
2. Image efficiency score
But its best feature is its integration with any CI tool. We can set a condition on either of the two or both metrics and if the condition fails, the CI job will also fail. This way we can always establish confidence in the Docker image created from every CI job.

5. Conclusion

The primary goal always must be minimal docker image size, since any docker image build for ML system will be always heavy. We should always follow all best practice especially Multi-stage build and versioning of packages. Last but also most important for gpu based images is to test the configuration on the dev environment.

Note: I had originally published this blog at towardsdatascience.com found here

Working with Hugging Face Transformers and TF 2.0

Akash Desarda — Fri, 24 Apr 2020 04:06:09 GMT

Models based on Transformers are the current sensation of the world of NLP. Hugging Faces Transformers library provides all SOTA models (like BERT, GPT2, RoBERTa, etc) to be used with TF 2.0 and this blog aims to show its interface and APIs

0. Disclaimer

I am assuming that you are aware of Transformers and its attention mechanism. The primary aim of this blog is to show how to use Hugging Faces transformer library with TF 2.0, i.e. it will be a more code-focused blog.

1. Introduction

Hugging Face initially supported only PyTorch, but now TF 2.0 is also well supported. You can find a good number of quality tutorials for using the transformer library with PyTorch, but the same is not true with TF 2.0 (primary motivation for this blog).

To use BERT or even AlBERT is quite easy and the standard process in TF 2.0 courtesy to tensorflow_hub, but the same is not the case with GPT2, RoBERTa, DistilBERT, etc. Here comes Hugging Faces transformer library to rescue. They provide intuitive APIs to build a custom model from scratch or fine-tune a pre-trained model for a wide list of transformer-based models.

It supports a wide range of NLP application like Text classification, Question-Answer system, Text summarization, Token classification, etc. Head over to their Docs for more detail.

This tutorial will be based on a Multi-Label Text classification of Kaggles Toxic Comment Classification Challenge.

Following is a general pipeline for any transformer model:

Tokenizer definition Tokenization of Documents Model Definition Model Training Inference

Let us now go over them one by one, I will also try to cover multiple possible use cases.

2. HuggingFace transformer General Pipeline

2.1 Tokenizer Definition

Every transformer based model has a unique tokenization technique, unique use of special tokens. The transformer library takes care of this for us. It supports tokenization for every model which is associated with it.

from transformers import DistilBertTokenizer, RobertaTokenizer, distil_bert = 'distilbert-base-uncased' # Pick any desired pre-trained modelroberta = 'roberta-base-uncase'# Defining DistilBERT tokonizertokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,                                                max_length=128, pad_to_max_length=True)# Defining RoBERTa tokinizertokenizer = RobertaTokenizer.from_pretrained(roberta, do_lower_case=True, add_special_tokens=True,                                                max_length=128, pad_to_max_length=True)

Every transformer model has a similar token definition API
Here I am using a tokenizer from a Pretrained model.
Here,
- add_special_tokens: Is used to add special character like , ,, etc w.r.t Pretrained model in use. It should be always kept True
- max_length: Max length of any sentence to tokenize, it's a hyperparameter. (originally BERT has 512 max length)
- pad_to_max_length: perform padding operation.

2.2 Tokenization of Documents

The next step is now to perform tokenization on documents. It can be performed either by encode() or encode_plus() method.

def tokenize(sentences, tokenizer):    input_ids, input_masks, input_segments = [],[],[]    for sentence in tqdm(sentences):        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=128, pad_to_max_length=True,                                              return_attention_mask=True, return_token_type_ids=True)        input_ids.append(inputs['input_ids'])        input_masks.append(inputs['attention_mask'])        input_segments.append(inputs['token_type_ids'])            return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')

Any transformer model generally needs three input:
- input ids: word id associated with their vocabulary
- attention mask: Which id must be paid attention to; 1=pay attention. In simple terms, it tells the model which are original words and which are padded words or special tokens
- token type id: It's associated with model consuming multiply sentence like Question-Answer model. It tells the model about the sequence of the sentences.
Though it is not compulsory to provide all these three ids and only input ids will also do, but attention mask helps the model to focus on only valid words. So at least for the classification task both these should be provided.

2.3 Training and Fine-tuning

Now comes the most crucial part, the Training. The method that I will discuss is by no means the only possible way to train. Though after a lot of experimenting I found this method to be most workable. I will discuss three possible ways to train the model:

Use Pretrained model directly as a classifier
Transformer model to extract embedding and use it as input to another classifier.
Fine-tuning a Pretrained transformer model on custom config and dataset.

2.3.1 Use Pretrained model directly as a classifier

This is the simplest but also with the least application. Hugging Faces transformers library provide some models with sequence classification ability. These model have two heads, one is a pre-trained model architecture as the base & a classifier as the top head.

Tokenizer definition Tokenization of Documents Model Definition

from transformers import TFDistilBertForSequenceClassification, DistilBertConfigimport tensorflow as tfdistil_bert = 'distilbert-base-uncased'config = DistilBertConfig(num_labels=6)config.output_hidden_states = Falsetransformer_model = TFDistilBertForSequenceClassification.from_pretrained(distil_bert, config = config)[0]input_ids = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')input_masks_ids = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32')X = transformer_model(input_ids, input_masks_ids)model = tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs = X)

Summary of Pretrained model directly as a classifier

Note: Models which are SequenceClassification are only applicable here.
Defining the proper config is crucial here. As you can see on line 6, I am defining the config. num_labels is the number of classes to use when the model is a classification model. It also supports a variety of configs so go ahead & see their docs.
Some key things to note here are:
- Here only weights of the pre-trained model can be updated, but updating them is not a good idea as it will defeat the purpose of transfer learning. So, actually there is nothing here to update. This is the reason I least prefer this.
- It is also the least customizable.
- A hack you can try is using num_labels with much higher no and finally adding a dense layer at the end which can be trained.

# Hackconfig = DistilBertConfig(num_labels=64)config.output_hidden_states = Falsetransformer_model=TFDistilBertForSequenceClassification.from_pretrained(distil_bert, config = config) input_ids = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')input_masks_ids = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32')X = transformer_model(input_ids, input_masks_ids)[0]X = tf.keras.layers.Dropout(0.2)(X)X = tf.keras.layers.Dense(6, activation='softmax')model = tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs = X)for layer in model.layer[:2]:    layer.trainable = False

2.3.2 Transformer model to extract embedding and use it as input to another classifier

This approach needs two-level or two separate models. We use any transformer model to extract word embedding & then use this word embedding as input to any classifier (eg Logistic classifier, Random forest, Neural nets, etc).

I would suggest you read this article by Jay Alammar which discusses this approach with great detail and clarity.

As this blog is all about neural nets, let me explain this approach with NN.

distil_bert = 'distilbert-base-uncased'config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)config.output_hidden_states = Falsetransformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]cls_token = embedding_layer[:,0,:]X = tf.keras.layers.BatchNormalization()(cls_token)X = tf.keras.layers.Dense(192, activation='relu')(X)X = tf.keras.layers.Dropout(0.2)(X)X = tf.keras.layers.Dense(6, activation='softmax')(X)model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)for layer in model.layers[:3]:  layer.trainable = False

Model Summary

Line #11 is key here. We are only interested in or classification token of the model which can be extracted using the slice operation. Now we have 2D data and build the network as one desired.
This approach works generally better every time compared to 2.3.1 approach. But it also has some drawbacks, like:
- It is not so suitable for production, as you must be using transformer model as a just feature extractor and so you have to now maintain two models, as your classifier head is different (like XGBoost or Catboast ).
- While converting 3D data to 2D we may miss on valuable info.

The transformers library provide a great utility if you want to just extract word embedding.

import numpy as npfrom transformers import AutoTokenizer, pipeline, TFDistilBertModelmodel = TFDistilBertModel.from_pretrained('distilbert-base-uncased')tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')pipe = pipeline('feature-extraction', model=model,                 tokenizer=tokenizer)features = pipe('any text data or list of text data',                pad_to_max_length=True)features = np.squeeze(features)features = features[:,0,:]

2.3.3 Fine-tuning a Pretrained transformer model**

This is my favourite approach as here we are making use of the full potential of any transformer model. Here well be using weights of pre-trained transformer model and then fine-tune on our data i.e transfer learning.

distil_bert = 'distilbert-base-uncased'config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)config.output_hidden_states = Falsetransformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)X = tf.keras.layers.GlobalMaxPool1D()(X)X = tf.keras.layers.Dense(50, activation='relu')(X)X = tf.keras.layers.Dropout(0.2)(X)X = tf.keras.layers.Dense(6, activation='sigmoid')(X)model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)for layer in model.layers[:3]:  layer.trainable = False

Look at line #17 as 3D data is generated earlier embedding layer, we can use LSTM to extract great details.
Next thing is to transform the 3D data into 2D so that we can use a FC layer. You can use any Pooling layer to perform this.
Also, note on line #18 & #19. We should always freeze the pre-trained weights of the transformer model & never update them and update only the remaining weights.

Some extras

Every approach has two things in common:
1. config.output_hidden_states=False; as we are training & not interested in output state.
2. X = transformer_model()[0]; this is inline in config.output_hidden_states as we want only the top head.
config is a dictionary. So to see all available configuration, just simply print it.
Choose the base model carefully as TF 2.0 support is new, so there might be bugs.

2.4 Inference

As the model is based on tf.keras model API, we can use Keras same commonly used method of model.predict()

We can even use the transformer librarys pipeline utility (please refer to the example shown in 2.3.2). This utility is quite effective as it unifies tokenization and prediction under one common simple API.

3. End Notes

Hugging Face has really made it quite easy to use any of their models now with tf.keras. It has open wide possibilities.

They have also made it quite easy to use their model in the cross library (from PyTorch to TF or vice versa).

I would suggest visiting their docs, as they have very intuitive & to-the-point docs.

Build a Custom ResNetV2 with the desired depth

Akash Desarda — Sun, 05 Apr 2020 17:03:01 GMT

This tutorial will help you to build a ResNet model with any desired no of depth/layer from scratch

ResNet has been always one of my favourite architecture and I have used its core idea of skip connection many times. It is now fairly old as both ResNetV1(Deep Residual Learning for Image Recognition), ResNetV2( Identity Mappings in Deep Residual Networks) came in 2015 & 2016, but still today its core concept is widely used. So you must be thinking why this tutorial in 2020.

There are two primary reasons for this tutorial:

Build ResNetV2 with any desired depth, not just ResNet50, ResNet101 or ResNet152 (as included in keras application)
Use of Tensorflow 2.xx

This tutorial is divided into two-part. Part 1 will briefly discuss ResNet and Part two will focus on the coding part. Keep one thing in mind the primary goal of this tutorial is to showcase the coding part of building the ResNet model with any desired no of depth/layer from scratch.

Part 1: ResNet in Brief

One of the biggest problems of any deep learning network is Vanishing, Exploding gradient. This restricts us to go much deeper into the network. Refer to video if you are not aware or want to refresh it.

So the question is How the ResNet solves the problem?

I will try to put it in simple words. The core idea of ResNet is based on skip connection i.e. it allows to take activation from one layer and feed it to future layer (can be much deep). Consider the problem of vanishing gradient, when we go deeper some neuron will not contribute anything because their weights have reduced significantly. But now if we bring activation from the earlier layer and add to the current layer before activation, now it will certainly contribute.

Refer this video to understand it more mathematically

Now without wasting time let us move towards coding part

Part 2: Coding

Note: I have published the repo containing all the code and can be found here.

First, we have to create a residual block which is building block of ResNet. It is used as a skip connector. ResNetV2 brought one slight modification, as opposed to V1 convolution layer is used first and then batch normalization.

# Identity Block or Residual Block or simply Skip Connectordef residual_block(X, num_filters: int, stride: int = 1, kernel_size: int = 3,                   activation: str = 'relu', bn: bool = True, conv_first: bool = True):    """    Parameters    ----------    X : Tensor layer        Input tensor from previous layer    num_filters : int        Conv2d number of filters    stride : int by default 1        Stride square dimension    kernel_size : int by default 3        COnv2D square kernel dimensions    activation: str by default 'relu'        Activation function to used    bn: bool by default True        To use BatchNormalization    conv_first : bool by default True        conv-bn-activation (True) or bn-activation-conv (False)    """    conv_layer = Conv2D(num_filters,                        kernel_size=kernel_size,                        strides=stride,                        padding='same',                        kernel_regularizer=l2(1e-4))    # X = input    if conv_first:        X = conv_layer(X)        if bn:            X = BatchNormalization()(X)        if activation is not None:            X = Activation(activation)(X)            X = Dropout(0.2)(X)    else:        if bn:            X = BatchNormalization()(X)        if activation is not None:            X = Activation(activation)(X)        X = conv_layer(X)    return X

Inside network sometime convolution layer is not used first so we have to use an if-else loop.

Next is add layers. We will use Keras functional API for the same purpose

# depth should be 9n+2 (eg 56 or 110)# Model definitionnum_filters_in = 32num_res_block = int((depth - 2) / 9)inputs = Input(shape=input_shape)# ResNet V2 performs Conv2D on X before spiting into two pathX = residual_block(X=inputs, num_filters=num_filters_in, conv_first=True)# Building stack of residual unitsfor stage in range(3):    for unit_res_block in range(num_res_block):        activation = 'relu'        bn = True        stride = 1        # First layer and first stage        if stage == 0:            num_filters_out = num_filters_in * 4            if unit_res_block == 0:                activation = None                bn = False            # First layer but not first stage        else:            num_filters_out = num_filters_in * 2            if unit_res_block == 0:                stride = 2        # bottleneck residual unit        y = residual_block(X,                           num_filters=num_filters_in,                           kernel_size=1,                           stride=stride,                           activation=activation,                           bn=bn,                           conv_first=False)        y = residual_block(y,                           num_filters=num_filters_in,                           conv_first=False)        y = residual_block(y,                           num_filters=num_filters_out,                           kernel_size=1,                           conv_first=False)        if unit_res_block == 0:            # linear projection residual shortcut connection to match            # changed dims            X = residual_block(X=X,                               num_filters=num_filters_out,                               kernel_size=1,                               stride=stride,                               activation=None,                               bn=False)        X = tf.keras.layers.add([X, y])    num_filters_in = num_filters_out

Let us go it line-by-line

On line #7 we take input

line #10 is used to create a convolution layer before splitting into two paths. Here we are using the same residual block which we created earlier.

from line #1355 are used to create a stack of residual units. Let's discuss this in more details. After line #10 the network is split into two part & later are added. Now in part one three operations are performed,

batch norm activation Conv.

This process is repeated thrice and finally added. To put in ResNet perspective, after splitting, in one sub-network we perform some operation (as mentioned above) and add its activation to the state where we split. So you see we skipped or jumped three layers.

Why three layers? It depends on your intuition. It can be two or four or anything. I found the best result with three & also the author have suggested with two or three layers.

Time to go deeper into codelitrely :)

line #13 starts appending stacks of the residual block in for loop. (A simple trick here to skip over the desired layer is a start for loop in that range)

line #3044 is used to append the convolution layer in one of sub-network after performing the split.

Here you can see that I have used three residual blocks. Why three? Because I am skipping over three activations

line #4553: after performing the split, second sub-network must be a convolution layer which can be skipped ahead added later.

line #54 is where all the magic takes place :). Here we add our both sub-network.

Last part is to connect a fully connected layer on top of the network.

 # Add classifier on top.# v2 has BN-ReLU before PoolingX = BatchNormalization()(X)X = Activation('relu')(X)X = AveragePooling2D(pool_size=8)(X)y = Flatten()(X)y = Dense(512, activation='relu')(y)y = BatchNormalization()(y)y = Dropout(0.5)(y)outputs = Dense(num_classes,                activation='softmax')(y)# Instantiate model.model = Model(inputs=inputs, outputs=outputs)

You can customize this part at your will.

Sample Model architecture with depth 11

Some Key Things to Note

I have written a comment depth should be 9n+2 because here I am skipping over three activations and to match the tensor shape before & after the split, depth must be chosen with accordance to the formula.
Lots of adjustment are made in strides and kernel size to match the tensor shape. I would suggest you write down one loop on a paper to see all the adjustment and understand it more thoroughly.
Do not get confused by skip connectors or residual block. The connection between them is as follow: after performing the split two path are created. path one will have three layers or three residual blocks and path two will have one layer or one residual block (now this block is the same as before the split). Now this residual block from path two will be brought up to path one and added before activation. Thus this block had jumped or skipped over three blocks & thus became a skip connector.

Part 3: End Notes and some extras

I would suggest you go through my GitHub repo ResNet-builder as it includes a lot of more APIs which I thought might my out of scope for this blog. This blog was restricted to building a ResNet network. But to build a complete ResNet system we will need much more functionality like a Data loader, Inference generator, Visualize model performance, etc. I have included all these APIs.

The repo also supports a variety of configurations to build a model.

I have tried to keep API as intuitive as possible, though if you have any confusion you can connect me via Linkedin.

MLOps: The Upcoming Shining Star

Akash Desarda — Sat, 11 Jan 2020 14:19:33 GMT

Why in the world shouldnt I prioritize building The Perfect Model?

Before reading further, I would suggest you read this paper. It discusses all of the related problems in detail.

The major hurdles that generally arise while developing ML apps are:

Current Challenges of Productionizing Machine learning models

Complex Models Abstraction Boundaries

Traditional software engineering practice has shown that strong abstraction boundaries using encapsulation and modular design help create maintainable code in which it is easy to make isolated changes and improvements. But in the case of ML, enforcing strict action becomes difficult because of its dependency on external data.

Data Dependency

Data Dependencies Cost More than Code Dependencies

Feeding the data to training and steps done at the evaluation stage in the data scientist sandbox can dramatically vary in real-world scenarios. Depending on the use case data changes with time and lack of regularity cause poor performance of ML models.

Simple to complex pipelines

Training a simple model and putting it into inference and generating prediction is a simple way of getting business insights, this is not sufficient. In real-world cases, regularity is needed and time models need to be retrained on new data which will be fetched from the data lake. So there are going to be many models & with human approval to decide which model to choose for production. In the Federated pipeline, it becomes even more challenging to maintain.

Configuration Debt

Any large system has a wide range of configurable options, including which features are used, how data is selected, a wide variety of algorithm-specific learning settings, potential pre or post-processing, verification methods, etc. In a mature system that is being actively developed, the number of lines of configuration can far exceed the number of lines of the traditional code. Each configuration line has the potential for mistakes.

Reproducibility Debt

It is important that we can re-run experiments and get similar results, but designing real-world systems to allow for strict reproducibility is a task made difficult by randomized algorithms, non-determinism inherent in parallel learning, reliance on initial conditions, and interactions with the external world.

Production ML Risk

There is always the risk of ML models not doing and needs continuous monitoring and evaluation if they are performing within expected bound. On live data metrics like Accuracy, Precision, recall, etc. cannot be used as live data does not have labels.

Process and Collaboration

In production, ML requires multiple abilities to handle production grades ML systems like data scientists, data engineers, business analysts, and operations. Different teams will focus on various outcomes. The Data scientist will focus on improving the accuracy and detecting data deviations, the business analyst wants to enhance KPIs, operations team wish to see uptime and resources. Unlike the Data scientist sandbox, the production environment has many objects like models, algorithms, pipelines, etc. that are difficult to handle and versioning of them is yet another issue, object storage is needed to store the ML models, and source control repository is not the best option.

What is MLOps?

MLOps establishes a culture and environment where ML technologies can generate business benefits by optimizing the ML lifecycle to automate and scale ML initiatives and optimized business return of ML in production. MLOps have mix capabilities of Data scientists and services

MLOps enables collaboration across diverse users (such as Data Scientists, Data Engineers, Business Analysts and ITOps) on ML operations and enables a data-driven continuous optimization of ML operations impact or ROI (Return on Investment) to business applications.

Why MLOps?

It is pretty clear from the above content that what is the need for MLOps and what lead to the rise of this hybrid approach in the modern era of Artificial Intelligence. Now moving forward from What to Why. Let us give some light on the reasons which led to the use of MLOps in the first place.

Orchestration of multiple pipelines

The development of machine learning models is not a single code file task. Instead, it involves the combination of the different pipelines which have their roles to perform.
Pipelines for the primary process such as pre-processing, feature engineering model training and model inference, etc. involved in the big picture of the development of the machine learning model.
MLOps play an essential role in the simple orchestration of these multiple pipelines to ensure the updating of the model automatically.

Manage Full Life Cycle of MLOps

The life cycle of a Machine learning model consists of different sub-parts which should be considered as a software entity individually.
These sub-parts have their own need for management and maintenance, which often handled by DevOps, but it is challenging to manage them using traditional DevOps methods.
MLOps is the newly emerged technique which includes a combination of people, process, and technology that give an edge to swiftly and safely optimize and to deploy machine learning models.

Scale ML Applications

As it is said earlier in the topic, the development of models is not an issue to be worried about, and the real problem lies in the management of the models at scale.
The management of the thousands of models at once is a very cumbersome and challenging task which test the performance of the models at scale.
With the use of MLOps, it naturally scales the manage thousands of pipelines of models in production.

Maintain ML Health

To maintain ML health after the deployment of ML models is the most critical part of the post-process. It is vital so that ML models can be operated and managed flawlessly.
MLOps provide the latest ML health methods by enabling the detection of different drifts (model drift, data drift) in an automated way.
It can provide the ability to use the latest edge cutting algorithms in the system to detect these drift so that these drifts can be avoided much before they will start to affect ML health.

Continuous Integration and Deployment

Continuous Integration and Deployment is one of the whole sole purposes, which led to the use of DevOps in any software product development procedures.
But due to the scale of the operability of ML models, it is difficult to use the same methods of continuous integration and deployment, which are used for other software products.
MLOps can provide the hands to use different dedicated tools and techniques which are specialized to ensure the continuous integration and deployment services in the field of ML models.

Model Governance

Under Model Governance, MLOps can provide rich model performance data by applying to monitor the attributes on a massive scale.
It can also provide the ability to take snapshots of the pipelines for analyzing critical moments.
Also, the logging facilities and audit trails under MLOps can be used for reporting and continuity of compliance.

How is MLOps different from DevOps?

Data/model versioning != code versioning
Model reuse entirely has different case than software reuse, as models need tuning based on scenarios and data.
Fine-tuning is needed when to reuse a model. Transfer learning on it, and it leads to a training pipeline.
Retraining ability requires on-demand as the models decay over time.

Lets talk about Full Stack Machine Learning Development, shall we?

As you may already have got the gist, but lets talk about in some detail.

Developing a ML system is just not developing a model, but it is much more. Configuration, Data collection, Deployment, Serving, etc (as shown in dig above).
So for Rapid but safe, High confidence but generic, Development-friendly but also production-friendly, we need to replicate all the best practices of Software Development in ML development.
As DevOps is helping Software Development into Full stack development, so does MLOps will help ML development to Full Stack ML development.

Embracing Mlops into ML Development will always ensure the confidence in ML System.

I will be publishing a new series on building the right Full-stack ML system. I will keep you posted :)

import idea

Working with Avro file format in Python the right way

How to supercharge your config to make it truly environment agnostic

1. The Old School way

2. Environment Agnostic Config: Generating config programmatically

3. One Config to rule them all

4. How to actually consume the config in code

5. Bringing all things together

The practical guide to write useful comments

1. The Need

2. How to write good comments

2.1 Philosophical change

2.2 Tagging comments

2.3 Navigate & automate comments

3. How to write a good message commit message

How to SSH login password free from Windows, Linux, Mac

1. Linux OS & MAC OS

2. Windows

How to merge a specific directory or file in Git

The Ultimate VS Code setup guide 🐱‍💻

Table of contents

Part 1: Python development

Python (extension) by Microsoft:

Pylance (extension) by Microsoft:

Visual Studio IntelliCode (extension) by Microsoft:

SonarLint (extension) by SonarSource:

Pylint (native support):

Auto code formatting using Black (native support):

Python Docstring Generator (extension) by Nils Werner:

Python Indent (extension) by Kevin Rose:

Python Type Hint (extension) by njqdev:

Python Test Explorer for Visual Studio Code (extension) by Little Fox Team:

Part 2: Git

GitLens (extension) by Eric Amodio:

Git Graph (extension) by mhutchie:

Git History (extension) by Don Jayamanne:

Conventional Commits (extension) by vivaxy:

Part 3: Productivity boosters

VS Code Workspace (system functionality):

Useful keyboard shortcut:

Setting sync (system functionality):

Thunder Client (extension) by Ranga Vadhineni:

Path Autocomplete (extension) by Mihai Vilcu:

Comment Anchors (extension) by Exodius Studios:

Bracket Pair Colorizer 2 (extension) by CoenraadS:

Code Spell Checker (extension) by Street Side Software:

Error Lens (extension) by Alexander:

footsteps (extension) by Wattenberger:

Zoom Bar (extension) by wraith13

Resource Monitor (extension) by mutantdino:

Draw.io Integration (extension) by Henning Dieterichs

Part 4: Customization

Rainglow theme (extension) by Dayle Rees:

Material Icon Theme (extension) by Philipp Kief

Window Colors (extension) by Stuart Robinson:

Practical OOP in Python: Methods

1. Brief theory

2. Practical use case

2.1 instance method

2.2 classmethod

2.3 staticmethod

2.4 property

2.5 private method

2.6 strict private method

Testing in a CI/CD Pipeline Part 3: Deployment testing

1. Deployment testing in brief 💼

2. General mechanics 🧰

3. Implementation in Azure DevOps pipeline 🚀

3.1 Test script

3.2 Perform deployment test in the Kubernetes

3.3 Extracting URL/endpoint

3.4 Integration with CD Job

4. Demo

5. Deployment test code

Testing in a CI/CD Pipeline Part 2: Integration testing

1. Integration testing in brief 💼

2. General mechanics

3. Implementation in Azure DevOps pipeline 🚀

3.1 Test script

3.2 Extracting URL/endpoint