<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[import idea - Because it's just an idea that all you need 💡]]></title><description><![CDATA[On a mission to advocate the idea which can potentially solve your problem as ideas are platform/language/tools agnostics.]]></description><link>https://importidea.dev</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1622643942349/REQfMRmjW.png</url><title>import idea - Because it&apos;s just an idea that all you need 💡</title><link>https://importidea.dev</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 06 Jun 2026 10:44:03 GMT</lastBuildDate><atom:link href="https://importidea.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[How to Supercharge Your Streaming Data Pipeline in Python]]></title><description><![CDATA[Streaming data processing has come a long way, so why stick to old methods and not use modern practices. Let me share my fresh perspective that can help you solve your problem.
Inspiration from Batch Processing
Batch Processing shines with below (tho...]]></description><link>https://importidea.dev/how-to-supercharge-your-streaming-data-pipeline-in-python</link><guid isPermaLink="true">https://importidea.dev/how-to-supercharge-your-streaming-data-pipeline-in-python</guid><category><![CDATA[pthon-kafka]]></category><category><![CDATA[quixstreaming]]></category><category><![CDATA[kafka]]></category><category><![CDATA[Python]]></category><category><![CDATA[streaming]]></category><category><![CDATA[streaming data]]></category><category><![CDATA[Polars]]></category><category><![CDATA[deltalake]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Sun, 08 Jun 2025 13:01:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749387299932/78805fbb-2907-4826-afaa-239713754168.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Streaming data processing has come a long way, so why stick to old methods and not use modern practices. Let me share my fresh perspective that can help you solve your problem.</p>
<h2 id="heading-inspiration-from-batch-processing">Inspiration from Batch Processing</h2>
<p>Batch Processing shines with below (though not limited) general use cases</p>
<ul>
<li><p>Transforming data in ETL/ELT data pipelines.</p>
</li>
<li><p>Performing aggregation, grouping, filtering, joins, analytics, etc, the list is never ending.</p>
</li>
<li><p>Doing all kinds of operations on table data like append, merge, delete, update, etc.</p>
</li>
</ul>
<p>This side of the world is pretty mature now. We have a very good set of tools, frameworks, etc. that allows us to develop the pipeline. So why not apply the same functionality on a streaming pipeline but at the same time not lose its real-time processing character.</p>
<h2 id="heading-problem-statement">Problem Statement</h2>
<p>Track customers' journeys on an e-commerce website. The events include which products a customer viewed, which products were added to the cart, and finally, which products were purchased. Calculate metrics that can be used for analysis.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749389855794/2a345922-d7ac-4b60-9f01-5ba78ffd70c2.png" alt class="image--center mx-auto" /></p>
<p>I am using <code>Kafka</code> as my streaming platform. In Python, two popular client libraries can be used to interact with Kafka brokers</p>
<ol>
<li><p><a target="_blank" href="https://github.com/confluentinc/confluent-kafka-python">confluent kafka</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/dpkp/kafka-python">kafka python</a></p>
</li>
</ol>
<p>Here is my alternative take - neither of them should be used directly.</p>
<ul>
<li><p>Neither of them has Python’s <code>Typed</code> support, so no IDE can work efficiently.</p>
</li>
<li><p>Both need a fair amount of boilerplate code</p>
</li>
<li><p>Their scope of work is limited to just <code>Producer</code>, <code>Consumer</code> &amp; <code>Admin</code>.</p>
</li>
<li><p>To apply principles of Batch processing, we have to write a lot of custom code.</p>
</li>
</ul>
<p>You need not worry, there are open-source libraries like <a target="_blank" href="https://github.com/pathwaycom/pathway">Pathway</a> and <a target="_blank" href="https://github.com/quixio/quix-streams">Quix Streams</a> that can be used for stream processing. Let’s quickly compare them</p>
<ul>
<li><p>Pathway has many more GitHub stars compared to Quix Streams.</p>
</li>
<li><p>Pathway also supports a lot more connectors and sinks than the other.</p>
</li>
<li><p>Both allow us to do batch processing tasks like filters, aggregation, joins, analytics, and Python UDF on streaming data.</p>
</li>
<li><p>Pathway's strength in supporting many connectors and sinks is also its weakness, as it makes the library quite large. Quix Streams is designed for one main purpose: Streaming.</p>
</li>
<li><p>The biggest advantage of Quix Streams is that it lets us use low-level Kafka client tools with full <code>Typed</code> support, allowing us to use IDEs efficiently. <strong>This was the main reason I chose Quix Streams.</strong></p>
</li>
</ul>
<h3 id="heading-the-producer">The Producer:</h3>
<p>Below, I am generating Fake data for a Customer journey &amp; sending the events into their respective topics.</p>
<pre><code class="lang-python"> <span class="hljs-keyword">import</span> logging

<span class="hljs-keyword">from</span> quixstreams <span class="hljs-keyword">import</span> Application

<span class="hljs-keyword">from</span> src.ops.generator <span class="hljs-keyword">import</span> generate_dummy_e_commerce_data

logger = logging.getLogger(<span class="hljs-string">"data-pipeline"</span>)
app = Application(
    broker_address=<span class="hljs-string">"localhost:9092"</span>,
    loglevel=<span class="hljs-string">"DEBUG"</span>,
    producer_extra_config={
        <span class="hljs-string">"linger.ms"</span>: <span class="hljs-string">"300"</span>,  <span class="hljs-comment"># Wait up to 300ms for more messages before sending</span>
        <span class="hljs-string">"compression.type"</span>: <span class="hljs-string">"gzip"</span>,  <span class="hljs-comment"># Use gzip compression for messages</span>
    },
)
product_view_topic = app.topic(
    <span class="hljs-string">"ecomm_product_view"</span>, value_serializer=<span class="hljs-string">"json"</span>, key_serializer=<span class="hljs-string">"string"</span>
)
<span class="hljs-comment"># Creating the kafka Topic client</span>
cart_topic = app.topic(<span class="hljs-string">"ecomm_cart"</span>, value_serializer=<span class="hljs-string">"json"</span>, key_serializer=<span class="hljs-string">"string"</span>)
buy_topic = app.topic(<span class="hljs-string">"ecomm_buy"</span>, value_serializer=<span class="hljs-string">"json"</span>, key_serializer=<span class="hljs-string">"string"</span>)

<span class="hljs-comment"># quixstreams' Application allow us to use context manager so we won't need to worry </span>
<span class="hljs-comment"># about closing connection gracefully</span>
<span class="hljs-keyword">with</span> app.get_producer() <span class="hljs-keyword">as</span> producer:
    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">200</span>):
        logger.info(<span class="hljs-string">f"current iteration: <span class="hljs-subst">{_}</span>"</span>)
        <span class="hljs-comment"># generate dummy data</span>
        event = generate_dummy_e_commerce_data()

        <span class="hljs-comment"># sending the event to respective topics</span>
        <span class="hljs-comment"># Product view event</span>
        product_view_msg = product_view_topic.serialize(
            key=event[<span class="hljs-string">"product_view"</span>].user_id,
            value=event[<span class="hljs-string">"product_view"</span>].model_dump(mode=<span class="hljs-string">"json"</span>),
        )
        producer.produce(
            product_view_topic.name,
            value=product_view_msg.value,
            key=product_view_msg.key,
        )
        logger.debug(<span class="hljs-string">f"producing product view event: <span class="hljs-subst">{event[<span class="hljs-string">'product_view'</span>].event_id}</span>"</span>)

        <span class="hljs-comment"># Add to cart event</span>
        <span class="hljs-comment"># NOTE - It is possible that the add_to_cart and purchase events are None</span>
        <span class="hljs-keyword">if</span> event[<span class="hljs-string">"add_to_cart"</span>]:
            cart_msg = cart_topic.serialize(
                key=event[<span class="hljs-string">"add_to_cart"</span>].user_id,
                value=event[<span class="hljs-string">"add_to_cart"</span>].model_dump(mode=<span class="hljs-string">"json"</span>),
            )
            producer.produce(cart_topic.name, value=cart_msg.value, key=cart_msg.key)
            logger.debug(
                <span class="hljs-string">f"producing product add to cart event: <span class="hljs-subst">{event[<span class="hljs-string">'add_to_cart'</span>].event_id}</span>"</span>
            )

        <span class="hljs-keyword">if</span> event[<span class="hljs-string">"purchase"</span>]:
            buy_msg = buy_topic.serialize(
                key=event[<span class="hljs-string">"purchase"</span>].user_id,
                value=event[<span class="hljs-string">"purchase"</span>].model_dump(mode=<span class="hljs-string">"json"</span>),
            )
            producer.produce(buy_topic.name, value=buy_msg.value, key=buy_msg.key)
            logger.debug(<span class="hljs-string">f"producing buy event: <span class="hljs-subst">{event[<span class="hljs-string">'purchase'</span>].event_id}</span>"</span>)
</code></pre>
<p><strong>Important Point:</strong></p>
<ul>
<li><p>The default settings for Producer are good, but to extract more performance, you’ll need to fine-tune them. I usually play around following the producer config setting:</p>
<ul>
<li><p><code>compression.type</code>: To compress messages. I usually use <code>gzip</code>. It has wide support.</p>
</li>
<li><p><code>linger.ms</code>: Duration to wait for more gathering messages before sending. It should depend on the frequency of incoming messages.</p>
</li>
</ul>
</li>
<li><p>It’s recommended to use the context manager of <code>quixstreams</code> app object, so that towards the end connections will be closed gracefully.</p>
</li>
</ul>
<h3 id="heading-the-consumer-part-a">The Consumer: Part A</h3>
<p>Consuming messages from three topics —&gt; Perform data processing —&gt; Perform join to calculate customer journey —&gt; Publish results to 4th topic.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> quixstreams <span class="hljs-keyword">import</span> Application

<span class="hljs-keyword">from</span> src.ops.transform <span class="hljs-keyword">import</span> convert_utc_to_ist

app = Application(
    broker_address=<span class="hljs-string">"localhost:9092"</span>,
    consumer_group=<span class="hljs-string">"ecomm_sync_group"</span>,
    auto_offset_reset=<span class="hljs-string">"earliest"</span>,
    consumer_extra_config={
        <span class="hljs-string">"auto.offset.reset"</span>: <span class="hljs-string">"earliest"</span>,  <span class="hljs-comment"># Start reading from the earliest message</span>
        <span class="hljs-string">"enable.auto.commit"</span>: <span class="hljs-string">"true"</span>,  <span class="hljs-comment"># Automatically commit offsets</span>
    },
    loglevel=<span class="hljs-string">"DEBUG"</span>,
)

<span class="hljs-comment"># add topics to consume</span>
product_view_topic = app.topic(
    <span class="hljs-string">"ecomm_product_view"</span>, value_serializer=<span class="hljs-string">"json"</span>, key_serializer=<span class="hljs-string">"string"</span>
)
cart_topic = app.topic(<span class="hljs-string">"ecomm_cart"</span>, value_deserializer=<span class="hljs-string">"json"</span>, key_serializer=<span class="hljs-string">"string"</span>)
buy_topic = app.topic(<span class="hljs-string">"ecomm_buy"</span>, value_deserializer=<span class="hljs-string">"json"</span>, key_serializer=<span class="hljs-string">"string"</span>)
<span class="hljs-comment"># Output topic for customer journey</span>
customer_journey_topic = app.topic(
    <span class="hljs-string">"ecomm_customer_journey"</span>,
    value_serializer=<span class="hljs-string">"json"</span>,
)

<span class="hljs-comment"># Streaming dataframe consumers</span>
product_view_sdf = app.dataframe(product_view_topic)
cart_sdf = app.dataframe(cart_topic)
buy_sdf = app.dataframe(buy_topic)

<span class="hljs-comment"># Stream processing</span>
<span class="hljs-comment"># Product view topic</span>
product_view_sdf[<span class="hljs-string">"timestamp"</span>] = product_view_sdf[<span class="hljs-string">"timestamp"</span>].apply(convert_utc_to_ist)

<span class="hljs-comment"># Add to cart topic</span>
cart_sdf[<span class="hljs-string">"price"</span>] = cart_sdf[<span class="hljs-string">"price"</span>].apply(<span class="hljs-keyword">lambda</span> x: round(x, <span class="hljs-number">2</span>))
cart_sdf[<span class="hljs-string">"timestamp"</span>] = cart_sdf[<span class="hljs-string">"timestamp"</span>].apply(convert_utc_to_ist)
<span class="hljs-comment"># Join product view and cart dataframes on user_id</span>
joined_view_cart = cart_sdf.join_asof(
    right=product_view_sdf, how=<span class="hljs-string">"left"</span>, on_merge=<span class="hljs-string">"keep-left"</span>
)

<span class="hljs-comment"># Buy product topic</span>
buy_sdf[<span class="hljs-string">"price"</span>] = buy_sdf[<span class="hljs-string">"price"</span>].apply(<span class="hljs-keyword">lambda</span> x: round(x, <span class="hljs-number">2</span>))
buy_sdf[<span class="hljs-string">"timestamp"</span>] = buy_sdf[<span class="hljs-string">"timestamp"</span>].apply(convert_utc_to_ist)
joined_buy = buy_sdf.join_asof(right=joined_view_cart, how=<span class="hljs-string">"left"</span>, on_merge=<span class="hljs-string">"keep-left"</span>)
joined_buy.to_topic(customer_journey_topic)

<span class="hljs-comment"># Starting the app to process streams real-time</span>
app.run()
</code></pre>
<p><strong>Important Points:</strong></p>
<ul>
<li><p>Same as <code>Producer</code>, the default settings for <code>Consumer</code> are good, but to extract more performance, you’ll need to fine-tune them. I usually play around following the producer config setting:</p>
</li>
<li><p>Quix streams provides us with <code>StreamingDataFrame</code> interface to apply all kinds of transformation &amp; analytical logic that we want to apply, including Python <code>UDF</code>.</p>
</li>
<li><p>I am performing a <code>Stateful Join</code> on two <code>StreamingDataFrame</code>. It uses <code>RocksDB</code> to maintain the State with flushing data. I will highly recommend reading more on it <a target="_blank" href="https://quix.io/docs/quix-streams/joins.html#how-it-works_1">here</a></p>
</li>
<li><p>After a couple of <code>Join</code> Operation, I am sending data as an event to yet another topic. This will only get triggered when the join operation finds a key to join. In this way, I can track the entire journey of the customer from product view to add to cart to finally buy.</p>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">I would highly recommend that you read &amp; go through Quix Streams’ official <a target="_self" href="https://quix.io/docs/get-started/welcome.html">docs</a>. They nicely explained a lot more use cases with examples.</div>
</div>

<h3 id="heading-the-consumer-part-b">The Consumer: Part B</h3>
<p>Consuming messages using a low-level client library API to perform custom sink logic.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

<span class="hljs-keyword">import</span> orjson
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">from</span> quixstreams <span class="hljs-keyword">import</span> Application

<span class="hljs-keyword">from</span> src.connector <span class="hljs-keyword">import</span> DataWriter

logger = logging.getLogger(<span class="hljs-string">"data-pipeline"</span>)
merge_option = DataWriter.generate_delta_table_merge_method_options(
    when_not_matched_insert_all=<span class="hljs-literal">True</span>, when_matched_update_all=<span class="hljs-literal">True</span>
)
app = Application(
    broker_address=<span class="hljs-string">"localhost:9092"</span>,
    consumer_group=<span class="hljs-string">"ecomm_customer_report_group"</span>,
    auto_offset_reset=<span class="hljs-string">"earliest"</span>,
    consumer_extra_config={
        <span class="hljs-string">"enable.auto.commit"</span>: <span class="hljs-literal">True</span>,  <span class="hljs-comment"># Automatically commit offsets</span>
    },
    loglevel=<span class="hljs-string">"DEBUG"</span>,
)

<span class="hljs-keyword">with</span> app.get_consumer() <span class="hljs-keyword">as</span> consumer:
    consumer.subscribe(topics=[<span class="hljs-string">"ecomm_customer_journey"</span>])

    <span class="hljs-comment"># Starting the 'Forever consuming consumer'</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        message = consumer.poll(<span class="hljs-number">0.5</span>)
        <span class="hljs-keyword">if</span> message <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
            <span class="hljs-keyword">continue</span>
        <span class="hljs-keyword">elif</span> message.error():
            logger.error(<span class="hljs-string">"Kafka error:"</span>, message.error())
            <span class="hljs-keyword">continue</span>

        value = message.value()

        <span class="hljs-comment"># Merge data into Delta lake table</span>
        df = pl.from_dict(orjson.loads(value))
        merge_stats = DataWriter.delta_table_merge_disk(
            df=df,
            path=Path(__file__).parent.parent.parent
            / <span class="hljs-string">"data/gold/ecomm_customer_journey"</span>,
            delta_merge_options={
                <span class="hljs-string">"predicate"</span>: <span class="hljs-string">"source.event_id = target.event_id"</span>,  <span class="hljs-comment"># condition to determine upsert req</span>
                <span class="hljs-string">"source_alias"</span>: <span class="hljs-string">"source"</span>,
                <span class="hljs-string">"target_alias"</span>: <span class="hljs-string">"target"</span>,
            },
            delta_merge_method_options=merge_option,
        )
        logger.info(<span class="hljs-string">f"merge successfully with stats: <span class="hljs-subst">{merge_stats}</span>"</span>)

        consumer.store_offsets(message=message)
</code></pre>
<p><strong>Important Points:</strong></p>
<ul>
<li><p>This is the second way to get messages from Kafka Topics. It requires more code but gives us flexibility in processing. However, it's not the main recommendation; using <code>StreamingDataFrame</code> is preferred.</p>
</li>
<li><p>I needed to do a <code>Delta Merge</code> on the Delta Lake table, which the <code>quixstreams</code> library doesn't support directly. So, I used the low-level client API to achieve it.</p>
</li>
<li><p>I like the <code>quixstreams</code> library because it offers a high-level API -<code>StreamingDataFrame</code>, for batch-like processing on streaming data. But if that's not an option, it also provides a low-level Kafka client API, allowing us to do almost anything.</p>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Pathway library has native support for writing to Deltalake tables, but it simply performs <code>append</code> &amp; no <code>delta merge</code>. Also, it does not expose any direct client API.</div>
</div>

<p><strong>Note</strong>: I have open-sourced my project &amp; you can find all the source code here - <a target="_blank" href="https://github.com/Akashdesarda/data-pipeline-app-demo">Akashdesarda/data-pipeline-app-demo</a></p>
<h2 id="heading-conclusion">Conclusion:</h2>
<ul>
<li><p>Supercharging your streaming data pipeline in Python involves using modern tools and practices to boost efficiency and performance.</p>
</li>
<li><p>Draw inspiration from batch processing to apply similar principles to streaming data without losing real-time capabilities.</p>
</li>
<li><p>Use advanced libraries like Quix Streams for efficient data processing.</p>
</li>
<li><p>Features like <code>StreamingDataFrame</code> these allow for high-level operations.</p>
</li>
<li><p>Low-level client APIs are available for custom tasks.</p>
</li>
<li><p>This approach simplifies development and provides flexibility and scalability.</p>
</li>
<li><p>It enables effective tracking and analysis of customer journeys in real-time.</p>
</li>
<li><p>Embracing these modern techniques ensures your data pipeline is robust, efficient, and meets the demands of today's data-driven environments.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Why Using SQL for Data Application Development Might Be a Mistake]]></title><description><![CDATA[Data is a crucial component of any modern application or service, and how you process it can determine the success or failure of your app. While data is often stored in databases, relying on SQL queries for processing can be problematic. Let me expla...]]></description><link>https://importidea.dev/why-using-sql-for-data-application-development-might-be-a-mistake</link><guid isPermaLink="true">https://importidea.dev/why-using-sql-for-data-application-development-might-be-a-mistake</guid><category><![CDATA[SQL]]></category><category><![CDATA[ETL]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[orm]]></category><category><![CDATA[software development]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Sat, 17 May 2025 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747628367976/688c6d5c-6f2f-43a4-9ed5-473141c013c7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Data is a crucial component of any modern application or service, and how you process it can determine the success or failure of your app. While data is often stored in databases, relying on SQL queries for processing can be problematic. Let me explain why this approach might not be ideal and what alternatives you can consider.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">This article is an opinionated advice based on my experience. I am not presenting the idea as a universal truth that you have to follow no matter what. Based on your experience, you may feel a completely different take.</div>
</div>

<p>Generally, the use of Data in any application can be categorised as</p>
<ol>
<li><p>Analytical</p>
</li>
<li><p>ETL/ELT</p>
</li>
</ol>
<p>Using SQL for analytical purposes is great, and you should keep using it. SQL truly excels in this area. There's no need for setup or worrying about the environment—just run the query and get your results.</p>
<p>But for ETL pipelines with complex transformations (which is common in the real world), you should switch from SQL immediately. Because using SQL often leads to very long, nested, and endless CTEs. This makes it really hard to read &amp; nightmare debugging experience.</p>
<h3 id="heading-challenges-of-sql-heavy-etl-pipelines-why-it-can-be-a-mistake"><strong>Challenges of SQL-heavy ETL Pipelines (Why it can be a mistake):</strong></h3>
<ul>
<li><p><strong>Increased Complexity and Readability:</strong> ETL pipelines with many transformations often result in complex, deeply nested SQL queries with numerous CTEs, making the code hard to read, understand, and maintain. This complexity can be challenging for new team members or even for your future self to understand it.</p>
</li>
<li><p><strong>Debugging Nightmares:</strong> Tracking down errors in a massive SQL script with multiple levels of nesting can be a nightmare. The error messages might not always be as informative as those you'd get in your primary programming language's debugger.</p>
</li>
<li><p><strong>Limited Expressiveness for Complex Logic:</strong> While SQL is powerful for set-based operations, implementing intricate business logic or conditional transformations can become cumbersome and less intuitive compared to using the control flow structures available in languages like Python or Java.</p>
</li>
<li><p><strong>Version Control and Collaboration Challenges:</strong> While SQL scripts can be version-controlled, the changes and their impact might be harder to visualize and collaborate on compared to code within a larger application codebase managed by standard development tools and practices.</p>
</li>
<li><p><strong>Fragility and Maintainability:</strong> Complex SQL ETL can become fragile and hard to maintain over time, which might lead to constantly fixing problems as they arise.</p>
</li>
<li><p><strong>Testing Difficulties:</strong> Unit testing individual components and transformations within a large SQL script can be challenging. Testing often involves running the entire pipeline or significant portions of it, making it slower and less granular than testing functions or methods in your application code.</p>
</li>
</ul>
<p>Below is a sample of a typical Query structure for an ETL pipeline. As you can see, even for fewer lines of code, it became complex to understand.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span>
    SourceData <span class="hljs-keyword">AS</span> (
        <span class="hljs-comment">-- Step 1: Extract data from the raw source</span>
        <span class="hljs-keyword">SELECT</span>
            column1,
            column2,
            column3,
            <span class="hljs-comment">-- ... more columns ...</span>
            <span class="hljs-keyword">CAST</span>(timestamp_column <span class="hljs-keyword">AS</span> <span class="hljs-built_in">TIMESTAMP</span>) <span class="hljs-keyword">AS</span> processed_timestamp
        <span class="hljs-keyword">FROM</span>
            raw_data_table
        <span class="hljs-keyword">WHERE</span>
            date_column &gt;= <span class="hljs-string">'some_start_date'</span> <span class="hljs-keyword">AND</span> date_column &lt; <span class="hljs-string">'some_end_date'</span>
    ),

    CleanedData <span class="hljs-keyword">AS</span> (
        <span class="hljs-comment">-- Step 2: Clean and standardize the data</span>
        <span class="hljs-keyword">SELECT</span>
            <span class="hljs-keyword">UPPER</span>(<span class="hljs-keyword">TRIM</span>(column1)) <span class="hljs-keyword">AS</span> standardized_column1,
            <span class="hljs-keyword">CASE</span>
                <span class="hljs-keyword">WHEN</span> column2 = <span class="hljs-string">'value_a'</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'Category A'</span>
                <span class="hljs-keyword">WHEN</span> column2 = <span class="hljs-string">'value_b'</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'Category B'</span>
                <span class="hljs-keyword">ELSE</span> <span class="hljs-string">'Other'</span>
            <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> categorized_column2,
            <span class="hljs-keyword">CAST</span>(column3 <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> numeric_column3,
            processed_timestamp
        <span class="hljs-keyword">FROM</span>
            SourceData
        <span class="hljs-keyword">WHERE</span>
            column1 <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">AND</span> column3 <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>
    ),

    TransformedData1 <span class="hljs-keyword">AS</span> (
        <span class="hljs-comment">-- Step 3: Apply the first set of transformations</span>
        <span class="hljs-keyword">SELECT</span>
            standardized_column1,
            categorized_column2,
            numeric_column3 * <span class="hljs-number">10</span> <span class="hljs-keyword">AS</span> transformed_numeric_column3,
            <span class="hljs-keyword">EXTRACT</span>(<span class="hljs-keyword">YEAR</span> <span class="hljs-keyword">FROM</span> processed_timestamp) <span class="hljs-keyword">AS</span> processing_year,
            <span class="hljs-keyword">EXTRACT</span>(<span class="hljs-keyword">MONTH</span> <span class="hljs-keyword">FROM</span> processed_timestamp) <span class="hljs-keyword">AS</span> processing_month
        <span class="hljs-keyword">FROM</span>
            CleanedData
    ),

    LookupTable1 <span class="hljs-keyword">AS</span> (
        <span class="hljs-comment">-- Step 4: Join with a lookup table</span>
        <span class="hljs-keyword">SELECT</span>
            td1.*,
            lt1.lookup_value
        <span class="hljs-keyword">FROM</span>
            TransformedData1 td1
        <span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span>
            lookup_table_one lt1 <span class="hljs-keyword">ON</span> td1.standardized_column1 = lt1.lookup_key
    ),

    FilteredData <span class="hljs-keyword">AS</span> (
        <span class="hljs-comment">-- Step 5: Apply specific filtering criteria</span>
        <span class="hljs-keyword">SELECT</span>
            *
        <span class="hljs-keyword">FROM</span>
            LookupTable1
        <span class="hljs-keyword">WHERE</span>
            processing_year = <span class="hljs-number">2024</span> <span class="hljs-keyword">AND</span> categorized_column2 <span class="hljs-keyword">IN</span> (<span class="hljs-string">'Category A'</span>, <span class="hljs-string">'Category B'</span>)
    ),

    AggregatedData <span class="hljs-keyword">AS</span> (
        <span class="hljs-comment">-- Step 6: Perform aggregations</span>
        <span class="hljs-keyword">SELECT</span>
            processing_year,
            processing_month,
            categorized_column2,
            <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> record_count,
            <span class="hljs-keyword">SUM</span>(transformed_numeric_column3) <span class="hljs-keyword">AS</span> total_transformed_value
        <span class="hljs-keyword">FROM</span>
            FilteredData
        <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span>
            processing_year, processing_month, categorized_column2
    ),

    FinalStage <span class="hljs-keyword">AS</span> (
        <span class="hljs-comment">-- Step 7: Final transformations and calculations</span>
        <span class="hljs-keyword">SELECT</span>
            ad.*,
            (total_transformed_value / record_count) <span class="hljs-keyword">AS</span> average_transformed_value,
            <span class="hljs-string">'Processed'</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">status</span>
        <span class="hljs-keyword">FROM</span>
            AggregatedData ad
        <span class="hljs-comment">-- Potentially another join with another lookup table here</span>
        <span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span>
            final_lookup_table flt <span class="hljs-keyword">ON</span> ad.categorized_column2 = flt.category
    )

<span class="hljs-comment">-- Step 8: Load or select the final processed data</span>
<span class="hljs-keyword">SELECT</span>
    *
<span class="hljs-keyword">FROM</span>
    FinalStage
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span>
    processing_year, processing_month, categorized_column2;

<span class="hljs-comment">-- You might even have subqueries within these CTEs for more complex logic:</span>
<span class="hljs-comment">--</span>
<span class="hljs-comment">-- TransformedData2 AS (</span>
<span class="hljs-comment">--     SELECT</span>
<span class="hljs-comment">--         fd.*,</span>
<span class="hljs-comment">--         (SELECT MAX(another_column) FROM another_table WHERE fk_column = fd.some_id) AS max_value_from_other_table</span>
<span class="hljs-comment">--     FROM</span>
<span class="hljs-comment">--         FilteredData fd</span>
<span class="hljs-comment">-- ),</span>
<span class="hljs-comment">--</span>
<span class="hljs-comment">-- And so on...</span>
</code></pre>
<h3 id="heading-the-solution">The Solution</h3>
<p>You should follow the below generic steps to solve the above-stated problem &amp; improve your development experience.</p>
<ul>
<li><p><strong>Step 1: Initial Starting Point</strong></p>
<ul>
<li><p>Use an ORM to pull decently filtered initial source data.</p>
</li>
<li><p>If you wish, you can even use Raw SQL with the programming language-specific database drivers.</p>
</li>
<li><p>If you are using Python, then I have already written a blog that explains how we should effectively work with databases in Python. It explains the solution which is the best of both world. You read here - <a target="_blank" href="https://importidea.dev/how-to-effectively-work-with-databases-in-python">How to effectively work with Databases in Python</a>.</p>
</li>
</ul>
</li>
<li><p><strong>Step 2: Design your codebase using a proper Library Structure</strong></p>
<ul>
<li><p>I'm confident that in the real world, you'll have many ETL pipelines running in a project to meet different business needs. I'm also certain that these pipelines will share a lot of common business logic.</p>
</li>
<li><p>All the code, business logic, utilities, connections, visualizations, and more should be developed and organized as a Library. This approach provides a wide range of flexible options and enables us to leverage the best practices of Software Engineering. You can apply principles like DRY, SRE, design patterns, OOP, and more. The possibilities are extensive.</p>
</li>
<li><p>Below is a sample library folder structure. This allows as t write relevant code to appropriate location.</p>
</li>
<li><pre><code class="lang-plaintext">  some_library/
  ├── config
  ├── utils/
  │   ├── module1
  │   └── module2
  ├── ops/
  │   ├── preprocessing/
  │   │   ├── module1
  │   │   └── module2
  │   └── processing/
  │       ├── module1
  │       └── module2
  └── pipeline/
      ├── pipeline1
      ├── pipeline2
      └── pipeline3
  cicd.yaml
  pyproject.toml
  poetry.lock
  test/
  ├── test_utils
  ├── test_ops
  └── test_pipeline
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Step 3: Use your library as a package in the pipeline</strong></p>
<ul>
<li><p>Now that the entire codebase is organized as a library, the next step is to publish it as a package.</p>
</li>
<li><p>Install and use this package in downstream pipelines just like any other open-source package.</p>
</li>
<li><p>This approach allows you to reuse the same functions for common functionalities.</p>
</li>
</ul>
</li>
<li><p><strong>Step 4: Use the programming language ecosystem</strong></p>
<ul>
<li><p>Now that we're using programming languages like Python, Java, or JavaScript, we can take advantage of open-source tools and packages to enhance the pipeline even more.</p>
</li>
<li><h3 id="heading-this-saves-us-from-the-pain-of-re-inventing-the-wheel">This saves us from the pain of re-inventing the wheel.</h3>
</li>
</ul>
</li>
</ul>
<h3 id="heading-possible-counterarguments-or-nuances-to-consider"><strong>Possible Counterarguments or Nuances to Consider:</strong></h3>
<ul>
<li><p><strong>Performance:</strong> For very large datasets and simple, set-based transformations, well-optimized SQL can sometimes outperform code-based ETL due to the database's ability to leverage indexing and other performance optimizations. However, for complex logic, the overhead of repeated SQL calls and data transfer might negate this benefit.</p>
</li>
<li><p><strong>Database-Specific Features:</strong> Sometimes, specific database features might be most efficiently utilized directly through SQL. However, if the core logic becomes overly convoluted, the trade-off for readability and maintainability might still favor the above approach for the bulk of the pipeline.</p>
</li>
<li><p><strong>Out of Memory Issues:</strong> Large datasets won’t fit into memory &amp; restrict us to run the pipeline. This can be solved using batch processing. To further improve performance, all the batches can be executed in parallel using a distributed system.</p>
</li>
</ul>
<h3 id="heading-conclusion">Conclusion:</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Feature</strong></td><td><strong>SQL-heavy ETL Pipelines</strong></td><td><strong>ETL Pipelines running in the programming language</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Readability</strong></td><td>Can become complex and hard to read with many CTEs and nesting</td><td>Generally more readable and maintainable, especially for complex logic</td></tr>
<tr>
<td><strong>Maintainability</strong></td><td>Challenging to maintain and understand over time</td><td>Easier to maintain and refactor</td></tr>
<tr>
<td><strong>Debugging</strong></td><td>Difficult to debug complex nested queries</td><td>Easier to debug using programming language tools</td></tr>
<tr>
<td><strong>Expressiveness</strong></td><td>Limited for intricate business logic</td><td>More flexible and expressive with full programming language features</td></tr>
<tr>
<td><strong>SDLC Integration</strong></td><td>Might fall outside standard development lifecycle</td><td>Better integration with application codebase and SDLC practices</td></tr>
<tr>
<td><strong>Testing</strong></td><td>Harder to unit test granular components</td><td>Easier to unit test individual transformation steps</td></tr>
<tr>
<td><strong>Development Speed</strong></td><td>Can be quick for simple transformations</td><td>Might be slower initially for complex mappings, but faster for complex logic</td></tr>
<tr>
<td><strong>Performance</strong></td><td>Can be highly optimized for set-based operations</td><td>Potential overhead of object mapping, but manageable with careful design</td></tr>
<tr>
<td><strong>Team Skillset</strong></td><td>Requires strong SQL expertise</td><td>Leverages existing programming language skills</td></tr>
<tr>
<td><strong>Complexity Management</strong></td><td>Complexity increases significantly with more transformations</td><td>Complexity can be better managed through code organization and modularity</td></tr>
<tr>
<td><strong>Flexibility</strong></td><td>Less flexible for complex, multi-step transformations</td><td>More flexible in handling diverse data manipulations and integrations</td></tr>
</tbody>
</table>
</div>]]></content:encoded></item><item><title><![CDATA[From Failure to Flow: How I Used Polars to Conquer Memory Issues in Our Data Pipelines]]></title><description><![CDATA[Ever been bogged down by data pipelines crashing due to memory issues? It's a frustratingly common problem in data engineering projects. This post chronicles my experience of identifying and resolving memory bottlenecks in our data processing using t...]]></description><link>https://importidea.dev/from-failure-to-flow-how-i-used-polars-to-conquer-memory-issues-in-our-data-pipelines</link><guid isPermaLink="true">https://importidea.dev/from-failure-to-flow-how-i-used-polars-to-conquer-memory-issues-in-our-data-pipelines</guid><category><![CDATA[Lazyframe]]></category><category><![CDATA[Python]]></category><category><![CDATA[Polars]]></category><category><![CDATA[numpy]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[memory-management]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Sat, 26 Apr 2025 18:34:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745692097833/90a2cc26-2b37-441f-af5d-8721d17e9c95.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever been bogged down by data pipelines crashing due to memory issues? It's a frustratingly common problem in data engineering projects. This post chronicles my experience of identifying and resolving memory bottlenecks in our data processing using the powerful Polars library adhering to data engineering best practices.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">I've organised my explanation using the STAR method for clarity and ease of understanding. Let's dive into how I moved from pipeline failures to a smooth, efficient flow.</div>
</div>

<h1 id="heading-the-situation">The Situation</h1>
<p>The project involved generating <strong>Monte Carlo Projections</strong> using <strong>Geometric Brownian Motion (GBM)</strong> to create thousands of potential future price paths for assets. This helps analysts estimate the probabilities of various outcomes.</p>
<p>First, I would like to provide a concise explanation of the components involved in this pipeline.</p>
<h2 id="heading-pipeline-components">Pipeline Components</h2>
<p><strong>Step 1:</strong> Generate a random normal distribution and then perform a cumulative sum over it.</p>
<p><strong>Step 2:</strong> Using the Covariance matrix and two sets of random normal distributions, compute the Einstein summation.</p>
<p><strong>Step 3:</strong> Incorporating both deterministic trends (drift) and random fluctuations (volatility) using GBM.</p>
<p><strong>Step 4:</strong> Perform mean aggregation &amp; select based on filter criteria.</p>
<h2 id="heading-issue">Issue</h2>
<ul>
<li><p>The initial version used <code>Numpy</code> &amp; <code>Pandas</code> library for all computations.</p>
</li>
<li><p>We have to run the Projection for at least 100 years, but many times it would be even more than that. Let’s break down how the computation looks based on the above steps,</p>
</li>
<li><pre><code class="lang-python">    <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

    rng = np.random.default_rng()

    DAYS = <span class="hljs-number">365</span>
    YEARS = <span class="hljs-number">1</span>_00
    PATH = <span class="hljs-number">20</span>_000

    <span class="hljs-comment"># Generating random cumulative sum normal distrubution</span>
    <span class="hljs-comment"># NOTE - We required two two normal distrubution set</span>
    nor_dis_random1 = np.cumsum(
        rg.normal(<span class="hljs-number">0</span>, np.sqrt(<span class="hljs-number">1</span> / days), (days * fixed_years, paths)),
        axis=<span class="hljs-number">0</span>
    )
    nor_dis_random2 = np.cumsum(
        rg.normal(<span class="hljs-number">0</span>, np.sqrt(<span class="hljs-number">1</span> / days), (days * fixed_years, paths)),
        axis=<span class="hljs-number">0</span>
    )

    <span class="hljs-comment"># Computing Einstein Summation</span>
    ein_corr = np.einsum(
                    <span class="hljs-string">"ij,jkl-&gt;ikl"</span>,
                    covariance_matrix_data,
                    np.array([nor_dis_random1, nor_dis_random2]),
                    casting=<span class="hljs-string">"same_kind"</span>,
                    optimize=<span class="hljs-literal">True</span>,
                )
</code></pre>
</li>
<li><p>As you can see above, we had to deal with extremely big arrays. <code>Numpy</code> has to allocate memory for random normal distribution generation. Business logic requires two such big arrays to compute Einstein Summation. For 100 years &amp; above even after beefing it till <code>64 GB</code> it was failing with <code>Out of Memory</code> Error.</p>
</li>
<li><p>Following the Einstein Summation task, numerous transformations, including the application of GBM, were performed. This was initially done using <code>Pandas</code> , which made already memory hogging code even more bulky.</p>
</li>
<li><pre><code class="lang-python">    <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

    <span class="hljs-comment"># Creating two pandas dataframe from Einstein Summation aaray</span>
    projecttion_a = pd.DataFrame(ein_corr[<span class="hljs-number">0</span>].T).cumsum(axis=<span class="hljs-number">0</span>)
    projecttion_b = pd.DataFrame(ein_corr[<span class="hljs-number">1</span>].T).cumsum(axis=<span class="hljs-number">0</span>)

    <span class="hljs-comment"># Further transformation</span>
    sigma_s = <span class="hljs-number">0.26</span>
    mu_s = <span class="hljs-number">0.05</span>
    s0 = <span class="hljs-number">391555</span>

    projecttion_a = S0 * np.exp((mu_S - <span class="hljs-number">0.5</span> * sigma_S**<span class="hljs-number">2</span>) + sigma_S * projecttion_a
    projecttion_b = S0 * np.exp((mu_S - <span class="hljs-number">0.5</span> * sigma_S**<span class="hljs-number">2</span>) + sigma_S * projecttion_b
</code></pre>
</li>
<li><p>It became simply unsustainable to keep increasing memory.</p>
</li>
</ul>
<h1 id="heading-the-tasks">The Tasks</h1>
<p>Let me show you how I divided the problem into smaller tasks and solved them, aiming for a cohesive and efficient process.</p>
<h2 id="heading-assessment">Assessment</h2>
<ul>
<li><p>The moment I stepped into this project, I realized that <code>Pandas</code> wasn't the right tool for the job. Both Numpy and Pandas store everything in memory from start to finish, and every transformation adds to memory use.</p>
</li>
<li><p>While exploring possible solutions, I wondered about using <code>PySpark</code> because of its ability to handle distributed workloads. But then, I stumbled upon two significant issues:</p>
<ul>
<li><p>It doesn't have first-class support for running NumPy functions across a cluster.</p>
</li>
<li><p>The environment is costly and bulky since Spark is cluster-based.</p>
</li>
</ul>
</li>
<li><p>Switching to Spark would have required a lot more work. That's where <a target="_blank" href="https://docs.pola.rs/api/python/stable/reference/">Polars</a> comes in to save the day. Here are the key reasons why I chose it:</p>
<ul>
<li><p>Like PySpark’s <code>lazy evaluation</code>, Polars also supports it with <a target="_blank" href="https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html"><code>LazyFrame</code></a>.</p>
</li>
<li><p>First-class for Numpy functions, even in LazyFrame.</p>
</li>
<li><p>Runs on a single machine and generally offers excellent performance since its based on Rust &amp; internal use of parallel processing.</p>
</li>
</ul>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">If you're not using <a target="_self" href="https://docs.pola.rs/">Polars</a> in your work yet, let me tell you, you're missing out on the ultimate Swiss army knife of the Data Engineering world! It's incredibly feature-rich, lightning-fast, supports tons of I/O operations, and boasts an amazingly intuitive API!</div>
</div>

<h2 id="heading-redesigned-pipeline-components">Redesigned Pipeline components</h2>
<ul>
<li><p><strong>Step 1:</strong> Replace all <code>Pandas Dataframe</code> code with <code>Polars LazyFrame</code>.</p>
</li>
<li><p><strong>Step 2:</strong> Since Polars fully supports Numpy functions, keep using specific Numpy functions that Polars doesn't have built-in.</p>
</li>
<li><p><strong>Step 3:</strong> Switch to <code>Batch Processing</code> for transformations. Ensure each batch creates a <code>LazyFrame</code>, so nothing is stored in memory until the final execution.</p>
</li>
<li><p><strong>Step 4:</strong> Keep adding transformation steps to the LazyFrame and execute them at the end. This approach lets Polars excel by making the most of its <code>Lazy API</code>.</p>
</li>
</ul>
<pre><code class="lang-plaintext">With the lazy API, Polars doesn't run each query line-by-line but instead processes the full
query end-to-end. To get the most out of Polars it is important that you use the lazy API because:
 - The lazy API allows Polars to apply automatic query optimization with the query optimizer.
 - The lazy API allows you to work with larger than memory datasets using streaming.
 - The lazy API can catch schema errors before processing the data.
</code></pre>
<h1 id="heading-the-actions">The Actions</h1>
<p>Enough theory talking, now lets dive into real coding based actions that solved the all the above issues.</p>
<h2 id="heading-numpy-function-polars-lazyframe">Numpy function + Polars Lazyframe</h2>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

rng = np.random.default_rng()

DAYS = <span class="hljs-number">365</span>
YEARS = <span class="hljs-number">1</span>_00
PATH = <span class="hljs-number">20</span>_000

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">unit_batch</span>(<span class="hljs-params">fixed_years=<span class="hljs-number">10</span></span>):</span>
        <span class="hljs-keyword">return</span> (
            pl.LazyFrame()
            .with_columns(
                <span class="hljs-comment"># Here I am directly using NumPy's normal() &amp; cumsum() in Polars Lazyframe since</span>
                <span class="hljs-comment"># it have first class support</span>
                nor_dis_random=np.cumsum(
                    rg.normal(<span class="hljs-number">0</span>, np.sqrt(<span class="hljs-number">1</span> / days), (days * fixed_years, paths)).astype(np.float32),
                    axis=<span class="hljs-number">0</span>,
                    dtype=np.float32,
                ) <span class="hljs-comment"># I am using Float32 as data structure instead of default Float64</span>
            ) 
            <span class="hljs-comment"># I need to create lazyframe of shape (x,y) from normal distribution data of shape (x,y)</span>
            <span class="hljs-comment"># Initially `nor_dis_random` is ArrayType column to which I am exploding to create y columns</span>
            <span class="hljs-comment"># The beauty of Polars Lazy API is we can keep adding steps to Query plan.</span>
            .with_columns(pl.col(<span class="hljs-string">"nor_dis_random"</span>).arr.to_struct().alias(<span class="hljs-string">"array_struct"</span>))
            .unnest(<span class="hljs-string">"array_struct"</span>)
            .drop(<span class="hljs-string">"nor_dis_random"</span>)
        )

 <span class="hljs-comment"># Creating stack of all batches of lazyframe, that will eventually get concated</span>
normal_dist_random_cum_sum_frames = [
    _unit_chunk(i)
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> alive_it(
        chunks, title=<span class="hljs-string">"Normal Distribution Cumulative sum"</span>, force_tty=<span class="hljs-literal">True</span>, total=len(chunks)
    )
]
<span class="hljs-comment"># This steps is very important as here I am vertically stacking all batches but still as LazyFrame</span>
normal_dist = pl.concat(normal_dist_random_cum_sum_frames, how=<span class="hljs-string">"vertical"</span>)
</code></pre>
<h2 id="heading-batch-processing-on-polars-lazyframe">Batch Processing on Polars Lazyframe</h2>
<pre><code class="lang-python">BATCH_SIZE = <span class="hljs-number">1</span>_000

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_no_of_batches</span>(<span class="hljs-params">row_count: int, batch_size: int = <span class="hljs-number">10</span></span>) -&gt; int:</span>
    <span class="hljs-keyword">return</span> row_count // batch_size + (row_count % batch_size &gt; <span class="hljs-number">0</span>)


ein_corr_a, ein_corr_b = [], []

no_of_batches = calculate_no_of_batches(
    normal_dist1.select(pl.len()).collect().item(), batch_size=BATCH_SIZE
)

<span class="hljs-comment"># Batch processing loop</span>
<span class="hljs-keyword">for</span> batch <span class="hljs-keyword">in</span> alive_it(range(no_of_batches), title=<span class="hljs-string">"Einstein summation"</span>, force_tty=<span class="hljs-literal">True</span>):
    <span class="hljs-comment"># Polars Lazy API provides as with slice() to iterate over lazyframe without actually </span>
    <span class="hljs-comment"># loading data in memory. Its similar to SQL Order &amp; limit without actually loading data</span>
    brownian_path_array1 = normal_dist1.slice(batch * BATCH_SIZE, BATCH_SIZE)
    brownian_path_array2 = normal_dist2.slice(batch * BATCH_SIZE, BATCH_SIZE)

    <span class="hljs-comment"># Append as array to calculate Einstein summation convention</span>
    einstein_summation_operands = np.array(
        [
            brownian_path_array1.collect().to_numpy().T,
            brownian_path_array2.collect().to_numpy().T,
        ],
        dtype=np.float32,
    )
    ein_corr = np.einsum(
        <span class="hljs-string">"ij,jkl-&gt;ikl"</span>,
        covariance_matrix_data,
        einstein_summation_operands,
        dtype=np.float32,
        casting=<span class="hljs-string">"same_kind"</span>,
        optimize=<span class="hljs-literal">True</span>,
    )

    <span class="hljs-comment"># Creating lazyframe from above &amp; perform cumulative sum over entire column</span>
    <span class="hljs-comment"># Once again Polars Expressive Lazy API makes code more cleaner &amp; readable. </span>
    ein_corr_a.append(pl.LazyFrame(ein_corr[<span class="hljs-number">0</span>].T).with_columns(pl.all().cum_sum()))
    ein_corr_b.append(pl.LazyFrame(ein_corr[<span class="hljs-number">1</span>].T).with_columns(pl.all().cum_sum()))

<span class="hljs-comment"># Just like above, even here at end all smaller batches of lazyFrame is </span>
<span class="hljs-comment"># concated as Unified Lazyframe. Still nothing is stored in memory.</span>
ein_corr_frame1 = pl.concat(ein_corr_a, how=<span class="hljs-string">"vertical"</span>)
ein_corr_frame2 = pl.concat(ein_corr_b, how=<span class="hljs-string">"vertical"</span>)
</code></pre>
<h2 id="heading-building-data-transformation-query-graph">Building data transformation Query graph</h2>
<ol>
<li>Computing GBM paths</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># Get column names from standard_brownian for explicit transformations</span>
col_names = ein_corr_frame1.collect_schema().names()

ein_corr_frame1 = (
    ein_corr_frame1
    <span class="hljs-comment"># Step 1: Calculate drift term + volatility term for each column</span>
    .with_columns(
        [
            (
                sigma_s * pl.col(col_name) + (mu_s - <span class="hljs-number">0.5</span> * sigma_s**<span class="hljs-number">2</span>) * pl.col(<span class="hljs-string">"real_number_time"</span>)
            ).alias(<span class="hljs-string">f"<span class="hljs-subst">{col_name}</span>_gbm"</span>)
            <span class="hljs-keyword">for</span> col_name <span class="hljs-keyword">in</span> col_names
        ]
    )
    <span class="hljs-comment"># Step 2: Apply exponential function and scale by s0</span>
    .with_columns(
        [
            (pl.col(<span class="hljs-string">f"<span class="hljs-subst">{col_name}</span>_gbm"</span>).exp() * s0).alias(<span class="hljs-string">f"Path_<span class="hljs-subst">{i + <span class="hljs-number">1</span>}</span>"</span>)
            <span class="hljs-keyword">for</span> i, col_name <span class="hljs-keyword">in</span> enumerate(col_names)
        ]
    )
    <span class="hljs-comment"># Step 3: Keep only the final sales path columns</span>
    .select([<span class="hljs-string">f"Path_<span class="hljs-subst">{i + <span class="hljs-number">1</span>}</span>"</span> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(col_names))])
)
</code></pre>
<ol start="2">
<li>Compute Mean from GBM Paths &amp; Use of Dynamic Scaling to fix Infinite number issue</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># Get column names from standard_brownian for explicit transformations</span>
col_names = ein_corr_frame1.collect_schema().names()

ein_corr_frame1 = (
    ein_corr_frame1.with_row_index(<span class="hljs-string">"_idx"</span>, <span class="hljs-number">1</span>)
    <span class="hljs-comment"># select row no present in year_end_location_time</span>
    .filter(
        pl.col(<span class="hljs-string">"_idx"</span>).is_in(
            time_step.select(<span class="hljs-string">"year_end_location_time"</span>).drop_nulls().collect().to_series()
        )
    )
    .drop(<span class="hljs-string">"_idx"</span>, strict=<span class="hljs-literal">False</span>)
    <span class="hljs-comment"># Handle infinities by replacing them with None</span>
    .with_columns(
        [
            pl.when(pl.col(c).is_infinite()).then(<span class="hljs-literal">None</span>).otherwise(pl.col(c)).alias(c)
            <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> col_names
        ]
    )  <span class="hljs-comment"># Calculate max value per row for Dynamic Scaling ops</span>
    .with_columns(max_value=pl.max_horizontal(pl.all()))
    <span class="hljs-comment"># Performing dynamic scale mean by scaling down --&gt; mean --&gt; scale up</span>
    .with_columns(
        Sales=pl.mean_horizontal(  <span class="hljs-comment"># taking mean of entire row</span>
            <span class="hljs-comment"># dividing all cols by max value (scaled down)</span>
            pl.all().exclude(<span class="hljs-string">"max_value"</span>) / pl.col(<span class="hljs-string">"max_value"</span>)
        )  <span class="hljs-comment"># multiplying mean scaled down with max value (scale up)</span>
        * pl.col(<span class="hljs-string">"max_value"</span>)
    )  <span class="hljs-comment"># We only need 'Sales' column</span>
    .select(<span class="hljs-string">"Sales"</span>)
)
</code></pre>
<ol start="3">
<li>Materializing the entire query plan into final output</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># Till now we have performed many heavy maths based calculations, but everything is defred</span>
<span class="hljs-comment"># running collect() will materialize all the queries into final dataframe</span>
ein_corr_frame1.collect()
</code></pre>
<p><strong>Key Points</strong>:</p>
<ul>
<li><p>Polars expressive Lazy API is very powerful, clean &amp; highly readable.</p>
</li>
<li><p>From above you can easily gather that I am simple keep adding steps to Query plan without actually running them yet.</p>
</li>
<li><p>This gives Polars’ engine many opportunities to optimize the query.</p>
</li>
<li><p>Untill we don’t run <code>LazyFrame.collect()</code> all the queries are deferred &amp; nothing is stored in memory</p>
</li>
</ul>
<h1 id="heading-the-results">The Results</h1>
<ul>
<li><p>The entire Monte Carlo Simulation running for 100 years were able to complete under 20-25 GB of memory, which was simply failing in using Pandas even after providing 64 GB.</p>
</li>
<li><p>Added benefit was total time because Polars is based on Rust &amp; internally uses parrallel processing to run queries.</p>
</li>
<li><p>The Expressive API of Polars library is very powerful &amp; intuitive, especially for Data engineers coming from SQL world.</p>
</li>
<li><p>This experience underscores the importance of choosing the right tools and approaches in data engineering to achieve optimal performance and efficiency.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[How to effectively work with Databases in Python]]></title><description><![CDATA[Introduction
The age-old debate on the use of Raw SQL v/s ORM is still very much alive in today’s world. Let’s see some of the comparing points




AspectRaw SQLORM (Object-Relational Mapping)



Ease of UseRequires knowledge of SQL syntax and databa...]]></description><link>https://importidea.dev/how-to-effectively-work-with-databases-in-python</link><guid isPermaLink="true">https://importidea.dev/how-to-effectively-work-with-databases-in-python</guid><category><![CDATA[Python]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[orm]]></category><category><![CDATA[Benchmark]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Thu, 26 Dec 2024 11:09:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1735211279674/f5e7f911-e0c6-4698-b231-da8d35d899e3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>The age-old debate on the use of <code>Raw SQL</code> v/s <code>ORM</code> is still very much alive in today’s world. Let’s see some of the comparing points</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Aspect</td><td>Raw SQL</td><td>ORM (Object-Relational Mapping)</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Ease of Use</strong></td><td>Requires knowledge of SQL syntax and database-specific features.</td><td>Abstracts SQL into Python objects, making it easier for those familiar with Python.</td></tr>
<tr>
<td><strong>Flexibility</strong></td><td>Offers complete control over the SQL queries and database interactions.</td><td>Limited by the ORM's capabilities, but can be extended with raw SQL if needed.</td></tr>
<tr>
<td><strong>Performance</strong></td><td>Can be optimized for performance by writing efficient SQL queries.</td><td>May introduce overhead due to abstraction, but often optimized for common use cases.</td></tr>
<tr>
<td><strong>Portability</strong></td><td>Tied to specific SQL dialects, making it less portable across databases.</td><td>Generally more portable as it abstracts database-specific details.</td></tr>
<tr>
<td><strong>Learning Curve</strong></td><td>Steeper learning curve for those unfamiliar with SQL.</td><td>Easier for Python developers, but it requires learning the ORM's API.</td></tr>
<tr>
<td><strong>Maintenance</strong></td><td>Can be harder to maintain due to verbose SQL code.</td><td>Easier to maintain as changes in the database schema can be managed through models.</td></tr>
<tr>
<td><strong>Security</strong></td><td>Prone to SQL injection if not handled properly.</td><td>Provides built-in protection against SQL injection through parameterized queries.</td></tr>
</tbody>
</table>
</div><p>A seasoned database expert might claim that an ORM or a programming language isn't necessary for working with a database. However, in practice, this approach has significant downsides, such as:</p>
<ul>
<li><p>For any complex problem, you might end up with multiple nested <code>SELECT</code> queries, which can be really tricky for others to debug and understand.</p>
</li>
<li><p><code>SQL</code> has a fixed set of <code>keywords</code>, so you have to work within those limits. This means missing out on all the amazing possibilities that a full-stack programming language like <code>Python</code> offers.</p>
</li>
<li><p>You can't build a <code>data pipeline</code> using just <code>SQL</code>.</p>
</li>
</ul>
<p>Therefore, the use of <code>SQL</code> should generally be limited to performing data analytics rather than developing data applications.</p>
<h1 id="heading-the-world-of-orms">The World of ORMs</h1>
<p>Python being a popular programming language provides many options for ORM. Let’s see them in action</p>
<h2 id="heading-tech-stack">Tech stack</h2>
<ul>
<li><p>For this article, I am using <code>Postgres</code> database &amp; <code>Python 3.13</code></p>
</li>
<li><p>Typically, an ORM transforms SQL data into Python <code>data structures</code>, which are then organized into either a <code>dict</code>, a <code>tuple</code> (a <code>list</code> of <code>tuples</code>), or even a <code>namedtuple</code> (a <code>list</code> of <code>namedtuples</code>).</p>
</li>
<li><p>These basic data structures may not simplify handling real-world problems.</p>
</li>
<li><p>I prefer using a <code>Dataframe</code> as the data structure to store the data. A <code>Dataframe</code> closely resembles a <code>Database Table</code>.</p>
</li>
<li><p>You can choose between <code>Pandas</code> or <a target="_blank" href="https://docs.pola.rs/api/python/stable/reference/api/polars.read_database.html"><code>Polars</code></a>. Personally, I use <code>Polars</code>. I won't dive into why I prefer <code>Polars</code> over <code>Pandas</code> here, as that's a topic for another time. But trust me, switching to <code>Polars</code> can really make your life easier!</p>
</li>
</ul>
<h2 id="heading-benchmark">Benchmark</h2>
<p>The environment that I use includes:</p>
<ul>
<li><p><code>Postgres 17</code></p>
</li>
<li><p><code>Python 3.13</code></p>
</li>
<li><p>Table of size <code>1,977,823</code></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735194861431/4617483a-8710-4811-be51-e2107dabd446.png" alt /></p>
<ul>
<li>This is the query</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    *
<span class="hljs-keyword">FROM</span>
    FACTOR_INVESTING.TICKER_HISTORY
<span class="hljs-keyword">WHERE</span>
    TICKER <span class="hljs-keyword">IN</span> (<span class="hljs-string">'INFY'</span>, <span class="hljs-string">'TCS'</span>)
    <span class="hljs-keyword">AND</span> <span class="hljs-built_in">DATE</span> <span class="hljs-keyword">BETWEEN</span> <span class="hljs-keyword">CAST</span>(<span class="hljs-string">'2010-01-01'</span> <span class="hljs-keyword">AS</span> <span class="hljs-built_in">DATE</span>) <span class="hljs-keyword">AND</span> <span class="hljs-keyword">CAST</span>(<span class="hljs-string">'2024-01-01'</span> <span class="hljs-keyword">AS</span> <span class="hljs-built_in">DATE</span>)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span>
    <span class="hljs-built_in">DATE</span> <span class="hljs-keyword">DESC</span> <span class="hljs-keyword">NULLS</span> <span class="hljs-keyword">LAST</span>
</code></pre>
<ul>
<li>The Benchmark query returns data of size <code>6910</code> rows</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735210271542/d6be6e5b-40d7-4de7-a7a7-2639b72a23c4.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-1-sqlalchemy-polars">1. SQLAlchemy + Polars</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> date
<span class="hljs-keyword">from</span> about_time <span class="hljs-keyword">import</span> about_time
<span class="hljs-keyword">from</span> alive_progress <span class="hljs-keyword">import</span> alive_it
<span class="hljs-keyword">from</span> sqlalchemy.orm <span class="hljs-keyword">import</span> declarative_base, mapped_column, Mapped, Session
<span class="hljs-keyword">from</span> sqlalchemy <span class="hljs-keyword">import</span> String, Date, DOUBLE_PRECISION, select, desc, create_engine
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl


engine = create_engine(<span class="hljs-string">'postgresql+psycopg://akash:0330@localhost/playground'</span>)
Base = declarative_base()

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TickerHistory</span>(<span class="hljs-params">Base</span>):</span>
    __tablename__ = <span class="hljs-string">"ticker_history"</span>
    __table_args__ = {<span class="hljs-string">"schema"</span>: <span class="hljs-string">"factor_investing"</span>, <span class="hljs-string">"extend_existing"</span>: <span class="hljs-literal">True</span>}

    date: Mapped[date] = mapped_column(Date)
    ticker: Mapped[str] = mapped_column(String)
    key: Mapped[str] = mapped_column(String, primary_key=<span class="hljs-literal">True</span>, nullable=<span class="hljs-literal">False</span>)
    open: Mapped[float] = mapped_column(DOUBLE_PRECISION)
    high: Mapped[float] = mapped_column(DOUBLE_PRECISION)
    low: Mapped[float] = mapped_column(DOUBLE_PRECISION)
    close: Mapped[float] = mapped_column(DOUBLE_PRECISION)

query = (
    select(TickerHistory)
    .where(TickerHistory.ticker.in_([<span class="hljs-string">"INFY"</span>, <span class="hljs-string">"TCS"</span>]))
    .where(TickerHistory.date.between(date(<span class="hljs-number">2010</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>), date(<span class="hljs-number">2024</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>)))
    .order_by(desc(TickerHistory.date))
)

<span class="hljs-keyword">with</span> about_time() <span class="hljs-keyword">as</span> t:
    <span class="hljs-keyword">with</span> Session(engine) <span class="hljs-keyword">as</span> session:
        <span class="hljs-comment"># Running the same query 100 times</span>
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> alive_it(range(<span class="hljs-number">100</span>)):
            <span class="hljs-comment"># Directly reading query in polars dataframe</span>
            df = pl.read_database(query, session)

print(<span class="hljs-string">f"Total time taken: <span class="hljs-subst">{t.duration_human}</span>"</span>)
</code></pre>
<p><strong>The result</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735195793554/a94cfbe8-4321-4abc-b6f7-5783cf398e5a.png" alt class="image--center mx-auto" /></p>
<p>So that’s <code>44.41</code> second for 100 queries &amp; <code>2.25</code> queries per second throughout</p>
<h3 id="heading-2-pony-orm-peewee-orm-sqlmodel-orm">2. Pony ORM, Peewee ORM, SQLModel ORM</h3>
<p>Both of these popular ORMs currently do not support <code>Psycopg3</code> and still require <code>Psycopg2</code>. The library authors recommend using <code>Psycopg3</code> going forward, as mentioned <a target="_blank" href="https://www.psycopg.org/features/">here</a>. Because of this, I decided to skip using both of these ORMs.</p>
<p><a target="_blank" href="https://sqlmodel.tiangolo.com/"><code>SQLModel</code></a> ORM which is itself based on SQLAlchemy, so we won't any significant difference in results.</p>
<p><strong>But this is not the end of the tunnel. There are other potentially good options too. Let’s check them out</strong></p>
<blockquote>
<p>You may have already got a feeling that I prefer ORM based solution compared to Raw SQL. But theoretically, Raw SQL query execution should have an edge over ORM execution.</p>
<p>To combine the power of both the world I use <a target="_blank" href="https://sqlglot.com/sqlglot.html"><code>Sqlglot Library</code></a>. This allows you to build the query programmatically as well as iteratively. Eliminates issues like typos/spelling errors, SQL Injection, etc. It supports 24 dialects.</p>
</blockquote>
<h3 id="heading-3-psycopg3-sqlglot-polars">3. Psycopg3 + Sqlglot + Polars</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> date
<span class="hljs-keyword">from</span> about_time <span class="hljs-keyword">import</span> about_time
<span class="hljs-keyword">from</span> alive_progress <span class="hljs-keyword">import</span> alive_it
<span class="hljs-keyword">from</span> psycopg <span class="hljs-keyword">import</span> connect
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">from</span> sqlglot <span class="hljs-keyword">import</span> select, condition, Dialects

query = (
    select(<span class="hljs-string">"*"</span>)
    .from_(<span class="hljs-string">"factor_investing.ticker_history"</span>)
    .where(condition(<span class="hljs-string">"ticker"</span>).isin(<span class="hljs-string">"INFY"</span>, <span class="hljs-string">"TCS"</span>))
    .where(condition(<span class="hljs-string">"date"</span>).between(date(<span class="hljs-number">2010</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>), date(<span class="hljs-number">2024</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>)))
    .order_by(<span class="hljs-string">"date DESC"</span>)
    .sql(Dialects.POSTGRES)
)

conn = connect(
    host=<span class="hljs-string">"localhost"</span>, port=<span class="hljs-number">5432</span>, dbname=<span class="hljs-string">"playground"</span>, user=<span class="hljs-string">"akash"</span>, password=<span class="hljs-string">"0330"</span>
)

<span class="hljs-keyword">with</span> about_time() <span class="hljs-keyword">as</span> t:
    print(<span class="hljs-string">"Starting benchmark..."</span>)
    <span class="hljs-keyword">with</span> conn.cursor() <span class="hljs-keyword">as</span> cursor:
        <span class="hljs-comment"># Running the same query 100 times</span>
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> alive_it(range(<span class="hljs-number">100</span>)):
            <span class="hljs-comment"># Directly reading query in polars dataframe</span>
            df = pl.read_database(query, cursor)

print(<span class="hljs-string">f"Total time taken: <span class="hljs-subst">{t.duration_human}</span>"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735205781463/42940da0-313e-4838-8974-1fcb1a1fe459.png" alt class="image--center mx-auto" /></p>
<p>So that’s <code>20.04</code> second for 100 queries &amp; <code>5</code> queries per second throughout. This improves the result compared to SQLAlchemy.</p>
<h3 id="heading-4-adbc-sqlglot-polars">4. ADBC + Sqlglot + Polars</h3>
<p><code>Polars dataframe</code> are backed by <code>Arrow Table</code> &amp; uses <code>PyArrow</code>. Use of <a target="_blank" href="https://arrow.apache.org/adbc/current/index.html"><code>ADBC</code></a> (Arrow Database Connectivity) can benefit from <code>zero copy</code> concept, since Polars doesn't need to convert table data returned by ADBC drivers.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> adbc_driver_postgresql.dbapi <span class="hljs-keyword">import</span> connect
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> date
<span class="hljs-keyword">from</span> about_time <span class="hljs-keyword">import</span> about_time
<span class="hljs-keyword">from</span> alive_progress <span class="hljs-keyword">import</span> alive_it
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">from</span> sqlglot <span class="hljs-keyword">import</span> select, condition, Dialects

query = (
    select(<span class="hljs-string">"*"</span>)
    .from_(<span class="hljs-string">"factor_investing.ticker_history"</span>)
    .where(condition(<span class="hljs-string">"ticker"</span>).isin(<span class="hljs-string">"INFY"</span>, <span class="hljs-string">"TCS"</span>))
    .where(condition(<span class="hljs-string">"date"</span>).between(date(<span class="hljs-number">2010</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>), date(<span class="hljs-number">2024</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>)))
    .order_by(<span class="hljs-string">"date DESC"</span>)
    .sql(Dialects.POSTGRES)
)

uri = <span class="hljs-string">"postgresql://akash:0330@localhost/playground"</span>

<span class="hljs-keyword">with</span> about_time() <span class="hljs-keyword">as</span> t:
    <span class="hljs-keyword">with</span> connect(uri) <span class="hljs-keyword">as</span> conn:
       <span class="hljs-comment"># Running the same query 100 times</span>
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> alive_it(range(<span class="hljs-number">100</span>)):
            <span class="hljs-comment"># Directly reading query in polars dataframe</span>
            df = pl.read_database(query, conn)

print(<span class="hljs-string">f"Total time taken: <span class="hljs-subst">{t.duration_human}</span>"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735207255748/9b01ea20-4fb0-4c40-ae2a-1eaecab33359.png" alt class="image--center mx-auto" /></p>
<p>So that’s <code>17.94</code> second for 100 queries &amp; <code>5.68</code> queries per second throughout. This improves the result compared to both <code>SQLAlchemy</code> &amp; <code>Psycopg</code></p>
<h3 id="heading-5-connectorx-sqlglot-polars">5. ConnectorX + Sqlglot + Polars</h3>
<p><a target="_blank" href="https://sfu-db.github.io/connector-x/intro.html#"><code>ConnectorX</code></a> is yet another SQL driver/library that is making some rounds. It is using <code>Rust</code>. Let’s see in action. BTW it uses a slightly different approach. There is no connection or cursor concept here.</p>
<pre><code class="lang-python"> <span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> date
<span class="hljs-keyword">from</span> about_time <span class="hljs-keyword">import</span> about_time
<span class="hljs-keyword">from</span> alive_progress <span class="hljs-keyword">import</span> alive_it
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">from</span> sqlglot <span class="hljs-keyword">import</span> select, condition, Dialects

query = (
    select(<span class="hljs-string">"*"</span>)
    .from_(<span class="hljs-string">"factor_investing.ticker_history"</span>)
    .where(condition(<span class="hljs-string">"ticker"</span>).isin(<span class="hljs-string">"INFY"</span>, <span class="hljs-string">"TCS"</span>))
    .where(condition(<span class="hljs-string">"date"</span>).between(date(<span class="hljs-number">2010</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>), date(<span class="hljs-number">2024</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>)))
    .order_by(<span class="hljs-string">"date DESC"</span>)
    .sql(Dialects.POSTGRES)
)

uri = <span class="hljs-string">"postgresql://akash:0330@localhost/playground"</span>
<span class="hljs-keyword">with</span> about_time() <span class="hljs-keyword">as</span> t:
    <span class="hljs-comment"># Running the same query 100 times</span>
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> alive_it(range(<span class="hljs-number">100</span>)):
            <span class="hljs-comment"># Directly reading query in polars dataframe</span>
            df = pl.read_database_uri(query,uri, engine=<span class="hljs-string">"connectorx"</span>)

print(<span class="hljs-string">f"Total time taken: <span class="hljs-subst">{t.duration_human}</span>"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735209004847/6766a3ab-aa61-4729-a969-5241d1d4bdf7.png" alt class="image--center mx-auto" /></p>
<p>So that’s <code>1:22.4</code> second for 100 queries &amp; <code>12.1</code> queries per second throughout. This is the worst of the lot.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<ul>
<li><p>The thought that Raw SQL can give you better results holds true. But at the same time directly using it in Python will be never a good idea.</p>
</li>
<li><p>Sqlglot nicely fits this scenario. Allows us to use Raw SQL query &amp; combine with Dataframe.</p>
</li>
<li><p>In terms of benchmarks, the combination of ADBC + Sqlglot + Polars is the winner because of the tight Arrow Integration. So this should be your first pick.</p>
</li>
<li><p>But at the same time sticking with Psycopg is also not a bad idea.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Streamlining Your Databricks Environment Setup]]></title><description><![CDATA[I'm pretty sure that if you're using Databricks to run your PySpark job, these might be your typical steps:

Design and develop business logic.

A notebook that performs all the business logic.

Running that notebook using Databricks Workflow.


This...]]></description><link>https://importidea.dev/streamlining-your-databricks-environment-setup</link><guid isPermaLink="true">https://importidea.dev/streamlining-your-databricks-environment-setup</guid><category><![CDATA[Databricks]]></category><category><![CDATA[express-idea]]></category><category><![CDATA[setup]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Thu, 28 Nov 2024 07:38:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732779497950/c44b1596-b530-46a8-b62e-2149f10d8cb9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I'm pretty sure that if you're using <code>Databricks</code> to run your <code>PySpark job</code>, these might be your typical steps:</p>
<ul>
<li><p>Design and develop business logic.</p>
</li>
<li><p>A notebook that performs all the business logic.</p>
</li>
<li><p>Running that notebook using <a target="_blank" href="https://docs.databricks.com/en/jobs/index.html">Databricks Workflow</a>.</p>
</li>
</ul>
<p>This is easier said than done. Logistics for running this notebook(s) is one of the biggest headaches. I am sure you won’t be just running one or two notebooks. Setting the environment correctly for every one of them can be cumbersome. It becomes even tougher if you are using proprietary/private libraries.</p>
<p>Following is the system that I came up with which is the most practical solution:</p>
<ol>
<li><p><strong>Step 1:</strong> Universal environment setup notebook.</p>
<ul>
<li><p>Create a universal environment setup notebook in your repo.</p>
</li>
<li><p>This notebook can be placed in the root of your repo so that all other notebooks can easily access it</p>
</li>
</ul>
</li>
<li><p><strong>Step 2:</strong> Secrets, Environment variables, Constants, etc</p>
<ul>
<li><p>Secrets, variables, and constants change depending on the environment in which the notebook is running, so they must be set accordingly.</p>
</li>
<li><p>We use <code>dbutils.secret</code> to fetch all the secrets. Let’s see in action</p>
<pre><code class="lang-python">  <span class="hljs-comment"># Getting all secrets</span>
  <span class="hljs-comment"># Note - It is important to make sure that the scope is appropriately mapped to</span>
  <span class="hljs-comment"># the secret store assigned to environment-specific workspace</span>
  PAT = dbutils.secrets.get(<span class="hljs-string">"&lt;your_secret_scope&gt;"</span>, <span class="hljs-string">"&lt;secret_1_name&gt;"</span>)
  SECRET_1 = dbutils.secrets.get(<span class="hljs-string">"&lt;your_secret_scope&gt;"</span>, <span class="hljs-string">"&lt;secret_1_name&gt;"</span>)
  SECRET_2 = dbutils.secrets.get(<span class="hljs-string">"&lt;your_secret_scope&gt;"</span>, <span class="hljs-string">"&lt;secret_2_name&gt;"</span>)
  SECRET_3 = dbutils.secrets.get(<span class="hljs-string">"&lt;your_secret_scope&gt;"</span>, <span class="hljs-string">"&lt;secret_3_name&gt;"</span>)
</code></pre>
<pre><code class="lang-python">  <span class="hljs-comment"># Environment specific constants</span>
  databricks_host = spark.conf.get(<span class="hljs-string">"spark.databricks.workspaceUrl"</span>)

  <span class="hljs-keyword">if</span> databricks_host == <span class="hljs-string">"&lt;your dev workapce host url&gt;"</span>:
      environment = <span class="hljs-string">"dev"</span>
      CONSTANT_1 = <span class="hljs-string">"dev constant 1"</span>
      CONSTANT_2 = <span class="hljs-string">"dev constant 2"</span>
      volume_path = <span class="hljs-string">"/Volumes/dev/path"</span>
      catalog_name = <span class="hljs-string">"dev_catalog"</span>
  <span class="hljs-keyword">elif</span> databricks_host == <span class="hljs-string">"&lt;your uat workapce host url&gt;"</span>:
      environment = <span class="hljs-string">"uat"</span>
      CONSTANT_1 = <span class="hljs-string">"uat constant 1"</span>
      CONSTANT_2 = <span class="hljs-string">"uat constant 2"</span>
      volume_path = <span class="hljs-string">"/Volumes/uat/path"</span>
      catalog_name = <span class="hljs-string">"uat_catalog"</span>
  <span class="hljs-keyword">else</span> databricks_host == <span class="hljs-string">"&lt;your prod workapce host url&gt;"</span>:
      environment = <span class="hljs-string">"prd"</span>
      CONSTANT_1 = <span class="hljs-string">"prod constant 1"</span>
      CONSTANT_2 = <span class="hljs-string">"prod constant 2"</span>
      volume_path = <span class="hljs-string">"/Volumes/prod/path"</span>
      catalog_name = <span class="hljs-string">"prod_catalog"</span>
  <span class="hljs-keyword">else</span>:
      <span class="hljs-keyword">raise</span> NameError(<span class="hljs-string">"Incorrect databricks workspace"</span>)
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Step 3: Passing all variables to the downstream notebook</strong>There are a couple of ways to pass these variables to the downstream notebook to access environment-specific values. Here are two methods I use:</p>
<ul>
<li><p>DBFS JSON file: Save all the values in a JSON file at the <code>dbfs/tmp/</code> location. This way, the file is deleted once the notebook job is done. I used this method for a while, but Databricks recently changed the permission rules, and now only admins can access it.</p>
</li>
<li><p>Temp View: Using a <code>TEMP VIEW</code> is even better than a file. It doesn't require admin permission. So, create a <code>TEMP VIEW</code> that includes all these variables.</p>
</li>
</ul>
</li>
</ol>
<ul>
<li><pre><code class="lang-python">  <span class="hljs-comment"># Creating dict of all variables</span>

  env_vars = {
      <span class="hljs-string">"secret_1"</span>: SECRET_1,
      <span class="hljs-string">"secret_2"</span>: SECRET_2,
      <span class="hljs-string">"secret_3"</span>: SECRET_3,
      <span class="hljs-string">"constant_1"</span>: CONSTANT_1,
      <span class="hljs-string">"constant_2"</span>: CONSTANT_2,
      <span class="hljs-string">"catalog_name"</span>: catalog_name
  }

  <span class="hljs-comment"># Write environment variables to a TEMP VIEW</span>
  spark.createDataFrame([env_vars]).createOrReplaceTempView(<span class="hljs-string">"env_vars"</span>)
</code></pre>
</li>
</ul>
<ol start="4">
<li><p><strong>Step 4: Installing libraries, especially private</strong></p>
<ul>
<li><p>You should definitely package your codebase as a <code>Python Library</code>. Then install &amp; use it just like any other <code>open source</code> library. This way you won’t have to worry about <code>Path</code> or <code>Relative/Absolute Import Error</code>, etc.</p>
</li>
<li><p>You need to separate package for <code>dev/qa</code> (which can be experimental) &amp; <code>uat/prod</code> (must be stable)</p>
</li>
<li><p>Follow <a target="_blank" href="https://peps.python.org/pep-0440/#developmental-releases">PEP 440</a> guidelines to version your code.</p>
<ul>
<li><p>use <code>X.Y.devN</code> version for the package published from the develop branch.</p>
</li>
<li><p>use <code>X.Y.N</code> version for the package published from the master branch.</p>
</li>
</ul>
</li>
<li><p>Use <code>%pip</code> to install environment-specific private library.</p>
<pre><code class="lang-python">  <span class="hljs-keyword">if</span> environment == <span class="hljs-string">"dev"</span> <span class="hljs-keyword">or</span> environment == <span class="hljs-string">"qa"</span>:
      <span class="hljs-comment"># --pre flag will install package having 'dev' label</span>
      <span class="hljs-comment"># NOTE - use your repository usrl appropriately</span>
       %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --pre --upgrade <span class="hljs-string">"&lt;your package name&gt;"</span>

  <span class="hljs-keyword">if</span> environment == <span class="hljs-string">"uat"</span> <span class="hljs-keyword">or</span> environment == <span class="hljs-string">"prod"</span>:
      %pip install --extra-index-url https://databricks-user:{PAT}@pkgs.dev.azure.com/dp-us-bus/_packaging/dp-pip/pypi/simple/ --upgrade <span class="hljs-string">"&lt;your package name&gt;"</span>

  <span class="hljs-comment"># Note - Databricks recommends to restart python to make sure we'll </span>
  <span class="hljs-comment"># be using libraries that were just installed</span>
  dbutils.restartPython()
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Step 5:</strong> Running the setup notebook in the downstream job notebook</p>
<ul>
<li><p>Use databricks’ magic <code>%run</code> command.</p>
</li>
<li><p>Read the <code>TEMP VIEW</code> &amp; update values in the os environment</p>
<pre><code class="lang-python">  <span class="hljs-comment"># Running the notebook. Make sure to use correct relative path</span>
  %run ../../databricks_environment_setup

  <span class="hljs-comment"># Read environment variables from the TEMP VIEW &amp; set the environment variables</span>
  <span class="hljs-comment"># for use in this notebook</span>
  <span class="hljs-keyword">import</span> os

  os.environ.update(spark.table(<span class="hljs-string">"env_vars"</span>).first().asDict())
</code></pre>
</li>
</ul>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[How to Integrate Auth0 SSO with FastAPI]]></title><description><![CDATA[If you want to integrate Auth0 SSO (or any social login) for all your authentication and authorization needs, you're in the right place! Let's get started together!
Here are the steps

Step 1: Use FastAPI-Auth0 python library.

Step 2: Create Auth0 F...]]></description><link>https://importidea.dev/how-to-integrate-auth0-sso-with-fastapi</link><guid isPermaLink="true">https://importidea.dev/how-to-integrate-auth0-sso-with-fastapi</guid><category><![CDATA[express-idea]]></category><category><![CDATA[FastAPI]]></category><category><![CDATA[Auth0]]></category><category><![CDATA[SSO]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Wed, 27 Nov 2024 07:01:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732690875828/9290a72f-c0f0-469a-b3da-ee7152aad3dd.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you want to integrate <code>Auth0 SSO</code> (or any social login) for all your authentication and authorization needs, you're in the right place! Let's get started together!</p>
<p><strong>Here are the steps</strong></p>
<ol>
<li><p><strong>Step 1:</strong> Use <a target="_blank" href="https://github.com/dorinclisu/fastapi-auth0">FastAPI-Auth0</a> python library.</p>
</li>
<li><p><strong>Step 2:</strong> Create Auth0 FastAPI Security dependency</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Any, Iterator

<span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> Depends, FastAPI, Security
<span class="hljs-keyword">from</span> fastapi_auth0.auth <span class="hljs-keyword">import</span> Auth0
<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> RootModel


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RootUser</span>(<span class="hljs-params">RootModel</span>):</span>
    root: dict[str, Any]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__iter__</span>(<span class="hljs-params">self</span>) -&gt; Iterator[str]:</span>
        <span class="hljs-keyword">return</span> iter(self.root)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__getitem__</span>(<span class="hljs-params">self, item</span>) -&gt; Any:</span>
        <span class="hljs-keyword">return</span> self.root[item]

auth = Auth0(
    domain=<span class="hljs-string">"&lt;Your auth0 domain&gt;"</span>,
    api_audience=<span class="hljs-string">"&lt;Your auth0 api audience&gt;"</span>,
    auth0user_model=RootUser
) <span class="hljs-comment"># Note - You can use customized pydantic user model too</span>
</code></pre>
<ol start="3">
<li><strong>Step 3:</strong> Secured the Endpoint using FastAPI dependency</li>
</ol>
<pre><code class="lang-python">app = FastAPI()

<span class="hljs-comment"># NOTE - I am using `auth.implicit_scheme` to implement Auth0 SSO</span>
<span class="hljs-meta">@app.get("/protected", dependencies=[Depends(auth.implicit_scheme)])</span>
<span class="hljs-comment"># NOTE -`auth.get_user` performs the login &amp; returns user identity </span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_secure</span>(<span class="hljs-params">user: RootUser = Security(<span class="hljs-params">auth.get_user</span>)</span>):</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"message"</span>: <span class="hljs-string">f"<span class="hljs-subst">{user}</span>"</span>}
</code></pre>
<p>    <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732690017380/631ff9d4-730d-4b59-8ac1-f5ff2d8a7855.png" alt class="image--center mx-auto" /></p>
<p>    <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732690094712/42841d13-f09b-4f67-9cfc-f59a32de1d5b.png" alt class="image--center mx-auto" /></p>
<ol start="4">
<li><strong>Step 4:</strong> (Optional) If you want the client id to be auto-filled then modify it accordingly</li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Any, Iterator

<span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> Depends, FastAPI, Security
<span class="hljs-keyword">from</span> fastapi_auth0.auth <span class="hljs-keyword">import</span> Auth0
<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> RootModel


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RootUser</span>(<span class="hljs-params">RootModel</span>):</span>
    root: dict[str, Any]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__iter__</span>(<span class="hljs-params">self</span>) -&gt; Iterator[str]:</span>
        <span class="hljs-keyword">return</span> iter(self.root)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__getitem__</span>(<span class="hljs-params">self, item</span>) -&gt; Any:</span>
        <span class="hljs-keyword">return</span> self.root[item]


auth = Auth0(
    domain=<span class="hljs-string">"&lt;Your auth0 domain&gt;"</span>,
    api_audience=<span class="hljs-string">"&lt;Your auth0 api audience&gt;"</span>,
    auth0user_model=RootUser
) <span class="hljs-comment"># Note - You can use a customized pydantic user model too</span>
app = FastAPI(swagger_ui_init_oauth={<span class="hljs-string">"clientId"</span>: <span class="hljs-string">"&lt;your client id&gt;"</span>})

<span class="hljs-meta">@app.get("/protected", dependencies=[Depends(auth.implicit_scheme)])</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_secure</span>(<span class="hljs-params">user: RootUser = Security(<span class="hljs-params">auth.get_user</span>)</span>):</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"message"</span>: <span class="hljs-string">f"<span class="hljs-subst">{user}</span>"</span>}
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732690590226/5bdbe27f-3d45-45a8-b65d-4c38ab7dc79a.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[How I leveraged FastAPI's dependency injection to reduce code by more than 30%]]></title><description><![CDATA[This article expects that you know what is Dependency Injection in the FastAPI world & basic syntax. If you haven't yet explored or used it, I recommend following this guide by the original author of FastAPI.
They are incredibly useful when you need ...]]></description><link>https://importidea.dev/how-i-leveraged-fastapis-dependency-injection-to-reduce-code-by-more-than-30</link><guid isPermaLink="true">https://importidea.dev/how-i-leveraged-fastapis-dependency-injection-to-reduce-code-by-more-than-30</guid><category><![CDATA[FastAPI]]></category><category><![CDATA[optimization]]></category><category><![CDATA[dependency injection]]></category><category><![CDATA[REST API]]></category><category><![CDATA[data-engineering]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Tue, 19 Nov 2024 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732470209898/8873b76e-1255-451d-9de7-e1dd3e2e219e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This article expects that you know what is <code>Dependency Injection</code> in the FastAPI world &amp; basic syntax. If you haven't yet explored or used it, I recommend following this <a target="_blank" href="https://fastapi.tiangolo.com/tutorial/dependencies/">guide</a> by the original author of FastAPI.</p>
<p>They are incredibly useful when you need to:</p>
<ul>
<li><p>Share logic effortlessly (use the same code logic over and over again).</p>
</li>
<li><p>Seamlessly share database connections.</p>
</li>
<li><p>Enforce security, authentication, role requirements, and more with ease.</p>
</li>
<li><p>And so many other amazing things...</p>
</li>
</ul>
<p>All this, while keeping code repetition to a minimum!</p>
<p>I started using <code>Dependency Injection</code> just for authentication. So I used it like this</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> .models <span class="hljs-keyword">import</span> (
    Auth0APIUser,
    RootAuth0User,
)

<span class="hljs-comment"># Security flow to authenticate user using Password flow</span>
header_username = APIKeyHeader(
    name=<span class="hljs-string">"username"</span>, scheme_name=<span class="hljs-string">"username"</span>, auto_error=<span class="hljs-literal">False</span>
)
header_password = APIKeyHeader(
    name=<span class="hljs-string">"password"</span>, scheme_name=<span class="hljs-string">"password"</span>, auto_error=<span class="hljs-literal">False</span>
)

<span class="hljs-comment"># Security flow to authenticate user using Auth0</span>
auth0_authentication = Auth0(
    domain=env_config.auth0_domain,
    api_audience=env_config.auth0_api_audience,
    auth0user_model=RootAuth0User,  <span class="hljs-comment"># type: ignore</span>
    auto_error=<span class="hljs-literal">False</span>,
)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">user_authentication_security</span>(<span class="hljs-params">
    user: Annotated[RootAuth0User, Security(<span class="hljs-params">auth0_authentication.get_user</span>)],
</span>) -&gt; Auth0APIUser:</span>
    <span class="hljs-string">"""FASTApi Security injection to perform user auth using auth0"""</span>
    <span class="hljs-keyword">return</span> auth0_api_response_to_user(user.model_dump())
</code></pre>
<p>Next is its application in the API Endpoint function</p>
<pre><code class="lang-python"><span class="hljs-meta">@app.get("/schema/{schema}/tables/{table}")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_table_data</span>(<span class="hljs-params">
    schema: str,
    table: str,
    user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">user_authentication_security</span>)]
</span>) -&gt; dict:</span>
    <span class="hljs-comment"># checking if users has access to schema</span>
    schema_access = <span class="hljs-keyword">await</span> check_schema_access(schema, user)
    <span class="hljs-keyword">if</span> schema_access <span class="hljs-keyword">is</span> <span class="hljs-literal">False</span>:
        logger.warning(
            <span class="hljs-string">f"<span class="hljs-subst">{user_info.email_id}</span> is not authorized to access <span class="hljs-subst">{schema}</span>"</span>,
            extra=logger_properties,
        )
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_403_FORBIDDEN,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{user_info.email_id}</span> is not authorized to access <span class="hljs-subst">{schema}</span>"</span>,
        )

    <span class="hljs-comment"># checking if users has access to table</span>
    table_access = <span class="hljs-keyword">await</span> check_table_access(schema, table, user)
    <span class="hljs-keyword">if</span> table_access <span class="hljs-keyword">is</span> <span class="hljs-literal">False</span>:
        logger.warning(
            <span class="hljs-string">f"<span class="hljs-subst">{user_info.email_id}</span> is not authorized to access <span class="hljs-subst">{table}</span> under <span class="hljs-subst">{schema}</span>"</span>,
            extra=logger_properties,
        )
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_403_FORBIDDEN,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{user_info.email_id}</span> is not authorized to access <span class="hljs-subst">{table}</span> under <span class="hljs-subst">{schema}</span>"</span>,
        )

    <span class="hljs-comment"># checking if table is present or not</span>
    <span class="hljs-comment"># checking if requested dataset is present</span>
    current_dataset = <span class="hljs-keyword">await</span> datasetDetailsV2.find_one(
        datasetDetailsV2.dataset == dataset
    )
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> current_dataset:
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"dataset: <span class="hljs-subst">{dataset}</span> not found"</span>,
        )
    <span class="hljs-comment"># checking if requested table is present in dataset</span>
    complete_table_name = <span class="hljs-string">f"<span class="hljs-subst">{current_dataset.destination_metadata.unity_catalog}</span>.<span class="hljs-subst">{current_dataset.dataset}</span>.<span class="hljs-subst">{table}</span>"</span>
    <span class="hljs-keyword">if</span> complete_table_name <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> current_dataset.delta_share_compatible_table:
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"table: <span class="hljs-subst">{table}</span> not found in dataset: <span class="hljs-subst">{dataset}</span>"</span>,
        )

    <span class="hljs-comment"># From here onwards comes the core logic of above endpoint logic</span>
    ...
</code></pre>
<p>The code above is practically shouting, "<strong>I need some help!</strong> 😫". This are several issues:</p>
<ul>
<li><p>A lot of other code is being used in the function apart from the core logic. This violates the <strong>Single Responsibility Principle (SRP)</strong> of <strong>SOLID Principal</strong></p>
</li>
<li><p>If I need to check if a <code>Table</code> is present or if a <code>User</code> has the right access in another endpoint function, I will end up having to copy and paste the same logic there too.</p>
</li>
<li><p>What if I need to change the <strong>authentication</strong> and <strong>authorization</strong> logic in the future? It would make testing a lot more complicated too.</p>
</li>
<li><p>Also, long functions just look too ugly 🤮</p>
</li>
</ul>
<p>We can improve this by removing things like authentication, authorization, verification, etc. and all the non-core logic from the API endpoint function into a separate dependency function. We can even use the nested dependency function to comply with Don’t Repeat Yourself (DRY).</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> .models <span class="hljs-keyword">import</span> (
    Auth0APIUser,
    RootAuth0User,
)

<span class="hljs-comment"># Security flow to authenticate user using Password flow</span>
header_username = APIKeyHeader(
    name=<span class="hljs-string">"username"</span>, scheme_name=<span class="hljs-string">"username"</span>, auto_error=<span class="hljs-literal">False</span>
)
header_password = APIKeyHeader(
    name=<span class="hljs-string">"password"</span>, scheme_name=<span class="hljs-string">"password"</span>, auto_error=<span class="hljs-literal">False</span>
)

<span class="hljs-comment"># Security flow to authenticate user using Auth0</span>
auth0_authentication = Auth0(
    domain=env_config.auth0_domain,
    api_audience=env_config.auth0_api_audience,
    auth0user_model=RootAuth0User,  <span class="hljs-comment"># type: ignore</span>
    auto_error=<span class="hljs-literal">False</span>,
)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">user_authentication_security</span>(<span class="hljs-params">
    user: Annotated[RootAuth0User, Security(<span class="hljs-params">auth0_authentication.get_user</span>)],
</span>) -&gt; Auth0APIUser:</span>
    <span class="hljs-string">"""FASTApi Security injection to perform user auth using auth0"""</span>
    <span class="hljs-keyword">return</span> auth0_api_response_to_user(user.model_dump())

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dataset_table_availability_verification</span>(<span class="hljs-params">dataset: str, table: str</span>):</span>
    <span class="hljs-string">"""FASTApi dependency to check if requested dataset and table is present"""</span>
    <span class="hljs-comment"># checking if requested dataset is present</span>
    current_dataset = <span class="hljs-keyword">await</span> datasetDetailsV2.find_one(
        datasetDetailsV2.dataset == dataset
    )
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> current_dataset:
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"dataset: <span class="hljs-subst">{dataset}</span> not found"</span>,
        )
    <span class="hljs-comment"># checking if requested table is present in dataset</span>
    complete_table_name = <span class="hljs-string">f"<span class="hljs-subst">{current_dataset.destination_metadata.unity_catalog}</span>.<span class="hljs-subst">{current_dataset.dataset}</span>.<span class="hljs-subst">{table}</span>"</span>
    <span class="hljs-keyword">if</span> complete_table_name <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> current_dataset.delta_share_compatible_table:
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"table: <span class="hljs-subst">{table}</span> not found in dataset: <span class="hljs-subst">{dataset}</span>"</span>,
        )
    <span class="hljs-keyword">return</span> current_dataset

<span class="hljs-comment"># 1st Level dependancy</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">role_verification</span>(<span class="hljs-params">
    user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">get_user</span>)], claim: str
</span>) -&gt; userInfoV2:</span>
    <span class="hljs-comment"># getting current user by matching auth0 api user id</span>
    current_user = <span class="hljs-keyword">await</span> userInfoV2.find_one(userInfoV2.user_id == user.user_id)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> current_user:
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_404_NOT_FOUND, detail=<span class="hljs-string">"user not found"</span>
        )
    <span class="hljs-comment"># checking if requested role is present</span>
    user_rbac = RBACPolicyClient(user_policy=current_user.model_dump())
    <span class="hljs-keyword">try</span>:
        user_rbac.enforce_claim(policy_level=PolicyLevel.role, claim=claim)
    <span class="hljs-keyword">except</span> RBACPolicyViolation <span class="hljs-keyword">as</span> e:
        logger.warning(<span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> not approved for <span class="hljs-subst">{claim}</span>"</span>)
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_403_FORBIDDEN,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> don't have role: <span class="hljs-subst">{claim}</span>"</span>,
        ) <span class="hljs-keyword">from</span> e

<span class="hljs-comment"># 1st Level dependancy</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scope_verification</span>(<span class="hljs-params">
    user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">get_user</span>)], claim: str
</span>) -&gt; userInfoV2:</span>
    <span class="hljs-comment"># getting current user by matching auth0 api user id</span>
    current_user = <span class="hljs-keyword">await</span> userInfoV2.find_one(userInfoV2.user_id == user.user_id)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> current_user:
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_404_NOT_FOUND, detail=<span class="hljs-string">"user not found"</span>
        )
    user_rbac = RBACPolicyClient(user_policy=current_user.model_dump())
    <span class="hljs-keyword">try</span>:
        user_rbac.enforce_claim(policy_level=PolicyLevel.scope, claim=claim)
    <span class="hljs-keyword">except</span> RBACPolicyViolation <span class="hljs-keyword">as</span> e:
        logger.warning(<span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> not approved for <span class="hljs-subst">{claim}</span>"</span>)
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_403_FORBIDDEN,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> don't have scope: <span class="hljs-subst">{claim}</span>"</span>,
        ) <span class="hljs-keyword">from</span> e
    <span class="hljs-keyword">return</span> current_user

<span class="hljs-comment"># 2nd Level dependency</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_table_scope_dependency</span>(<span class="hljs-params">
    user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">get_user</span>)],
    dataset: str,
    table: str,
</span>) -&gt; userInfoV2:</span>
    scope_condition = <span class="hljs-string">f"<span class="hljs-subst">{dataset}</span>:<span class="hljs-subst">{table}</span>::write"</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> scope_verification(user=user, claim=scope_condition)
<span class="hljs-comment"># 2nd Level dependency</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_table_scope_dependency</span>(<span class="hljs-params">
    user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">get_user</span>)],
    dataset: str,
    table: str,
</span>) -&gt; userInfoV2:</span>
    scope_condition = <span class="hljs-string">f"<span class="hljs-subst">{dataset}</span>:<span class="hljs-subst">{table}</span>::read"</span>  <span class="hljs-comment"># read scope will present by default</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> scope_verification(user=user, claim=scope_condition)
<span class="hljs-comment"># 2nd Level dependency</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">engineering_admin_role_dependency</span>(<span class="hljs-params">
    user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">get_user</span>)],
</span>) -&gt; userInfoV2:</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> role_verification(user=user, claim=<span class="hljs-string">"management:engineering::admin"</span>)
</code></pre>
<p>Note that I've reused the <code>1st level dependency</code> in several <code>2nd level dependencies</code>, following the <strong>DRY</strong> principle. Up next, we'll see how it's applied in an API Endpoint function.</p>
<pre><code class="lang-python"><span class="hljs-comment"># EG 1. Here I am using authorization as dependancy rather in the same function</span>
<span class="hljs-meta">@app.get("/user", dependencies=[Depends(engineering_admin_role_dependency)])</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_list_users</span>() -&gt; list[userInfoV2]:</span>
    <span class="hljs-string">"""Get the list of all users along with their access"""</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> userInfoV2.find_all().to_list()

<span class="hljs-comment"># EG 2. Similar to above, I used authorization as dependancy</span>
<span class="hljs-meta">@app.patch(</span>
    <span class="hljs-string">"/dataset"</span>,
    status_code=status.HTTP_201_CREATED,
    dependencies=[Depends(engineering_admin_role_dependency)],
)
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_update_dataset</span>(<span class="hljs-params">details: DatasetInput</span>):</span>
    <span class="hljs-string">"""Update existing dataset into the system"""</span>

    <span class="hljs-comment"># checking if dataset is present</span>
    current_dataset = <span class="hljs-keyword">await</span> datasetDetailsV2.find_one(
        datasetDetailsV2.dataset == details.dataset
    )
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> current_dataset:
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">"dataset not present"</span>,
        )
    <span class="hljs-comment"># Core logic will come here</span>

<span class="hljs-comment"># EG 3. Using multiple dependancy to make code more clearner</span>
<span class="hljs-meta">@router_v2.get(</span>
    <span class="hljs-string">"/dataset/{dataset}/table/{table}"</span>,
    dependencies=[Depends(read_table_scope_dependency)],
    response_class=UJSONResponse,
)
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_table_data</span>(<span class="hljs-params">
    dataset: str,
    table: str,
    current_dataset: Annotated[
        datasetDetailsV2, Depends(<span class="hljs-params">dataset_table_availability_verification</span>)
    ],
</span>) -&gt; list[dict]:</span>
    <span class="hljs-string">"""get delta table from desired dataset &amp; table"""</span>
    <span class="hljs-comment"># Core logic to read table data will come here</span>
</code></pre>
<p><strong>The key takeaways we can gather from the above are:</strong></p>
<ul>
<li><p>Keep things tidy by separating dependencies like auth and verification from the endpoint function. This makes the code look cleaner.</p>
</li>
<li><p>Use multilevel dependency functions to reuse code that repeats.</p>
</li>
<li><p>Testing becomes easier since we can create dedicated tests focusing on the endpoint function and other tests focusing on the dependency function.</p>
</li>
<li><p>Other endpoints with the same needs can easily reuse the same dependency injections.</p>
</li>
</ul>
<p>All of this helped cut down the lines of code, streamline the logic, and generally make the code easier to read.</p>
<p>But there's still room for improvement. Some logic doesn't really belong to the core logic. By building on the <code>Multilevel Dependency Injection</code>, we can further optimize it as follows:</p>
<pre><code class="lang-python"><span class="hljs-comment"># 1st Level dependency</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">user_presence_dependency</span>(<span class="hljs-params">
    request: Request,
    user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">get_user</span>)],
</span>) -&gt; UserInfoV2:</span>
    current_user_df = pl.read_database(
        current_user_filter(user.user_id), request.app.state.cursor
    )
    <span class="hljs-keyword">if</span> current_user_df.is_empty():
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> not found"</span>,
        )
    <span class="hljs-keyword">return</span> UserInfoV2.from_polars(current_user_df)

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dataset_presence_dependency</span>(<span class="hljs-params">
    request: Request, dataset: str,
</span>) -&gt; DatasetInfoV2:</span>
    current_dataset_df = pl.read_database(
        current_dataset_filter(dataset), request.app.state.cursor
    )
    <span class="hljs-keyword">if</span> current_dataset_df.is_empty():
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{dataset}</span> not found"</span>,
        )
    <span class="hljs-keyword">return</span> DatasetInfoV2.from_polars(current_dataset_df)

<span class="hljs-comment"># 2nd Level dependency</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">table_presence_dependency</span>(<span class="hljs-params">
    dataset: Annotated[DatasetInfoV2, Depends(<span class="hljs-params">dataset_presence_dependency</span>)],
    table: str,
</span>) -&gt; str:</span>
    <span class="hljs-keyword">if</span> dataset.table <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> table <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> dataset.table:
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"table: <span class="hljs-subst">{table}</span> not found in dataset: <span class="hljs-subst">{dataset.dataset}</span>"</span>,
        )
    <span class="hljs-keyword">return</span> table

<span class="hljs-comment"># 3rd Level dependency</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_table_scope_dependency</span>(<span class="hljs-params">
    user: Annotated[UserInfoV2, Depends(<span class="hljs-params">user_presence_dependency</span>)],
    dataset: str,
    table: str,
</span>) -&gt; UserInfoV2:</span>
    scope_condition = <span class="hljs-string">f"<span class="hljs-subst">{dataset}</span>:<span class="hljs-subst">{table}</span>::read"</span>  <span class="hljs-comment"># read scope will present by default</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> scope_verification(user=user, claim=scope_condition)


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">engineering_admin_role_dependency</span>(<span class="hljs-params">
    user: Annotated[UserInfoV2, Depends(<span class="hljs-params">user_presence_dependency</span>)],
</span>) -&gt; UserInfoV2:</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> role_verification(user=user, claim=FixedString.ENGINEERING_ADMIN.value)
</code></pre>
<p>Next is its application in the API Endpoint function</p>
<pre><code class="lang-python"><span class="hljs-meta">@router_v2.patch(</span>
    <span class="hljs-string">"/dataset/{dataset}"</span>,
    status_code=status.HTTP_201_CREATED,
    dependencies=[Depends(engineering_admin_role_dependency)],
)
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_update_dataset</span>(<span class="hljs-params">
    dataset: Annotated[DatasetInfoV2, Depends(<span class="hljs-params">dataset_presence_dependency</span>)],
</span>) -&gt; dict[str, str]:</span>
    <span class="hljs-string">"""Update the existing dataset details"""</span>
    <span class="hljs-comment"># Core logic will come here</span>

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_read_data</span>(<span class="hljs-params">
    dataset: Annotated[DatasetInfoV2, Depends(<span class="hljs-params">dataset_presence_dependency</span>)],
    table: Annotated[str, Depends(<span class="hljs-params">table_presence_dependency</span>)],
    filter_query: Annotated[FilterQueryParams, Query(<span class="hljs-params"></span>)],
    request: Request,
</span>) -&gt; ORJSONResponse:</span>
    <span class="hljs-string">"""Get data from desired dataset &amp; table"""</span>
    <span class="hljs-comment"># Core logic will come here</span>
</code></pre>
<p><strong>The key takeaways we can gather from the above are:</strong></p>
<ul>
<li><p>The code now looks even cleaner. The API endpoint function now includes only the code related to itself and nothing else.</p>
</li>
<li><p>You might think that adding more dependency functions means adding more lines of code, but these are reusable dependencies. When you have many endpoints with the same requirements, the benefits really add up!</p>
</li>
</ul>
<p><strong>In this article, I have used 2-3 endpoints for example. But my actual project has upwards of 30 endpoints. Following were the impact.</strong></p>
<ul>
<li><p>They are grouped by categories. Ideally, all the endpoints in a category will have common requirements like authentication methods. I simply used the security dependency injection at the API Router definition &amp; all the endpoint functions referring to it don’t need to use it again.</p>
</li>
<li><p>Even for the endpoints which have additional requirements, I can just use the respective dependency.</p>
</li>
<li><p>That's why overall I was able to reduce the code in my project by more than 30%</p>
</li>
<li><p>Apart from code optimization, if you closely look at the endpoints having either path or query parameters, using dependencies becomes a lot more readable. Now anyone knows what the requirements are for that specific parameter.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[How to Create an Effective Enterprise Data Strategy: Part 2]]></title><description><![CDATA[TLDRThis article discusses the creation of an effective enterprise data strategy, focusing on building a data sharing application using Databricks Lakehouse and FastAPI. It covers the system backend architecture, API endpoints, authentication, and au...]]></description><link>https://importidea.dev/how-to-create-an-effective-enterprise-data-strategy-part-2</link><guid isPermaLink="true">https://importidea.dev/how-to-create-an-effective-enterprise-data-strategy-part-2</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[FastAPI]]></category><category><![CDATA[Databricks]]></category><category><![CDATA[Python]]></category><category><![CDATA[spark]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Thu, 24 Oct 2024 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732432430541/da7e5146-ca2d-42c3-88e9-d12b5d06cf33.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<details><summary>TLDR</summary><div data-type="detailsContent">This article discusses the creation of an effective enterprise data strategy, focusing on building a data sharing application using Databricks Lakehouse and FastAPI. It covers the system backend architecture, API endpoints, authentication, and authorization. The article also explores different phases of interacting with Databricks Lakehouse, including using Databricks Delta Sharing, open-source tools like Polars, and Databricks Connect, highlighting the pros and cons of each approach. The conclusion emphasizes the importance of flexibility, modularity, and open-source solutions in developing a scalable and efficient data strategy.</div></details>

<h2 id="heading-introduction">Introduction</h2>
<p><strong>Recap of Part 1</strong></p>
<p>In <a target="_blank" href="https://hashnode.com/post/cm3lib3r8000009l7giey5b7l">Part 1</a>, we discussed why having a strong enterprise data strategy is important and the architectural choices we made to fit our needs.</p>
<p>A big part of an enterprise data strategy is <strong>data sharing</strong>. In this article, we'll dive into how we built our <strong>data sharing</strong> application, the tools and technology we used, and the valuable lessons we picked up along the way.</p>
<p><strong>The need</strong></p>
<p>Even before we jump in, I want to discuss what were the most important requirements that we were aiming to solve</p>
<ul>
<li><p>We chose the <code>Databricks Lakehouse</code> platform for data storage. It's based on the <code>Spark</code> ecosystem, using <code>Spark SQL</code> or <code>PySpark</code> for data interaction. However, most users and tools don't primarily use these methods.</p>
</li>
<li><p>Most of our users are financial analysts who have long used <code>Ms Excel</code> for their workflows. We aimed to create a solution that securely provides them with authorized data access.</p>
</li>
</ul>
<h2 id="heading-system-backend-architecture">System Backend Architecture</h2>
<p>To select the backend architecture we took inspiration from how two service communicate with each other - <code>APIs</code>. We explored multiple <code>API Driven Architecture</code> like <code>REST API</code>, <code>GraphQL</code>, <code>gRPC</code>, but ultimately decided to go ahead with <code>REST API</code> due to its <strong>widespread compatibility</strong> with tools (including Excel 😍)</p>
<p>We used <code>FastAPI</code> as the <code>REST API</code> framework.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">I am a very firm believer in principles like <code>DRY (Don't Repeat Yourself)</code>, <code>DIRFT (Do It Right the First Time)</code>. Establishing the project's structure is crucial, as it ensures a streamlined API design that maintains simplicity while offering the flexibility needed to incorporate new features seamlessly.</div>
</div>

<h3 id="heading-project-structure"><strong>Project Structure</strong></h3>
<pre><code class="lang-plaintext">├── data_sharing/
│   ├── dependency/
│   │   ├── auth.py
│   │   └── verification.py
│   ├── routers/
│   │   ├── admin.py
│   │   ├── users.py
│   │   └── table.py
│   ├── models.py
│   └── config.py
└── main.py
</code></pre>
<ul>
<li><p>We designed the structure of the project keeping <strong>modularity</strong> as a priority. Not everything is cramped into <code>main.py</code>. Follow this <a target="_blank" href="https://fastapi.tiangolo.com/tutorial/bigger-applications/#path-operations-with-apirouter">doc</a> for more info.</p>
</li>
<li><p>All the routers have their modules. Similarly, dependencies too have their modules.</p>
</li>
<li><p>We also have a dedicated <code>models</code> module to declare all sorts <code>Pydantic</code>, <code>Enum</code>, <code>Dataclass</code>, etc models. Similarly a module for the environment <code>config</code> to make the codebase truly environment-agnostic without changing a single line of code.</p>
</li>
</ul>
<h3 id="heading-api-endpoints"><strong>API Endpoints</strong></h3>
<p>These are some of the important endpoints</p>
<ol>
<li><p><code>GET /api/v1/users</code> : To get a list of all user details</p>
</li>
<li><p><code>GET /api/v1/users/{user_id}</code>: To get details of specific <code>user_id</code>.</p>
</li>
<li><p><code>PUT /api/v1/users/{user_id}/rbac</code>: Override specific <code>user_id</code> <code>RBAC</code> with provided values</p>
</li>
<li><p><code>PATCH /api/v1/users/{user_id}/rbac?policy-level</code>: Change the specific <code>RBAC</code> policy level of the given <code>user_id</code>.</p>
</li>
<li><p><code>DELETE /api/v1/users/{user_id}</code>: Delete the given <code>user_id</code> from the system.</p>
</li>
<li><p><code>GET /api/v1/dataset</code>: To get a list of all dataset details.</p>
</li>
<li><p><code>GET /api/v1/dataset/{dataset}</code>: To get the details of the given <code>dataset</code></p>
</li>
<li><p><code>GET /api/v1/dataset/{dataset}/table</code>: To get the list of all tables in the given <code>dataset</code></p>
</li>
<li><p><code>GET /api/v1/dataset/{dataset}/table/{table}</code>: To get the <code>table</code> data residing in the given <code>dataset</code></p>
</li>
<li><p><code>POST /api/v1/dataset/{dataset}/table/{table}</code>: To write data as <code>payload</code> to the given <code>table</code> of the given <code>datset</code></p>
</li>
<li><p><code>POST /api/v1/dataset/{dataset}/table/{table}/file</code>: To write data as <code>file upload</code> using <code>Form data</code> protocol to the given <code>table</code> of the given <code>datset</code></p>
</li>
</ol>
<h3 id="heading-authentication"><strong>Authentication</strong></h3>
<p>The most important part of any authentication logic is getting the identity of the user. We are using <code>Auth0</code> &amp; <code>SSO</code> provided by it. Let’s see in action how to implement it in <code>FastAPI</code>.</p>
<ul>
<li><p>The biggest challenge that we had was to integrate <code>SSO login</code> with <code>Swagger Docs</code>. We started with the <a target="_blank" href="https://developer.auth0.com/resources/code-samples/api/fastapi/basic-role-based-access-control">official docs</a> provided by Auth0. This is more on the manual side for <code>U2M</code> OAuth flow.</p>
</li>
<li><p>As I said earlier our goal was to use <code>SSO login</code> instead. So we instead use <a target="_blank" href="https://github.com/dorinclisu/fastapi-auth0"><code>fastapi-auth0</code></a> python library which enabled us to use SSO login write into Swagger docs.</p>
</li>
<li><p>To abstract all the complexity from end user we combined <code>Security</code> with <code>Dependency Inject</code> of FastAPI.</p>
  <div data-node-type="callout">
  <div data-node-type="callout-emoji">💡</div>
  <div data-node-type="callout-text">FastAPI doesen’t really restricts us with Dependency Injection. We can go craazy creative using it. Its one of the best feature that FastAPI provides us.</div>
  </div>


</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># For Auth we need couple of models. This is written in data_sharing/modes.py</span>

<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> (
    BaseModel,
    EmailStr,
    Field,
    RootModel
)

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RootAuth0User</span>(<span class="hljs-params">RootModel</span>):</span>
    root: dict[str, Any]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__iter__</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-keyword">return</span> iter(self.root)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__getitem__</span>(<span class="hljs-params">self, item</span>):</span>
        <span class="hljs-keyword">return</span> self.root[item]

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Auth0APIUser</span>(<span class="hljs-params">BaseModel</span>):</span>
    user_id: str
    primary_email: EmailStr
    sub: str
    aud: str
    iat: datetime
    exp: datetime
    azp: str
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># This SSO Auth code is written in data_sharing/dependency/auth.py</span>


<span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> Depends, HTTPException, Security, status
<span class="hljs-keyword">from</span> fastapi.security <span class="hljs-keyword">import</span> APIKeyHeader, HTTPBasic
<span class="hljs-keyword">from</span> fastapi_auth0.auth <span class="hljs-keyword">import</span> Auth0, JwksDict

<span class="hljs-keyword">from</span> data_sharing.models <span class="hljs-keyword">import</span> (
    Auth0APIUser,
    RootAuth0User,
)

<span class="hljs-comment"># Security flow to authenticate user using Auth0</span>
auth0_authentication = Auth0(
    domain=<span class="hljs-string">"&lt;your auth0 domain&gt;"</span>,
    api_audience=<span class="hljs-string">"&lt;your auth0 api audience&gt;"</span>,
    auth0user_model=RootAuth0User,  <span class="hljs-comment"># type: ignore</span>
    auto_error=<span class="hljs-literal">False</span>,
)

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_user_using_auth0_sso_flow</span>(<span class="hljs-params">
    user: Annotated[RootAuth0User, Security(<span class="hljs-params">auth0_authentication.implicit_scheme</span>)],
</span>) -&gt; RootAuth0User:</span>
    <span class="hljs-string">"""FASTApi Security injection to perform user auth using auth0 sso flow."""</span>
    <span class="hljs-keyword">return</span> user


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_user_using_auth0_implicit_flow</span>(<span class="hljs-params">
    user: Annotated[RootAuth0User, Security(<span class="hljs-params">auth0_authentication.get_user</span>)],
</span>) -&gt; RootAuth0User:</span>
    <span class="hljs-string">"""FastAPI Security injection to perform user auth using auth0 implicit (Bearer token) flow."""</span>
    <span class="hljs-keyword">return</span> user

<span class="hljs-comment"># Once successfull SSO login &amp; code exchange with Auth0, we have to convert it to actucal user</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_user</span>(<span class="hljs-params">
    sso_user: Annotated[RootAuth0User, Depends(<span class="hljs-params">_get_user_using_auth0_sso_flow</span>)],
    implicit_user: Annotated[
        RootAuth0User, Depends(<span class="hljs-params">_get_user_using_auth0_implicit_flow</span>)
    ],
</span>) -&gt; Auth0APIUser:</span>
    <span class="hljs-string">"""FastAPI Security injection to get the user based on the various auth flow"""</span>
    <span class="hljs-comment"># NOTE - keep adding all the auth mechanism here to get the user</span>
    <span class="hljs-keyword">if</span> sso_user:
        <span class="hljs-comment"># If you don't get response that follows AuthOAPIUser model then you</span>
        <span class="hljs-comment"># may neeed to writea small function to convert it, like below</span>
        <span class="hljs-comment"># return auth0_api_response_to_user(sso_user.model_dump())</span>
        <span class="hljs-keyword">return</span> Auth0APIUser(**sso_user.model_dump())
    <span class="hljs-keyword">elif</span> implicit_user:
        <span class="hljs-comment"># return auth0_api_response_to_user(implicit_user.model_dump())</span>
        <span class="hljs-keyword">return</span> Auth0APIUser(**sso_user.model_dump())
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail=<span class="hljs-string">"No authorization token provided"</span>,
        )
</code></pre>
<h3 id="heading-authorization">Authorization</h3>
<ul>
<li><p>We cannot treat authorization as some afterthought. It’s equally important. As stated earlier we use <code>RBAC Strategy</code> for authorization.</p>
</li>
<li><p>It’s implementation is again powered by <code>FastAPI Dependency</code>. (See that’s why I say I love it ❤️)</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># All this authorization &amp; verification code is written in data_sharing/dependency/verification.py</span>

<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">role_verification</span>(<span class="hljs-params">user: UserInfo, claim: str</span>) -&gt; UserInfo:</span>
    <span class="hljs-comment"># checking if requested role is present</span>
    user_rbac = RBACPolicyClient(user_policy=user.model_dump())
    <span class="hljs-keyword">try</span>:
        user_rbac.enforce_claim(policy_level=PolicyLevel.role, claim=claim)
    <span class="hljs-keyword">except</span> RBACPolicyViolation <span class="hljs-keyword">as</span> e:
        logger.warning(
            <span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> not approved for <span class="hljs-subst">{claim}</span>"</span>
        )
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_403_FORBIDDEN,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{decrypt_data(ENCRYPTION_KEY,user.primary_email)}</span> don't have role: <span class="hljs-subst">{claim}</span>"</span>,
        ) <span class="hljs-keyword">from</span> e
    <span class="hljs-keyword">return</span> user


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scope_verification</span>(<span class="hljs-params">user: UserInfo, claim: str</span>) -&gt; UserInfo:</span>
    user_rbac = RBACPolicyClient(user_policy=user.model_dump())
    <span class="hljs-keyword">try</span>:
        user_rbac.enforce_claim(policy_level=PolicyLevel.scope, claim=claim)
    <span class="hljs-keyword">except</span> RBACPolicyViolation <span class="hljs-keyword">as</span> e:
        logger.warning(
            <span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> not approved for <span class="hljs-subst">{claim}</span>"</span>
        )
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_403_FORBIDDEN,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> don't have scope: <span class="hljs-subst">{claim}</span>"</span>,
        ) <span class="hljs-keyword">from</span> e
    <span class="hljs-keyword">return</span> user


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">permission_verification</span>(<span class="hljs-params">user: UserInfo, permission: str</span>) -&gt; UserInfo:</span>
    <span class="hljs-comment"># checking if requested permission is present</span>
    <span class="hljs-keyword">if</span> user.permission <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">and</span> permission <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> user.permission:
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_403_FORBIDDEN,
            detail=<span class="hljs-string">f"user don't have permission: <span class="hljs-subst">{permission}</span>"</span>,
        )
    <span class="hljs-keyword">return</span> user


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">user_presence_dependency</span>(<span class="hljs-params">
    request: Request, user: Annotated[Auth0APIUser, Depends(<span class="hljs-params">get_user</span>)],
</span>) -&gt; UserInfo:</span>
    current_user_df = pl.read_database(
        current_user_filter(user.user_id), request.app.state.cursor
    )
    <span class="hljs-keyword">if</span> current_user_df.is_empty():
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{user.primary_email}</span> not found"</span>,
        )
    <span class="hljs-keyword">return</span> UserInfo.from_polars(current_user_df)


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dataset_presence_dependency</span>(<span class="hljs-params">request: Request, dataset: str
</span>) -&gt; DatasetInfo:</span>
    current_dataset_df = pl.read_database(
        current_dataset_filter(dataset), request.app.state.cursor
    )
    <span class="hljs-keyword">if</span> current_dataset_df.is_empty():
        <span class="hljs-keyword">raise</span> HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"<span class="hljs-subst">{dataset}</span> not found"</span>,
        )
    <span class="hljs-keyword">return</span> DatasetInfoV2.from_polars(current_dataset_df)


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">table_presence_dependency</span>(<span class="hljs-params">
    dataset: Annotated[DatasetInfo, Depends(<span class="hljs-params">dataset_presence_dependency</span>)], table:
</span>) -&gt; str:</span>
    <span class="hljs-keyword">if</span> dataset.table <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> table <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> dataset.table:
        <span class="hljs-keyword">raise</span> HTTPException(
            status.HTTP_404_NOT_FOUND,
            detail=<span class="hljs-string">f"table: <span class="hljs-subst">{table}</span> not found in dataset: <span class="hljs-subst">{dataset.dataset}</span>"</span>,
        )
    <span class="hljs-keyword">return</span> table


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_table_scope_dependency</span>(<span class="hljs-params">
    user: Annotated[UserInfo, Depends(<span class="hljs-params">user_presence_dependency</span>)],
    dataset: str,
    table: str,
</span>) -&gt; UserInfo:</span>
    scope_condition = <span class="hljs-string">f"<span class="hljs-subst">{dataset}</span>:<span class="hljs-subst">{table}</span>::write"</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> scope_verification(user=user, claim=scope_condition)


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_table_scope_dependency</span>(<span class="hljs-params">
    user: Annotated[UserInfo, Depends(<span class="hljs-params">user_presence_dependency</span>)],
    dataset: str,
    table: str,
</span>) -&gt; UserInfo:</span>
    scope_condition = <span class="hljs-string">f"<span class="hljs-subst">{dataset}</span>:<span class="hljs-subst">{table}</span>::read"</span>  <span class="hljs-comment"># read scope will present by default</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> scope_verification(user=user, claim=scope_condition)


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">engineering_admin_role_dependency</span>(<span class="hljs-params">
    user: Annotated[UserInfo, Depends(<span class="hljs-params">user_presence_dependency</span>)],
</span>) -&gt; UserInfo:</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> role_verification(user=user, claim=FixedString.ENGINEERING_ADMIN.value)


<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dataset_owner_role_dependency</span>(<span class="hljs-params">
    user: Annotated[UserInfo, Depends(<span class="hljs-params">user_presence_dependency</span>)], dataset: str
</span>) -&gt; UserInfo:</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> role_verification(user=user, claim=<span class="hljs-string">f"<span class="hljs-subst">{dataset}</span>::dataset_owner"</span>)
</code></pre>
<h3 id="heading-routers-in-action">Routers in action</h3>
<p>Lets see few example how we can combine all the above to create <code>API Routers</code>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># All user router code is written in data_sharing/router/users.py</span>

<span class="hljs-comment"># SECTION - User router</span>
router = APIRouter(prefix=<span class="hljs-string">"/api/v1"</span>, tags=[APITags.user])

<span class="hljs-meta">@router.get("/user/{user_id}")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_user</span>(<span class="hljs-params">
    user: Annotated[UserInfo, Depends(<span class="hljs-params">user_presence_dependency</span>)],
    user_id: str
    request: Request,
</span>) -&gt; UserInfoV2:</span>
    <span class="hljs-string">"""Get the details of specified `user` along with their access"""</span>
    <span class="hljs-comment"># Your respective logic to get user details will come here</span>

<span class="hljs-meta">@router.put(</span>
    <span class="hljs-string">"/user/{user_id}/rbac"</span>,
    status_code=status.HTTP_201_CREATED,
    dependencies=[Depends(engineering_admin_role_dependency)],
)
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_replace_user_rbac_v2</span>(<span class="hljs-params">
    user_id: str, user_rbac: UserRBACInfo
</span>) -&gt; dict[str, str]:</span>
    <span class="hljs-string">"""Replace user's complete RBAC policy.

    &gt;**NOTE:** This will completely overwrite the existing RBAC policy of the user. Use with caution.
    """</span>
    <span class="hljs-comment"># Your logic to override user RBAC policies will come here</span>
</code></pre>
<p>You can see how we used the right dependency injection to handle both authentication and authorization. It looks pretty neat, doesn't it? Let's check out some more examples.</p>
<pre><code class="lang-python"><span class="hljs-meta">@router.get(</span>
    <span class="hljs-string">"/dataset/{dataset}/table/{table}"</span>,
    dependencies=[Depends(read_table_scope_dependency)],
    response_class=ORJSONResponse,
)
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_data</span>(<span class="hljs-params">
    dataset: Annotated[DatasetInfoV2, Depends(<span class="hljs-params">dataset_presence_dependency</span>)],
    table: Annotated[str, Depends(<span class="hljs-params">table_presence_dependency</span>)],
    filter_query: Annotated[FilterQueryParams, Query(<span class="hljs-params"></span>)],
    request: Request,
</span>) -&gt; ORJSONResponse:</span>
    <span class="hljs-string">"""Get data from desired dataset &amp; table"""</span>
    <span class="hljs-comment"># Code to read table data will come here</span>
    <span class="hljs-keyword">return</span> ORJSONResponse(query_data.to_dicts())
</code></pre>
<p>Here you can see we used three dependency injections in two different ways</p>
<ol>
<li><p><code>read_table_scope_dependency</code>: It is declared in the router decorator as this dependency is not expected to return anything.</p>
</li>
<li><p><code>dataset_presence_dependency</code> &amp; <code>table_presence_dependency</code>: This is declared as <code>Path parameter</code>. That’s because they return some values that will be used downstream.</p>
</li>
</ol>
<p>You will also note that any user <code>RBAC</code> related dependency is not used here. That’s because the dependencies here are multilevel nested dependencies. A level above dependency will perform user authentication &amp; authorization.</p>
<h2 id="heading-interact-with-databricks-lakehouse">Interact with Databricks Lakehouse</h2>
<p>This is like the heart of the <strong>Data Sharing</strong> system. To provide the ability to read &amp; write data in Lakehouse outside of Databricks <code>Spark ecosystem</code>. I have already discussed the <code>API Endpoint</code> part of it, now let’s dive into core logic.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Like any software development journey, this one was full of iterations. I believe that going through each step will benefit you, as it will help you understand the reasoning behind the architectural decisions.</div>
</div>

<h3 id="heading-phase-1-databricks-delta-sharing">Phase 1: Databricks Delta Sharing</h3>
<p>Databricks has a decent native sharing product - <a target="_blank" href="https://learn.microsoft.com/en-us/azure/databricks/delta-sharing/"><code>Delta Sharing</code></a>. Follow the docs on how to create a <code>Share</code> &amp; <code>Recipient</code> (this is not the scope of this article). Once it was done, we had to do the following</p>
<ul>
<li><p>To consume the data in Python we had two options - <code>Spark</code> &amp; <code>Pandas</code>. We obviously chose <code>Pandas</code> because that was the whole point.</p>
</li>
<li><p>The syntax to read the data on surface looks pretty straight forward</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> delta_sharing

<span class="hljs-comment"># &lt;profile-path&gt;: the location of the credential file.</span>
<span class="hljs-comment"># &lt;share-name&gt;: the value of share= for the table.</span>
<span class="hljs-comment"># &lt;schema-name&gt;: the value of schema= for the table.</span>
<span class="hljs-comment"># &lt;table-name&gt;: the value of name= for the table.</span>

data = delta_sharing.load_as_pandas(
<span class="hljs-string">f"&lt;profile-path&gt;#&lt;share-name&gt;.&lt;schema-name&gt;.&lt;table-name&gt;"</span>
)
</code></pre>
<ul>
<li>But in the practical world, it is never this straightforward. As a good governance policy, we used to maintain a dedicated share for every schema and a dedicated recipient for every share. This allowed us to streamline our governance policies. The goal of API was to make every data accessible regardless of which schema it belongs to. So, we had to store multiple credential files and load them dynamically based on table names.</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># We stored all the credential (encrypted) in the database. So based on</span>
<span class="hljs-comment"># given schema, it was read from database</span>
current_cred_config = <span class="hljs-keyword">await</span> shareCredential.find_one(
    shareCredential.schema_name == schema
)
<span class="hljs-comment"># getting delta share cred</span>
cred = current_cred_config.share_credential.model_dump()
<span class="hljs-comment"># NOTE - the token is encrypted, so need to decrypt it first</span>
cred[<span class="hljs-string">"bearerToken"</span>] = decrypt_data(ENCRYPTION_KEY, cred[<span class="hljs-string">"bearerToken"</span>])
<span class="hljs-comment"># WARNING - to create delta sharing client it needs cred file physically present. So deleting it</span>
<span class="hljs-comment"># as soon as client is created</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"./config.share"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
    json.dump(cred, f)

<span class="hljs-comment"># NOTE - below is the specific `URL` format that delta sharing client needs</span>
table_url = <span class="hljs-string">f"./config.share#<span class="hljs-subst">{current_cred_config.share_name}</span>.<span class="hljs-subst">{schema}</span>.<span class="hljs-subst">{table}</span>"</span>
data = delta_sharing.load_as_pandas(table_url)
logger.info(<span class="hljs-string">f"successfully retrieve <span class="hljs-subst">{schema}</span>.<span class="hljs-subst">{table}</span> for <span class="hljs-subst">{user_info.email_id}</span>"</span>)
<span class="hljs-comment"># deleting config</span>
cred_config_clean_up()
</code></pre>
<p><strong>Pros</strong></p>
<ul>
<li><p>Readily available sharing capability provided by Databricks.</p>
</li>
<li><p>No limit to how many catalogs can used to share. No limit on Share &amp; Recipient creation.</p>
</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li><p>Databricks often updates the <code>Reader</code> and <code>Writer</code> protocols with new features. This means Delta Sharing needs to catch up, and we must ensure we aren't using unsupported table properties.</p>
</li>
<li><p>The <code>credential</code> created in Databricks for <code>Recipient</code> do get expired. So we need to frequently rotate them. This can be cumbersome.</p>
</li>
<li><p>But by far the biggest problem (which was eventually a deal breaker for us) with its <code>pandas</code> driver is no support for <code>offset</code> . They do have support for <code>limit</code>, but without offset, <code>limit</code> have very limited benefits.</p>
<ul>
<li><p>We could not use pagination.</p>
</li>
<li><p>We literally have to load all data in memory which is not practical.</p>
</li>
</ul>
</li>
<li><p>It may not be a <strong>Con</strong> but <code>Delta Sharing</code> is designed for just reading data. So we could not add <code>Write</code> data functionality.</p>
</li>
</ul>
<p>Due to all these issues, we had to make architectural changes.</p>
<h3 id="heading-phase-2-open-source-tools">Phase 2: Open source tools</h3>
<p>As I stated in Part 1, one of the primary reasons to select <code>Databricks lakehose</code> because it uses open source <code>Delta lake</code> format to store table data. So we started to look for other open-source tools that we can utilize here. We settled on the wonderful <a target="_blank" href="https://docs.pola.rs/"><code>Polars</code></a> library (If you're not already using it, you should definitely start right away! I can't recommend it enough).</p>
<p>It has built-in support to read Delta Lake tables, thanks to the <a target="_blank" href="https://delta-io.github.io/delta-rs/"><code>Deltalake</code></a> library. It also supports <a target="_blank" href="https://docs.pola.rs/api/python/stable/reference/api/polars.scan_delta.html"><code>Lazy read</code></a>, which works like Python generators, along with <code>Offset</code> and <code>Limit</code>. Plus, since we were already using <code>Polars</code>, we didn't need to make any big changes to our workflows, which was a huge plus for us!</p>
<pre><code class="lang-python"><span class="hljs-comment"># Reading Delta table Data with few filter options</span>

<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl

<span class="hljs-comment"># setting up options for reading delta table using polars</span>
storage_options = {
    <span class="hljs-string">"account_name"</span>: <span class="hljs-string">"&lt;storage_account_name&gt;"</span>,
    <span class="hljs-string">"account_key"</span>: <span class="hljs-string">"&lt;storage_account_key&gt;"</span>,
}
pyarrow_options = {<span class="hljs-string">"parquet_read_options"</span>: {<span class="hljs-string">"coerce_int96_timestamp_unit"</span>: <span class="hljs-string">"ms"</span>}} <span class="hljs-comment"># make sure to set the correct timestamp unit</span>
data = pl.scan_delta(
    source=<span class="hljs-string">"&lt;path_to_delta_table&gt;"</span>, <span class="hljs-comment"># EG - "abfss://&lt;container&gt;@&lt;storage_account_name&gt;.dfs.core.windows.net/path_to/delta_table"</span>
    pyarrow_options=pyarrow_options,
    storage_options=storage_options,
)

<span class="hljs-comment"># Creating query plan to get the data based on the query params.</span>
<span class="hljs-keyword">if</span> columns: <span class="hljs-comment"># NOTE - We used this as query param in fastapi to fetch only required columns</span>
    data = data.select(columns)
<span class="hljs-keyword">if</span> row_count: <span class="hljs-comment"># NOTE - We used this as query param in fastapi to fetch only required absolute no of rows</span>
    data = data.limit(row_count)
<span class="hljs-keyword">elif</span> page: <span class="hljs-comment"># NOTE - We used this as query param in fastapi to fetch data based on pagination</span>
    data = data.slice(
        offset=page * env_config.page_size, length=env_config.page_size
    )

<span class="hljs-comment"># executing query plan</span>
query_data = <span class="hljs-keyword">await</span> data.collect_async(streaming=<span class="hljs-literal">True</span>)
<span class="hljs-comment"># NOTE - If not using aysnc then use below code</span>
<span class="hljs-comment"># query_data = data.collect(streaming=True)</span>

<span class="hljs-comment"># WARNING - fastapi cannot handle polars dataframe while sending response back, so converting</span>
<span class="hljs-comment"># it to python dict</span>
json_data =  query_data.to_dicts()
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># Writing data using Delta Merge</span>

storage_options = {
    <span class="hljs-string">"account_name"</span>: <span class="hljs-string">"&lt;storage_account&gt;"</span>,
    <span class="hljs-string">"account_key"</span>: <span class="hljs-string">"&lt;storage_key&gt;"</span>,
}

merge_ops_output = (
    data.write_delta( <span class="hljs-comment"># NOTE - data is a Polars dataframe</span>
        target=<span class="hljs-string">"&lt;storage_location&gt;"</span>,
        mode=<span class="hljs-string">"merge"</span>,
        storage_options=storage_options,
        delta_merge_options={
            <span class="hljs-string">"predicate"</span>: <span class="hljs-string">"&lt;predicate&gt;"</span>,  <span class="hljs-comment"># condition to determine upsert requirement. e.g. "source.id = target.id"</span>
            <span class="hljs-string">"source_alias"</span>: <span class="hljs-string">"source"</span>,
            <span class="hljs-string">"target_alias"</span>: <span class="hljs-string">"target"</span>,
        },
    )
    .when_matched_update_all()  <span class="hljs-comment"># updating all columns for a row already present</span>
    .when_not_matched_insert_all()  <span class="hljs-comment"># inserting all columns for a new row</span>
    <span class="hljs-comment"># Delta table supports a lot more merge operation, check them out based </span>
    <span class="hljs-comment"># your requirement</span>
    .execute()
)
</code></pre>
<p><strong>Pros</strong></p>
<ul>
<li><p>This uses open-source tools.</p>
</li>
<li><p>Supports proper <code>Offset</code> &amp; <code>Limit</code> which allowed us to develop <code>Pagination</code>.</p>
</li>
<li><p>It supports all <code>Delta Merge</code> operations out of the box which allowed us to develop <code>Write data</code> functionality.</p>
</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li><p>To read or write data, we need to know the storage location of the table in advance. To simplify this for users, who only need to provide the table name, we keep a database with the storage locations of all tables. Additionally, there is a process that regularly updates this database.</p>
</li>
<li><p>The biggest issue with this approach is again the same issue that we had with the previous approach</p>
<ul>
<li><p><code>Reader</code> &amp; <code>Writer</code> protocol version. Even here the version was not updated frequently &amp; we could not use the latest features developed by Databricks.</p>
</li>
<li><p>As a temporary fix, we used to maintain two tables - the first one was the <strong>original</strong> table with all the latest table properties &amp; a <strong>copy</strong> table stripped with all table properties.</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-phase-3-databricks-connect">Phase 3: Databricks Connect</h3>
<p>They say the <strong>third is the charm</strong>. This was the case even for us too.</p>
<p>As discussed in the previous two phases we had to struggle with multiple pain points. We managed them for some time but none of them were permanent full-proof solutions. So while maintaining the current architecture, we kept our research engine going. We tried multiple approaches &amp; finally settled on Databricks Connect.</p>
<p>It’s a tool provided by Databricks to run Spark (both SQL &amp; PySpark) jobs on Databricks infrastructure remotely. It includes <a target="_blank" href="https://docs.databricks.com/en/dev-tools/python-sql-connector.html"><code>Databricks SQL Connector</code></a> to work with <strong>Spark SQL</strong> &amp; <a target="_blank" href="https://docs.databricks.com/en/dev-tools/databricks-connect/python/index.html"><code>Databricks Connect for Python</code></a> (based on Spark Connect) to work with <strong>PySpark.</strong></p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Databricks Connect has been around for a while, but we initially avoided it due to the lack of Databricks Serverless Compute, which only recently became generally available. Previously, we had to rely on Spark Clusters, which were slow to start, resource-intensive, and practically speaking an overkill for our use case.</div>
</div>

<p>Let's explore how to create the <code>Cursor</code> to interact with the Delta table in Lakehouse. Databricks offers a bunch of connection and authentication options, which you can check out <a target="_blank" href="https://docs.databricks.com/en/dev-tools/python-sql-connector.html#authentication">here</a>. Since we're using it in the application, we went with <a target="_blank" href="https://docs.databricks.com/en/dev-tools/python-sql-connector.html#auth-m2m">OAuth machine-to-machine (M2M) authentication</a>.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> databricks.core.config <span class="hljs-keyword">import</span> Config
<span class="hljs-keyword">from</span> databricks.sdk.core <span class="hljs-keyword">import</span> oauth_service_principal
<span class="hljs-keyword">from</span> databricks.sql.exc <span class="hljs-keyword">import</span> RequestError
<span class="hljs-keyword">from</span> databricks.sql <span class="hljs-keyword">import</span> connect


<span class="hljs-meta">@asynccontextmanager</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">setup_databricks_cursor</span>(<span class="hljs-params">app: FastAPI</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_oauth_cred</span>():</span>
        <span class="hljs-keyword">return</span> oauth_service_principal(db_cfg)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_cursor</span>():</span>
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">return</span> connect(
                server_hostname=<span class="hljs-string">"&lt;databricks warehouse/cluster host&gt;"</span>, <span class="hljs-comment"># make sure to grab serverless warehouse</span>
                http_path=<span class="hljs-string">"&lt;databricks http path&gt;"</span>, <span class="hljs-comment"># make sure to grab serverless warehouse</span>
                credentials_provider=_oauth_cred,
            ).cursor()
        <span class="hljs-keyword">except</span> RequestError:
            logger.debug(<span class="hljs-string">"either token expired or connection with databricks lost, getting new"</span>)
            <span class="hljs-keyword">return</span> _get_cursor()

    <span class="hljs-comment"># Adding cursor to FastAPI app state so that it becomes available to all routers</span>
    app.state.cursor = _get_cursor()
    logger.debug(<span class="hljs-string">"successfully connected to databricks"</span>)
    <span class="hljs-keyword">yield</span>
    app.state.cursor.close()
    logger.debug(<span class="hljs-string">"successfully closed databricks connection"</span>)

app = FastAPI(
    debug=<span class="hljs-literal">True</span>,
    title=<span class="hljs-string">"Your title"</span>,
    lifespan=setup_databricks_cursor, <span class="hljs-comment"># NOTE - here we are passing the async context manager created above as a lifespan</span>
)
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># For Reading we used Databricks SQL Connector </span>
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">from</span> sqlglot <span class="hljs-keyword">import</span> Dialects, select
<span class="hljs-keyword">from</span> sqlglot.errors <span class="hljs-keyword">import</span> ParseError

<span class="hljs-keyword">try</span>:
    columns = select(*filter_query.column) <span class="hljs-keyword">if</span> filter_query.column <span class="hljs-keyword">else</span> select(<span class="hljs-string">"*"</span>)  <span class="hljs-comment"># NOTE - We used this as query param in fastapi to fetch only required columns</span>
    query = columns.from_(<span class="hljs-string">f"<span class="hljs-subst">{dataset.catalog}</span>.<span class="hljs-subst">{dataset.dataset}</span>.<span class="hljs-subst">{table}</span>"</span>) <span class="hljs-comment"># Full table name. EG - "catalog_name.dataset_name.table_name"</span>
    <span class="hljs-keyword">if</span> filter_query.filter:  <span class="hljs-comment"># NOTE - We used this as query param in fastapi to filter the data based SQL Where clause</span>
        query = query.where(filter_query.filter)
    <span class="hljs-keyword">if</span> filter_query.order_by: <span class="hljs-comment"># NOTE - We used this as query param in fastapi to order the data based on SQL Order By clause</span>
        query = query.order_by(filter_query.order_by)
    <span class="hljs-keyword">if</span> filter_query.row_count: <span class="hljs-comment"># NOTE - We used this as query param in fastapi to limit the number of rows fetched</span>
        query = query.limit(filter_query.row_count)
    <span class="hljs-keyword">elif</span> filter_query.page: <span class="hljs-comment"># NOTE - We used this as query param in fastapi to enable pagination</span>
        query = query.offset((filter_query.page - <span class="hljs-number">1</span>) * <span class="hljs-number">1000</span>).limit(<span class="hljs-number">1000</span>)

    <span class="hljs-comment"># executing query plan</span>
    query_data = pl.read_database(
        query.sql(dialect=Dialects.DATABRICKS), request.app.state.cursor <span class="hljs-comment"># SQLGlot converts above query to Databricks SQL dialect</span>
    )
    data = ORJSONResponse(query_data.to_dicts()) <span class="hljs-comment"># FastAPI as native support for ORJSONResponse, which is faster than JSONResponse</span>
<span class="hljs-comment"># Error coming from SQLGlot due to incorrect sql query</span>
<span class="hljs-keyword">except</span> ParseError <span class="hljs-keyword">as</span> e:
    <span class="hljs-comment"># You can add your custom exception handling here</span>
</code></pre>
<p>One important point to note for using <strong>Databricks Connect</strong> is the connection gets destroyed in 10 minutes of inactivity and we have to manually create a new connection again. To solve this we used <strong>Python</strong> <strong>context manager</strong>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Setting up Databricks connect to execute PySpark code</span>

<span class="hljs-meta">@contextmanager</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">spark_manager</span>(<span class="hljs-params">profile: str | Config = <span class="hljs-string">"DEFAULT"</span></span>):</span>
    <span class="hljs-keyword">if</span> isinstance(profile, str):
        cfg = _get_databricks_config(profile)
        <span class="hljs-comment"># NOTE - It's not necessary to use config. But I am using it to make it more flexible &amp;</span>
        <span class="hljs-comment"># able to use any profile/environment as needed.</span>
    <span class="hljs-keyword">if</span> isinstance(profile, Config):
        cfg = profile
    spark = DatabricksSession.builder.sdkConfig(cfg).getOrCreate()
    <span class="hljs-keyword">yield</span> spark
    <span class="hljs-comment"># NOTE - We don't need to close the session as it will be automatically closed in case of 10</span>
    <span class="hljs-comment"># minutes of inactivity by databricks. Also one more added benefit is that some other workload</span>
    <span class="hljs-comment"># can use the same session if it's not closed.</span>
    <span class="hljs-comment"># WARNING - If you still want to close the session, you can uncomment the below line.</span>
    <span class="hljs-comment"># spark.stop()</span>
</code></pre>
<pre><code class="lang-python">db_cfg = Config(
    host=env_config.databricks_host,
    client_id=<span class="hljs-string">"&lt;databricks service principal client id&gt;"</span>,
    client_secret= <span class="hljs-string">"&lt;databricks service principal client secret&gt;"</span>
    serverless_compute_id=<span class="hljs-string">"auto"</span>, <span class="hljs-comment"># To use serverless cluster</span>
)

<span class="hljs-comment"># This will make sure we'll always gets an connection. If there is already one </span>
<span class="hljs-comment"># it will use it. If not present then it will create new &amp; use it. </span>
<span class="hljs-keyword">with</span> spark_manager(db_cfg) <span class="hljs-keyword">as</span> spark:
        <span class="hljs-comment"># your PySpark code will come here</span>
</code></pre>
<p><strong>Pros</strong></p>
<ul>
<li><p>No limitations on Delta tables with any table properties. In fact, we can even query <code>Views</code> which is not possible in the first two phases.</p>
</li>
<li><p>Since we are using Databricks SQL, it brings all sorts of SQL capabilities like <strong>Select</strong>, <strong>Offset</strong>, <strong>Limit</strong>, <strong>Where</strong>, <strong>Order By,</strong> etc.</p>
</li>
<li><p>We can use Databricks Serverless Compute, that is shared with other workloads, So it requires no additional infrastructure cost.</p>
</li>
<li><p>No need to maintain any temporary solution.</p>
</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li><strong>Vendor lock-in</strong> as we have to use tools provided by Databricks.</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<ul>
<li><p>Wrapping up, creating a solid enterprise data strategy is all about planning and executing well across different stages.</p>
</li>
<li><p>Our journey from <strong>Databricks Delta Sharing</strong> to <strong>open-source</strong> tools like <strong>Polars</strong>, and finally to <strong>Databricks Connect</strong>, shows how important it is to stay flexible and keep improving to tackle challenges like data access and performance.</p>
</li>
<li><p>By focusing on things like <strong>modularity, authentication, and authorization, and embracing open-source solutions</strong>, businesses can create a scalable and efficient data strategy that keeps up with what users need.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[How to Create an Effective Enterprise Data Strategy: Part 1]]></title><description><![CDATA[TLDRData management is crucial for enterprises to ensure data accuracy, accessibility, and security, which supports informed decision-making, operational efficiency, and compliance. An effective data strategy involves a robust data platform architect...]]></description><link>https://importidea.dev/how-to-create-an-effective-enterprise-data-strategy-part-1</link><guid isPermaLink="true">https://importidea.dev/how-to-create-an-effective-enterprise-data-strategy-part-1</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[Databricks]]></category><category><![CDATA[architecture]]></category><category><![CDATA[best practices]]></category><category><![CDATA[data]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Mon, 14 Oct 2024 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731842025728/1a27bf85-c25c-4a26-b893-d800b26978d3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<details><summary>TLDR</summary><div data-type="detailsContent">Data management is crucial for enterprises to ensure data accuracy, accessibility, and security, which supports informed decision-making, operational efficiency, and compliance. An effective data strategy involves a robust data platform architecture, like the Databricks Lakehouse with Medallion Architecture, for efficient data storage and processing. It also includes secure data sharing and consumption through universally compatible APIs and a comprehensive security architecture with data security and access control measures. These strategies drive innovation, competitive advantage, and reliable data insights.</div></details>

<h2 id="heading-introduction">Introduction</h2>
<h3 id="heading-overview-of-the-importance-of-data-management-in-enterprises">Overview of the importance of Data management in enterprises</h3>
<p>Data management is vital for enterprises as it</p>
<ul>
<li><p>Ensures that data is accurate, accessible, and secure, which is essential for <strong>informed decision-making</strong>.</p>
</li>
<li><p>Helps organizations streamline operations, improve efficiency, and enhance customer experiences by providing <strong>reliable data insights</strong>.</p>
</li>
<li><p>Supports compliance with regulatory requirements and reduces the risk of data breaches by implementing <strong>robust security measures</strong>.</p>
</li>
<li><p>Drives innovation and competitive advantage by enabling advanced analytics and data-driven strategies.</p>
</li>
</ul>
<h3 id="heading-key-features-of-an-effective-data-sharing-solution-in-enterprises">Key Features of an Effective Data Sharing Solution in Enterprises</h3>
<p>Every organization needs a powerful data sharing solution that is</p>
<ul>
<li><p>Powerful enough to <strong>share data of any size.</strong></p>
</li>
<li><p>Agnostic enough to be <strong>used by any tool or programming language</strong>.</p>
</li>
<li><p>Scalable enough to <strong>support all users</strong> all the time.</p>
</li>
<li><p>Secure enough to ensure the right people have access to the right data.</p>
</li>
</ul>
<p>Achieving this goal will truly make the solution great. Users will have enough confidence, eventually leading to a good adoption rate.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">I prefer to break down the solution into <code>What, Where, How</code> problem statements, solving each unit individually. By combining them, we eventually construct the overall solution.</div>
</div>

<h2 id="heading-robust-data-platform-architecture">Robust Data Platform Architecture</h2>
<p>Here, we'll cover the key architecture choices we made to build a strong data platform that meets all business needs.</p>
<h3 id="heading-data-storage-strategy">Data storage strategy</h3>
<p><strong>The What</strong></p>
<p>These are extremely important steps. You have to make the <strong>right choice at the beginning</strong>, because if you make the wrong choice then <strong>data migration will be a difficult task</strong>. Usually, the options to choose from or either <code>Data Warehouse</code> or <code>Data Lakehouse</code>.</p>
<p>We choose to go with <code>Data Lakehouse</code> for the following reasons:</p>
<ul>
<li><p>A Data Lakehouse provides a <strong>centralized location</strong> to store all types of data.</p>
</li>
<li><p>It supports not only structured data like <strong>tables</strong> but also unstructured data such as <strong>images, videos, and binaries</strong>, as it’s built on top of a <code>Data lake</code>.</p>
</li>
<li><p>Its <strong>modular &amp; open design</strong> is highly advantageous. By separating storage from data processing, it allows for the use of various tools to read, write, and process data efficiently.</p>
</li>
</ul>
<p><strong>The Where</strong></p>
<p>Out of all the <code>Data Lakehouse</code> out there, we choose <a target="_blank" href="https://docs.databricks.com/en/lakehouse/index.html">Databricks Lakehouse</a> for the following reasons:</p>
<p><img src="https://docs.databricks.com/en/_images/lakehouse-diagram.png" alt="A diagram of the lakehouse architecture using Unity Catalog and delta tables." class="image--center mx-auto" /></p>
<ul>
<li><p>Databricks Lakehouse supports <code>cloud storage objects</code> for data storage. We already use <code>Azure ADLS gen2</code>, which is <strong>natively supported</strong> by Databricks Lakehouse.</p>
</li>
<li><p>It utilizes <code>Apache Spark</code> (<code>PySpark</code> + <code>Spark SQL</code>) for <strong>processing and transforming data</strong>. While many tools can handle <code>big data</code>, none match the capabilities of <code>Apache Spark</code>.</p>
</li>
<li><p>It employs the <code>Delta Lake</code> format to store data in tables. Delta Lake introduces <strong>ACID properties</strong>, a key feature that previously deterred users from moving away from <code>Data Warehouses</code> to <code>Data lakes</code>, which paved the way for <code>Data Lakehouse</code>.</p>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">There are many more features that Databricks offers, which we’ll discuss over the course of this article series.</div>
</div>

<p><strong>The How</strong></p>
<p>There are many well-defined strategies used in storing data. Since we are using the <code>Databricks Lakehouse</code> platform, we went ahead with Databricks’ <a target="_blank" href="https://www.databricks.com/glossary/medallion-architecture">Medallion Architecture</a></p>
<p><img src="https://www.databricks.com/sites/default/files/inline-images/building-data-pipelines-with-delta-lake-120823.png?v=1702318922" alt="Building Reliable, Performant Data Pipelines with Delta Lake" /></p>
<ul>
<li><p>Adopting the Medallion architecture has allowed us to <strong>enhance data quality</strong> by organizing data into three distinct namespaces, each serving a specific purpose, ensuring they do not interfere with one another.</p>
</li>
<li><p>The Medallion architecture also <strong>improves</strong> our ability to meet business needs, as the gold layer is <strong>specifically designed</strong> to cater to those requirements.</p>
</li>
<li><p>It also provides a <strong>unified data management</strong> ability to ease engineering efforts.</p>
</li>
</ul>
<h3 id="heading-facilitate-data-sharing-amp-consumption">Facilitate Data Sharing &amp; Consumption</h3>
<p>A robust Enterprise data strategy must have equally robust <strong>data sharing &amp; consumption</strong> capabilities.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731839438082/2ac20fff-0d1e-4a5f-8eb2-5cb380de83cf.png" alt class="image--center mx-auto" /></p>
<p><strong>The What</strong></p>
<ul>
<li><p><code>Data Sharing</code> - The goal is to make <strong>data accessible</strong> throughout the organization.</p>
</li>
<li><p><code>Data Consumption</code> - The goal is to make reading and writing data <strong>simple and convenient</strong> for <strong>authorized users and applications</strong> by abstracting all the complexity.</p>
</li>
</ul>
<p><strong>The Where</strong></p>
<ul>
<li><p>Both layers - Data Sharing &amp; Consumption - should act as middleware between the user/tool and the <code>Lakehouse</code>.</p>
</li>
<li><p>We tried multiple options for hosting it but eventually settled on <code>Kubernetes</code>. We were already using it for other products too. However, hosting the application should be purely based on your requirements (and it's a little out of scope for this article too).</p>
</li>
</ul>
<p><strong>The How</strong></p>
<ul>
<li><p>We took inspiration from how two independent services usually communicate &amp; majorly the answer is <code>APIs</code>.</p>
</li>
<li><p>In 2024, we had two options to choose from either <code>REST API</code> or <code>GraphQL</code>.</p>
</li>
<li><p>We selected <code>REST API</code> because <code>GraphQL</code> still has limited support across many tools. Almost all tools and programming languages support REST API, making it <strong>universally compatible.</strong></p>
</li>
</ul>
<h2 id="heading-security-architecture">Security Architecture</h2>
<p>I cannot stress enough that Security Architecture is as important as other architectures like application, scaling, etc. <em>Treating security as a secondary citizen or an afterthought will come back to haunt you later.</em></p>
<p>To satisfy all our security needs, we broke down the architecture into parts - <code>Data Security</code> &amp; <code>Access control</code></p>
<h3 id="heading-data-security">Data Security</h3>
<p>Effective data security measures are crucial for any organization.</p>
<p><img src="https://hiflylabswebstorage.blob.core.windows.net/hiflylabs-web-strapi-container/assets/image2_v2_e232c86fcd.png" alt="image2 v2.png" /></p>
<p><strong>The What</strong></p>
<ul>
<li><p>The goal is to have the <strong>ability to store</strong> data securely</p>
</li>
<li><p>The goal is to follow any region specific <strong>data localization or compliance policies</strong>.</p>
</li>
<li><p>The goal is to r<strong>estrict unauthorized use</strong> of sensitive data.</p>
</li>
</ul>
<p><strong>The Where</strong></p>
<ul>
<li><p>Databricks Lakehouse lets us save table data <strong>externally</strong> to cloud storage. We use this feature to store all our tables in <code>Azure ADLS Gen2</code>. This ensures our data is securely stored.</p>
</li>
<li><p>Azure allows us to select the location of our storage account. So we use a <strong>multi-region storage account strategy</strong> to meet data localization policies.</p>
</li>
</ul>
<p><strong>The How</strong></p>
<ul>
<li><p>Since we are using <code>Azure ADLS Gen2</code>, we get multi-layered security like <strong>authentication, access control, network isolation, data protection, advanced threat protection,</strong> and <strong>auditing.</strong></p>
</li>
<li><p>The most crucial is played by <code>Databricks Unity Catalog</code>. It goes hand in hand with the multi-region storage account strategy. Here is how we did it</p>
<pre><code class="lang-plaintext">  Step 1: Configure external locations and storage credentials
  - External locations are defined as a path to cloud storage, 
    combined with a storage credential that can be used to access 
    that location.
  - A storage credential encapsulates a long-term cloud credential 
    that provides access to cloud storage.

  Step 2: Configure region specific Unity Catalog
  - From above step, now we use region specific storage account 
    storage credentials to create it's dedicated unity catalog. 
  - Once we have the unity catalog created, we can use Spark (SQL or Pyspark)
    to process the data. 

  Step 3: Managing all catalogs
  - This is where Databricks really shines. we can have all different 
    region based unity catalog under same workspace. 
  - This way we can run our spark jobs or ETL pipelines, process the data
    save it back to appropriate location. 
  - This way we comply with all policies.
</code></pre>
</li>
</ul>
<h3 id="heading-access-control"><strong>Access Control</strong></h3>
<p>The Security architecture isn't just about external protection or following policies; we also need to make sure it's secure on the inside too. This can be done using the <code>Access Control</code> principals.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731837362326/f0b64749-1dcc-4ea0-bba1-e527fb9b45ef.png" alt class="image--center mx-auto" /></p>
<p><strong>The What</strong></p>
<ul>
<li><p>The goal is to ensure that only <strong>authorized users can access</strong> specific data, <strong>preventing unauthorized</strong> access and potential data breaches.</p>
</li>
<li><p>The goal is to <strong>control</strong> who can <strong>view or modify data</strong>, access control helps <strong>maintain the integrity</strong> of the data, ensuring that it remains accurate and reliable.</p>
</li>
<li><p>The goal is to <strong>provide a clear audit trail</strong> of who accessed or modified data, which is essential for <strong>accountability and transparency</strong>.</p>
</li>
</ul>
<p><strong>The Where</strong></p>
<ul>
<li><p>Since we are using <code>Databricks Lakehouse</code> with <strong>External location,</strong> we have to implement two <strong>Access Control</strong> strategies.</p>
</li>
<li><p><strong>Governance with</strong> <code>Databricks Unity Catalog</code>:</p>
<ul>
<li><p>It allows us <code>GRANT</code> fine grain control like select, read, write, etc to the various objects like table, view, volume, model, etc.</p>
</li>
<li><p>We use <code>User Groups principal</code> instead of direct <code>user principles</code> for granting any access.</p>
</li>
</ul>
</li>
<li><p>Access Control in <code>Azure ADLS Gen2</code>:</p>
<ul>
<li><p>All the files associated with tables, views, and models are stored in cloud storage. We made sure that any group will not have higher permission than they have over Databricks Unity Catalog.</p>
</li>
<li><p>To keep it simple we usually don’t provide <code>write</code> access to storage accounts associated with higher environments to users. Everything is controlled by the <strong>Access Control</strong> policy in Databricks.</p>
</li>
</ul>
</li>
</ul>
<p><strong>The How</strong></p>
<ul>
<li><p>Databricks offers a highly adaptable governance policy, which we leverage extensively.</p>
</li>
<li><p>We implement a <code>Zero Trust Policy</code> in combination with <code>Role-Based Access Control</code>.</p>
</li>
<li><p>For each <code>Schema</code>, we have established roles such as <code>reader</code>, <code>writer</code>, and <code>owner</code>. Depending on the specific use case, <code>Groups</code> are assigned these roles. Subsequently, the <code>User</code> is added to the appropriate <code>Group</code>.</p>
</li>
<li><p>When a <code>User</code> no longer requires access, they are removed from the group, eliminating the need for frequent modifications to the grants.</p>
</li>
<li><p>The <code>Service Principal</code> is crucial, especially in the production environment. We only use it to execute jobs.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<ul>
<li><p>Creating an effective enterprise data strategy is essential for harnessing the full potential of data. Focus on robust data management, secure data sharing, and comprehensive security architecture.</p>
</li>
<li><p>Implement a well-thought-out data platform architecture, like the Databricks Lakehouse with Medallion Architecture, for efficient data storage and processing.</p>
</li>
<li><p>Facilitate seamless data sharing and consumption through universally compatible APIs for easy access by authorized users.</p>
</li>
<li><p>Adopting these strategies helps drive innovation, gain a competitive edge, and make informed decisions based on reliable data insights.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Working with Avro file format in Python the right way]]></title><description><![CDATA[Here are some quick helpful tips for using Avro file format correctly in python.

Note: I am asumming you familair with Apache Avro file format, its advantages, its shortcomings, etc.

Tip no 1: Use the correct package
Instead of using the official p...]]></description><link>https://importidea.dev/working-with-avro-file-format-in-python-the-right-way</link><guid isPermaLink="true">https://importidea.dev/working-with-avro-file-format-in-python-the-right-way</guid><category><![CDATA[#fastavro]]></category><category><![CDATA[avro]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Mon, 19 Dec 2022 06:28:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1671431248231/LscCboKfv.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here are some quick helpful tips for using Avro file format correctly in python.</p>
<blockquote>
<p>Note: I am asumming you familair with Apache Avro file format, its advantages, its shortcomings, etc.</p>
</blockquote>
<p><strong>Tip no 1: Use the correct package</strong></p>
<p>Instead of using the official package from <a target="_blank" href="https://pypi.org/project/avro/">Apache Avro</a> use <a target="_blank" href="https://github.com/fastavro/fastavro">Fast Avro for Python</a>. Trust me the claims made by the author of fastavro mostly holds true.</p>
<p><strong>Tip no 2: Use of schema</strong></p>
<p>Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. You can read the <a target="_blank" href="https://avro.apache.org/docs/1.11.1/specification/">specification docs</a> to understand more about it in detail. I too found it a bit confusing &amp; keep ever forgetting. So here is the rule of thumb that I follow:</p>
<pre><code class="lang-python">schema = {
    <span class="hljs-string">'doc'</span>: <span class="hljs-string">'A dummy avro file'</span>, <span class="hljs-comment"># a short description </span>
    <span class="hljs-string">'name'</span>: <span class="hljs-string">'dummy'</span>, <span class="hljs-comment"># your supposed to be file name with .avro extension </span>
    <span class="hljs-string">'type'</span>: <span class="hljs-string">'record'</span>, <span class="hljs-comment"># type of avro serilazation, there are more (see above docs) but as per me this will do most of the time</span>
    <span class="hljs-string">'fields'</span>: [ <span class="hljs-comment"># this defines actual keys &amp; their types</span>
        {<span class="hljs-string">'name'</span>: <span class="hljs-string">'key1'</span>, <span class="hljs-string">'type'</span>: <span class="hljs-string">'string'</span>},
        {<span class="hljs-string">'name'</span>: <span class="hljs-string">'key2'</span>, <span class="hljs-string">'type'</span>: <span class="hljs-string">'int'</span>},
        {<span class="hljs-string">'name'</span>: <span class="hljs-string">'key2'</span>, <span class="hljs-string">'type'</span>: <span class="hljs-string">'boolean'</span>},
    ],
}
</code></pre>
<p><strong>Tip no 3: Write correctly</strong></p>
<p>The fastavro default write method for some reason does not use any codec or compression algorithm, which defeats the purpose of using avro. See the below screenshot.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Format</td><td>Size</td></tr>
</thead>
<tbody>
<tr>
<td>JSON</td><td>13.6 mb</td></tr>
<tr>
<td>Avro (no compression/code)</td><td>13.3 mb</td></tr>
<tr>
<td>Avro (deflate compression/code)</td><td>2.13 mb</td></tr>
<tr>
<td>Avro (snappy compression/code)</td><td>3.4 mb</td></tr>
</tbody>
</table>
</div><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1671430040095/WtxsRCxw_.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastavro <span class="hljs-keyword">import</span> writer, parse_schema, reader

<span class="hljs-comment"># default codec is None </span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'dummy.avro'</span>, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> out:
    writer(out, parse_schema(schema), more_rows)

<span class="hljs-comment"># from the above screenshot its best to use deflate. It also have native supoprt</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'dummy_deflate.avro'</span>, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> out:
    writer(out, parse_schema(schema), more_rows, codec=<span class="hljs-string">"deflate"</span>)
</code></pre>
<p><strong>Tip no 3: Read as a generator</strong></p>
<p>Assuming the file size is huge (that will be the case why you had the need to move from JSON to something like Avro) &amp; <code>fastavro</code> do support lazy loading why not use it. Also, it is straightforward</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastavro <span class="hljs-keyword">import</span> reader
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"dummy.avro"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> out:
    f = reader(out) <span class="hljs-comment"># now f is generator</span>
    <span class="hljs-comment"># you loop it</span>
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> f:
        do_something(i)
</code></pre>
]]></content:encoded></item><item><title><![CDATA[How to supercharge your config to make it truly environment agnostic]]></title><description><![CDATA[Anyone who develops a project which involves multiple environments (e.g. DEV, QA, PROD) knows how painful it is to write code once which works everywhere, especially if it involves lots of env-specific tools (including cloud). I too have faced such p...]]></description><link>https://importidea.dev/how-to-supercharge-your-config-to-make-it-truly-environment-agnostic</link><guid isPermaLink="true">https://importidea.dev/how-to-supercharge-your-config-to-make-it-truly-environment-agnostic</guid><category><![CDATA[environment-agnostic]]></category><category><![CDATA[Python]]></category><category><![CDATA[configuration]]></category><category><![CDATA[pydantic]]></category><category><![CDATA[environmental management system]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Tue, 13 Dec 2022 14:51:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1670942968280/wjY6WRA6b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Anyone who develops a project which involves multiple environments (e.g. DEV, QA, PROD) knows how painful it is to write code once which works everywhere, especially if it involves lots of env-specific tools (including cloud). I too have faced such problems &amp; to solve it once-end for all I came up with a system called <strong>Environment Agnostic Config</strong>.</p>
<h2 id="heading-1-the-old-school-way">1. The Old School way</h2>
<p>Using the config to control environments &amp; settings is nothing new. Developers have been using it for ages, so why is it necessary to reinvent the wheel? I am not proposing to reinvent the wheel, but modernize it.</p>
<p>The most popular (&amp; probably the most naïve way too 😕) is the use of file-based configs like <strong>JSON, YAML, TOML, etc</strong>. Apart from being naïve, they have two significant issues:</p>
<ul>
<li><p>they have to be hardcoded</p>
</li>
<li><p>they cannot be generated dynamically.</p>
</li>
</ul>
<p>This is a big deal breaker when dealing with multiple environments. If you have multiple sources to populate it (e.g. some secret vault, environment variables, hardcoded values, etc) then I would strongly recommend just dropping the idea of using a file-based config to save hardships in your life 😉.</p>
<h2 id="heading-2-environment-agnostic-config-generating-config-programmatically">2. Environment Agnostic Config: Generating config programmatically</h2>
<p>To achieve this, I use <a target="_blank" href="https://docs.pydantic.dev/usage/settings/">Pydantic's Setting Management</a> application. Anyone familiar with Pydantic will find this familiar &amp; easy to use.</p>
<p>Usually, you must be using</p>
<ul>
<li><p>Something like a .env file to store all secrets &amp; later read them. Or store secrets/configuration directly as environment variables &amp; then read them.</p>
</li>
<li><p>Other configurations/settings can be hardcoded and written into json/yaml/toml/ini files.</p>
</li>
</ul>
<p>So with the help of Pydantic's <code>BaseSettings</code> we can combine both. Let's dive in to see it in action.</p>
<p>Pydantic's <code>BaseSettings</code> already have support to read from environment variable, .env file, etc (Follow the original docs to see in more details - <a target="_blank" href="https://docs.pydantic.dev/usage/settings/">Settings management - pydantic</a>). So this way we can have some fields (or class attribute) as hardcoded feils &amp; some populate from environment variables under one common Pydantic model. Let's see an example.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseSettings, Feild

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Config</span>(<span class="hljs-params">BaseSettings</span>):</span>
    env_variable1: str = Feild(description=<span class="hljs-string">"some description"</span>)
    env_variable: str = Feild(description=<span class="hljs-string">"some description"</span>)
    hard_coded1: str = <span class="hljs-string">"some hardcoded value"</span>
    hard_coded2: int = <span class="hljs-number">999</span>
</code></pre>
<p>From the above example, the first two fields will be automatically populated from matching environment variables, the next two are the hard coded variables.</p>
<blockquote>
<p><strong>Note:</strong> Since this is based on Pydantic, you can add all sorts of regular Pydantic validators. See the original docs (above) to see all the possibilities.</p>
</blockquote>
<h2 id="heading-3-one-config-to-rule-them-all">3. One Config to rule them all</h2>
<p>Now coming to the most important part - <em>How to use one config for all possible environments (e.g. DEV, QA, PROD, etc).</em> Actually, it is quite easy, just create a respective Pydantic <code>BaseSettings</code> model for all environments.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseSettings, Feild

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">LocalSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
    env_variable1: str = Feild(description=<span class="hljs-string">"some description"</span>)
    env_variable: str = Feild(description=<span class="hljs-string">"some description"</span>)
    hard_coded1: str = <span class="hljs-string">"some hardcoded value"</span>
    hard_coded2: int = <span class="hljs-number">999</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DEVSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
    env_variable1: str = Feild(description=<span class="hljs-string">"some description"</span>)
    env_variable: str = Feild(description=<span class="hljs-string">"some description"</span>)
    hard_coded1: str = <span class="hljs-string">"some hardcoded value"</span>
    hard_coded2: int = <span class="hljs-number">999</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PRODSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
    env_variable1: str = Feild(description=<span class="hljs-string">"some description"</span>)
    env_variable: str = Feild(description=<span class="hljs-string">"some description"</span>)
    hard_coded1: str = <span class="hljs-string">"some hardcoded value"</span>
    hard_coded2: int = <span class="hljs-number">999</span>
</code></pre>
<p>Since the underlying environment variables will be different for all environments, so will be the populated fields.</p>
<h2 id="heading-4-how-to-actually-consume-the-config-in-code">4. How to actually consume the config in code</h2>
<p>Now we have a single source of config so moving to next part is to how actually consume it in some code. Again there is nothing novel here. What I do is create a function which takes the underlying environment as a parameter as input &amp; return respective config. Also, I usually store all this in config.py</p>
<pre><code class="lang-python"><span class="hljs-comment"># logic in config.py</span>
<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseSettings, Feild

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">LocalSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
<span class="hljs-comment"># same as above</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DEVSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
<span class="hljs-comment"># same as above</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PRODSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
<span class="hljs-comment"># same as above</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_config</span>(<span class="hljs-params">environment: str</span>):</span>
    match environment:
        case <span class="hljs-string">"local"</span>:
            config = LocalSettings()
        case <span class="hljs-string">"dev"</span>:
            config = DEVSettings()
        case <span class="hljs-string">"prod"</span>:
            config = PRODSettings()
    <span class="hljs-keyword">return</span> config


<span class="hljs-comment"># in some other part of your code/library</span>
<span class="hljs-keyword">from</span> config <span class="hljs-keyword">import</span> get_config
config = get_config(<span class="hljs-string">"dev"</span>)
some_variable = config.env_variable

<span class="hljs-comment"># NOTE - you don't need to even hardcode environment parameter. What I do is simply create a environment variable for environment it self &amp; use to in function.</span>
config = get_config(os.environ[<span class="hljs-string">"environment"</span>])
some_variable = config.env_variable
<span class="hljs-comment"># Now this makes truly environment agnostic &amp; excat same code will work everywhere.</span>
</code></pre>
<p><strong>What if you have multiple scopes of multiple config requirements?</strong> The pattern remians the same. Add as many as required configs as Pydantic <code>BaseSettings</code> model &amp; return them.</p>
<p>Let me explain with a simple use case of mine. My application needs to support multiple languages. About 80% of the code is generic but there few logic which are language dependent &amp; changes based on underlying language. So I just define them in their respective language Pydantic model. Let's see an example</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseSettings, Feild

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">EnglishConfig</span>(<span class="hljs-params">BaseSettings</span>):</span>
    variable1: str = <span class="hljs-string">"some thing"</span>
    variable: list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]
    hard_coded1: dict = {}
    hard_coded2: int = <span class="hljs-number">999</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">FrenchConfig</span>(<span class="hljs-params">BaseSettings</span>):</span>
    variable1: str = <span class="hljs-string">"some thing"</span>
    variable: list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]
    hard_coded1: dict = {}
    hard_coded2: int = <span class="hljs-number">999</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">HindiConfig</span>(<span class="hljs-params">BaseSettings</span>):</span>
    variable1: str = <span class="hljs-string">"some thing"</span>
    variable: list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]
    hard_coded1: dict = {}
    hard_coded2: int = <span class="hljs-number">999</span>
</code></pre>
<p>Now exactly similar to above logic for environment settings we can we make language dependant or language specific logic as language agnostic.</p>
<h2 id="heading-5-bringing-all-things-together">5. Bringing all things together</h2>
<p>Finally, let me show you how my final config.py looks like</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Any, Optional

<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseSettings

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">EnglishConfig</span>(<span class="hljs-params">BaseSettings</span>):</span>
    regex_pattern_alphanumeric: Optional[str] = <span class="hljs-string">"[^0-9a-z/s]"</span>
    list_of_missing_must_include_words = [<span class="hljs-string">"Missing"</span>, <span class="hljs-string">"Must include"</span>]
    list_of_name_prefixes = [<span class="hljs-string">"dr"</span>, <span class="hljs-string">"mr"</span>, <span class="hljs-string">"mrs"</span>, <span class="hljs-string">"jr"</span>, <span class="hljs-string">"sr"</span>]

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SpanishConfig</span>(<span class="hljs-params">BaseSettings</span>):</span>
    regex_pattern_alphanumeric: Optional[str] = <span class="hljs-string">"[^0-9a-záéíñóúü/s]"</span>
    list_of_name_prefixes = [<span class="hljs-string">"sres"</span>, <span class="hljs-string">"señora"</span>]
    list_of_missing_must_include_words = [<span class="hljs-string">"Falta"</span>, <span class="hljs-string">"Debe incluir lo siguiente"</span>]

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">FrenchConfig</span>(<span class="hljs-params">BaseSettings</span>):</span>
    regex_pattern_alphanumeric: Optional[str] = <span class="hljs-string">"[^0-9a-z\u00C0-\u017F/s]"</span>
    list_of_missing_must_include_words = [<span class="hljs-string">"Termes manquants"</span>, <span class="hljs-string">"Doit inclure"</span>]
    list_of_name_prefixes = [<span class="hljs-string">"m"</span>, <span class="hljs-string">"madame"</span>]

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">LocalEnvironmentSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
    common_root_folder: Optional[str] = <span class="hljs-string">"/tmp"</span>
    logging_level: Optional[int] | Optional[tuple] = (<span class="hljs-number">10</span>,<span class="hljs-number">10</span>,<span class="hljs-number">10</span>,)
    status_url: str = <span class="hljs-string">"https://some-url-dev.com"</span>
    SOME_SECRET: str 
    CONNECTION_STRING: str

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DevEnvironmentSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
    common_root_folder: Optional[str] = <span class="hljs-string">"/tmp"</span>
    logging_level: Optional[int] | Optional[tuple] = (<span class="hljs-number">10</span>,<span class="hljs-number">10</span>,<span class="hljs-number">10</span>,)
    status_url: str = <span class="hljs-string">"https://some-url-dev.com"</span>
    SOME_SECRET: str 
    CONNECTION_STRING: str

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QAEnvironmentSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
    common_root_folder: Optional[str] = <span class="hljs-string">"/tmp"</span>
    logging_level: Optional[int] | Optional[tuple] = (<span class="hljs-number">10</span>,<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,)
    status_url: str = <span class="hljs-string">"https://some-url-qa.com"</span>
    SOME_SECRET: str 
    CONNECTION_STRING: str
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PRODEnvironmentSettings</span>(<span class="hljs-params">BaseSettings</span>):</span>
    common_root_folder: Optional[str] = <span class="hljs-string">"/tmp"</span>
    logging_level: Optional[int] | Optional[tuple] = (<span class="hljs-number">10</span>,<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,)
    status_url: str = <span class="hljs-string">"https://some-url-prod.com"</span>
    SOME_SECRET: str 
    CONNECTION_STRING: str
<span class="hljs-comment"># NOTE - See how I have changed the `status_url` &amp; `logging_level`for all environments &amp; `regex_pattern_alphanumeric` for all languages.</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_config</span>(<span class="hljs-params">language: str, environment: str</span>):</span>
    <span class="hljs-comment"># setting language based config</span>
    match language:
        case <span class="hljs-string">"en"</span>:
            language_config = EnglishConfig()
        case <span class="hljs-string">"es"</span>:
            language_config = SpanishConfig()
        case <span class="hljs-string">"fr"</span>:
            language_config = FrenchConfig()
        case _:
            <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"given language: <span class="hljs-subst">{language}</span> must be either from en, es, pt,"</span>)
    <span class="hljs-comment"># setting environment based config</span>
    match environment:
        case <span class="hljs-string">"local"</span>:
            environment_settings = LocalEnvironmentSettings()
        case <span class="hljs-string">"dev"</span>:
            environment_settings = DevEnvironmentSettings()
        case <span class="hljs-string">"qa"</span>:
            environment_settings = QAEnvironmentSettings()
        case <span class="hljs-string">"pro"</span>:
            environment_settings = PRODEnvironmentSettings()

    <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">GlobalConfig</span>(<span class="hljs-params">BaseSettings</span>):</span>
                global_language_config = language_config
                global_environment_settings = environment_settings

    <span class="hljs-keyword">return</span> GlobalConfig()
</code></pre>
<blockquote>
<p>Note: As my other blogs, this idea is not limited just to python but can be used anywhere. I have used python to explain the idea. Few modifications &amp; same approach can be applied anywhere.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[The practical guide to write useful comments]]></title><description><![CDATA[1. The Need
I don't think anyone will agree that writing comments in your code are a waste of time and effort. Then Why do most people don't really write good & useful comments? Why do they willingly or unwillingly make their own life & experience di...]]></description><link>https://importidea.dev/the-practical-guide-to-write-useful-comments</link><guid isPermaLink="true">https://importidea.dev/the-practical-guide-to-write-useful-comments</guid><category><![CDATA[Programming Blogs]]></category><category><![CDATA[best practices]]></category><category><![CDATA[Programming Tips]]></category><category><![CDATA[vscode extensions]]></category><category><![CDATA[comments]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Sun, 19 Jun 2022 09:02:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1655626833098/VjDxTwoOB.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-1-the-need">1. The Need</h1>
<p>I don't think anyone will agree that writing comments in your code are a waste of time and effort. Then Why do most people don't really write good &amp; useful comments? Why do they willingly or unwillingly make their own life &amp; experience difficult? </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655611562819/uYMAs8Koz.png" alt="image.png" />
You all must have seen this meme &amp; then laugh &amp; then move on. This is a very serious problem. Hers is one more,</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655611956914/nLVRJ_MkA.png" alt="image.png" />
So why do we fall into such a pitfall even knowing pretty well that there will be a pitfall ahead? Here are some of the reasons that I think are primary contributors:</p>
<ul>
<li>The comments don't go hand in hand with the code &amp; look out of the place.</li>
<li>In an already massive code base, it becomes even more difficult to navigate them. </li>
<li>The rush to commit the code.</li>
<li>Some unspoken but reality of human behaviour like, "I wrote the code so beautifully that it is self-explanatory" (Yeah, maybe to yourself but not so obvious to others). OR "If I am just going to write the comments then when I will write the actual code" (writing comments is not an afterthought but you should write them along with the code)</li>
<li>Finally, some people are just lazy. Nothing can be done about them. As they are already on the path which will surely make their life difficult in future. </li>
</ul>
<p>So not writing comments is more of a philosophical or even behavioural problem rather than 
technical or knowledge problem.</p>
<h1 id="heading-2-how-to-write-good-comments">2. How to write good comments</h1>
<p>There are many good blogs out there (<a target="_blank" href="https://stackoverflow.blog/2021/12/23/best-practices-for-writing-code-comments/">this</a> one is very good) which explains what is good content for comments. My focus is on the practicality &amp; ease of access. Following are the steps that you must follow for writing good comments.</p>
<h2 id="heading-21-philosophical-change">2.1 Philosophical change</h2>
<p>This can be either very easy or very difficult to adapt. It's upon you. The ideal way would be to have comments more than the actual lines of code. One more thing that will help is while writing a code think that you are not writing this for yourself but for others.</p>
<h2 id="heading-22-tagging-comments">2.2 Tagging comments</h2>
<p>Writing just the comments (even if their contents is good) usually is not that helpful because navigating them becomes difficult. So for this, I have come up with a system of 'Tagging Comments'. It is nothing fancy, you just have to start a comment with its type. Here are the type that I use,</p>
<ul>
<li>ANCHOR - Used to indicate a section in your file</li>
<li>TODO - An item that is awaiting completion, address something, etc in future </li>
<li>FIXME - An item that requires an immediate bugfix</li>
<li>NOTE - An important note for a specific code section to fetch the attention of a fellow developer </li>
<li>REVIEW - An item that requires additional review, very useful during pull requests or merges</li>
<li>DEPRECATED - An item which is no longer being used &amp; will be removed in future</li>
<li>WARNING - An warring showing for the following item, if not followed respective bad thing can happen</li>
<li>SECTION - Used to define a region</li>
<li>LINK - Used to link to a file that can be opened within the editor (See 'Link Anchors')</li>
<li>EG - An example of what we should expect in the following item</li>
</ul>
<p>Let me show some practical examples of how they should be used</p>
<pre><code class="lang-python"><span class="hljs-comment"># 1. Without comments</span>
<span class="hljs-keyword">with</span> open(file_path) <span class="hljs-keyword">as</span> data_file:
    <span class="hljs-keyword">yield</span> <span class="hljs-keyword">from</span> reader(data_file)

<span class="hljs-comment"># 2. With comments</span>
<span class="hljs-comment"># TODO - which encoding to use?</span>
<span class="hljs-keyword">with</span> open(file_path) <span class="hljs-keyword">as</span> data_file:
    <span class="hljs-comment"># spiting out only unit data lazily</span>
     <span class="hljs-keyword">yield</span> <span class="hljs-keyword">from</span> reader(data_file)
</code></pre>
<p>Any linter will give a warning in the above example as no encoding was provided. But I didn't know the answer straight away, so I used a <code>TODO</code> tag to resolve it once I'll know the answer. </p>
<pre><code class="lang-python"><span class="hljs-comment"># saving the current iteration's payload</span>
<span class="hljs-comment"># FIXME - Input request is also sending a set in the payload, which it should not as set cannot to</span>
<span class="hljs-comment"># save as json. Temporary fix is to save it as a text file</span>
DataWriter.str_to_txt(
    str(req_body),
    <span class="hljs-string">f"/research/sessions/<span class="hljs-subst">{uuid}</span>/payload.txt"</span>,
)
</code></pre>
<p>The above code snippet will still work, but its behaviour is incorrect &amp; it must be changed. That's why I used a <code>FIXME</code> tag. Now how it's different from the <code>TODO</code> tag? So <code>FIXME</code> should be used when you know for certain that the following item will definitely turn into a potential bug &amp; <code>TODO</code> should be used for a wide variety which may not be necessary bugs or code-breaking items. </p>
<pre><code class="lang-python"><span class="hljs-comment"># to let joblib release all workers gracefully from memory. NOTE - It's only needed here # because the same joblib workers from cleaning ops are used by entity extraction ops &amp; RQ does not need</span>
<span class="hljs-comment"># joblib's multiprocessing</span>
time.sleep(<span class="hljs-number">5</span>)
</code></pre>
<p>A <code>NOTE</code> tag is used to bring the attention of fellow developers to the following item. This type of comment has more importance than a regular comment.</p>
<p><code>SECTION</code> tag should use to group certain business logics that can be put under the common bucket. This becomes very helpful in the case of a pipeline. </p>
<pre><code class="lang-python"><span class="hljs-comment"># NOTE - Following ETL pipeline will only work with prod configuration</span>
<span class="hljs-comment"># SECTION - Part A: Extract </span>
collect_data(source)
clean_data(raw_data)
dump_clean_data(clean_data)

<span class="hljs-comment"># SECTION - Part B: Transform</span>
read_clean_data(clean_data)
transform_data(data)

<span class="hljs-comment"># SECTION - Part C: Load</span>
load_data_to_db(transformed_data)

<span class="hljs-comment"># SECTION - Part D: Post-processing</span>
clean_environment()
</code></pre>
<p>Everyone has the habit of commenting out certain business logic or some code snippet. And this is not wrong. There can be some valid reason to still keep the commented code snippet. But this becomes extremely confusing for others as they might not be aware of the reason. Ultimately, this leads to difficulty in maintaining the code. So it's better to add a <code>DEPRECATED</code> along with the reason. </p>
<pre><code class="lang-python"><span class="hljs-comment"># DEPRECATED - blob storage as a source is not required for now &amp; maybe removed completely in future.</span>
<span class="hljs-comment">#blob_data_source = [unit_source for unit_source in data if unit_source.is_present()]</span>
</code></pre>
<p>For <code>EG</code> tag I follow a couple of rules of thumb,</p>
<ol>
<li>If you think certain logic is not clear, then add an <code>EG</code> comment displaying what can be the potential value will be.</li>
<li>For every nested loop I write an example regarding what to expect in the next level. </li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># EG - "Random123 hello @#" will become "andom hello"</span>
regex_pattern = <span class="hljs-string">"[^a-z]"</span>
result = re.sub(regex_pattern, <span class="hljs-string">""</span>, text)

<span class="hljs-keyword">for</span> key,val <span class="hljs-keyword">in</span> random_dict.items():
    <span class="hljs-comment"># EG - another_random_dict[val] = 'some str'</span>
    <span class="hljs-keyword">for</span> val <span class="hljs-keyword">in</span> another_random_dict:
        <span class="hljs-keyword">if</span> isinstance(another_random_dict[val], str):
            some_list.append(another_random_dict[val])
</code></pre>
<h2 id="heading-23-navigate-andamp-automate-comments">2.3 Navigate &amp; automate comments</h2>
<p>One of the reasons that I mentioned above is why people don't want to write comments is difficult in navigating them. If I don't have a solution for this problem then there is no point of this blog 😅. So let me introduce you to an amazing VS Code extension - <a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=ExodiusStudios.comment-anchors">Comment Anchors</a> which I used very heavily for the above use case. </p>
<ul>
<li>It searches for the tag &amp; creates a bookmark. So that using its tree structure you can quickly jump to it. Also, it can act as a one-stop to track all important tags like FIXME, TODO, REVIEW, etc. Go through its <a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=ExodiusStudios.comment-anchors">docs</a> for more info.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655623632691/WBJDRwbi6.png" alt="image.png" /></p>
<ul>
<li>The extension comes with some default tags/strings to bookmark but with the ability to customize it. The following is the one that I am using (You can refer to their docs on how to customize it according to your specific needs)</li>
</ul>
<pre><code class="lang-json"><span class="hljs-string">"commentAnchors.tags.list"</span>: [
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"ANCHOR"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"default"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#A8C023"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"file"</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"TODO"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"blue"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#3ea8ff"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"FIXME"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"red"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#F44336"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
        <span class="hljs-attr">"isBold"</span>: <span class="hljs-literal">true</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"NOTE"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"orange"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#FFB300"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"file"</span>,
        <span class="hljs-attr">"styleComment"</span>: <span class="hljs-literal">true</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"REVIEW"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"green"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#64DD17"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"SECTION"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"blurple"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#896afc"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
        <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"region"</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"LINK"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#2ecc71"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#2ecc71"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
        <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"link"</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"DEPRECATED"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#B22222"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#B22222"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
        <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"anchor"</span>,
        <span class="hljs-attr">"isBold"</span>: <span class="hljs-literal">true</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"WARNING"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#B22222"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#B22222"</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
        <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"anchor"</span>,
        <span class="hljs-attr">"isBold"</span>: <span class="hljs-literal">true</span>
    },
    {
        <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"EG"</span>,
        <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#00FFFF"</span>,
        <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#eb667d"</span>,
        <span class="hljs-attr">"backgroundColor"</span>: <span class="hljs-string">"rgba(49, 184, 79, 0.2)"</span>,
        <span class="hljs-attr">"borderStyle"</span>: <span class="hljs-string">"1px solid #23b2ea"</span>,
        <span class="hljs-attr">"borderRadius"</span>: <span class="hljs-number">6</span>,
        <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
        <span class="hljs-attr">"styleComment"</span>: <span class="hljs-literal">true</span>
    }
],
</code></pre>
<p>You can add this to your VS code <code>settings.json</code> file.</p>
<h1 id="heading-3-how-to-write-a-good-message-commit-message">3. How to write a good message commit message</h1>
<ul>
<li>Writing a good commit message is also extremely important. The same pitfalls as above are true here too. </li>
<li>Just writing something like <code>updated xyz.py</code> or <code>deleted abc.txt</code> or <code>moved some_file.js</code> or <code>bug fix</code> etc is very bad practice. It does not provide any context &amp; becomes difficult to track changes using <code>git blame</code>. </li>
<li>How to follow common standards across your team? What should be these standards?   No need to reinvent the wheel and just use <a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=vivaxy.vscode-conventional-commits">Conventional Commits</a>.</li>
<li>It is based on excellent <a target="_blank" href="https://www.conventionalcommits.org/en/v1.0.0/">Conventional Commits 1.0.0</a> spec. You can go through the spec (it is definitely a good read).</li>
<li>It's very easy, intuitive &amp; fun (as it also supports gitmoji 🤩) to use, trust me. Follow its doc to understand more.  </li>
</ul>
<p><img src="https://github.com/vivaxy/vscode-conventional-commits/raw/master/assets/docs/demo.gif" alt="image.png" /></p>
]]></content:encoded></item><item><title><![CDATA[How to SSH login password free from Windows, Linux, Mac]]></title><description><![CDATA[1. Linux OS & MAC OS
Run the following command on your bash (or any alternative like zsh, fish, etc) to set up auto ssh login.
ssh-copy-id -i "<Public key path>" "<user.name@email.com@host_ip_adress>"

Example: 

Output:

2. Windows
Run the following...]]></description><link>https://importidea.dev/ssh-login-password-free-from-windows-linux-mac</link><guid isPermaLink="true">https://importidea.dev/ssh-login-password-free-from-windows-linux-mac</guid><category><![CDATA[software development]]></category><category><![CDATA[Linux]]></category><category><![CDATA[tips]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Tue, 01 Feb 2022 17:02:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1643734927269/cygFz1NEBh.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-1-linux-os-andamp-mac-os">1. Linux OS &amp; MAC OS</h1>
<p>Run the following command on your bash (or any alternative like zsh, fish, etc) to set up auto ssh login.</p>
<pre><code class="lang-bash">ssh-copy-id -i <span class="hljs-string">"&lt;Public key path&gt;"</span> <span class="hljs-string">"&lt;user.name@email.com@host_ip_adress&gt;"</span>
</code></pre>
<p><strong>Example: </strong>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643733668913/hPPU_frxO.png" alt="Linux example" /></p>
<p><strong>Output:</strong>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643734030037/MPMD6Lp9r.png" alt="Screenshot 2021-10-29 132602.png" /></p>
<h1 id="heading-2-windows">2. Windows</h1>
<p>Run the following commands, in a local PowerShell window replacing user and host name as appropriate to copy your local public key to the SSH host.</p>
<pre><code class="lang-powershell">$USER_AT_HOST="your-user-name-on-host@hostname"
$PUBKEYPATH="$HOME\.ssh\id_rsa.pub"

$pubKey=(Get-Content "$PUBKEYPATH" | Out-String); ssh "$USER_AT_HOST" "mkdir -p ~/.ssh &amp;&amp; chmod 700 ~/.ssh &amp;&amp; echo '${pubKey}' &gt;&gt; ~/.ssh/authorized_keys &amp;&amp; chmod 600 ~/.ssh/authorized_keys"
</code></pre>
<p><strong>Example: </strong>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643734429263/2WxRmpmJT.png" alt="Screenshot 2021-10-29 132457.png" /></p>
<p><strong>Output:</strong>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1643734641245/n45GsJTje.png" alt="image.png" /></p>
]]></content:encoded></item><item><title><![CDATA[How to merge a specific directory or file in Git]]></title><description><![CDATA[Think of the following scenarios:

There might be two branches with active development and one of the branch needs some specific (updated) file or directory.
You don't want to merge the complete branch but just need some specific file(s).


Similarly...]]></description><link>https://importidea.dev/how-to-merge-a-specific-directory-or-file-in-git</link><guid isPermaLink="true">https://importidea.dev/how-to-merge-a-specific-directory-or-file-in-git</guid><category><![CDATA[Git]]></category><category><![CDATA[tricks]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Mon, 27 Sep 2021 02:43:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1632710580967/OJS5cVfzz.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Think of the following scenarios:</p>
<ul>
<li>There might be two branches with active development and one of the branch needs some specific (updated) file or directory.</li>
<li>You don't want to merge the complete branch but just need some specific file(s).</li>
</ul>
<blockquote>
<p>Similarly, there might be 'n' no. of scenarios where the common theme is that instead of merging the complete branch, all you need to do is merges specific file(s)/directory(s). </p>
</blockquote>
<p>Using a smart trick which I like to call '<strong>Selective checkout</strong>' can do the intended job</p>
<pre><code class="lang-bash">git checkout destination
git checkout <span class="hljs-built_in">source</span> sub-directory/
git commit -am <span class="hljs-string">"Message."</span>
git pull --rebase
git push
</code></pre>
<p>Using this git terminal trick now you can actually perform specific file/directory merge without merging the complete branch.</p>
]]></content:encoded></item><item><title><![CDATA[The Ultimate VS Code setup guide 🐱‍💻]]></title><description><![CDATA[The key factor for becoming a productive powerhouse and a good developer are:

Good technical & fundamental language knowledge
Understanding software designing 
How to make the most use of an IDE. 

The third point may seem a little odd to many, in f...]]></description><link>https://importidea.dev/the-ultimate-vs-code-setup-guide</link><guid isPermaLink="true">https://importidea.dev/the-ultimate-vs-code-setup-guide</guid><category><![CDATA[Visual Studio Code]]></category><category><![CDATA[Python]]></category><category><![CDATA[Git]]></category><category><![CDATA[vscode extensions]]></category><category><![CDATA[Productivity]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Sun, 26 Sep 2021 03:47:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1630063047958/4K9R0yA4e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>The key factor for becoming a productive powerhouse and a good developer are:</strong></p>
<ol>
<li>Good technical &amp; fundamental language knowledge</li>
<li>Understanding software designing </li>
<li>How to make the most use of an IDE. </li>
</ol>
<p>The third point may seem a little odd to many, in fact, it is mostly undervalued. But to become a productive powerhouse 💪, it is far the most important point to master.</p>
<blockquote>
<p>I have spent a lot of time customizing my VS code setup and now I feel I have reached a point where I can confidently present it to the world 🌍 as the most complete setup. 🧰</p>
<p>Note: This will be a long blog, but I have divided it into parts so that a reader can easily jump to his/her interest.</p>
</blockquote>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://twitter.com/Love2Code/status/1427992702856142853">https://twitter.com/Love2Code/status/1427992702856142853</a></div>
<p>Everyone knows how powerful VS code is &amp; supports almost all languages. The most important of all the great features is its support for a lot of languages out of the box. But tweaking the default setup will boost your productivity 🚀 to the next level. </p>
<h1 id="table-of-contents">Table of contents</h1>
<ol>
<li><a class="post-section-overview" href="#part-1-python-development">Part 1: Python Development </a></li>
<li><a class="post-section-overview" href="#part-2-git">Part 2: Git</a></li>
<li><a class="post-section-overview" href="#part-3-productivity-boosters">Part 3: Productivity boosters </a></li>
<li><a class="post-section-overview" href="#part-4-customization">Part 4: Customization</a></li>
</ol>
<hr />
<h2 id="part-1-python-development">Part 1: Python development</h2>
<h4 id="python-extension-by-microsofthttpsmarketplacevisualstudiocomitemsitemnamems-pythonpython"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=ms-python.python"><strong>Python (extension)</strong> by Microsoft</a>:</h4>
<ul>
<li>A compulsory requirement for python development.</li>
<li>Provides language, debug, test, etc support.</li>
<li>Latest updates have even brought support for the Jupyter notebooks. </li>
</ul>
<p><strong>Severity: Must</strong></p>
<ul>
<li><h4 id="pylance-extension-by-microsofthttpsmarketplacevisualstudiocomitemsitemnamems-pythonvscode-pylance"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance"><strong>Pylance (extension)</strong> by Microsoft</a>:</h4>
<ul>
<li>Works alongside Python extension to provide performant language support.</li>
<li>It adds tons of features into bare-metal Python extension like - <ul>
<li>Docstrings</li>
<li>Signature help, with type information</li>
<li>Parameter suggestions</li>
<li>Code completion</li>
<li>Auto-imports (as well as add and remove import code actions)</li>
<li>As-you-type reporting of code errors and warnings (diagnostics)</li>
<li>Code outline and navigation</li>
<li>Type checking mode</li>
<li>IntelliCode compatibility</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><strong>Severity: Must</strong></p>
<h4 id="visual-studio-intellicode-extension-by-microsofthttpsmarketplacevisualstudiocomitemsitemnamevisualstudioexptteamvscodeintellicode"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=VisualStudioExptTeam.vscodeintellicode"><strong>Visual Studio IntelliCode (extension)</strong> by Microsoft</a>:</h4>
<ul>
<li>It provides <strong>AI-assisted</strong> development features for Python, TypeScript/JavaScript and Java developers with insights based on understanding your code context combined with machine learning.</li>
<li>It AI predictions works fairly well and does not try to intrude on your development. </li>
</ul>
<p><strong>Severity: Must</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629805771145/jNY3as5nm.gif" alt="python-intellicode.gif" /></p>
<p>There are many other extensions that may not fall under the 'must severity' but are still very helpful, lets continue then,</p>
<h4 id="sonarlint-extension-by-sonarsourcehttpsmarketplacevisualstudiocomitemsitemnamesonarsourcesonarlint-vscodeandssrfalseoverview"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=SonarSource.sonarlint-vscode&amp;ssr=false#overview"><strong>SonarLint (extension)</strong> by SonarSource</a>:</h4>
<ul>
<li>It is a linter or static analysis tool (free and developed by a very respected company in this domain) that lets you fix coding issues before they exist by analyzing the code. </li>
<li>It can track Bugs 🐛 and Security Vulnerabilities as you write code. But the best part is the documentation that it provides as a Code smell without even the need to commit the code. </li>
<li>The installation can a little tricky as initially it needs java to run. It by default handles all installation if sufficient write permission is available else you have to do it manually. But trust me when the installation is done this will help you &amp; your team member to stick to all best practices of python coding and also maintain consistent code. 🫂</li>
</ul>
<p><strong>Severity: Essential</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629808707489/c0r77Y71r.gif" alt="sonarlint-rule-description.gif" /></p>
<h4 id="pylint-native-supporthttpsgithubcompycqapylint"><a target="_blank" href="https://github.com/PyCQA/pylint"><strong>Pylint</strong> (native support)</a>:</h4>
<ul>
<li>Pylint is not an extension but a dedicated linter that can be used independently without VS code, but VS code does have its native support, more <a target="_blank" href="https://code.visualstudio.com/docs/python/linting#_pylint">here</a>. </li>
<li>It is a Python static code analysis tool that looks for programming errors, helps to enforce a coding standard, sniffs for code smells and offers simple refactoring suggestions.</li>
<li>It can be installed as simple as <code>pip install pylint</code>. Follow the above official docs for pylint integration with vs code. <ul>
<li>But if you ask me using its CLI tool is the preferred way. It can be triggered by <code>pylint path/to/dir</code> for analyzing the complete directory &amp; <code>pylint path/to/some_file.py</code> to analyze any specific file. </li>
</ul>
</li>
</ul>
<p><strong>Severity: Essential</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629814032147/WiVryUiRT.png" alt="image.png" /></p>
<blockquote>
<p>Note: The most optimal Python linter setup in vs code is a combination of Pylance, SonarLint &amp; Pylint. All these three work independently and does not intrude on each other. Their combination provides the perfect linting experience. </p>
</blockquote>
<h4 id="auto-code-formatting-using-black-native-supporthttpsgithubcompsfblack"><a target="_blank" href="https://github.com/psf/black">Auto code formatting using <strong>Black</strong> (native support)</a>:</h4>
<ul>
<li>One of the more important point while working in a team and collaborating on a project is writing clean &amp; consistent code. But dividing our focus on logic &amp; writing clean code always harms productivity. To deal with such situations a formatter should be used. </li>
<li>VS Code support a wide variety of formatter tools (more <a target="_blank" href="https://code.visualstudio.com/docs/python/editing#_formatting">here</a>) &amp; automates the code formatting. Following are the steps <ul>
<li>Goto settings --&gt; Extensions --&gt; Python --&gt; Python › Formatting: Provider and select black from drop down menu. Or if you prefer settings.json then simply and this <code>"python.formatting.provider": "black"</code></li>
<li>While you are on your file/document which you want to format, press <code>right click --&gt; Format code</code>. There is even an option to automatically format the current file on save by adding <code>"editor.formatOnSave": true</code> in <code>settings.json</code>.</li>
</ul>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629827087973/eDwBwxpgXM.png" alt="image.png" /></p>
<ul>
<li>I personally use Black due for the following reasons <ul>
<li>It philosophy is kind of authoritative &amp; will format the code strictly to its set of rules. </li>
<li>I believe we already have to make a lot of critical decisions and formating should not be one of them. </li>
<li>Also when the whole team uses Black formatter, then the complete codebase will look &amp; feel consistent and clean. </li>
</ul>
</li>
</ul>
<p><strong>Severity: Essential</strong></p>
<h4 id="python-docstring-generator-extension-by-nils-wernerhttpsmarketplacevisualstudiocomitemsitemnamenjpwernerautodocstring"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring"><strong>Python Docstring Generator</strong> (extension) by Nils Werner</a>:</h4>
<blockquote>
<p>I am assuming that you are well aware of the <a target="_blank" href="https://www.python.org/dev/peps/pep-0257/#id15">Docstring</a>. </p>
</blockquote>
<ul>
<li>Writing a good informative docstring is very important in terms of documentation. But if you or your team decides to follow any formats like (which you should totally do)  <a target="_blank" href="https://numpydoc.readthedocs.io/en/latest/format.html">Numpy</a>, <a target="_blank" href="https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings">Google</a>,etc. can results in difficulty for writing consistent &amp; complying with a format/style guide.</li>
<li>Python Docstring Generator can generate a Docstring template that adheres to the selected format based on <code>type hints/type annotation</code>. So that you have don't have to worry about formatting &amp; just focus on writing the required information.</li>
<li>You can even jump to the next &amp; previous element in the template using <code>shift</code> &amp; <code>shift + tab</code> keyboard keys respectively.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629871466261/deKN1LcgE.gif" alt="autodoc-demo.gif" /></p>
<ul>
<li>To change format goto Settings --&gt; Extensions --&gt; Python Docstring Generator configuration --&gt; Select you desire format from the <code>Auto Docstring: Docstring Format</code> drop-down menu. </li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629871738339/3dpaEn_kq.png" alt="image.png" /></p>
<p><strong>Severity: Essential</strong></p>
<h4 id="python-indent-extension-by-kevin-rosehttpsmarketplacevisualstudiocomitemsitemnamekevinrosevsc-python-indent"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent"><strong>Python Indent</strong> (extension) by Kevin Rose</a>:</h4>
<ul>
<li>If you have ever felt that you keep on messing with the indentation then extension got you covered. Every time you press the <code>Enter</code> key it automatically adds the correct indent.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629874515045/U3wlEk9VE.gif" alt="indent-demo.gif" /></p>
<p><strong>Severity: Helpful</strong></p>
<h4 id="python-type-hint-extension-by-njqdevhttpsmarketplacevisualstudiocomitemsitemnamenjqdevvscode-python-typehintandssrfalseoverview"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=njqdev.vscode-python-typehint&amp;ssr=false#overview"><strong>Python Type Hint</strong> (extension) by njqdev</a>:</h4>
<ul>
<li>Provides type hint auto-completion for Python, with completion items for built-in types, classes and the typing module.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629874753286/qd44sOXuC.gif" alt="type-hint-demo.gif" /></p>
<p><strong>Severity: Helpful</strong></p>
<h4 id="python-test-explorer-for-visual-studio-code-extension-by-little-fox-teamhttpsmarketplacevisualstudiocomitemsitemnamelittlefoxteamvscode-python-test-adapter"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=LittleFoxTeam.vscode-python-test-adapter"><strong>Python Test Explorer for Visual Studio Code</strong> (extension) by Little Fox Team</a>:</h4>
<ul>
<li>This extension allows you to run your Python Unittest, Pytest or Testplan tests with the Test Explorer UI.</li>
<li>It provides much better and rich information with ample control.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629875276267/XoAE4debRi.gif" alt="test-explore.gif" /></p>
<hr />
<h2 id="part-2-git">Part 2: Git</h2>
<h4 id="gitlens-extension-by-eric-amodiohttpsmarketplacevisualstudiocomitemsitemnameeamodiogitlens"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=eamodio.gitlens"><strong>GitLens</strong> (extension) by Eric Amodio</a>:</h4>
<ul>
<li>It supercharges the Git capabilities built into Visual Studio Code. It helps you to visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more.</li>
<li>The best part of this extension is that all the default settings that come out of the box work well. It will improve your Git experience 10x. Even Gitlens can be converted to a standalone tool, yeah it's that powerful 💪.</li>
<li>There are so many awesome features so I would suggest you watch the video</li>
</ul>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=rxKGgSLwOnU">https://www.youtube.com/watch?v=rxKGgSLwOnU</a></div>
<p><strong>Severity: Must</strong></p>
<h4 id="git-graph-extension-by-mhutchiehttpsmarketplacevisualstudiocomitemsitemnamemhutchiegit-graph"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=mhutchie.git-graph"><strong>Git Graph</strong> (extension) by mhutchie</a>:</h4>
<ul>
<li>It generates beautiful, colourful &amp; informative graphs to visualize all your commits history across all branches. This makes it extremely easy to track the project. </li>
<li>It even creates every single commit a clickable link that can be used to view <code>git diff</code> among others things. </li>
<li>It also offers a lot more git functionality which I highly suggest to checkout. </li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630055860840/iccdUlOPQ.gif" alt="git-graph-demo.gif" /></p>
<p><strong>Severity: Must</strong></p>
<h4 id="git-history-extension-by-don-jayamannehttpsmarketplacevisualstudiocomitemsitemnamedonjayamannegithistory"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=donjayamanne.githistory"><strong>Git History</strong> (extension) by Don Jayamanne</a>:</h4>
<ul>
<li>View and search git log, history along with the graph and details, previous copy of the file.</li>
<li>Compare branches, commits, files across commits.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630056300601/i9MX2QWUv.gif" alt="gitLogv3.gif" /></p>
<p><strong>Severity: Essential</strong></p>
<blockquote>
<p>Info: You must be thinking 🤔 that extension like vs code's builtin git, gitlens, git graph, git history, seems to have some functionality overlapping, which isn't incorrect. But the important thing is all this extension works well without harming others functionality, so it's totally fine to use them together. In fact, when they all are used together, they make vs code the best Git management tool out there. </p>
</blockquote>
<h4 id="conventional-commits-extension-by-vivaxyhttpsmarketplacevisualstudiocomitemsitemnamevivaxyvscode-conventional-commits"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=vivaxy.vscode-conventional-commits"><strong>Conventional Commits</strong> (extension) by vivaxy</a>:</h4>
<ul>
<li>Before using the extension I would suggest you understand the philosophy behind <a target="_blank" href="https://www.conventionalcommits.org/en/v1.0.0/">Conventional Commits</a></li>
<li>It brings support to Conventional Commits in vs code. <blockquote>
<p>Note: Explanation about Conventional Commits is out of the scope of this article but you should must go through it before using this extension.</p>
</blockquote>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630056967071/mzxH-PaAh.gif" alt="cc-demo.gif" /></p>
<p><strong>Severity: Essential</strong></p>
<hr />
<h2 id="part-3-productivity-boosters">Part 3: Productivity boosters</h2>
<h4 id="vs-code-workspace-system-functionalityhttpscodevisualstudiocomdocseditorworkspaces"><a target="_blank" href="https://code.visualstudio.com/docs/editor/workspaces"><strong>VS Code Workspace</strong> (system functionality)</a>:</h4>
<ul>
<li>Workspace can be used to create a sandbox setup specific to the project or environment. Every single setting, extension, config, etc can be customized to cater specifics needs of a project. </li>
<li>Creating a workspace is very easy just click on on <code>File</code> &amp; select <code>Save Workspace as</code></li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629890314022/w0u6KzswM.png" alt="image.png" /></p>
<ul>
<li>Initially just created workspace will inherit everything that is present in <code>User settings</code>, then you can now start changing anything that you want to &amp; while doing so you will have the option for saving just to the workspace or globally to the user settings. </li>
<li>I rate this feature of VS code right at the top 🔝. This is what I generally do:<ul>
<li>For a python project I set a default python path to be used every time</li>
<li>I disabled any extensions that are not required in the project</li>
<li>Change terminal-specific settings</li>
<li>Hell yeah you can even set workspace specific themes, fonts, etc. which I do all the time 😎. I believe having a visual difference helps to differentiate projects. </li>
</ul>
</li>
</ul>
<p><strong>Severity: Must</strong></p>
<h4 id="useful-keyboard-shortcut"><strong>Useful keyboard shortcut</strong>:</h4>
<ul>
<li>Follow this excellent <a target="_blank" href="https://dev.to/shubhamreacts/15-useful-vscode-shortcuts-to-improve-productivity-4akc">blog</a> by Shubham Khatri. I too from time to give visit here to freshen up the commands.</li>
</ul>
<p><strong>Severity: Essential</strong></p>
<h4 id="setting-sync-system-functionalityhttpscodevisualstudiocomdocseditorsettings-sync"><a target="_blank" href="https://code.visualstudio.com/docs/editor/settings-sync"><strong>Setting sync</strong> (system functionality)</a>:</h4>
<ul>
<li>Settings Sync lets you share your Visual Studio Code configurations such as settings, keybindings, and installed extensions across your machines so you are always working with your favourite setup.</li>
<li>This way you can maintain a similar &amp; familiar setup with your personal, work, or other personal machines. I personally use this extensively.</li>
<li>But mind you as awesome as it looks, it can also be proved quickly a double-edged sword. Why?<ul>
<li>There might be few extensions that you just want on specific devices.</li>
<li>There might be a few setting's configs that use <code>path</code></li>
<li>Any other settings or keybindings that you want to use only on a specific device. </li>
</ul>
</li>
<li>To solve this (potential) conflict vs code provides us extremely granular control over what to sync what to not. Following are few of them (which might be most common):<ul>
<li>Do not want to sync a specific extension: open the extension page --&gt; Click on Do not sync this extension </li>
</ul>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1632713995938/rHIA2lawv.png" alt="image.png" /></p>
<ul>
<li>Do not want to sync some specific setting: Goto that specific setting (using settings UI) --&gt; Click on gear <code>⛮</code> icon --&gt; Sync this setting</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1632714450053/WAxYJmzz_.png" alt="image.png" /></p>
<p><strong>Severity: Essential</strong></p>
<h4 id="thunder-client-extension-by-ranga-vadhineni"><a target="_blank" href><strong>Thunder Client</strong> (extension) by Ranga Vadhineni</a>:</h4>
<ul>
<li>Thunder Client is a lightweight Rest API Client Extension. It is basically like a Postman inside vs code so that you don't have to leave vs code at all. </li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629910984952/_hMYVrMkP.gif" alt="thunder-client.gif" /></p>
<p><strong>Severity: Essential</strong></p>
<h4 id="path-autocomplete-extension-by-mihai-vilcuhttpsmarketplacevisualstudiocomitemsitemnameionutvmipath-autocomplete"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=ionutvmi.path-autocomplete"><strong>Path Autocomplete</strong> (extension) by Mihai Vilcu</a>:</h4>
<ul>
<li>Provides path completion. It supports relative, absolute, workspace path auto-completion.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629872376656/lxEj7-5tv.gif" alt="path-autocomplete.gif" />
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629874006429/X7SNw-U0Ow.gif" alt="path.gif" /></p>
<p><strong>Severity: Helpful</strong></p>
<h4 id="comment-anchors-extension-by-exodius-studioshttpsmarketplacevisualstudiocomitemsitemnameexodiusstudioscomment-anchors"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=ExodiusStudios.comment-anchors"><strong>Comment Anchors</strong> (extension) by Exodius Studios</a>:</h4>
<ul>
<li>Writing comments (even more than the code itself 🧑‍💻) is extremely important in long term, so does an efficient way to track &amp; navigate them. Comment Anchors is the best extension to deal with this task. It supports all languages.</li>
<li>You can place anchors within comments or strings to place bookmarks within the context of your code. Anchors can be used to track TODOs, write notes, create foldable sections, or to build a simple navigation making it easier to navigate your files. Anchors can be viewed for the current file, or throughout the entire workspace, using an easy to use the sidebar.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629900919989/-Wx3-2qai.png" alt="image.png" /></p>
<ul>
<li>It even supports adding a custom anchor. Following are list of anchor that I used</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629901032521/0i2pLLm78.png" alt="image.png" /></p>
<pre><code class="lang-json"><span class="hljs-comment">// My custom anchor related code in `settings.json`</span>

    <span class="hljs-string">"commentAnchors.tags.list"</span>: [
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"ANCHOR"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"default"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#A8C023"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"file"</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"TODO"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"blue"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#3ea8ff"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"FIXME"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"red"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#F44336"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
            <span class="hljs-attr">"isBold"</span>: <span class="hljs-literal">true</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"STUB"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"purple"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#BA68C8"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"file"</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"NOTE"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"orange"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#FFB300"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"file"</span>,
            <span class="hljs-attr">"styleComment"</span>: <span class="hljs-literal">true</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"REVIEW"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"green"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#64DD17"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"SECTION"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"blurple"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#896afc"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
            <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"region"</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"LINK"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#2ecc71"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#2ecc71"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
            <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"link"</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"DEPRECATED"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#B22222"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#B22222"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
            <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"anchor"</span>,
            <span class="hljs-attr">"isBold"</span>: <span class="hljs-literal">true</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"WARNING"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#B22222"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#B22222"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
            <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"anchor"</span>,
            <span class="hljs-attr">"isBold"</span>: <span class="hljs-literal">true</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"EG"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"#00FFFF"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"#31e0ec"</span>,
            <span class="hljs-attr">"backgroundColor"</span>: <span class="hljs-string">"rgba(49, 184, 79, 0.2)"</span>,
            <span class="hljs-attr">"borderStyle"</span>: <span class="hljs-string">"1px solid #23b2ea"</span>,
            <span class="hljs-attr">"borderRadius"</span>: <span class="hljs-number">6</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
            <span class="hljs-attr">"styleComment"</span>: <span class="hljs-literal">true</span>
        },
        {
            <span class="hljs-attr">"tag"</span>: <span class="hljs-string">"@:"</span>,
            <span class="hljs-attr">"iconColor"</span>: <span class="hljs-string">"yellow"</span>,
            <span class="hljs-attr">"highlightColor"</span>: <span class="hljs-string">"yellow"</span>,
            <span class="hljs-attr">"scope"</span>: <span class="hljs-string">"workspace"</span>,
            <span class="hljs-attr">"behavior"</span>: <span class="hljs-string">"anchor"</span>
        }
    ],
</code></pre>
<blockquote>
<p>Note: It offers much more features, so I highly suggest reading their description. 📖</p>
</blockquote>
<h4 id="bracket-pair-colorizer-2-extension-by-coenraadshttpsmarketplacevisualstudiocomitemsitemnamecoenraadsbracket-pair-colorizer-2"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=CoenraadS.bracket-pair-colorizer-2"><strong>Bracket Pair Colorizer 2</strong> (extension) by CoenraadS</a>:</h4>
<ul>
<li>This extension allows matching brackets to be identified with colours.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629892884976/m79oFIVOl.png" alt="image.png" /></p>
<p><strong>Severity: Helpful</strong></p>
<h4 id="code-spell-checker-extension-by-street-side-softwarehttpsmarketplacevisualstudiocomitemsitemnamestreetsidesoftwarecode-spell-checker"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=streetsidesoftware.code-spell-checker"><strong>Code Spell Checker</strong> (extension) by Street Side Software</a>:</h4>
<ul>
<li>This is a very handy extension for someone like me who makes a lot of typos. It not only checks for typos but also provides correct suggestions too. </li>
<li>It works with 20+ file types (which obviously covers all the popular one). It supports <code>camelCase</code>, <code>PascalCase</code>, <code>snake_case</code>. </li>
<li>You can even add words to the global level or even workspace level.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629895789723/KQBne8vh2.gif" alt="spell-checker.gif" /></p>
<p><strong>Severity: Essential</strong></p>
<h4 id="error-lens-extension-by-alexanderhttpsmarketplacevisualstudiocomitemsitemnameusernamehwerrorlens"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=usernamehw.errorlens"><strong>Error Lens</strong> (extension) by Alexander</a>:</h4>
<ul>
<li>ErrorLens turbo-charges language diagnostic features by making diagnostics stand out more prominently, highlighting the entire line wherever a diagnostic is generated by the language and also prints the message inline.</li>
<li>This makes debugging &amp; catching errors comparatively easy.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629905531804/8vc85MdgQ.png" alt="image.png" /></p>
<p><strong>Severity: Essential</strong></p>
<h4 id="footsteps-extension-by-wattenbergerhttpsmarketplacevisualstudiocomitemsitemnamewattenbergerfootsteps"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=Wattenberger.footsteps"><strong>footsteps</strong> (extension) by Wattenberger</a>:</h4>
<ul>
<li>Keep your place when jumping between different parts of your code. This is a VSCode extension that will highlight lines as you edit them, fading as you move away. Jump between lines using <code>ctrl+alt+left</code> and <code>ctrl+alt+right</code>.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629908038847/cwsgXyswT.gif" alt="footsteps.gif" /></p>
<p><strong>Severity: Helpful</strong></p>
<h4 id="zoom-bar-extension-by-wraith13httpsmarketplacevisualstudiocomitemsitemnamewraith13zoombar-vscode"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=wraith13.zoombar-vscode"><strong>Zoom Bar</strong> (extension) by wraith13</a></h4>
<ul>
<li>Can zoom via GUI in the status bar.
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629908882532/CczJeuild.png" alt="image.png" /></li>
</ul>
<p><strong>Severity: Helpful</strong></p>
<h4 id="resource-monitor-extension-by-mutantdinohttpsmarketplacevisualstudiocomitemsitemnamemutantdinoresourcemonitor"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=mutantdino.resourcemonitor"><strong>Resource Monitor</strong> (extension) by mutantdino</a>:</h4>
<ul>
<li>Display CPU frequency, usage, memory consumption, and battery percentage remaining within the VSCode status bar.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629909286957/laftCXXIA.png" alt="image.png" /></p>
<h4 id="drawio-integration-extension-by-henning-dieterichshttpsmarketplacevisualstudiocomitemsitemnamehedietvscode-drawio"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=hediet.vscode-drawio"><strong>Draw.io Integration</strong> (extension) by Henning Dieterichs</a></h4>
<ul>
<li><a target="_blank" href="https://app.diagrams.net">Draw.io</a> inside vs code.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1629911232080/ZpNW_fiQr.gif" alt="draw-io-demo.gif" /></p>
<p><strong>Severity: Helpful</strong></p>
<hr />
<h2 id="part-4-customization">Part 4: Customization</h2>
<p>This is the one place where vs code really shines.</p>
<h4 id="rainglow-theme-extension-by-dayle-reeshttpsmarketplacevisualstudiocomitemsitemnamedaylereesrainglow"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=daylerees.rainglow"><strong>Rainglow theme</strong> (extension) by Dayle Rees</a>:</h4>
<ul>
<li>Rainglow is a collection of colour themes &amp; consists of 320+ syntax and UI themes.</li>
<li>Colour combinations are excellent. All the themes follow similar categories and hierarchies which makes it super easy to pick a new theme.</li>
<li>Trust me after installing it you won't need any other theme. </li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630059925795/ZQrjeC2uj.png" alt="image.png" /></p>
<h4 id="material-icon-theme-extension-by-philipp-kiefhttpsmarketplacevisualstudiocomitemsitemnamepkiefmaterial-icon-theme"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=PKief.material-icon-theme"><strong>Material Icon Theme</strong> (extension) by Philipp Kief</a></h4>
<ul>
<li>There are many good options for icons theme in vs code but Material Icon Theme covers most grounds and at the same time all the icons are precise and beautifully designed.<h4 id="window-colors-extension-by-stuart-robinsonhttpsmarketplacevisualstudiocomitemsitemnamestuartunique-window-colors"><a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=stuart.unique-window-colors"><strong>Window Colors</strong> (extension) by Stuart Robinson</a>:</h4>
</li>
<li>This extension is a bit unique and fun. Automatically adds a unique colour to each window's <code>activityBar</code> and <code>titleBar</code>.</li>
<li>It works without harming any existing theme extension &amp; works along with it. </li>
<li>Why this is useful you ask? If you have multiple windows open (like me all the time 😅) then this adds a new colour &amp; makes it super easy to differentiate. </li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1630060928043/uaDbd2WGZ.png" alt="image.png" /></p>
]]></content:encoded></item><item><title><![CDATA[Practical OOP in Python: Methods]]></title><description><![CDATA[The class is the backbone of OOP in Python and methods are body parts of the class. Understanding the practical application of the methods is the key to take the most advantage of the class and eventually OOP in Python.
Let's touch on some theory in ...]]></description><link>https://importidea.dev/practical-oop-in-python-methods</link><guid isPermaLink="true">https://importidea.dev/practical-oop-in-python-methods</guid><category><![CDATA[Python]]></category><category><![CDATA[oop]]></category><category><![CDATA[Programming Blogs]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Tue, 22 Jun 2021 03:58:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1624334263316/s0g3Cj9c-.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The class is the backbone of OOP in Python and methods are body parts of the class. Understanding the practical application of the methods is the key to take the most advantage of the class and eventually OOP in Python.</p>
<p>Let's touch on some theory in brief as this blog focuses on practicality</p>
<h2 id="1-brief-theory">1. Brief theory</h2>
<p><strong>class</strong>: A class is a user-defined blueprint or prototype from which objects are created. It wraps all the similar methods (ideally by conventions). </p>
<p><strong>methods</strong>: A glorified function that is a class member and will always remain bound to a class. </p>
<p>If you wish to understand the theory, then I would suggest you go through</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://realpython.com/instance-class-and-static-methods-demystified/#lets-see-them-in-action">https://realpython.com/instance-class-and-static-methods-demystified/#lets-see-them-in-action</a></div>
<p><strong>Types of methods available in Python:</strong></p>
<ol>
<li>True method:<ol>
<li>instance method</li>
<li>class method</li>
<li>static method</li>
<li>property (not everyone will agree)</li>
</ol>
</li>
<li>Method by conventions (this is what something I have derived)<ol>
<li>private method</li>
<li>strict private method</li>
</ol>
</li>
</ol>
<h2 id="2-practical-use-case">2. Practical use case</h2>
<p>The beauty of OOPs in any programming language (that has its support obviously 😎) is the flexibility. There will be always more than one way to do any task at hand, but not all are <em>best practice</em> or <em>practical</em> or <em>pythonic</em>. </p>
<blockquote>
<p>Note: I will only touch the syntax/theory part briefly with an assumption that the reader knows the theory part but to see their real-world practical use-case. </p>
</blockquote>
<p>Let's begin by writing a sample class that will be used across the complete blog</p>
<pre><code class="lang-python"><span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    <span class="hljs-string">"""This is a sample class to explain practical use-case of all methods
    """</span>    
    xyz: int
    abc: int = <span class="hljs-number">4</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ins_method</span>(<span class="hljs-params">self, no: int</span>):</span>
        <span class="hljs-string">"""Sample instance method
        Args:
            no (int): any int number 
        """</span>        
        print(self.xyz * no)

<span class="hljs-meta">    @classmethod</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">cls_method</span>(<span class="hljs-params">cls, no: int</span>):</span>
        <span class="hljs-string">"""Sample class method
        Args:
            no (int): any int number 
        """</span>
        print(cls.abc * no)

<span class="hljs-meta">    @staticmethod</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">stc_method</span>(<span class="hljs-params">no: int</span>):</span>
        <span class="hljs-string">"""Sample static method
        Args:
            no (int): any int number 
        """</span>
        print(SampleClass.xyz * no)
</code></pre>
<p>Now we'll see their practical use case one by one. </p>
<h3 id="21-instance-method">2.1 instance method</h3>
<ul>
<li>The plain, simple, regular method with any frills. </li>
<li>Practically speaking, this type of method is used 90% of the time. In fact, just using the instance method is what you are going to need all the time. </li>
<li>It has free access to all attributes &amp; even other methods at the same object level. Due to this flexibility it is most widely used. </li>
<li>From the above <code>SampleClass</code> example <code>ins_method()</code> is the instance method. </li>
</ul>
<pre><code class="lang-python"><span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    xyz: int
    abc: int = <span class="hljs-number">4</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ins_method</span>(<span class="hljs-params">self, no: int</span>):</span>      
        print(self.xyz * no)

s = SampleClass(xyz=<span class="hljs-number">10</span>)
s.ins_method(<span class="hljs-number">2</span>)

<span class="hljs-comment"># Output</span>
<span class="hljs-number">20</span>
</code></pre>
<ul>
<li>Some key point to note here:<ul>
<li>As instance method are bound to an object of the class, so first an object must be created. </li>
<li>The same object then can be used anywhere. </li>
</ul>
</li>
</ul>
<blockquote>
<p><strong>Factory function</strong>: Before moving to the next part, it is important to understand <a target="_blank" href="https://en.wikipedia.org/wiki/Factory_(object-oriented_programming">factory function</a>) as this is one  use-case where <code>classmethod</code> and <code>staticmethod</code> have wide practical application. Again here I will be focussing on practical. </p>
</blockquote>
<h3 id="22-classmethod">2.2 classmethod</h3>
<ul>
<li>First, get this, classmethod is not a must have/use in Python OOP. Almost always plain instance method will do the job. But there is one use-case where it fits perfectly, i.e. if you want to use the factory function for accessing class attribute. </li>
<li>Unlike the instance method where we need to create an object of class first &amp; then use the '.' notation to use it, classmethod can be used without creating an object as it is bound to the class itself and not to the object. Sound too technical let's see an example. </li>
</ul>
<pre><code class="lang-python"><span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    xyz: int
    abc: int = <span class="hljs-number">4</span>

<span class="hljs-meta">    @classmethod</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">cls_method</span>(<span class="hljs-params">cls, no: int</span>):</span>
        print(cls.abc * no)

SampleClass(<span class="hljs-number">4</span>).cls_method(<span class="hljs-number">2</span>)

<span class="hljs-comment"># Output</span>
<span class="hljs-number">8</span>
</code></pre>
<ul>
<li>As you can see above we have not created any object for <code>SampleClass</code> but directly used '.' notation with 'SampleClass' itself as classmethod can be bound to class directly. If you compare it with the instance method, there we created an object <code>s</code> then the method <code>ins_method()</code> was being accessed as it is bound to <code>s</code> but not to <code>Sampleclass</code></li>
<li>So what is the benefit of all this, not a lot but there are a few:<ul>
<li>The one obvious, use it as a factory function where you don't want to specifically create an object of the class. Why? Maybe you fear the object size will be too large &amp; you only want to use a specific method. </li>
<li>It is a way to tell your other fellow team member or other people that this method doesn't depend on the instance variable (value provided by the user) but on the class variable to which the user doesn't have access.</li>
</ul>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    xyz: int <span class="hljs-comment">#This is instance variable</span>
    abc: int = <span class="hljs-number">4</span> <span class="hljs-comment"># This is class variable</span>

<span class="hljs-meta">    @classmethod</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">cls_method</span>(<span class="hljs-params">cls, no: int</span>):</span>
        print(cls.abc * no)

<span class="hljs-comment"># Traditional way</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    abc: int = <span class="hljs-number">4</span> <span class="hljs-comment">#This is class variable</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, xyz</span>):</span>
        self.xyz = xyz <span class="hljs-comment"># This is instance variable</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1623594595554/p_q_1YS8Z.png" alt="image.png" /></p>
<ul>
<li>Here <code>cls_method()</code> only have access to <code>abc</code> which is class variable</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1623594750648/vflHpTdMq.png" alt="image.png" /></p>
<ul>
<li>Here <code>ins_method()</code> have access to everything. </li>
<li>A method that needs to use a class variable as well as user input. </li>
</ul>
<pre><code class="lang-python"><span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    xyz: int
    abc: int = <span class="hljs-number">4</span>

<span class="hljs-meta">    @classmethod</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">cls_method</span>(<span class="hljs-params">cls, no: int</span>):</span>
        print(cls.abc * no)

SampleClass(<span class="hljs-number">4</span>).cls_method(<span class="hljs-number">2</span>) <span class="hljs-comment"># Here cls_methos is using class variable - 'abc' as well as user input - 'no'</span>

<span class="hljs-comment"># Output</span>
<span class="hljs-number">8</span>
</code></pre>
<h3 id="23-staticmethod">2.3 staticmethod</h3>
<ul>
<li>Similar to the 'classmethod' even 'staticmethod' is not a must have/use in Python OOP. In fact, its usage is even less than <code>classmethod</code>. Some dev argues that it is totally useless, <a target="_blank" href="https://www.webucator.com/blog/2016/05/when-to-use-static-methods-in-python-never/">see here</a>.</li>
<li>It shares almost all properties from <code>classmethod</code> except access to neither class variable nor to instance variable as it does not use either self or cls. </li>
<li>Like the <code>classmethod</code> it can be used as a factory function, but should not. Because<ul>
<li>We already have <code>classmethod</code> for it. </li>
<li><code>classmethod</code> will work perfectly even if it is not using a class variable. </li>
</ul>
</li>
<li>So where should it be used? There are no true practical use-case of it and can be skipped entirely. But I do use them in some rare occurrence or edge case,
When you have some function outside the scope of a class but feel it is tightly related to the class, it can be included as <code>staticmethod</code> as it does not need any class/instance variable. <ul>
<li>You want to mimic the private method which doesn't need a class/instance variable. But I would advise instead use a private method (more on this later).</li>
</ul>
</li>
</ul>
<h3 id="24-property">2.4 property</h3>
<ul>
<li><p>This is a special type of methods, in fact formally it falls under descriptor. Follow this excellent blog if you want to understand the theory part</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.machinelearningplus.com/python-property/">https://www.machinelearningplus.com/python-property/</a></div>
</li>
<li><p><code>@property</code> should be used when you want to return an attribute produced by some function/method. </p>
</li>
<li>As the name suggests the property object should always hold some kind of characteristic of the class. Python's builtin pathlib library is one of the best examples.</li>
</ul>
<h3 id="25-private-method">2.5 private method</h3>
<ul>
<li>It is a method to which external users do not have access &amp; it is just used internally. </li>
<li>Python doesn't have a true private method (like in java). But we can implement it using generally agreed conventions. </li>
<li>The convention is to add '_' as a prefix to the <code>instance method</code> name. By seeing this name convention, the user will understand that this particular should not be touched. </li>
</ul>
<pre><code class="lang-python"><span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    xyz: int
    abc: int = <span class="hljs-number">4</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ins_method</span>(<span class="hljs-params">self, no: int</span>):</span>
        print(self._pvt_method(no) * no)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_pvt_method</span>(<span class="hljs-params">self, no: int</span>):</span>
        <span class="hljs-keyword">return</span> (self.abc * no)
</code></pre>
<ul>
<li><p>But as mentioned earlier that Python doesn't have a true private method, so the user can still use it. Someone who doesn't the name convention might accidentally use it.</p>
<pre><code class="lang-python">sc = SampleClass(<span class="hljs-number">4</span>)
sc._pvt_method(<span class="hljs-number">4</span>)

<span class="hljs-comment"># Output</span>
<span class="hljs-number">16</span>
</code></pre>
</li>
</ul>
<h3 id="26-strict-private-method">2.6 strict private method</h3>
<ul>
<li>You might not find this term anywhere formally. This is somewhat I have coined. As above mentioned limitation of the private method in python, if you want to enforce it we can hack name mangling feature of python by adding '__' or double underscore as a prefix to the <code>instance method</code> name. </li>
</ul>
<pre><code class="lang-python"><span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SampleClass</span>:</span>
    xyz: int
    abc: int = <span class="hljs-number">4</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ins_method</span>(<span class="hljs-params">self, no: int</span>):</span>
        <span class="hljs-keyword">return</span> (self.__strict_pvt_method(no) * no)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_pvt_method</span>(<span class="hljs-params">self, no: int</span>):</span>
        <span class="hljs-keyword">return</span> (self.abc * no)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__strict_pvt_method</span>(<span class="hljs-params">self, no: int</span>):</span>
        <span class="hljs-keyword">return</span> (self.abc * no)

<span class="hljs-comment"># Output</span>
sc.ins_method(<span class="hljs-number">4</span>)
<span class="hljs-number">64</span>
sc._pvt_method(<span class="hljs-number">4</span>)
<span class="hljs-number">16</span>
sc.__strict_pvt_method(<span class="hljs-number">4</span>)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
&lt;ipython-input<span class="hljs-number">-31</span><span class="hljs-number">-102204654</span>d54&gt; <span class="hljs-keyword">in</span> &lt;module&gt;
----&gt; <span class="hljs-number">1</span> sc.__strict_pvt_method(<span class="hljs-number">4</span>)

AttributeError: <span class="hljs-string">'SampleClass'</span> object has no attribute <span class="hljs-string">'__strict_pvt_method'</span>
</code></pre>
<ul>
<li>Though it strict but not entirely 100% enforced rule as it can be still used by name mangling syntax</li>
</ul>
<pre><code class="lang-python">sc._SampleClass__strict_pvt_method(<span class="hljs-number">4</span>)

<span class="hljs-comment"># Output</span>
<span class="hljs-number">16</span>
</code></pre>
<ul>
<li>Then why use it? Because it makes it really hard to use to outside the class as it is not that obvious to use. </li>
</ul>
<blockquote>
<p>Note: I will suggest to further read this thread</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://stackoverflow.com/questions/70528/why-are-pythons-private-methods-not-actually-private">https://stackoverflow.com/questions/70528/why-are-pythons-private-methods-not-actually-private</a></div>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Testing in a CI/CD Pipeline Part 3: Deployment testing]]></title><description><![CDATA[This is part 3 of the Testing in a CI/CD Pipeline series. It is advised first to go through part 1, part 2 🤓.
https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing
https://importidea.dev/testing-in-cicd-part-2-integration-testing
1. Depl...]]></description><link>https://importidea.dev/testing-in-cicd-part-3-deployment-testing</link><guid isPermaLink="true">https://importidea.dev/testing-in-cicd-part-3-deployment-testing</guid><category><![CDATA[Testing]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Sat, 29 May 2021 05:04:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1622956045733/aWIOuLoOO.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is part 3 of the Testing in a CI/CD Pipeline series. It is advised first to go through part 1, part 2 🤓.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing">https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing</a></div>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://importidea.dev/testing-in-cicd-part-2-integration-testing">https://importidea.dev/testing-in-cicd-part-2-integration-testing</a></div>
<h2 id="1-deployment-testing-in-brief">1. Deployment testing in brief 💼</h2>
<p>Deployment testing is different from system or unit testing. Let's see in brief (as the original intent of this guide is how to integrate it CI/CD pipeline)</p>
<ul>
<li>In unit testing, tests are performed to measure the correctness of the system's individual smaller or unit component.</li>
<li>In contrast, Deployment testing is a testing stage where two or more software units are joined and tested as one entity but after release or deployment.</li>
<li>Deployment testing in CI/CD pipeline works best and integrates easily if your system is build using a Microservice approach.</li>
</ul>
<h2 id="2-general-mechanics">2. General mechanics 🧰</h2>
<p><strong>Why</strong></p>
<ul>
<li>To ensure every individual microservice is working/behaving correctly.</li>
<li>Buggy code can be caught after release (&amp; ideally before exposing to the public) and trigger an automated (or manual depends on your use case) rollback. </li>
</ul>
<p><strong>How</strong></p>
<ul>
<li>Running a practical (mocking a real-world use case) example against the microservice.
The result must be known beforehand and will be used to compare the output from the test.</li>
</ul>
<p><strong>When</strong></p>
<ul>
<li>Deployment testing must be done as the last component of the CD pipeline.</li>
<li>It should be triggered after the microservice is successfully deployed.</li>
</ul>
<p><strong>Where</strong></p>
<ul>
<li>It is always performed at your infra/deployment layer (eg. Kubernetes)</li>
<li>It should be the last step of any CI/CD pipeline.</li>
</ul>
<h2 id="3-implementation-in-azure-devops-pipeline">3. Implementation in Azure DevOps pipeline 🚀</h2>
<ul>
<li>If your microservice exposes an endpoint (which will be in most of the cases) then all you need is to post a request using REST API (or whatever your microservice supports).</li>
<li>In my case, the complete microservice was packaged and containerized using Docker and deployed to the Kubernetes cluster. I used a simple python script to extract URL, post request using REST API, and compare the output.</li>
<li>If you have gone through <a target="_blank" href="https://importidea.dev/testing-in-cicd-part-2-integration-testing">part 2</a> of this series regarding integration testing, you will notice that a lot of things are common for both. That's true, the core of both these testing is exactly the same, just executing them differs. </li>
</ul>
<blockquote>
<p>Note: I believe the execution part is more important than the logic (as core logic is the same as integration testing). So I will be focussing more on the execution part.</p>
</blockquote>
<h3 id="31-test-script">3.1 Test script</h3>
<ul>
<li>As I mentioned earlier the core of integration testing and deployment testing is the same so even the testing logic is the same. </li>
<li>The key difference here is, the deployment testing script should be able to perform the test of all the microservices. To put together in context with integration test, deployment test is a compilation of all individual integration test w.r.t to their microservices. </li>
<li>As the concept of the test script is already explained in detail <a target="_blank" href="https://importidea.dev/testing-in-cicd-part-2-integration-testing">here</a>, so I am skipping it, though I will show the overall code at the end just to make this blog less cluttered.</li>
</ul>
<h3 id="32-perform-deployment-test-in-the-kubernetes">3.2 Perform deployment test in the Kubernetes</h3>
<p>You may not be using the Kubernetes at all but the idea behind this is platform agnostic and understanding the flow is key, having said that let's move on</p>
<ol>
<li>As all the microservices are Deployment app (in kubernetes world, more <a target="_blank" href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/">here</a>) we will treat the deployment test as a microservice too. Doing so have tons of benefits like:<ul>
<li>All the networking requirement will be handled by kubernetes. 
If you maintained multiple staging envs (like dev, qa, prod), we use the kubernetes namespace to perform env specific test. </li>
<li>As both client &amp; user (here its microservice &amp; deployment test service) are internal or at the same level in kubernetes, test latency will be much small. </li>
</ul>
</li>
<li>The deployment test pod which will be generated from its k8s deployment app will only include the test script. </li>
<li>The CD pipeline of every microservice will need a <code>kubectl exec</code> task at the end because all we need to do is run the python script sitting inside the deployment pod.<pre><code class="lang-bash">kubectl -n &lt;your namespace&gt; <span class="hljs-built_in">exec</span> po/&lt;deployment <span class="hljs-built_in">test</span> pod name&gt; -- python3 deployment_test.py
</code></pre>
<blockquote>
<p>Note: In the later section will see how all this can be automated using reference/variables.</p>
</blockquote>
</li>
</ol>
<h3 id="33-extracting-urlendpoint">3.3 Extracting URL/endpoint</h3>
<p>Here comes the magic of kubernets as all the networking is handled by itself. There are a lot of options, but we will be using <a target="_blank" href="https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/"><em>DNS for Services and Pods</em></a> as both the service as internal. </p>
<ul>
<li><p>All you need to know:</p>
<ol>
<li>Name of your k8s deployment app.</li>
<li>Namespace where the pod is currently sitting.</li>
</ol>
</li>
<li><p>The IP address for URL will be <code>http://&lt;k8s deployment name&gt;.&lt;namespace&gt;.svc.cluster.local/&lt;your remaining endpoint or sub-page&gt;</code>. Let's see an example. Let's say if k8s deployment name is <em>my-deployment </em> &amp; namespace is <em>dev</em> then it will be <code>http://my-deployment.dev.svc.cluster.local/&lt;some-api&gt;</code> </p>
</li>
<li><p>Directly using IP address is bad practice as it will keep on changing after every new/restart pod. But the above method, kubernetes will handle this for us.</p>
</li>
</ul>
<h3 id="34-integration-with-cd-job">3.4 Integration with CD Job</h3>
<p>I will be using the <a target="_blank" href="https://docs.microsoft.com/en-us/azure/devops/pipelines/release">Release pipeline</a> of the Azure DevOps pipeline for CD Job and focus just on the deployment testing task.</p>
<ul>
<li>All you need are multiple <code>kubectl</code> task</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622949443734/dpg9le-5Z.png" alt="image.png" /></p>
<ul>
<li>Step 1: We need to extract the deployment test pod name to perform the <code>kubectl exec</code> command. <ol>
<li>We will use the <code>kubectl get</code> command to extract the current pod name of the given deployment/app. See the Arguments section carefully. This is where we are extracting the name. The argument will be,<pre><code class="lang-bash">pods -l app=crs-ai-deployment-test -o jsonpath={.items[*].metadata.name}
</code></pre>
</li>
</ol>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622949869313/_FTnfw-Yi.png" alt="image.png" /></p>
<ol>
<li>Save the output/name in some reference/variable which can be used in a later stage. This can be done using <strong>Output variable &gt; Reference name</strong></li>
<li>Output format should be always none. This can be done using <strong>Advanced &gt; Output format&gt; none</strong></li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622950428840/KS3e3lj6o.png" alt="image.png" /></p>
<p>As you can see from the above screenshot, I am using <em>test</em> as a reference which makes the variable name as a <strong>test.KubectlOutput</strong>. </p>
<ul>
<li>Step 2: kubectl exec deployment testing<ol>
<li>We will use the <code>kubectl exec</code> command to run the python script. See the argument section carefully. The <strong>test.KubectlOutput</strong> which was produced in the previous stage is used now. </li>
</ol>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622950884501/wvVgk_CSm.png" alt="image.png" /></p>
<ul>
<li>Step 3: (extra) print logs if test fail<ol>
<li>Similar to integration testing where logs were printed to the console after the test failed for investigation needs can be done even in deployment testing. </li>
<li>We will need a <code>kubectl log</code> command with <code>--since=10m</code> flag</li>
</ol>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622951140079/PczIwyHx3.png" alt="image.png" /></p>
<ul>
<li>See the argument section carefully. Here the reference <code>$(pod.KubectlOutput)</code> is the name of the microservice. Its current pod name can be extracted similarly to how we did for the deployment test pod. </li>
<li>We also need to change the Control option to <code>Only when a previous task failed</code> as this should be only run when the previous task of `kubectl exec' task performing deployment test failed. </li>
</ul>
<p>So this is how deployment testing can be automated and integrated into the CD job. Let's see some action. </p>
<h2 id="4-demo">4. Demo</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622955051123/q0Sg1X8wo.png" alt="image.png" /></p>
<ul>
<li>First, the pod name of the deployment test pod was extracted. Then the testing was performed. As the test failed, next the logs were printed. </li>
</ul>
<h2 id="5-deployment-test-code">5. Deployment test code</h2>
<p>This may change based on your requirement. But you can still refer to this for the idea, as always an idea is platform agnostic. </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Dict, Tuple

sys.tracebacklimit = <span class="hljs-number">0</span>
logging.basicConfig(level=logging.INFO, format=<span class="hljs-string">"[%(levelname)s]: %(message)s"</span>)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser()
parser.add_argument(<span class="hljs-string">"--namespace"</span>, type=str) <span class="hljs-comment"># Targeted namespace</span>
parser.add_argument(<span class="hljs-string">"--deployment_name"</span>, type=str) <span class="hljs-comment"># Targeted test as this is compilation of all individual test</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">payload_data</span>(<span class="hljs-params">deployment_name: str</span>) -&gt; Tuple[str, Dict]:</span>
    <span class="hljs-string">"""Prepare payload data for Deployment specifics

    Parameters
    ----------
    deployment_name : str
        Name of deployment to perform testing.

    Returns
    -------
    str
        Api name
    Dict
        Sample data to check for deployment testing

    Raises
    ------
    ValueError
        Must be from supported deployment testing: &lt;here you can add all your test name&gt;&lt;eg&gt; research-clarity-id-applicability,research-clarity-id-adv-nonadv
    """</span>
 <span class="hljs-comment"># <span class="hljs-doctag">TODO:</span> Add new deployment name to the list</span>
    supported_deployment = [
        <span class="hljs-string">"research-clarity-id-applicability"</span>,
        <span class="hljs-string">"research-clarity-id-adv-nonadv"</span>,
    ]
    <span class="hljs-keyword">if</span> deployment_name <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> supported_deployment:
        <span class="hljs-keyword">raise</span> ValueError(
            <span class="hljs-string">f"Given deployment is either wrong or not supported.\nIt must be from <span class="hljs-subst">{<span class="hljs-string">', '</span>.join(supported_deployment)}</span> "</span>
        )

    <span class="hljs-comment"># <span class="hljs-doctag">TODO:</span> Add all new sample data and API here</span>
    <span class="hljs-keyword">elif</span> deployment_name == <span class="hljs-string">"research-clarity-id-applicability"</span>:
        api_name = <span class="hljs-string">"IDA"</span>
        svc = <span class="hljs-string">f"crs-id-applicability-api.<span class="hljs-subst">{args.namespace}</span>.svc.cluster.local"</span>
        path = <span class="hljs-string">"./data/sample_data_research-clarity-id-applicability.json"</span>
    <span class="hljs-keyword">elif</span> deployment_name == <span class="hljs-string">"research-clarity-id-adv-nonadv"</span>:
        api_name = <span class="hljs-string">"adverse_nonadverse"</span>
        svc = <span class="hljs-string">f"crs-id-adverse.<span class="hljs-subst">{args.namespace}</span>.svc.cluster.local"</span>
        path = <span class="hljs-string">"./data/sample_data_research-clarity-id-adv-nonadv.json"</span>

    <span class="hljs-keyword">with</span> open(path, <span class="hljs-string">"r"</span>) <span class="hljs-keyword">as</span> file:
        payload = json.load(file)

    <span class="hljs-keyword">return</span> (api_name, svc, payload)

api_name, svc, payload = payload_data(args.deployment_name)
url = <span class="hljs-string">f"http://<span class="hljs-subst">{svc}</span>/api/<span class="hljs-subst">{api_name}</span>"</span>
logger.info(<span class="hljs-string">f"Testing api @ <span class="hljs-subst">{url}</span>"</span>)

header = {<span class="hljs-string">"Content-Type"</span>: <span class="hljs-string">"application/json"</span>}
logger.info(<span class="hljs-string">f"Send API request for <span class="hljs-subst">{args.deployment_name}</span> deployment testing"</span>)
response = requests.request(<span class="hljs-string">"POST"</span>, url, headers=header, json=payload)

<span class="hljs-comment"># <span class="hljs-doctag">TODO:</span> Add all new assert condition here.</span>
<span class="hljs-keyword">if</span> args.deployment_name == <span class="hljs-string">"research-clarity-id-applicability"</span>:
    response_data = response.json()
    <span class="hljs-keyword">assert</span> (
        response_data[<span class="hljs-string">"ida_output_path"</span>].split(<span class="hljs-string">"/"</span>)[<span class="hljs-number">-1</span>] == <span class="hljs-string">"IDA.ndjson"</span>
    ), <span class="hljs-string">"Not Received expected output, test is failed"</span>
<span class="hljs-keyword">elif</span> args.deployment_name == <span class="hljs-string">"research-clarity-id-adv-nonadv"</span>:
    response_data = str(response.content).replace(<span class="hljs-string">"'"</span>, <span class="hljs-string">""</span>)
    <span class="hljs-keyword">assert</span> (
        response_data.split(<span class="hljs-string">"/"</span>)[<span class="hljs-number">-1</span>] == <span class="hljs-string">"classifiation_output.ndjson"</span>
    ), <span class="hljs-string">"Not Received expected output, test is failed"</span>

logger.info(<span class="hljs-string">"Deployment test passed successfully !!!"</span>)
</code></pre>
]]></content:encoded></item><item><title><![CDATA[Testing in a CI/CD Pipeline Part 2: Integration testing]]></title><description><![CDATA[This is part 2 of the Testing in a CI/CD Pipeline series. It is advised to first go through it 🤓.
https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing
1. Integration testing in brief 💼
Integration testing is different from system or un...]]></description><link>https://importidea.dev/testing-in-cicd-part-2-integration-testing</link><guid isPermaLink="true">https://importidea.dev/testing-in-cicd-part-2-integration-testing</guid><category><![CDATA[ci-cd]]></category><category><![CDATA[Testing]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Thu, 27 May 2021 06:35:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1622001310159/TsxFP-qZH.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is part 2 of the <strong>Testing in a CI/CD Pipeline</strong> series. It is advised to first go through it 🤓.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing">https://importidea.hashnode.dev/testing-in-cicd-part-1-pr-testing</a></div>
<h2 id="1-integration-testing-in-brief">1. Integration testing in brief 💼</h2>
<p>Integration testing is different from system or unit testing. Let's see in brief (as the original intent of this guide is how to integrate it CI/CD pipeline)</p>
<ul>
<li>In unit testing, tests are performed to measure the correctness of individual smaller or unit component of the system. </li>
<li>In contrast, Integration testing is a testing stage where two or more software units are joined and tested as one entity.</li>
<li>Integration testing in CI/CD pipeline works best and integrates easily if your system is build using a Microservice approach. </li>
</ul>
<h2 id="2-general-mechanics">2. General mechanics</h2>
<p><strong>Why</strong></p>
<ul>
<li>To ensure every individual microservice is working/behaving correctly. </li>
<li>Buggy code will be restricted during the CI pipeline only &amp; will never get deployed. </li>
</ul>
<p><strong>How</strong></p>
<ul>
<li>Running a practical (mocking a real-world use case) example against the microservice.
The result must be known beforehand and will be used to compare the output from the test.</li>
</ul>
<p><strong>When</strong></p>
<ul>
<li>Integration testing must be done during the CI pipeline. </li>
<li>It should be triggered after the microservice is successfully build.</li>
</ul>
<p><strong>Where</strong></p>
<ul>
<li>It is always performed at Remote Repository (eg Github, Gitlab).</li>
<li>It should be the second step of any CI/CD pipeline.</li>
</ul>
<h2 id="3-implementation-in-azure-devops-pipeline">3. Implementation in Azure DevOps pipeline 🚀</h2>
<ul>
<li>If your microservice exposes an endpoint (which will be in most of the cases) then all you need is to post a request using REST API (or whatever your microservice supports).</li>
<li>In my case, the complete microservice was packaged and containerized using Docker. I used a simple python script to extract URL, post request using REST API, and compare the output. </li>
</ul>
<p>Let's go first through the testing script and then through the CI pipeline</p>
<h3 id="31-test-script">3.1 Test script</h3>
<ol>
<li>The core of the script is to read some sample data send a post request and finally compare result. The bare minimum would be</li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> open(<span class="hljs-string">'./tests/sample_data.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
    payload = json.load(f)

header = {<span class="hljs-string">"Content-Type"</span>: <span class="hljs-string">"application/json"</span>}

response = requests.request(<span class="hljs-string">'POST'</span>, url, headers=header, json=payload)
response_data = response.json()

<span class="hljs-keyword">assert</span> response_data[<span class="hljs-string">'ida_output_path'</span>].split(<span class="hljs-string">'/'</span>)[<span class="hljs-number">-1</span>] ==<span class="hljs-string">'IDA.ndjson'</span>, <span class="hljs-string">'Not Received expected output, test is failed.'</span> <span class="hljs-comment"># You may use some other method to compare 😁</span>
</code></pre>
<ul>
<li>This will only work in an ideal scenario which will be not possible 99% of the time &amp; in fact completely defeats our original purpose. </li>
<li>We need to make it more suitable &amp; versatile for this use case. It can be done by adding two more components,<ol>
<li>A try-catch block. </li>
<li>console logging at the time of failure to investigate it. </li>
</ol>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">try</span>:
    response = requests.request(<span class="hljs-string">'POST'</span>, url, headers=header, json=payload)
    response_data = response.json()
<span class="hljs-keyword">except</span>:
    logging.error(<span class="hljs-string">'An error has occurred. Refer logs to locate error.'</span>)
    os.system(<span class="hljs-string">"docker logs test_api &gt; output.log"</span>)
    time.sleep(<span class="hljs-number">3</span>)
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'./output.log'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> log:
        print(log.read())

<span class="hljs-keyword">try</span>:    
    <span class="hljs-keyword">assert</span> response_data[<span class="hljs-string">'ida_output_path'</span>].split(<span class="hljs-string">'/'</span>)[<span class="hljs-number">-1</span>] ==<span class="hljs-string">'IDA.ndjson'</span>, <span class="hljs-string">'Not Received expected output, test is failed.'</span>
<span class="hljs-keyword">except</span> (AssertionError,KeyError) <span class="hljs-keyword">as</span> e:
    os.system(<span class="hljs-string">"docker logs test_api &gt; output.log"</span>)
    time.sleep(<span class="hljs-number">3</span>)
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'./output.log'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> log:
        print(log.read())
</code></pre>
<ul>
<li>Some pointers from the above code block<ol>
<li>I have added <em>post request and result comparison</em> in the <code>try-catch</code> block so that any error will not stop the code. </li>
<li>As I have mentioned earlier I am performing the test inside the docker image so all the logs I am extracting from it (as shown on line no 6, 15)</li>
<li>Printing logs at the time of failure is very important for investigating the issue. The aim of Integration testing is not just to restrict a buggy release but also to help to investigate it. </li>
</ol>
</li>
</ul>
<h3 id="32-extracting-urlendpoint">3.2 Extracting URL/endpoint</h3>
<ul>
<li>This may differ quite a lot based on your use case. This is how I do it.<ol>
<li>I first start/run the freshly built docker container on the worker node of CI pipeline</li>
<li>Then extract the IP address where the container is running.</li>
<li>Use this IP as the endpoint to post request over REST API. </li>
</ol>
</li>
</ul>
<pre><code class="lang-python">subprocess.call([<span class="hljs-string">'docker'</span>, <span class="hljs-string">'run'</span>, <span class="hljs-string">'-d'</span>, <span class="hljs-string">'-p'</span> ,<span class="hljs-string">'80:80'</span>,<span class="hljs-string">'--name'</span>, <span class="hljs-string">'test_api'</span>, args.image_name])
ip = subprocess.getoutput(<span class="hljs-string">"docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' test_api"</span>)
url = <span class="hljs-string">f'http://<span class="hljs-subst">{ip}</span>/api/IDA'</span>
logger.info(<span class="hljs-string">f"Testing api @ <span class="hljs-subst">{url}</span>"</span>)
</code></pre>
<h3 id="32-integration-with-ci-job">3.2 Integration with CI Job</h3>
<ul>
<li>If you follow a similar flow then all you need is a task to run the python script. It can be as simple as,</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">PythonScript@0</span>
  <span class="hljs-attr">displayName:</span> <span class="hljs-string">Integration</span> <span class="hljs-string">testing</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">scriptSource:</span> <span class="hljs-string">'filePath'</span>
    <span class="hljs-attr">scriptPath:</span> <span class="hljs-string">'$(System.DefaultWorkingDirectory)/tests/integration_testing.py'</span>
    <span class="hljs-attr">arguments:</span> <span class="hljs-string">'--image_name your.repo.io/ida:$(Build.BuildNumber)'</span>
</code></pre>
<ul>
<li>There are two more important points for implementing Integration testing<ol>
<li>Placement: It should be performed just after Docker build and before Docker push</li>
<li>If the test fails the script must stop the pipeline to proceed further. This can be achieved by using something like <code>os.sys.exit()</code></li>
</ol>
</li>
</ul>
<h2 id="4-workingdemo">4. Working/Demo</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622614739603/ciL8uPnFx.png" alt="image.png" />
Img: Case- Passing of Integration test</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1622614847706/2NwfE7jXI.png" alt="image.png" />
Img: Case- Failing of Integration test. </p>
<ul>
<li>As you can see no further task were executed after the failure of the integration test and eventually any subsequently connected CD tasks. </li>
</ul>
<h2 id="5-complete-code">5. Complete code</h2>
<ol>
<li>integration_testing.py</li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">import</span> subprocess


logging.basicConfig(level=logging.INFO, format=<span class="hljs-string">'[%(levelname)s]: %(message)s'</span>)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser()
parser.add_argument(<span class="hljs-string">'--image_name'</span>, type=str)

args = parser.parse_args()

<span class="hljs-comment"># Run docker container at port 80</span>
subprocess.call([<span class="hljs-string">'docker'</span>, <span class="hljs-string">'run'</span>, <span class="hljs-string">'-d'</span>, <span class="hljs-string">'-p'</span> ,<span class="hljs-string">'80:80'</span>,<span class="hljs-string">'--name'</span>, <span class="hljs-string">'test_api'</span>, args.image_name])
ip = subprocess.getoutput(<span class="hljs-string">"docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' test_api"</span>)
url = <span class="hljs-string">f'http://<span class="hljs-subst">{ip}</span>/api/IDA'</span>
logger.info(<span class="hljs-string">f"Testing api @ <span class="hljs-subst">{url}</span>"</span>)
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'./tests/sample_data.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
    payload = json.load(f)

header = {<span class="hljs-string">"Content-Type"</span>: <span class="hljs-string">"application/json"</span>}
logger.info(<span class="hljs-string">'Waiting for 30 sec to let docker container start'</span>)
time.sleep(<span class="hljs-number">30</span>)
logger.info(<span class="hljs-string">"Send API request for integration testing"</span>)
<span class="hljs-keyword">try</span>:
    response = requests.request(<span class="hljs-string">'POST'</span>, url, headers=header, json=payload)
    response_data = response.json()
<span class="hljs-keyword">except</span>:
    logging.error(<span class="hljs-string">'An error has occurred. Refer logs to locate error.'</span>)
    os.system(<span class="hljs-string">"docker logs test_api &gt; output.log"</span>)
    time.sleep(<span class="hljs-number">3</span>)
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'./output.log'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> log:
        print(log.read())
    os.sys.exit(<span class="hljs-string">'Task terminated'</span>)

<span class="hljs-keyword">try</span>:    
    <span class="hljs-keyword">assert</span> response_data[<span class="hljs-string">'ida_output_path'</span>].split(<span class="hljs-string">'/'</span>)[<span class="hljs-number">-1</span>] ==<span class="hljs-string">'IDA.ndjson'</span>, <span class="hljs-string">'Not Received expected output, test is failed.'</span>
<span class="hljs-keyword">except</span> (AssertionError,KeyError) <span class="hljs-keyword">as</span> e:
    os.system(<span class="hljs-string">"docker logs test_api &gt; output.log"</span>)
    time.sleep(<span class="hljs-number">3</span>)
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'./output.log'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> log:
        print(log.read())
    logging.error(<span class="hljs-string">f'Problem with <span class="hljs-subst">{e}</span> Refer logs to locate error'</span>)
    logging.error(<span class="hljs-string">f"Response from API: <span class="hljs-subst">{response_data}</span>"</span>)
    os.sys.exit(<span class="hljs-string">'Task terminated'</span>)

logger.info(<span class="hljs-string">"Integration test passed successfully moving to next task"</span>)
</code></pre>
<ol>
<li>CI pipeline</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">trigger:</span>
  <span class="hljs-attr">branches:</span>
    <span class="hljs-attr">include:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">develop</span>
  <span class="hljs-attr">paths:</span>
    <span class="hljs-attr">exclude:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">Dockerfile_base</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">requirements.txt</span>

<span class="hljs-attr">pool:</span>
  <span class="hljs-attr">vmImage:</span> <span class="hljs-string">'ubuntu-latest'</span>

<span class="hljs-attr">steps:</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">checkout:</span> <span class="hljs-string">self</span>
  <span class="hljs-attr">clean:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">fetchDepth:</span> <span class="hljs-number">1</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">UsePythonVersion@0</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">versionSpec:</span> <span class="hljs-string">'3.x'</span>
    <span class="hljs-attr">addToPath:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">architecture:</span> <span class="hljs-string">'x64'</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">CmdLine@2</span>
  <span class="hljs-attr">displayName:</span> <span class="hljs-string">Install</span> <span class="hljs-string">python</span> <span class="hljs-string">package</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">script:</span> <span class="hljs-string">'python3 -m pip install requests azure-devops'</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">PythonScript@0</span>
  <span class="hljs-attr">displayName:</span> <span class="hljs-string">Check</span> <span class="hljs-string">build</span> <span class="hljs-string">pipeline</span> <span class="hljs-string">status</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">scriptSource:</span> <span class="hljs-string">'filePath'</span>
    <span class="hljs-attr">scriptPath:</span> <span class="hljs-string">'$(System.DefaultWorkingDirectory)/tests/base_pipeline_status.py'</span>
    <span class="hljs-attr">arguments:</span> <span class="hljs-string">'--personal_access_token $(PERSONALACCESSTOKEN) --repo_id b978e55f-bf80-466c-86c8-fc0dfe909b2c --pipeline_def_id 179'</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">Docker@0</span>
  <span class="hljs-attr">displayName:</span> <span class="hljs-string">'Docker Build Image'</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">azureSubscription:</span> <span class="hljs-string">'Your Subscripton'</span>
    <span class="hljs-attr">azureContainerRegistry:</span> <span class="hljs-string">'Your container registery'</span>
    <span class="hljs-attr">dockerFile:</span> <span class="hljs-string">Dockerfile</span>
    <span class="hljs-attr">buildArguments:</span> <span class="hljs-string">|
      ARG_STORAGEACCOUNTNAME=$(STORAGEACCOUNTNAME)
      ARG_CONTAINERNAME=$(CONTAINERNAME)
      ARG_STORAGEACCOUNTKEY=$(STORAGEACCOUNTKEY)
      ARG_MAXWORKERS=$(MAXWORKERS)
</span>    <span class="hljs-attr">imageName:</span> <span class="hljs-string">'ida:$(Build.BuildNumber)'</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">PythonScript@0</span>
  <span class="hljs-attr">displayName:</span> <span class="hljs-string">Integration</span> <span class="hljs-string">testing</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">scriptSource:</span> <span class="hljs-string">'filePath'</span>
    <span class="hljs-attr">scriptPath:</span> <span class="hljs-string">'$(System.DefaultWorkingDirectory)/tests/integration_testing.py'</span>
    <span class="hljs-attr">arguments:</span> <span class="hljs-string">'--image_name your.repo.io/ida:$(Build.BuildNumber)'</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">Docker@0</span>
  <span class="hljs-attr">displayName:</span> <span class="hljs-string">'Push image to ACR'</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">azureSubscription:</span> <span class="hljs-string">'Your Subscripton'</span>
    <span class="hljs-attr">azureContainerRegistry:</span> <span class="hljs-string">'Your container registery'</span>
    <span class="hljs-attr">action:</span> <span class="hljs-string">'Push an image'</span>
    <span class="hljs-attr">imageName:</span> <span class="hljs-string">'ida:$(Build.BuildNumber)'</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">PublishBuildArtifacts@1</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">PathtoPublish:</span> <span class="hljs-string">'$(Build.SourcesDirectory)/kube'</span>
    <span class="hljs-attr">ArtifactName:</span> <span class="hljs-string">'drop'</span>
    <span class="hljs-attr">publishLocation:</span> <span class="hljs-string">'Container'</span>
</code></pre>
]]></content:encoded></item><item><title><![CDATA[Testing in a CI/CD Pipeline Part 1: PR testing]]></title><description><![CDATA[There is no doubt the benefits of TDD 🧪 or test driven development. Follow this awesome Twitter thread and I bet if you not a fan of TDD you will surely become one.
https://twitter.com/gareth_leake_/status/1308989905197043713
TDD is a lot more than ...]]></description><link>https://importidea.dev/testing-in-cicd-part-1-pr-testing</link><guid isPermaLink="true">https://importidea.dev/testing-in-cicd-part-1-pr-testing</guid><category><![CDATA[ci-cd]]></category><category><![CDATA[Testing]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Wed, 26 May 2021 03:20:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1621998509554/0y81CCWGd.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is no doubt the benefits of TDD 🧪 or test driven development. Follow this awesome Twitter thread and I bet if you not a fan of TDD you will surely become one.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://twitter.com/gareth_leake_/status/1308989905197043713">https://twitter.com/gareth_leake_/status/1308989905197043713</a></div>
<p>TDD is a lot more than just the Unit testing. Lets us see how we can design our CI/CD pipeline to adapt to the TDD approach. </p>
<blockquote>
<p>Info: 
💡💡💡</p>
<p>I am using the Azure DevOps pipeline as CI/CD tool. But I believe you will be able to implement it in other CI/CD tool too. Understanding the mechanism is key.</p>
</blockquote>
<hr />
<p>Let's break down the pipeline into three component and see how to perform a test there.</p>
<h2 id="1-pull-request-testing-in-brief">1. Pull request testing in brief 💼</h2>
<p>PR testing is different from the system or unit testing. Let's see in brief (as the original intent of this guide is how to integrate it CI/CD pipeline)</p>
<ul>
<li>In unit testing, tests are performed to measure the correctness of the system's individual smaller or unit component.</li>
<li>In contrast, PR testing is a testing stage where the functional test is performed on the complete scope branch which has to be merged. </li>
<li>PR testing in CI/CD pipeline works best and integrates easily if your system is build using a Microservice approach.</li>
</ul>
<h2 id="2-general-mechanics">2. General mechanics 🧰</h2>
<p><strong>Why</strong></p>
<ul>
<li>To ensure the functionality code is working as expected.</li>
<li>It helps to restrict the merging of any unintended commits. </li>
<li>Ensures nothing breaks over collaboration. </li>
</ul>
<p><strong>How</strong></p>
<ul>
<li>The core of running a PR test is the same as running a Unit test.</li>
<li>All the tests which is used here are identical to Unit testing with a key difference of:<ol>
<li>Unit test are/should be performed locally whereas PR testing is performed over the remote. </li>
<li>Code coverage of PR testing must be always more than the unit test because the intention of the unit test is to just check individual changes locally, whereas PR testing intention is to check for the complete branch which may more than one collaborative effort. </li>
<li>It may be possible that you may not need a 100% passing of the Unit test (though this is not ideal) but PR testing must be 100% passing. </li>
</ol>
</li>
</ul>
<p><strong>When</strong></p>
<ul>
<li>PR testing should be performed at the time of a PR completion or merge activity.</li>
</ul>
<p><strong>Where</strong></p>
<ul>
<li>It is always performed at Remote Repository (eg Github, Gitlab). </li>
<li>It should be the first step of any CI/CD pipeline.</li>
</ul>
<h2 id="3-implementation-it-in-the-azure-devops-pipeline">3. Implementation it in the Azure DevOps pipeline 🚀</h2>
<ul>
<li>I am using python and pytest here. I am assuming that you will already have all the unit tests written and follow all conventions for <a target="_blank" href="https://docs.pytest.org/en/6.2.x/goodpractices.html">pytest automatic test discovery</a>. </li>
<li>All you need to run is a terminal command</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">CmdLine@2</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">script:</span> <span class="hljs-string">|
      # If you have setup.py
      python3 -m pip install .
      pytest -vv
</span>    <span class="hljs-attr">workingDirectory:</span> <span class="hljs-string">'$(System.DefaultWorkingDirectory)'</span>
</code></pre>
<ul>
<li>But the critical part here is <em>How to make it automated?</em> 
After all, this is part of CI/CD and the whole point of CI/CD is automation. </li>
<li>This is a two-step process in Azure DevOps:<ol>
<li>Create a new pipeline dedicated to running the unit tests. The trigger should be the same as its parent build or CI pipeline. </li>
</ol>
</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-attr">trigger:</span>
<span class="hljs-bullet">-</span> <span class="hljs-string">develop</span>

<span class="hljs-attr">pool:</span>
  <span class="hljs-attr">vmImage:</span> <span class="hljs-string">'ubuntu-latest'</span>

<span class="hljs-attr">steps:</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">UsePythonVersion@0</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">versionSpec:</span> <span class="hljs-string">'3.x'</span>
    <span class="hljs-attr">addToPath:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">architecture:</span> <span class="hljs-string">'x64'</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">task:</span> <span class="hljs-string">CmdLine@2</span>
  <span class="hljs-attr">inputs:</span>
    <span class="hljs-attr">script:</span> <span class="hljs-string">|
       # If you have setup.py
</span>      <span class="hljs-string">python3</span> <span class="hljs-string">-m</span> <span class="hljs-string">pip</span> <span class="hljs-string">install</span> <span class="hljs-string">.</span>
      <span class="hljs-string">pytest</span> <span class="hljs-string">-vv</span>
    <span class="hljs-attr">workingDirectory:</span> <span class="hljs-string">'$(System.DefaultWorkingDirectory)'</span>
</code></pre>
<blockquote>
<p>Tip:
🤝🤝🤝</p>
<p>From above you can see I am using trigger as develop branch as this PR testing is intended for develop branch. </p>
</blockquote>
<ol>
<li>Branch Policies:<ul>
<li>For enabling PR triggers we need to branch policies in Azure DevOps</li>
<li>Goto Repo &gt; Branch &gt; Branch Policies &gt; Build Validation</li>
</ul>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1621827665103/RllfoYCTK.png" alt="branch policies" />
Img: Branch Policies
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1621827692385/v-cAnwgxJ.png" alt="build validation " />
Img: Build Validation</p>
<ul>
<li>We have to add a build validation policy for the PR trigger. Out of all two settings is key,<ol>
<li>Trigger: Must be automatic</li>
<li>Policy requirement: Required </li>
</ol>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1621828372390/JGtMq_9L7.png" alt="build-validation-policy.png" />
Img: Build Validation Policy</p>
<ul>
<li>See in the Build pipeline section, I am pointing to the PR testing pipeline that I showed earlier. </li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1621828704914/5S9k1b1rs.png" alt="demo-pr-test.png" /></p>
<p>Img: How the PR testing works (see the marked item)</p>
]]></content:encoded></item><item><title><![CDATA[Understanding current AI industry expectation]]></title><description><![CDATA[1. The Past Expectation
It is vital to understand the past responsibility to adapt to the current expectation of the industry. 

The AI space was still relatively new (though not in academics) and many companies, startups were analyzing its applicati...]]></description><link>https://importidea.dev/understanding-current-ai-industry-expectation</link><guid isPermaLink="true">https://importidea.dev/understanding-current-ai-industry-expectation</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[guide]]></category><category><![CDATA[Career]]></category><dc:creator><![CDATA[Akash Desarda]]></dc:creator><pubDate>Sun, 23 May 2021 16:47:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1623645172685/pbJVYWAOG.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="1-the-past-expectation">1. The Past Expectation</h2>
<p>It is vital to understand the past responsibility to adapt to the current expectation of the industry. </p>
<ul>
<li>The AI space was still relatively new (though not in academics) and many companies, startups were analyzing its application and valid use-case. </li>
<li>The research was the primary focus. The caveat here was that this research many times was not directly in line with the core of the organization. So initially not much credibility was expected.</li>
<li>Generally, companies used to blend the roles of a Data Scientist with a Data analyst or Data engineer. Again, due to the vagueness of AI enterprise application. </li>
<li>Individuals also had a kind of similar dilemma. A lot of their research or work was not directly in line and practically not viable to be served as a product.</li>
</ul>
<hr />
<h2 id="2-the-current-outlook">2. The current outlook</h2>
<p>The democratization of AI has seen remarkable developments from businesses and startups. Let us try to understand it,</p>
<ul>
<li>The industry now distinguishes the role of a Data Scientist, Machine Learning Engineer, Data Analyst, Data engineer, even MLops engineer. </li>
<li>Businesses no longer allow research in the wild, as they know what use-case exactly they are tapping in. A clear mindset &amp; similar discrete approach from an individual is also required. </li>
<li>Every Research or POC must have a tangible and servable product</li>
</ul>
<hr />
<h2 id="3-the-thorough-dissection-of-all-the-roles">3. The thorough dissection of all the Roles</h2>
<p>If we have to pick one area where the Businesses have excelled in AI space, it is undoubtedly the clear expectation from all varieties of the Roles, which are in a nutshell:</p>
<ol>
<li><p>Data Scientist: A Data Scientist is a person who (generally from a stats/maths background) uses a variety of means including AI to extract valuable information from data. </p>
<ul>
<li>A fundamental difference between Data Analyst &amp; Data scientist is- the former generally rely on domain knowledge and manual old school methods to make sense of data on a small to medium scale, whereas, the latter is responsible for collecting, analyzing and interpreting data on a larger scale using wider means of tools like AI, SQL, old school manual ways, etc., </li>
<li>Domain knowledge is not a must but having is helpful. </li>
<li>The primary job is to maintain and extract business contributing insights from data &amp; not to develop the software or product. </li>
<li>A Statistician or a Mathematician can become a good Data Scientist. </li>
</ul>
</li>
<li><p>Machine Learning Engineer: A niche software engineer who develops a product or service based on AI.</p>
<ul>
<li>An ML engineer needs to have all the expertise of traditional software engineering along with knowledge of AI because he/she is eventually going to build software with AI at its core.</li>
<li>The primary job is not to extract data but to develop an AI tool that can perform the same job.</li>
<li>A developer with good knowledge of machine learning/deep learning as well as software engineering can become a good Machine learning engineer.  </li>
</ul>
</li>
<li><p>Machine Learning Operation Engineer: A niche software engineer who maintains and automates the pipeline which is used by the ML system. </p>
<ul>
<li>Relatively new field inspired by DevOps. Though different from traditional DevOps roles. </li>
<li>Unlike traditional software engineering, development for any product/software/service based on AI doesn't stop at the completion of the building of software. It has to be updated regularly with new data, based on the <em>Data-Drift</em>.</li>
<li>The primary job includes all traditional DevOps work as well as maintaining/automating pipeline and Data-Drift</li>
<li>A developer with good knowledge of machine learning/deep learning, software engineering &amp; cloud technologies can become a good MlOps engineer. </li>
</ul>
</li>
<li><p>Data Engineer: A niche software engineer who develops a pipeline to serve all data needs using a variety of tools (generally cloud-based)</p>
<ul>
<li>The data engineer needs to have expertise with major cloud-based data platform, batch or stream processing, big data platform (depends on business requirement), database (though not like a Database administrator).</li>
<li>Have to work in line with Data Governance policies.</li>
<li>Primary job is to design, implements and maintain the Data pipeline.  </li>
<li>A developer with good knowledge of cloud technologies, data platform, processing needs can become a good Data engineer. </li>
</ul>
</li>
</ol>
<hr />
<p>For a new seeker or someone who is aiming to advance in his or her career, all these roles and expectations must be well understood. Given that companies are clearly distinguishing this role, it is expected that this will also be the case for individuals. A vague mindset is totally useless.</p>
]]></content:encoded></item></channel></rss>