Airflow 2.0 & how it works : Scheduling and beyond

Airflow became in the last recent years a major actor for scheduling a wide variety of actions. From running basic Python script to API calls up to acting as an ETL tools, it can address nearly all your requests !

Thus, more and more it became a central tool for anything, even irrelevant actions. And between what the is made for, and how it is often used, people don’t get how it really works.

We’re going to see what comes handy, and what’s to avoid while manipulating Airflow. We will dive deeper into some content in other articles.

What is Airflow made for ?

Airflow, regardless of it’s version and implementation, is at its core a scheduling tools despite any usage you’ll encounter.

Its role is to orchestrate your runs, not code them ! This is very important to understand, because nearly all constructs around Airflow are based for that purpose.

These are the elements that create Airflow cement :

Let’s describe them !

1. DAG : Directed Acyclic Graph

The base component of Airflow is a DAG. It’s a group of Tasks to execute in a given order.

As its name describe it : it’s an acyclic graph of Tasks. You can either give its parents or children Task(s) to define a link between tasks and in which direction it goes. Thus you get a a graph of dependencies representing the actions you have to orchestrate.

To give some insightful info you also give some orchestration parameters :

  • a cron to define when it runs
  • if your schedule should try or not to go back in time and catchup on past event not yet finished
  • Actions to take on some events (we’ll see this later)

A DAG is the entry-point to run something on Airflow. Airflow run DAGs, not operators, nor hooks or whatever. If a Task isn’t in a DAG, it’s never scheduled.

Basic group

2. Operators (synchronous Task)

Airflow relies on two different kind of components : Operators and Hooks. Those two components are called « Task » which is what a DAG is scheduling.

Operator are made to do something, for example :

  • Execute a script
  • Insert data in a PostgreSQL Database
  • Upload a file to AWS
  • Create a custom operator (launch an ETL/ELT Job)
  • etc.

Operator are executed, and nothing is to get from it except Success or Failure and is considered synchronous. This is how you should consider what is an Operator.

Under an Operator, often lies a Hook, because many interactions with a system are asynchronous. Thus you need an underlying piece of code to emulate this synchronous behaviour. This is where Hooks come in the picture !

3. Hooks (asynchronous Task)

Hooks are a big part of Airflow, because they implement the basic feature which is checking whether or not an action on a remote system is done or not.

A Hook needs something called a Connection which work hand-to-hand with a hook : you need to know the endpoint you’re going to ask for an answer.

Then, the process is fairly simple :

  1. Connect and send your request (HTTP, SQL, etc.)
  2. Wait
  3. Ask if it’s OK
  4. If not, go back to n°2
  5. Else, get the result !
  6. Close your connection

And here lies the second big difference between Operators and Hooks : they are asynchronous and you need to fetch the result of your query.

4. Events

Events is the last core feature of Airflow : it gives us the info on the status of a Tasks such as : None, Queued, Scheduled, Running, Failed, Up for retry, etc.. And the list is quite long.

To these statuses, you can automate some action like « on_success_callback », which will cause your Task to automatically launch some piece of code when the status « Success » is triggered.

You can also trigger some event like « before execute », to run something before any execution of a Task, like some logging.

Important note : Apache and Google Composer version differs on how they implement this feature. Apache is triggering this feature any time the status changes. Meanwhile, Google implemented a simplistic version where you can’t force the activation of an event, and you can’t reset the event trigger. It’s a one-shot.

What can you do with Airflow ?

Here the answer is as much confusing as it is real : anything !

With enough resources and time, you can run any piece of code in Python and run it, whatever it does. And now the community took the bad habit to code and run everything inside Airflow.

You can find some Airflow usage where they nearly coded an ETL tools. Sure it’s not performing well and is painful to maintain, but nevertheless, it works.

So, yes, it can nearly do anything, but it will cost you a lot.

What you should not do !

What to avoid ?
Why ?

You shouldn’t use DAG to compute stuff, even worse, create connection to remote systems.

DAG are evaluated every 5 seconds or so, to see if anything changed. If your code contains some loops or connections, it will need to wait for this computation to try to run your Task, leading to poor performance and heavy requests.

You shouldn’t create a monolithic DAG, that contains every custom Tasks.

You’ll often reuse the same code over and over. Try to make it count and create useful Operators when needed !!

No code should resides inside your Operator/Hooks/Functions !

You should never see something else than parameters inside those elements.

Airflow is a scheduler, not an ETL, nor a docker platform nor an ORM. You should not code inside Airflow except for scheduling purpose and logic.

You have to concatenate some files and zip them ? You need to retrieve some data from an API ? It’s not where it belongs.

You should have a separate system to handle this kind of things.

then

Airflow is here to orchestrate your Tasks and you should not use it to do any transformation. Going this way is possible, but you’ll very fast get stuck in heavy maintenance and using loopholes to cover for what it is : a raw tool to code dependencies.

Don’t overestimate Airflow capabilities, and if you want to code all your stuff in Python, do it on a separate system.

What you should do !

What feature is powerful ?
Why ?

DAG and scheduling events.

Those things are :

  • using a cron
  • can rerun any Task
  • can rerun at any period of time

A DAG is pretty simple to manage : it’s a file with a code to defined which Task precede the other.

And what is simpler than a cron to define its frequency ?

An other feature is that you can rerun any Task (this one is pretty normal), but when rerun it, it reruns with the context for the given date.

Basically, you can tell Airflow : « Clear everything for the last 7 days and rerun it ». Not many schedulers can do it.

Use events to handle success, retry and failure events automatically.

Something very useful in Airflow is that you can create functions to handle any kind of trigger.

Thus, you can send an email or do some logging, dump things, etc.. You can give a « on_success_callback » action to a DAG and another to some other DAG.

This kind of automation reduces by a huge factor maintenance.

Use custom Operators

Here lies the main strength of Airflow : you can create your own custom Operator. We will cover that later, but this is a huge feature since you can code any action, and so create your own handler.

You have a proprietary API that you want to handle without showing all your code in a Task configuration ? Create your own Operator to handle it !!

Create Sensors to create dependencies between DAGs

We did not cover Sensors, but basically, this is a cross-DAG dependency.

They allow you to test whether or not a Task in a different DAG is finished, despite it’s not in your DAG. this a common feature, but it comes with all previous features !

Schedule time-wise, automatic event handling, etc.

then

Airflow is a powerful tool when it comes to customize code and work with a team. you can basically communicate with any system, and you can create Operator to override any usage you have for a simpler maintenance.

The second main important thing about Airflow is that every scheduled Task is aware of when it should have ran in the first place so you can rerun them without side effects.

It’s a pretty raw tool, but once you’re done with the basics (or don’t need it), you’ll have a powerful tool in your hand, with great scalability potential.

Conclusion

On this article, we covered what is considered as the core feature of Airflow : orchestrate.

Reviewing core aspects of this scheduler allows to understand that it shouldn’t be used for computation, even if you can do everything you want. It’s not because you can that you should.

One of the many core aspect of Airflow we uncovered here is notion of Operator and Hooks, which are respectively synchronous and asynchronous, the first using the later in many cases. And the difference with many schedulers ?

You can create your own custom Operators and Hooks for your use, without making a big development.

 

Airflow is a very raw software where you need to dive inside to make a good use out of it, considering other turnkey schedulers, but once you’re settled, you can start to produce very fast and without any large development.