Flexible, Easy Data Pipelines on Google Cloud with Cloud Composer (Cloud Next '18)

Flexible, Easy Data Pipelines on Google Cloud with Cloud Composer (Cloud Next '18)



good afternoon everyone welcome I'm James Malone I'm a product manager with Google cloud and I'm joined today hi good afternoon Lou I'm a suite with Google and I've been working on composer since the inception and today we're going to cover Google Cloud composer first before I begin cloud composer recently went GA and there was a lot of work from Googlers to make it happen so to all of the Googlers thank you but far and above beyond that composer is a work of love of the open-source community and we wanted to take a second before we begin today to say thank you to everyone who has participated in the development of Apache airflow who has given input who's written code who has used it truly thank you and we're excited to be a part of the airflow community now with composer the first thing that we want to do today is give an overview of composer so composer is a new service it's built on a patchy airflow and we'll also talk about air flow so you want to just level set and talk about composer why we did it what it does after that we're going to go in-depth and talk about Apache airflow we're gonna look at a dag we're going to run through a demo of cloud composer and then next ups on how you can get involved with both airflow and also cloud composer so before composer there were a few ways to create and schedule and manage workflows on Google cloud platform and to be totally honest they were not the best at all so on one end you had very low cost very easy but pretty inflexible not so powerful solutions mainly just putting something in a crontab scheduling it and letting it run on the more complicated end you had customers that were developed developing really complicated frameworks to schedule orchestrate and manage things on Google cloud platform really powerful but also took a team of engineers to do in our opinion that none of this is an ideal solution because people are focusing on things that are not really what they're trying to do they're developing orchestration engines they're developing description languages they're not focusing on what they set out to do which was just having a workflow run and monitoring that workflow from time to time so we thought it should be easy to create manage schedule and monitor workflows across Google cloud platform I call out all of those steps individually because they're all really important parts when you think about the lifecycle of a workflow it's not about just scheduling something and letting it run it's not about just monitoring something you mean we've really wanted to think about the whole process holistically together and find something that would allow somebody to create this workflow to schedule it to run to look at how it's running and then manage that workflow based on what's happening so we came up with cloud composer cloud composer if you didn't catch last week because we didn't really make a lot of noise because next was this week cloud composer just recently went GA so it's generally available in a couple of regions inside of cloud platform and it's based on Apache airflow so the best summary of cloud composer is it's managed Apache airflow so there's a few things that we really wanted to tackle with composer first we wanted an end-to-end GCP wide solution to orchestration and workflows second it was really important to us that composer works both inside of cloud platform and outside to be totally blunt if we develop something that just worked inside of Google Cloud it really would be missing the mark because hybrid cloud multi cloud and just not locking people in or all the fact of life so if we develop something that was proprietary we thought from the outset we would be failing we wanted it to be really easy we don't want a really complicated workflow system because then people are just wrangling with the infrastructure the description language and that just sucks it's not a great use of time it was also really important to us that it's open source for a few reasons again we don't want you to be locked in we want people to be able to look under the hood see what's happening we also wanted people to be able to contribute and and be part of a larger community a common question we've gotten since we launched airflow into alpha and then beta and GA is what's the difference between composure is to manage to patchy airflow service and just running air flow on my own on the VM or set of VMs so there's a few things that we try to tackle first we wanted a seamless and integrated experience with cloud platform that means that composer is available inside of our command line tooling there's a Google API it's not just a standalone thing that feels like a separate product second we wanted security with cloud composer you have I am you have audit logging it's not just a separate product it acts and feels from a security and audibility perspective the same as any other cloud product we also wanted it to be easy to use when things weren't going quite the way you expect so when you're developing with workflows or running workflows you know some pieces of that workflow may not always run exactly the way that you want so stock driver integration was very important to us and it's a first-class citizen with composer we also wanted to make it really easy to manage your air flow deployment so that doesn't just mean creating and deleting your air flow deployment it means doing things like setting environment variables or installing and maintaining Python packages things that you could conceivably do on your own but aren't a value-add when you're doing it on your own because we've built on Apache air flow there's a core set of support for several products inside of Google cloud platform so for example there's data flow operators there's big query operators there's cloud storage operators and the support for Google cloud platform products is expanding with each airflow release and that's a core part of what the air flow team is doing either inside of the air flow team themselves or with other teams inside of Google is expanding the breadth and depth of support for cloud platform products inside of airflow really importantly airflow supports a whole host of things outside of GCP so it supports services so you can go talk to things like slab sheera it supports technology so you can go call rest api so you can go hit FTPS all of the support for non Google things is absolutely usable within airflow and also cloud composer and we didn't want to break or limit what you could do outside of GCP with composer so all of the cool things that you can do with airflow outside of GCP are very usable since our GA happened late last week not a lot of people may have noticed so we wanted to call out a few things that just launched last week with our GA we launched support for a couple of new regions inside of GCP there's expanded stack driver logging which you'll see today and the demo there's expanded cloud I am rolls and we took a bunch of fixes that will appear in future versions of airflow or new additions and we back ported them to the cloud composer release we try not really to modify the cloud composer airflow version too much from the mainline version of airflow itself but once in a while there are fixes tweaks additions that we back pour our general philosophy is unless there's a JIRA associated with it we won't inject it into composer because again we don't want to make a composer a black box it's it's just not a value add and it's an alternative form of login since not a lot of people may be familiar with airflow or may have just heard of airflow we wanted to quickly cover air flow some of the core concepts just to establish a baseline we're really excited about air flow we love air flow and we want other people to love airflow and get excited about air flow if you are totally unfamiliar with air flow it's an open source project it's incubating the Apache Software Foundation it's been around for a few years I think it's fair to say it's become one of if not one of the leading open-source package to create and schedule workflows air flow is really interesting because all of your workflows our code they're Python codes so they're highly approachable you can do a lot of different things and you'll see part of that in the demo today you can do things like programmatically generate workflows which is really cool one of the questions we faced especially initially when we made a bet on airflow about a year ago was why air flow what we really wanted to do with composure is tie the strengths of air flow is an open source package and open source community with the strengths of Google cloud platform so Google cloud platform we were very good at running infrastructure creating services adding layers of security auditability maintaining cost but as I mentioned we didn't have a workflow an orchestration solution air flow did have a really strong workflow orchestration solution it had a whole bunch of connectors for services inside of an outside of Google cloud platform it already had a description language defined set of api's she wanted to join the two as a union inside of cloud Composer as we've developed in cloud composer we've also started contributing back to the airflow community the kubernetes executor is a good example and i'll talk about it a little bit later in the deck of something that is really interesting to us and something that were very interested in there's a few key concepts in airflow you may want to be aware of if you've never used air flow in-depth before first all your workflows are graphs so workflows are a series of tasks and those tasks can fit into a graph and that can be a very simple graph it can be just one node which is bash script that just says two all the time it can be a really complicated graph which will show you an example of our build system for air flow which is a good example of a complicated graph that graph has a series of tasks tasks are essentially steps something that happens maybe running a sequel query running a bigquery query airflow in those tasks has operators and sensors an operator is essentially something that tells something to do something so a good example would be a big query query operator tells bigquery to go run a query there's also the concept of a sensor and air flow which is essentially a binary wait until something happens when it's true it proceeds airflow itself also has a lot of really interesting deep functionality that we evangelize people use inside of composer is also another reason why we thought the airflow is a great bet you can do things like define connections and have your workflows use certain connections you can set s LA's on your workflows and see what's meeting SLA and not you can do things like pass information between tasks there's a lot of really interesting advanced functionality inside of airflow generally we get a lot of questions of can air flow do X Y or Z often the answer is yes air flow actually can do x y and z and with that i'm going to turn it over for a look at air flow in-depth thank you James so with that being said you know we do want to next what you knew so if I reviewed some of the product details the way how we construct a product and some of the design decisions so we have made so you know the capability is where compose is so in car composer we introduced this new concept called environment really is very similar to you know a kubernetes cluster what they depart cluster is essentially it means a collection of managers GCP resources that gives you the functionality Nydia to run Apache airflow inside a single choose to be project you could create multiple composing environments and all the environments are integrated with Google Cloud storage I am stuck travelogues as well as call I am so the way to interact with the product the following you can use Cloud SDK you can use compendium you can use our REST API functionality-wise all three masters are equivalent however I do want to point out one difference in color SDK to make it a convenient for the composer users and so that you don't have to manage to set of command-line tools you know there's composer comma line tools but at the same time there's also like elf LACMA line tools so what we did in a product we took a tunnel to or outflow command commands to loop through the composer g-cloud command so that in a single place you can manage both your composing environment at the same time you can interact with the composer with the Elfa environment only I mentioned that a composing environment is really a collection of GCP resources so here I'm going to give you so if I could zoom one live being and then explain how and why we decided use the following industry resources to construct the composer service a very high level you notice there are two projects one is got a customer project the other one's got a tenant project so custom apology I probably the ones you film in a wealth you know in track was used to be by creating a GP project that's that's the customer project and tenant project is probably a concept new it's nothing it's really is the same as an ordinary GP project it's just the case that this tiny project is managed and owned by Google as we walk through the detailed architecture we explain why we decide to make this design decision that a portion every resource actually lives inside the tenant project I'll throw yourself you know if you look at the the way how it's being constructed it has a number of sort of a micro service like flavor you have an L flow web server you have a well flow app scheduler al flow metadata database those components just naturally match in a map to the serve the the the wide range of GCP services we offer so for example I start with outflow scheduler and L flow worker if you look inside a coupon any class that we decide to host both the worker and a scheduler inside Kubrick cluster the reason we do that is that so you allows you to conveniently packaged your workflow application dependencies because essentially all your all your tasks can run inside the containers the worker and the scheduler and they communicate alright is through the sealer executors and then movie next we have this clásico proxy and then we decide to host the elf flow metadata database inside a tenant project so that only the service account you use the creator composing environment has access to the metadata database it's really for enhancing security as we believe that a cloud sequel or the elf of database shows all the valuable metadata information regarding workflow think about you have like connection credentials store in a database you obviously don't want anyone in your project be able to access that credential information walk down on the right side you know we have the airflow web server interacts with the database surface all workflow information and now it's we decide to host the web server inside ji-ae so that is puppies sort of accessible you don't need to have the clumsy proxy set up be able to access the web server but we do realize at the same time you don't want to make your web server open to anyone on internet so as a result we collaborate with another service in Google in Google Cloud it's called identity a while proxy so that you know only authorized users would be able to access the web server later on in the time of session we'll be able to give you a way to have a sense and a feel how that works and we also make it extremely easy for you to config access the web server you know is exactly the same way as how you would configure I am policy moving left we have G says we use Jesus as a convenience store for you to state your tax deploying a new workflow to composer is as simple as drop your fire into the Jesus pocket we understand that there are needs sometimes you need to stage your workflow artifacts and we also make use we make some we make it a manager service for you so that your artifacts is nicely staged in the Jesus packet which later allows you to sort of get retrieve those artifacts back out finally everything's being you know all the interactions were logs of being streamed a to structure of I hope Fleur has a very the UI itself comes together with Pascal ox but you can't really find out what's happening if for example there's a hell floor Walker crash well there's an outflow exception in yourself the workflow was scheduled runs into exception at that point you sort of lose that I lose the visibility and what's happening that's why we decided that we it makes a lot of sense to offer this additional stackdriver login community with the architecture and now I'm going to explain a little bit about workflows because not I not everyone is sort like a familiar with workflow and not the way how our flow experts workflows so at a very high level a workflow consists of a collection of tasks and then it was and the there interdependency relationships so in this very in this case you have for example you have some flower in HDFS and then you know whenever you have new for addition to HDFS your workflow you may want to kick off your workflow such that it will copy the file from HDFS G says once it's in G says or Google Cloud Storage you maybe you want to trigger a bigquery operator that does something loaded data in and a subsequently maybe run some query and make the result of vailable maybe while a snag notification so a few things you probably have noticed in the in this example workflow description there are tasks associate with workflows for example you want to run a bigquery job there was a dependencies you want to wait until the data is available in G says and then you study OB a query job there's also a component of trigonis something you know I have a file that Assad now appears in my um primer or HDFS then that starts my workflow so those are some of the elements or building blocks in our flow workflows I'm going to start with actually give everyone an example how you can build sort of a very simple workflow that consists of two or three tasks so really two outflow unlike other sort of orchestration workflow solutions they they express workflow as code so instead of you know having this giant configuration file of spasming your workflows you write your workflow as Python code roughly speaking about five steps for you to define a workflow in our flow the first thing is like important you know import Alf you know from the outflow system you imported a dag which is a with the acronym for a directed acyclic graph which is really another way of saying workflow and then you try to import all the operators that you're going to use in the workflow in this case a big query operator and then you know there are also like a trigger rules which allows you to specify the inter task relationship now once you have all import statements ready in the next step is for you to define all arguments to your task or to your workflow you probably notice that there's a t4 tag args this is really a very convenient thing that's providing an outflow so that you know if you do feel that you have common arguments to a number of your tasks instead of specifying those argument at each and every single outflow task you can specify them at the beginning of your workflow and it just pass that in automatically by the elf low-tech model imagine if you are going to use a configuration that's probably hard because you probably have to copy and paste a lot of duplicated lines of code once you have all the workflow data has specified now you can you're going to introduce define your workflow dag equal to die you give a name you give a schedule interval are you passing the default acts as a nation now within a dag you start to define your tasks in this case we have two tasks you know I have bigquery operator tasks you also have a big query to Jesus operator as James mentioned earlier our flow has wide range of supported TCP operators both be clear operator as well as the big word cloud storage operator they are available in elf law once you space for all the tasks defined all the task arguments the next step is actually for you to chain them up by giving some dependency in this space for example the the line simply says hey I want to run big queue elf local media query first after that I run the export to GCS task as simple as that now the reason why we decided to go with this routine I'll flow in composer is really there are a lot of nice things that you can do once tag workflow can be specified as programs it gives you vision control it allows you to dynamically generate the tags it also give you the choice of you know support templates so that you know you have a skeleton via workflow and then be able to then I mean dynamically instantiate your workflows like I said the first thing you know with tag with the in our flow tags in the language choice the supports like ginger templating so you can spice for a template come in and every single run you will be able to reconfigure that task likewise think about you need to generate a thousand tasks or like a 10,000 tasks of the same type it's probably fairly tedious to be able to do that in a configuration language so with tag as code it's just like a simple two line of code would give you like a thousand tasks the third thing is like I'll flow so you know you sort of a naturally match the the the programming or the software model or software development process where you have modules and sub-modules so here in L flow you have workflows and the sub workflows they're called tags and sub bags so in the example on left side you could have many tasks or if you want to realize that you want to create a reusable tags what you can do is just package those tasks into a sub bag and they include a sub bag in the non attack so with all that I'm going to give a composer demo so in that demo we're going to show you how to use the how to how to interact with the product and also for example we just cover how does that look like how do you query workflow status how would you trigger workflow execution for my external system all right now we're going to switch to the demo we'll turn on screen mirroring and then switch to the demo here this is also known as the game of how quickly do we know Chrome OS all right screen mirroring is on so hopefully we can switch the demo now now we go Chrome to the rescue that's you can't over tell I'm not a chrome user all right well let's so early as I mention that sorry about that cool so we have us go back we have a very simple and an each console interface for you to interact with the environments so to create a composing environment that's what you need to do specify a name Google next specify number of notes you want like you know three knows well ten knows how many number knows if you want which location you want to deploy nice Asian RCS one you have the option to define machine type if you do feel like you're going to have some CPU intensive workflows you can try to configure your composing environment with some more powerful machines for you and likewise you know you can also specify network and subnet and this is for the case when you need to have shared VPC where you and i guess we use some network we have defined in your project there's one thing I want to call in in a configuration is that we do allows you to provide your own service account you don't have to rely on the default compute engine service account instead you can supply any service account you gave this from the man this allows you to sort of restrict the possible set of services that you composer interact in environment and password interact with we do allows you to have you know our flow configuration overrides you know for example you know if you want to increase your diet low timeout value from 100 seconds or 200 seconds this way you can specify some of the air flow configuration overrides once you have all the parameters input and just create a clip yeah that's we need to do to bring our composing environment now while waiting for the environment we created right now it takes a while because as I mentioned we have we host the alpha web server inside and it just simply takes GAE like 10 plus minutes to get the application deployed for you so we have pre created a composer time oh environments and so we can take a look those are the service account those are the name you got a view of all the details pertaining to your environment only I mentioned that we use GC has to to deploy workflows so just click into the text folder we have a one simple tag is the PQ demo which is the example just work that everyone to deploy a new dag is as simple just copy a file into this Jesus back here meanwhile if you want to interact with the service service and then try to you know interact with an outflow of you I understand like what are the workflows as simple you'd like it you don't need to do run through like clumsy proxy setup simple click gives you the air flow web UI this is the simple that is attack time Oh Doug I just mentioned you know it's like really like two or three tasks it's not to mention to everyone it's not to say that you know you can create a more complicated X James mentioned earlier that you know we're seeing Google we have this complicated dag that really helps us to run CS city so so that will make sure that all changes are submitted to our flow upstream will not break GCP operators the other part I want to show to everyone is earlier I mentioned that this the web server is protected by I ap and only authorized user could access the web server so I'm going to open an ignition window all right just give me one second James I needs your magical power to restore my demo page no worries cool thank you we also have somebody who knows the Chrome keyboard commands well and you'll need to copy and paste the help you should be said so what I'm trying to do is I try to log in with just my sort like a person at gmail and he used this two-factor authentication I'm sorry I forgot about my foot but you trust me it works I noticed I like a few a few are taking in photos of the snail please try it offline and for the coffee that you are I'm guarantee that you wouldn't be able to access this this server this website cool so the other thing I mentioned early that because we have this IEP proton you're a gives you the guarantee that only authorized user could access to your tax it gives the opens up a lot of possibilities so here we're gonna have a demo where you know you actually try to invoke a trigger attack execution for my GCF you know you're writing your own function really doesn't really matter what is GCF or you know our different computer as long as you know you are you have the necessary I am credentials you are being added to the I am policy of composer you would be able to remotely trigger the execution of a dag so I'm going to test this function what it does is behind the same is this function we're trying to call the URL as L float was a host is web server in the same application as the L flow excuse me our flow water hoses API server in the same application as a web server so behind the saying this really does is send a restful request to the web server I just showed you as to a page refresh now you see a task being triggered and this is the menu idea generated by outflow so this is what I mean that you know once you have the dag you could trigger the dagger anywhere it gives you a lot of like spin it here so that you don't necessarily need to have access to g-cloud not necessarily need to have access to proxy or even Google Cloud console the that's just weird about this you know while we're waiting for this you know I just realize the it this particular tax accuses where they completed menu at this time so the other thing I want to show to everyone is in this GA release we also support Cooper I depart operator so in the past it has been paying for you to manage dependencies as James mentioned in our lives you know we do make it a convenience so that you can use store any Python packages that sometimes if your dependencies live outside of Python then what working doing right that's why in this one we realize that we make it very convenient we back part of the kubernetes heart operator in our flow 1.10 which was just released a couple weeks back but it was who will make you just walk out a box for you so you don't have to worry about it you know what's the service kinda was the credential all that stuff so as a result of that I'm going to show you another demo which we really just yeah that's just waiter one more two minutes for the doctor show up this is what I mentioned that out flow has this directory scan interval which allows you to specify how often you want to scan for new changes new files or new workflows in your tag folder again there's some default value but you know to a composer we give you the option to override that default configuration value while waiting for the workflow demo to show up I plan you know we can look a bit more into details about what this outflow UI offers to you you have different views of your dad it is a graph view a tree view you also can you can also see the code at any time it just in just in case you need to like a sort of a switch back and forth between the graph representation we workflow and the code representation you could also conveniently you know add connections to your workflows only imagine that our flow does span the store or the connection credentials in the metadata database and it also convenient offer you this outflow this web UI for you to mutate your connections anytime you can also see all your configuration that comes together with this particular installation as I mention like those configuration can be later on changed through the manager service yeah and that's come back now we do notice that as we added this code a depart example not intend to pretty straightforward we'd run let's take a look at a code so as I mention you have all the just recap you have all the import statements at the top then you have all your definition where you work flow data now you're gonna read for in your workflow now you define your tag after that you define your task so in this case we have two tasks the tasks the first one is really like a bachelor that print date the second one is what I mentioned you have a culinary part operator which allows you to in this kids get a per image and a computer the value of pi' at any time you could sort of this is the outflow WI while you can inspect the log output sometimes there might be a delay in how soon those locks appears this is where that the stackdriver logging is helpful so let's try to look at the logs that spewed at the same time by the worker as you notice that we have this pot launcher that's actually related to launching a pod so the nice thing we added into the stack driver lock is that we organized the stack driver lock and then provide labels for you to access the task so I'm going to here's what I'm going to do I'm going to filled out all logs pertaining to this specific task there you go you see all the logs that being generated for you well at the same time you know you may take our flow for a while before they have the logs appear under logs populating for you and now the task is done we can take a look at the output as you notice that they you know you start out to keep hard and then I just waited for that you keep how to be done out of some at some point let's see how I try to squirrel over there we go there's a job pending job running and then you prints the value of pi' and then finally I got a task successful so we started that concludes the demo session feel free to try out a service offline like like I said I promise that you can't access my web server alright quick recap on the demo you know we showed you how to create a composing environment the various ways to interact with the environment how did you monitor workflows how to deploy workflows monitor would flows trigger workflow from an external source and then inspect your workflow status with IO flow logging as well as thank Joe Valachi past you James excellent and just to add on the security that we use for the identity we're a proxy is the same Lord lover load balancer level security that's used for things like the Google Cloud console so we are paranoid about security so the two-factor authentication is a good example of that it's it's very core security so it you would have seen a rejection and that rejection is handled very low down in the stack so I want to talk about next steps in terms of our involvement with air flow in terms of where composure is going in terms of how you can get started there's a lot of composure questions that we get and we just want to go ahead and answer some of the most common questions please you know if you have questions I'll have details on how you can bug us please bug us we are here as a we are not shy of questions we love input we're here to collaborate with people these just happened to be the questions that we get a 99% of the time there's questions on can people install their own Python packages air flow operators Python specific themes the answer is yes inside of the cloud storage bucket that contains your DAGs you can either add custom Python modules that's the interesting thing about everything being code there's also specific folders to add plugins for air flow itself there's questions on whether we touch the environment after it's been created the answer is no one of the soft spots on air flow right now is how changes are handled version diversion right now we think it's best that we don't change your environments version or a versioning of the components once you deploy it that may change over time as the air flow community as composer matures will clog composer be offered in more regions yes we are actively working on it Chie is a good example of that is there a graphical way to create DAGs a very very common question the answer is no but this is something of extreme interest to us and also of interest to the air flow community so the answer is no but I would expect that it will happen at some point in the future which version of Python can be used with composer this is probably the most composer specific question that we get right now it's 2/7 we are actively working in on Python 35 support you could probably expect that in a future composer release future directions for cloud compilers so we shut off some of the work we've done with kubernetes the intersection of kubernetes and air flow is exceedingly interesting to us Google in terms of the composer team and other teams have been involved in work for kubernetes and airflow the kubernetes executor is a good example of that it's not the last example or the the end of the line I think in terms of air flow and kubernetes we're also working on additional operators so we want to support additional API surface area coverage of the products that are already inside of air flow so things like data proc or bigquery data flow we also want to expand support for new products inside of airflow that our GSP products third resource usage so right now you can create your composer environment with fixed size you can also resize that environment we're also thinking of ways that we can increase the elasticity of that environment based on workflows that are executing on that environment much like a lot of our managed services we want to tightly constrain the resources that you're using for an environment to what's actually going on in that environment there's a ton of different things that you can do to get involved with either airflow if you don't like composer and that's totally ok or composer itself airflow the patchy website for airflow is a really good place to get started there's links to the mailing list they have pretty active mailing lists there's a lot of information for how to get involved in that community in terms of composer we have our documentation for the product there's also a google group mailing list that you can join you know please ask questions on that mailing list there's also a Stack Overflow tag that we look at as a team so if you have composer specific questions please use that tag because the more questions that you ask with that tag the less likely it is that other people will need to hunt and peck for that information over time and we just can't anticipate all of the questions that might come up over time as a plug there is a meetup group if you happen to be local to the Bay Area for air flow it there is going to be an event in September which is going to be hosted by the cloud composer team at the Google Sunnyvale office so if you are local please sign up we just put up the air flow of our the meetup event for that so something to check out if you're curious with that thank you all very much for being here we sincerely appreciate it and again thank you to everyone who used composer and gave us feedback we're open to questions we have five minutes if you have questions please come up to the mic happy to answer any questions you guys have [Applause]

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *