Extracting time collection on given geographical coordinates from satellite tv for pc or Numerical Climate Prediction knowledge might be difficult due to the quantity of knowledge and of its multidimensional nature (time, latitude, longitude, peak, a number of parameters). One of these processing might be present in climate and local weather analysis, but additionally in functions like photovoltaic and wind energy. For example, time collection describing the amount of photo voltaic vitality reaching particular geographical factors can assist in designing photovoltaic energy vegetation, monitoring their operation, and detecting yield loss.
A generalization of the issue might be acknowledged as follows: how can we extract knowledge alongside a dimension that’s not the partition key from a big quantity of multidimensional knowledge? For tabular knowledge, this downside might be simply solved with AWS Glue, which you should utilize to create a job to filter and repartition the info, as proven on the finish of this put up. However what if the info is multidimensional and offered in a domain-specific format, like within the use case that we need to sort out?
AWS Lambda is a serverless compute service that allows you to run code with out provisioning or managing servers. With AWS Step Capabilities, you possibly can launch parallel runs of Lambda features. This put up reveals how you should utilize these providers to run parallel duties, with the instance of time collection extraction from a big quantity of satellite tv for pc climate knowledge saved on Amazon Easy Storage Service (Amazon S3). You additionally use AWS Glue to consolidate the recordsdata produced by the parallel duties.
Be aware that Lambda is a basic objective serverless engine. It has not been particularly designed for heavy knowledge transformation duties. We’re utilizing it right here after having confirmed the next:
- Job period is predictable and is lower than quarter-hour, which is the utmost timeout for Lambda features
- The use case is easy, with low compute necessities and no exterior dependencies that might decelerate the method
We work on a dataset offered by EUMESAT: the MSG Complete and Diffuse Downward Floor Shortwave Flux (MDSSFTD). This dataset incorporates satellite tv for pc knowledge at 15-minute intervals, in netcdf format, which represents roughly 100 GB for 1 yr.
We course of the yr 2018 to extract time collection on 100 geographical factors.

Resolution overview
To realize our purpose, we use parallel Lambda features. Every Lambda operate processes 1 day of knowledge: 96 recordsdata representing a quantity of roughly 240 MB. We then have 365 recordsdata containing the extracted knowledge for every day, and we use AWS Glue to concatenate them for the total yr and cut up them throughout the 100 geographical factors. This workflow is proven within the following structure diagram.

Deployment of this answer: On this put up, we offer step-by-step directions to deploy every a part of the structure manually. For those who desire an computerized deployment, we have now ready for you a Github repository containing the required infrastructure as code template.
The dataset is partitioned by day, with YYYY/MM/DD/ prefixes. Every partition incorporates 96 recordsdata that shall be processed by one Lambda operate.
We use Step Capabilities to launch the parallel processing of the one year of the yr 2018. Step Capabilities helps builders use AWS providers to construct distributed functions, automate processes, orchestrate microservices, and create knowledge and machine studying (ML) pipelines.
However earlier than beginning, we have to obtain the dataset and add it to an S3 bucket.
Stipulations
Create an S3 bucket to retailer the enter dataset, the intermediate outputs, and the ultimate outputs of the info extraction.
Obtain the dataset and add it to Amazon S3
A free registration on the info supplier web site is required to obtain the dataset. To obtain the dataset, you should utilize the next command from a Linux terminal. Present the credentials that you just obtained at registration. Your Linux terminal might be in your native machine, however it’s also possible to use an AWS Cloud9 occasion. Just be sure you have at the very least 100 GB of free storage to deal with all the dataset.
As a result of the dataset is sort of massive, this obtain may take a very long time. Within the meantime, you possibly can put together the following steps.
When the obtain is full, you possibly can add the dataset to an S3 bucket with the next command:
For those who use momentary credentials, they could expire earlier than the copy is full. On this case, you possibly can resume through the use of the aws s3 sync command.
Now that the info is on Amazon S3, you possibly can delete the listing that has been downloaded out of your Linux machine.
Create the Lambda features
For step-by-step directions on easy methods to create a Lambda operate, consult with Getting began with Lambda.
The primary Lambda operate within the workflow generates the checklist of days that we need to course of:
We then use the Map state of Step Capabilities to course of every day. The Map state will launch one Lambda operate for every ingredient returned by the earlier operate, and can cross this ingredient as an enter. These Lambda features shall be launched concurrently for all the weather within the checklist. The processing time for the total yr will due to this fact be an identical to the time wanted to course of 1 single day, permitting scalability for very long time collection and enormous volumes of enter knowledge.
The next is an instance of code for the Lambda operate that processes every day:
You want to affiliate a task to the Lambda operate to authorize it to entry the S3 buckets. As a result of the runtime is a couple of minute, you additionally must configure the timeout of the Lambda operate accordingly. Let’s set it to five minutes. We additionally enhance the reminiscence allotted to the Lambda operate to 2048 MB, which is required by the netcdf4 library for extracting a number of factors at a time from satellite tv for pc knowledge.
This Lambda operate will depend on the pandas and netcdf4 libraries. They are often put in as Lambda layers. The pandas library is offered as an AWS managed layer. The netcdf4 library should be packaged in a customized layer.
Configure the Step Capabilities workflow
After you create the 2 Lambda features, you possibly can design the Step Capabilities workflow within the visible editor through the use of the Lambda Invoke and Map blocks, as proven within the following diagram.

Within the Map state block, select Distributed processing mode and enhance concurrency restrict to 365 in Runtime settings. This may allow parallel processing of all the times.


The variety of Lambda features that may run concurrently is restricted for every account. Your account might have inadequate quota. You’ll be able to request a quota enhance.
Launch the state machine
Now you can launch the state machine. On the Step Capabilities console, navigate to your state machine and select Begin execution to run your workflow.

This may set off a popup in which you’ll be able to enter non-obligatory enter in your state machine. For this put up, you possibly can go away the defaults and select Begin execution.

The state machine ought to take 1–2 minutes to run, throughout which period it is possible for you to to observe the progress of your workflow. You’ll be able to choose one of many blocks within the diagram and examine its enter, output, and different data in actual time, as proven within the following screenshot. This may be very helpful for debugging functions.

When all of the blocks flip inexperienced, the state machine is full. At this step, we have now extracted the info for 100 geographical factors for an entire yr of satellite tv for pc knowledge.
Within the S3 bucket configured as output for the processing Lambda operate, we are able to examine that we have now one file per day, containing the info for all of the 100 factors.

Rework knowledge per day to knowledge per geographical level with AWS Glue
For now, we have now one file per day. Nevertheless, our purpose is to get time collection for each geographical level. This transformation includes altering the way in which the info is partitioned. From a day partition, we have now to go to a geographical level partition.
Happily, this operation might be finished very merely with AWS Glue.
- On the AWS Glue Studio console, create a brand new job and select Visible with a clean canvas.
For this instance, we create a easy job with a supply and goal block.

- Add an information supply block.
- On the Knowledge supply properties tab, choose S3 location for S3 supply kind.
- For S3 URL, enter the placement the place you created your recordsdata within the earlier step.
- For Knowledge format, maintain the default as Parquet.
- Select Infer schema and examine the Output schema tab to verify the schema has been accurately detected.

- Add an information goal block.
- On the Knowledge goal properties tab, for Format, select Parquet.
- For Compression kind, select Snappy.
- For S3 Goal Location, enter the S3 goal location in your output recordsdata.
We now must configure the magic!
- Add a partition key, and select
point_id.
This tells AWS Glue the way you need your output knowledge to be partitioned. AWS Glue will robotically partition the output knowledge based on the point_id column, and due to this fact we’ll get one folder for every geographical level, containing the entire time collection for this level as requested.

To complete the configuration, we have to assign an AWS Id and Entry Administration (IAM) function to the AWS Glue job.
- Select Job particulars, and for IAM function¸ select a task that has permissions to learn from the enter S3 bucket and to write down to the output S3 bucket.
You’ll have to create the function on the IAM console when you don’t have already got an applicable one.
- Enter a reputation for our AWS Glue job, reserve it, and run it.
We will monitor the run by selecting Run particulars. It ought to take 1–2 minutes to finish.
Remaining outcomes
After the AWS Glue job succeeds, we are able to examine within the output S3 bucket that we have now one folder for every geographical level, containing some Parquet recordsdata with the entire yr of knowledge, as anticipated.

To load the time collection for a particular level right into a pandas knowledge body, you should utilize the awswrangler library out of your Python code:
If you wish to check this code now, you possibly can create a pocket book occasion in Amazon SageMaker, after which open a Jupyter pocket book. The next screenshot illustrates operating the previous code in a Jupyter pocket book.

As we are able to see, we have now efficiently extracted the time collection for particular geographical factors!
Clear up
To keep away from incurring future prices, delete the assets that you’ve created:
- The S3 bucket
- The AWS Glue job
- The Step Capabilities state machine
- The 2 Lambda features
- The SageMaker pocket book occasion
Conclusion
On this put up, we confirmed easy methods to use Lambda, Step Capabilities, and AWS Glue for serverless ETL (extract, remodel, and cargo) on a big quantity of climate knowledge. The proposed structure permits extraction and repartitioning of the info in just some minutes. It’s scalable and cost-effective, and might be tailored to different ETL and knowledge processing use instances.
Excited about studying extra in regards to the providers offered on this put up? You’ll find hands-on labs to enhance your information with AWS Workshops. Moreover, try the official documentation of AWS Glue, Lambda, and Step Capabilities. You can too uncover extra architectural patterns and finest practices at AWS Whitepapers & Guides.
In regards to the Creator

Lior Perez is a Principal Options Architect on the Enterprise group based mostly in Toulouse, France. He enjoys supporting prospects of their digital transformation journey, utilizing massive knowledge and machine studying to assist remedy their enterprise challenges. He’s additionally personally obsessed with robotics and IoT, and continually appears for brand new methods to leverage applied sciences for innovation.
