Sunday, December 22, 2024

Author data integration jobs with an interactive data preparation experience with AWS Glue visual ETL | Amazon Web Services

Must read

We are excited to announce a new capability of the AWS Glue Studio visual editor that offers a new visual user experience. Now you can author data preparation transformations and edit them with the AWS Glue Studio visual editor. The AWS Glue Studio visual editor is a graphical interface that enables you to create, run, and monitor data integration jobs in AWS Glue.

The new data preparation interface in AWS Glue Studio provides an intuitive, spreadsheet-style view for interactively working with tabular data. Within this interface, you can visually inspect tabular data samples, validate recipe steps through real-time runs, and author data preparation recipes without writing code. Within the new experience, you can choose from hundreds of prebuilt transformations. This allows data analysts and data scientists to rapidly construct the necessary data preparation steps to meet their business needs. After you complete authoring the recipes, AWS Glue Studio will automatically generate the Python script to run the recipe data transformations as part of AWS Glue extract, transform, and load (ETL) jobs.

In this post, we show how to use this new feature to build a visual ETL job that preprocesses data to meet the business needs for an example use case, entirely within the AWS Glue Studio console, without the overhead of manual script coding.

Example use case

A fictional e-commerce company sells apparel and allows customers to leave text reviews and star ratings for each product, to help other customers to make informed purchase decisions. To simulate this, we will use a sample synthetic review dataset, which includes different products and customer reviews.

In this scenario, you’re a data analyst in this company. Your role involves preprocessing raw customer review data to prepare it for downstream analytics. This requires transforming the data by normalizing columns through actions such as casting columns to appropriate data types, splitting a single column into multiple new columns, and adding computed columns based on other columns. To quickly create an ETL job for these business requirements, you use AWS Glue Studio to inspect the data and author data preparation recipes.

The AWS Glue job will be configured to output the file to Amazon Simple Storage Service (Amazon S3) in a preferred format and automatically create a table in the AWS Glue Data Catalog. This Data Catalog table will be shared with your analyst team, allowing them to query the table using Amazon Athena.

Prerequisites

For this tutorial, you need an S3 bucket to store output from the AWS Glue ETL job and Athena queries, and a Data Catalog database to create new tables. You also need to create AWS Identity and Access Management (IAM) roles for the AWS Glue job and AWS Management Console user.

Create an S3 bucket to store output from the AWS Glue ETL jobs and Athena query results

You can either create a new S3 bucket or use an existing bucket to store output from the AWS Glue ETL job and Athena queries. In the following steps, replace and with the name of the S3 bucket.

Create a Data Catalog database

You can either create a new Data Catalog database or use an existing database to create tables. In the following steps, replace with the name of your database.

Create an IAM role for the AWS Glue job

Complete the following steps to create an IAM role for the AWS Glue job:

  1. On the IAM console, in the navigation pane, choose Role.
  2. Choose Create role.
  3. For Trusted entity type, choose AWS service.
  4. For Service or use case, choose Glue.
  5. Choose Next.
  6. For Add permissions, choose AWSGlueServiceRole, then choose Next.
  7. For Role name, enter a role name (for this post, GlueJobRole-recipe-demo).
  8. Choose Create role.
  9. Choose the created IAM role.
  10. Under Permissions policies, choose Add permission and Create inline policy.
  11. For Policy editor, choose JSON, and enter the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-bigdata-blog/generated_synthetic_reviews/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:List*",
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::/*",
                    "arn:aws:s3:::"
                ]
            }
        ]
    }

  12. Choose Next.
  13. For Policy name, enter a name for your policy.
  14. Choose Create policy.

Create an IAM role for the console user

Complete the following steps to create the IAM role to interact with the console:

  1. On the IAM console, in the navigation pane, choose Role.
  2. Choose Create role.
  3. For Trusted entity type, choose the entity of your choice.
  4. For Add permissions, add the following AWS managed policies:
    1. AmazonAthenaFullAccess
    2. AWSGlueConsoleFullAccess
  5. Choose Next.
  6. For Role name, enter a role name of your choice.
  7. Choose Create role.
  8. Choose the created IAM role.
  9. Under Permissions policies, choose Add permission and Create inline policy.
  10. For Policy editor, choose JSON, and enter the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Statement1",
                "Effect": "Allow",
                "Action": [
                    "iam:PassRole"
                ],
                "Resource": [
                    "arn:aws:iam:::role/GlueJobRole-recipe-demo"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-bigdata-blog/generated_synthetic_reviews/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:List*",
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::/*",
                    "arn:aws:s3:::/*"
                ]
            }
        ]
    }

  11. Choose Next.
  12. For Policy name, enter a name for your policy.
  13. Choose Create policy.

The S3 bucket and IAM roles required for this tutorial have been created and configured. Switch to the console user role that you set up and proceed with the steps in the following sections.

Author and run a data integration job using the interactive data preparation experience

Let’s create an AWS Glue ETL job in AWS Glue Studio. In this ETL job, we load S3 Parquet files as the source, process the data using recipe steps, and write the output to Amazon S3 as Parquet. You can configure all these steps in the visual editor in AWS Glue Studio. We use the new data preparation authoring capabilities to create recipes that meet our specific business needs for data transformations. This exercise will demonstrate how you can develop data preparation recipes in AWS Glue Studio that are tailored to your use case and can be readily incorporated into scalable ETL jobs. Complete the following steps:

  1. On the AWS Glue Studio console, choose Visual ETL in the navigation pane.
  2. Under Create job, choose Visual ETL.
  3. At the top of the job, replace “Untitled job” with a name of your choice.
  4. On the Job Details tab, under Basic properties, specify the IAM role that the job will use (GlueJobRole-recipe-demo).
  5. Choose Save.
  6. On the Visual tab, choose the plus sign to open the Add nodes menu. Search for s3 and add an Amazon S3 as a Source.
  1. For S3 source type, choose S3 location.
  2. For S3 URL, specify s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/.
  3. For Data format, select Parquet.
  4. As a child of this source, search in the Add nodes menu for recipe and add the Data Preparation Recipe
  5. In the Data preview window, choose Start session if it has not been started.
    1. If it hasn’t been started, Start a data preview session will be displayed on the Data Preparation Recipe
    2. Choose your IAM role for the AWS Glue job.
    3. Choose Start session.
  1. After your data preview session has been started, on the Data Preparation Recipe transform, choose Author Recipe to open the data preparation recipe editor.

This will initialize a session using a subset of the data. After session initialization, the AWS Glue Studio console provides an interactive interface that enables intuitive construction of recipe steps for AWS Glue ETL jobs.

As described in our example use case, you’re authoring recipes to preprocess customer review data for analysis. Upon reviewing the spreadsheet-style data preview, you notice the product_title column contains values like business formal pants, plain and business formal jeans, patterned, with the product name and sub-attribute separated by a comma. To better structure this data for downstream analysis, you decide to split the product_title column on the comma delimiter to create separate columns for the product name and sub-attribute. This will allow for easier filtering and aggregation by product type or attribute during analysis.

On the spreadsheet-style UI, you can check the statistics of each column like Min, Median, Max, cardinality, and value distribution for a subset of the data. This provides useful insights about the data to inform transformation decisions. When reviewing the statistics for the review_year columns, you notice they contain a wide range of values spanning over 15 years. To enable easier analysis of seasonal and weekly trends, you decide to derive new columns showing the week number and day of the week computed from the review_date column.

Moreover, for convenience of downstream analysis, you decided to change the data type of the customer_id and product_id columns from string to integer. Converting data types is a common task in ETL workflows for analytics. The data preparation recipes in AWS Glue Studio provide a wide variety of common ETL transformations like renaming columns, deleting columns, sorting, and reordering columns. Feel free to browse the data preparation UI to discover other available recipes that can help transform your data.

Let’s see how to implement the recipe step in the Data Preparation Recipe transform to meet these requirements.

  1. Select the customer_id column and choose the Change type recipe step.
    1. For Change type to, choose integer.
    2. Choose Apply to add the recipe step.
  1. Select the product_id column and choose the Change type recipe step.
    1. For Change type to, choose integer.
    2. Choose Apply.
  2. Select the product_title column and choose On a single delimiter under SPLIT.
    1. For Delimiter, select Enter custom value and enter ,.
    2. Choose Apply.
  1. Select the review_date column and choose Week number under EXTRACT.
    1. For Destination column, enter review_date_week_number.
    2. Choose Apply.
  1. Select the review_date column and choose Day of week under EXTRACT.
    1. For Destination column, enter review_date_week_day.
    2. Choose Apply.

After these recipe steps were applied, you can see the customer_id and product_id columns have been converted to integer, the product_title column has been split into product_title1 and product_title2, and review_date_week_number and review_date_week_day have been added. While authoring data preparation recipe steps, you can view tabular data and inspect whether the recipe steps are working as expected. This enables interactive validation of recipe steps through the subset examination results previewed in the UI during the recipe authoring process.

  1. Choose Done authoring recipe to close the interface.

Now, on the Script tab in AWS Glue Studio console, you can see the script generated from the recipe steps. AWS Glue Studio automatically converts the recipe steps configured through the UI into the Python code. This allows you to build ETL jobs utilizing the wide range of transformations available in data preparation recipes, without having to manually code the logic yourself.

  1. Choose Save to save the job.
  2. On the Visual tab, search in the Add nodes menu for s3 and add an Amazon S3 as a Target.
    1. For Format, choose Parquet.
    2. For Compression Type, choose Snappy.
    3. For S3 Target Location, select your output S3 location s3:///output/.
    4. For Data Catalog update options, choose Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
    5. For Database, choose the database of your choice.
    6. For Table name, enter data_preparation_recipe_demo_tbl.
    7. Under Partition keys, choose Add a partition key, and select review_year.
  3. Choose Save, then choose Run to run the job.

Up to this point, we have created and run the ETL job. When the job has finished running, a table named data_preparation_recipe_demo_tbl has been created in the Data Catalog. The table has the partition column review_year with partitions for the years 2000–2016. Let’s move on to the next step and query the table.

Run queries on the output data with Athena

Now that the AWS Glue ETL job is complete, let’s query the transformed output data. As a sample analysis, let’s find the top three items that were reviewed in 2008 across all marketplaces and calculate the average star rating for those items. Then, for the top one item that was reviewed in 2008, we find the top five sub-attributes for it. This will demonstrate querying the new processed dataset to derive insights.

  1. On the Athena console, run the following query against the table:
    SELECT count(*) AS count, product_title_1, avg(star_rating) AS ave 
    FROM .data_preparation_recipe_demo_tbl 
    WHERE review_year = 2008
    GROUP BY product_title_1
    ORDER BY count DESC
    LIMIT 3;

This query counts the number of reviews in 2008 for each product_title_1 and returns the top three most reviewed items. It also calculates the average star_rating for each of the top three items. The query will return results as shown in the following screenshot.

The item made with natural materials heels is the top one most reviewed item. Now let’s query the top five most reviewed attributes for it.

  1. Run the following query against the table:
    SELECT count(*) AS count, product_title_2, avg(star_rating) AS ave 
    FROM .data_preparation_recipe_demo_tbl
    WHERE review_year = 2008 
    AND product_title_1 = 'made with natural materials heels'
    GROUP BY product_title_2
    ORDER BY count DESC
    LIMIT 5;

The query will return results as shown in the following screenshot.

The query results show that for the top reviewed item made with natural materials heels, the top five most reviewed sub-attributes in 2008 were draped, asymmetric, muted, polka-dotted, and oversized. Of these top five sub-attributes, draped had the highest average star rating.

Through this walkthrough, we were able to quickly build an ETL job and generate datasets that fulfill analytics needs, without the overhead of manual script coding.

Clean up

If you no longer need this solution, you can delete the following resources created in this tutorial:

  • S3 bucket (s3://, s3://)
  • IAM roles for the AWS Glue job (GlueJobRole-recipe-demo) and the console user
  • AWS Glue ETL job
  • Data Catalog database () and table (data_preparation_recipe_demo_tbl)

Conclusion

In this post, we introduced the new AWS Glue data preparation authoring experience, which lets you create new low-code no-code data integration recipe transformations directly on the AWS Glue Studio console. We demonstrated how you can use this feature to quickly build ETL jobs and generate datasets that meet your business needs without time-consuming manual coding.

The AWS Glue data preparation authoring experience is now publicly available. Try out this new capability and discover recipes that can facilitate your data transformations.

To learn more about using the interactive data preparation authoring experience in AWS Glue Studio, check out the following video and read the AWS News Blog.


About the Authors

Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Fabrizio Napolitano is a Principal Specialist Solutions Architect or Data Analytics at AWS. He has worked in the analytics domain for the last 20 years, now focusing on helping Canadian public sector organizations innovate with data. Quite by surprise, he become a Hockey Dad after moving to Canada.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Gal HeyneGal Heyne is a Technical Product Manager for AWS Data Processing services with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data services products.

Latest article