Use Apache Spark on Amazon EMR Serverless directly from Amazon Sagemaker Studio

You can now run petabyte-scale data analytics and machine learning on Amazon EMR Serverless directly from Amazon SageMaker Studio notebooks. EMR Serverless automatically provisions and scales the required resources, allowing you to focus on your data and models without having to configure, optimize, tune, or manage clusters. EMR Serverless automatically installs and configures open source frameworks and provides a performance-optimized runtime that is compatible with and faster than standard open source.

With this release, you can now visually create and browse EMR Serverless applications directly from SageMaker Studio and connect to them in a few simple clicks. Once connected to an EMR Serverless application, you can use Spark SQL, Scala, Python to interactively query, explore and visualize data, and run Apache Spark jobs to process data directly from Studio Notebooks. Jobs run fast because they use EMR’s performance-optimized versions of Spark. For e.g. Spark on EMR 7.1 is 4.5x faster than it’s open source equivalent. EMR Serverless offers fine-grained automatic scaling, which provisions and quickly scales the compute and memory resources to match the requirements of your application and you pay for only what you use.

These features are supported on SageMaker Distribution 1.10 and above, and are generally available in all AWS Regions where SageMaker Studio is available. To learn more, read the blog Use LangChain with PySpark for Processing documents at massive scale with Amazon SageMaker Studio and EMR Serverless, or the SageMaker Studio documentation here.

Source:: Amazon AWS