How one can Pace Up Knowledge Move Between Databricks and SAS

0/5 No votes

Report this app

Description

[ad_1]

This can be a collaborative put up between Databricks and T1A. We thank Oleg Mikhov, Options Architect at T1A, for his contributions.

 
That is the primary put up in a sequence of blogs on the perfect practices of bringing collectively Databricks Lakehouse Platform and SAS. A earlier Databricks weblog put up launched Databricks and PySpark to SAS builders. On this put up, we focus on methods for exchanging information between SAS and Databricks Lakehouse Platform and methods to hurry up the info move. In future posts, we’ll discover constructing environment friendly information and analytics pipelines involving each applied sciences.

Knowledge-driven organizations are quickly adopting the Lakehouse platform to maintain up with the continuously rising enterprise calls for. Lakehouse platform has change into a brand new norm for organizations wanting to construct information platforms and structure. The modernization entails shifting information, purposes, or different enterprise components to the cloud. Nonetheless, the transition to the cloud is a gradual course of and it’s business-critical to proceed leveraging legacy investments for so long as doable. With that in thoughts, many corporations are inclined to have a number of information and analytics platforms, the place the platforms coexist and complement one another.

One of many combos we see is using SAS with the Databricks Lakehouse. There are a lot of advantages of enabling the 2 platforms to effectively work collectively, comparable to:

  • Higher and scalable information storage capabilities of cloud platforms
  • Higher computing capability utilizing applied sciences, comparable to Apache Spark™, natively constructed with parallel processing capabilities
  • Obtain larger compliance with information governance and administration utilizing Delta Lake
  • Decrease the price of information analytics infrastructure with simplified architectures

Some frequent information science and information evaluation use instances and causes noticed are:

  1. SAS practitioners leverage SAS for its core statistical packages to develop superior analytics output that meets regulatory necessities whereas they use Databricks Lakehouse for information administration, ELT sorts of processing, and information governance
  2. Machine studying fashions developed in SAS are scored on huge quantities of knowledge utilizing parallel processing structure of Apache Spark engine within the Lakehouse platform
  3. SAS information analysts acquire quicker entry to massive quantities of knowledge within the Lakehouse Platform for ad-hoc evaluation and reporting utilizing Databricks SQL endpoints and excessive bandwidth connectors
  4. Ease cloud modernization and migration journey by establishing a hybrid workstream involving each cloud structure and on-prem SAS platform

Nonetheless, a key problem of this coexistence is how the info is performantly shared between the 2 platforms. On this weblog, we share finest practices applied by T1A for his or her prospects and benchmark outcomes evaluating totally different strategies of shifting information between Databricks and SAS.

Eventualities

The preferred use case is a SAS developer attempting to entry information within the lakehouse. The analytics pipelines involving each applied sciences require information move in each instructions: information moved from Databricks to SAS and information moved from SAS to Databricks.

  1. Entry Delta Lake from SAS: A SAS person desires to entry large information in Delta Lake utilizing the SAS programming language.
  2. Entry SAS datasets from Databricks: A Databricks person desires to entry SAS datasets, usually the sas7bdat datasets as a DataFrame to course of in Databricks pipelines or retailer in Delta Lake for enterprise-wide entry.

In our benchmark checks, we used the next atmosphere setup:

  1. Microsoft Azure because the cloud platform
  2. SAS 9.4M7 on Azure (single node Commonplace D8s v3 VM)
  3. Databricks runtime 9.0, Apache Spark 3.1.2 (2 nodes Commonplace DS4v2 cluster)

Determine 1 reveals the conceptual structure diagram with the parts mentioned. Databricks Lakehouse sits on Azure Knowledge Lake storage with Delta Lake medallion structure. SAS 9.4 put in on Azure VM connects to Databricks Lakehouse to learn/write information utilizing connection choices mentioned within the following sections.

SAS and Databricks conceptual architecture diagram on Azure
Determine 1 SAS and Databricks conceptual structure diagram on Azure

The diagram above reveals a conceptual structure of Databricks deployed on Azure. The structure will likely be comparable on different cloud platforms. On this weblog, we solely focus on the mixing with the SAS 9.4 platform. In a later weblog put up, we’ll lengthen this dialogue to entry lakehouse information from SAS Viya.

Entry Delta Lake from SAS

Think about that we’ve a Delta Lake desk that must be processed in a SAS program. We would like the perfect efficiency when accessing this desk, whereas additionally avoiding any doable points with information integrity or information sorts compatibility. There are alternative ways to realize information integrity and compatibility. Beneath we focus on just a few strategies and examine them on ease of use and efficiency.

In our testing, we used the eCommerce conduct dataset (5.67GB, 9 columns, ~ 42 mill information) from Kaggle.
Knowledge Supply Credit score: eCommerce conduct information from multi class retailer and REES46 Advertising Platform.

Examined strategies

1. Utilizing SAS/ACCESS Interface connectors
Historically, SAS customers leverage SAS/ACCESS software program to connect with exterior information sources. You’ll be able to both use a SAS LIBNAME assertion pointing to the Databricks cluster or use the SQL pass-through facility. At current for SAS 9.4, there are three connection choices out there.

  1. SAS/ACCESS Interface to ODBC
  2. SAS/ACCESS Interface to JDBC
  3. SAS/ACCESS Interface to Spark

SAS/ACCESS Interface to Spark has been lately loaded with capabilities with unique assist to Databricks clusters. See this video for a brief demonstration. The video mentions SAS Viya however the identical is relevant to SAS 9.4.

Code samples on find out how to use these connectors will be discovered on this git repository: T1A Git – SAS Libraries Examples.

2. Utilizing saspy package deal
The open-source library, saspy, from SAS Institute permits Databricks Pocket book customers to run SAS statements from a Python cell within the pocket book to execute code within the SAS server, in addition to to import and export information from SAS datasets to Pandas DataFrame.

Because the focus of this part is accessing lakehouse information by a SAS programmer utilizing SAS programming, this methodology was wrapped in a SAS macro program much like the purpose-built integration methodology mentioned subsequent.

To realize higher efficiency with this package deal, we examined the configuration with an outlined char_length choice (particulars out there right here). With this feature, we are able to outline lengths for character fields within the dataset. In our checks utilizing this feature introduced an extra 15% enhance in efficiency. For the transport layer between environments, we used the saspy configuration with an SSH connection to the SAS server.

3. Utilizing a purpose-built integration
Though the 2 strategies talked about above have their upsides, the efficiency will be improved additional by addressing some shortcomings, mentioned within the subsequent part (Take a look at Outcomes), of the earlier strategies. With that in thoughts, we developed a SAS macro-based integration utility with a main deal with efficiency and usefulness for SAS customers. The SAS macro will be simply built-in into present SAS code with none data about Databricks platform, Apache Spark or Python.

The macro orchestrates a multistep course of utilizing Databricks API:

  1. Instruct the Databricks cluster to question and extract information per the offered SQL question and cache the leads to DBFS, counting on its Spark SQL distributed processing capabilities.
  2. Compress and securely switch the dataset to the SAS server (CSV in GZIP) over SSH
  3. Unpack and import information into SAS to make it out there to the person within the SAS library. At this step, leverage column metadata from Databricks information catalog (column sorts, lengths, and codecs) for constant, right and environment friendly information presentation in SAS

Word that for variable-length information sorts, the mixing helps totally different configuration choices, relying on what most closely fits the person necessities comparable to,

  • want for utilizing a configurable default worth
  • profiling to 10,000 rows (+ add headroom) to establish the most important worth
  • profiling the whole column within the dataset to establish the most important worth

A simplified model of the code is obtainable right here T1A Git – SAS DBR Customized Integration.

The tip-user utilization of this SAS macro appears as proven under, and takes three inputs:

  1. SQL question, based mostly on which information will likely be extracted from Databricks
  2. SAS libref the place the info ought to land
  3. Identify to be given to the SAS dataset

Take a look at outcomes

Databricks to SAS data access methods performance
Determine 2 Databricks to SAS information entry strategies efficiency

As proven within the plot above, for the check dataset, the outcomes present that SAS/ACCESS Interface to JDBC and SAS/ACCESS Interface to Apache Spark confirmed comparable efficiency and carried out decrease in comparison with different strategies. The principle purpose for that’s the JDBC strategies don’t profile character columns in datasets with the intention to set correct column size within the SAS dataset. As a substitute, they outline the default size for all character column sorts (String and Varchar) as 765 symbols. That causes not solely efficiency points throughout preliminary information retrieval however for all additional processing. Plus it consumes vital extra storage. In our checks, for the supply dataset of 5.6 GB, we ended with a 216 GB file within the WORK library. Nonetheless, with the SAS/ACCESS Interface to ODBC, the default size was 255 symbols, which resulted in a major efficiency enhance.

Utilizing SAS/ACCESS Interface strategies is probably the most handy choice for present SAS customers. There are some vital concerns whenever you use these strategies

  1. Each options assist implicit question pass-through however with some limitations:
  • SAS/ACCESS Interface to JDBC/ODBC assist solely pass-through for PROC SQL statements
  • Along with PROC SQL pass-through SAS/ACCESS Interface to Apache Spark helps pass-through for many of the SQL capabilities. This methodology additionally permits pushing frequent SAS procedures to Databricks clusters.
  • The difficulty with setting the size for the character columns described earlier than. As a workaround, we recommend utilizing the DBSASTYPE choice to explicitly set column size for SAS tables. It will assist with additional processing of the dataset however received’t have an effect on the preliminary retrieval of the info from Databricks
  • SAS/ACCESS Interface to Apache Spark/JDBC/ODBC doesn’t permit combining tables from totally different Databricks databases (schemas) assigned as totally different libnames in the identical question (becoming a member of them) with the pass-through facility. As a substitute, it’s going to trigger exporting complete tables in SAS and processing in SAS. As a workaround, we recommend making a devoted schema in Databricks that can include views based mostly on tables from totally different databases (schemas).
  • Utilizing the saspy methodology confirmed barely higher efficiency in comparison with SAS/ACCESS Interface to JDBC/Spark strategies, nevertheless, the primary disadvantage is that saspy library solely works with pandas DataFrames and it places a major load on the Apache Spark driver program and requires the whole DataFrame to be pulled into reminiscence.

    The purpose-built integration methodology confirmed the perfect efficiency in comparison with different examined strategies. Determine 3 reveals a move chart with high-level steerage in selecting from the strategies mentioned.

    Databricks to SAS data access - method selection
    Determine 3 Databricks to SAS information entry – methodology choice

    Entry SAS datasets from Databricks

    This part addresses the necessity by Databricks builders to ingest a SAS dataset into Delta Lake and make it out there in Databricks for enterprise intelligence, visible analytics, and different superior analytics use instances whereas a number of the beforehand described strategies are relevant right here, some extra strategies are mentioned.

    Within the check, we begin with a SAS dataset (in sas7bdat format) on the SAS server, and ultimately, we’ve this dataset out there as Spark DataFrame (if the lazy invocation is relevant we pressure to load information in a DataFrame and measure the general time) in Databricks.

    We used the identical atmosphere and the identical dataset for this situation that was used within the earlier situation. The checks don’t take into account the use case the place a SAS person writes a dataset into Delta Lake utilizing SAS programming. This includes considering cloud supplier instruments and capabilities which will likely be mentioned in a later weblog put up.

    Examined strategies

    1. Utilizing the saspy package deal from SAS
    The sd2df methodology within the saspy library converts a SAS dataset to a pandas DataFrame, utilizing SSH for information switch. It provides a number of choices for staging storage (Reminiscence, CSV, DISK) throughout the switch. In our check, the CSV choice, which makes use of PROC EXPORT csv file and pandas read_csv() strategies, which is the really useful choice for giant information units, confirmed the perfect efficiency.

    2. Utilizing pandas methodology
    Since early releases pandas allowed customers to learn sas7bdat information utilizing pandas.read_sas API. The SAS file needs to be accessible to the python program. Generally used strategies are FTP, HTTP, or shifting to cloud object storage comparable to S3. We moderately used an easier strategy to maneuver a SAS file from the distant SAS server to the Databricks cluster utilizing SCP.

    3. Utilizing spark-sas7bdat
    Spark-sas7bdat is an open-source package deal developed particularly for Apache Spark. Much like the pandas.read_sas() methodology, the SAS file have to be out there on the filesystem. We downloaded the sas7bdat file from a distant SAS Server utilizing SCP.

    4. Utilizing a purpose-built integration
    One other methodology that was explored is utilizing standard strategies with a deal with balancing comfort and efficiency. This methodology abstracts away core integrations and is made out there to the person as a Python library which is executed from the Databricks Pocket book.

    1. Use saspy package deal to execute a SAS macro code (on a SAS server) which does the next
    • Export sas7bdat to CSV file utilizing SAS code
    • Compress the CSV file to GZIP
  • Transfer the compressed file to the Databricks cluster driver node utilizing SCP
  • Decompresses the CSV file
  • Reads CSV file to Apache Spark DataFrame
  • Take a look at outcomes

    SAS to Databricks data access methods performance
    Determine 4 SAS to Databricks information entry strategies efficiency

    The spark-sas7bdat confirmed the perfect efficiency amongst all of the strategies. This package deal takes full benefit of parallel processing in Apache Spark. It distributes blocks of sas7bdat information on employee nodes. The foremost disadvantage of this methodology is that sas7bdat is a proprietary binary format, and the library was constructed based mostly on reverse engineering of this binary format, so it doesn’t assist all sorts of sas7bdat information, in addition to it isn’t formally (commercially) vendor-supported.

    The saspy and pandas strategies are comparable in the way in which that they’re each constructed for a single node atmosphere and each learn information to pandas DataFrame requiring an extra step earlier than having the info out there as a Spark DataFrame.

    The purpose-built integration macro confirmed higher efficiency in comparison with saspy and pandas as a result of it reads information from CSV by Apache Spark APIs. Nonetheless, it doesn’t beat the efficiency of the spark-sas7bdat package deal. The aim-built methodology will be handy in some instances because it permits including intermediate information transformations on the SAS server.

    Conclusion

    Increasingly enterprises are gravitating in direction of constructing a Databricks Lakehouse and there are a number of methods of accessing information from the Lakehouse through different applied sciences. This weblog discusses how SAS builders, information scientists and different enterprise customers can leverage the info within the Lakehouse and write the outcomes to the cloud. In our experiment, we examined a number of totally different strategies of studying and writing information between Databricks and SAS. The strategies fluctuate not solely by efficiency however by comfort and extra capabilities that they supply.

    For this check, we used the SAS 9.4M7 platform. SAS Viya helps many of the mentioned approaches but additionally offers extra choices. In the event you’d prefer to be taught extra concerning the strategies or different specialised integration approaches not coated right here, be happy to succeed in out to us at Databricks or databricks@t1a.com.

    Within the upcoming posts on this weblog sequence, we’ll focus on finest practices in implementing built-in information pipelines, end-to-end workflows, utilizing SAS and Databricks and find out how to leverage SAS In-Database applied sciences for scoring SAS fashions in Databricks clusters.

    SAS® and all different SAS Institute Inc. services or products names are registered logos or logos of SAS Institute Inc. within the USA and different nations. ® signifies USA registration.

    Get began

    Attempt the course, Databricks for SAS Customers, on Databricks Academy to get a primary hands-on expertise with PySpark programming for SAS programming language constructs and contact us to be taught extra about how we are able to help your SAS staff to onboard their ETL workloads to Databricks and allow finest practices.



    [ad_2]

    Leave a Reply

    Your email address will not be published.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.