Friday, July 1, 2022
HomeBig DataProducing and Viewing Lineage by way of Apache Ozone

Producing and Viewing Lineage by way of Apache Ozone

[ad_1]

Comply with your knowledge in object storage on-premises

As companies look to scale-out storage, they want a storage layer that’s performant, dependable and scalable. With Apache Ozone on the Cloudera Knowledge Platform (CDP), they will implement a scale-out mannequin and construct out their subsequent era storage structure with out sacrificing safety, governance and lineage. CDP integrates its current Shared Knowledge Expertise (SDX) with Ozone for a straightforward transition, so you possibly can start using object storage on-prem. On this article, we’ll give attention to producing and viewing lineage that features Ozone property from Apache Atlas.

About Ozone integration with Atlas

With CDP 7.1.4 and later, Ozone is built-in with Atlas out of the field, and entities like Hive, Spark course of, and NiFi flows, will end in Atlas creating Ozone path entities. For instance, writing a Spark dataset to Ozone or launching a DDL question in Hive that factors to a location in Ozone. Previous to that, entities created with an Ozone path resulted in creating HDFS path entities.

This integration mechanism doesn’t present a direct Atlas Hook or Atlas Bridge choice to hearken to the entity occasions in Ozone. As such, Atlas doesn’t have a direct hook and solely the trail data is supplied. As we’ll see, Atlas populates a couple of different attributes for Ozone entities.

Earlier than we start

This text assumes that you’ve got a CDP Personal Cloud Base cluster 7.1.5 or greater with Kerberos enabled and admin entry to each Ranger and Atlas. As well as, you will want a consumer that may create databases and tables in Hive, and create volumes and buckets in Ozone.

Additionally, to offer some context for these new to Ozone, it offers three most important abstractions. If we take into consideration them from the attitude of conventional storage, we are able to draw the next analogies:

  • Volumes are much like mount factors. Volumes are used to retailer buckets. Solely directors can create or delete volumes. As soon as a quantity is created, customers can create as many buckets as wanted.
  • Buckets are much like subdirectories of that mount level. A bucket can comprise any variety of objects, however buckets can’t comprise different buckets. Ozone shops knowledge as objects which stay inside these buckets.
  • Keys are much like the absolutely certified path of information.

This permits for safety insurance policies on the quantity or bucket degree so you possibly can isolate customers, because it is sensible to your necessities. For instance, my knowledge quantity might comprise a number of buckets for each stage of the information, and I can management who accesses every stage. One other state of affairs could possibly be every line of enterprise will get their very own quantity, they usually can create buckets and insurance policies because it is sensible for his or her customers.

Loading knowledge into Ozone

Let’s begin with loading knowledge into Ozone. As a way to do this, we first must create a quantity and a bucket utilizing the Ozone shell. The instructions to take action are as follows:

ozone sh quantity create /knowledge
ozone sh bucket create /knowledge/tpc

I’ve chosen these names as a result of I’ll be utilizing a straightforward methodology for producing and writing TPC-DS datasets, together with creating their corresponding Hive tables. You will discover the instrument I’m utilizing on the repo right here. Nonetheless, be at liberty to select your labels for the amount and bucket and produce your knowledge. Earlier than that, let’s confirm the bucket exists in two methods:

  1. Utilizing the Ozone shell
    ozone sh bucket record /knowledge

    It is best to see one thing much like

    {
     "metadata" : { },
      "volumeName" : "knowledge",
      "identify" : "tpc",
      "storageType" : "DISK",
      "versioning" : false,
      "usedBytes" : 0,
      "usedNamespace" : 0,
      "creationTime" : "2021-07-21T17:21:18.158Z",
      "modificationTime" : "2021-07-21T17:21:18.158Z",
      "encryptionKeyName" : null,
      "sourceVolume" : null,
      "sourceBucket" : null,
      "quotaInBytes" : -1,
      "quotaInNamespace" : -1
    }
  2. Utilizing the Hadoop CLI
    hdfs dfs -ls ofs://ozone1/knowledge/tpc

    The Ozone URI takes the type of ofs://<ozone_service_id>/<quantity>/<bucket>. If you happen to don’t know the primary half, you possibly can simply discover it in Cloudera Supervisor by going to Clusters > Ozone and selecting the configuration tab. From there you possibly can seek for service.id and see one thing like

 

With our bucket created, let’s finalize safety for Hive. I discussed in the beginning that you simply’d require a consumer with pretty open entry in Hive and Ozone. Nonetheless, regardless of that entry, there’s nonetheless yet one more coverage required for us to create tables that time to a location in Ozone. Now that we’ve created that location, we are able to create a coverage like under:

As you possibly can see, this makes use of the URI in Ozone the place we loaded knowledge. By creating this coverage, we’re permitting SQL actions entry to the amount and the bucket inside it. Thus, we’re capable of create tables in Hive.

Now we are able to start loading knowledge.

  • If you happen to’re bringing your individual, it’s so simple as creating the bucket in Ozone utilizing the Hadoop CLI and placing the information you need there:
    hdfs dfs -mkdir ofs://ozone1/knowledge/tpc/check
    hdfs dfs -put <filename>.csv ofs://ozone1/knowledge/tpc/check
  • If you wish to use the information generator, clone the repo regionally and alter into the listing.
    git clone https://github.com/dstreev/hive-testbench.git
  • From there, you possibly can run the next:
    hdfs dfs -mkdir ofs://ozone1/knowledge/tpc/tpcds
    ./tpcds-build.sh
    ./tpcds-gen.sh --scale 10 --dir ofs://ozone1/knowledge/tpc/tpcds
  • Upon success, you’ll see these messages:
    TPC-DS textual content knowledge era full.
    Loading textual content knowledge into exterior tables.

If you might want to create a desk to your knowledge, it’s so simple as launching a DDL question with the situation pointing to the Ozone tackle the place the information was loaded. 

Viewing lineage in Atlas

As soon as tables are in place, we are able to start to see lineage in Atlas. To view the tables you’ve created, pull down the Search By Kind bar and enter hive_table. As soon as that’s been chosen, click on the search button to view all tables. 

Click on on the desk you wish to view and see the lineage tab on the precise (see under). You’ll discover that Hive tables and processes are current however so are Ozone keys.

 

Now that we now have knowledge in Hive, let’s propagate the lineage with Spark.

Writing to Ozone in Spark

For the reason that objective of this weblog is to point out lineage for knowledge that exists in Ozone, I’m going to do a easy transformation within the Spark shell and write the information out to Ozone. Be at liberty to convey your code or run queries as you’d like towards the information you will have there. If you wish to observe alongside, listed here are the steps:

  1. Launch spark-shell in shopper mode
    spark-shell --master yarn --deploy-mode shopper --conf spark.yarn.entry.hadoopFileSystems=ofs://ozone1/knowledge/tpc/
  2. Create a dataset from the shopper desk
    val customerDs = spark.sql("SELECT * FROM tpcds_text_10.buyer")
  3. Fill in any nulls for the columns c_birth_year, c_birth_month, and c_birth_day. On this instance, we’re filling null years with 1970 and null months and days with 1.
    val noDobNulls = customerDs.na.fill(1970, Seq("c_birth_year")).na.fill(1, Seq("c_birth_month", "c_birth_day"))
  4. Left pad the month and day columns so that they’re all the identical size and create the birthDate column to concatenate yr, month, and day separated by a touch.
    val paddedMonth = lpad(col("c_birth_month"), 2, "0")
    val paddedDay = lpad(col("c_birth_day"), 2, "0")
    val birthDate = concat_ws("-", col("c_birth_year"), paddedMonth, paddedDay)
  5. Add the c_birth_date column to the dataset the place we crammed in null values
    val customerWithDob = noDobNulls.withColumn("c_birth_date", birthDate)
  6. Write the reworked knowledge to Ozone in ORC format
    customerWithDob.write.mode(SaveMode.Overwrite).orc("ofs://ozone1/knowledge/tpc/tpcds/10/customer_orc")

Going again to Atlas, you possibly can see the lineage has propagated from our Spark course of.

Abstract

Now you will have seen how CDP offers a better transition to an on-prem, non-public cloud structure with out sacrificing essential facets of safety, governance, and lineage. With Ozone in place, you possibly can start to scale-out compute and storage individually and overcome the restrictions of HDFS. If you wish to proceed experimenting with Ozone and Atlas, you possibly can attempt writing to Ozone through Kafka utilizing our documented configuration examples. Then you possibly can import Kafka lineage utilizing the Atlas Kafka import instrument supplied with CDP. After that, checkout our Information Hub to see what all is offered to you with CDP and get there!

[ad_2]

RELATED ARTICLES

1 COMMENT

Most Popular

Recent Comments