Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (2024)

Prajwal S R

Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

  • Report this post

I am very glad to share that I have completed and received the Azure Databricks Platform Architect Accreditation badge from the Databricks Academy.Databricks#azuredatabricks #platformarchitect

Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R • Databricks Badges credentials.databricks.com

28

8 Comments

Like Comment

Tejaswini Paturi

Senior Manager, Agile Leadership, Product Vision, Strategic Planning, Operational Excellence and Customer Success

3d

  • Report this comment

Congratulations Prajwal S R

Like Reply

1Reaction 2Reactions

Smrithy C

Azure Developer@DATABEAT || Top DataEngineering Voice || Ex-Mindtree || Ex-Picktail || AZ-900/DP 900 /DP 600 Microsoft Certified

3d

  • Report this comment

Congrats! Prajwal S R

Like Reply

1Reaction 2Reactions

Padmaja Kuruba

Dr.Padmaja Kuruba

2d

  • Report this comment

Congrats!

Like Reply

1Reaction 2Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    In my previous post, I had explained about the different types of Private endpoint sub-resources that we can create and also its uses. In this post, we will discuss about the different types of private endpoints that we can create with respect to the Azure Databricks workspaces. Based on how the private endpoint is created, there are 2 types of endpoints: Frontend endpoint and Backend endpoint. Even though there is no specific pages or configs separately available for these endpoints, it depends on which VNET we are using to create the endpoints. Based on this, we will identify the of endpoint that is created.Frontend endpoint: This is the endpoint that is created for connections from the users to the Control Plane. For example, the frontend endpoint will make sure the connection requests from the users (Azure Portal page) REST API's etc are securely connected.When we are creating a VNET-injected Databricks workspace, there will already be 2 subnets created for private and public subnets. Along with these 2, we can create another subnet in the same VNET and use it to create a private endpoint. This is considered as the Frontend endpoint.Backend endpoint: This is the endpoint which is created for connections between the Data plane and the control plane, which is from the workspace to the control plane. All the cluster startup requests, job run requests etc will go through this private endpoint to connect securely.This endpoint can be created for provide access to the on-prem networks or the other networks. So for this, along with the VNET used to deploy the Databricks workspace, we can create another VNET. In this separate VNET, we can create a subnet and use it to create the endpoint. This will be considered as the Backend endpoint. This VNET can have a peering with the on-prem network or with other networks for which the access should be allowed securely.Please feel free to add any points if I have missed.#azuredatabricks #privateendpoint #dataengineer #networking

    8

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    When we get the data in raw format, there will be a need to clean the data and have it ready with the desired format.It is important to find the null values and remove the duplicate data which will also reduce the number of records fetched while querying the table. Below I have written sample queries in SQL and Pyspark to find the null records and also to remove the duplicate records:Finding null values:Select count_if(email IS NULL) from users;Select count(*) FROM users WHERE email IS NULL;From pyspark.sql.functions import colusersDF = spark.read.table("users")usersDF.selectExpr("count_if(email IS NULL")usersDF.where(col("email").isNull()).count()‐-----------------------------------------------Remove duplicate records:Create or replace TEMP VIEW sample ASSELECT user_id, timestamp, max(email) AS email_id, max(updated) AS max_updatedFROM usersWHERE user_id IS NOT NULLGROUP BY user_id, timestamp;SELECT count(*) FROM sample From pyspark.sql.functions import maxsampleDF = (usersDF .where(col("user_id").isNotNull()) .groupBy("email", "timestamp") .agg(max("email").alias("email_id"), (max("updated").alias("max_updated"))sampleDF.count()Let me know in the comments about the methods you have used to find null and duplicate records.#azuredatabricks #sql #pyspark #dataengineer

    13

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    Once we start using Unity Catalog in our Azure Databricks account, we come across the different types of tables that we can create: Managed tables and External tables. It is important to know the difference between these types.1. Managed Tables: These are the tables which will be saved in the managed storage. This is the location which we would have given while creating the metastore. 2. External table: These are the tables whicj will be saved in the external location that we have created in the locations. It can be either the exact external location or the nested folder of the external location.In both the cases, the metadata will be stored in the Unity catalog only. Also there is no differnce in the access permissions. When we drop the table which is a managed storage, both the data and the metadata will be deleted. But with external tables, only the metdata available in the workspace will be deleted and the data will still be available in the external location.Which type of table have you created and which is better. Let me know about your thoughts in the comments.#azuredatabricks #databricks #tables #dataengineering

    16

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    One of the recent features added to the Azure Databricks service is the ability to create Private link connection for the workspaces to ensure the connection is secure is going through only the approved network. We can also connect from our on-prem networks securely using the transit VNET.Private link is a feature where we can create a private endpoint for the resources and there will be a IP assigned to the endpoint which will be used for all the connections. We can create private endpoints for a wide range of resources and the types of the sub-resoucres in the endpoint varies depending on the types of service we are creating the endpoint for.For ex, we can create a private endpoint for ADLS with the below sub-resource types: dfs, blob, file etc. Similarly there are two different types we can select for the Azure Databricks workspaces: databricks_ui_api and browser_authentication:1.Browser_authentication: This endpoint type can be selected when we have multiple workspaces in the same region where we can create one endpoint per region. Once created and connected with the network, all the authentication requests (SSO) for all the workspaces in that region will go through this endpoint itself.2.Databricks_ui_api: This is the endpoint that is used to connect to the Databricks control plane and also the connections to the other Azure resources. Each workspace must have a separate endpoint with this type. The network traffic for a Private Link connection between a transit VNet and the workspace control plane always traverses over the Microsoft backbone network.Please feel free to add any points if I have missed. I will be posting later about the different types of private endpoints that we can create for Azure Databricks workspaces.#azuredatabricks #networking #privatelink

    15

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    How can we access the ADLS resource from Azure Databricks workspaces and which is the better method to use.If we are not using Unity Catalog in our environment, we can still connect to the ADLS resource from the workspace using different methods. First of all, there are 3 main different types of authentication available as listed below:1. Service Principal authentication (also called as Oauth method).2. SAS Key method.3. Account Key method.All the above methods has its own advantages and disadvantages. Bearing in mind managing the keys and key rotation, many of them go for SPN authentication. Even this method has secret key creation step and requires specifying key expiry. It requires to generate a new key if it expires and we must update the same in the spark config commands or in the secrets that we have stored in the Key vault.Once we have decided which authentication method to use, there are two different access methods which we can use with any of the above 3 authentication types:1. Mounting method.2. Direct access method.Even though mounting method is used by most of the users, it is not the recommended method as this method has been deprecated by the Databricks team. There are several reasons for deprecating this method, such as:A. The mount point created using a cluster can be accessed from any other cluster if the user has the mount point name.B. The mount point can also be deleted by any user if they have the mount point name and has access to any cluster.So it is recommended to use Direct access method where we avoid creating the mount point. As we will be using the spark config commands to access ADLS, we can utilize the below points to ensure the content is not visible to everyone:1. Store the credentials in Key vault and access it using dbutils command.2. Use notebook ACL's to give access to limited people.3. Pass the spark configs through the Advanced configs tab in the cluster and enable cluster ACL's.Feel free to add more points on this.#azuredatabricks #dataengineer #adls #spark

    32

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    To calculate the databricks usage cost, here is the formula: Total cost for Databricks service = VM Cost + DBU CostVM Cost = [Total Hours] X [No. of Instances] X [Linux VM Price]DBU Cost = [Total Hours] X [No. of Instances] X [DBU] X [DBU Price/hour - Standard / Premium Tier] Here is an example on how Azure Databricks billing works? Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute or All-Purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-Purpose Compute workload. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-Purpose Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for All-Purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100 The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600 The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198. If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Light Compute workload: VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 DBU cost for Jobs Light Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.22/DBU = $440 The total cost would therefore be $598 (VM Cost) + $440 (DBU Cost) = $1,038. In addition to VM and DBU charges, you may also be charged for bandwidth, managed disks, storage cost.#databricks #azuredatabricks #dataengineer #cost

    30

    9 Comments

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    I’m happy to share that I’m starting a new position as Consultant at Capgemini!

    This content isn’t available here

    Access this content and more in the LinkedIn app

    119

    94 Comments

    Like Comment

    To view or add a comment, sign in

  • Prajwal S R

    Consultant at Capgemini | Azure Databricks | Azure Data Factory | Pyspark | SQL | Cloud Academy certified Azure Databricks Specialist | Microsoft Certified Azure Fundamentals | Ex-LTIMindtree

    • Report this post

    How will you approach a use case where you need to merge multiple files in a folder in the ADLS Gen2 account and copy it to another location as a single file instead of multiple files.If it is just loading files in the Databricks notebook, we can use the .load option to load a complete folder and its files. But if the use case is to merge and copy the files to another location, we can use the ADF Copy Data activity. Below are the steps that we can perform for the same.1. Initially we have to create a Linked service. I will take the example of copying from and saving in a storage account. So we can create linked service by selecting the ADLS Gen2 storage account.2. Create Datasets: One for source and another for the sink/destination. In the source dataset, I can pass only the folder path instead of the complete file path. In the sink dataset, I can select the path where the file has to be copied.3. Once the datasets are created, I can create the Copy Data activity in the Pipelines page. In the activity settings page, Source: I have to select Dataset1 which has the path of the folder.Sink: I will select the Dataset2 which has the path where the files are to be copied.4. Make sure to select Merge files as the Copy behavior. This will merge all the files under the folder and copy it as a single file in the destination path.Sample Datasets under folder:file1.json:{ { "name": "Prajwal" }}file2.json:{ { "name": "Prajwal S R" }}After the Copy Data activity is completed, we can see the below data in a single file in the destination path:{ {"name" : "Prajwal"}, {"name" : "Prajwal S R"}}#dataengineer #azuredatabricks #azuredatafactory

    29

    Like Comment

    To view or add a comment, sign in

Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (34)

Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (35)

852 followers

  • 26 Posts

View Profile

Follow

Explore topics

  • Sales
  • Marketing
  • Business Administration
  • HR Management
  • Content Management
  • Engineering
  • Soft Skills
  • See All
Prajwal S R on LinkedIn: Academy Accreditation - Azure Databricks Platform Architect • Prajwal S R… (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Sen. Emmett Berge

Last Updated:

Views: 6196

Rating: 5 / 5 (60 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Sen. Emmett Berge

Birthday: 1993-06-17

Address: 787 Elvis Divide, Port Brice, OH 24507-6802

Phone: +9779049645255

Job: Senior Healthcare Specialist

Hobby: Cycling, Model building, Kitesurfing, Origami, Lapidary, Dance, Basketball

Introduction: My name is Sen. Emmett Berge, I am a funny, vast, charming, courageous, enthusiastic, jolly, famous person who loves writing and wants to share my knowledge and understanding with you.