Feeds:
Posts
Comments

Archive for January, 2024

When I set the @query_result_separator to either space or comma, it outputs the data in one column, separated by the separator defined. To get it to output the data into its original columns, press TAB keyboard as the value for the @query_result_separator.

EXEC msdb.dbo.sp_send_dbmail

@profile_name = 'profile_integration_errors',
@recipients = 'x@YXZ.com',
@query = ' SELECT * FROM ExceptionReport ',
@subject = 'Exception Report Attached',
@body = 'Please find the report attached',
@body_format = 'HTML',
    @query_result_header = 1,
    @query_attachment_filename = 'ExceptionReport.csv',
@attach_query_result_as_file = 1,
    @query_result_separator = ' ',
    @query_result_width = 32767,
    @query_result_no_padding = 1;

Read Full Post »

Instead of hard coding the credentials in the Notebook, we can reference them in Azure Key Vault (AKV).

I am authenticating my API using header and key, so I created 2 secret keys in AKV to store the API header and API key.

I am accessing my ADLS gen 2 using ADLS access key, so I created another secret in AKV to store the ADLS access key.

To establish a connection from Azure Databricks to ADLS, I use Azure Key Vault-backed scope. A scope is collection of secrets identified by a name.

Steps to configure the Azure Key Vault-backed scope for Azure Databricks:

Go to <a href=”https://https://<databricks-instance>#secrets/createScope. This URL is case sensitive; scope in createScope must be uppercase. To get the databricks-instance, launch the Azure Databricks workspace. Copy the URL as highlighted below, which is the databricks-instance.

Replace the databricks-instance with the URL you have just copied.

Enter the Scope Name and set the Mange Principal to All Users. 

Go to Azure Key Vault to get the DNS Name and Resource ID. Enter the DNS Name and Resource ID for Azure Key Vault.

Hit the Create button to create the scope.

To test, open a Notebook in Databricks and then enter the python code below:
display(dbutils.secrets.listScopes())

To get the secret from Azure Key Vaults, use the scope to retrieve the secrets. See the python code below in the Notebook:
adls_account_key = dbutils.secrets.get(scope = "akv_databricks_secretscope", key = "adlsAccountKey")
apiKey = dbutils.secrets.get(scope = "akv_databricks_secretscope", key = "APIUserKey")
apiHeader = dbutils.secrets.get(scope = "akv_databricks_secretscope", key = "APIHeader")

headers= {apiHeader: apiKey}

To access ADLS gen 2, use the python code below. Replace the <access key> with adls_account_key.

spark.conf.set("fs.azure.account.key.<storage-account>.dfs.core.windows.net","<access key>")

To access a container in the ADLS gen 2:

abfss://container_name@storage_account.dfs.core.windows.net/folder_path/file_name

My API config file in JSON is stored in the bronze container of the ADLS. To read this config file and cache the parsed results, use the python code below:

config_json_data = spark.read.json(config_json, multiLine=True, mode="FAILFAST").cache()

My config file looks like below:

{

"sites": ["corporate", "olympicdam", "porphyry", "thunderbox", "wilkiecreek", "paradigm"],
"base_url": "https://data.fleet4d.net",
}

To retrieve the base_url from the config file, use the python code below:

var_base_url = config_json_data.select("base_url").collect()[0]["base_url"]

To loop through each site, use the python code below:

for i in range(len(item["sites"])):         

var_site = item["sites"][i]
      var_url = f'{var_base_url}/{var_site}/masterdata'

To call the API and get its response back:

response=requests.request("GET", var_url, headers = headers)

Convert the response to string:

response_text=response.text

The API response is in XML form. To get each item in the tree and write to ADLS:

root=ET.fromstring(response_text)
#Iterate through each element of the root
for child in root.findall('device'):
id = child.attrib.get('id')
itemid = child.attrib.get('itemid')
name = child.attrib.get('name')
description = child.attrib.get('description')
typeX = child.attrib.get('type')
subtype = child.attrib.get('subtype')
ipaddress = child.attrib.get('ipaddress')
vendor = child.attrib.get('vendor')
enabled = child.attrib.get('enabled')
modified = child.attrib.get('modified')

  #Create tuples
         values = [id, itemid, name, description, typeX, subtype, ipaddress, vendor, enabled, modified]
data.append(values)
#print(values)
          
           #Create Spark dataframe from a list of tuples
df = spark.createDataFrame(data, schema)

      #Write to ADLS gen 2 as csv

                    ”’

                    You get multiple files in a folder because spark writes each shuffle partition in-place out to a “part…” file to avoid network I/O. You can use coalesce to bring all the shuffles into a single partition and write it out to a single file but be mindful of the performance implications.

                    ”’

                    df.coalesce(1).write.mode(‘overwrite’).format(‘csv’).option(“header”, True).save(data_file_path)

                    #Remove _SUCCESS file

                    spark.conf.set(“spark.databricks.io.directoryCommit.createSuccessFile”,”false”)

                    # Set file path

                    dir_path = os.path.dirname(data_file_path)

                    success_file_path = f'{dir_path}/_SUCCESS’

                    # Remove _SUCCESS file

                    if os.path.exists(success_file_path):

                        os.remove(success_file_path)        

 Spark writes split the csv into parts and created _commited, _started and _SUCCESS files and place them all in a folder, as shown below. To prevent it from splitting the csv into parts, use the COALESCE command in your code, as mentioned above.

Read Full Post »

Using Access Keys

  • When Storage account is created, you get two 512-bit access keys

Access Method

abfss://container_name@storage_account.dfs.core.windows.net/folder_path/file_name

Access via python notebook

spark.conf.set("fs.azure.account.key.<storage-account>.dfs.core.windows.net","<access key>",

References :

ABFS-Driver 

Acro names

  • ABFS (Azure Blob File System)

Databricks Scope

Instead of directly putting your secret value in the code, place key inside a scope. Use the scope name to get the value of the key.

Read Full Post »

If you receive a message of [REDACTED] in the output of your code, it means that Databricks replaced the secret values with [REDACTED].

https://docs.databricks.com/en/security/secrets/redaction.html

Read Full Post »

I set up a build pipeline on my SQL Server database repository and I am using Visual Studio 2022.

My build pipeline failed to run with the error below:

##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VisualStudio\v16.0\SSDT\Microsoft.Data.Tools.Schema.SqlTasks.targets(573,5): Error MSB4181: The "SqlBuildTask" task returned false but did not log an error.

It has picked up the MSBuild.exe from the Visual Studio 2019 path instead of from Visual Studio 2022.

To fix this, set the Agent Specification to windows-2022, instead of windows-latest.

If you are using Microsoft-hosted agent windows-latest or windows-2022, with your task, it will use VS2022 by default. If you’d like to use VS2019, change the agent to “Windows-2019”.

Read Full Post »

Use the Power Query below and then select the user table.

Syntax:

ActiveDirectory.Domains(optional forestRootDomainName as nullable text) as table
Source = ActiveDirectory.Domains("company.internal")

Expand the organizationalPerson to get the user attributes.

https://learn.microsoft.com/en-us/powerquery-m/activedirectory-domains

Read Full Post »

I’ve got a Power BI report that is using a direct query to SQL Server database.

According to this article, https://learn.microsoft.com/en-us/power-bi/connect-data/refresh-data,

” Power BI doesn’t import data over connections that operate in DirectQuery mode. Instead, the semantic model returns results from the underlying data source whenever a report or dashboard queries the semantic model. Power BI transforms and forwards the queries to the data source.

Because Power BI doesn’t import the data, you don’t need to run a data refresh. However, Power BI still performs tile refreshes and possibly report refreshes, as the next section on refresh types explains. A tile is a report visual pinned to a dashboard, and dashboard tile refreshes happen about every hour so that the tiles show recent results. You can change the schedule in the semantic model settings.”

Image source : https://learn.microsoft.com/en-us/power-bi/connect-data/refresh-data

So, it’s important to set up an automatic page refresh on the report itself. To set it, open the pbix file then go to the Format properties, as shown below.

https://learn.microsoft.com/en-us/power-bi/create-reports/desktop-automatic-page-refresh

Read Full Post »