Load Nested XML Files to Oracle Autonomous Database and Convert it to Structured Table

Published in

Oracle Developers

6 min readMar 18, 2024

Oracle Autonomous Database has long had support for Oracle XML DB features. Using the XMLTABLE feature you can load an XML file as Rows and Columns to a Structured Oracle Table.

For this demo we will use a sample XML called PatientRecord.xml as below:

<?xml version="1.0" encoding="UTF-8"?><PatientRecord><Patient><ID>123456</ID><Name>सुरेश कुमार</Name><Gender>M</Gender><BirthDate>1980-09-10</BirthDate><Address><Street>महात्मा गांधी रोड</Street><City>मुंबई</City><State>महाराष्ट्र</State><Country>भारत</Country><PostalCode>400001</PostalCode></Address></Patient><MedicalHistory><Allergies><Allergy><Name>पेनिसिलिन</Name><Reaction>गंभीर</Reaction></Allergy><Allergy><Name>धूल</Name><Reaction>मामूली</Reaction></Allergy></Allergies><ChronicConditions><Condition><Name>अस्थमा</Name><DiagnosedDate>2005-03-15</DiagnosedDate></Condition></ChronicConditions></MedicalHistory><TreatmentHistory><Treatment><StartDate>2022-04-10</StartDate><EndDate>2022-04-24</EndDate><Description>एंटीबायोटिक उपचार</Description><Physician>डॉ. अनिता देशपांडे</Physician></Treatment></TreatmentHistory></PatientRecord>

1. First thing we need to do is; create a staging table and then load the XML file to the table

XML file is stored as XMLType column in table xml_patient_data

CREATE TABLE xml_patient_data (
    id NUMBER GENERATED BY DEFAULT AS IDENTITY MINVALUE 1 START WITH 1,
    xml_content XMLType
);

2. The best way to load bulk data to any Oracle database is to use SQL Loader

a. Create a SQL Loader control file

$ vim xml_patient_load.ctl

LOAD DATA
INFILE 'PatientRecord.xml'
BADFILE 'PatientRecord.bad'
DISCARDFILE 'PatientRecord.dsc'
INTO TABLE xml_patient_data
APPEND
FIELDS TERMINATED BY '\n'
(
    xml_content LOBFILE(CONSTANT 'PatientRecord.xml') TERMINATED BY EOF
)

b. Set NLS_LANG on the host. This ensures you can load the UTF-8 correctly

$ export NLS_LANG=.AL32UTF8

$ echo $NLS_LANG
.AL32UTF8

c. Run the SQL Loader utility sqlldr

Entry for ADB in $ORACLE_HOME/network/admin/tnsnames.ora

ADB=(description= (retry_count=20)(retry_delay=3)(address=(protocol=tcps)(port=1521)(host=adb.uk-london-1.oraclecloud.com))(connect_data=(service_name=m11111111111n_demoadb_high.adb.oraclecloud.com))(security=(ssl_server_dn_match=yes)))

$ sqlldr userid=admin/YOURP@ssword1234@adb control=xml_patient_load.ctl

SQL*Loader: Release 23.0.0.0.0 - Production on Mon Mar 18 00:49:21 2024
Version 23.3.0.23.09

Copyright (c) 1982, 2023, Oracle and/or its affiliates.  All rights reserved.

Path used:      Conventional
Commit point reached - logical record count 1

Table XML_PATIENT_DATA:
  1 Row successfully loaded.

Check the log file:
  xml_patient_load.log
for more information about the load.

d. Let’s check the table and see if the contents are loaded correctly

--truncate table XML_PATIENT_DATA;

select count(*) from XML_PATIENT_DATA;

SELECT id, XMLSerialize(DOCUMENT xml_content AS CLOB) FROM XML_PATIENT_DATA;

-- Read Using getClobVal 

SET SERVEROUTPUT ON SIZE UNLIMITED


DECLARE
  l_xml_content XMLType;
BEGIN
  SELECT xml_content INTO l_xml_content FROM XML_PATIENT_DATA WHERE ROWNUM = 1;
  DBMS_OUTPUT.PUT_LINE(l_xml_content.getClobVal());
END;
/

e. If you want to load multiple XML files where each XML file is a record of a new patient then you can use the below simple shell script to loop through the files

$ vim Patient_Load.sh

$ chmod +x Patient_Load.sh

#!/bin/bash

export NLS_LANG=.AL32UTF8

DIRECTORY='/home/oracle/patientdata'
CONTROL_FILE='xml_patient_load.ctl'

USERID='admin/YourPassw0rd1234#_@adb'

for FILE in ${DIRECTORY}/*.xml; do
  sed "s|CONSTANT '.*.xml'|CONSTANT '${FILE}'|g" ${DIRECTORY}/${CONTROL_FILE} > ${DIRECTORY}/temp_control.ctl

  sqlldr userid=${USERID} control=${DIRECTORY}/temp_control.ctl log=${FILE}.log

  echo "Loaded file: ${FILE}"
done

rm ${DIRECTORY}/temp_control.ctl

$ ./Patient_Load.sh

SQL*Loader: Release 23.0.0.0.0 - Production on Mon Mar 18 00:55:57 2024
Version 23.3.0.23.09

Copyright (c) 1982, 2023, Oracle and/or its affiliates.  All rights reserved.

Path used:      Conventional
Commit point reached - logical record count 1

Table XML_PATIENT_DATA:
  1 Row successfully loaded.

Check the log file:
  /home/oracle/patientdata/PatientRecord.xml.log
for more information about the load.
Loaded file: /home/oracle/patientdata/PatientRecord.xml

3. Create the Final Structured Table with Rows and Columns from XMLType column in Staging Table

CREATE TABLE STRUCTURED_PATIENT_DATA AS
SELECT patient_info.*, allergies_info.*, conditions_info.*, treatment_info.*
FROM xml_patient_data,
XMLTABLE('/PatientRecord/Patient'
  PASSING xml_patient_data.xml_content
  COLUMNS
    ID VARCHAR2(10) PATH 'ID',
    Name VARCHAR2(50) PATH 'Name',
    Gender CHAR(1) PATH 'Gender',
    BirthDate DATE PATH 'BirthDate',
    Street VARCHAR2(100) PATH 'Address/Street',
    City VARCHAR2(50) PATH 'Address/City',
    State VARCHAR2(50) PATH 'Address/State',
    Country VARCHAR2(50) PATH 'Address/Country',
    PostalCode VARCHAR2(10) PATH 'Address/PostalCode'
) patient_info,
XMLTABLE('/PatientRecord/MedicalHistory/Allergies/Allergy'
  PASSING xml_patient_data.xml_content
  COLUMNS
    Allergy_Name VARCHAR2(100) PATH 'Name',
    Allergy_Reaction VARCHAR2(100) PATH 'Reaction'
) allergies_info,
XMLTABLE('/PatientRecord/MedicalHistory/ChronicConditions/Condition'
  PASSING xml_patient_data.xml_content
  COLUMNS
    Condition_Name VARCHAR2(100) PATH 'Name',
    DiagnosedDate DATE PATH 'DiagnosedDate'
) conditions_info,
XMLTABLE('/PatientRecord/TreatmentHistory/Treatment'
  PASSING xml_patient_data.xml_content
  COLUMNS
    Treatment_StartDate DATE PATH 'StartDate',
    Treatment_EndDate DATE PATH 'EndDate',
    Treatment_Description VARCHAR2(100) PATH 'Description',
    Physician VARCHAR2(100) PATH 'Physician'
) treatment_info;

In this SQL statement, patient_data is your table and xml_content is the column where the XML data is stored. The statement makes use of the XMLTABLE function to extract data from different levels of the XML hierarchy.

patient_info extracts the patient's personal information.
allergies_info extracts information about allergies from the MedicalHistory section.
conditions_info extracts information about chronic conditions.
treatment_info extracts information about the treatment history.

Please note that the data types (like VARCHAR2 and DATE) and their sizes may need to be adjusted according to the actual content of your XML and the schema of your database. If your dates are not in the default Oracle format (DD-MON-RR), you may need to use TO_DATE with a format mask to correctly parse the dates.

Query the final structured table to ensure the data is correctly inserted including the non-english characters:

-- truncate table STRUCTURED_PATIENT_DATA; 

-- drop table STRUCTURED_PATIENT_DATA;

select count(*) from STRUCTURED_PATIENT_DATA;

select * from STRUCTURED_PATIENT_DATA;

Appendix A : Normalized Data

When you use XMLTABLE multiple times in a query with one-to-many relationships (like a patient having multiple allergies, conditions, or treatments), it can produce multiple rows for each entry because each XMLTABLE generates a row for every match it finds. If you have multiple allergies, conditions, or treatments, you will get a Cartesian product of all possible combinations, which is likely why you're seeing more rows than expected.

To avoid this, you can use subqueries and aggregate functions to concatenate multiple entries into a single row.

Use Subqueries with aggregation functions to concatenate the multiple entries related to allergies, conditions, and treatments into delimited lists.

Here’s how you can adjust the SQL statement:

CREATE TABLE STRUCTURED_PATIENT_DATA AS
SELECT 
  patient_info.ID,
  patient_info.Name,
  patient_info.Gender,
  patient_info.BirthDate,
  patient_info.Street,
  patient_info.City,
  patient_info.State,
  patient_info.Country,
  patient_info.PostalCode,
  RTRIM(
    XMLCAST(
      XMLAGG(
        XMLELEMENT("e", allergies_info.Allergy_Name || ', ')
      ) AS CLOB
    ),
    ', '
  ) AS Allergies,
  RTRIM(
    XMLCAST(
      XMLAGG(
        XMLELEMENT("e", conditions_info.Condition_Name || ', ')
      ) AS CLOB
    ),
    ', '
  ) AS ChronicConditions,
  RTRIM(
    XMLCAST(
      XMLAGG(
        XMLELEMENT("e", treatment_info.Treatment_Description || ', ')
      ) AS CLOB
    ),
    ', '
  ) AS Treatments
FROM 
  xml_patient_data,
  XMLTABLE('/PatientRecord/Patient'
    PASSING xml_patient_data.xml_content
    COLUMNS
      ID VARCHAR2(10) PATH 'ID',
      Name VARCHAR2(50) PATH 'Name',
      Gender CHAR(1) PATH 'Gender',
      BirthDate DATE PATH 'BirthDate',
      Street VARCHAR2(100) PATH 'Address/Street',
      City VARCHAR2(50) PATH 'Address/City',
      State VARCHAR2(50) PATH 'Address/State',
      Country VARCHAR2(50) PATH 'Address/Country',
      PostalCode VARCHAR2(10) PATH 'Address/PostalCode'
  ) patient_info,
  XMLTABLE('/PatientRecord/MedicalHistory/Allergies/Allergy'
    PASSING xml_patient_data.xml_content
    COLUMNS
      Allergy_Name VARCHAR2(100) PATH 'Name'
  ) allergies_info,
  XMLTABLE('/PatientRecord/MedicalHistory/ChronicConditions/Condition'
    PASSING xml_patient_data.xml_content
    COLUMNS
      Condition_Name VARCHAR2(100) PATH 'Name'
  ) conditions_info,
  XMLTABLE('/PatientRecord/TreatmentHistory/Treatment'
    PASSING xml_patient_data.xml_content
    COLUMNS
      Treatment_Description VARCHAR2(100) PATH 'Description'
  ) treatment_info
GROUP BY 
  patient_info.ID,
  patient_info.Name,
  patient_info.Gender,
  patient_info.BirthDate,
  patient_info.Street,
  patient_info.City,
  patient_info.State,
  patient_info.Country,
  patient_info.PostalCode;

In this adjusted statement:

XMLAGG is used to aggregate the elements.
XMLELEMENT("e", value) is used to create an XML structure that can be aggregated.
RTRIM is used to trim the trailing comma and space.
XMLCAST is used to convert the aggregated XML into a CLOB.
The GROUP BY clause ensures that each patient's information is grouped together, thereby preventing the creation of multiple rows per patient.

This should create a single row for each patient record with comma-separated lists of allergies, conditions, and treatments. Please note that the column names like Allergies, ChronicConditions, and Treatments will now contain delimited string lists instead of XML fragments.

Appendix B : Data Quality

If you’re not able to see the non-english characters or the columns appear as ???? And if you’re unsure whether the issue lies with the database or the display environment, you can use Oracle’s DUMP function to return the internal representation of the data. This can help determine if the correct data is stored in the database:

Use the below SQL to check the field’s data using DUMP

SELECT DUMP(name,1016) FROM STRUCTURED_PATIENT_DATA;

This will show the hexadecimal representation of the stored data, allowing you to verify if the Unicode characters are correctly stored.

Incorrect Data :

--------------------------------------------------------------------------------
Typ=1 Len=44 CharacterSet=AL16UTF16: ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,f
f,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,
fd,0,20

Typ=1 Len=80 CharacterSet=AL16UTF16: ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,f
f,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,
fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd
,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,0,20

Typ=1 Len=30 CharacterSet=AL16UTF16: ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,f
f,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd,ff,fd

The column consists of repeated sequences of FFFD. In Unicode, FFFD represents the replacement character (�), which is used to replace an incoming character whose value is unknown or unrepresentable in Unicode.

The presence of only FFFD characters suggests that the data was not correctly stored as intended Unicode text. Instead, it seems like it was stored as the replacement character. This usually happens when:

Data Conversion Issue: The database attempted to store characters that it couldn’t represent in its character set, leading to the replacement of those characters with FFFD.
CharacterSet Mismatch: The client or the method used to insert the data into the database did not correctly handle or specify the character set, leading to improper encoding and storage of the data.

Given that the character set is specified as AL16UTF16 (which supports the full range of Unicode characters), the issue likely occurred before or during the data insertion process.

Set NLS_LANG at the client from where you are running SQL*Loader this issue should not occur.

Correct Data :

Typ=1 Len=31 CharacterSet=AL32UTF8: e0,a4,b8,e0,a5,81,e0,a4,b0,e0,a5,87,e0,a4,b6,20,e0,a4,95,e0,a5,81,e0,a4,ae,e0,a4,be,e0,a4,b0

Now after setting NLS_LANG to UTF8 in source we cannot see FFFD sequence repeating again.