Drive.Xml

Prev Next

Type

Read/write

Author

Availability

Direct provider

Read

Finbourne

Provided with LUSID

The Drive.Xml provider enables you to write a Luminesce query that extracts data from one or more XML files stored in Drive.

The query returns a table of data assembled from the contents of the file or files in the order they are read.

See also: Drive.Excel, Drive.Sqlite, Drive.Csv, Drive.RawText, Drive.File

Basic usage

@x = use Drive.Xml
<options>
enduse;
select * from @x

Options

Drive.Xml has options that enable you to filter or refine a query.

Note: The --file option is mandatory. The --columns and --nodePath options are recommended; see below.

An option takes the form --<option>=<value>, for example --file=book.xml. Note no spaces are allowed either side of the = operator. If an option:

  • Takes a boolean value, then specifying that option (for example --addFileName) sets it to True; omitting the option specifies False.

  • Takes multiple string values, then specify a comma-separated list, for example --types=Boolean,Date,Decimal.

While not mandatory, note that:

  • The --nodePath option accepts an XPath query that selects the node group(s) to query. If omitted, the entire root node is queried (//*).

  • The --columns option is required to actually retrieve data from an XML document (as opposed to just counting nodes). Arguments must be key/value pairs specified on separate lines underneath the option itself, where the key is the column name you want in the results table and the value is an XPath query that selects the node and/or attribute values to display. See the examples below.

Current options at article update time are listed in the table below. For the very latest information, run the following query using a suitable tool and examine the online help:

@x = use Drive.Xml
enduse;
select * from @x

Current options

Explanation

file (-f)

Mandatory. The file to read. Should be text (any extension) or compressed (.zip, .gz or .tar.gz). It may also be a folder, in which case --folderFilter is also required to specify which files in the folder to process. [String]

folderFilter

Denotes this is searching an entire folder structure and provides a Regular Expression of path/file names within it that should be processed. All matches should be of the same format. [String]

zipFilter (-z)

Denotes this is a Zip file and provides a Regular Expression of path/file names within it that should be processed. All matches should be of the same format. [String]

allowMissing

Should a file/folder simply not exist, don't throw an error but return an empty table with column names and types created as best possible given other options. [Boolean]

addFileName

Adds a column (the first column) to the result set which contains the file the row came from. [Boolean]

types

Column types (comma delimited list of: 'Boolean, Date, DateTime, Decimal, Double, Int, BigInt, Text, Table', some columns may be left blank while others are specified) [String]

trim

Trim both leading and trailing whitespace when reading values [Boolean]

inferTypeRowCount

If non-zero and 'types' is not specified (or not specified for some columns) this will look through N rows to attempt to work out the column types for columns not pre-specified [Int32]

valuesToMakeNull

Regex of values to map to 'null' in the returned data. [String, Default: (^[nN][uU][lL]{1,2}$)|((\r\n|\n|\r)$)|(^(\r\n|\n|\r))|(^\s*$)]

nodePath

XPath query that selects the nodes to map to rows [String, Default: //*]

namespaces

Selected prefix(es) and namespace(s):

prefix1=namespace1-uri1,prefix2=namespace2-uri2,...prefixN=namespaceN-uriN [String]

columns

Column names and XPath to create them, of the form: Name=XPath (one pair per line) If omitted this will be made up of all the direct children of the first selected node in the --nodePath. If any column doesn't supply XPath (so of the form: abc or abc=) then the XPath will be inferred in the same way as omitting this entirely, but only requested columns will be returned. [String]

Examples

In the following examples, the select * from @x syntax at the end prints the table of data assembled by the query.

Note: For more examples, try the Luminesce Github repo.

Example 1: Extract data from a XML file with either one or no namespace

Consider an XML document like this stored as a file in Drive:

<?xml version="1.0" encoding="UTF-8"?>
<books>
    <book>
        <title lang="en" colour="blue">This is the title of book 1</title>
        <price>19.99</price>
    </book>
    <book>
        <title lang="en" colour="white">This is the title of book 2</title>
        <price>29.99</price>
    </book>
</books>

The --nodePath option queries each <book> node in the root <books> node. The --columns option displays the <title> and <price> node values, and the lang and colour attribute values, in four separate columns, one row per book.

@x = use Drive.Xml
--file=/sales/book.xml
--nodePath=books/book
--columns
BookTitle=title
BookPrice=price
Language=title/@lang
Color=title/@colour
enduse;
select * from @x

The table of data returned by the query looks like this:

Example 2: Extract data from a XML file with multiple namespaces

Consider an XML document like this stored as a file in Drive:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:h="http://www.w3.org/schema1"
      xmlns:f="https://www.w3.org/schema2">
<h:books>
    <h:book xmlns:h="http://www.w3.org/schema1">
        <h:title lang="en" colour="blue">This is the title of book 1</h:title>
        <f:title lang="hu" colour="yellow">Book 1: The title</f:title>
        <h:price>19.99</h:price>
    </h:book>
    <h:book xmlns:h="http://www.w3.org/schema1">
        <h:title lang="en" colour="white">This is the title of book 2</h:title>
        <f:title lang="hu" colour="orange">Book 2: The title</f:title>
        <h:price>29.99</h:price>
    </h:book>
</h:books>
<f:books>
    <f:book>
        <f:title lang="hu" colour="green">Book 3: The title</f:title>
        <f:price>39.95</f:price>
    </f:book>
</f:books>
</root>

The --nodePath option queries each <book> node in the <books> node in the h namespace. The --columns option displays the <title> node values from both the f and h namespaces.

@x = use Drive.Xml
--file=/sales/book.xml
--namespaces=h=http://www.w3.org/schema1,f=https://www.w3.org/schema2
--nodePath=root/h:books/h:book
--columns
BookTitle=h:title
AltBookTitle=f:title
enduse;
select * from @x

The table of data returned by the query looks like this:

Example 3: Extract data from multiple XML files matched using a regular expression

If --file is a folder in Drive, you must specify the --folderFilter option with a regular expression to nominate one or more files to read. 

In this example, data is extracted from three files book-1.xml, book-2.xml and book-3.xml, but not from book-4.xml. The --addFileName option adds an extra column to the table of results showing the source of each record.

@x = use Drive.Xml
--file=/sales
--folderFilter=book-[1-3].xml
--addFileName
enduse;
select * from @x

Example 4: Extract data from multiple XML files stored in a ZIP archive

In this example, sales.zip is stored in the root Drive folder; data is extracted from any XML file within it that just has alphabetic characters in its file name.

@x = use Drive.Xml
--file=sales.zip
--zipFilter=[a-zA-Z].xml
--addFileName
enduse;
select * from @x

Example 5: Infer data types of columns from the first 10 rows

@x = use Drive.Xml
--file=/sales/book.xml
--inferTypeRowCount=10
enduse;
select * from @x