Drive.Parquet

Type	Read/write	Author	Availability
Direct provider	Write	FINBOURNE	Provided with LUSID

The Drive.Parquet provider enables you to write a Luminesce query that extracts data from one or more Apache Parquet files stored in Drive.

The query returns a table of data assembled from the contents of the file or files in the order they are read.

See also: Drive.Excel, Drive.Csv, Drive.Xml, Drive.RawText

Basic usage

@x = use Drive.Parquet
<options>
enduse;
select * from @x

Options

Drive.Parquet has options that enable you to filter or refine a query.

Note: The --file option is mandatory.

An option takes the form --<option>=<value>, for example --file=trade-file.parquet. Note no spaces are allowed either side of the = operator. If an option:

Takes a boolean value, then specifying that option (for example --addFileName) sets it to True; omitting the option specifies False.
Takes multiple string values, then specify a comma-separated list, for example --select=My,Column,Names.

Current options at article update time are listed in the table below. For the very latest information, run the following query using a suitable tool and examine the online help:

@x = use Drive.Parquet
enduse;
select * from @x

Current options	Explanation
`file` (`-f`)	Mandatory. The file to read. It may also be a folder, in which case --folderFilter is also required to specify which files in the folder to process. [String]
`folderFilter`	Denotes this is searching an entire folder structure and provides a Regular Expression of path/file names within it that should be processed. All matches should be of the same format. [String]
`zipFilter` (`-z`)	Denotes this is a Zip file and provides a Regular Expression of path/file names within it that should be processed. All matches should be of the same format. [String]
`allowMissing`	Should a file/folder simply not exist, don't throw an error but return an empty table with column names and types created as best possible given other options. [Boolean]
`addFileName`	Adds a column (the first column) to the result set which contains the file the row came from. [Boolean]
`select`	Column (by Name) that should be returned (comma delimited list). [String]

Examples

In the following examples, the select * from @x syntax at the end prints the table of data assembled by the query.

Note: For more examples, try the Luminesce Github repo.

Example 1: Extract data from a particular Parquet file

@x = use Drive.Parquet
--file=/trade-files/eod.parquet
enduse;
select * from @x

Example 2: Extract specific columns from a Parquet file

In this example, just column3 and column7 are extracted.

@x = use Drive.Parquet
--file=/trade-files/eod.parquet
--select=column3,column7
enduse;
select * from @x

Example 3: Extract data from a particular Parquet file stored in a ZIP archive

In this example, daily.zip is stored in the root Drive folder, containing one or more Parquet files. Data is extracted from the archived Parquet file specified by the --zipFilter option.

@x = use Drive.Parquet
--file=daily.zip
--zipFilter=eod.parquet
enduse;
select * from @x