 Tabular Files

One of EdgeSet’s distinctive features is its ability to query data in tabular (CSV, spreadsheet, etc.) files stored in various object and file storage systems. The data does not need to be mapped manually in EdgeSet; EdgeSet can infer appropriate tables from the files automatically.

Whenever a data source is scanned, EdgeSet retrieves a file listing from it. It then samples data from each file to determine if it’s a supported file. If so, it infers the names and types of each column. Finally, it collects matching (partitioned) files and creates tables from them.

Supported storage

Storage Format Partitioned
AWS S3 CSV yes
OSS CSV yes
Google Drive CSV no
Google Drive Google Sheets no
Google Drive Microsoft Excel no
Google Drive OpenDocument Spreadsheet no
Google Drive Apple Numbers no
SMB Microsoft Excel no
SMB OpenDocument Spreadsheet no
SMB Apple Numbers no
SMB Microsoft Access no

Column inference

Each column of a tabular file can be of a certain type, such as integers or dates. EdgeSet will automatically infer the type of a column if all sampled cells1 in that column match a given type.

1 We use the term “cell” to refer to an individual row and column of any tabular file, whether it’s a spreadsheet or not.

Supported column data types and example values
Type Examples Notes
boolean

True false

y NO

Both true and false values must be present or the column will be inferred as varchar.
integer

16

-200000

(47,693)

0xff

Grouping characters supported: ,, ., ', _, space, and thin space.

The same grouping character must be used for all cells.

Integers are always inferred as bigints (64-bit).

decimal

1.25

0,001

100,000.00

1e10

The same grouping characters are supported, and either . or , must separate the integer and fractional parts of the number.
double 1e300 The double-precision floating point type is only inferred when at least one number in a column is outside the range representable by decimal.
date

2021-11-27

27/11/2021

11/27/2021

November 27th, 2021

Day-month-year or month-day-year order are inferred automatically when possible. If all dates are ambiguous, the column will be inferred as varchar.
time

14:49:00

3:15 PM

00:00:00.000

timestamp

2021-11-27 15:15:27.032

11/27/2021 3:15PM

Any supported date format separated by whitespace or a T is supported.
timestamp with time zone

2021-11-27T15:15:27.032Z

11/27/2021 15:15 +08

varchar anything Any column that fails inference for any reason is treated as varchar.

Whitespace2 is stripped from the beginning and end of each cell when inferring its type and converting it to a value.

2 Whitespace is any Unicode space character and the following control characters: horizontal tab (\t), vertical tab (\v), line feed (\n), carriage return (\r), and form feed (\f).

Null values

EdgeSet considers empty cells to be null values, not varchar.

Headers

EdgeSet determines that a header is present in a table if and only if:

  • the table has at least two rows AND
  • at least one column is inferred to be some type other than varchar
  • all cells of the first row are not null
  • at least one of the cells of the first row is not a valid value of the inferred type of its column

Table inference

S3-compatible data sources support merging multiple files in a folder into a single table. This is to support large-volumn, typically append-only data storage like that used in data lakes.

When all tabular files in a folder are compatible, they are merged to form a single table.

Files are compatible if:

  • they have the same number of columns AND
  • they have the same headers (or no headers)

Types will be merged between columns of every compatible file.

If at least one file is unambiguous, all ambiguous files will be resolved by merging them into the unambiguous one.

Partition inference

Partitioning is the most important optimization for querying a large amount of data spread across separate files (such as in data lakes). By separating the data into partitions, the query engine can scan only the data it needs to answer a particular query.

EdgeSet supports the following partition formats, listed in order of priority:

Format Example
Key=Value from=USD/to=EUR/data.csv
Year/Month/Day 2021/11/17/data.csv
Year/Month 2021/11/data.csv

The partition “keys” become columns on the virtual table that EdgeSet creates. Then, if the key columns are used in the WHERE clause of a query, EdgeSet will scan just the relevant tables.3

3 You can think of the partition key as similar to an index in a relational database.

Table layout inference

For spreadsheets, EdgeSet supports additional inference of the table location. For example, given the following spreadsheet:

title
col1 col2 note
first 1 3 marginalia
second 2 4

EdgeSet would infer the table to be:

title
col1 col2
first 1 3
second 2 4

Table inference follows deterministic rules:

  • empty leading columns and rows are skipped
  • rows are considered to extend to the last non-empty cell
  • the table extends to the widest row