 Tabular Files

One of EdgeSet’s distinctive features is its ability to query data in tabular (CSV, Google Sheets) files stored in various object storage systems. The data does not have to be prepared especially for EdgeSet; EdgeSet can infer appropriate tables from the files automatically.

Whenever the query engine is restarted, EdgeSet retrieves a file listing from each connected cloud storage system. It then samples data from each file to determine if it’s a supported file. If so, it infers the names and types of each column. Finally, it collects matching (partitioned) files and creates tables from them.

Column inference

Each column of a tabular file can be of a certain type, such as integers or dates. EdgeSet will automatically infer the type of a column if all sampled cells in that column match a given type. (Note: We will use the term “cell” to refer to an individual row and column of any tabular file, whether it’s a spreadsheet or not).

Supported column data types and example values
Type Examples
boolean True
f
y
NO
integer 16
-200
+47,693
3338.
0xff
decimal 99.99
-0.0000000001
765,686.81
0.14734E2
date 2021-11-27
27/11/2021
11/27/2021
time 14:49:00
3:15 PM
00:00:00.000
timestamp 2021-11-27 15:15:27.032
11/27/2021 3:15PM
timestamp w/ zone 2021-11-27T15:15:27.032Z
11/27/2021 15:15 +08
varchar anything else

Whitespace1 is stripped from the beginning and end of each cell when inferring its type and converting it to a value.

Null values

EdgeSet considers empty values to be null values, not varchar.

Headers

EdgeSet determines that a header is not present if:

  • there are at least two rows AND
  • at least one column is not varchar

When just one row is present, the inferred schema will be an empty table with column names while the raw schema will be a 1-row table with no column names.

What about the ambiguous case when all columns are varchars? EdgeSet will use the first row as a header for the inferred schema because the raw schema will already have the same schema but without a header row.

Table inference

If all tabular files in a folder are compatible, they are merged to form a single table.

Files are compatible if:

  • they have the same number of columns AND
  • they have the same headers (or no headers)

Types will be merged between columns of every compatible file.

If at least one file is unambiguous, all ambiguous files will be resolved by merging them into the unambiguous one.

Partition inference

Partitioning is the most important optimization for querying within files. By separating the data into partitions, the query engine can scan only the data it needs to answer a particular query.

EdgeSet supports the following partition formats, listed in order of priority:

Format Example
Key=Value from=USD/to=EUR/data.csv
Year/Month/Day 2021/11/17/data.csv
Year/Month 2021/11/data.csv

If EdgeSet detects enough files following the partitioning format and that have compatible schemas, it will group them into a single table and optimize queries when they contain partition values in the WHERE clause. EdgeSet will create individual tables for any files that does not fit the pattern or that have incompatible schemas.


  1. Whitespace is any Unicode space character and the following control characters: horizontal tab (\t), vertical tab (\v), line feed (\n), carriage return (\r), and form feed (\f).↩︎