 Tabular Files

One of EdgeSet’s distinctive features is its ability to query data in tabular (CSV, Google Sheets) files stored in various object and file storage systems. The data does not have to be prepared especially for EdgeSet; EdgeSet can infer appropriate tables from the files automatically.

Whenever a data source is scanned, EdgeSet retrieves a file listing from it. It then samples data from each file to determine if it’s a supported file. If so, it infers the names and types of each column. Finally, it collects matching (partitioned) files and creates tables from them.

Column inference

Each column of a tabular file can be of a certain type, such as integers or dates. EdgeSet will automatically infer the type of a column if all sampled cells in that column match a given type. (Note: We will use the term “cell” to refer to an individual row and column of any tabular file, whether it’s a spreadsheet or not).

Supported column data types and example values
Type	Examples
boolean	`True`
	`f`
	`y`
	`NO`
integer	`16`
	`-200`
	`+47,693`
	`3338.`
	`0xff`
double	`99.99`
	`-0.0000000001`
	`765,686.81`
date	`2021-11-27`
	`27/11/2021`
	`11/27/2021`
time	`14:49:00`
	`3:15 PM`
	`00:00:00.000`
timestamp	`2021-11-27 15:15:27.032`
	`11/27/2021 3:15PM`
timestamp w/ zone	`2021-11-27T15:15:27.032Z`
	`11/27/2021 15:15 +08`
varchar	anything else

Whitespace¹ is stripped from the beginning and end of each cell when inferring its type and converting it to a value.

¹ Whitespace is any Unicode space character and the following control characters: horizontal tab (\t), vertical tab (\v), line feed (\n), carriage return (\r), and form feed (\f).

Null values

EdgeSet considers empty values to be null values, not varchar.

Headers

EdgeSet determines that a header is not present if:

there are at least two rows AND
at least one column is not varchar

When just one row is present, the inferred folder will be an empty table with column names while the raw folder will be a 1-row table with no column names.

What about the ambiguous case when all columns are varchars? EdgeSet will use the first row as a header for the inferred folder because the raw folder will already have the same schema but without a header row.

Table inference

If all tabular files in a folder are compatible, they are merged to form a single table.

Files are compatible if:

they have the same number of columns AND
they have the same headers (or no headers)

Types will be merged between columns of every compatible file.

If at least one file is unambiguous, all ambiguous files will be resolved by merging them into the unambiguous one.

Partition inference

Partitioning is the most important optimization for querying within files. By separating the data into partitions, the query engine can scan only the data it needs to answer a particular query.

EdgeSet supports the following partition formats, listed in order of priority:

Format	Example
Key=Value	from=USD/to=EUR/data.csv
Year/Month/Day	2021/11/17/data.csv
Year/Month	2021/11/data.csv

If EdgeSet detects enough files following the partitioning format and that have compatible schemas, it will group them into a single table and optimize queries when they contain partition values in the WHERE clause. EdgeSet will create individual tables for any files that does not fit the pattern or that have incompatible schemas.

Supported storage

Storage	Format	Partitioned
AWS S3	CSV	yes
Google Drive	CSV	no
Google Drive	Google Sheets	no