14 Tabular Files
One of EdgeSet’s distinctive features is its ability to query data in tabular (CSV, Google Sheets) files stored in various object storage systems. The data does not have to be prepared especially for EdgeSet; EdgeSet can infer appropriate tables from the files automatically.
Whenever the query engine is restarted, EdgeSet retrieves a file listing from each connected cloud storage system. It then samples data from each file to determine if it’s a supported file. If so, it infers the names and types of each column. Finally, it collects matching (partitioned) files and creates tables from them.
14.1 Column inference
Each column of a tabular file can be of a certain type, such as integers or dates. EdgeSet will automatically infer the type of a column if all sampled cells in that column match a given type. (Note: We will use the term “cell” to refer to an individual row and column of any tabular file, whether it’s a spreadsheet or not).
Type | Examples |
---|---|
boolean | True |
f |
|
y |
|
NO |
|
integer | 16 |
-200 |
|
+47,693 |
|
3338. |
|
0xff |
|
decimal | 99.99 |
-0.0000000001 |
|
765,686.81 |
|
0.14734E2 |
|
date | 2021-11-27 |
27/11/2021 |
|
11/27/2021 |
|
time | 14:49:00 |
3:15 PM |
|
00:00:00.000 |
|
timestamp | 2021-11-27 15:15:27.032 |
11/27/2021 3:15PM |
|
timestamp w/ zone | 2021-11-27T15:15:27.032Z |
11/27/2021 15:15 +08 |
|
varchar | anything else |
Whitespace1 is stripped from the beginning and end of each cell when inferring its type and converting it to a value.
14.1.1 Null values
EdgeSet considers empty values to be null
values, not varchar
.
14.1.2 Headers
EdgeSet determines that a header is not present if:
- there are at least two rows AND
- at least one column is not
varchar
When just one row is present, the inferred schema will be an empty table with column names while the raw schema will be a 1-row table with no column names.
What about the ambiguous case when all columns are varchar
s? EdgeSet will use the first row as a header for the inferred schema because the raw schema will already have the same schema but without a header row.
14.2 Table inference
If all tabular files in a folder are compatible, they are merged to form a single table.
Files are compatible if:
- they have the same number of columns AND
- they have the same headers (or no headers)
Types will be merged between columns of every compatible file.
If at least one file is unambiguous, all ambiguous files will be resolved by merging them into the unambiguous one.
14.2.1 Partition inference
Partitioning is the most important optimization for querying within files. By separating the data into partitions, the query engine can scan only the data it needs to answer a particular query.
EdgeSet supports the following partition formats, listed in order of priority:
Format | Example |
---|---|
Key=Value | from=USD/to=EUR/data.csv |
Year/Month/Day | 2021/11/17/data.csv |
Year/Month | 2021/11/data.csv |
If EdgeSet detects enough files following the partitioning format and that have compatible schemas, it will group them into a single table and optimize queries when they contain partition values in the WHERE
clause. EdgeSet will create individual tables for any files that does not fit the pattern or that have incompatible schemas.
Whitespace is any Unicode space character and the following control characters: horizontal tab (
\t
), vertical tab (\v
), line feed (\n
), carriage return (\r
), and form feed (\f
).↩︎