Tabular Files
One of EdgeSet’s distinctive features is its ability to query data in tabular (CSV, spreadsheet, etc.) files stored in various object and file storage systems. The data does not need to be mapped manually in EdgeSet; EdgeSet can infer appropriate tables from the files automatically.
Whenever a data source is scanned, EdgeSet retrieves a file listing from it. It then samples data from each file to determine if it’s a supported file. If so, it infers the names and types of each column. Finally, it collects matching (partitioned) files and creates tables from them.
Supported storage
| Storage | Format | Partitioned |
|---|---|---|
| AWS S3 | CSV | yes |
| OSS | CSV | yes |
| Google Drive | CSV | no |
| Google Drive | Google Sheets | no |
| Google Drive | Microsoft Excel | no |
| Google Drive | OpenDocument Spreadsheet | no |
| Google Drive | Apple Numbers | no |
| SMB | Microsoft Excel | no |
| SMB | OpenDocument Spreadsheet | no |
| SMB | Apple Numbers | no |
| SMB | Microsoft Access | no |
Column inference
Each column of a tabular file can be of a certain type, such as integers or dates. EdgeSet will automatically infer the type of a column if all sampled cells1 in that column match a given type.
1 We use the term “cell” to refer to an individual row and column of any tabular file, whether it’s a spreadsheet or not.
| Type | Examples | Notes |
|---|---|---|
| boolean |
|
Both true and false values must be present or the column will be inferred as varchar. |
| integer |
|
Grouping characters supported: The same grouping character must be used for all cells. Integers are always inferred as bigints (64-bit). |
| decimal |
|
The same grouping characters are supported, and either . or , must separate the integer and fractional parts of the number. |
| double | 1e300 |
The double-precision floating point type is only inferred when at least one number in a column is outside the range representable by decimal. |
| date |
|
Day-month-year or month-day-year order are inferred automatically when possible. If all dates are ambiguous, the column will be inferred as varchar. |
| time |
|
|
| timestamp |
|
Any supported date format separated by whitespace or a T is supported. |
| timestamp with time zone |
|
|
| varchar | anything | Any column that fails inference for any reason is treated as varchar. |
Whitespace2 is stripped from the beginning and end of each cell when inferring its type and converting it to a value.
2 Whitespace is any Unicode space character and the following control characters: horizontal tab (\t), vertical tab (\v), line feed (\n), carriage return (\r), and form feed (\f).
Null values
EdgeSet considers empty cells to be null values, not varchar.
Headers
EdgeSet determines that a header is present in a table if and only if:
- the table has at least two rows AND
- at least one column is inferred to be some type other than
varchar - all cells of the first row are not null
- at least one of the cells of the first row is not a valid value of the inferred type of its column
Table inference
S3-compatible data sources support merging multiple files in a folder into a single table. This is to support large-volumn, typically append-only data storage like that used in data lakes.
When all tabular files in a folder are compatible, they are merged to form a single table.
Files are compatible if:
- they have the same number of columns AND
- they have the same headers (or no headers)
Types will be merged between columns of every compatible file.
If at least one file is unambiguous, all ambiguous files will be resolved by merging them into the unambiguous one.
Partition inference
Partitioning is the most important optimization for querying a large amount of data spread across separate files (such as in data lakes). By separating the data into partitions, the query engine can scan only the data it needs to answer a particular query.
EdgeSet supports the following partition formats, listed in order of priority:
| Format | Example |
|---|---|
| Key=Value | from=USD/to=EUR/data.csv |
| Year/Month/Day | 2021/11/17/data.csv |
| Year/Month | 2021/11/data.csv |
The partition “keys” become columns on the virtual table that EdgeSet creates. Then, if the key columns are used in the WHERE clause of a query, EdgeSet will scan just the relevant tables.3
3 You can think of the partition key as similar to an index in a relational database.
Table layout inference
For spreadsheets, EdgeSet supports additional inference of the table location. For example, given the following spreadsheet:
| ⠀ | |||||
| title | |||||
| col1 | col2 | note | |||
| first | 1 | 3 | marginalia | ||
| second | 2 | 4 |
EdgeSet would infer the table to be:
| title | ||
| col1 | col2 | |
| first | 1 | 3 |
| second | 2 | 4 |
Table inference follows deterministic rules:
- empty leading columns and rows are skipped
- rows are considered to extend to the last non-empty cell
- the table extends to the widest row