JSON Files

Option 1. Direct reading

Reading JSON file directly from the location without it copying it. It automatically detects the schema (column names), and flatten it

SELECT  country_name, region, `SP.URB.TOTL`, `SP.URB.TOTL.IN.ZS`
FROM json.`s3a://iomete-lakehouse-shared/superset_examples/countries.json` 
LIMIT 3;

+---------------+-------------+--------------+--------------------+
| country_name  |   region    | SP.URB.TOTL  | SP.URB.TOTL.IN.ZS  |
+---------------+-------------+--------------+--------------------+
| Afghanistan   | South Asia  | 739462.0     | 8.221              |
| Afghanistan   | South Asia  | 779754.0     | 8.508              |
| Afghanistan   | South Asia  | 822719.0     | 8.805              |
+---------------+-------------+--------------+--------------------+

Option 2. Reference table with options

To get more control reading raw JSON files you can use the following syntax and provide different options. It doesn't copy the data, it just references it.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON https://jsonlines.org/. For a regular multi-line JSON file, set the multiLine option to true

CREATE TABLE countries
USING org.apache.spark.sql.json
OPTIONS (
  path "s3a://iomete-lakehouse-shared/superset_examples/countries.json"
);

  
DESC countries;

+-----------------------+------------+----------+
|       col_name        | data_type  | comment  |
+-----------------------+------------+----------+
| NY.GNP.PCAP.CD        | double     | NULL     |
| SE.ADT.1524.LT.FM.ZS  | double     | NULL     |
| SE.ADT.1524.LT.MA.ZS  | double     | NULL     |
| ...                   | ...        | ...      |
| SP.URB.TOTL.IN.ZS     | double     | NULL     |
| SP.UWT.TFRT           | double     | NULL     |
| country_code          | string     | NULL     |
| country_name          | string     | NULL     |
| region                | string     | NULL     |
| year                  | string     | NULL     |
+-----------------------+------------+----------+
328 rows selected

Options

JSON Lines (newline-delimited JSON) are supported by default. For JSON (one record per file), set the multiLine option to true.

This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.

You can set the following JSON-specific options to deal with non-standard JSON files:

  • primitivesAsString (default false): infers all primitive values as a string type
  • prefersDecimal (default false): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.
  • allowComments (default false): ignores Java/C++ style comment in JSON records
  • allowUnquotedFieldNames (default false): allows unquoted JSON field names
  • allowSingleQuotes (default true): allows single quotes in addition to double quotes
  • allowNumericLeadingZeros (default false): allows leading zeros in numbers (e.g. 00012)
  • allowBackslashEscapingAnyCharacter (default false): allows accepting quoting of all character using backslash quoting mechanism
  • allowUnquotedControlChars (default false): allows JSON Strings to contain unquoted control characters (ASCII characters with a value less than 32, including tab and line feed characters) or not.
  • mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
    • PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in a user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema.
    • DROPMALFORMED : ignores the whole corrupted records.
    • FAILFAST : throws an exception when it meets corrupted records.
  • columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
  • dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at Datetime Patterns. This applies to date type.
  • timestampFormat (default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]): sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime Patterns. This applies to the timestamp type.
  • multiLine (default false): parse one record, which may span multiple lines, per file
  • encoding (by default it is not set): allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If the encoding is not specified and multiLine is set to true, it will be detected automatically.
  • lineSep(default covers all \r, \r\n and \n): defines the line separator that should be used for parsing.
  • samplingRatio (default is 1.0): defines the fraction of input JSON objects used for schema inferring.
  • dropFieldIfAllNull (default false): whether to ignore column of all null values or empty array/struct during schema inference.
  • locale (default is en-US): sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.
  • pathGlobFilter: an optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.
  • modifiedBefore (batch only): an optional timestamp to only include files with modification times occurring before the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • modifiedAfter (batch only): an optional timestamp to only include files with modification times occurring after the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • recursiveFileLookup: recursively scan a directory for files. Using this option disables partition discovery
  • allowNonNumericNumbers (default true): allows JSON parser to recognize set of "Not-a-Number" (NaN) tokens as legal floating number values:
    • +INF for positive infinity, as well as alias of +Infinity and Infinity.
    • -INF for negative infinity), alias -Infinity.
    • NaN for other not-a-numbers, like result of division by zero

Did this page help you?