File Streaming

Transfer files to iceberg continuously.

Table of Contents

File formats

Tested file formats.

  • CSV

Job creation

  • Go to Spark Jobs.
  • Click on Create New.

Specify the following parameters (these are examples, you can change them based on your preference):

  • Name: file-streaming-job
  • Docker Image: iomete/iomete_file_streaming_job:0.2.0
  • Main application file: local:///app/
  • Environment Variables: LOG_LEVEL: INFO or ERROR
  • Config file:
  file: {
    format: csv,
    path: "files/",
    max_files_per_trigger: 1,
    latest_first: false,
    max_file_age: "7d"
  database: {
    schema: default,
    table: awesome_csv_addresses
  processing_time: {
    interval: 5
    unit: seconds # minutes

Configuration properties

Property Description

Required properties to connect and configure.

  • format The format of file.
  • path The source path to connect file directory
  • max_files_per_trigger Maximum file number per trigger.
  • latest_first Whether to process the latest new files first, useful when there is a large backlog of files.
  • max_file_age Maximum age of files to be processed.

Destination database properties.

  • schema Specify the schema (database) to store into.
  • table Specify the table.

Processing time to persist incoming data on iceberg.

  • interval Processing trigger interval.
  • table Processing trigger unit: seconds, minutes.

Create Spark Job - Deployment


Deployment preferences.

Create Spark Job - Instance

You can use Environment Variables to store your sensitive data like password, secrets, etc. Then you can use these variables in your config file using the ${ENV_NAME} syntax.


Instance and environment variable parameters.

Create Spark Job - Application Config


Job config.


Prepare the dev environment

virtualenv .env #or python3 -m venv .env
source .env/bin/activate

pip install -e ."[dev]"

Run test

python3 -m pytest # or just pytest

Did this page help you?