Uploading Data¶

In the section on Using the Blind Proxy, we covered using the CLI command blind record create to ingest records. There are multiple ways to get data into Blind Insight. We'll go over the rest of them in detail here.

Introduction¶

Blind Insight supports a number of different ways to upload data. The most common is using the CLI command blind record create to ingest records. There are multiple ways to get data into Blind Insight. We'll go over the rest of them in detail here.

Supported File Types¶

The bulk upload features of the Blind Proxy support the following file types:

JSON using the same documented format for creating records
CSV that includes a header row that maps to field names on the Schema to which you're uploading

JSON Support¶

When using the CLI or API each JSON record must have a schema field that provides the URL (e.g. /api/schemas/{schema_id}/) to the schema for which the records will be conforming.

Tip

When using the web UI to upload, the schema ID value is provided automatically and may be omitted in the record objects.

CSV Support¶

Warning

CSV files must have a header row so that Blind Insight knows how to correlate each column name to field names on the Schema for the records. It is an error to not include a header row.

Please make sure that the data type for each column matches that of your schema. For example, if you have a schema with a field named age that is of type integer, then the data in the CSV for that column should be an integer and not a string. The CSV parser will raise a validation error if the data type does not match.

Tip

When using the web UI to upload, the schema ID value is provided automatically and may be omitted from the CSV header row.

When using the CLI or API, the schema cannot be automatically detected from the CSV file. Therefore, the CSV header row must have a "magic" header as the last entry in the headers formatted as schema_id:{schema_id}, for example:

name,age,schema_id:T9ebhvLhBGvHwGQQ6C4oZZ`
Bob,42
Alice,29

During parsing this extra header is stripped out of the CSV data before processing the records.

Sample Datasets¶

We have a number of sample datasets that you can use to test the upload features of the Blind Proxy. You can find them in the Demo Datasets repository.

To utilize these sample datasets, start by cloning the repository.

git clone https://github.com/blind-insight/demo-datasets.git

Once you have cloned the repository, you can navigate to the datasets directory and choose one of the sample datasets to upload. The repository is organized as follows:

datasets/ is the top-level directory containing all datasets.
Each subdirectory therein represents a dataset name.
Each individual dataset contains data and schemas directories.

datasets
├── medical
│   ├── data
│   │   ├── condition_data.json
│   │   ├── medication_data.json
│   │   ├── patient_data.json
│   │   ├── procedure_data.json
│   │   └── wearable_data.json
│   └── schemas
│       ├── condition.json
│       ├── medication.json
│       ├── patient.json
│       ├── procedure.json
│       └── wearable.json

cd demo-datasets/datasets

Each dataset has a corresponding schema that you can use to upload the data. You can find the schema for each dataset in the schemas directory.

cd demo-datasets/datasets/{dataset_name}/schemas

Each schema has a corresponding JSON file that you can use to upload the data. You can find the JSON file for each dataset in the data directory.

cd demo-datasets/datasets/{dataset_name}/data

Please see the README.md file in the demo-datasets repository for more information on the sample datasets.

Batch Uploads¶

The direct method using blind record create has a limit of 100 records as this utilizes the raw underlying API endpoint for encrypted record creation and does not have support for bulk or batch upload features.

Internally these are referred to as "Jobs".

To create a batch upload job, you can use:

The Blind Proxy REST API endpoint /api/jobs/upload/
The Blind Proxy CLI command blind jobs upload
The Web UI (Behind the Blind Proxy)

We will cover each of these in detail below.

API¶

Important

All API calls are authenticated using Basic authentication at this time. JWT tokens are supported but not yet documented. You must always provide a valid Authorization header with your API calls.

The Blind Proxy REST API endpoint /api/jobs/upload/ is used to create a batch upload job:

POST /api/jobs/upload/ HTTP/2
Content-Type: application/json

[
    {
        "data": {"name": "Bob", "age": 42},
        "schema": "/api/schemas/T9ebhvLhBGvHwGQQ6C4oZZ/"
    },
    {
        "data": {"name": "Alice", "age": 29},
        "schema": "/api/schemas/T9ebhvLhBGvHwGQQ6C4oZZ/"
    }
]

CSV content is supported by providing a Content-Type header of text/csv and the body being the CSV content:

POST /api/jobs/upload/ HTTP/2
Content-Type: text/csv

name,age,schema_id:T9ebhvLhBGvHwGQQ6C4oZZ
Bob,42
Alice,29

Warning

Do not forget the "magic" header in the CSV file when using the API directly. Please see the CSV Support section above for more information.

Upon successful creation of a job, a random UUID assigned as the job ID and a 201 Created response will be returned with the following response body:

HTTP/2 201 Created
X-Job-ID: 7b8e9f2d-4a3c-4e5f-8d1a-6b9c0d3e2f1a
X-Job-Status-URL: /api/jobs/7b8e9f2d-4a3c-4e5f-8d1a-6b9c0d3e2f1a/
X-Job-Websocket-URL: /api/ws/jobs/7b8e9f2d-4a3c-4e5f-8d1a-6b9c0d3e2f1a/

{
    "job_id": "7b8e9f2d-4a3c-4e5f-8d1a-6b9c0d3e2f1a"
}

X-Job-ID will be the ID of the job that was created.
X-Job-Status-URL will be the URL to the job status endpoint.
X-Job-Websocket-URL will be the URL to the job websocket endpoint.

The job status endpoint can be polled to check the status of the job:

GET /api/jobs/7b8e9f2d-4a3c-4e5f-8d1a-6b9c0d3e2f1a/ HTTP/2

The job websocket endpoint can be used to receive real-time updates on the status of the job:

GET /api/ws/jobs/7b8e9f2d-4a3c-4e5f-8d1a-6b9c0d3e2f1a/ HTTP/2

The job status and websocket endpoints will return messages to the client with the following payloads:

{
  "status": "processing",
  "processed": 42,
  "total": 100,
  "current_batch": 5,
  "total_batches": 10,
  "error": ""
}

Note

When there is no error, the error field will not be present in the response body.

Valid status values are:

uploading - The job is currently uploading.
processing - The job is currently processing.
completed - The job has completed successfully.
failed - The job has failed.

CLI¶

Important

The CLI requires that you use blind login to authenticate before using the blind jobs upload command. Please see the Using the Blind Proxy section on Logging into the API for more information.

The CLI command blind jobs upload is used to create a batch upload job:

blind jobs upload --data bulk_job.json

CSV files are also supported using the --data flag:

blind jobs upload --data bulk_job.csv

The batch size can be controlled by the --X-Batch-Size flag, which defaults to 10 and is the maximum number of records that will be uploaded in a single batch.

blind jobs upload --data bulk_job.json --X-Batch-Size 50

Warning

The CLI will display the job ID but it will not display upload progress. You will have to poll the job status or use the websocket endpoint to check the status of the job.

Web UI¶

Important

This assumes that you're already logged into the web UI and have a valid session cookie.

Warning

There is a file limit of approximately 2GB for uploads via the web UI. This limitation varies by browser and device. For uploads of larger files, please use the CLI, API, or see the Very Large Uploads section below.

The web UI will automatically create a batch upload job when a CSV or JSON file is uploaded. To get started, nagivate to the list of records for a given schema.

When you have a newly-created schema, you will see the following empty state with a disabled Upload button that allows you to upload a CSV or JSON file by dragging and dropping a file onto the upload area:

Batch Upload Empty State

Let's upload a the same JSON file containing six (6) records from the Using the Blind Proxy section to our new schema by dragging and dropping a file onto the upload area observing that the filename is automatically detected and the Upload button activates:

Batch Upload Empty State Drag and Drop

Depending on how many records you uploaded, you may have seen a progress bar. Now you've got some records. Now that you've got some records, you can click the Upload Records button to start another upload:

Batch Upload Modal

Let's upload a another batch with a few more records (1,000 to be exact) to our new schema by dragging and dropping a file onto the upload area and clicking the Upload button, and check out that progress bar:

Batch Upload Progress Bar

When the upload is complete, you will see the records in the list of records for the schema.

Uploading CSV files are just as easy as JSON files. Try it out for yourself!

From here you will be able search uploaded records as illustrated in the Using the Blind Proxy section on Encrypted Search. Are we having fun yet? And you thought you were just going to upload some records!

Success

You did it!

Very Large Uploads¶

For very large uploads of >2GB we support using the TUS protocol, a resumable file upload protocol.

We'll cover this in more detail later. But for now, know that you can point a TUS-compatible client to /api/files/ on a running Blind Proxy instance and it will be fully compatible with uploading files using TUS.

The same file formats supported by batch uploads (CSV, JSON) are supported here.

A sample request flow would go something like this:

Creating an Upload¶

Initial Request

POST /api/files/ HTTP/2
Tus-Resumable: 1.0.0
Upload-Length: {bytes}
Upload-Metadata: filename {base64_filename}, filetype {base64_filetype}

Initial Response

HTTP/2 201 Created
Location: {server_url}/api/files/{file_id}
X-Job-Id: {job_id}
X-Job-Status-Url: {server_url}/api/jobs/{job_id}/
X-Job-Websocket-Url: {server_url}/api/ws/jobs/{job_id}/

Sending File Data¶

Extracting the Location header from the Initial Response, you would then send a PATCH request to that URL.

Request

PATCH /api/files/:file_id/ HTTP/2
Tus-Resumable: 1.0.0
Content-Type: application/offset+octet-stream
Upload-Offset: {current_offset} (0 for new file, or wherever chunk picks up)
{file binary contents}

Response

HTTP/2 204 No Content
Tus-Resumable: 1.0.0
Upload-Offset: {new_offset} # Server response with bytes received in first PATCH

In keeping with TUS, you would then send a new PATCH request for each chunk of the file, sending the new Upload-Offset header from each prior response, with each subsequent request.