Document Uploads & Ingestion

Learn how to upload documents, understand the pipeline, and manage files in collections.

Once you have a Collection, you populate it with documents. This guide covers how to upload files, walks through the async pipeline that prepares them for search, and explains how to manage existing documents.

🔄 Ingestion Pipeline Overview

Once a document is uploaded via any method, it enters an asynchronous pipeline that transforms raw bytes into searchable semantic blocks.

                  Document Upload (any method)
                            │
                            ▼
 ┌──────────────────────────────────────────────┐
 │           1. Validation & Storage            │
 │  • MIME detection & file validation          │
 │  • File hash (SHA-256) computation           │
 │  • Upload artifact to S3/MinIO bucket        │
 │  • Create SQL record (status: INDEXING)      │
 └──────────────────┬───────────────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────────────┐
 │          2. Parsing (Worker)                 │
 │  • OCR / Layout extraction                   │
 │  • Convert to Markdown                       │
 │  • Generate thumbnail (first page)           │
 └──────────────────┬───────────────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────────────┐
 │          3. Chunking (Worker)                │
 │  • Split into semantic blocks                │
 │  • Preserve hierarchy (headings, tables)     │
 └──────────────────┬───────────────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────────────┐
 │         4. Vector Embedding (Worker)         │
 │  • Generate dense vector per chunk           │
 │  • Generate sparse BM25 vector per chunk     │
 │  • Write to Qdrant index                     │
 │  • Update status → READY                     │
 └──────────────────────────────────────────────┘

Document Statuses

Status	Meaning
`indexing`	Document uploaded, initial SQL record created
`ready`	Fully ingested — parsed, chunked, embedded, and available for search
`failed`	An error occurred during parsing, chunking, or embedding
`deleted`	Document was soft-deleted

📂 Supported File Types

The platform supports robust parsing and ingestion for over 30 file formats. During upload, the system automatically validates files using MIME type detection and structural integrity checks.

Type	Supported Formats
Documents	`.pdf`, `.docx`, `.doc`, `.xlsx`, `.xls`, `.pptx`, `.ppt`, `.odt`, `.ods`, `.odp`
Images	`.jpg`, `.jpeg`, `.png`, `.webp`, `.tiff`, `.tif`, `.bmp`, `.svg`, `.avif`, `.apng`, `.gif`, `.ico`
Text	`.txt`, `.md`, `.markdown`, `.html`, `.htm`, `.csv`
Structured	`.json`
Archives	`.zip`

📤 Upload Methods

All uploads require the target workspace_id and collection_id in the URL path.

Method A: Multipart Form Data

Endpoint: POST /v1/w/{workspace_id}/col/{collection_id}/docs/file-multipart

The most common approach — send a file as a standard multipart/form-data request. Best for PDFs, Office docs, images, and Markdown files.

curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/file-multipart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/document.pdf"

Method B: Binary Stream

Endpoint: POST /v1/w/{workspace_id}/col/{collection_id}/docs/file-binary

Ideal for system-to-system transfers where you stream raw bytes directly. You must provide the Content-Disposition header for filename and extension detection.

curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/file-binary" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/pdf" \
  -H "Content-Disposition: attachment; filename=\"report.pdf\"" \
  --data-binary @/path/to/report.pdf

Method C: Raw Text Content

Endpoint: POST /v1/w/{workspace_id}/col/{collection_id}/docs/raw

For programmatic content (e.g., notes, logs, or Markdown strings) that you want to index directly without a physical file.

curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/raw" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "memo.txt",
    "content": "This is raw text to be indexed.",
    "contentType": "text/plain"
  }'

🔗 Connectors (Crawlers)

Documents can also be imported automatically from external sources (S3, Web, Google Drive, FTP) using Connectors.

See the Connectors guide for setup instructions and source type reference.

🛠️ Document Management

Documents are the primary data units within a collection. For a detailed technical reference of every field and parameter, see the API Document Reference.

Upload a Document

The most common way to add data is a multipart upload. This triggers the asynchronous ingestion pipeline (parse → chunk → embed).

curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/file-multipart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/document.pdf"

Read, List & Update

To manage your existing document metadata and their searchable content, use the following specialized endpoints:

List Documents: Retrieve all documents within a collection with optional status filtering.
Read Document: Fetch the complete block tree, metadata, and optional raw vector embeddings.
Update Document: Replace a document's content with a new file, triggering a re-ingestion cycle.

Delete a Document

Deleting a document performs a soft delete. The SQL record and vector blocks are hidden from search, but the original artifact remains in storage for audit trails.

curl -X DELETE "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/{doc_id}" \
  -H "Authorization: Bearer YOUR_API_KEY"

Document Uploads & Ingestion

On this page