Document Uploads & Ingestion
Learn how to upload documents, understand the pipeline, and manage files in collections.
Once you have a Collection, you populate it with documents. This guide covers how to upload files, walks through the async pipeline that prepares them for search, and explains how to manage existing documents.
🔄 Ingestion Pipeline Overview
Once a document is uploaded via any method, it enters an asynchronous pipeline that transforms raw bytes into searchable semantic blocks.
Document Upload (any method)
│
▼
┌──────────────────────────────────────────────┐
│ 1. Validation & Storage │
│ • MIME detection & file validation │
│ • File hash (SHA-256) computation │
│ • Upload artifact to S3/MinIO bucket │
│ • Create SQL record (status: INDEXING) │
└──────────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ 2. Parsing (Worker) │
│ • OCR / Layout extraction │
│ • Convert to Markdown │
│ • Generate thumbnail (first page) │
└──────────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ 3. Chunking (Worker) │
│ • Split into semantic blocks │
│ • Preserve hierarchy (headings, tables) │
└──────────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ 4. Vector Embedding (Worker) │
│ • Generate dense vector per chunk │
│ • Generate sparse BM25 vector per chunk │
│ • Write to Qdrant index │
│ • Update status → READY │
└──────────────────────────────────────────────┘Document Statuses
| Status | Meaning |
|---|---|
indexing | Document uploaded, initial SQL record created |
ready | Fully ingested — parsed, chunked, embedded, and available for search |
failed | An error occurred during parsing, chunking, or embedding |
deleted | Document was soft-deleted |
📂 Supported File Types
The platform supports robust parsing and ingestion for over 30 file formats. During upload, the system automatically validates files using MIME type detection and structural integrity checks.
| Type | Supported Formats |
|---|---|
| Documents | .pdf, .docx, .doc, .xlsx, .xls, .pptx, .ppt, .odt, .ods, .odp |
| Images | .jpg, .jpeg, .png, .webp, .tiff, .tif, .bmp, .svg, .avif, .apng, .gif, .ico |
| Text | .txt, .md, .markdown, .html, .htm, .csv |
| Structured | .json |
| Archives | .zip |
📤 Upload Methods
All uploads require the target workspace_id and collection_id in the URL path.
Method A: Multipart Form Data
Endpoint: POST /v1/w/{workspace_id}/col/{collection_id}/docs/file-multipart
The most common approach — send a file as a standard multipart/form-data request. Best for PDFs, Office docs, images, and Markdown files.
curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/file-multipart" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@/path/to/document.pdf"Method B: Binary Stream
Endpoint: POST /v1/w/{workspace_id}/col/{collection_id}/docs/file-binary
Ideal for system-to-system transfers where you stream raw bytes directly. You must provide the Content-Disposition header for filename and extension detection.
curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/file-binary" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/pdf" \
-H "Content-Disposition: attachment; filename=\"report.pdf\"" \
--data-binary @/path/to/report.pdfMethod C: Raw Text Content
Endpoint: POST /v1/w/{workspace_id}/col/{collection_id}/docs/raw
For programmatic content (e.g., notes, logs, or Markdown strings) that you want to index directly without a physical file.
curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/raw" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "memo.txt",
"content": "This is raw text to be indexed.",
"contentType": "text/plain"
}'🔗 Connectors (Crawlers)
Documents can also be imported automatically from external sources (S3, Web, Google Drive, FTP) using Connectors.
See the Connectors guide for setup instructions and source type reference.
🛠️ Document Management
Documents are the primary data units within a collection. For a detailed technical reference of every field and parameter, see the API Document Reference.
Upload a Document
The most common way to add data is a multipart upload. This triggers the asynchronous ingestion pipeline (parse → chunk → embed).
curl -X POST "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/file-multipart" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@/path/to/document.pdf"Read, List & Update
To manage your existing document metadata and their searchable content, use the following specialized endpoints:
- List Documents: Retrieve all documents within a collection with optional status filtering.
- Read Document: Fetch the complete block tree, metadata, and optional raw vector embeddings.
- Update Document: Replace a document's content with a new file, triggering a re-ingestion cycle.
Delete a Document
Deleting a document performs a soft delete. The SQL record and vector blocks are hidden from search, but the original artifact remains in storage for audit trails.
curl -X DELETE "https://api.axelered.com/v1/w/{workspace_id}/col/{collection_id}/docs/{doc_id}" \
-H "Authorization: Bearer YOUR_API_KEY"