v1.0.0 release - Contributors, Sponsors and Enquiries are most welcome 😌

Document Ingestion

Document processing pipeline — PDF, DOCX, HTML, Markdown parsing with intelligent chunking, table/image extraction, and OCR.

The ingest package turns raw documents into clean, chunked, RAG-ready content — multi-format parsing, configurable chunking strategies, and flexible processing pipelines.

Installation

bash
pnpm add @lov3kaizen/agentsea-ingest

Quick Start

typescript
import { createIngester, pipelines } from '@lov3kaizen/agentsea-ingest';

// Simple ingestion
const ingester = createIngester();
const doc = await ingester.ingestFile('./document.pdf');
console.log(`Extracted ${doc.chunks.length} chunks`);

// RAG-optimized pipeline
const pipeline = pipelines.rag().build();
const result = await pipeline.process({ path: './document.md' });

Parsing Documents

The ingest package ships parsers for every common document format. Each parser produces extracted text, structured elements, and tables.

Supported Formats

FormatParserFactory
PDFPDFParsercreatePDFParser
DOCXDOCXParsercreateDOCXParser
HTMLHTMLParsercreateHTMLParser
MarkdownMarkdownParsercreateMarkdownParser
TextTextParsercreateTextParser
CSVCSVParsercreateCSVParser
ExcelExcelParsercreateExcelParser
JSONJSONParsercreateJSONParser

Direct Parser Usage

Use a parser factory directly when you want fine-grained control over a single document:

typescript
import {
  createPDFParser,
  createMarkdownParser,
} from '@lov3kaizen/agentsea-ingest';
import { readFileSync } from 'fs';

const pdfParser = createPDFParser();
const buffer = readFileSync('./document.pdf');
const result = await pdfParser.parse(buffer);

console.log(result.text);
console.log(result.elements);
console.log(result.tables);

Chunking Strategies

Split parsed text into retrieval-friendly chunks using the strategy that best fits your content.

Fixed Size

typescript
import { createFixedChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createFixedChunker();
const chunks = chunker.chunk(text, {
  maxTokens: 512,
  overlap: 50,
  splitOnSentences: true,
});

Recursive

typescript
import { createRecursiveChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createRecursiveChunker();
const chunks = chunker.chunk(text, {
  maxTokens: 512,
  separators: ['\n\n', '\n', '. ', ' '],
  keepSeparator: true,
});

Semantic

Groups sentences by embedding similarity — pass your embedding model via embedFunction:

typescript
import { createSemanticChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createSemanticChunker();
const chunks = await chunker.chunk(text, {
  maxTokens: 512,
  similarityThreshold: 0.5,
  embedFunction: async (text) => myEmbeddingModel(text),
});

Hierarchical

Splits on Markdown headings while preserving parent context:

typescript
import { createHierarchicalChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createHierarchicalChunker();
const chunks = chunker.chunk(markdownText, {
  maxTokens: 512,
  headingLevels: [1, 2, 3],
  includeParentContext: true,
});

Sentence and paragraph chunkers are also available via createSentenceChunker and createParagraphChunker.

Table & Image Extraction

Parsers automatically extract tables and images alongside text. The results are exposed as TableData and ImageData on the parse result, and as Element entries on the processed document.

typescript
import { createPDFParser } from '@lov3kaizen/agentsea-ingest';
import { readFileSync } from 'fs';

const parser = createPDFParser();
const result = await parser.parse(readFileSync('./report.pdf'));

// Extracted tables (TableData[])
for (const table of result.tables) {
  console.log(table);
}

// Document elements include paragraphs, headings, lists, tables, images
console.log(result.elements);

Pipeline Builder

Compose loading, parsing, cleaning, chunking, and embedding into a single configurable pipeline:

typescript
import { createPipelineBuilder } from '@lov3kaizen/agentsea-ingest';

const pipeline = createPipelineBuilder()
  .withName('my-pipeline')
  .withStages(['load', 'parse', 'clean', 'chunk', 'embed'])
  .withChunking({
    strategy: 'semantic',
    maxTokens: 512,
    overlap: 50,
  })
  .withCleaning({
    operations: ['normalize_whitespace', 'remove_urls', 'trim'],
  })
  .withCallbacks({
    onDocumentComplete: (doc) => console.log(`Processed: ${doc.id}`),
  })
  .build();

const result = await pipeline.process({ path: './document.pdf' });

Pre-built Pipelines

The pipelines helper provides ready-made configurations, including an OCR pipeline for scanned documents:

typescript
import { pipelines } from '@lov3kaizen/agentsea-ingest';

// Simple text extraction
const simple = pipelines.simple().build();

// Full processing with all stages
const full = pipelines.full().build();

// RAG-optimized pipeline
const rag = pipelines.rag().build();

// Document analysis (no chunking)
const analysis = pipelines.analysis().build();

// OCR pipeline for scanned documents
const ocr = pipelines.ocr().build();

Ingester

The Ingester class is a high-level API for ingesting files, URLs, buffers, and whole directories:

typescript
import { createIngester } from '@lov3kaizen/agentsea-ingest';

const ingester = createIngester({
  chunking: {
    strategy: 'recursive',
    maxTokens: 512,
  },
  concurrency: 4,
  fileSizeLimit: 10 * 1024 * 1024, // 10MB
});

// Ingest single file
const doc = await ingester.ingestFile('./document.pdf');

// Ingest from URL
const webDoc = await ingester.ingestUrl('https://example.com/page.html');

// Ingest from buffer
const bufferDoc = await ingester.ingestBuffer(buffer, 'document.pdf');

// Ingest directory
const results = await ingester.ingestDirectory('./documents', {
  recursive: true,
  include: ['*.pdf', '*.docx'],
  exclude: ['draft-*'],
});

Watch Mode

Automatically process files as they are added or modified:

typescript
import { createIngester } from '@lov3kaizen/agentsea-ingest';

const ingester = createIngester({
  watchMode: {
    enabled: true,
    paths: ['./documents'],
    include: ['*.pdf', '*.md'],
    debounceDelay: 1000,
    processExisting: true,
  },
});

ingester.startWatching();
// Files added/modified in ./documents will be automatically processed

Events

Subscribe to pipeline lifecycle events via the pipeline's event emitter:

typescript
import { createPipeline } from '@lov3kaizen/agentsea-ingest';

const pipeline = createPipeline(config);
const emitter = pipeline.getEventEmitter();

emitter.on('document:loaded', (event) => {
  console.log(`Loaded: ${event.documentId}`);
});

emitter.on('document:chunked', (event) => {
  console.log(`Created ${event.chunkCount} chunks`);
});

emitter.on('document:completed', (event) => {
  console.log(`Completed: ${event.document.id}`);
});

Embeddings Handoff

Each ProcessedDocument exposes a list of Chunk objects with text, metadata, and an optional embedding field — ready to be embedded and stored in your vector store or memory layer:

typescript
import { createIngester } from '@lov3kaizen/agentsea-ingest';
import type { ProcessedDocument, Chunk } from '@lov3kaizen/agentsea-ingest';

const ingester = createIngester();
const doc: ProcessedDocument = await ingester.ingestFile('./document.pdf');

// Hand each chunk to your embeddings model + vector store
for (const chunk of doc.chunks) {
  const text: string = chunk.text;
  // const embedding = await embeddings.embed(text);
  // await vectorStore.upsert({ id: chunk.id, embedding, metadata: chunk.metadata });
}

Alternatively, add an embed stage to your pipeline so chunks arrive with embeddings already attached.

Next Steps