AgentSea - Unite and Orchestrate AI Agents

Installation

bash

pnpm add @lov3kaizen/agentsea-ingest

Quick Start

typescript

import { createIngester, pipelines } from '@lov3kaizen/agentsea-ingest';

// Simple ingestion
const ingester = createIngester();
const doc = await ingester.ingestFile('./document.pdf');
console.log(`Extracted ${doc.chunks.length} chunks`);

// RAG-optimized pipeline
const pipeline = pipelines.rag().build();
const result = await pipeline.process({ path: './document.md' });

Parsing Documents

The ingest package ships parsers for every common document format. Each parser produces extracted text, structured elements, and tables.

Supported Formats

Format	Parser	Factory
PDF	PDFParser	createPDFParser
DOCX	DOCXParser	createDOCXParser
HTML	HTMLParser	createHTMLParser
Markdown	MarkdownParser	createMarkdownParser
Text	TextParser	createTextParser
CSV	CSVParser	createCSVParser
Excel	ExcelParser	createExcelParser
JSON	JSONParser	createJSONParser

Direct Parser Usage

Use a parser factory directly when you want fine-grained control over a single document:

typescript

import {
  createPDFParser,
  createMarkdownParser,
} from '@lov3kaizen/agentsea-ingest';
import { readFileSync } from 'fs';

const pdfParser = createPDFParser();
const buffer = readFileSync('./document.pdf');
const result = await pdfParser.parse(buffer);

console.log(result.text);
console.log(result.elements);
console.log(result.tables);

Chunking Strategies

Split parsed text into retrieval-friendly chunks using the strategy that best fits your content.

Fixed Size

typescript

import { createFixedChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createFixedChunker();
const chunks = chunker.chunk(text, {
  maxTokens: 512,
  overlap: 50,
  splitOnSentences: true,
});

Recursive

typescript

import { createRecursiveChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createRecursiveChunker();
const chunks = chunker.chunk(text, {
  maxTokens: 512,
  separators: ['\n\n', '\n', '. ', ' '],
  keepSeparator: true,
});

Semantic

Groups sentences by embedding similarity — pass your embedding model via embedFunction:

typescript

import { createSemanticChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createSemanticChunker();
const chunks = await chunker.chunk(text, {
  maxTokens: 512,
  similarityThreshold: 0.5,
  embedFunction: async (text) => myEmbeddingModel(text),
});

Hierarchical

Splits on Markdown headings while preserving parent context:

typescript

import { createHierarchicalChunker } from '@lov3kaizen/agentsea-ingest';

const chunker = createHierarchicalChunker();
const chunks = chunker.chunk(markdownText, {
  maxTokens: 512,
  headingLevels: [1, 2, 3],
  includeParentContext: true,
});

Sentence and paragraph chunkers are also available via createSentenceChunker and createParagraphChunker.

Table & Image Extraction

Parsers automatically extract tables and images alongside text. The results are exposed as TableData and ImageData on the parse result, and as Element entries on the processed document.

typescript

import { createPDFParser } from '@lov3kaizen/agentsea-ingest';
import { readFileSync } from 'fs';

const parser = createPDFParser();
const result = await parser.parse(readFileSync('./report.pdf'));

// Extracted tables (TableData[])
for (const table of result.tables) {
  console.log(table);
}

// Document elements include paragraphs, headings, lists, tables, images
console.log(result.elements);

Pipeline Builder

Compose loading, parsing, cleaning, chunking, and embedding into a single configurable pipeline:

typescript

import { createPipelineBuilder } from '@lov3kaizen/agentsea-ingest';

const pipeline = createPipelineBuilder()
  .withName('my-pipeline')
  .withStages(['load', 'parse', 'clean', 'chunk', 'embed'])
  .withChunking({
    strategy: 'semantic',
    maxTokens: 512,
    overlap: 50,
  })
  .withCleaning({
    operations: ['normalize_whitespace', 'remove_urls', 'trim'],
  })
  .withCallbacks({
    onDocumentComplete: (doc) => console.log(`Processed: ${doc.id}`),
  })
  .build();

const result = await pipeline.process({ path: './document.pdf' });

Pre-built Pipelines

The pipelines helper provides ready-made configurations, including an OCR pipeline for scanned documents:

typescript

import { pipelines } from '@lov3kaizen/agentsea-ingest';

// Simple text extraction
const simple = pipelines.simple().build();

// Full processing with all stages
const full = pipelines.full().build();

// RAG-optimized pipeline
const rag = pipelines.rag().build();

// Document analysis (no chunking)
const analysis = pipelines.analysis().build();

// OCR pipeline for scanned documents
const ocr = pipelines.ocr().build();

Ingester

The Ingester class is a high-level API for ingesting files, URLs, buffers, and whole directories:

typescript

import { createIngester } from '@lov3kaizen/agentsea-ingest';

const ingester = createIngester({
  chunking: {
    strategy: 'recursive',
    maxTokens: 512,
  },
  concurrency: 4,
  fileSizeLimit: 10 * 1024 * 1024, // 10MB
});

// Ingest single file
const doc = await ingester.ingestFile('./document.pdf');

// Ingest from URL
const webDoc = await ingester.ingestUrl('https://example.com/page.html');

// Ingest from buffer
const bufferDoc = await ingester.ingestBuffer(buffer, 'document.pdf');

// Ingest directory
const results = await ingester.ingestDirectory('./documents', {
  recursive: true,
  include: ['*.pdf', '*.docx'],
  exclude: ['draft-*'],
});

Watch Mode

Automatically process files as they are added or modified:

typescript

import { createIngester } from '@lov3kaizen/agentsea-ingest';

const ingester = createIngester({
  watchMode: {
    enabled: true,
    paths: ['./documents'],
    include: ['*.pdf', '*.md'],
    debounceDelay: 1000,
    processExisting: true,
  },
});

ingester.startWatching();
// Files added/modified in ./documents will be automatically processed

Events

Subscribe to pipeline lifecycle events via the pipeline's event emitter:

typescript

import { createPipeline } from '@lov3kaizen/agentsea-ingest';

const pipeline = createPipeline(config);
const emitter = pipeline.getEventEmitter();

emitter.on('document:loaded', (event) => {
  console.log(`Loaded: ${event.documentId}`);
});

emitter.on('document:chunked', (event) => {
  console.log(`Created ${event.chunkCount} chunks`);
});

emitter.on('document:completed', (event) => {
  console.log(`Completed: ${event.document.id}`);
});

Embeddings Handoff

Each ProcessedDocument exposes a list of Chunk objects with text, metadata, and an optional embedding field — ready to be embedded and stored in your vector store or memory layer:

typescript

import { createIngester } from '@lov3kaizen/agentsea-ingest';
import type { ProcessedDocument, Chunk } from '@lov3kaizen/agentsea-ingest';

const ingester = createIngester();
const doc: ProcessedDocument = await ingester.ingestFile('./document.pdf');

// Hand each chunk to your embeddings model + vector store
for (const chunk of doc.chunks) {
  const text: string = chunk.text;
  // const embedding = await embeddings.embed(text);
  // await vectorStore.upsert({ id: chunk.id, embedding, metadata: chunk.metadata });
}

Alternatively, add an embed stage to your pipeline so chunks arrive with embeddings already attached.

Document Ingestion

Installation

Quick Start

Parsing Documents

Supported Formats

Direct Parser Usage

Chunking Strategies

Fixed Size

Recursive

Semantic

Hierarchical

Table & Image Extraction

Pipeline Builder

Pre-built Pipelines

Ingester

Watch Mode

Events

Embeddings Handoff

Next Steps