Document Ingestion
Document processing pipeline — PDF, DOCX, HTML, Markdown parsing with intelligent chunking, table/image extraction, and OCR.
Installation
pnpm add @lov3kaizen/agentsea-ingestQuick Start
import { createIngester, pipelines } from '@lov3kaizen/agentsea-ingest';
// Simple ingestion
const ingester = createIngester();
const doc = await ingester.ingestFile('./document.pdf');
console.log(`Extracted ${doc.chunks.length} chunks`);
// RAG-optimized pipeline
const pipeline = pipelines.rag().build();
const result = await pipeline.process({ path: './document.md' });Parsing Documents
The ingest package ships parsers for every common document format. Each parser produces extracted text, structured elements, and tables.
Supported Formats
| Format | Parser | Factory |
|---|---|---|
| PDFParser | createPDFParser | |
| DOCX | DOCXParser | createDOCXParser |
| HTML | HTMLParser | createHTMLParser |
| Markdown | MarkdownParser | createMarkdownParser |
| Text | TextParser | createTextParser |
| CSV | CSVParser | createCSVParser |
| Excel | ExcelParser | createExcelParser |
| JSON | JSONParser | createJSONParser |
Direct Parser Usage
Use a parser factory directly when you want fine-grained control over a single document:
import {
createPDFParser,
createMarkdownParser,
} from '@lov3kaizen/agentsea-ingest';
import { readFileSync } from 'fs';
const pdfParser = createPDFParser();
const buffer = readFileSync('./document.pdf');
const result = await pdfParser.parse(buffer);
console.log(result.text);
console.log(result.elements);
console.log(result.tables);Chunking Strategies
Split parsed text into retrieval-friendly chunks using the strategy that best fits your content.
Fixed Size
import { createFixedChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createFixedChunker();
const chunks = chunker.chunk(text, {
maxTokens: 512,
overlap: 50,
splitOnSentences: true,
});Recursive
import { createRecursiveChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createRecursiveChunker();
const chunks = chunker.chunk(text, {
maxTokens: 512,
separators: ['\n\n', '\n', '. ', ' '],
keepSeparator: true,
});Semantic
Groups sentences by embedding similarity — pass your embedding model via embedFunction:
import { createSemanticChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createSemanticChunker();
const chunks = await chunker.chunk(text, {
maxTokens: 512,
similarityThreshold: 0.5,
embedFunction: async (text) => myEmbeddingModel(text),
});Hierarchical
Splits on Markdown headings while preserving parent context:
import { createHierarchicalChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createHierarchicalChunker();
const chunks = chunker.chunk(markdownText, {
maxTokens: 512,
headingLevels: [1, 2, 3],
includeParentContext: true,
});Sentence and paragraph chunkers are also available via createSentenceChunker and createParagraphChunker.
Table & Image Extraction
Parsers automatically extract tables and images alongside text. The results are exposed as TableData and ImageData on the parse result, and as Element entries on the processed document.
import { createPDFParser } from '@lov3kaizen/agentsea-ingest';
import { readFileSync } from 'fs';
const parser = createPDFParser();
const result = await parser.parse(readFileSync('./report.pdf'));
// Extracted tables (TableData[])
for (const table of result.tables) {
console.log(table);
}
// Document elements include paragraphs, headings, lists, tables, images
console.log(result.elements);Pipeline Builder
Compose loading, parsing, cleaning, chunking, and embedding into a single configurable pipeline:
import { createPipelineBuilder } from '@lov3kaizen/agentsea-ingest';
const pipeline = createPipelineBuilder()
.withName('my-pipeline')
.withStages(['load', 'parse', 'clean', 'chunk', 'embed'])
.withChunking({
strategy: 'semantic',
maxTokens: 512,
overlap: 50,
})
.withCleaning({
operations: ['normalize_whitespace', 'remove_urls', 'trim'],
})
.withCallbacks({
onDocumentComplete: (doc) => console.log(`Processed: ${doc.id}`),
})
.build();
const result = await pipeline.process({ path: './document.pdf' });Pre-built Pipelines
The pipelines helper provides ready-made configurations, including an OCR pipeline for scanned documents:
import { pipelines } from '@lov3kaizen/agentsea-ingest';
// Simple text extraction
const simple = pipelines.simple().build();
// Full processing with all stages
const full = pipelines.full().build();
// RAG-optimized pipeline
const rag = pipelines.rag().build();
// Document analysis (no chunking)
const analysis = pipelines.analysis().build();
// OCR pipeline for scanned documents
const ocr = pipelines.ocr().build();Ingester
The Ingester class is a high-level API for ingesting files, URLs, buffers, and whole directories:
import { createIngester } from '@lov3kaizen/agentsea-ingest';
const ingester = createIngester({
chunking: {
strategy: 'recursive',
maxTokens: 512,
},
concurrency: 4,
fileSizeLimit: 10 * 1024 * 1024, // 10MB
});
// Ingest single file
const doc = await ingester.ingestFile('./document.pdf');
// Ingest from URL
const webDoc = await ingester.ingestUrl('https://example.com/page.html');
// Ingest from buffer
const bufferDoc = await ingester.ingestBuffer(buffer, 'document.pdf');
// Ingest directory
const results = await ingester.ingestDirectory('./documents', {
recursive: true,
include: ['*.pdf', '*.docx'],
exclude: ['draft-*'],
});Watch Mode
Automatically process files as they are added or modified:
import { createIngester } from '@lov3kaizen/agentsea-ingest';
const ingester = createIngester({
watchMode: {
enabled: true,
paths: ['./documents'],
include: ['*.pdf', '*.md'],
debounceDelay: 1000,
processExisting: true,
},
});
ingester.startWatching();
// Files added/modified in ./documents will be automatically processedEvents
Subscribe to pipeline lifecycle events via the pipeline's event emitter:
import { createPipeline } from '@lov3kaizen/agentsea-ingest';
const pipeline = createPipeline(config);
const emitter = pipeline.getEventEmitter();
emitter.on('document:loaded', (event) => {
console.log(`Loaded: ${event.documentId}`);
});
emitter.on('document:chunked', (event) => {
console.log(`Created ${event.chunkCount} chunks`);
});
emitter.on('document:completed', (event) => {
console.log(`Completed: ${event.document.id}`);
});Embeddings Handoff
Each ProcessedDocument exposes a list of Chunk objects with text, metadata, and an optional embedding field — ready to be embedded and stored in your vector store or memory layer:
import { createIngester } from '@lov3kaizen/agentsea-ingest';
import type { ProcessedDocument, Chunk } from '@lov3kaizen/agentsea-ingest';
const ingester = createIngester();
const doc: ProcessedDocument = await ingester.ingestFile('./document.pdf');
// Hand each chunk to your embeddings model + vector store
for (const chunk of doc.chunks) {
const text: string = chunk.text;
// const embedding = await embeddings.embed(text);
// await vectorStore.upsert({ id: chunk.id, embedding, metadata: chunk.metadata });
}Alternatively, add an embed stage to your pipeline so chunks arrive with embeddings already attached.