BuildData
Data Source
File storage and parsing layer · 4 import methods · Relationship with the knowledge base
A Data source is the upstream of the knowledge base. It owns file storage, parsing, and metadata. One data source can feed multiple knowledge bases, and can also be queried directly by Workflow nodes.
4 Import Methods
| Method | Use |
|---|---|
| Bulk upload | Drag in a batch of documents |
| Reference an existing data source | Link in another workspace's or org-level data source |
| Web crawl | Provide a URL; auto-fetch and parse |
| App sync (SaaS) | Sync from Notion / Feishu / DingTalk / Confluence, etc. |
| Manual text | Paste text directly |
Supported Formats
Focused on unstructured documents:
| Format | Parse behavior |
|---|---|
| PDF / Word | Text extraction + paragraph detection + image OCR (optional) |
| TXT / Markdown | Read directly |
| PPT | Per-page text + image extraction |
| CSV / Excel | Per-row / per-sheet parsing |
Document Management
Each document carries full metadata:
| Field | Description |
|---|---|
| Name / type / size | The file itself |
| Parse status | Pending / parsing / ready / failed |
| Source | Upload / crawl / reference / sync |
| Upload time / creator | Audit |
| Word count / token count | Capacity estimate |
Operations: Re-parse · Preview parsed result · Soft delete (recoverable).
File Groups and ACL
A data source can have file groups, each with its own ACL (view / use / edit / manage). Common pattern:
Import Settings
When uploading, you can specify:
| Setting | Description |
|---|---|
| Target group | Pick existing / create new |
| Chunk mode | Smart / general (only affects later vectorization) |
| File-type restrictions | Allowlist / denylist |
| Size limits | Per-file max / total quota |
Data Source vs Knowledge Base
Key boundaries:
- A data source does not have to enter a knowledge base (Workflow can consume directly)
- The same data source can feed multiple knowledge bases (with different chunking strategies)
Anti-Patterns
- Binding a data source directly to an Agent (should bind a knowledge base instead)
- Uploading the same document twice (no auto-dedup; wastes vector space)
- Not splitting huge files (PDFs > 100MB parse slowly and are easy to break)
Next Steps
- Connect a data source to a knowledge base → Knowledge base
- Read directly in a Workflow → Workflow · Data nodes