Architecture

Overview

GraphArch is a flexible and extensible documentation generation tool designed to be a Swiss Army knife for documentation needs. It can process various input sources and generate documentation in multiple output formats, making it a versatile solution for different documentation scenarios.

Core Concepts

1. Input Sources

GraphArch supports two main categories of input sources:

File-Based Sources

File System: Process local files and directories
Git Repositories: Scan and process files from Git repositories
S3 Buckets: Access and process files from S3 storage

Each file-based source implements the FileSource trait, providing consistent file access patterns and directory traversal capabilities. The implementation is provided through the FileSourceImplementor enum which currently supports:

LocalDirectorySource: For local file system access
GitRepositorySource: For Git repository access
S3BucketSource: For S3 bucket access

Database-Based Sources

SPARQL Endpoints: Query and process data from RDF databases (planned)
Database Connections: Support for various database types (planned)

Database sources will implement a separate trait (e.g., DatabaseSource) that handles connection management, query execution, and result processing. This is currently planned for future implementation.

2. Data Loading and Storage

The tool uses a two-stage data processing approach:

Loader Store

An in-memory triplestore (currently using OxiGraph) that stores the raw data from input sources
Optimized for performance with future plans for persistence
Includes a file registry system that:
- Tracks loaded files using SHA-256 hashes
- Stores file metadata (size, creation time, modification time)
- Maintains relationships between file contents and file paths
Provides methods for:
- Registering new files
- Inserting RDF quads
- Querying the store using SPARQL
Stores data in RDF format, enabling semantic querying and processing

Documentation Model

A separate triplestore that holds processed, documentable content
Strictly follows the GraphArch ontology for all stored data

Provides a Rust API for adding and manipulating documentation elements:

// Example of the Rust API
struct Book {
    title: String,
    authors: Vec<String>,
    // ... other fields defined in GraphArch ontology
}

struct Section {
    title: String,
    description: Option<String>,
    // ... other fields defined in GraphArch ontology
}

struct Chapter {
    title: String,
    content: String,
    // ... other fields defined in GraphArch ontology
}

impl DocumentationModel {
    fn add_book(&self, book: Book) -> Result<()>;
    fn add_section(&self, section: Section) -> Result<()>;
    fn add_chapter(&self, chapter: Chapter) -> Result<()>;
    // ... other methods for ontology-defined elements
}

Transforms raw data into documentation structure:
- OWL Ontologies → Books
- OWL Classes → Chapters within Sections
- OWL Properties → Additional chapters
- Labels and comments → Content and descriptions
All SPARQL queries and RDF manipulation are encapsulated within the model module
External code interacts only with Rust structs and methods, never directly with RDF
The model module may be extracted into a separate crate in the future
Uses Dublin Core terms for metadata as defined in the GraphArch ontology:
- dc:title for document titles
- dc:creator for authors
Independent of output format, allowing for flexible documentation generation
Enables cross-referencing and linking between different parts of the documentation

3. Documentors

Documentors are specialized processors that:

Scan the Loader Store for specific types of documentable content
Transform raw data into structured documentation elements
Add processed content to the Documentation Model

Currently implemented documentors:

OWL Ontology Documentor:
- Processes OWL ontologies and their classes
- Creates a Book for each ontology
- Creates a Section for the ontology’s classes
- Creates Chapters for each OWL class
- Transforms labels and comments into documentation content
- Supports iteration over OWL classes for detailed processing

Planned documentors:

SHACL Documentor: For processing SHACL shapes
SKOS Documentor: For processing SKOS taxonomies
Markdown Documentor: For processing markdown content
Additional support for ODRL, DCAT, DPROD, Croissant

Each documentor implements the Documentor trait with:

file_types(): Returns supported file types
generate(): Processes content and updates the Documentation Model

4. Output Generators

Output generators transform the Documentation Model into specific output formats:

Typst Generator: Produces PDF documentation using Typst
- Typst is just as capable (if not more) as LaTeX
Markdown Generator: Creates Markdown files
HTML Generator: Generates static websites
(Future) Support for additional output formats

Each generator:

Reads from the Documentation Model
Transforms the abstract documentation structure into format-specific content
Handles cross-referencing and linking appropriately for its output format

Data Flow

The tool follows a three-stage data processing pipeline:

graph LR
    subgraph Input Sources
        FS[File System]
        Git[Git Repo]
        S3[S3 Bucket]
        SPARQL[SPARQL Endpoint]
    end

    subgraph Stage 1: Loading
        LS[Loader Store]
        FR[File Registry]
    end

    subgraph Stage 2: Processing
        D1[OWL Documentor]
        D2[SHACL Documentor]
        D3[SKOS Documentor]
        D4[Markdown Documentor]
    end

    subgraph Stage 3: Generation
        DM[Documentation Model]
        TG[Typst Generator]
        MG[Markdown Generator]
        HG[HTML Generator]
    end

    FS --> LS
    Git --> LS
    S3 --> LS
    SPARQL --> LS

    LS --> FR
    FR --> LS

    LS --> D1
    LS --> D2
    LS --> D3
    LS --> D4

    D1 --> DM
    D2 --> DM
    D3 --> DM
    D4 --> DM

    DM --> TG
    DM --> MG
    DM --> HG

Each stage has a specific responsibility:

Input Processing: Various sources feed their data into the Loader Store
Content Processing: Documentors analyze the Loader Store and build the Documentation Model
Output Generation: Generators transform the Documentation Model into final documentation

The diagram shows how:

Multiple input sources can feed into the Loader Store
The File Registry tracks and manages loaded files
Different documentors can process the same data in the Loader Store
All documentors contribute to the same Documentation Model
Multiple generators can create different output formats from the same Documentation Model

Future Enhancements

Persistent Loader Store with caching
Additional input source types (SPARQL endpoints, databases)
More documentable content types (SHACL, SKOS, etc.)
New output format generators
Enhanced cross-referencing capabilities
Improved performance optimizations

Technical Implementation

The tool is implemented in Rust, leveraging:

Async/await for efficient I/O operations
Trait-based abstractions for extensibility
RDF/SPARQL for semantic data processing (encapsulated in model module)
Strong type system for safety and maintainability
OxiGraph for RDF storage and querying
Tokio for async runtime
Clap for command-line interface

Module Structure and Rules

Model Module (`src/model`)

Core data structures and business logic
Exclusive owner of the SPARQL queries regarding the Documentation Model
Provides type-safe Rust API for other modules to interact with documentation data
All communication with other modules must be through Rust structs and methods
No direct SPARQL exposure to external modules

Generator Module (`src/generator`)

Handles transformation of Documentation Model into various output formats
Must use the API of the DocumentationModel, never direct SPARQL queries
Implementations:
- Console Generator: ANSI-colored terminal output
- Typst Generator: PDF documentation
- Markdown Generator: Markdown files
- HTML Generator: Static websites

Source Module (`src/source`)

Handles input source management
Abstracts file system, Git repositories, S3 access, database access
Primary traits are Source, FileSource and DatabaseSource
Provides unified interface for reading source content
Responsible for source traversal and content extraction

Loader Module (`src/loader`)

Manages loading of source content into the so-called Loader Store
Primary trait is Loader
Every Loader implementation handles a different file format (e.g. RDF, Markdown, etc.)
Maintains file registry and content tracking
Coordinates with Source module for content access

Documentor Module (`src/documentor`)

Processes raw content, as stored by the Loaders into the Loader Store, into documentation structures like Books, Chapters etc, that are handled by the DocumentationModel.
Transforms source-specific content (OWL, SHACL, SKOS, etc.) into generic documentation elements
Works with Model module to create proper documentation structure
Must use Model’s Rust API for all documentation operations

Store Module (`src/store`)

Manages the raw data store (i.e. the Loader Store)
Provides low-level store operations
Coordinates with Loader module for content storage
Maintains file registry system
Used by the Loaders as their output store

Util Module (`src/util`)

Common utilities and helper functions
Shared types and constants
Cross-cutting concerns like logging and error handling

RDF Const Module (`src/rdf_const`)

Constants for common RDF predicates of various ontologies including the GraphArch ontology
We want to leverage Rust type checking for every RDF predicate, rather than littering RDF predicates all over the place.

Principles

The codebase follows these architectural principles:

Clear separation between raw data (Loader Store) and processed documentation (Documentation Model)
All RDF/SPARQL operations are isolated to the model module and must never leak outside
External code interacts with the Documentation Model through a type-safe Rust API
The GraphArch ontology defines the structure of all documentation data
Future extraction of the model module into a separate crate is supported
Each module has a single responsibility and clear boundaries
Inter-module communication happens through well-defined Rust types and traits

Model Module Guidelines

The model module (src/model) has special responsibilities and restrictions:

All SPARQL queries to the Store must be contained within the model module
The Store field in DocumentationModel is private and must never be accessed directly
External code must use the type-safe Rust API methods provided by DocumentationModel
No SPARQL queries or RDF operations should be visible outside the model module
The model module is responsible for translating between RDF data and Rust types