Skip to content

Download Confluence pages and their index in multiple ways using Atlassian API and Python

License

Notifications You must be signed in to change notification settings

SomeSunlight/confluenceDumpWithPython

 
 

Repository files navigation

Confluence Dump with Python

This toolbox exports content from a Confluence instance (Cloud or Data Center) into a static, navigable HTML archive and converts it into professional, hierarchical PDF documents. Key Features:

  • Visual Fidelity: Fetches rendered HTML (export_view) to preserve macros, layouts, and formatting.
  • Navigation: Injects a fully functional, static navigation sidebar into every HTML page.
  • Offline Browsing: Localizes images and links, and downloads all attachments (PDFs, Office docs, etc.) for complete offline access.
  • Sort Order: Recursively scans the tree to ensure the manual sort order from Confluence is preserved.
  • Metadata Injection: Automatically adds Page Title, Author, and Modification Date to the top of every page.
  • Versioning: Creates timestamped output folders (e.g., 2025-11-21 1400 Space IT) for clean history management.
  • Professional PDF: Merges the content into a single PDF with TOC, Bookmarks, and mixed Portrait/Landscape orientation.

Toolbox Overview (Key Files)

  • confluenceDumpToHTML.py: The main downloader. Connects to Confluence, scrapes content, and creates the folder structure.
  • htmlToDoc.py: The publisher. Converts the downloaded HTML folder into a single PDF or a Master-HTML file for LLMs.
  • confluence_products.ini: Configuration file for API URLs (Cloud vs. Data Center).
  • styles/: Contains CSS files. site.css (if present) is applied automatically. pdf_settings.css configures the PDF layout (A4/Letter, Margins).

Quick Start Guide

Follow these steps to create your first PDF export of a single page tree.

1. Setup

Install requirements and set your credentials.

pip install -r requirements.txt
# Windows Users: Install GTK3 Runtime for PDF generation!

Linux/Mac:

export CONFLUENCE_TOKEN="YourPersonalAccessToken"

Windows (Powershell):

$env:CONFLUENCE_TOKEN="YourPersonalAccessToken"

2. The Dump (Download)

Run the dumper for a specific page tree. This will create a new folder in output/.

# Example for Data Center
python3 confluenceDumpToHTML.py --base-url "[https://confluence.corp.com](https://confluence.corp.com)" --profile dc --context-path "/wiki" -o "./output" tree -p "123456"

3. The Publication (PDF)

Look into the output folder. You will see a new folder like 2025-01-27 0900 My Page Title. Use this path for the PDF generator.

python3 htmlToDoc.py --site-dir "./output/2025-01-27 0900 My Page Title" --pdf

Result: You now have a ... .pdf inside that folder.

Platform Support & Authentication

This script supports both Confluence Cloud and Confluence Data Center.

⚠️ Note on Cloud Verification: The Cloud support has been ported to the new architecture but was primarily developed and tested against a Confluence Data Center environment.

Configuration

Define API paths in confluence_products.ini. Authentication is handled via Environment Variables:

  • Cloud: CONFLUENCE_USER (Email) and CONFLUENCE_TOKEN (API Token).
  • Data Center: CONFLUENCE_TOKEN (Personal Access Token). ⚠️ Troubleshooting Note for Data Center: If authentication fails, ensure you are connected to the VPN and that your admin allows Personal Access Tokens (PAT).

Detailed Usage: Stage 1 (HTML Export)

Downloads pages, builds the index, and creates a clean HTML base.

python3 confluenceDumpToHTML.py [OPTIONS] <COMMAND> [ARGS]

Commands

  • space: Dumps an entire space. (-sp SPACEKEY)
  • tree: Dumps a specific page and its descendants. (-p PAGEID)
  • single: Dumps a single page. (-p PAGEID)
  • label: "Forest Mode". Dumps all pages with a specific label as root trees. (-l LABEL)
    • Use --exclude-label to prune specific subtrees (e.g. 'archived').
  • all-spaces: Dumps all visible spaces.

Common Options

  • -o, --outdir: Base output directory.
  • -t, --threads: Number of download threads (e.g., -t 8).
  • --css-file: Path to custom CSS (applied after standard styles).

Handling Complex Macros (Manual Overrides)

The Problem: Some Confluence pages (e.g. complex Table Filters) fail to render via API due to server-side timeouts or heavy client-side JavaScript. The Solution:

  1. Open the page in Chrome/Edge.
  2. Save as "Webpage, Single File (*.mhtml)".
  3. Save it as manual_overrides/[PageID].mhtml.
  4. Run the dumper with --manual-overrides-dir "./manual_overrides". The script will extract the rendered state from the MHTML, clean it, and inject it into the pipeline.

Detailed Usage: Stage 2 (Architecture Sandbox)

Allows re-organizing the structure (Index) locally without touching Confluence.

  1. Generate Editor:
    python3 create_editor.py --site-dir "./output/2025-01-01 Space IT"
    
  2. Edit: Open editor_sidebar.html. Use Drag & Drop to move pages/folders.
  3. Save: Click "Copy Markdown", paste into sidebar_edit.md.
  4. Apply:
    python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT"
    

Detailed Usage: Stage 3 (Document Generation)

Converts the dumped pages into a single document.

python3 htmlToDoc.py --site-dir "./output/2025-01-01 Space IT" --pdf

Options

  • --pdf: Generate PDF (via WeasyPrint).
  • --html: Generate single-file Master HTML (for LLM context windows).
  • --preview: Generate debug HTML (linked to local CSS).

Customizing the PDF

The layout is controlled by CSS files in the styles/ folder of your export:

  • pdf_settings.css: Configure Page Size (A4/Letter), Orientation, and Margins.
  • site.css: General styles (detected automatically).

About

Download Confluence pages and their index in multiple ways using Atlassian API and Python

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%