Skip to content

ConstCorrectness/horrible-man-pages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Horrible Man Pages (HMP MCP)

Install in Cursor Install in Claude Desktop Install in Claude Code

Horrible Man Pages is a powerful Python utility designed to process, clean, and index compressed Linux/Unix man pages into a ChromaDB vector database. This effectively transforms your system's manual pages into a semantic, searchable knowledge base, ready for downstream NLP tasks, RAG (Retrieval-Augmented Generation), and advanced querying.


🚀 Features

  • Automated Discovery: Automatically locates man pages on your system via the MANPATH environment variable.
  • Robust Parsing Pipeline: Handles .gz compressed files and accurately converts raw troff typesetting into clean, readable ASCII text.
  • Artifact Removal: Strips out backspaces, overstrikes (used for bolding/underlining in terminals), and modern ANSI color codes.
  • Parallel Processing: Leverages multi-core architectures to process thousands of man pages concurrently for rapid indexing.
  • Vector Storage: Integrates seamlessly with ChromaDB for persistent, semantic search capabilities.

🛠️ The Processing Pipeline

The core challenge in extracting clean text from man pages is that they are written in the troff typesetting language and usually distributed as gzip-compressed archives. Processing them often leaves artifacts such as backspaces for bold/underlined text and ANSI escape codes.

Without proper cleaning, embedding models (like OpenAI's text-embedding-3-small or HuggingFace models) would receive malformed tokens (e.g., seeing N\bNA\bAM\bME\bE instead of NAME).

Our robust pipeline ensures embedding models receive clean text:

flowchart TD
    A["[COMPRESSED TROFF .gz]<br>.SH NAME \n ls \- list..."] -->|gzip.decompress| B
    B["[RAW TROFF bytes]<br>.SH NAME \n ls \- list..."] -->|groff -Tascii -man| C
    C["[FORMATTED WITH OVERSTRIKES]<br>NAME\bNAME \n ls - list..."] -->|col -b| D
    D["[CLEAN TEXT]<br>NAME \n ls - list..."] -->|strip ANSI codes| E
    E["[FINAL ASCII TEXT]<br>NAME \n ls - list..."]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#bbf,stroke:#333,stroke-width:2px
Loading

💻 Technical Implementation

Here is an abstracted look at the core logic used to parse the man pages:

import gzip
import re
import subprocess

def strip_ansi_codes(text):
    """Removes ANSI escape sequences (e.g., \x1b[31m for red text)"""
    ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
    return ansi_escape.sub('', text)

# 1. Decompress the man page
raw_bytes = gzip.decompress(file_data)

# 2. Convert raw troff to ASCII man page format
groff_output = subprocess.run(
    ['groff', '-Tascii', '-man'], 
    input=raw_bytes, 
    capture_output=True
).stdout

# 3. Strip backspace/overstrike characters
clean_text = subprocess.run(
    ['col', '-b'], 
    input=groff_output, 
    capture_output=True, 
    text=True
).stdout

# 4. Remove any lingering ANSI escape codes
final_safe_text = strip_ansi_codes(clean_text)

📦 Project Structure

  • horrible/ManPageIndexer.py: Contains the primary ManPageIndexer class responsible for discovery, parallel parsing, and database insertion.
  • create_chromadb.py: The entry point script to initialize the ingestion process.
  • man_pages_db/: The default local directory where ChromaDB stores its persistent vector indices (ignored in version control).

🚀 Getting Started

First, ensure you have initialized the project with uv and populated the ChromaDB database with your system's man pages:

uv sync
uv run main.py

🔌 Install in Cursor

  1. Open Cursor and navigate to Settings > Features > MCP.
  2. Click + Add New MCP Server.
  3. Set Type to command.
  4. Set Name to Horrible Man Pages.
  5. Set Command to uv run /absolute/path/to/horrible-man-pages/man_mcp.py (replace with your absolute path).

🖥️ Install in Claude Desktop

Add the following to your claude_desktop_config.json (usually located at ~/Library/Application Support/Claude/claude_desktop_config.json on macOS or %APPDATA%\Claude\claude_desktop_config.json on Windows):

{
  "mcpServers": {
    "horrible-man-pages": {
      "command": "uv",
      "args": [
        "run",
        "man_mcp.py"
      ],
      "cwd": "/absolute/path/to/horrible-man-pages"
    }
  }
}

💻 Install in Claude Code

Run the following command in your terminal, replacing the path with your absolute path to the directory:

claude mcp add horrible-man-pages uv -- run /absolute/path/to/horrible-man-pages/man_mcp.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages