Horrible Man Pages (HMP MCP)

Horrible Man Pages is a powerful Python utility designed to process, clean, and index compressed Linux/Unix man pages into a ChromaDB vector database. This effectively transforms your system's manual pages into a semantic, searchable knowledge base, ready for downstream NLP tasks, RAG (Retrieval-Augmented Generation), and advanced querying.

🚀 Features

Automated Discovery: Automatically locates man pages on your system via the MANPATH environment variable.
Robust Parsing Pipeline: Handles .gz compressed files and accurately converts raw troff typesetting into clean, readable ASCII text.
Artifact Removal: Strips out backspaces, overstrikes (used for bolding/underlining in terminals), and modern ANSI color codes.
Parallel Processing: Leverages multi-core architectures to process thousands of man pages concurrently for rapid indexing.
Vector Storage: Integrates seamlessly with ChromaDB for persistent, semantic search capabilities.

🛠️ The Processing Pipeline

The core challenge in extracting clean text from man pages is that they are written in the troff typesetting language and usually distributed as gzip-compressed archives. Processing them often leaves artifacts such as backspaces for bold/underlined text and ANSI escape codes.

Without proper cleaning, embedding models (like OpenAI's text-embedding-3-small or HuggingFace models) would receive malformed tokens (e.g., seeing N\bNA\bAM\bME\bE instead of NAME).

Our robust pipeline ensures embedding models receive clean text:

flowchart TD
    A["[COMPRESSED TROFF .gz]<br>.SH NAME \n ls \- list..."] -->|gzip.decompress| B
    B["[RAW TROFF bytes]<br>.SH NAME \n ls \- list..."] -->|groff -Tascii -man| C
    C["[FORMATTED WITH OVERSTRIKES]<br>NAME\bNAME \n ls - list..."] -->|col -b| D
    D["[CLEAN TEXT]<br>NAME \n ls - list..."] -->|strip ANSI codes| E
    E["[FINAL ASCII TEXT]<br>NAME \n ls - list..."]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#bbf,stroke:#333,stroke-width:2px

💻 Technical Implementation

Here is an abstracted look at the core logic used to parse the man pages:

import gzip
import re
import subprocess

def strip_ansi_codes(text):
    """Removes ANSI escape sequences (e.g., \x1b[31m for red text)"""
    ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
    return ansi_escape.sub('', text)

# 1. Decompress the man page
raw_bytes = gzip.decompress(file_data)

# 2. Convert raw troff to ASCII man page format
groff_output = subprocess.run(
    ['groff', '-Tascii', '-man'], 
    input=raw_bytes, 
    capture_output=True
).stdout

# 3. Strip backspace/overstrike characters
clean_text = subprocess.run(
    ['col', '-b'], 
    input=groff_output, 
    capture_output=True, 
    text=True
).stdout

# 4. Remove any lingering ANSI escape codes
final_safe_text = strip_ansi_codes(clean_text)

📦 Project Structure

horrible/ManPageIndexer.py: Contains the primary ManPageIndexer class responsible for discovery, parallel parsing, and database insertion.
create_chromadb.py: The entry point script to initialize the ingestion process.
man_pages_db/: The default local directory where ChromaDB stores its persistent vector indices (ignored in version control).

🚀 Getting Started

First, ensure you have initialized the project with uv and populated the ChromaDB database with your system's man pages:

uv sync
uv run main.py

🔌 Install in Cursor

Open Cursor and navigate to Settings > Features > MCP.
Click + Add New MCP Server.
Set Type to command.
Set Name to Horrible Man Pages.
Set Command to uv run /absolute/path/to/horrible-man-pages/man_mcp.py (replace with your absolute path).

🖥️ Install in Claude Desktop

Add the following to your claude_desktop_config.json (usually located at ~/Library/Application Support/Claude/claude_desktop_config.json on macOS or %APPDATA%\Claude\claude_desktop_config.json on Windows):

{
  "mcpServers": {
    "horrible-man-pages": {
      "command": "uv",
      "args": [
        "run",
        "man_mcp.py"
      ],
      "cwd": "/absolute/path/to/horrible-man-pages"
    }
  }
}

💻 Install in Claude Code

Run the following command in your terminal, replacing the path with your absolute path to the directory:

claude mcp add horrible-man-pages uv -- run /absolute/path/to/horrible-man-pages/man_mcp.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
horrible		horrible
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
man_mcp.py		man_mcp.py
pyproject.toml		pyproject.toml
test_search.py		test_search.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Horrible Man Pages (HMP MCP)

🚀 Features

🛠️ The Processing Pipeline

💻 Technical Implementation

📦 Project Structure

🚀 Getting Started

🔌 Install in Cursor

🖥️ Install in Claude Desktop

💻 Install in Claude Code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Horrible Man Pages (HMP MCP)

🚀 Features

🛠️ The Processing Pipeline

💻 Technical Implementation

📦 Project Structure

🚀 Getting Started

🔌 Install in Cursor

🖥️ Install in Claude Desktop

💻 Install in Claude Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages