Horrible Man Pages is a powerful Python utility designed to process, clean, and index compressed Linux/Unix man pages into a ChromaDB vector database. This effectively transforms your system's manual pages into a semantic, searchable knowledge base, ready for downstream NLP tasks, RAG (Retrieval-Augmented Generation), and advanced querying.
- Automated Discovery: Automatically locates man pages on your system via the
MANPATHenvironment variable. - Robust Parsing Pipeline: Handles
.gzcompressed files and accurately converts rawtrofftypesetting into clean, readable ASCII text. - Artifact Removal: Strips out backspaces, overstrikes (used for bolding/underlining in terminals), and modern ANSI color codes.
- Parallel Processing: Leverages multi-core architectures to process thousands of man pages concurrently for rapid indexing.
- Vector Storage: Integrates seamlessly with ChromaDB for persistent, semantic search capabilities.
The core challenge in extracting clean text from man pages is that they are written in the troff typesetting language and usually distributed as gzip-compressed archives. Processing them often leaves artifacts such as backspaces for bold/underlined text and ANSI escape codes.
Without proper cleaning, embedding models (like OpenAI's text-embedding-3-small or HuggingFace models) would receive malformed tokens (e.g., seeing N\bNA\bAM\bME\bE instead of NAME).
Our robust pipeline ensures embedding models receive clean text:
flowchart TD
A["[COMPRESSED TROFF .gz]<br>.SH NAME \n ls \- list..."] -->|gzip.decompress| B
B["[RAW TROFF bytes]<br>.SH NAME \n ls \- list..."] -->|groff -Tascii -man| C
C["[FORMATTED WITH OVERSTRIKES]<br>NAME\bNAME \n ls - list..."] -->|col -b| D
D["[CLEAN TEXT]<br>NAME \n ls - list..."] -->|strip ANSI codes| E
E["[FINAL ASCII TEXT]<br>NAME \n ls - list..."]
style A fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#bbf,stroke:#333,stroke-width:2px
Here is an abstracted look at the core logic used to parse the man pages:
import gzip
import re
import subprocess
def strip_ansi_codes(text):
"""Removes ANSI escape sequences (e.g., \x1b[31m for red text)"""
ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
return ansi_escape.sub('', text)
# 1. Decompress the man page
raw_bytes = gzip.decompress(file_data)
# 2. Convert raw troff to ASCII man page format
groff_output = subprocess.run(
['groff', '-Tascii', '-man'],
input=raw_bytes,
capture_output=True
).stdout
# 3. Strip backspace/overstrike characters
clean_text = subprocess.run(
['col', '-b'],
input=groff_output,
capture_output=True,
text=True
).stdout
# 4. Remove any lingering ANSI escape codes
final_safe_text = strip_ansi_codes(clean_text)horrible/ManPageIndexer.py: Contains the primaryManPageIndexerclass responsible for discovery, parallel parsing, and database insertion.create_chromadb.py: The entry point script to initialize the ingestion process.man_pages_db/: The default local directory where ChromaDB stores its persistent vector indices (ignored in version control).
First, ensure you have initialized the project with uv and populated the ChromaDB database with your system's man pages:
uv sync
uv run main.py- Open Cursor and navigate to Settings > Features > MCP.
- Click + Add New MCP Server.
- Set Type to
command. - Set Name to
Horrible Man Pages. - Set Command to
uv run /absolute/path/to/horrible-man-pages/man_mcp.py(replace with your absolute path).
Add the following to your claude_desktop_config.json (usually located at ~/Library/Application Support/Claude/claude_desktop_config.json on macOS or %APPDATA%\Claude\claude_desktop_config.json on Windows):
{
"mcpServers": {
"horrible-man-pages": {
"command": "uv",
"args": [
"run",
"man_mcp.py"
],
"cwd": "/absolute/path/to/horrible-man-pages"
}
}
}Run the following command in your terminal, replacing the path with your absolute path to the directory:
claude mcp add horrible-man-pages uv -- run /absolute/path/to/horrible-man-pages/man_mcp.py