Skip to content

GML-2086 add Excel and CSV extraction support with UI warning#35

Merged
chengbiao-jin merged 5 commits intorelease_1.4.0from
GML-2086-XLS-CSV
May 6, 2026
Merged

GML-2086 add Excel and CSV extraction support with UI warning#35
chengbiao-jin merged 5 commits intorelease_1.4.0from
GML-2086-XLS-CSV

Conversation

@prinskumar-tigergraph
Copy link
Copy Markdown
Contributor

@prinskumar-tigergraph prinskumar-tigergraph commented Apr 30, 2026

Added the UI message in case of CSV/XLSX.
XLSX text extraction logic added in the text-extractor file.

@tg-pr-agent
Copy link
Copy Markdown

tg-pr-agent Bot commented Apr 30, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Dropped formats

TextExtractor.supported_extensions no longer includes .png and .gif, while other helper logic still treats image formats as supported. This mismatch can cause uploads that were previously accepted to be skipped or rejected in one path but still appear supported in another.

self.supported_extensions = {
    '.txt': 'text/plain',
    '.md': 'text/markdown',
    '.pdf': 'application/pdf',
    '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
    '.doc': 'application/msword',
    '.html': 'text/html',
    '.htm': 'text/html',
    '.json': 'application/json',
    '.csv': 'text/csv',
    '.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
    '.xls': 'application/vnd.ms-excel',
    '.xml': 'application/xml',
    '.jpeg': 'image/jpeg',
    '.jpg': 'image/jpeg',
    '.jsonl': 'application/x-jsonlines'
Excel memory

The Excel extraction path loads every sheet into pandas dataframes and converts them to markdown in memory. Large or multi-sheet spreadsheets may cause high memory usage, long processing times, or oversized extracted documents, so limits and behavior for large files should be validated.

elif extension in ['.xlsx', '.xls']:
    import pandas as pd
    engine = 'openpyxl' if extension == '.xlsx' else 'xlrd'
    try:
        xl = pd.ExcelFile(file_path, engine=engine)
    except Exception:
        xl = pd.ExcelFile(file_path)
    sheet_texts = []
    for sheet_name in xl.sheet_names:
        df = xl.parse(sheet_name)
        if df.empty:
            continue
        df = df.fillna('')
        sheet_md = df.to_markdown(index=False)
        sheet_texts.append(f"## Sheet: {sheet_name}\n\n{sheet_md}")
    return "\n\n".join(sheet_texts) if sheet_texts else "[Excel file is empty or contains no data]"
CSV encoding

CSV files are still read strictly as UTF-8 text. Real-world spreadsheets exported as CSV often use other encodings, so this path may fail for valid uploads even though spreadsheet support was expanded.

elif extension in ['.html', '.htm', '.csv']:
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read().strip()

Comment thread common/utils/text_extractors.py
Comment thread common/utils/text_extractors.py Outdated
@prinskumar-tigergraph prinskumar-tigergraph changed the base branch from main to release_1.4.0 May 4, 2026 13:31
@chengbiao-jin chengbiao-jin merged commit e3f95c3 into release_1.4.0 May 6, 2026
@chengbiao-jin chengbiao-jin deleted the GML-2086-XLS-CSV branch May 6, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants