LlamaIndex readers for Built-Simple research APIs, providing semantic search over scientific literature.
- PubMed Reader - 4.5M+ biomedical articles with hybrid semantic/keyword search
- ArXiv Reader - 2.7M+ preprints in physics, math, CS, and ML
- Wikipedia Reader - Semantic search over Wikipedia articles
- No API key required - Free tier available for all endpoints
- Rich metadata - Full citation info for all documents
Each document contains:
- Text: Title + abstract (default) OR full article text (with
include_full_text=True) - Metadata:
pmid- PubMed ID (e.g., "31041627")title- Full article titlejournal- Publication journal nameyear- Publication yeardoi- DOI identifierdoi_url- Direct DOI linkurl- Link to PubMed pagehas_full_text- Boolean indicating if full text was fetchedfull_text_length- Character count of full text (when available)
🔥 FULL TEXT AVAILABLE! Unlike most research APIs that only provide abstracts, Built-Simple has full article text for millions of papers:
# Get full article text (15K-70K chars per article)
reader = BuiltSimplePubMedReader(include_full_text=True)
docs = reader.load_data("cancer immunotherapy", limit=5)
for doc in docs:
print(f"Full text length: {len(doc.text)} chars") # ~15,000-70,000 chars!Each document contains:
- Text: Title + authors + full abstract
- Metadata:
arxiv_id- ArXiv identifier (e.g., "2301.12345" or "cs/0308031")title- Paper titleauthors- Author namesyear- Publication yearurl- Link to ArXiv abstract pagepdf_url- Direct PDF download linksimilarity_score- Semantic relevance score (0-1)
Note: Full paper PDFs are NOT downloaded—only abstracts. Use pdf_url to fetch the full PDF if needed.
Each document contains:
- Text: Article title + summary/intro section
- Metadata:
title- Article titleurl- Link to Wikipedia page
Note: Only article summaries, not full articles.
pip install llama-index-readers-builtsimplefrom llama_index.readers.builtsimple import (
BuiltSimplePubMedReader,
BuiltSimpleArxivReader,
)
# Search PubMed for medical literature
pubmed_reader = BuiltSimplePubMedReader()
pubmed_docs = pubmed_reader.load_data("CRISPR gene therapy", limit=10)
for doc in pubmed_docs:
print(f"Title: {doc.metadata['title']}")
print(f"Journal: {doc.metadata['journal']}")
print(f"Year: {doc.metadata['pub_year']}")
print(f"URL: {doc.metadata['url']}\n")
# Search ArXiv for ML papers
arxiv_reader = BuiltSimpleArxivReader()
arxiv_docs = arxiv_reader.load_data("transformer architecture attention", limit=10)
for doc in arxiv_docs:
print(f"Title: {doc.metadata['title']}")
print(f"Authors: {doc.metadata['authors']}")
print(f"ArXiv ID: {doc.metadata['arxiv_id']}\n")from llama_index.core import VectorStoreIndex
from llama_index.readers.builtsimple import BuiltSimplePubMedReader
# Load documents
reader = BuiltSimplePubMedReader()
documents = reader.load_data("immunotherapy cancer treatment", limit=20)
# Build index
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the side effects of CAR-T therapy?")
print(response)from llama_index.core import VectorStoreIndex
from llama_index.readers.builtsimple import (
BuiltSimplePubMedReader,
BuiltSimpleArxivReader,
)
# Load from multiple sources
pubmed = BuiltSimplePubMedReader()
arxiv = BuiltSimpleArxivReader()
# Combine documents
documents = []
documents.extend(pubmed.load_data("drug discovery machine learning", limit=10))
documents.extend(arxiv.load_data("drug discovery deep learning", limit=10))
# Build unified index
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query(
"How is machine learning being used for drug discovery?"
)
print(response)BuiltSimplePubMedReader(
api_key: Optional[str] = None, # Optional for higher rate limits
timeout: int = 30,
)
def load_data(
query: str,
limit: int = 10,
) -> List[Document]Document Metadata:
source: "builtsimple-pubmed"pmid: PubMed IDtitle: Paper titlejournal: Journal namepub_year: Publication yeardoi: DOI identifierurl: Link to PubMed
BuiltSimpleArxivReader(
api_key: Optional[str] = None,
timeout: int = 30,
)
def load_data(
query: str,
limit: int = 10,
) -> List[Document]Document Metadata:
source: "builtsimple-arxiv"arxiv_id: ArXiv identifier (e.g., "2301.12345")title: Paper titleauthors: Author listyear: Publication yearurl: Link to ArXiv
BuiltSimpleWikipediaReader(
api_key: Optional[str] = None,
timeout: int = 30,
)
def load_data(
query: str,
limit: int = 10,
) -> List[Document]Document Metadata:
source: "builtsimple-wikipedia"title: Article titleurl: Link to Wikipedia
| Tier | Rate Limit | Notes |
|---|---|---|
| Free | 10 req/min | No API key needed |
| Pro | 100 req/min | Requires API key |
Get an API key at pubmed.built-simple.ai or arxiv.built-simple.ai.
Unlike scraping or official APIs:
- Pre-indexed vectors - No embedding costs, instant semantic search
- Hybrid search - Combines BM25 + vector similarity
- Always available - No rate limit hell from upstream providers
- Structured data - Clean JSON responses with full metadata
This package is part of the LlamaIndex ecosystem. To contribute:
- Fork the repo
- Create a feature branch
- Submit a PR to run-llama/llama_index
MIT License - see LICENSE for details.