MCPサーバ - PDF

概要

PDF Reader MCPサーバは、AIアシスタントがPDFファイルの読み取り、解析、操作を可能にするMCP (Model Context Protocol) サーバである。
このMCPを使用することにより、Claude等のLLMがPDFドキュメントのテキスト抽出、メタデータ取得、ページ操作等を実行することができる。

PDF Reader MCPサーバは、以下に示すような機能を提供する。

PDFファイルからのテキスト抽出
PDFメタデータの取得 (タイトル、著者、作成日等)
ページ数やドキュメント構造の取得
複数PDFファイルの同時処理
ページ範囲指定によるテキスト抽出
PDFテーブルデータの抽出 (一部実装)

PDF Reader MCPサーバは、Standard I/O (STDIO) トランスポートを使用してローカル環境で動作する。
これにより、Claude Desktop、Claude Code、Cursor等のMCPクライアントと統合することができる。

PDF Reader MCPの機能

テキスト抽出

PDFドキュメント全体からのテキスト抽出
指定ページ範囲からのテキスト抽出
フォーマット保持オプション
マルチカラム対応
レイアウト解析

メタデータ処理

ドキュメントタイトル
著者情報
作成日時
修正日時
PDFバージョン
ページ数
ファイルサイズ

ページ操作

個別ページへのアクセス
ページ範囲指定
ページ数の取得
ページサイズ情報

高度な機能

テーブルデータの抽出
画像情報の取得
リンクとアノテーションの解析
ブックマーク構造の取得

動作要件

システム要件

Node.js 18以上
Python 3.8以上 (Pythonを使用するの場合)

必須ライブラリ

Node.jsを使用する場合

pdf-parse >= 1.1.1
pdf-lib >= 1.17.1
pdfjs-dist >= 3.0.0

Pythonを使用する場合

PyPDF2 >= 3.0.0
pdfplumber >= 0.9.0
PyMuPDF (fitz) >= 1.22.0
fastmcp >= 0.1.0

インストール

Linux

依存関係のインストール

まず、必要な環境をインストールする。

# RHEL
sudo dnf install curl wget git gcc-c++ make python3 python3-pip nodejs npm unzip

# SUSE
sudo zypper install curl wget git gcc-c++ make python3 python3-pip nodejs npm unzip

# Debian
sudo apt install curl wget git build-essential python3 python3-pip nodejs npm unzip

Node.jsのインストール

Node.jsの最新版をインストールする場合は、NodeSourceリポジトリを使用する。

curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -

# RHEL
sudo dnf install nodejs

# SUSE
sudo zypper install nodejs

# Debian
sudo apt install nodejs

Bunのインストール

Bunをインストールする。

curl -fsSL https://bun.com/install | bash

~/.profileファイル等に、環境変数 PATH を設定する。

 export PATH="$BUN_INSTALL/bin:$PATH"

または、下記のURLからバンドル版をダウンロードする。

https://github.com/oven-sh/bun/releases/latest/download/bun-linux-x64.zip

ダウンロードしたファイルを解凍する。
必要ならば、解凍したファイルを任意のディレクトリに配置する。

unzip bun-linux-x64.zip
mv bun-linux-x64 <任意のディレクトリ>

~/.profileファイル等に、環境変数 PATH を設定する。

 export PATH="/<Bunのインストールディレクトリ>/bin:$PATH"

PDF Reader MCPのインストール

PDF Reader MCPをダウンロードする。

git clone https://github.com/SylphxAI/pdf-reader-mcp.git
cd pdf-reader-mcp

Node.jsを使用する場合、依存関係をインストールする。

npm install

# エラーが表示される場合
rm -rf node_modules package-lock.json
npm install --ignore-scripts
npm run build

# または
npm install -g @sylphx/pdf-reader-mcp

Pythonを使用する場合、仮想環境を作成して依存関係をインストールする。

python3 -m venv venv

# Bash / Zshの場合
source venv/bin/activate

# Fishの場合
source venv/bin/activate.sh

# 依存関係のインストール
pip install -r requirements.txt

Windows

Node.jsの公式WebサイトからNode.jsをダウンロードしてインストールする。

Git for Windowsの公式Webサイトからインストーラをダウンロードしてインストールする。

PowerShellまたはコマンドプロンプトを開いて、PDF Reader MCPをダウンロードする。

git clone https://github.com/SylphxAI/pdf-reader-mcp.git
cd pdf-reader-mcp

PDF Reader MCPの依存関係をインストールする。

npm install

プロジェクト構造

PDF Reader MCPサーバのプロジェクト構造を以下に示す。

pdf-reader-mcp/
├── README.md                       # プロジェクトドキュメント
├── package.json                    # Node.js依存関係 (Node.js実装)
├── requirements.txt                # Python依存関係 (Python実装)
├── tsconfig.json                   # TypeScript設定 (Node.js実装)
├── .env.example                    # 環境設定の例
├── src/                            # ソースコード (Node.js実装)
│   ├── index.ts                    # メインエントリーポイント
│   ├── tools/                      # ツール実装
│   │   ├── read_pdf.ts
│   │   ├── get_metadata.ts
│   │   ├── get_page_count.ts
│   │   ├── search_pdf.ts
│   │   └── extract_table.ts
│   └── utils/                      # ユーティリティ関数
│       ├── file_validator.ts
│       ├── path_resolver.ts
│       └── error_handler.ts
├── server.py                       # メインサーバ (Python実装)
├── tools/                          # ツール実装 (Python実装)
│   ├── __init__.py
│   ├── read_pdf.py
│   ├── metadata.py
│   ├── search.py
│   └── tables.py
├── utils/                          # ユーティリティ (Python実装)
│   ├── __init__.py
│   ├── validators.py
│   └── security.py
├── tests/                          # テスト
│   ├── test_read_pdf.py
│   ├── test_metadata.py
│   └── test_security.py
└── docs/                           # ドキュメント
    ├── API.md
    ├── USAGE.md
    └── TROUBLESHOOTING.md

クライアント接続設定

Claude Desktopからの接続

Claude Desktopの設定ファイルを編集する。

設定ファイルの場所は、以下の通りである。

Linux
~/.config/Claude/claude_desktop_config.json
Windows
%APPDATA%\Claude\claude_desktop_config.json

Node.jsを使用する場合の設定を以下に示す。

 {
   "mcpServers": {
     "pdf-reader": {
       "command": "node",
       "args": [
         "/<任意のディレクトリ>/pdf-reader-mcp/dist/index.js"
       ],
       "env": {
         "NODE_ENV": "production"
       }
     }
   }
 }

Pythonを使用する場合の設定を以下に示す。

 {
   "mcpServers": {
     "pdf-reader": {
       "command": "/<任意のディレクトリ>/pdf-reader-mcp/venv/bin/python",
       "args": [
         "/<任意のディレクトリ>/pdf-reader-mcp/server.py"
       ]
     }
   }
 }

Windows環境の場合、パスを適切に変更する。

 {
   "mcpServers": {
     "pdf-reader": {
       "command": "node",
       "args": [
         "C:\\<任意のディレクトリ>\\pdf-reader-mcp\\dist\\index.js"
       ]
     }
   }
 }

Claude Desktopを再起動して、PDF Reader MCPサーバが利用可能であることを確認する。

Clineからの接続

Clineの設定ファイルを編集する。

# Linux
~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json

# Windows
%APPDATA%\Code\User\globalStorage\saoudrizwan.claude-dev\settings\cline_mcp_settings.json

設定内容はClaude Desktopと同じフォーマットを使用する。

Cursorからの接続

Cursorのグローバル設定ファイルを編集する。

# Linux
~/.cursor/mcp.json

# Windows
%USERPROFILE%\.cursor\mcp.json

# macOS
~/.cursor/mcp.json

設定内容を以下に示す。

 {
   "mcpServers": {
     "pdf-reader": {
       "command": "node",
       "args": ["/path/to/pdf-reader-mcp/dist/index.js"],
       "autoApprove": [],
       "disabled": false,
       "timeout": 60,
       "transportType": "stdio"
     }
   }
 }

または、プロジェクト固有の設定として .cursor/mcp.jsonファイルをプロジェクトディレクトリに配置する。

Claude Codeからの接続

Claude Codeは、カレントディレクトリ内のMCPサーバを自動的に検出するため、追加の設定は不要である。

利用可能なツール

read_pdf

PDFファイル全体または指定ページ範囲からテキストを抽出する。

パラメータ
パラメータ	型	必須	説明
file_path	string	必須	PDFファイルの絶対パスまたは相対パス
start_page	integer	オプション	開始ページ番号 (1から始まる)
end_page	integer	オプション	終了ページ番号 (1から始まる)

使用例

 # PDFファイル全体を読み取る
 「/path/to/document.pdf」のPDFファイルを読み取ってください。
 
 # ページ範囲を指定して読み取る
 「/path/to/document.pdf」の5ページから10ページまでを読み取ってください。

戻り値の例

 {
   "text": "抽出されたテキストの内容...",
   "pages": 6,
   "file": "/path/to/document.pdf"
 }

get_pdf_metadata

PDFファイルのメタデータを取得する。

パラメータ
パラメータ	型	必須	説明
file_path	string	必須	PDFファイルの絶対パスまたは相対パス

使用例

「/path/to/document.pdf」のメタデータを取得してください。

戻り値の例

 {
   "title": "Document Title",
   "author": "Author Name",
   "subject": "Document Subject",
   "creator": "PDF Creator",
   "producer": "PDF Producer",
   "creation_date": "2024-01-15T10:30:00Z",
   "modification_date": "2024-01-20T15:45:00Z",
   "page_count": 42,
   "pdf_version": "1.7",
   "file_size": 1024576
 }

get_pdf_page_count

PDFファイルの総ページ数を取得する。

パラメータ
パラメータ	型	必須	説明
file_path	string	必須	PDFファイルの絶対パスまたは相対パス

使用例

「/path/to/document.pdf」のページ数を教えてください。

戻り値の例

 {
   "page_count": 42,
   "file": "/path/to/document.pdf"
 }

extract_pdf_table

PDFファイルからテーブルデータを抽出する。(定義により利用可能性が異なる)。

パラメータ
パラメータ	型	必須	説明
file_path	string	必須	PDFファイルの絶対パスまたは相対パス
page_number	integer	オプション	テーブルを抽出するページ番号
table_settings	object	オプション	テーブル抽出の詳細設定

使用例

「/path/to/document.pdf」の3ページ目からテーブルを抽出してください。

search_pdf

PDFファイル内でキーワードを検索する。

パラメータ
パラメータ	型	必須	説明
file_path	string	必須	PDFファイルの絶対パスまたは相対パス
keyword	string	必須	検索するキーワード
case_sensitive	boolean	オプション	大文字小文字を区別するか (デフォルト : false)

使用例

「/path/to/document.pdf」から「機械学習」というキーワードを検索してください。

戻り値の例

 {
   "matches": [
     {
       "page": 5,
       "context": "...機械学習アルゴリズムを使用して...",
       "position": 234
     },
     {
       "page": 12,
       "context": "...機械学習モデルの訓練では...",
       "position": 567
     }
   ],
   "total_matches": 2
 }

使用例

基本的なPDF読み取り

「~/Documents/report.pdf」を読み取ってください。
「/home/user/papers/research.pdf」の1ページから5ページまでを読み取ってください。

メタデータ取得

「~/Documents/manual.pdf」のメタデータを取得してください。
「~/Downloads/invoice.pdf」のページ数を教えてください。

キーワード検索

「~/Documents/thesis.pdf」から「機械学習」というキーワードを検索してください。
「~/papers/article.pdf」で「neural network」を大文字小文字を区別して検索してください。

テーブル抽出

「~/Documents/financial_report.pdf」の3ページ目からテーブルを抽出してください。
「~/data/statistics.pdf」からすべてのテーブルを抽出してください。

複数ファイルの処理

以下のPDFファイルを全て読み取って、内容を要約してください。
- ~/Documents/chapter1.pdf

- ~/Documents/chapter2.pdf

- ~/Documents/chapter3.pdf

トラブルシューティング

PDFファイルの読み込みに失敗する

ファイルが存在しないエラー
読み込み権限エラー
フォーマットエラー

解決方法を以下に示す。

ファイルパスが正しいか確認する
```
ls -la /path/to/file.pdf
```
ファイルの読み取り権限があるか確認する
```
chmod 644 /path/to/file.pdf
```
PDFファイルが破損していないか確認する
```
pdfinfo /path/to/file.pdf
```

PDFのバージョンが古い場合、変換を試みる

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.7 -o output.pdf input.pdf

テキスト抽出が正しく動作しない

空のテキストが返される
文字化けが発生する
レイアウトが崩れる

解決方法を以下に示す。

PDFがスキャン画像ベースでないか確認する (OCRが必要)

異なるライブラリを試す (pdfplumber → PyMuPDF)

 import fitz  # PyMuPDF
 
 def read_pdf_with_pymupdf(file_path: str) -> str:
     doc = fitz.open(file_path)
     text = ""
     for page in doc:
         text += page.get_text()
     return text

また、エンコーディングの問題を確認する

メモリ不足エラー

大きなPDFファイルで処理が停止する
メモリエラーが発生する

解決方法を以下に示す。

ページ単位の処理を実装する
ファイルサイズ制限を設定する
ストリーミング処理を使用する
Python環境のメモリ制限を増やす

パフォーマンスの最適化

ページ単位の処理

大規模なPDFファイルを処理する場合、ページ単位で処理することでメモリ使用量を削減できる。

 def read_pdf_streaming(file_path: str, page_size: int = 10):
     """ストリーミング方式でPDFを読み込む"""
     path = Path(file_path).resolve()
     
     with pdfplumber.open(str(path)) as pdf:
         total_pages = len(pdf.pages)
         
         for start in range(0, total_pages, page_size):
             end = min(start + page_size, total_pages)
             
             batch_text = []
             for page_num in range(start, end):
                 page = pdf.pages[page_num]
                 text = page.extract_text()
                 if text:
                     batch_text.append(text)
             
             yield {
                 "text": "\n\n".join(batch_text),
                 "pages": f"{start + 1}-{end}",
                 "total_pages": total_pages
             }

キャッシング

頻繁にアクセスされるPDFファイルのメタデータをキャッシュする。

 from functools import lru_cache
 import hashlib
 
 def get_file_hash(file_path: str) -> str:
    """ファイルのハッシュ値を計算する"""
    hasher = hashlib.md5()
    with open(file_path, 'rb') as f:
       buf = f.read(65536)
       while len(buf) > 0:
          hasher.update(buf)
          buf = f.read(65536)
    return hasher.hexdigest()
 
 @lru_cache(maxsize=50)
 def get_cached_metadata(file_path: str, file_hash: str) -> dict:
    """キャッシュされたメタデータを取得する"""
    return get_pdf_metadata(file_path)
 
 @mcp.tool()
 def get_pdf_metadata_cached(file_path: str) -> dict:
    """キャッシュを使用してメタデータを取得する"""
    try:
       file_hash = get_file_hash(file_path)
       return get_cached_metadata(file_path, file_hash)
    except Exception as e:
       return {"error": str(e)}

並列処理

複数のPDFファイルを同時に処理する場合、並列処理を使用する。

 from concurrent.futures import ThreadPoolExecutor, as_completed
 
 def process_multiple_pdfs(file_paths: list[str]) -> list[dict]:
    """複数のPDFファイルを並列処理する"""
    results = []
 
    with ThreadPoolExecutor(max_workers=4) as executor:
       future_to_path = {
          executor.submit(read_pdf, path): path 
          for path in file_paths
       }
 
       for future in as_completed(future_to_path):
          path = future_to_path[future]
          try:
             result = future.result()
             results.append({
                "file": path,
                "result": result
             })
          except Exception as e:
             results.append({
                "file": path,
                "error": str(e)
             })
 
    return results

高度な機能

OCR統合

スキャンされたPDFファイルからテキストを抽出する場合、OCR機能を統合する。

 from PIL import Image
 import pytesseract
 import pdf2image
 
 def extract_text_with_ocr(file_path: str) -> str:
    """OCRを使用してスキャンPDFからテキストを抽出する"""
    # PDFを画像に変換
    images = pdf2image.convert_from_path(file_path)
 
    text_content = []
    for image in images:
       # OCR実行
       text = pytesseract.image_to_string(image, lang='jpn+eng')
       text_content.append(text)
 
    return "\n\n".join(text_content)

PDF分割

大きなPDFファイルを分割する機能を追加する。

 from PyPDF2 import PdfWriter, PdfReader
 
 @mcp.tool()
 def split_pdf(
    file_path: str,
    output_dir: str,
    pages_per_file: int = 10
 ) -> dict:
    """PDFファイルを複数のファイルに分割する"""
    try:
       path = Path(file_path).resolve()
       output_path = Path(output_dir).resolve()
       output_path.mkdir(exist_ok=True)
 
       reader = PdfReader(str(path))
       total_pages = len(reader.pages)
 
       file_count = 0
       for start in range(0, total_pages, pages_per_file):
          writer = PdfWriter()
          end = min(start + pages_per_file, total_pages)
  
          for page_num in range(start, end):
             writer.add_page(reader.pages[page_num])
  
          output_file = output_path / f"split_{file_count + 1}.pdf"
          with open(output_file, 'wb') as output:
             writer.write(output)
  
          file_count += 1
  
       return {
          "files_created": file_count,
          "output_directory": str(output_path)
       }
  
    except Exception as e:
       return {"error": f"分割エラー: {str(e)}"}

PDF結合

複数のPDFファイルを結合する機能を追加する。

 @mcp.tool()
 def merge_pdfs(
    file_paths: list[str],
    output_path: str
 ) -> dict:
    """複数のPDFファイルを結合する"""
    try:
       writer = PdfWriter()
 
       for file_path in file_paths:
          reader = PdfReader(file_path)
          for page in reader.pages:
             writer.add_page(page)
 
       with open(output_path, 'wb') as output:
          writer.write(output)
 
       return {
          "output_file": output_path,
          "total_pages": len(writer.pages)
       }
 
    except Exception as e:
       return {"error": f"結合エラー: {str(e)}"}

PDFブックマーク抽出

PDFファイルのブックマーク構造を取得する。

 @mcp.tool()
 def get_pdf_bookmarks(file_path: str) -> dict:
    """PDFファイルのブックマーク構造を取得する"""
    try:
       path = Path(file_path).resolve()
       reader = PdfReader(str(path))
 
       def extract_bookmarks(bookmarks, level=0):
          result = []
          for item in bookmarks:
             if isinstance(item, list):
                result.extend(extract_bookmarks(item, level + 1))
             else:
                result.append({
                   "title": item.title,
                   "page": reader.get_destination_page_number(item) + 1,
                   "level": level
                })
         return result
 
       bookmarks = reader.outline
       if bookmarks:
          return {
             "bookmarks": extract_bookmarks(bookmarks)
          }
       else:
          return {"bookmarks": [], "message": "ブックマークが見つかりません"}
 
    except Exception as e:
       return {"error": f"ブックマーク取得エラー: {str(e)}"}

外部リンク