Block's Playbook for Designing MCP Servers

$ cat content.md

At Block, we have developed more than 60 MCP servers, and this playbook reflects some patterns and learnings we've observed across that ecosystem.

Best practices

Design top-down from workflows, not bottom-up from API endpoints

Unlike traditional API design, tools for LLMs should be designed with usability, context constraints, and language model strengths in mind. It's usually better to start top-down from the workflow that needs to be automated, and work backwards (in as few steps as possible) to define tools that support that flow effectively.

Do: Think in terms of user workflows. Combine multiple internal API calls into a single high-level tool. If you're required to chain tool calls, make sure you clearly outline the steps & dependencies in your tool's instructions and the output of previous steps are concise.

Don't: Expose raw, granular API endpoints like GET /user or GET /file. These usually produce verbose, poorly scoped outputs.

Example:

⚠️ 2 Tools: returns a giant JSON blob, but only need username

get_user(user_id) & upload_file(path, owner) → get_user(...)

✅ 1 Tool: gets the user, uploads the file with owner and then returns a success/error msg

upload_file(path, owner)

Experiment with tool names and parameters

Tool names, descriptions, and parameters are treated as prompts for the LLM, so it's really important to have clear instructions. See the post below from one of the creators of MCP.

For complex tool parameters, it's highly recommended to use Pydantic (or equivalent) models, which support model and field descriptions, along with JSONSchema serialization.

Think about token budget

LLMs have finite context windows. For example, Claude 3.7 can process max 200K input tokens. Some LLMs have larger windows such as Gemini, GPT 4.1 but even for those models, the performance drops for large contexts because training data on the internet is skewed towards shorter lengths. As tool developers, you are in the best position to know which tool calls produce large outputs and implement checks inside the tool to guard against overflows.

Tactics:

Check byte size, character count, or estimate token count using libraries like tiktoken before returning text.
Check the size of images and consider resizing or lowering the quality to make them smaller.
Based on the tool's semantics, choose one of the following:
- Throw an error: Explicitly raise an MCP tool execution error and let the agent recover.
- Reduce output: Shorten content to a safe limit with a clear note that it was truncated, or you can run a summarizer if available (e.g., for long file contents or logs)
- Paginate the output: In some cases, all the output may be necessary, so you might want the model to get the different pages via tool calls. Note that there is a trade off here between trying to reduce the number of tool calls and maintaining the token budget.

Practical Example: In Goose's file reading tool, files over 400kB in size raise a tool execution error. This is because we find LLMs are smart enough to work around this by using shell commands like sed -n '500,1000p' file.txt.


rust
1const MAX_FILE_SIZE: u64 = 400 * 1024; // 400KB in bytes
2if file_size > MAX_FILE_SIZE {
3    return Err(ToolError::ExecutionError(format!(
4        "File '{}' is too large ({:.2}KB). Maximum size is 400KB to prevent memory issues. If needed, use shell commands like 'head', 'tail', or 'sed -n' to read a subset of the file.",
5        path.display(),
6        file_size as f64 / 1024.0
7    )));
8}

Prefer actionable error messages that enable recovery by the agent, instead of vague prompt instructions like: "Don't return large files!".

💡Tip

For tools that might produce large outputs (e.g., file reads, shell command execution, doc scraping), include fallback logic early. Even checking the file size before reading can save a failed tool call.

Take advantage of prompt prefix caching

LLM providers offer significant price discounts and latency improvements if the prompt prefix is cached. If possible, avoid injecting dynamic live data in instructions such as current timestamp since that invalidates the cache. Instead, opt for data which changes less frequently such as the session created timestamp. Dynamically picking tools or examples to inject in a prompt will also invalidate cache. When developing, please look at the cached token metrics such as cache_read_input_tokens and cache_creation_input_tokens.

Lean into the strengths of LLMs

Design tools that make the most of what LLMs do well:

✅ Good At:

SQL queries
- Use DuckDB to expose structured data for querying
- Clean your schema: use tidy data principles. Think of creating a “gold dataset” which is easy to query for the end user. This often means denormalizing the data so fewer joins are required.
- Name your tables and columns so it's easier for LLMs to understand and generate
Markdown / Mermaid diagrams
- Diagrams as text is a superpower
- Validate the diagram code or pass them back as errors

⚠️ Weak At:

Planning over many steps
- LLMs are improving at planning but still it's hard for them to chain together 20 tool calls today so design your tools in a way that requires less chaining.
Outputting formats with a strict grammar like JSON (e.g., missed quotes, commas)
- Prefer Markdown or XML over raw JSON since they're typically more token efficient. Here's an article that goes more in depth on why LLMs are bad at returning JSON.
- If JSON is necessary, use structured outputs or keep a simple schema and avoid long lists.
Generating long names
- Use tools like tiktokenizer.vercel.app to check how your prompts get tokenized and imagine yourself in the LLM's position. For example, assume a long database table name takes up 32 tokens. The LLM needs to regenerate those tokens every time it references the table in a query.

Best practices for Auth

Secure authentication is essential for extensions, especially when accessing user or organizational resources. Design your authentication flows to minimize user friction and maximize security by taking advantage of modern best practices.

Use OAuth whenever possible
- Prefer OAuth flows (e.g., Authorization Code Grant) to direct credential entry or legacy API keys.
- Trigger the OAuth flow only upon an extension's first use, rather than at activation.
- Ensure the minimum necessary scopes are requested for your workflow to minimize risk.
Store tokens securely using the keyring
- Sensitive credentials—such as access tokens, refresh tokens, and client secrets—should be stored in a platform-provided keyring or credential manager.
- Use a library like keyring for Python, the macOS Keychain, Windows Credential Locker, etc., for secure storage and automatic system integration.
Token lifecycle management
- Promptly handle token refresh (using a refresh token or re-authenticating the user) and invalidate stored tokens if an extension is removed or access is revoked.

❌ Never save tokens or secrets in plaintext files.

Provide instructions and tool annotations

Provider instructions for MCP server (during initialization) and tool annotations (e.g. readOnlyHint) are currently optional in the MCP spec. However, Goose uses server instructions to construct the system prompt and tool annotations for smart approval of tool calls. Other coding agents like Claude Code and Codex also have a notion of tool approval requests, so these hints are applicable across agents.

Make it easier to manage permissions

Goose tools support three permission levels: Always Allow, Allow Once, and Denied. To help users make safe choices, each tool should stick to a single risk level—read-only (low risk) or non-read (higher risk). Tools that mix both make it harder for users to judge the risk accurately. If your workflow truly needs both read and write in one tool, that's okay, just document it clearly and validate inputs. There's always a trade off between the number of tools and how simple permission management stays.

Do: Build tools with one risk level only: either read-only (safe) or non-read (write/delete). Bundle related read-only actions into a single tool.

Don't: If possible, don't mix read and write operations in the same tool. It confuses users and makes permission settings harder.

Example: Bundling related read-only tools


python
1@mcp.tool()
2def get_issue_info(issue_id: str, info_category: str) -> dict:
3    """Get specific information about an issue.
4
5    The 'info_category' parameter determines the type of information to fetch.
6    Available categories:
7        'details' - General issue details,
8        'comments' - Comments on the issue,
9        'labels' - Labels assigned to the issue,
10        'subscribers' - Users subscribed to the issue,
11        'parent' - Parent issue of a sub-issue,
12        'branch_name' - Git branch name for the issue.
13        'children' - Get all child issues of a specific issue.
14
15    Args:
16        issue_id (str): Issue ID
17        info_category (str): Category of information to retrieve.
18                             Options: "details", "comments", "labels",
19                                      "subscribers", "parent", "branch_name",
20                                      "children"
21
22    Returns:
23        dict: Issue information or error details
24    """

Examples

Let's look at a few examples of MCP servers we've built, the lessons learned, and how we improved the servers:

Google Calendar MCP

Our first version (v1) of Google Calendar MCP served as a thin wrapper around the external API:


python
1@mcp.tool()
2def list_calendars():
3    """Lists all user calendars."""
4    
5
6@mcp.tool()
7def list_calendar_events(
8    calendar_id: str = "primary",
9    timeMax: datetime | None = None,
10    timeMin: datetime | None = None,
11    verbose: bool = False,
12) -> dict:
13    """Get all events for a specified time period for a given calendar."""
14
15@mcp.tool()
16def retrieve_timezone(calendar_id: str = "primary") -> dict:
17    """Retrieves timezone for a given calendar."""
18
19@mcp.tool()
20def retrieve_calendar_free_busy_slots(
21    time_min: str,
22    time_max: str,
23    timezone: str = "UTC",
24    calendar_ids: list = ["primary"]
25) -> dict:
26    """Retrieves free and busy slots from the calendars of the 'calendar_ids' list."""

Problems

Bottom-up design: Tools simply mirrored API endpoints.
No analytics capabilities: Answering "How many meetings did I have last month?" was impossible.
Painful for LLM workflows: To serve common queries like “what does my day look like?” or “find a meeting time between Alice, Bob & Carol” an LLM has to chain many tool calls and then do analysis.

v2 In our reworked version, we manage and sync DuckDB tables for calendar events that let us run SQL queries against the table. This provides richer context to LLMs by giving them access to raw events and allows LLMs to perform analytics on the fly. We also no longer require chaining tool calls and embed more logic into a single query_database tool. For common complex queries such as finding free meeting slots, we provide DuckDB macros to make it easier and instruct LLMs on the schema & macro usage.

Here's how v2 consolidates multiple steps into one LLM-efficient tool to find common meeting times:


python
1@mcp.tool()
2def query_database(sql: str, time_min: str | None, time_max: str | None) -> str:
3    """Run arbitrary queries against calendars/events in our DuckDB tables."""

A workflow to “find 1-hour slots where Alice, Bob, and Carol are all available this week” can now be answered simply with:


sql
1SELECT * FROM free_slots(['alice@example.com','bob@example.com','carol@example.com'],
2                        '2025-05-13T09:00:00Z', '2025-05-17T18:00:00Z');

Linear MCP

The first version of our Linear MCP offered many tools that were essentially variations of GraphQL queries under the hood. Some operations also required multiple tool calls chained together to get the full end result.


python
1@mcp.tool()
2def get_issue(issue_id: str) -> dict:
3    """Get detailed information about a specific issue."""
4...
5
6@mcp.tool()
7def get_issue(issue_id: str) -> dict:
8    """Get detailed information about a specific issue."""
9
10@mcp.tool()
11def get_issue_labels(issue_id: str) -> dict:
12    """Get all labels assigned to an issue."""
13
14# all 30+ tools

Problems

Bottom-up design: Tools simply mirrored API endpoints.
Painful for LLM workflows: Asking questions like “what issues is bob@example.com working on” would require anywhere from 4-6 tool calls depending on how many teams the user was a part of.
Too many tools: there were too many calls for what was exposed, and any more complex workflows may require more tools to be exposed.

In a subsequent version, to simplify tool selection for the model, we consolidated many read-only tools. We grouped related functions, so instead of separate tools like get_team_members, get_team_projects, etc., there's now a comprehensive get_team_info tool (and a similar get_issue_info, etc). This reduction in tools means the model can now find the necessary parameters by checking the description of a single, relevant tool instead of choosing from numerous specific ones.


python
1@mcp.tool()
2def get_issue_info(issue_id: str, info_category: str) -> dict:
3    """Get specific information about an issue.
4
5    The 'info_category' parameter determines the type of information to fetch.
6    Available categories:
7        'details' - General issue details,
8        'comments' - Comments on the issue,
9        'labels' - Labels assigned to the issue,
10        'subscribers' - Users subscribed to the issue,
11        'parent' - Parent issue of a sub-issue,
12        'branch_name' - Git branch name for the issue.
13        'children' - Get all child issues of a specific issue.
14
15    Args:
16        issue_id (str): Issue ID
17        info_category (str): Category of information to retrieve.
18                             Options: "details", "comments", "labels",
19                                      "subscribers", "parent", "branch_name",
20                                      "children"
21
22    Returns:
23        dict: Issue information or error details
24    """

Problems

Bottom-up design: Tools simply mirrored API endpoints, even though they were consolidated.
Painful for LLM workflows: Asking questions like “what issues is bob@example.com working on” still had the same problems as the initial version, the number of tools was reduced but each still needed to be called individually.

The most recent version of the Linear MCP consolidated even further, exposing two tools execute_readonly_query and execute_mutation_query, where each tool would directly take in a GraphQL query. To inform the model on how to use the tools, we added the schema and examples of the GraphQL schema as part of the instructions of the model.

To reduce initial token usage, especially with extensive schemas, a get_linear_graphql_schema tool could be introduced. This tool would only need to be called once when initiating a Linear MCP session, allowing the GraphQL schema to be omitted from the main instructions.


python
1@mcp.tool()
2def execute_readonly_query(
3    query: str, variables: Optional[dict[str, Any]] = None
4) -> dict:
5    """Execute the readonly Linear graphql query to get the object
6
7    Args:
8        query (str): The readonly Linear graphql query
9        variables (str): The variables can be passed into the query
10    """

With a single GraphQL query, what originally took 4+ tool calls is now a single GraphQL call that the model can generate:


sql
1query ($email: String!) {
2  issues(filter: { assignee: { email: { eq: $email } } }) {
3    nodes {
4      title
5      project {
6        id
7        name
8      }
9      assignee {
10        email
11      }
12    }
13  }
14}

The usage of MCP is no longer a proof of concept at Block, it's how we work. We've learned a lot while developing MCP servers for our employees and will continue to share as we gain more insights.