Home About Me

Running Large Models Locally with Ollama: UI, APIs, Model Choices, and Real-World Tuning

Local deployment has become one of the most practical ways to use large language models. For developers, it removes recurring API costs and makes testing faster. For companies, it keeps sensitive data inside their own environment instead of sending it through third-party services. What makes Ollama stand out is how much it simplifies that process. It takes something that used to involve messy environment setup, dependency conflicts, and awkward model management, and turns it into something much closer to installing a normal application.

It is often described as the Docker of local LLMs, and that comparison makes sense. Ollama packages model deployment, runtime management, and access into a much cleaner workflow. With the addition of an official UI, it is no longer limited to command-line users either. People who do not want to live in a terminal can now install models, launch them, and chat with them directly from a browser.

Why Ollama matters

Ollama is an open-source tool for deploying and managing large models on local machines. Its core promise is simple: running a model locally should not feel harder than installing software.

That idea addresses three pain points that have made local model usage frustrating for a lot of people:

  • environment setup is often complicated
  • model lifecycle management is tedious
  • hardware resources can be consumed inefficiently if the stack is not tuned well

What makes it useful

Easy setup
You do not need to manually assemble a Python environment, CUDA dependencies, and a pile of supporting tools just to get started. Installation is designed to be straightforward, and mainstream models can be run with minimal preparation.

Lightweight and efficient
The core runtime is relatively lean. It can run on ordinary PCs, including machines with around 8 GB of memory, while also scaling up to servers with stronger hardware.

Broad model support
Ollama includes access to many widely used models, including Llama 3, DeepSeek, Qwen, and Mistral. These cover common use cases such as text generation, translation, question answering, and code generation.

Official graphical interface
The built-in UI removes the need to rely exclusively on terminal commands. Installing a model and chatting with it can now be done visually.

Open integration path
It exposes a REST API and supports SDK-based access, making it practical to connect with business systems in Java, Python, Go, and other languages.

Privacy by design
Everything runs locally, so data does not need to pass through external servers. For teams working with internal, regulated, or sensitive information, that is a major advantage.

Where it fits best

Ollama is especially well suited to a few recurring scenarios:

  • internal enterprise knowledge assistants, such as document search or staff training support
  • local development and testing without depending on external connectivity
  • privacy-sensitive workflows, including medical or financial data processing
  • edge and on-device deployments, such as industrial or embedded AI use cases

Getting started: installation and the built-in UI

Hardware and system requirements

Before installing Ollama, it helps to match the model size to the machine you actually have.

Minimum
8 GB of memory is enough to run smaller lightweight models, such as a 7B-class model or something like DeepSeek-1.5B.

Recommended
16 GB of memory is a more comfortable baseline for 13B models, such as Llama 3-13B.

Higher-performance setup
32 GB of memory plus an NVIDIA GPU is better suited for much larger models, including 70B variants like Llama 3-70B.

Supported operating systems
- Windows 10+ - macOS 11+ - Linux, including Ubuntu 20.04+ and CentOS 8+

Installing Ollama

  1. Download the installer from the official Ollama site at ollama.com.
  2. Run the installer and follow the default steps. On Windows, environment variables are typically configured automatically. On Linux and macOS, some manual setup may still be needed.
  3. Verify that the installation worked by opening a terminal and running ollama --version. If a version number appears, Ollama is installed correctly.

Using the official UI instead of the command line

From version v0.1.30+, Ollama includes an official UI, so there is no extra interface layer to install.

How to access it

  1. Start the Ollama service. In many cases it runs in the background automatically after installation, using port 11434 by default. If it does not, start it manually with ollama serve.
  2. Open a browser and go to http://localhost:11434.
  3. The UI will load with the main areas used for model management and chat.

Main UI sections

Model Library
This is where supported models are listed. You can search the catalog and install models directly, including options like Llama 3 and DeepSeek.

Chat
Once a model is installed and running, this is where you interact with it. Context-aware conversation is supported, and chat history can be saved.

Models
This section shows which models are already installed. From here you can start, stop, or remove them.

Settings
Used for runtime configuration, such as memory limits, GPU usage, and API-related settings.

Example: install and use DeepSeek through the UI

A simple hands-on flow looks like this:

  1. Search for deepseek in the Model Library.
  2. Click Pull to install it.
  3. After installation, go to Models and click Run.
  4. Open the Chat tab and ask something like “Explain what a microservices architecture is.”
  5. The model will return a response in real time, and the conversation history will be saved automatically.

For people who want local models without learning terminal commands first, this UI is one of Ollama’s biggest practical improvements.

Command-line workflow for power users

The visual interface is convenient, but the terminal is still the fastest way to automate routine tasks or manage multiple models.

Common commands include:

# 安装模型(以 DeepSeek 为例)
ollama run deepseek

# 查看已安装的模型
ollama list

# 启动模型(后台运行)
ollama run deepseek --background

# 停止模型
ollama stop deepseek

# 删除模型
ollama rm deepseek

# 查看模型详情(如参数、大小)
ollama show deepseek

# 导出模型(备份到本地文件)
ollama save deepseek ./deepseek-model.ollama

# 导入模型(从本地文件恢复)
ollama load deepseek ./deepseek-model.ollama

Choosing the right model

Ollama supports dozens of mainstream large models, but picking one is always a trade-off between capability, speed, and resource consumption.

Common options and where they shine

<table> <thead> <tr> <th>Model</th> <th>Parameters</th> <th>Main strengths</th> <th>Resource needs</th> <th>Typical use cases</th> </tr> </thead> <tbody> <tr> <td>Llama 3</td> <td>8B/70B</td> <td>Strong all-around performance, multilingual support</td> <td>8G+/32G+</td> <td>General text generation, Q&A, translation</td> </tr> <tr> <td>DeepSeek</td> <td>1.5B/7B/13B</td> <td>Strong Chinese support, good code generation</td> <td>4G+/8G+/16G+</td> <td>Chinese dialogue, coding, document analysis</td> </tr> <tr> <td>Qwen</td> <td>7B/14B</td> <td>Excellent Chinese understanding, strong long-context handling</td> <td>8G+/16G+</td> <td>Chinese Q&A, long document processing, enterprise knowledge bases</td> </tr> <tr> <td>Mistral</td> <td>7B/8x7B</td> <td>Fast, memory-efficient, supports function calling</td> <td>8G+/16G+</td> <td>Low-latency services, API integrations, edge deployment</td> </tr> <tr> <td>CodeLlama</td> <td>7B/13B</td> <td>Built for code generation and debugging, supports multiple languages</td> <td>8G+/16G+</td> <td>Software development, automated testing, technical documentation</td> </tr> </tbody> </table>

Practical selection advice

If you are just getting started
Use a model in the 1.5B to 7B range, such as DeepSeek-1.5B or Mistral-7B. These launch faster and place less pressure on memory.

For enterprise use
Models in the 13B to 70B range, such as Llama 3-13B or Qwen-14B, generally provide stronger overall performance.

For Chinese-language tasks
DeepSeek and Qwen are usually better fits because they are much stronger in Chinese understanding and generation than many general-purpose alternatives.

For coding
CodeLlama and DeepSeek-Coder are more suitable when the job involves code completion, debugging, or refactoring.

Turning Ollama into a local model service

One of Ollama’s biggest advantages is that it can be treated like infrastructure rather than just a desktop tool. Once the service is running, it exposes an HTTP API on port 11434 by default, which means local models can be integrated into applications the same way teams already integrate other internal services.

Core API endpoints

<table> <thead> <tr> <th>Path</th> <th>Method</th> <th>Purpose</th> <th>Example request body</th> </tr> </thead> <tbody> <tr> <td>/api/chat</td> <td>POST</td> <td>Chat with context support</td> <td>"model":"deepseek","messages":[{"role":"user","content":"解释微服务"}]</td> </tr> <tr> <td>/api/generate</td> <td>POST</td> <td>One-off text generation without chat history</td> <td>"model":"deepseek","prompt":"写一篇关于 AI 的短文"</td> </tr> <tr> <td>/api/models</td> <td>GET</td> <td>List installed models</td> <td>-</td> </tr> <tr> <td>/api/pull</td> <td>POST</td> <td>Install a model</td> <td>"name":"deepseek"</td> </tr> <tr> <td>/api/stop</td> <td>POST</td> <td>Stop a model</td> <td>"name":"deepseek"</td> </tr> </tbody> </table>

This is enough to build a local “LLM as a service” setup for internal tools, web apps, scripts, and backend systems.

Java integration with Spring Boot

For Java teams, Ollama can be integrated into a Spring Boot application cleanly, especially through Spring AI.

Step 1: add dependencies

<!-- Spring AI 集成 Ollama(推荐,简化开发) -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-ollama</artifactId>
    <version>1.0.0</version>
</dependency>
<!-- Spring Web(用于暴露接口) -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>

Step 2: configure the connection

Put the following into application.yml:

spring:
  ai:
    ollama:
      base-url: http://localhost:11434 # Ollama 服务地址
      chat:
        model: deepseek # 默认使用的模型
        options:
          temperature: 0.7 # 随机性(0-1,值越大越随机)
          max-tokens: 1024 # 最大生成 tokens 数

Step 3: write the application code

import org.springframework.ai.ollama.OllamaChatModel;
import org.springframework.ai.ollama.api.OllamaApi;
import org.springframework.ai.ollama.api.OllamaOptions;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class OllamaChatController {

    // 注入 Ollama 聊天模型
    @Autowired
    private OllamaChatModel ollamaChatModel;

    /**
     * 简单聊天接口(无上下文)
     */
    @GetMapping("/api/chat/simple")
    public String simpleChat(@RequestParam String msg) {
        // 直接调用模型生成结果
        return ollamaChatModel.call(msg);
    }

    /**
     * 带上下文的聊天接口
     */
    @GetMapping("/api/chat/context")
    public String contextChat(@RequestParam String msg, @RequestParam(required = false) String context) {
        // 构建对话历史(上下文)
        String prompt = context == null ? msg : context + "\n用户:" + msg + "\n助手:";
        // 自定义生成参数(覆盖配置文件)
        OllamaOptions options = OllamaOptions.create()
                .withTemperature(0.5) // 降低随机性,结果更精准
                .withMaxTokens(2048); // 支持更长回复
        // 调用模型
        return ollamaChatModel.call(prompt, options);
    }

    /**
     * 动态切换模型
     */
    @GetMapping("/api/chat/switch-model")
    public String switchModelChat(@RequestParam String msg, @RequestParam String modelName) {
        // 动态创建模型客户端
        OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");
        OllamaChatModel dynamicModel = new OllamaChatModel(ollamaApi, modelName);
        return dynamicModel.call(msg);
    }
}

Step 4: test the endpoints

After the Spring Boot app starts, test it in a browser or via Postman:

  • Simple chat: http://localhost:8080/api/chat/simple?msg=解释什么是微服务
  • Context-aware chat: http://localhost:8080/api/chat/context?msg=它的优点是什么&context=用户:解释什么是微服务\n助手:微服务是一种架构风格...
  • Switch model dynamically: http://localhost:8080/api/chat/switch-model?msg=写一段 Python 代码&modelName=codellama

Python integration

Python is even more direct when all you need is a quick local script or service wrapper.

Step 1: install the package

pip install ollama

Step 2: call the model

import ollama

# 1. 简单聊天(无上下文)
def simple_chat(msg, model="deepseek"):
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": msg}]
    )
    return response["message"]["content"]

# 2. 带上下文的聊天
def context_chat(msg, context=None, model="deepseek"):
    messages = []
    if context:
        # 拼接历史上下文
        messages.append({"role": "assistant", "content": context})
    messages.append({"role": "user", "content": msg})
    response = ollama.chat(model=model, messages=messages)
    return response["message"]["content"]

# 3. 文本生成(无对话格式,适合长文本)
def generate_text(prompt, model="deepseek"):
    response = ollama.generate(
        model=model,
        prompt=prompt,
        options={"temperature": 0.6, "max_tokens": 1500}
    )
    return response["response"]

# 测试
if __name__ == "__main__":
    print(simple_chat("写一篇关于 AI 发展趋势的短文"))
    print(context_chat("展开说说第三点", context="AI 发展趋势包括:1. 大模型轻量化..."))
    print(generate_text("写一个 Python 爬虫,爬取博客文章标题"))

This pattern works well for automation, local assistants, internal tooling, and quick prototype APIs.

Go integration

Go projects can call Ollama through the official API package as well.

Step 1: install the dependency

go get github.com/ollama/ollama/api

Step 2: create a client and send a request

package main

import (
    "context"
    "fmt"
    "github.com/ollama/ollama/api"
)

func main() {
    client, err := api.NewClient("http://localhost:11434")
    if err != nil {
        panic(err)
    }

    // 简单聊天
    msg := "解释什么是区块链"
    response, err := client.Chat(context.Background(), &api.ChatRequest{
        Model: "deepseek",
        Messages: []api.Message{
            {Role: "user", Content: msg},
        },
        Options: api.Options{
            Temperature: 0.7,
            MaxTokens:  1024,
        },
    })
    if err != nil {
        panic(err)
    }

    fmt.Printf("助手:%s\n", response.Message.Content)
}

Performance tuning and stability improvements

Once the basics are working, the next challenge is usually efficiency: making response time better, reducing memory pressure, and keeping the service stable under heavier use.

Adjust runtime resources

Model behavior can be tuned at launch time depending on available hardware:

# 启动模型时指定内存限制(如限制使用 8G 内存)
ollama run deepseek --memory 8g

# 指定 GPU 使用率(仅 NVIDIA GPU,0.0-1.0)
ollama run deepseek --gpu 0.8

# 调整上下文窗口大小(支持更长对话,单位:tokens)
ollama run deepseek --context 4096

These options matter because local model performance is always constrained by the machine they run on. A model that feels slow or unstable is not always the wrong model; sometimes it is just mismatched to the available resources.

Practical optimization tips

  • Close unnecessary background applications to free CPU and RAM.
  • Use SSD storage instead of HDD. Model load times can be several times faster on SSDs.
  • Enable GPU acceleration when possible. NVIDIA users generally need the CUDA toolkit, while AMD users need ROCm.
  • Use quantized models when available. Quantization such as 4-bit or 8-bit can significantly reduce memory use. Some models in Ollama already support automatic quantization.

Enterprise deployment and high availability

For personal use, one local instance is often enough. In production, however, availability and maintainability matter just as much as raw model quality.

Recommended production practices

Containerize the service
Package Ollama in Docker and manage it through Docker Compose or Kubernetes.

Run multiple instances behind a load balancer
If request volume grows, deploy several Ollama instances and distribute traffic through Nginx or an API gateway.

Add monitoring and alerts
Use Prometheus and Grafana to observe memory consumption, response latency, and error rates.

Back up models regularly
Export model files on a schedule to avoid unnecessary recovery work after failures.

Example Docker Compose deployment

version: "3"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-service
    ports:
      - "11434:11434"
    volumes:
      - ./ollama-data:/root/.ollama # 持久化模型数据
    deploy:
      resources:
        limits:
          memory: 16G
          cpus: "4"
    restart: always

This gives you persistent model storage, a fixed service port, and basic resource limits in a form that is easy to reproduce across environments.

Common issues and how to avoid them

Model downloads are slow or fail

A common reason is network instability, especially if model files are hosted on overseas infrastructure.

Possible fixes:

  • configure an HTTP or HTTPS proxy
  • download model files from a closer mirror and import them with ollama load
  • download in parts rather than relying on one large uninterrupted transfer

Memory usage is too high after the model starts

Ways to reduce the pressure:

  • move to a smaller model, for example from 13B down to 7B
  • use the --memory parameter to cap memory usage
  • close other memory-heavy applications

Generation is too slow

To improve speed:

  • enable GPU acceleration and install the required drivers and dependencies
  • use a smaller model
  • lower the temperature value, which can help output arrive faster

Chinese output quality is weak

If Chinese generation is underperforming:

  • choose a model optimized for Chinese, especially DeepSeek or Qwen
  • make the prompt explicit, such as “Explain this in Chinese”
  • increase or tune context settings, for example with --context 2048

Port 11434 is already in use

You can either stop the process occupying that port or start Ollama on another one:

  • free the existing process using port 11434
  • start with a custom port, such as ollama serve --port 11435

Is Ollama worth using?

For local LLM deployment, Ollama has become one of the most practical options because it combines three things that rarely come together cleanly: simplicity, efficiency, and privacy.

It works for several different audiences at once:

  • Developers who want to test ideas quickly or integrate local models into applications without relying on third-party APIs
  • Enterprise IT teams building internal AI services such as knowledge bases or support bots under stricter privacy requirements
  • Non-technical users who prefer a UI over the command line
  • Students and researchers who want a lower-cost way to compare and experiment with different models

Ollama is still evolving quickly, and it is reasonable to expect broader capabilities over time, including richer UI workflows, more advanced multi-model coordination, and potentially stronger support for fine-tuning-related workflows.

If the goal is to run large models locally without drowning in environment setup and dependency management, Ollama makes that path much more approachable. In only a few minutes, it can turn a local machine into a private AI workspace.