Skip to main content
Skip table of contents

Ollama Quick-Start Guide

Ollama is your local AI engine for running fast, private language models - no cloud required. Shinydocs Pro integrates with Ollama to power document enrichment, data extraction, and intelligent tagging, all behind your firewall.

This guide walks you through installing Ollama, binding it to 0.0.0.0 so Shinydocs Pro can connect, and setting up GPU drivers to get the most out of your hardware.

For more detailed information about installing Ollama on your operating system, check out Ollama’s official documentation:

Graphics Processing Unit (GPU) Drivers

For best performance, make sure your system has the latest NVIDIA or AMD drivers installed. Failing to do so may result in Ollama using only your CPU and RAM (very slow).

To confirm your GPU is supported by Ollama, check the official compatibility list here:

https://github.com/ollama/ollama/blob/main/docs/gpu.md

If your card is listed and drivers are up to date, Ollama will automatically take advantage of your GPU when available.

For the latest NVIDIA drivers, check out: https://www.nvidia.com/en-us/drivers/

For the latest AMD drivers, check out: Drivers and Support for Processors and Graphics

Sizing models based on your resources

There’s no exact formula for matching hardware to LLM performance. In general, smaller models are faster but less accurate, and larger models are slower but more capable. The key is finding the sweet spot between speed, cost, and the level of insight you need.

General guide for VRAM

Use this as a starting point when deciding which models your hardware can realistically handle:

Model Size

Minimum VRAM Needed

Notes

1–2B

4 GB

Great for testing, basic tasks, and low-resource setups

3–4B

6 GB

Good balance for lightweight enrichment and Q&A

7–8B

8–12 GB

Ideal for general-purpose use on modern GPUs (e.g., RTX 3060+)

12–14B

16 GB

High-performing models for advanced use cases

70B+ / MoE

24 GB or more

Workstation-class cards or multi-GPU setups needed

Additional Notes

  • ~1 GB of VRAM per billion parameters is a rough baseline.

  • Actual memory use varies based on model type, quantization, and input length.

  • Using quantized models (like :q4_K_M) significantly lowers VRAM demands.

  • Always test with a small set of documents first to find what works best on your setup.

The goal isn’t just “bigger is better”, it’s about choosing the model that gets good enough answers fast enough for your workflow.

Windows

Install Ollama

  1. Download the installer: https://ollama.com/download

  2. Run the installer and complete the setup

  3. Open Command Prompt or PowerShell, and test it:

    CODE
    ollama run llama3

Bind to 0.0.0.0 for external server access

To allow Shinydocs Pro to connect to Ollama:

  1. Press Win + S, search for Environment Variables, and open the system settings

  2. Under System Variables, click New

    • Variable name: OLLAMA_HOST

    • Variable value: 0.0.0.0

  3. Click OK on all dialogs to apply

  4. Restart the Ollama app

Ollama will now listen on all interfaces, allowing Shinydocs Pro to connect using your machine's IP address.

Linux

Install Ollama

  1. Visit Download Ollama on Linux for the latest version

  2. Start the Ollama service:

    CODE
    sudo systemctl start ollama
  3. Enable on boot:

    CODE
    sudo systemctl enable ollama
  4. Test it:

    CODE
    ollama run llama3

Bind to 0.0.0.0 for external server access

To allow network access for Shinydocs Pro:

  1. Run:

    CODE
    sudo systemctl edit ollama.service
  2. Add the following lines:

    CODE
    [Service] Environment="OLLAMA_HOST=0.0.0.0"
  3. Save and close

  4. Reload the systemd daemon and restart Ollama:

    CODE
    sudo systemctl daemon-reload sudo systemctl restart ollama 

Now Ollama will bind to all interfaces.

Pull some models for use in Shinydocs Pro

We highly recommend using nomic-embed-text as your embedding model. You can pull this from Ollama by running the following command:

ollama pull nomic-embed-text:latest

Ollama supported models for different use cases

You will need to make sure the model you download will fit into your GPU’s VRAM.

Feel free to explore Ollama’s models and how they might fit your use cases: Ollama Models

Model Name

Use-Cases

Ollama Pull Command

Link

Gemma3 (Google DeepMind)

  • Document Q&A

  • Creative output

  • Summarization

  • Sentiment

  • ollama pull gemma3:4b

  • ollama pull gemma3:12b

  • ollama pull gemma3:27b

gemma3

Cogito

  • Text classification

  • PII and sensitive info detection

  • Pattern parsing

  • Analysis-based questions

  • Sentiment

  • ollama pull cogito:8b

  • ollama pull cogito:14b

  • ollama pull cogito:32b

  • ollama pull cogito:70b

cogito

Llama3.2 (Meta)

  • General conversation

  • Document Q&A

  • Summarization

  • Creative, non-technical prompts

  • ollama pull llama3.2:1b

  • ollama pull llama3.2:3b

llama3.2

Llama3.1 (Meta)

  • High-performance Q&A

  • Structured enrichment

  • General enterprise analysis

  • ollama pull llama3.1:8b

  • ollama pull llama3.1:70b

  • ollama pull llama3.1:405b

llama3.1

Mistral (Mistral AI)

  • Efficient instruction following

  • General enrichment

  • Fast summarization

  • ollama pull mistral:7b

mistral

Phi-4 (Microsoft)

  • Complex context understanding

  • Compliance QA

  • Technical/Scientific summary and analysis

  • ollama pull phi4:14b

phi4

Phi-3 (Microsoft)

  • Lightweight structured enrichment

  • Low-resource Q&A

  • Efficient summarization

  • ollama pull phi3:3.8b

phi3

SmolLM2

  • Basic document Q&A

  • Structured extraction on small text bodies

  • Low-resource summarization

  • ollama pull smollm2:135m

  • ollama pull smollm2:360m

  • ollama pull smollm2:1.7b

smollm2

Qwen 3 (Alibaba)

  • Multi-lingual support

  • Complex text

  • Math/numbers based insights

  • Code understanding

  • ollama pull qwen3:0.6b

  • ollama pull qwen3:1.7b

  • ollama pull qwen3:4b

  • ollama pull qwen3:8b

  • ollama pull qwen3:14b

  • ollama pull qwen3:30b

  • ollama pull qwen3:32b

  • ollama pull qwen3:235b

qwen3

IBM Granite 3.3

  • Complex reasoning

  • Compliance QA

  • Larger (128k+ token) context windows

  • ollama pull granite3.3:2b

  • ollama pull granite3.3:8b

granite3.3

DeepSeek-R1

  • Thinking model (responds with <think> tags to show reasoning

  • Slower responses as there is usually more text in the response

  • Detailed compliance QA

  • Analytical document workflows

  • ollama pull deepseek-r1:1.5b

  • ollama pull deepseek-r1:7b

  • ollama pull deepseek-r1:8b

  • ollama pull deepseek-r1:14b

  • ollama pull deepseek-r1:32b

  • ollama pull deepseek-r1:70b

  • ollama pull deepseek-r1:671b

deepseek-r1

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.