本地大语言模型和知识库

Ollama VS Hugging Face

在本地快速运行大模型，可以使用 Ollama 和 Hugging Face。

如果想通过 Hugging Face 本地快速访问大模型一般有两种方式：

Inference API （Serverless）

import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-7b-hf"
headers = {"Authorization": "Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}

def query(payload):
 response = requests.post(API_URL, headers=headers, json=payload)
 return response.json()

output = query({
 "inputs": "Can you please let us know more details about your ",
})
本地执行

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")
所以通过Hugging Face 对于不懂编程的人来说是比较困难的，一是需要申请API Key，二是需要本地有Python或者其他编程语言的环境。使用Ollama来运行本地大模型就非常的简单。

Ollama
Ollama的下载地址：https://ollama.com/download, 安装后直接在终端中执行如下命令：

ollama run llama2:7b

一般来说，7b的模型至少需要8G RAM，13b需要16G，70b需要64G。

在没有进行任何LLM训练，也没有通过检索增强生成（RAG）的情况下，你可以看到如下的回

function quicksort(arr) {
  if (arr.length <= 1) {
    return arr;
  }

  let pivot = arr[0];
  let left = [];
  let right = [];

  for (let i = 1; i < arr.length; i++) {
    if (arr[i] < pivot) {
      left.push(arr[i]);
    } else {
      right.push(arr[i]);
    }
  }

  return quicksort(left).concat([pivot], quicksort(right));
}

image-20240324151821599
以下是一些可以下载的示例模型，也可到官方网站查看可支持的model列表：https://ollama.com/library:

Model    Parameters    Size    Download
Llama 2    7B    3.8GB    ollama run llama2
Mistral    7B    4.1GB    ollama run mistral
Dolphin Phi    2.7B    1.6GB    ollama run dolphin-phi
Phi-2    2.7B    1.7GB    ollama run phi
Neural Chat    7B    4.1GB    ollama run neural-chat
Starling    7B    4.1GB    ollama run starling-lm
Code Llama    7B    3.8GB    ollama run codellama
Llama 2 Uncensored    7B    3.8GB    ollama run llama2-uncensored
Llama 2 13B    13B    7.3GB    ollama run llama2:13b
Llama 2 70B    70B    39GB    ollama run llama2:70b
Orca Mini    3B    1.9GB    ollama run orca-mini
Vicuna    7B    3.8GB    ollama run vicuna
LLaVA    7B    4.5GB    ollama run llava
Gemma    2B    1.4GB    ollama run gemma:2b
Gemma    7B    4.8GB    ollama run gemma:7b

AnythingLLM

可以通过AnythingLLM等现代化界面进行交互，而不是终端的方式。

Ollama其实有两种模式：

聊天模式
服务器模式
这里使用服务器模式，Ollama在后端运行大模型，开发ip和端口给外部软件使用。

ollama serve
通过终端或者命令行，访问http://localhost:11434 进行验证：

curl http://localhost:11434
Ollama is running
搭建一个本地知识库，会涉及到三个关键：

LLM Model，大语言模型。它负责处理和理解自然语言。
Embedding Model，嵌入模型。它负责把高维度的数据转化为低维度的嵌入空间。这个数据处理过程在RAG中非常重要。
Vector Store，向量数据库，专门用来高效处理大规模向量数据。
本地容器化安装AnyThingLLM
参考文档：https://github.com/Mintplex-Labs/anything-llm/blob/master/docker/HOW_TO_USE_DOCKER.md

Linux或MacOS执行如下命令：

export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
mintplexlabs/anythingllm

使用 http://localhost:3001 访问。

官方有一段这样的描述：

If you are in docker and cannot connect to a service running on your host machine running on a local interface or loopback:

localhost
127.0.0.1
0.0.0.0

On linux http://host.docker.internal:xxxx does not work. Use http://172.17.0.1:xxxx instead to emulate this functionality.

Then in docker you need to replace that localhost part with host.docker.internal. For example, if running Ollama on the host machine, bound to http://127.0.0.1:11434 you should put http://host.docker.internal:11434 into the connection URL in AnythingLLM.

本地大模型选择
图片

image-20240324154453634
embedding配置
可以选择：https://ollama.com/library/nomic-embed-text 或者AnythingLLM自带。

图片

image-20240324154828615
向量数据库配置
可以参看我之间的公众号文章：AI Agent 实战，或者博文：https://flyeric.top/archives/setup-langchain-ai-agent-practice ，构建本地的Vector Database。也可注册Pinecone免费试用。

图片

文档更新时间: 2024-04-12 13:27 作者：admin