bloomz

package module
v0.0.0-...-1834e77 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 29, 2023 License: MIT Imports: 5 Imported by: 1

README

bloomz.cpp

Inference of HuggingFace's BLOOM-like models in pure C/C++.

The repo was built on top of the amazing llama.cpp repo by @ggerganov, to support BLOOM models. It supports all models that can be loaded using BloomForCausalLM.from_pretrained().

bloomz-7b1

Demo

bloomz-7b1

Usage

First, you need to clone the repo and build it:

git clone https://github.com/NouamaneTazi/bloomz.cpp
cd bloomz.cpp
make
Convert weights

Then, you must convert the model weights to the ggml format. Any BLOOM model can be converted.

Some weights hosted on the Hub are already converted. You can find the list here.

Otherwise, the quickest way to convert weights is to use this converter tool. It is a Space hosted on the Huggingface Hub that converts and quantizes weights for you and upload them to the repository of your choice.

If you prefer, you can manually convert the weights on your machine:

# install required libraries
python3 -m pip install torch numpy transformers accelerate

# download and convert the 7B1 model to ggml FP16 format
python3 convert-hf-to-ggml.py bigscience/bloomz-7b1 ./models 
# Note: you can add --use-f32 to convert to FP32 instead of FP16

Optionally, you can quantize the model to 4-bits.

./quantize ./models/ggml-model-bloomz-7b1-f16.bin ./models/ggml-model-bloomz-7b1-f16-q4_0.bin 2
Run inference

Finally, you can run the inference.

./main -m ./models/ggml-model-bloomz-7b1-f16-q4_0.bin -t 8 -n 128

Your output should look like this:

make && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin  -p 'Translate "Hi, how are you?" in French:' -t 8 -n 256

I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 13.1.6 (clang-1316.0.21.2.5)
I CXX:      Apple clang version 13.1.6 (clang-1316.0.21.2.5)

make: Nothing to be done for `default'.
main: seed = 1678899845
llama_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 250880
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 1
llama_model_load: n_head  = 32
llama_model_load: n_layer = 30
llama_model_load: f16     = 2
llama_model_load: n_ff    = 16384
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 5312.64 MB
llama_model_load: memory_size =   480.00 MB, n_mem = 15360
llama_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Bonjour, comment ça va?</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  3092.29 ms
main:   sample time =     2.40 ms
main:  predict time =  1003.04 ms / 59.00 ms per token
main:    total time =  5307.23 ms

Advanced usage

Here's a list of the available options:

usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 4)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -n N, --n_predict N   number of tokens to predict (default: 128)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --repeat_last_n N     last n tokens to consider for penalize (default: 64)
  --repeat_penalty N    penalize repeat sequence of tokens (default: 1.3)
  --temp N              temperature (default: 0.8)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/ggml-model-bloomz-7b1-f16-q4_0.bin)

Memory usage

Model Disk Mem
bloomz-7b1-f16-q4_0 4.7 GB 5.3 GB

iOS App

The repo includes a proof-of-concept iOS app in the Bloomer directory. You need to provide the converted model weights, placing a file called ggml-model-bloomz-560m-f16.bin inside that folder. This is what it looks like on an iPhone:

bloom-ios-screenshot

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Bloomz

type Bloomz struct {
	// contains filtered or unexported fields
}

func New

func New(model string, opts ...ModelOption) (*Bloomz, error)

func (*Bloomz) Free

func (l *Bloomz) Free()

func (*Bloomz) Predict

func (l *Bloomz) Predict(text string, opts ...PredictOption) (string, error)

type ModelOption

type ModelOption func(p *ModelOptions)
var EnableF16Memory ModelOption = func(p *ModelOptions) {
	p.F16Memory = true
}

func SetContext

func SetContext(c int) ModelOption

SetContext sets the context size.

type ModelOptions

type ModelOptions struct {
	ContextSize int
	F16Memory   bool
}
var DefaultModelOptions ModelOptions = ModelOptions{
	ContextSize: 512,
	F16Memory:   false,
}

func NewModelOptions

func NewModelOptions(opts ...ModelOption) ModelOptions

Create a new PredictOptions object with the given options.

type PredictOption

type PredictOption func(p *PredictOptions)

func SetPenalty

func SetPenalty(penalty float64) PredictOption

SetPenalty sets the repetition penalty for text generation.

func SetRepeat

func SetRepeat(repeat int) PredictOption

SetRepeat sets the number of times to repeat text generation.

func SetSeed

func SetSeed(seed int) PredictOption

SetSeed sets the random seed for sampling text generation.

func SetTemperature

func SetTemperature(temp float64) PredictOption

SetTemperature sets the temperature value for text generation.

func SetThreads

func SetThreads(threads int) PredictOption

SetThreads sets the number of threads to use for text generation.

func SetTokens

func SetTokens(tokens int) PredictOption

SetTokens sets the number of tokens to generate.

func SetTopK

func SetTopK(topk int) PredictOption

SetTopK sets the value for top-K sampling.

func SetTopP

func SetTopP(topp float64) PredictOption

SetTopP sets the value for nucleus sampling.

type PredictOptions

type PredictOptions struct {
	Seed, Threads, Tokens, TopK, Repeat int
	TopP, Temperature, Penalty          float64
}
var DefaultOptions PredictOptions = PredictOptions{
	Seed:        -1,
	Threads:     runtime.NumCPU(),
	Tokens:      128,
	TopK:        10000,
	TopP:        0.90,
	Temperature: 0.96,
	Penalty:     1,
	Repeat:      64,
}

func NewPredictOptions

func NewPredictOptions(opts ...PredictOption) PredictOptions

Create a new PredictOptions object with the given options.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL