bloomz

package module

v0.0.0-...-1834e77 Latest Latest Go to latest Published: May 29, 2023 License: MIT Imports: 5 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/go-skynet/bloomz.cpp

Links

Open Source Insights

README ¶

bloomz.cpp

Inference of HuggingFace's BLOOM-like models in pure C/C++.

The repo was built on top of the amazing llama.cpp repo by @ggerganov, to support BLOOM models. It supports all models that can be loaded using BloomForCausalLM.from_pretrained().

bloomz-7b1

Demo

bloomz-7b1

Usage

First, you need to clone the repo and build it:

git clone https://github.com/NouamaneTazi/bloomz.cpp
cd bloomz.cpp
make

Convert weights

Then, you must convert the model weights to the ggml format. Any BLOOM model can be converted.

Some weights hosted on the Hub are already converted. You can find the list here.

Otherwise, the quickest way to convert weights is to use this converter tool. It is a Space hosted on the Huggingface Hub that converts and quantizes weights for you and upload them to the repository of your choice.

If you prefer, you can manually convert the weights on your machine:

# install required libraries
python3 -m pip install torch numpy transformers accelerate

# download and convert the 7B1 model to ggml FP16 format
python3 convert-hf-to-ggml.py bigscience/bloomz-7b1 ./models 
# Note: you can add --use-f32 to convert to FP32 instead of FP16

Optionally, you can quantize the model to 4-bits.

./quantize ./models/ggml-model-bloomz-7b1-f16.bin ./models/ggml-model-bloomz-7b1-f16-q4_0.bin 2

Run inference

Finally, you can run the inference.

./main -m ./models/ggml-model-bloomz-7b1-f16-q4_0.bin -t 8 -n 128

Your output should look like this:

make && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin  -p 'Translate "Hi, how are you?" in French:' -t 8 -n 256

I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 13.1.6 (clang-1316.0.21.2.5)
I CXX:      Apple clang version 13.1.6 (clang-1316.0.21.2.5)

make: Nothing to be done for `default'.
main: seed = 1678899845
llama_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 250880
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 1
llama_model_load: n_head  = 32
llama_model_load: n_layer = 30
llama_model_load: f16     = 2
llama_model_load: n_ff    = 16384
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 5312.64 MB
llama_model_load: memory_size =   480.00 MB, n_mem = 15360
llama_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Bonjour, comment ça va?</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  3092.29 ms
main:   sample time =     2.40 ms
main:  predict time =  1003.04 ms / 59.00 ms per token
main:    total time =  5307.23 ms

Advanced usage

Here's a list of the available options:

usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 4)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -n N, --n_predict N   number of tokens to predict (default: 128)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --repeat_last_n N     last n tokens to consider for penalize (default: 64)
  --repeat_penalty N    penalize repeat sequence of tokens (default: 1.3)
  --temp N              temperature (default: 0.8)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/ggml-model-bloomz-7b1-f16-q4_0.bin)

Memory usage

Model	Disk	Mem
`bloomz-7b1-f16-q4_0`	4.7 GB	5.3 GB

iOS App

The repo includes a proof-of-concept iOS app in the Bloomer directory. You need to provide the converted model weights, placing a file called ggml-model-bloomz-560m-f16.bin inside that folder. This is what it looks like on an iPhone:

bloom-ios-screenshot

Documentation ¶

Index ¶

type Bloomz
- func New(model string, opts ...ModelOption) (*Bloomz, error)
- func (l *Bloomz) Free()
- func (l *Bloomz) Predict(text string, opts ...PredictOption) (string, error)
type ModelOption
- func SetContext(c int) ModelOption
type ModelOptions
- func NewModelOptions(opts ...ModelOption) ModelOptions
type PredictOption
type PredictOptions
- func NewPredictOptions(opts ...PredictOption) PredictOptions

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Bloomz ¶

type Bloomz struct {
	// contains filtered or unexported fields
}

func New ¶

func New(model string, opts ...ModelOption) (*Bloomz, error)

func (*Bloomz) Free ¶

func (l *Bloomz) Free()

func (*Bloomz) Predict ¶

func (l *Bloomz) Predict(text string, opts ...PredictOption) (string, error)

type ModelOption ¶

type ModelOption func(p *ModelOptions)

var EnableF16Memory ModelOption = func(p *ModelOptions) {
	p.F16Memory = true
}

func SetContext ¶

func SetContext(c int) ModelOption

SetContext sets the context size.

type ModelOptions ¶

type ModelOptions struct {
	ContextSize int
	F16Memory   bool
}

var DefaultModelOptions ModelOptions = ModelOptions{
	ContextSize: 512,
	F16Memory:   false,
}

func NewModelOptions ¶

func NewModelOptions(opts ...ModelOption) ModelOptions

Create a new PredictOptions object with the given options.

type PredictOption ¶

type PredictOption func(p *PredictOptions)

func SetPenalty ¶

func SetPenalty(penalty float64) PredictOption

SetPenalty sets the repetition penalty for text generation.

func SetRepeat ¶

func SetRepeat(repeat int) PredictOption

SetRepeat sets the number of times to repeat text generation.

func SetSeed ¶

func SetSeed(seed int) PredictOption

SetSeed sets the random seed for sampling text generation.

func SetTemperature ¶

func SetTemperature(temp float64) PredictOption

SetTemperature sets the temperature value for text generation.

func SetThreads ¶

func SetThreads(threads int) PredictOption

SetThreads sets the number of threads to use for text generation.

func SetTokens ¶

func SetTokens(tokens int) PredictOption

SetTokens sets the number of tokens to generate.

func SetTopK ¶

func SetTopK(topk int) PredictOption

SetTopK sets the value for top-K sampling.

func SetTopP ¶

func SetTopP(topp float64) PredictOption

SetTopP sets the value for nucleus sampling.

type PredictOptions ¶

type PredictOptions struct {
	Seed, Threads, Tokens, TopK, Repeat int
	TopP, Temperature, Penalty          float64
}

var DefaultOptions PredictOptions = PredictOptions{
	Seed:        -1,
	Threads:     runtime.NumCPU(),
	Tokens:      128,
	TopK:        10000,
	TopP:        0.90,
	Temperature: 0.96,
	Penalty:     1,
	Repeat:      64,
}

func NewPredictOptions ¶

func NewPredictOptions(opts ...PredictOption) PredictOptions

Create a new PredictOptions object with the given options.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL