Showing posts with label benchmark. Show all posts
Showing posts with label benchmark. Show all posts

2023-12-22

Benchmarking LLM models

Today we're gonna try to use ollama to generate some code, it quite easy you just need to install it using curl like this, and install some model:

curl https://ollama.ai/install.sh | sh

# these models already predownloaded before first run
time ollama run codellama 'show me inplace mergesort using golang'
#time ollama run deepseek-coder 'show me inplace mergesort using golang'
time ollama run llama2-uncensored 'show me inplace mergesort using golang'
time ollama run mistral 'show me inplace mergesort using golang'

#time ollama run starcoder 'show me inplace mergesort using golang'
time ollama run codeup 'show me inplace mergesort using golang'
time ollama run codebooga 'show me inplace mergesort using golang'
time ollama run magicoder 'show me inplace mergesort using golang'

ollama list
NAME                     ID           SIZE    MODIFIED
codebooga:latest         05b83c5673dc 19 GB   3 seconds ago
codellama:latest         8fdf8f752f6e 3.8 GB  4 minutes ago
deepseek-coder:latest    140a485970a6 776 MB  10 minutes ago
llama2-uncensored:latest 44040b922233 3.8 GB  10 minutes ago
mistral:latest           4d9f4b269c33 4.1 GB  2 days ago
codeup:latest            54289661f7a9 7.4 GB  4 minutes ago
magicoder:latest         8007de06f5d9 3.8 GB  4 minutes ago

starcoder:latest         847e5a7aa26f 1.8 GB  10 minutes ago



Here's the result of each of them, the benchmark runs using 32-core CPU, since my GPU still unsupported unsupported (gfx1032):

codellama - 62.59s - will overwrite "left" if array overlapped, at least the structure a bit correct

 Sure! Here is an example of how to implement in-place merge sort using Go:
```
package main

import (
        "fmt"
)

func merge(arr []int, low int, mid int, high int) {
        left := arr[low : mid+1]
        right := arr[mid+1 : high+1]

        for i := low; i <= high; i++ {
                if left[i-low] > right[i-mid] {
                        arr[i] = left[i-low]
                } else {
                        arr[i] = right[i-mid]
                }
        }
}

func inplaceMergeSort(arr []int, low int, high int) {
        if low >= high {
                return
        }

        mid := (low + high) / 2
        inplaceMergeSort(arr, low, mid)
        inplaceMergeSort(arr, mid+1, high)
        merge(arr, low, mid, high)
}

func main() {
        arr := []int{5, 3, 7, 9, 2, 4, 6, 8}
        inplaceMergeSort(arr, 0, len(arr)-1)
        fmt.Println("Sorted array:", arr)
}
```
This code sorts the input array in place using the merge sort algorithm. The `merge` function is used to combine two sorted subarrays
into a single sorted array. The `inplaceMergeSort` function recursively splits the array into smaller subarrays until each subarray
contains only one element, at which point the `merge` function is called to combine the subarrays.

The `main` function demonstrates how to use the `inplaceMergeSort` function on an example input array.

deepseek-coder - 23.53s - fast but messy indentation, also spouting quicksort not mergesort 

 
Here's an in-place version of merge sort using Golang:

```go
package main
 
import (
    "fmt"
)     
    
func partition(array []int, low int , high int ) int {
        pivot := array[high] // Picks the last element as a pivot. This is done arbitrarily here for simplicity of code but in
real-world cases it should be chosen by an external algorithm like quicksort or similar with better average case performance then
mergesorts which pick random elements
        i := (low - 1)    // Index of smaller element  
        for j:=  low ;j <= high-1; j++ {     
            if array[j] < pivot{      
                i++                     // Swap arr[i+1] and arra[high], then increment i.  This is the part that makes sure all elements sma
smaller than Pivot go to left of it, while rest are in right side    // The key process is to check if current element 'array'[] has
value less
                //than pivot'. If yes,"then do nothing", else swap values and increment i.
                }                          }      return (i + 1)            }
            func merge( array []int, low int , mid int,high int ) { // The main function that sorts arr[l..r] using auxiliary space
a[]={0...n} and i is index of start element in left subarray 和 right.   pivot such as to be inserted
                // elements into correct position for final sorted array, then call itself recursively      return }            public func m
main() {          arr :=  [7,12,45689,30,-2 ,-5]         fmt.Println(array)    mergeSort(&arr[..])     // Calling the function to sort
array using mergesort
                }` `go run main.go 输出结果为[-5 -2  12  7 30 45689 ]] ) and go playground https://play.golang.org/p/-Ni_kJvKlf for running the code
```     Sorry, I can't provide you with a full in-place merge sort because it depends on your requirements (whether to print or not),
as well complexity analysis and other factors which may affect performance of an algorithm. The above is just how one might implement
such functionality within Go language using Goroutines if necessary for multi threading/parallelism, but would need a different
approach otherwise due the nature of merge sort itself being inherently sequential in operation (each element has to be processed
before it can go on). 
 

llama2-uncensored - 50.87s - err, using builtin sort XD

Sure, here's an implementation of in-place mergesort using Go:
```go
package main
import "fmt"
func mergeSort(arr []int) {
 if len(arr) <= 1 {
 return
 }
 mid := len(arr) / 2
 left := arr[:mid]
 right := arr[mid:]
 fmt.Println("Merging", left, "and", right)
 sort.Ints(left)
 sort.Ints(right)
 mergeSort(left)
 merge(left, right)
}
func merge(left []int, right []int) []int {
 res := make([]int, len(left)+len(right))
 for i := 0; i < len(res); i++ {
 if left[i] <= right[i-1] {
 res[i] = left[i]
 } else {
 res[i] = right[i-1]
 }
 }
 return res
}
func main() {
 arr := []int{5, 3, 2, 1, 7, 8, 4, 6}
 fmt.Println("Original array:", arr)
 mergeSort(arr)
 fmt.Println("Merged array: ", arr)
}
```
This implementation uses the built-in `sort` package to sort the left and right halves of the original array separately, and then
merges them using a recursive function called `merge`. Finally, it applies the sorted array to the original one using the `mergeSort`
function.

mistral - 96.77s - not really inplace XD it creates copy

 Here's an example of in-place MergeSort implementation in Golang. In this approach, we divide the given slice into smaller sub-arrays
until each sub-array contains only one element. Then we merge these sub-arrays back together to sort the original slice.

```go
func merge(arr []int, left int, mid int, right int) { // Merge function
    n1 := mid - left + 1
    n2 := right - mid

    Left := make([]int, n1)
    Right := make([]int, n2)

    for i := 0; i < n1; i++ {
        Left[i] = arr[left + i]
    }
    for j := 0; j < n2; j++ {
        Right[j] = arr[mid+1+j]
    }

    i := 0
    j := 0
    k := left

    for i < n1 && j < n2 {
        if Left[i] <= Right[j] {
            arr[k] = Left[i]
            i++
        } else {
            arr[k] = Right[j]
            j++
        }
        k++
    }

    for i < n1 { // Copy the rest of Left
        arr[k] = Left[i]
        i++
        k++
    }

    for j < n2 { // Copy the rest of Right
        arr[k] = Right[j]
        j++
        k++
    }
}

func mergeSort(arr []int, left int, right int) { // Merge sort function
    if len(arr) <= 1 {
        return
    }

    mid := (left + right) / 2

    mergeSort(arr, left, mid)
    mergeSort(arr, mid+1, right)
    merge(arr, left, mid, right)
}

func main() {
    arr := []int{12, 11, 13, 5, 6, 7}
    fmt.Println("Unsorted array:")
    printArray(arr)
    mergeSort(arr, 0, len(arr)-1)
    fmt.Println("Sorted array:")
    printArray(arr)
}

func printArray(arr []int) {
    for _, a := range arr {
        fmt.Println(a, " ")
    }
    fmt.Println()
}
```

In this example, the `mergeSort` function is called recursively on the array until each sub-array contains only one element. After
that, the merge function merges these sub-arrays back together, sorting the entire array in place. 

starcoder - 71.65s - BAD, spouting python nonsense

# golang
go run -race./cmd/test
# python
python3.7 -m pytest tests_unit --cov=src
-- spouting nonsense python code --

magicoder - 91.89s - not really in-place

 Here is a simple implementation of the in-place merge sort algorithm in Golang. It's very similar to the traditional merge sort
algorithm but works in-place without needing an additional array space:

```go
package main

import (
        "fmt"
)

func printArray(arr []int) {
        for _, value := range arr {
                fmt.Print(value, " ")
        }
        fmt.Println()
}

func mergeSort(arr []int) {
        var helper [100]int // declare an array to hold the helper values
        mergeSortRecursive(&helper[0], arr, 0, len(arr)-1)
}

func mergeSortRecursive(aux *[100]int, src []int, lo, hi int) {
        if lo >= hi {
                return // base case: array is sorted
        }

        // sort the left half
        mid := (lo + hi) / 2
        mergeSortRecursive(aux, src, lo, mid)

        // sort the right half
        mergeSortRecursive(aux, src, mid+1, hi)

        // merge the sorted halves together
        merge(&helper[0], aux, src, lo, mid, hi)
}

func merge(dst *[100]int, aux *[100]int, src []int, lo, mid, hi int) {
        i := lo
        j := mid + 1

        for k := lo; k <= hi; k++ {
                if i > mid {
                        dst[k] = src[j]
                        j++
                } else if j > hi {
                        dst[k] = src[i]
                        i++
                } else if src[i] < src[j] {
                        dst[k] = src[i]
                        i++
                } else {
                        dst[k] = src[j]
                        j++
                }
        }

        // copy back the sorted values to our original array
        for i, value := range dst[:hi-lo+1] {
                src[i+lo] = value
        }
}

func main() {
        arr := []int{37, 49, 38, 50, 3, 13}
        printArray(arr) // before sorting
        mergeSort(arr)
        printArray(arr) // after sorting
}
```
This code prints the input array, sorts it using merge sort algorithm and prints again to check if the sorting is successful.

codeup - 127.32s - using temporary variable, so this just normal mergesort, quite slow

Here is an example of an in-place mergesort implementation in Go:
```
func mergeSort(arr []int, low, high int) {
    if low < high {
        // divide the array into two halves
        mid := (low + high) / 2
        mergeSort(arr, low, mid)
        mergeSort(arr, mid+1, high)
        merge(arr, low, mid, high)
    }
}

func merge(arr []int, low, mid, high int) {
    var result []int
    left := arr[:mid]
    right := arr[mid:]
    for i := 0; i < len(left); i++ {
        result = append(result, left[i])
    }
    for j := 0; j < len(right); j++ {
        result = append(result, right[j])
    }
    return result
}
```
This code sorts the input array in place using the mergesort algorithm. The `mergeSort` function divides the array into two halves, sorts each half recursively, and then merges the two sorted halves back together using the `merge` function. The `merge` function takes four
integer parameters: the left and right indices of the two arrays being merged, and the low and high indices of the resulting array. It appends the elements of the left and right arrays to the resulting array in the correct order.

To use this code, you can simply call the `mergeSort` function with the address of the input array as an argument, like this:
```
arr := []int{1, 2, 3, 4, 5, 6}
mergeSort(arr)
fmt.Println(arr) // [1, 2, 3, 4, 5, 6]
```
This will sort the input array in place and print the resulting sorted array. Note that this implementation of mergesort has a time complexity of O(n log n), which means it will take longer to sort larger arrays.

codebooga - 490.72s - using temporary variable, so this just normal mergesort, slow AF

Here is an implementation of in-place merge sort using Go:

```go
package main

import (
        "fmt"
)

func merge(arr []int, low int, mid int, high int) {
        n1 := mid - low + 1
        n2 := high - mid

        L := make([]int, n1)
        R := make([]int, n2)

        for i := 0; i < n1; i++ {
                L[i] = arr[low+i]
        }
        for j := 0; j < n2; j++ {
                R[j] = arr[mid+1+j]
        }

        i, j, k := 0, 0, low

        for i < n1 && j < n2 {
                if L[i] <= R[j] {
                        arr[k] = L[i]
                        i++
                } else {
                        arr[k] = R[j]
                        j++
                }
                k++
        }

        for i < n1 {
                arr[k] = L[i]
                i++
                k++
        }

        for j < n2 {
                arr[k] = R[j]
                j++
                k++
        }
}

func mergeSort(arr []int, low int, high int) {
        if low < high {
                mid := (low + high) / 2
                mergeSort(arr, low, mid)
                mergeSort(arr, mid+1, high)
                merge(arr, low, mid, high)
        }
}

func inPlaceMergeSort(arr []int) {
        if len(arr) > 1 {
                mergeSort(arr, 0, len(arr)-1)
        }
}

func main() {
        arr := []int{12, 11, 13, 5, 6, 7}
        fmt.Println("Given array is: ", arr)
        inPlaceMergeSort(arr)
        fmt.Println("Sorted array is: ", arr)
}
```

This program sorts the given array using in-place merge sort algorithm. The `merge` function takes three parameters, an array `arr`, and two indices `low` and `high`. It merges the two sorted subarrays of `arr`, one starting from index `low` to `mid` (inclusive) and another from
`mid+1` to `high` (inclusive), into a single sorted subarray. The `mergeSort` function is a recursive function that sorts the array using merge sort algorithm by dividing it into two halves until each half has only one element, then calling the `merge` function to combine them
back in order. Finally, the `inPlaceMergeSort` function calls `mergeSort` on the input array and sorts it in place. The `main` function demonstrates how to use this implementation by sorting an example array.

Conclusion

So as you can see, size of model determines the speed of resolving the answer, also all of them are not good enough for now to implement in-place merge part.

2023-10-14

Benchmarking docker-volume vs mount-fs vs tmpfs

So today we're gonna benchmark between docker-volume (bind to docker-managed volume), bind/mount-fs (binding to host filesystem), and tmpfs. Which one can be the fastest? here's the docker compose:

version: "3.7"
services:
  web:
    image: ubuntu
    command: "sleep 3600"
    volumes:
        - ./temp1:/temp1 # mountfs
        - temp2:/temp2   # dockvol
        - temp3:/temp3   # tmpfs

volumes:
  temp2:
  temp3:
    driver_opts:
      type: tmpfs
      device: tmpfs


The docker compose file is on the sibling directory as data-root of docker to ensure using the same SSD. First benchmark we're gonna clone from this repository, then run copy, create 100 small files, then do 2 sequential write (small and large), here's the result of those (some steps not pasted below, eg. removing file when running benchmark twice for example):

apt install git g++ make time
alias time='/usr/bin/time -f "\nCPU: %Us\tReal: %es\tRAM: %MKB"'

cd /temp3 # tmpfs
git clone https://github.com/nikolausmayer/file-IO-benchmark.git

### copy small files

time cp -R /temp3/file-IO-benchmark /temp2 # dockvol
CPU: 0.00s      Real: 1.02s     RAM: 2048KB

time cp -R /temp3/file-IO-benchmark /temp1 # bindfs
CPU: 0.00s      Real: 1.00s     RAM: 2048KB

### create 100 x 10MB files

cd /temp3/file*
time make data # tmpfs
CPU: 0.41s      Real: 0.91s     RAM: 3072KB

cd /temp2/file*
time make data # dockvol
CPU: 0.44s      Real: 1.94s     RAM: 2816KB

cd /temp1/file*
time make data # mountfs
CPU: 0.51s      Real: 1.83s     RAM: 2816KB

### compile

cd /temp3/file*
time make # tmpfs
CPU: 2.93s  Real: 3.23s RAM: 236640KB

cd /temp2/file*
time make # dockvol
CPU: 2.94s  Real: 3.22s RAM: 236584KB

cd /temp1/file*
time make # mountfs
CPU: 2.89s  Real: 3.13s RAM: 236300KB

### sequential small

cd /temp3 # tmpfs
time dd if=/dev/zero of=./test.img count=10 bs=200M

2097152000 bytes (2.1 GB, 2.0 GiB) copied, 0.910784 s, 2.3 GB/s

cd /temp2 # dockvol
time dd if=/dev/zero of=./test.img count=10 bs=200M

2097152000 bytes (2.1 GB, 2.0 GiB) copied, 2.26261 s, 927 MB/s

cd /temp1 # mountfs
time dd if=/dev/zero of=./test.img count=10 bs=200M
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 2.46954 s, 849 MB/s

### sequential large

cd /temp3 # tmpfs
time dd if=/dev/zero of=./test.img count=10 bs=1G
10737418240 bytes (11 GB, 10 GiB) copied, 4.95956 s, 2.2 GB/s

cd /temp2 # dockvol
time dd if=/dev/zero of=./test.img count=10 bs=1G
10737418240 bytes (11 GB, 10 GiB) copied, 81.8511 s, 131 MB/s
10737418240 bytes (11 GB, 10 GiB) copied, 44.2367 s, 243 MB/s
# ^ running twice because I'm not sure why it's so slow

cd /temp1 # mountfs
time dd if=/dev/zero of=./test.img count=10 bs=1G
10737418240 bytes (11 GB, 10 GiB) copied, 12.7516 s, 842 MB/s

The conclusion is, docker volume is a bit faster (+10%) for sequential small, but significantly slower (-72% to -84%) for large sequential files compared to bind/mount-fs, for the other cases seems there's no noticeable difference. I always prefer bind/mount-fs over docker volume because of safety, for example if you accidentally run docker volume rm $(docker volume ls -q) this would delete all your docker volume (I did this multiple times on my own dev PC), also you can easily backup/rsync/copy/manage files if using bind/mount-fs. For other cases, that you don't care whether losing files or not and need high performance (as long as your ram is enough), just use tmpfs.

2023-06-14

Simple Websocket Echo Benchmark

Today we're gonna benchmark nodejs/bin+uwebsocket with golang+nbio. The code is here, both taken from their own example. The benchmark plan is create 10k client/connection, send both text/binary string and receive back the from server and sleep for 1s,100ms,10ms or 1ms, the result is as expected:

go 1.20.5 nbio 1.3.16
rps: 19157.16 avg/max latency = 1.66ms/319.88ms elapsed 10.2s
102 MB 0.6 core usage
rps: 187728.05 avg/max latency = 0.76ms/167.76ms elapsed 10.2s
104 MB 5 core usage
rps: 501232.80 avg/max latency = 12.48ms/395.01ms elapsed 10.1s
rps: 498869.28 avg/max latency = 12.67ms/425.04ms elapsed 10.1s
134 MB 15 core usage

bun 0.6.9
rps: 17420.17 avg/max latency = 5.57ms/257.61ms elapsed 10.1s
48 MB 0.2 core usage
rps: 95992.29 avg/max latency = 29.93ms/242.74ms elapsed 10.4s
rps: 123589.91 avg/max latency = 40.67ms/366.15ms elapsed 10.2s
rps: 123171.42 avg/max latency = 62.74ms/293.29ms elapsed 10.1s
55 MB 1 core usage

node 18.16.0
rps: 18946.51 avg/max latency = 6.64ms/229.28ms elapsed 10.3s
59 MB 0.2 core usage
rps: 97032.08 avg/max latency = 44.06ms/196.41ms elapsed 11.1s
rps: 114449.91 avg/max latency = 72.62ms/295.33ms elapsed 10.3s
rps: 109512.05 avg/max latency = 79.27ms/226.03ms elapsed 10.2s
59 MB 1 core usage


First line until 4th line are with 1s, 100ms, 10ms, 1ms delay before next request. Since Golang/nbio is by default can utilize multi-core so can handle ~50 rps per client, while Bun/Nodejs 11-12 rps per client. If you found a bug, or want to contribute another language (or create better client, just create a pull request on the github link above.

2022-12-24

CockroachDB Benchmark on Different Disk Types

Today we're going to benchmark CockroachDB one of database that I use this year to create embedded application. I use CockroachDB because I don't want to use SqLite or any other embedded database that lack of tooling or cannot be accessed by multiple program at the same time. With CockroachDB I only need to distribute my application binary, cockroachdb binary, and that's it, the offline backup also quite simple, just need to rsync the directory, or do manual rows export like other PostgreSQL-like database. Scaling out also quite simple.

Here's the result:

Disk Type Ins Dur (s) Upd Dur (s) Sel Dur (s) Many Dur (s) Insert Q/s Update Q/s Select1 Q/s SelMany Row/s SelMany Q/s
TMPFS (RAM) 1.3 2.1 4.9 1.5 31419 19275 81274 8194872 20487
NVME DA 1TB 2.7 3.7 5.0 1.5 15072 10698 80558 8019435 20048
NVMe Team 1TB 3.8 3.7 4.9 1.5 10569 10678 81820 8209889 20524
SSD GALAX 250GB 8.0 7.1 5.0 1.5 4980 5655 79877 7926162 19815
HDD WD 8TB 32.1 31.7 4.9 3.9 1244 1262 81561 3075780 7689

From the table we can see that TMPFS (RAM, obviously) is the fastest in all case especially insert and update benchmark, NVMe faster than SSD, and standard magnetic HDD is the slowest. but the query-part doesn't really have much effect probably because the dataset too small that all can fit in the cache.

The test done with 100 goroutines, 400 records insert/update per goroutines, the record is only integer and string. Queries done 10x for select, and 300x for select-many, sending small query is shown there reaching the limit  of 80K rps, inserts can reach 31K rps and multirow-query/updates can reach ~20K rps.

The repository is here if you want to run the benchmark on your own machine.

2022-03-20

1 million Go goroutine vs C# task

Let's compare 1 million goroutines with 1 million tasks, which one more efficient in cpu usage and memory usage? The code is forked from kjpgit's techdemo

Name     UserCPU    SysCPU     AvgRSS       MaxRSS    Wall
c#_1t      38.96    0.68      582,951      636,136    1:00
c#_2t      88.33    0.95      623,956      820,620    1:02
c#_4t     142.86    1.09      687,365      814,028    1:03
c#_8t     235.80    1.71      669,882      820,704    1:05
c#_16t    434.76    4.01      734,545      771,240    1:08
c#_32t    717.39    4.81      720,235      769,888    1:11
go_1t      58.77    0.65    2,635,380    2,635,632    1:04
go_2t      64.48    0.71    2,639,206    2,642,752    1:00
go_4t      72.55    1.42    2,651,086    2,654,972    1:00
go_8t      80.87    2.82    2,641,664    2,643,392    1:00
go_16t     83.18    4.03    2,673,404    2,681,100    1:00
go_32t     86.65    4.30    2,645,494    2,657,580    1:00


Apparently usual result is as expected because all tasks/goroutine spawned first before processing, Go's scheduler more efficient in CPU usage, but C# runtime more efficient in memory usage, which is normal tho because goroutines  requires minimum 2KB overhead per goroutine, way higher cost than spawing a task. What if we increase to 10 millions tasks/goroutines, and let the spawning done in another task/goroutine, so if goroutine done it can restore back memory to the GC, here's the result:

Name     UserCPU    SysCPU     AvgRSS        MaxRSS    Wall
c#_1t      12.78    1.28    2,459,190     5,051,528    0:13
c#_2t      22.60    1.54    2,692,439     5,934,796    0:18
c#_4t      42.09    1.54    2,370,239     5,538,280    0:21
c#_8t      88.54    2.29    2,522,053     6,334,176    0:29
c#_16t    204.39    3.32    2,395,001     5,803,808    0:34
c#_32t    259.09    3.25    1,842,458     4,710,012    0:28
go_1t      13.97    0.97    4,514,200     6,151,088    0:14
go_2t      12.35    1.51    5,595,418     9,506,076    0:07
go_4t      22.09    2.40    6,394,162    12,517,848    0:07
go_8t      31.00    3.09    7,115,281    13,428,344    0:06
go_16t     40.32    3.52    7,126,851    13,764,940    0:06
go_32t     58.58    3.58    7,104,882    12,145,396    0:06

Result seems normal, high memory usage caused by a lot of goroutine spawned at the same time in different thread, not blocking the main thread, but after it's done, got collected by GC (previously it was time based exit condition, this time it would exit after all process done, since I move the sleep first before atomic increment). What if we lower back to 1 million but with same exit rule and spawning executed in different task/goroutine also checking completion done every 1s, here's the result:

Name    UserCPU SysCPU    AvgRSS     MaxRSS    Wall
c#_1t    1.18    0.16     328,134     511,652    0:02
c#_2t    2.18    0.22     294,608     554,488    0:02
c#_4t    3.19    0.20     305,336     554,064    0:02
c#_8t    7.77    0.31     292,281     530,368    0:02
c#_16t  12.33    0.25     304,352     569,460    0:02
c#_32t  37.90    1.25     337,837     684,252    0:03
go_1t    2.72    0.42   1,592,978   2,519,040    0:03
go_2t    3.04    0.47   1,852,084   2,637,532    0:03
go_4t    3.65    0.54   1,936,626   2,637,272    0:03
go_8t    3.27    0.59   1,768,540   2,655,208    0:02
go_16t   4.01    0.71   1,770,673   2,664,504    0:02
go_32t   4.96    0.72   1,770,354   2,669,244    0:02

The difference in processing time is nengligible, but the CPU usage and memory usage quite contrast. Next, let's try to spawn in burst (100K per second), so we add 1 second sleep every 100th task/goroutine, since it's not quite realistic even for DDOS'ed server to receive that much (unless the server finely tuned), here's the result:

Name    UserCPU  SysCPU    AvgRSS     MaxRSS     Wall
c#_1t     0.61    0.08    146,849     284,436    0:05
c#_2t     1.17    0.10    131,778     261,720    0:05
c#_4t     1.53    0.08    133,505     289,584    0:05
c#_8t     4.17    0.15    131,924     284,960    0:05
c#_16t   10.94    0.68    135,446     289,028    0:05
c#_32t   19.86    3.01    130,533     284,924    0:05
go_1t     1.84    0.24    731,872   1,317,796    0:06
go_2t     1.87    0.26    659,382   1,312,220    0:05
go_4t     2.00    0.30    661,296   1,322,152    0:05
go_8t     2.37    0.34    660,641   1,324,684    0:05
go_16t    2.82    0.39    660,225   1,323,932    0:05
go_32t    3.36    0.45    659,176   1,327,264    0:05

And for 5 millions:

Name    UserCPU    SysCPU    AvgRSS       MaxRSS    Wall
c#_1t     3.39    0.24      309,103      573,772    0:11
c#_2t     8.30    0.26      278,683      553,592    0:11
c#_4t    13.65    0.32      274,679      658,104    0:11
c#_8t    23.20    0.46      286,336      641,376    0:12
c#_16t   45.85    1.32      286,311      640,336    0:12
c#_32t   64.83    2.46      264,866      615,552    0:12
go_1t     6.25    0.50    1,397,434    2,629,936    0:13
go_2t     6.20    0.56    1,386,336    2,631,580    0:11
go_4t     7.52    0.65    1,410,523    2,625,308    0:11
go_8t     8.21    0.86    1,441,080    2,779,456    0:11
go_16t   11.17    0.96    1,436,220    2,687,908    0:11
go_32t   12.97    1.06    1,430,573    2,668,816    0:11

And for 25 millions:

Name    UserCPU   SysCPU    AvgRSS        MaxRSS    Wall
c#_1t     15.94    0.69     590,411    1,190,340    0:24
c#_2t     34.88    0.84     699,288    1,615,372    0:32
c#_4t     59.95    0.89     761,308    1,794,116    0:34
c#_8t    100.64    1.36     758,161    1,845,944    0:36
c#_16t   199.56    2.99     765,791    2,014,856    0:38
c#_32t   332.02    4.07     811,809    1,972,400    0:41
go_1t     21.76    0.71   2,846,565    4,413,968    0:29
go_2t     25.77    1.03   2,949,433    5,553,608    0:25
go_4t     28.74    1.24   2,920,447    5,800,088    0:24
go_8t     37.28    1.96   2,869,074    5,502,776    0:23
go_16t    43.46    2.67   2,987,114    5,769,356    0:24
go_32t    43.77    2.92   3,027,179    5,867,084    0:24

How about 25 millions and sleep per 200K?

Name    UserCPU   SysCPU      AvgRSS       MaxRSS    Wall
c#_1t     18.47    0.91      842,492    1,820,788    0:22
c#_2t     40.32    0.93    1,070,555    2,454,324    0:31
c#_4t     62.39    1.16    1,103,741    2,581,476    0:33
c#_8t    100.84    1.34    1,074,820    2,377,580    0:34
c#_16t   218.26    2.91    1,062,642    2,726,700    0:37
c#_32t   339.00    6.51    1,042,254    2,275,644    0:40
go_1t     22.61    0.88    3,474,195    5,071,944    0:27
go_2t     25.83    1.20    3,912,071    6,964,640    0:20
go_4t     37.98    1.68    4,180,188    7,392,800    0:20
go_8t     38.56    2.44    4,189,265    8,481,852    0:18
go_16t    44.49    3.19    4,187,142    8,483,236    0:18
go_32t    48.82    3.44    4,218,591    8,424,200    0:18

And lastly 25 millions and sleep per 400K?

Name    UserCPU    SysCPU    AvgRSS        MaxRSS    Wall
c#_1t     18.66    0.98    1,183,313    2,622,464    0:20
c#_2t     41.27    1.14    1,326,415    3,155,948    0:31
c#_4t     67.21    1.11    1,436,280    3,015,212    0:33
c#_8t    107.14    1.56    1,492,179    3,378,688    0:35
c#_16t   233.50    2.45    1,498,421    3,732,368    0:41
c#_32t   346.87    3.74    1,335,756    2,882,676    0:39
go_1t     24.13    0.82    4,048,937    5,099,220    0:26
go_2t     28.85    1.41    4,936,677    8,023,568    0:18
go_4t     31.51    1.95    5,193,653    9,537,080    0:14
go_8t     45.27    2.65    5,461,107    9,499,308    0:14
go_16t    53.43    3.19    5,183,009    9,476,084    0:14
go_32t    61.98    3.86    5,589,156   10,587,788    0:14

How to read results above? Wall = how much time need to complete, lower is better; AvgRSS/MaxRSS = average/max memory usage, lower is better; UserCPU = CPU time used in percent >100% means more than 1 full core compute time being used, lower is better. Versions used in this benchmark:

go version go1.17.6 linux/amd64
dotnet --version
6.0.201

2022-02-22

C# vs Go in Simple Benchmark

Today we're gonna retry two of my few favorite language in associative array and comb sort benchmark (compile and run, not just runtime performance, because development waiting for compilation time also important) like in the past benchmark. For installing DotNet:

wget https://packages.microsoft.com/config/ubuntu/21.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb
sudo apt install apt-transport-https
sudo apt-get update
sudo apt-get install -y dotnet-sdk-6.0 aspnetcore-runtime-6.0

For installing Golang:

sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt-get update
sudo apt install -y golang-1.17

Result (best of 3 runs)

cd assoc; time dotnet run
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.16s     Real: 14.41s    RAM: 1945904KB

cd assoc; time go run map.go
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.80s     Real: 12.01s    RAM: 2305384KB

This is a bit weird, usually I see that Go use less memory but slower, but in this benchmark C# that are using less memory but a bit slower (14.41s vs 12.01s), possibly because the compilation speed also included.

cd num-assoc; time dotnet run
CPU: 2.21s      Real: 2.19s     RAM: 169208KB

cd num-assoc; time go run comb.go
CPU: 0.46s      Real: 0.44s     RAM: 83100KB

What if we increase the N from 1 million to 10 million?

cd num-assoc; time dotnet run
CPU: 19.25s     Real: 19.16s    RAM: 802296KB

cd num-assoc; time go run comb.go
CPU: 4.60s      Real: 4.67s     RAM: 808940KB

If you want to contribute (if I make mistake when coding the C# or Go version of the algorithm, or if there's more efficient data structure, just fork and create a pull request, and I will redo the benchmark).

2021-11-22

Kafka vs RedPanda Benchmark (also Tarantool and Clickhouse as queue)

Using default settings from their docker-compose example, today we're gonna benchmark one of popular MQ/PubSub software. I never used MQ extensively before (only NATS, Google PubSub, ActiveMQ, and Amazon SQS), usually just using standard database that stores event is sufficient (the consumer using pull, tailing from last primary key counter, and if need to fan-out just use multiple goroutine and multiple channel), because my projects never been a latency sensitive applications.

Some issues: 
  1. the benchmark has locking (atomic counters, sync.Map, etc), so consumer might not utilize whole CPU cores.
  2. confluent's kafka docker always error when starting because /var/lib/kafka/data not writable, so I bind on /var/lib/kafka instead. Clickhouse also always failed to start when bind to /var/lib/clickhouse/data, so I don't bind volume for Clickhouse.
  3. RedPanda failed to start when fs.aio-max-nr even when it's already ~1 million (originally only 64K), so I set it to 4194304
Benchmarking 1000 goroutines publishing 2000 messages each, with 100 goroutines consuming in parallel.

REDPANDA version: v21.10.1 (rev e7b6714)

=== redpanda single:

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  2387
MaxLatency (ms):  2125
AvgLatency (ms):  432
Total (s) 3.457646367s

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  2408
MaxLatency (ms):  2663
AvgLatency (ms):  490
Total (s) 3.459949739s

=== redpanda multi:

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  4187
MaxLatency (ms):  12146
AvgLatency (ms):  9701
Total (s) 13.610533861s 

# ^ weird, maybe startup not yet complete?
# retried reinit docker-compose, 1st time always slow
# but 2nd time always fast:

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  2413
MaxLatency (ms):  2704
AvgLatency (ms):  467
Total (s) 3.545496041s


KAFKA version: 7.0.0-ccs (Commit:c6d7e3013b411760)
equal to kafka 3.0.0

=== kafka single:

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  6634
MaxLatency (ms):  12052
AvgLatency (ms):  8579
Total (s) 13.722706977s

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  6380
MaxLatency (ms):  11856
AvgLatency (ms):  8636
Total (s) 13.625928209s

=== kafka multi:

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  6596
MaxLatency (ms):  11932
AvgLatency (ms):  8523
Total (s) 13.659630863s

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  6535
MaxLatency (ms):  11903
AvgLatency (ms):  8588
Total (s) 13.677644818s

These benchmark using default settings that exists in the docker examples I found, except SMP (I set it to the same amount of cores in the server that used to benchmark to make it fair with Kafka that uses JVM that by default can utilize all cores -- apparently this has insignificant impact). Current conclusion is, RedPanda way faster than Kafka, in terms of publishing speed (around ~1μs per message, 477K-837K msg/s) and consuming latency (432ms to 2.7s per message), while Kafka (around ~3μs per message, 301K-313K msg/s) and 8.5s to 12s per message. The RAM statistics tho, RedPanda uses 12GB for each node (10% of server's RAM), while Kafka only uses 355MB, 375MB, 788MB for nodes, and 120MB for zookeeper. The repo to reproduce this benchmark is here on 2021mq directory.

Btw if you're looking for Kafka/RedPanda GUI, try KOwl, this way more beautiful than ActiveMQ default Web UI.

Bonus rounds, using one of the fastest OLTP database: Tarantool and one of the fastest OLAP database: Clickhouse as Queue, by laveraging sequence (auto increment) or internal function to generate a sequence, the difference is there's only one consumer group (have to manually fan out using goroutine), no json encode and decode since it's structured database:


TARANTOOL version: 2.8.2

=== tarantool single (memtx):

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  11238
MaxLatency (ms):  1071
AvgLatency (ms):  101
Total (s) 11.244551225s

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  9596
MaxLatency (ms):  816
AvgLatency (ms):  61
Total (s) 9.957516119s

=== tarantool single (vinyl):

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  11383
MaxLatency (ms):  1076
AvgLatency (ms):  157
Total (s) 11.388865281s

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  9104
MaxLatency (ms):  102
AvgLatency (ms):  13
Total (s) 9.196549551s


CLICKHOUSE version: 21.11.4.14

=== clickhouse single:

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  2052
MaxLatency (ms):  2078
AvgLatency (ms):  1461
Total (s) 3.570767491s

FailProduce:  0
FailConsume:  0
DoubleConsume:  0
Produced (ms):  2057
MaxLatency (ms):  2008
AvgLatency (ms):  1445
Total (s) 3.536277427s

The result recap table (ms = millisecond, us = microsecond, ns = nanosecond):

only best of 2 runsRedPanda singleRedPanda multiKafka singleKafka multiTarantool memtxTarantool vinylClickhouse single
Publish (ms)2,3872,4136,3806,5359,5969,1042,052
Sub Max Latency (ms)2,1252,70411,85611,9038161022,008
Sub Avg Latency (ms)4904678,6368,52361131,445
Pub Troughput (msg/s)837,872828,844313,480306,044208,420219,684974,659
est. Pub Latency (ns)1,1941,2073,1903,2684,7984,5521,026
est. Sub Throughput (msg/s)4,081,6334,282,655231,589234,65932,786,885153,846,1541,384,083

Conclusion: Tarantool probably the only single node database that can compete with Kafka for queue use case (we can have multi-master replica but not recommended, it's better to use master-slave config where slave used as failover), for other database especially RDBMS that persist to disk pretty sure can only do ~50K tps, Clickhouse can be multi-master, and last time i check, it can do ~600K inserts per seconds (while this time it's around 1M inserts per seconds), I simulate the atomic counter on Clickhouse using TimeStamp64Milli, the query limited to 100 queries per second but it's quite good enough for pub-sub use case. The benefit of using database as MQ/PubSub is that you can do a very flexible query (SQL support), mostly better tooling (especially Clickhouse), or update the record for new consumer, but the cons is that you must notify/fan-out (for example using NATS broadcast, only push the signal for worker to pull), track the ack/retries and the read offset of the workers yourself (pull).

2021-07-08

Prime Benchmark

So, yesterday I didn't touch my PC to do a prime number benchmark. And Here's the result for single threaded (only showing fastest implementation of each language) and multithreaded:

IndexImplementationSLabelPassesDurAlgoFaithfulBitPasses/Second
1cpp3flo80_2206675.00baseno144133.26760
10c2daniel202505.00wheelyes14049.85259
11zig3ManDeJ179645.00baseno13592.49823
12c2daniel176815.00wheelyes13536.07129
16rust1mike-b158045.01baseyes83152.68929
20assembly1rberge144345.00baseno82886.80000
22haskell1fatho/119595.00baseno82391.77321
32fortran1johand99875.00baseno11997.40000
36crystal1marghi86805.00baseyes11735.86981
38fsharp3dmanno77545.00baseyes1550.68897
40java1Mansen1488710.0baseyes1488.70000
41csharp1kinema72715.00baseyes1454.08077
43julia2epithe69535.00baseyes11390.55577
46go2ssoves61615.00baseyes11232.01471
51nodejs1rogier57485.00baseyes11149.43213
57lisp2mayerr51225.00baseno11024.19803
58typescript1marghi50315.00baseyes1006.20000
59d2Bradle50035.00baseyes11000.52396
61v1marghi43295.00baseyes865.80000
63lua2ben1je31595.00baseno1631.80000
64nim2beef3328715.00baseyes1574.02096
67cython1rpkak26595.00baseyes8531.64832
71basic1rberge24165.00wheelyes1483.00680
73assemblyscript1donmah423110.0baseyes423.05768
74python2ssoves19915.00baseyes8398.09742
80scala1rom1de12035.00baseyes240.55189
81pascal1rberge11625.00baseyes232.40000
82cobol1fvbake11575.00baseno8231.40000
83pony1marghi11445.00baseyes1228.80000
84swift1j-f1204610.0baseyes204.55332
85dart1eagere8245.00baseyes164.77795
86haxe1TayIor139210.0baseyes139.19035
88ada1BoopBe6615.00baseno132.02220
92octave1octave3135.00baseno62.54234
93postscript1epithe2165.01baseno843.08797
94ruby1rberge1195.01baseyes23.71935
95wren1marghi1115.00baseyes22.16446
96php1Dennis14310.0baseyes14.24667
97smalltalk1fvbake495.07baseyes19.66469
99mixal1rberge304.91baseno16.10998
100perl1marghi285.16baseyes5.42031
103r1fvbake75.43baseyes321.28842
104sql2fvbake65.43otherno321.10375
105tcl1fvbake65.47baseyes11.09589
111latex1tjol217.8baseno320.11224
112bash1bash110.6baseno0.09357

IndexImplementationSLabelPassesDurThreadAlgoFaithfulBitPasses/Second
1zig3ManDe1409105.04wheelno17045.26046
2cpp3flo802361845.08baseno15904.57992
3zig3ManDe1063995.04wheelno15319.64146
4zig3ManDe1010265.04wheelno15051.08785
5zig3ManDe840025.04wheelno14200.08320
6zig3ManDe1478225.08wheelno13695.38740
7zig3ManDe720215.04wheelno13600.73314
8zig3ManDe705225.04wheelno13525.83204
9zig3ManDe1341245.08wheelno13352.91894
10zig3ManDe598515.04baseno82992.45424
11c2danie1013915.08wheelyes12534.43285
12zig3ManDe982465.08wheelno12455.98299
13zig3ManDe981905.08wheelno12454.57327
14zig3ManDe481645.04baseno12408.10849
15zig3ManDe917455.08wheelno12293.50574
16zig3ManDe905985.08wheelno12264.84129
17c2danie881035.08wheelyes12199.97727
18zig3ManDe423185.04baseno12115.85768
19c2danie788585.08wheelyes11969.05838
20zig3ManDe684925.08baseno81712.11852
21c2danie637525.08wheelyes11591.86334
22rust1mike-590015.08baseyes81474.83765
23rust1mike-529795.08baseyes11324.32205
24c2danie498225.08baseyes11244.43126
25zig3ManDe497125.08baseno11242.59124
26c2danie246645.04wheelyes11233.15067
27rust1mike-492385.08baseyes11230.78145
28zig3ManDe456365.08baseno11140.80189
29c2danie223785.04wheelyes11118.88478
30c2danie203855.04wheelyes11019.22065
31c2danie202575.04wheelyes11012.82346
32c2danie151325.04baseyes1756.55294
33cpp2davep200485.08baseyes1501.18496
34d2Bradl195535.08baseyes1488.74973
35zig3ManDe92405.04baseno8461.94272
36zig3ManDe118345.08baseno8295.81332
37csharp1kinem31935.08wheelyes179.81766
38csharp1kinem29615.08baseyes174.02041

Raw result on this gist