Programming Rants: benchmark

Showing posts with label benchmark. Show all posts

2023-12-22

Benchmarking LLM models

Today we're gonna try to use ollama to generate some code, it quite easy you just need to install it using curl like this, and install some model:

curl https://ollama.ai/install.sh | sh

# these models already predownloaded before first run
time ollama run codellama 'show me inplace mergesort using golang'
#time ollama run deepseek-coder 'show me inplace mergesort using golang'
time ollama run llama2-uncensored 'show me inplace mergesort using golang'
time ollama run mistral 'show me inplace mergesort using golang'
#time ollama run starcoder 'show me inplace mergesort using golang'
time ollama run codeup 'show me inplace mergesort using golang'
time ollama run codebooga 'show me inplace mergesort using golang'
time ollama run magicoder 'show me inplace mergesort using golang'

ollama list
NAME                     ID           SIZE    MODIFIED
codebooga:latest         05b83c5673dc 19 GB   3 seconds ago
codellama:latest         8fdf8f752f6e 3.8 GB 4 minutes ago
deepseek-coder:latest    140a485970a6 776 MB 10 minutes ago
llama2-uncensored:latest 44040b922233 3.8 GB 10 minutes ago
mistral:latest           4d9f4b269c33 4.1 GB 2 days ago
codeup:latest            54289661f7a9 7.4 GB 4 minutes ago
magicoder:latest         8007de06f5d9 3.8 GB 4 minutes ago
starcoder:latest         847e5a7aa26f 1.8 GB 10 minutes ago

Here's the result of each of them, the benchmark runs using 32-core CPU, since my GPU still unsupported unsupported (gfx1032):

codellama - 62.59s - will overwrite "left" if array overlapped, at least the structure a bit correct

Sure! Here is an example of how to implement in-place merge sort using Go:
```
package main

import (
        "fmt"
)

func merge(arr []int, low int, mid int, high int) {
        left := arr[low : mid+1]
        right := arr[mid+1 : high+1]

        for i := low; i <= high; i++ {
                if left[i-low] > right[i-mid] {
                        arr[i] = left[i-low]
                } else {
                        arr[i] = right[i-mid]
                }
        }
}

func inplaceMergeSort(arr []int, low int, high int) {
        if low >= high {
                return
        }

        mid := (low + high) / 2
        inplaceMergeSort(arr, low, mid)
        inplaceMergeSort(arr, mid+1, high)
        merge(arr, low, mid, high)
}

func main() {
        arr := []int{5, 3, 7, 9, 2, 4, 6, 8}
        inplaceMergeSort(arr, 0, len(arr)-1)
        fmt.Println("Sorted array:", arr)
}
```
This code sorts the input array in place using the merge sort algorithm. The `merge` function is used to combine two sorted subarrays
into a single sorted array. The `inplaceMergeSort` function recursively splits the array into smaller subarrays until each subarray
contains only one element, at which point the `merge` function is called to combine the subarrays.

The `main` function demonstrates how to use the `inplaceMergeSort` function on an example input array.

deepseek-coder - 23.53s - fast but messy indentation, also spouting quicksort not mergesort

Here's an in-place version of merge sort using Golang:

```go
package main

import (
    "fmt"
)

func partition(array []int, low int , high int ) int {
        pivot := array[high] // Picks the last element as a pivot. This is done arbitrarily here for simplicity of code but in
real-world cases it should be chosen by an external algorithm like quicksort or similar with better average case performance then
mergesorts which pick random elements
        i := (low - 1)    // Index of smaller element
        for j:= low ;j <= high-1; j++ {
            if array[j] < pivot{
                i++                     // Swap arr[i+1] and arra[high], then increment i. This is the part that makes sure all elements sma
smaller than Pivot go to left of it, while rest are in right side    // The key process is to check if current element 'array'[] has
value less
                //than pivot'. If yes,"then do nothing", else swap values and increment i.
                }                          }      return (i + 1)            }
            func merge( array []int, low int , mid int，high int ) { // The main function that sorts arr[l..r] using auxiliary space
a[]={0...n} and i is index of start element in left subarray 和 right.   pivot such as to be inserted
                // elements into correct position for final sorted array, then call itself recursively      return }            public func m
main() {          arr := [7，12，45689，30,-2 ,-5]         fmt．Println(array)    mergeSort(&arr[..])     // Calling the function to sort
array using mergesort
                }` `go run main.go 输出结果为[-5 -2 12 7 30 45689 ]] ) and go playground https://play.golang.org/p/-Ni_kJvKlf for running the code
```     Sorry, I can't provide you with a full in-place merge sort because it depends on your requirements (whether to print or not),
as well complexity analysis and other factors which may affect performance of an algorithm. The above is just how one might implement
such functionality within Go language using Goroutines if necessary for multi threading/parallelism, but would need a different
approach otherwise due the nature of merge sort itself being inherently sequential in operation (each element has to be processed
before it can go on).

llama2-uncensored - 50.87s - err, using builtin sort XD

Sure, here's an implementation of in-place mergesort using Go:
```go
package main
import "fmt"
func mergeSort(arr []int) {
if len(arr) <= 1 {
return
}
mid := len(arr) / 2
left := arr[:mid]
right := arr[mid:]
fmt.Println("Merging", left, "and", right)
sort.Ints(left)
sort.Ints(right)
mergeSort(left)
merge(left, right)
}
func merge(left []int, right []int) []int {
res := make([]int, len(left)+len(right))
for i := 0; i < len(res); i++ {
if left[i] <= right[i-1] {
res[i] = left[i]
} else {
res[i] = right[i-1]
}
}
return res
}
func main() {
arr := []int{5, 3, 2, 1, 7, 8, 4, 6}
fmt.Println("Original array:", arr)
mergeSort(arr)
fmt.Println("Merged array: ", arr)
}
```
This implementation uses the built-in `sort` package to sort the left and right halves of the original array separately, and then
merges them using a recursive function called `merge`. Finally, it applies the sorted array to the original one using the `mergeSort`
function.

mistral - 96.77s - not really inplace XD it creates copy

Here's an example of in-place MergeSort implementation in Golang. In this approach, we divide the given slice into smaller sub-arrays
until each sub-array contains only one element. Then we merge these sub-arrays back together to sort the original slice.

```go
func merge(arr []int, left int, mid int, right int) { // Merge function
    n1 := mid - left + 1
    n2 := right - mid

    Left := make([]int, n1)
    Right := make([]int, n2)

    for i := 0; i < n1; i++ {
        Left[i] = arr[left + i]
    }
    for j := 0; j < n2; j++ {
        Right[j] = arr[mid+1+j]
    }

    i := 0
    j := 0
    k := left

    for i < n1 && j < n2 {
        if Left[i] <= Right[j] {
            arr[k] = Left[i]
            i++
        } else {
            arr[k] = Right[j]
            j++
        }
        k++
    }

    for i < n1 { // Copy the rest of Left
        arr[k] = Left[i]
        i++
        k++
    }

    for j < n2 { // Copy the rest of Right
        arr[k] = Right[j]
        j++
        k++
    }
}

func mergeSort(arr []int, left int, right int) { // Merge sort function
    if len(arr) <= 1 {
        return
    }

    mid := (left + right) / 2

    mergeSort(arr, left, mid)
    mergeSort(arr, mid+1, right)
    merge(arr, left, mid, right)
}

func main() {
    arr := []int{12, 11, 13, 5, 6, 7}
    fmt.Println("Unsorted array:")
    printArray(arr)
    mergeSort(arr, 0, len(arr)-1)
    fmt.Println("Sorted array:")
    printArray(arr)
}

func printArray(arr []int) {
    for _, a := range arr {
        fmt.Println(a, " ")
    }
    fmt.Println()
}
```

In this example, the `mergeSort` function is called recursively on the array until each sub-array contains only one element. After
that, the merge function merges these sub-arrays back together, sorting the entire array in place.

starcoder - 71.65s - BAD, spouting python nonsense

# golang
go run -race./cmd/test
# python
python3.7 -m pytest tests_unit --cov=src
-- spouting nonsense python code --

magicoder - 91.89s - not really in-place

Here is a simple implementation of the in-place merge sort algorithm in Golang. It's very similar to the traditional merge sort
algorithm but works in-place without needing an additional array space:

```go
package main

import (
        "fmt"
)

func printArray(arr []int) {
        for _, value := range arr {
                fmt.Print(value, " ")
        }
        fmt.Println()
}

func mergeSort(arr []int) {
        var helper [100]int // declare an array to hold the helper values
        mergeSortRecursive(&helper[0], arr, 0, len(arr)-1)
}

func mergeSortRecursive(aux *[100]int, src []int, lo, hi int) {
        if lo >= hi {
                return // base case: array is sorted
        }

        // sort the left half
        mid := (lo + hi) / 2
        mergeSortRecursive(aux, src, lo, mid)

        // sort the right half
        mergeSortRecursive(aux, src, mid+1, hi)

        // merge the sorted halves together
        merge(&helper[0], aux, src, lo, mid, hi)
}

func merge(dst *[100]int, aux *[100]int, src []int, lo, mid, hi int) {
        i := lo
        j := mid + 1

        for k := lo; k <= hi; k++ {
                if i > mid {
                        dst[k] = src[j]
                        j++
                } else if j > hi {
                        dst[k] = src[i]
                        i++
                } else if src[i] < src[j] {
                        dst[k] = src[i]
                        i++
                } else {
                        dst[k] = src[j]
                        j++
                }
        }

        // copy back the sorted values to our original array
        for i, value := range dst[:hi-lo+1] {
                src[i+lo] = value
        }
}

func main() {
        arr := []int{37, 49, 38, 50, 3, 13}
        printArray(arr) // before sorting
        mergeSort(arr)
        printArray(arr) // after sorting
}
```
This code prints the input array, sorts it using merge sort algorithm and prints again to check if the sorting is successful.

codeup - 127.32s - using temporary variable, so this just normal mergesort, quite slow

Here is an example of an in-place mergesort implementation in Go:
```
func mergeSort(arr []int, low, high int) {
    if low < high {
        // divide the array into two halves
        mid := (low + high) / 2
        mergeSort(arr, low, mid)
        mergeSort(arr, mid+1, high)
        merge(arr, low, mid, high)
    }
}

func merge(arr []int, low, mid, high int) {
    var result []int
    left := arr[:mid]
    right := arr[mid:]
    for i := 0; i < len(left); i++ {
        result = append(result, left[i])
    }
    for j := 0; j < len(right); j++ {
        result = append(result, right[j])
    }
    return result
}
```
This code sorts the input array in place using the mergesort algorithm. The `mergeSort` function divides the array into two halves, sorts each half recursively, and then merges the two sorted halves back together using the `merge` function. The `merge` function takes four
integer parameters: the left and right indices of the two arrays being merged, and the low and high indices of the resulting array. It appends the elements of the left and right arrays to the resulting array in the correct order.

To use this code, you can simply call the `mergeSort` function with the address of the input array as an argument, like this:
```
arr := []int{1, 2, 3, 4, 5, 6}
mergeSort(arr)
fmt.Println(arr) // [1, 2, 3, 4, 5, 6]
```
This will sort the input array in place and print the resulting sorted array. Note that this implementation of mergesort has a time complexity of O(n log n), which means it will take longer to sort larger arrays.

codebooga - 490.72s - using temporary variable, so this just normal mergesort, slow AF

Here is an implementation of in-place merge sort using Go:

```go
package main

import (
        "fmt"
)

func merge(arr []int, low int, mid int, high int) {
        n1 := mid - low + 1
        n2 := high - mid

        L := make([]int, n1)
        R := make([]int, n2)

        for i := 0; i < n1; i++ {
                L[i] = arr[low+i]
        }
        for j := 0; j < n2; j++ {
                R[j] = arr[mid+1+j]
        }

        i, j, k := 0, 0, low

        for i < n1 && j < n2 {
                if L[i] <= R[j] {
                        arr[k] = L[i]
                        i++
                } else {
                        arr[k] = R[j]
                        j++
                }
                k++
        }

        for i < n1 {
                arr[k] = L[i]
                i++
                k++
        }

        for j < n2 {
                arr[k] = R[j]
                j++
                k++
        }
}

func mergeSort(arr []int, low int, high int) {
        if low < high {
                mid := (low + high) / 2
                mergeSort(arr, low, mid)
                mergeSort(arr, mid+1, high)
                merge(arr, low, mid, high)
        }
}

func inPlaceMergeSort(arr []int) {
        if len(arr) > 1 {
                mergeSort(arr, 0, len(arr)-1)
        }
}

func main() {
        arr := []int{12, 11, 13, 5, 6, 7}
        fmt.Println("Given array is: ", arr)
        inPlaceMergeSort(arr)
        fmt.Println("Sorted array is: ", arr)
}
```

This program sorts the given array using in-place merge sort algorithm. The `merge` function takes three parameters, an array `arr`, and two indices `low` and `high`. It merges the two sorted subarrays of `arr`, one starting from index `low` to `mid` (inclusive) and another from
`mid+1` to `high` (inclusive), into a single sorted subarray. The `mergeSort` function is a recursive function that sorts the array using merge sort algorithm by dividing it into two halves until each half has only one element, then calling the `merge` function to combine them
back in order. Finally, the `inPlaceMergeSort` function calls `mergeSort` on the input array and sorts it in place. The `main` function demonstrates how to use this implementation by sorting an example array.

Conclusion

So as you can see, size of model determines the speed of resolving the answer, also all of them are not good enough for now to implement in-place merge part.

2023-10-14

Benchmarking docker-volume vs mount-fs vs tmpfs

So today we're gonna benchmark between docker-volume (bind to docker-managed volume), bind/mount-fs (binding to host filesystem), and tmpfs. Which one can be the fastest? here's the docker compose:

version: "3.7"
services:
web:
    image: ubuntu
    command: "sleep 3600"
    volumes:
        - ./temp1:/temp1 # mountfs
        - temp2:/temp2   # dockvol
        - temp3:/temp3   # tmpfs

volumes:
temp2:
temp3:
    driver_opts:
      type: tmpfs
      device: tmpfs

The docker compose file is on the sibling directory as data-root of docker to ensure using the same SSD. First benchmark we're gonna clone from this repository, then run copy, create 100 small files, then do 2 sequential write (small and large), here's the result of those (some steps not pasted below, eg. removing file when running benchmark twice for example):

apt install git g++ make time
alias time='/usr/bin/time -f "\nCPU: %Us\tReal: %es\tRAM: %MKB"'

cd /temp3 # tmpfs
git clone https://github.com/nikolausmayer/file-IO-benchmark.git

### copy small files

time cp -R /temp3/file-IO-benchmark /temp2 # dockvol
CPU: 0.00s      Real: 1.02s     RAM: 2048KB

time cp -R /temp3/file-IO-benchmark /temp1 # bindfs
CPU: 0.00s      Real: 1.00s     RAM: 2048KB

### create 100 x 10MB files

cd /temp3/file*
time make data # tmpfs
CPU: 0.41s      Real: 0.91s     RAM: 3072KB

cd /temp2/file*
time make data # dockvol
CPU: 0.44s      Real: 1.94s     RAM: 2816KB

cd /temp1/file*
time make data # mountfs
CPU: 0.51s      Real: 1.83s     RAM: 2816KB

### compile

cd /temp3/file*
time make # tmpfs
CPU: 2.93s Real: 3.23s RAM: 236640KB

cd /temp2/file*
time make # dockvol
CPU: 2.94s Real: 3.22s RAM: 236584KB

cd /temp1/file*
time make # mountfs
CPU: 2.89s Real: 3.13s RAM: 236300KB

### sequential small

cd /temp3 # tmpfs
time dd if=/dev/zero of=./test.img count=10 bs=200M
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 0.910784 s, 2.3 GB/s

cd /temp2 # dockvol
time dd if=/dev/zero of=./test.img count=10 bs=200M
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 2.26261 s, 927 MB/s

cd /temp1 # mountfs
time dd if=/dev/zero of=./test.img count=10 bs=200M
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 2.46954 s, 849 MB/s

### sequential large

cd /temp3 # tmpfs
time dd if=/dev/zero of=./test.img count=10 bs=1G
10737418240 bytes (11 GB, 10 GiB) copied, 4.95956 s, 2.2 GB/s

cd /temp2 # dockvol
time dd if=/dev/zero of=./test.img count=10 bs=1G
10737418240 bytes (11 GB, 10 GiB) copied, 81.8511 s, 131 MB/s
10737418240 bytes (11 GB, 10 GiB) copied, 44.2367 s, 243 MB/s
# ^ running twice because I'm not sure why it's so slow

cd /temp1 # mountfs
time dd if=/dev/zero of=./test.img count=10 bs=1G
10737418240 bytes (11 GB, 10 GiB) copied, 12.7516 s, 842 MB/s

The conclusion is, docker volume is a bit faster (+10%) for sequential small, but significantly slower (-72% to -84%) for large sequential files compared to bind/mount-fs, for the other cases seems there's no noticeable difference. I always prefer bind/mount-fs over docker volume because of safety, for example if you accidentally run docker volume rm $(docker volume ls -q) this would delete all your docker volume (I did this multiple times on my own dev PC), also you can easily backup/rsync/copy/manage files if using bind/mount-fs. For other cases, that you don't care whether losing files or not and need high performance (as long as your ram is enough), just use tmpfs.

2023-06-14

Simple Websocket Echo Benchmark

Today we're gonna benchmark nodejs/bin+uwebsocket with golang+nbio. The code is here, both taken from their own example. The benchmark plan is create 10k client/connection, send both text/binary string and receive back the from server and sleep for 1s,100ms,10ms or 1ms, the result is as expected:

go 1.20.5 nbio 1.3.16
rps: 19157.16 avg/max latency = 1.66ms/319.88ms elapsed 10.2s
102 MB 0.6 core usage
rps: 187728.05 avg/max latency = 0.76ms/167.76ms elapsed 10.2s
104 MB 5 core usage
rps: 501232.80 avg/max latency = 12.48ms/395.01ms elapsed 10.1s
rps: 498869.28 avg/max latency = 12.67ms/425.04ms elapsed 10.1s
134 MB 15 core usage

bun 0.6.9
rps: 17420.17 avg/max latency = 5.57ms/257.61ms elapsed 10.1s
48 MB 0.2 core usage
rps: 95992.29 avg/max latency = 29.93ms/242.74ms elapsed 10.4s
rps: 123589.91 avg/max latency = 40.67ms/366.15ms elapsed 10.2s
rps: 123171.42 avg/max latency = 62.74ms/293.29ms elapsed 10.1s
55 MB 1 core usage

node 18.16.0
rps: 18946.51 avg/max latency = 6.64ms/229.28ms elapsed 10.3s
59 MB 0.2 core usage
rps: 97032.08 avg/max latency = 44.06ms/196.41ms elapsed 11.1s
rps: 114449.91 avg/max latency = 72.62ms/295.33ms elapsed 10.3s
rps: 109512.05 avg/max latency = 79.27ms/226.03ms elapsed 10.2s
59 MB 1 core usage

First line until 4th line are with 1s, 100ms, 10ms, 1ms delay before next request. Since Golang/nbio is by default can utilize multi-core so can handle ~50 rps per client, while Bun/Nodejs 11-12 rps per client. If you found a bug, or want to contribute another language (or create better client, just create a pull request on the github link above.

2022-12-24

CockroachDB Benchmark on Different Disk Types

Today we're going to benchmark CockroachDB one of database that I use this year to create embedded application. I use CockroachDB because I don't want to use SqLite or any other embedded database that lack of tooling or cannot be accessed by multiple program at the same time. With CockroachDB I only need to distribute my application binary, cockroachdb binary, and that's it, the offline backup also quite simple, just need to rsync the directory, or do manual rows export like other PostgreSQL-like database. Scaling out also quite simple.

Here's the result:

Disk Type	Ins Dur (s)	Upd Dur (s)	Sel Dur (s)	Many Dur (s)	Insert Q/s	Update Q/s	Select1 Q/s	SelMany Row/s	SelMany Q/s
TMPFS (RAM)	1.3	2.1	4.9	1.5	31419	19275	81274	8194872	20487
NVME DA 1TB	2.7	3.7	5.0	1.5	15072	10698	80558	8019435	20048
NVMe Team 1TB	3.8	3.7	4.9	1.5	10569	10678	81820	8209889	20524
SSD GALAX 250GB	8.0	7.1	5.0	1.5	4980	5655	79877	7926162	19815
HDD WD 8TB	32.1	31.7	4.9	3.9	1244	1262	81561	3075780	7689

From the table we can see that TMPFS (RAM, obviously) is the fastest in all case especially insert and update benchmark, NVMe faster than SSD, and standard magnetic HDD is the slowest. but the query-part doesn't really have much effect probably because the dataset too small that all can fit in the cache.

The test done with 100 goroutines, 400 records insert/update per goroutines, the record is only integer and string. Queries done 10x for select, and 300x for select-many, sending small query is shown there reaching the limit of 80K rps, inserts can reach 31K rps and multirow-query/updates can reach ~20K rps.

The repository is here if you want to run the benchmark on your own machine.

2022-03-20

1 million Go goroutine vs C# task

Let's compare 1 million goroutines with 1 million tasks, which one more efficient in cpu usage and memory usage? The code is forked from kjpgit's techdemo

Name UserCPU    SysCPU     AvgRSS       MaxRSS    Wall
c#_1t      38.96    0.68    582,951    636,136    1:00
c#_2t      88.33    0.95 623,956    820,620    1:02
c#_4t   142.86    1.09 687,365 814,028    1:03
c#_8t   235.80    1.71 669,882      820,704    1:05
c#_16t    434.76    4.01 734,545      771,240    1:08
c#_32t    717.39    4.81      720,235 769,888    1:11
go_1t      58.77    0.65    2,635,380    2,635,632    1:04
go_2t 64.48    0.71    2,639,206    2,642,752    1:00
go_4t 72.55    1.42    2,651,086    2,654,972    1:00
go_8t 80.87    2.82    2,641,664    2,643,392    1:00
go_16t    83.18    4.03    2,673,404    2,681,100    1:00
go_32t    86.65    4.30    2,645,494    2,657,580    1:00

Apparently usual result is as expected because all tasks/goroutine spawned first before processing, Go's scheduler more efficient in CPU usage, but C# runtime more efficient in memory usage, which is normal tho because goroutines requires minimum 2KB overhead per goroutine, way higher cost than spawing a task. What if we increase to 10 millions tasks/goroutines, and let the spawning done in another task/goroutine, so if goroutine done it can restore back memory to the GC, here's the result:

Name     UserCPU    SysCPU     AvgRSS MaxRSS    Wall
c#_1t    12.78    1.28    2,459,190 5,051,528    0:13
c#_2t    22.60    1.54    2,692,439 5,934,796    0:18
c#_4t    42.09    1.54    2,370,239 5,538,280    0:21
c#_8t 88.54    2.29    2,522,053 6,334,176    0:29
c#_16t    204.39    3.32    2,395,001 5,803,808    0:34
c#_32t    259.09    3.25    1,842,458 4,710,012    0:28
go_1t 13.97    0.97    4,514,200 6,151,088    0:14
go_2t 12.35    1.51    5,595,418 9,506,076    0:07
go_4t 22.09    2.40    6,394,162    12,517,848    0:07
go_8t 31.00    3.09    7,115,281    13,428,344    0:06
go_16t 40.32    3.52    7,126,851    13,764,940    0:06
go_32t 58.58    3.58    7,104,882    12,145,396    0:06

Result seems normal, high memory usage caused by a lot of goroutine spawned at the same time in different thread, not blocking the main thread, but after it's done, got collected by GC (previously it was time based exit condition, this time it would exit after all process done, since I move the sleep first before atomic increment). What if we lower back to 1 million but with same exit rule and spawning executed in different task/goroutine also checking completion done every 1s, here's the result:

Name    UserCPU SysCPU    AvgRSS    MaxRSS    Wall
c#_1t    1.18    0.16    328,134    511,652    0:02
c#_2t    2.18    0.22 294,608    554,488    0:02
c#_4t    3.19    0.20 305,336    554,064    0:02
c#_8t    7.77    0.31 292,281    530,368    0:02
c#_16t 12.33    0.25 304,352    569,460    0:02
c#_32t 37.90    1.25 337,837    684,252    0:03
go_1t    2.72    0.42   1,592,978   2,519,040    0:03
go_2t    3.04    0.47   1,852,084   2,637,532    0:03
go_4t    3.65    0.54   1,936,626   2,637,272    0:03
go_8t    3.27    0.59   1,768,540   2,655,208    0:02
go_16t   4.01    0.71   1,770,673   2,664,504    0:02
go_32t   4.96    0.72   1,770,354   2,669,244    0:02

The difference in processing time is nengligible, but the CPU usage and memory usage quite contrast. Next, let's try to spawn in burst (100K per second), so we add 1 second sleep every 100th task/goroutine, since it's not quite realistic even for DDOS'ed server to receive that much (unless the server finely tuned), here's the result:

Name    UserCPU SysCPU    AvgRSS   MaxRSS   Wall
c#_1t 0.61    0.08    146,849   284,436    0:05
c#_2t 1.17    0.10    131,778   261,720    0:05
c#_4t 1.53    0.08    133,505   289,584    0:05
c#_8t    4.17    0.15    131,924   284,960    0:05
c#_16t   10.94    0.68    135,446   289,028    0:05
c#_32t   19.86    3.01    130,533   284,924    0:05
go_1t 1.84    0.24    731,872   1,317,796    0:06
go_2t    1.87    0.26    659,382   1,312,220    0:05
go_4t 2.00    0.30    661,296   1,322,152    0:05
go_8t 2.37    0.34    660,641   1,324,684    0:05
go_16t    2.82    0.39    660,225   1,323,932    0:05
go_32t    3.36    0.45    659,176   1,327,264    0:05

And for 5 millions:

Name    UserCPU    SysCPU    AvgRSS    MaxRSS    Wall
c#_1t 3.39    0.24    309,103    573,772    0:11
c#_2t 8.30    0.26    278,683    553,592    0:11
c#_4t    13.65    0.32    274,679    658,104    0:11
c#_8t    23.20    0.46    286,336 641,376    0:12
c#_16t   45.85    1.32    286,311 640,336    0:12
c#_32t   64.83    2.46    264,866    615,552    0:12
go_1t 6.25    0.50    1,397,434    2,629,936    0:13
go_2t 6.20    0.56    1,386,336    2,631,580    0:11
go_4t    7.52    0.65    1,410,523    2,625,308    0:11
go_8t   8.21    0.86    1,441,080    2,779,456    0:11
go_16t   11.17    0.96    1,436,220    2,687,908    0:11
go_32t   12.97    1.06    1,430,573    2,668,816    0:11

And for 25 millions:

Name    UserCPU   SysCPU    AvgRSS      MaxRSS    Wall
c#_1t 15.94    0.69 590,411    1,190,340    0:24
c#_2t 34.88    0.84 699,288    1,615,372    0:32
c#_4t 59.95    0.89 761,308    1,794,116    0:34
c#_8t    100.64    1.36    758,161    1,845,944    0:36
c#_16t   199.56    2.99 765,791    2,014,856    0:38
c#_32t   332.02    4.07 811,809    1,972,400    0:41
go_1t 21.76    0.71   2,846,565    4,413,968    0:29
go_2t 25.77    1.03   2,949,433    5,553,608    0:25
go_4t 28.74    1.24   2,920,447    5,800,088    0:24
go_8t 37.28    1.96   2,869,074    5,502,776    0:23
go_16t    43.46    2.67   2,987,114    5,769,356    0:24
go_32t    43.77    2.92   3,027,179    5,867,084    0:24

How about 25 millions and sleep per 200K?

Name    UserCPU   SysCPU AvgRSS    MaxRSS    Wall
c#_1t    18.47    0.91    842,492    1,820,788    0:22
c#_2t    40.32    0.93    1,070,555    2,454,324    0:31
c#_4t    62.39    1.16    1,103,741    2,581,476    0:33
c#_8t    100.84    1.34    1,074,820    2,377,580    0:34
c#_16t   218.26    2.91    1,062,642    2,726,700    0:37
c#_32t   339.00    6.51    1,042,254    2,275,644    0:40
go_1t    22.61    0.88    3,474,195    5,071,944    0:27
go_2t 25.83    1.20    3,912,071    6,964,640    0:20
go_4t 37.98    1.68    4,180,188    7,392,800    0:20
go_8t 38.56    2.44    4,189,265    8,481,852    0:18
go_16t    44.49    3.19    4,187,142    8,483,236    0:18
go_32t    48.82    3.44    4,218,591    8,424,200    0:18

And lastly 25 millions and sleep per 400K?

Name    UserCPU    SysCPU    AvgRSS MaxRSS    Wall
c#_1t    18.66    0.98    1,183,313    2,622,464    0:20
c#_2t    41.27    1.14    1,326,415    3,155,948    0:31
c#_4t    67.21    1.11    1,436,280    3,015,212    0:33
c#_8t    107.14    1.56    1,492,179    3,378,688    0:35
c#_16t   233.50    2.45    1,498,421    3,732,368    0:41
c#_32t   346.87    3.74    1,335,756    2,882,676    0:39
go_1t    24.13    0.82    4,048,937    5,099,220    0:26
go_2t    28.85    1.41    4,936,677    8,023,568    0:18
go_4t    31.51    1.95    5,193,653    9,537,080    0:14
go_8t    45.27    2.65    5,461,107    9,499,308    0:14
go_16t    53.43    3.19    5,183,009    9,476,084    0:14
go_32t    61.98    3.86    5,589,156   10,587,788    0:14

How to read results above? Wall = how much time need to complete, lower is better; AvgRSS/MaxRSS = average/max memory usage, lower is better; UserCPU = CPU time used in percent >100% means more than 1 full core compute time being used, lower is better. Versions used in this benchmark:

go version go1.17.6 linux/amd64
dotnet --version
6.0.201

2022-02-22

C# vs Go in Simple Benchmark

Today we're gonna retry two of my few favorite language in associative array and comb sort benchmark (compile and run, not just runtime performance, because development waiting for compilation time also important) like in the past benchmark. For installing DotNet:

wget https://packages.microsoft.com/config/ubuntu/21.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb
sudo apt install apt-transport-https
sudo apt-get update
sudo apt-get install -y dotnet-sdk-6.0 aspnetcore-runtime-6.0

For installing Golang:

sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt-get update
sudo apt install -y golang-1.17

Result (best of 3 runs)

cd assoc; time dotnet run
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.16s Real: 14.41s RAM: 1945904KB

cd assoc; time go run map.go
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.80s Real: 12.01s RAM: 2305384KB

This is a bit weird, usually I see that Go use less memory but slower, but in this benchmark C# that are using less memory but a bit slower (14.41s vs 12.01s), possibly because the compilation speed also included.

cd num-assoc; time dotnet run
CPU: 2.21s      Real: 2.19s     RAM: 169208KB

cd num-assoc; time go run comb.go
CPU: 0.46s      Real: 0.44s     RAM: 83100KB

What if we increase the N from 1 million to 10 million?

cd num-assoc; time dotnet run
CPU: 19.25s     Real: 19.16s    RAM: 802296KB

cd num-assoc; time go run comb.go
CPU: 4.60s      Real: 4.67s     RAM: 808940KB

If you want to contribute (if I make mistake when coding the C# or Go version of the algorithm, or if there's more efficient data structure, just fork and create a pull request, and I will redo the benchmark).

2021-11-22

Kafka vs RedPanda Benchmark (also Tarantool and Clickhouse as queue)

Using default settings from their docker-compose example, today we're gonna benchmark one of popular MQ/PubSub software. I never used MQ extensively before (only NATS, Google PubSub, ActiveMQ, and Amazon SQS), usually just using standard database that stores event is sufficient (the consumer using pull, tailing from last primary key counter, and if need to fan-out just use multiple goroutine and multiple channel), because my projects never been a latency sensitive applications.

Some issues:

the benchmark has locking (atomic counters, sync.Map, etc), so consumer might not utilize whole CPU cores.
confluent's kafka docker always error when starting because /var/lib/kafka/data not writable, so I bind on /var/lib/kafka instead. Clickhouse also always failed to start when bind to /var/lib/clickhouse/data, so I don't bind volume for Clickhouse.
RedPanda failed to start when fs.aio-max-nr even when it's already ~1 million (originally only 64K), so I set it to 4194304

Benchmarking 1000 goroutines publishing 2000 messages each, with 100 goroutines consuming in parallel.

REDPANDA version: v21.10.1 (rev e7b6714)

=== redpanda single:

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 2387

MaxLatency (ms): 2125

AvgLatency (ms): 432

Total (s) 3.457646367s

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 2408

MaxLatency (ms): 2663

AvgLatency (ms): 490

Total (s) 3.459949739s

=== redpanda multi:

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 4187

MaxLatency (ms): 12146

AvgLatency (ms): 9701

Total (s) 13.610533861s

# ^ weird, maybe startup not yet complete?

# retried reinit docker-compose, 1st time always slow

# but 2nd time always fast:

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 2413

MaxLatency (ms): 2704

AvgLatency (ms): 467

Total (s) 3.545496041s

KAFKA version: 7.0.0-ccs (Commit:c6d7e3013b411760)

equal to kafka 3.0.0

=== kafka single:

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 6634

MaxLatency (ms): 12052

AvgLatency (ms): 8579

Total (s) 13.722706977s

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 6380

MaxLatency (ms): 11856

AvgLatency (ms): 8636

Total (s) 13.625928209s

=== kafka multi:

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 6596

MaxLatency (ms): 11932

AvgLatency (ms): 8523

Total (s) 13.659630863s

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 6535

MaxLatency (ms): 11903

AvgLatency (ms): 8588

Total (s) 13.677644818s

These benchmark using default settings that exists in the docker examples I found, except SMP (I set it to the same amount of cores in the server that used to benchmark to make it fair with Kafka that uses JVM that by default can utilize all cores -- apparently this has insignificant impact). Current conclusion is, RedPanda way faster than Kafka, in terms of publishing speed (around ~1μs per message, 477K-837K msg/s) and consuming latency (432ms to 2.7s per message), while Kafka (around ~3μs per message, 301K-313K msg/s) and 8.5s to 12s per message. The RAM statistics tho, RedPanda uses 12GB for each node (10% of server's RAM), while Kafka only uses 355MB, 375MB, 788MB for nodes, and 120MB for zookeeper. The repo to reproduce this benchmark is here on 2021mq directory.

Btw if you're looking for Kafka/RedPanda GUI, try KOwl, this way more beautiful than ActiveMQ default Web UI.

Bonus rounds, using one of the fastest OLTP database: Tarantool and one of the fastest OLAP database: Clickhouse as Queue, by laveraging sequence (auto increment) or internal function to generate a sequence, the difference is there's only one consumer group (have to manually fan out using goroutine), no json encode and decode since it's structured database:

TARANTOOL version: 2.8.2

=== tarantool single (memtx):

FailProduce:  0

FailConsume:  0

DoubleConsume:  0

Produced (ms):  11238

MaxLatency (ms):  1071

AvgLatency (ms):  101

Total (s) 11.244551225s

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 9596

MaxLatency (ms): 816

AvgLatency (ms): 61

Total (s) 9.957516119s

=== tarantool single (vinyl):

FailProduce:  0

FailConsume:  0

DoubleConsume:  0

Produced (ms):  11383

MaxLatency (ms):  1076

AvgLatency (ms):  157

Total (s) 11.388865281s

FailProduce:  0

FailConsume:  0

DoubleConsume:  0

Produced (ms):  9104

MaxLatency (ms):  102

AvgLatency (ms):  13

Total (s) 9.196549551s

CLICKHOUSE version: 21.11.4.14

=== clickhouse single:

FailProduce: 0
FailConsume: 0

DoubleConsume: 0

Produced (ms): 2052

MaxLatency (ms): 2078

AvgLatency (ms): 1461

Total (s) 3.570767491s

FailProduce: 0

FailConsume: 0

DoubleConsume: 0

Produced (ms): 2057

MaxLatency (ms): 2008

AvgLatency (ms): 1445

Total (s) 3.536277427s

The result recap table (ms = millisecond, us = microsecond, ns = nanosecond):

only best of 2 runs	RedPanda single	RedPanda multi	Kafka single	Kafka multi	Tarantool memtx	Tarantool vinyl	Clickhouse single
Publish (ms)	2,387	2,413	6,380	6,535	9,596	9,104	2,052
Sub Max Latency (ms)	2,125	2,704	11,856	11,903	816	102	2,008
Sub Avg Latency (ms)	490	467	8,636	8,523	61	13	1,445
Pub Troughput (msg/s)	837,872	828,844	313,480	306,044	208,420	219,684	974,659
est. Pub Latency (ns)	1,194	1,207	3,190	3,268	4,798	4,552	1,026
est. Sub Throughput (msg/s)	4,081,633	4,282,655	231,589	234,659	32,786,885	153,846,154	1,384,083

Conclusion: Tarantool probably the only single node database that can compete with Kafka for queue use case (we can have multi-master replica but not recommended, it's better to use master-slave config where slave used as failover), for other database especially RDBMS that persist to disk pretty sure can only do ~50K tps, Clickhouse can be multi-master, and last time i check, it can do ~600K inserts per seconds (while this time it's around 1M inserts per seconds), I simulate the atomic counter on Clickhouse using TimeStamp64Milli, the query limited to 100 queries per second but it's quite good enough for pub-sub use case. The benefit of using database as MQ/PubSub is that you can do a very flexible query (SQL support), mostly better tooling (especially Clickhouse), or update the record for new consumer, but the cons is that you must notify/fan-out (for example using NATS broadcast, only push the signal for worker to pull), track the ack/retries and the read offset of the workers yourself (pull).

2021-07-08

Prime Benchmark

So, yesterday I didn't touch my PC to do a prime number benchmark. And Here's the result for single threaded (only showing fastest implementation of each language) and multithreaded:

Index	Implementation	S	Label	Passes	Dur	Algo	Faithful	Bit	Passes/Second
1	cpp	3	flo80_	220667	5.00	base	no	1	44133.26760
10	c	2	daniel	20250	5.00	wheel	yes	1	4049.85259
11	zig	3	ManDeJ	17964	5.00	base	no	1	3592.49823
12	c	2	daniel	17681	5.00	wheel	yes	1	3536.07129
16	rust	1	mike-b	15804	5.01	base	yes	8	3152.68929
20	assembly	1	rberge	14434	5.00	base	no	8	2886.80000
22	haskell	1	fatho/	11959	5.00	base	no	8	2391.77321
32	fortran	1	johand	9987	5.00	base	no	1	1997.40000
36	crystal	1	marghi	8680	5.00	base	yes	1	1735.86981
38	fsharp	3	dmanno	7754	5.00	base	yes		1550.68897
40	java	1	Mansen	14887	10.0	base	yes		1488.70000
41	csharp	1	kinema	7271	5.00	base	yes		1454.08077
43	julia	2	epithe	6953	5.00	base	yes	1	1390.55577
46	go	2	ssoves	6161	5.00	base	yes	1	1232.01471
51	nodejs	1	rogier	5748	5.00	base	yes	1	1149.43213
57	lisp	2	mayerr	5122	5.00	base	no	1	1024.19803
58	typescript	1	marghi	5031	5.00	base	yes		1006.20000
59	d	2	Bradle	5003	5.00	base	yes	1	1000.52396
61	v	1	marghi	4329	5.00	base	yes		865.80000
63	lua	2	ben1je	3159	5.00	base	no	1	631.80000
64	nim	2	beef33	2871	5.00	base	yes	1	574.02096
67	cython	1	rpkak	2659	5.00	base	yes	8	531.64832
71	basic	1	rberge	2416	5.00	wheel	yes	1	483.00680
73	assemblyscript	1	donmah	4231	10.0	base	yes		423.05768
74	python	2	ssoves	1991	5.00	base	yes	8	398.09742
80	scala	1	rom1de	1203	5.00	base	yes		240.55189
81	pascal	1	rberge	1162	5.00	base	yes		232.40000
82	cobol	1	fvbake	1157	5.00	base	no	8	231.40000
83	pony	1	marghi	1144	5.00	base	yes	1	228.80000
84	swift	1	j-f1	2046	10.0	base	yes		204.55332
85	dart	1	eagere	824	5.00	base	yes		164.77795
86	haxe	1	TayIor	1392	10.0	base	yes		139.19035
88	ada	1	BoopBe	661	5.00	base	no		132.02220
92	octave	1	octave	313	5.00	base	no		62.54234
93	postscript	1	epithe	216	5.01	base	no	8	43.08797
94	ruby	1	rberge	119	5.01	base	yes		23.71935
95	wren	1	marghi	111	5.00	base	yes		22.16446
96	php	1	Dennis	143	10.0	base	yes		14.24667
97	smalltalk	1	fvbake	49	5.07	base	yes	1	9.66469
99	mixal	1	rberge	30	4.91	base	no	1	6.10998
100	perl	1	marghi	28	5.16	base	yes		5.42031
103	r	1	fvbake	7	5.43	base	yes	32	1.28842
104	sql	2	fvbake	6	5.43	other	no	32	1.10375
105	tcl	1	fvbake	6	5.47	base	yes	1	1.09589
111	latex	1	tjol	2	17.8	base	no	32	0.11224
112	bash	1	bash	1	10.6	base	no		0.09357

Index	Implementation	S	Label	Passes	Dur	Thread	Algo	Faithful	Bit	Passes/Second
1	zig	3	ManDe	140910	5.0	4	wheel	no	1	7045.26046
2	cpp	3	flo80	236184	5.0	8	base	no	1	5904.57992
3	zig	3	ManDe	106399	5.0	4	wheel	no	1	5319.64146
4	zig	3	ManDe	101026	5.0	4	wheel	no	1	5051.08785
5	zig	3	ManDe	84002	5.0	4	wheel	no	1	4200.08320
6	zig	3	ManDe	147822	5.0	8	wheel	no	1	3695.38740
7	zig	3	ManDe	72021	5.0	4	wheel	no	1	3600.73314
8	zig	3	ManDe	70522	5.0	4	wheel	no	1	3525.83204
9	zig	3	ManDe	134124	5.0	8	wheel	no	1	3352.91894
10	zig	3	ManDe	59851	5.0	4	base	no	8	2992.45424
11	c	2	danie	101391	5.0	8	wheel	yes	1	2534.43285
12	zig	3	ManDe	98246	5.0	8	wheel	no	1	2455.98299
13	zig	3	ManDe	98190	5.0	8	wheel	no	1	2454.57327
14	zig	3	ManDe	48164	5.0	4	base	no	1	2408.10849
15	zig	3	ManDe	91745	5.0	8	wheel	no	1	2293.50574
16	zig	3	ManDe	90598	5.0	8	wheel	no	1	2264.84129
17	c	2	danie	88103	5.0	8	wheel	yes	1	2199.97727
18	zig	3	ManDe	42318	5.0	4	base	no	1	2115.85768
19	c	2	danie	78858	5.0	8	wheel	yes	1	1969.05838
20	zig	3	ManDe	68492	5.0	8	base	no	8	1712.11852
21	c	2	danie	63752	5.0	8	wheel	yes	1	1591.86334
22	rust	1	mike-	59001	5.0	8	base	yes	8	1474.83765
23	rust	1	mike-	52979	5.0	8	base	yes	1	1324.32205
24	c	2	danie	49822	5.0	8	base	yes	1	1244.43126
25	zig	3	ManDe	49712	5.0	8	base	no	1	1242.59124
26	c	2	danie	24664	5.0	4	wheel	yes	1	1233.15067
27	rust	1	mike-	49238	5.0	8	base	yes	1	1230.78145
28	zig	3	ManDe	45636	5.0	8	base	no	1	1140.80189
29	c	2	danie	22378	5.0	4	wheel	yes	1	1118.88478
30	c	2	danie	20385	5.0	4	wheel	yes	1	1019.22065
31	c	2	danie	20257	5.0	4	wheel	yes	1	1012.82346
32	c	2	danie	15132	5.0	4	base	yes	1	756.55294
33	cpp	2	davep	20048	5.0	8	base	yes	1	501.18496
34	d	2	Bradl	19553	5.0	8	base	yes	1	488.74973
35	zig	3	ManDe	9240	5.0	4	base	no	8	461.94272
36	zig	3	ManDe	11834	5.0	8	base	no	8	295.81332
37	csharp	1	kinem	3193	5.0	8	wheel	yes	1	79.81766
38	csharp	1	kinem	2961	5.0	8	base	yes	1	74.02041

Raw result on this gist

Subscribe to: Posts ( Atom )