Giter Club home page Giter Club logo

asm's Introduction

asm build status GoDoc

Go library providing algorithms that use the full power of modern CPUs to get the best performance.

Motivation

The cloud makes it easier than ever to access large scale compute capacity, and it's become common to run distributed systems deployed across dozens or sometimes hundreds of CPUs. Because projects run on so many cores now, program performance and efficiency matters more today than it has ever before.

Modern CPUs are complex machines with performance characteristics that may vary by orders of magnitude depending on how they are used. Features like branch prediction, instruction reordering, pipelining, or caching are all input variables that determine the compute throughput that a CPU can achieve. While compilers keep being improved, and often employ micro-optimizations that would be counter-productive for human developers to be responsible for, there are limitations to what they can do, and Assembly still has a role to play in optimizing algorithms on hot code paths of large scale applications.

SIMD instruction sets offer interesting opportunities for software engineers. Taking advantage of these instructions often requires rethinking how the program represents and manipulates data, which is beyond the realm of optimizations that can be implemented by a compiler. When renting CPU time from a Cloud provider, programs that fail to leverage the full sets of instructions available are therefore paying for features they do not use.

This package aims to provide such algorithms, optimized to leverage advanced instruction sets of modern CPUs to maximize throughput and take the best advantage of the available compute power. Users of the package will find functions that have often been designed to work on arrays of values, which is where SIMD and branchless algorithms shine.

The functions in this library have been used in high throughput production environments at Segment, we hope that they will be useful to other developers using Go in performance-sensitive software.

Usage

The library is composed of multiple Go packages intended to act as logical groups of functions sharing similar properties:

Package Purpose
ascii library of functions designed to work on ASCII inputs
base64 standard library compatible base64 encodings
bswap byte swapping algorithms working on arrays of fixed-size items
cpu definition of the ABI used to detect CPU features
mem functions operating on byte arrays
qsort quick-sort implementations for arrays of fixed-size items
slices functions performing computations on pairs of slices
sortedset functions working on sorted arrays of fixed-size items

When no assembly version of a function is available for the target platform, the package provides a generic implementation in Go which is automatically picked up by the compiler.

Showcase

The purpose of this library being to improve the runtime efficiency of Go programs, we compiled a few snapshots of benchmark runs to showcase the kind of improvements that these code paths can expect from leveraging SIMD and branchless optimizations:

goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
pkg: github.com/segmentio/asm/ascii
name                  old time/op    new time/op     delta
EqualFoldString/0512     276ns ± 1%       21ns ± 2%    -92.50%  (p=0.008 n=5+5)

name                  old speed      new speed       delta
EqualFoldString/0512  3.71GB/s ± 1%  49.44GB/s ± 2%  +1232.79%  (p=0.008 n=5+5)
pkg: github.com/segmentio/asm/bswap
name    old time/op    new time/op     delta
Swap64    11.2µs ± 1%      0.9µs ± 9%    -92.06%  (p=0.008 n=5+5)

name    old speed      new speed       delta
Swap64  5.83GB/s ± 1%  73.67GB/s ± 9%  +1162.98%  (p=0.008 n=5+5)
pkg: github.com/segmentio/asm/qsort
name            old time/op    new time/op     delta
Sort16/1000000     269ms ± 2%       46ms ± 3%   -83.08%  (p=0.008 n=5+5)

name            old speed      new speed       delta
Sort16/1000000  59.4MB/s ± 2%  351.2MB/s ± 3%  +491.24%  (p=0.008 n=5+5)

Maintenance

The assembly code is generated with AVO, and orchestrated by a Makefile which helps maintainers rebuild the assembly source code when the AVO files are modified.

The repository contains two Go modules; the main module is declared as github.com/segmentio/asm at the root of the repository, and the second module is found in the build subdirectory.

The build module is used to isolate build dependencies from programs that import the main module. Through this mechanism, AVO does not become a dependency of programs using github.com/segmentio/asm, keeping the dependency management overhead minimal for the users, and allowing maintainers to make modifications to the build package.

Versioning of the two modules is managed independently; while we aim to provide stable APIs on the main package, breaking changes may be introduced on the build package more often, as it is intended to be ground for more experimental constructs in the project.

Requirements

Some libraries have custom purpose code for both amd64 and arm64. Others (qsort) have only amd64. Search for a .s file matching your architecture to be sure you are using the assembler optimized library instructions.

The Go code requires Go 1.17 or above. These versions contain significant performance improvements compared to previous Go versions.

asm version v1.1.5 and earlier maintain compatibility with Go 1.16.

purego

Programs in the build module should add the following declaration:

func init() {
	ConstraintExpr("!purego")
}

It instructs AVO to inject the !purego tag in the generated files, allowing the libraries to be compiled without any assembly optimizations with a build command such as:

go build -tags purego ...

This is mainly useful to compare the impact of using the assembly optimized versions instead of the simpler Go-only implementations.

asm's People

Contributors

achille-roussel avatar chriso avatar dependabot[bot] avatar kalamay avatar kevinburkesegment avatar mmcloughlin avatar pelletier avatar peterdemartini avatar pryz avatar udhaybegyall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

asm's Issues

undefined reference to `github.com/segmentio/asm/cpu.X86'

When segmentio/asm v1.2.0 is vendored into a module, building a Docker image fails.

build layer in Dockerfile:

RUN CGO_ENABLED=1 GO111MODULE=off go install /cmd/main.go

error logs:

[2022-12-06 16:56:34]  INFO docker build: #10 37.51 # github.com/segmentio/<REPO>/cmd/<BINARY>
  | [2022-12-06 16:56:34]  INFO docker build: #10 37.51 /usr/local/go/pkg/tool/linux_amd64/link: running gcc failed: exit status 1
  | [2022-12-06 16:56:34]  INFO docker build: #10 37.51 /usr/bin/ld: /tmp/go-link-506352844/go.o: in function `github.com/segmentio/<REPO>/vendor/github.com/segmentio/asm/bswap.swap64.abi0':
  | [2022-12-06 16:56:34]  INFO docker build: #10 37.51 /go/src/github.com/segmentio/<REPO>/vendor/github.com/segmentio/asm/bswap/swap64_amd64.s:14: undefined reference to `github.com/segmentio/asm/cpu.X86'
  | [2022-12-06 16:56:34]  INFO docker build: #10 37.51 collect2: error: ld returned 1 exit status
  | [2022-12-06 16:56:34]  INFO docker build: #10 37.51
  | [2022-12-06 16:56:35]  INFO docker build: #10 37.83 # github.com/segmentio/<REPO>/cmd/<BINARY>
  | [2022-12-06 16:56:35]  INFO docker build: #10 37.83 /usr/local/go/pkg/tool/linux_amd64/link: running gcc failed: exit status 1
  | [2022-12-06 16:56:35]  INFO docker build: #10 37.83 /usr/bin/ld: /tmp/go-link-65482911/go.o: in function `github.com/segmentio/<REPO>/vendor/github.com/segmentio/asm/bswap.swap64.abi0':
  | [2022-12-06 16:56:35]  INFO docker build: #10 37.83 /go/src/github.com/segmentio/<REPO>/vendor/github.com/segmentio/asm/bswap/swap64_amd64.s:14: undefined reference to `github.com/segmentio/asm/cpu.X86'
  | [2022-12-06 16:56:35]  INFO docker build: #10 37.83 collect2: error: ld returned 1 exit status
  | [2022-12-06 16:56:35]  INFO docker build: #10 37.83

It seems like this is the line of code within segmentio/asm that's missing a reference but I'm unsure how to fix this.

use x/sys/cpu

Prefer x/sys/cpu over the current asm/cpu package. It also has the advantage of taking into account whether an OS supports XMM and YMM registers.

"checkptr: pointer arithmetic result points to invalid allocation" on ARM

While running a test suite on Go 1.17.3 on an M1 Mac with github.com/segmentio/encoding v0.3.2 and github.com/segmentio/asm v1.1.0, I ran into segmentio/encoding#84 again. I believe the fix in segmentio/encoding#85 was incomplete and may not apply to non-Intel CPUs, or maybe the rules are slightly different in newer Go and/or on ARM.

The same test code triggers it:

package main

import (
	"testing"

	"github.com/segmentio/encoding/json"
)

type Foo struct {
	Source struct {
		Table string
	}
}

func TestUnmarshal(t *testing.T) {
	input := []byte(`{"source": {"table": "1234567"}}`)
	r := &Foo{}
	json.Unmarshal(input, r)
}

Run the same way:

go mod init segtest
go mod tidy
go test -v -race -trimpath ./...

And the results are:

=== RUN   TestUnmarshal
fatal error: checkptr: pointer arithmetic result points to invalid allocation

goroutine 35 [running]:
runtime.throw({0x1010ceb85, 0x40})
	runtime/panic.go:1198 +0x54 fp=0xc00004cb00 sp=0xc00004cad0 pc=0x100f81794
runtime.checkptrArithmetic(0xc00016a040, {0xc00004cb90, 0x1, 0x1})
	runtime/checkptr.go:69 +0xbc fp=0xc00004cb30 sp=0xc00004cb00 pc=0x100f5106c
github.com/segmentio/asm/ascii.ValidPrintString({0xc00016a020, 0x20})
	github.com/segmentio/[email protected]/ascii/valid_print_default.go:16 +0x80 fp=0xc00004cba0 sp=0xc00004cb30 pc=0x101092be0
github.com/segmentio/asm/ascii.ValidPrint(...)
	github.com/segmentio/[email protected]/ascii/valid_print.go:7
github.com/segmentio/encoding/ascii.ValidPrint(...)
	github.com/segmentio/[email protected]/ascii/valid_print.go:10
github.com/segmentio/encoding/json.internalParseFlags({0xc00016a020, 0x20, 0x20})
	github.com/segmentio/[email protected]/json/parse.go:33 +0x148 fp=0xc00004cc50 sp=0xc00004cba0 pc=0x1010b9788
github.com/segmentio/encoding/json.Parse({0xc00016a020, 0x20, 0x20}, {0x101113200, 0xc000112530}, 0x0)
	github.com/segmentio/[email protected]/json/json.go:303 +0xc8 fp=0xc00004cdb0 sp=0xc00004cc50 pc=0x1010b91c8
github.com/segmentio/encoding/json.Unmarshal({0xc00016a020, 0x20, 0x20}, {0x101113200, 0xc000112530})
	github.com/segmentio/[email protected]/json/json.go:285 +0x58 fp=0xc00004ce60 sp=0xc00004cdb0 pc=0x1010b8ff8
segtest.TestUnmarshal(0xc000107520)
	segtest/seg_test.go:18 +0xe4 fp=0xc00004cec0 sp=0xc00004ce60 pc=0x1010c3584
testing.tRunner(0xc000107520, 0x1011405f0)
	testing/testing.go:1259 +0x19c fp=0xc00004cfc0 sp=0xc00004cec0 pc=0x101035e1c
runtime.goexit()
	runtime/asm_arm64.s:1133 +0x4 fp=0xc00004cfc0 sp=0xc00004cfc0 pc=0x100fb59f4
created by testing.(*T).Run
	testing/testing.go:1306 +0x5bc

I've tried to run on an Intel CPU with -tags purego, but it won't compile due to duplicated function definitions. Perhaps add a build !purego to the generated _amd64.go files to get this to work and add go test -race -tags purego to your test suite?

Thanks!
-- Aaron

Assembly version for small strings

Maybe replace the < 32 bytes implementation to avoid the double scan + two functions call, if it happens to be an issue with production data.

ARM64 work would lead to Apple Silicon support?

Apologies if this isn't the right place to ask this question, but could find any other means.
Noticed ARM64 being worked on, does that imply there might be support for Apple Silicon in the future?
Apple Silicon deviates from ARM64 by supporting W^X. Don't know if it's applicable to your algorithms though.

Performance comparison between assembly and cgo

Hi folks,
This library is accelerated using assembly. Have you made any performance comparison of using cgo for corresponding acceleration? Although cgo itself has a remarkable overhead around 100ns for each call, it might be negligible given larger batch inputs. On the other hand, assembly itself also has the shortcomings such that it could not be inlined, so it would be interesting if performance comparison is available. Thank you~

Granting more permissive license to integrate upstream

Some of the implementations in this repository are drop-in replacement for standard Go library functions. Contributing them back upstream, to golang/go, could benefit the ecosystem as a whole with little drawback to Twilio Segment, as segmentio/asm is already released under the permissive MIT license. However, we did not anticipate the attribution requirement of this license to prevent its adoption in the standard library (example: utf8.Valid CL). It is a non-starter to include derivatives of MIT-licensed code in the Go standard library, because it would force all people who distribute Go program binaries to include the license and copyright assignment to Segment.

As a result, would you consider re-licensing the code in this repository to permit its integration into Go without attribution?

For example: MIT-0 ("MIT No Attribution License").

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.