Giter Club home page Giter Club logo

c2goasm's Introduction

c2goasm: C to Go Assembly

Introduction

This is a tool to convert assembly as generated by a C/C++ compiler into Golang assembly. It is meant to be used in combination with asm2plan9s in order to automatically generate pure Go wrappers for C/C++ code (that may for instance take advantage of compiler SIMD intrinsics or template<> code).

Mode of operation:

$ c2goasm -a /path/to/some/great/c-code.s /path/to/now/great/golang-code_amd64.s

You can optionally nicely format the code using asmfmt by passing in an -f flag.

This project has been developed as part of developing a Go wrapper around Simd. However it should also work with other projects and libraries. Keep in mind though that it is not intented to 'port' a complete C/C++ project in a single action but rather do it on a case-by-case basis per function/source file (and create accompanying high level Go code to call into the assembly code).

Command line options

$ c2goasm --help
Usage of c2goasm:
  -a	Immediately invoke asm2plan9s
  -c	Compact byte codes
  -f	Format using asmfmt
  -s	Strip comments

A simple example

Here is a simple C function doing an AVX2 intrinsics computation:

void MultiplyAndAdd(float* arg1, float* arg2, float* arg3, float* result) {
    __m256 vec1 = _mm256_load_ps(arg1);
    __m256 vec2 = _mm256_load_ps(arg2);
    __m256 vec3 = _mm256_load_ps(arg3);
    __m256 res  = _mm256_fmadd_ps(vec1, vec2, vec3);
    _mm256_storeu_ps(result, res);
}

Compiling into assembly gives the following

__ZN14MultiplyAndAddEPfS1_S1_S1_: ## @_ZN14MultiplyAndAddEPfS1_S1_S1_
## BB#0:
        push          rbp
        mov           rbp, rsp
        vmovups       ymm0, ymmword ptr [rdi]
        vmovups       ymm1, ymmword ptr [rsi]
        vfmadd213ps   ymm1, ymm0, ymmword ptr [rdx]
        vmovups       ymmword ptr [rcx], ymm1
        pop           rbp
        vzeroupper
        ret

Running c2goasm will generate the following Go assembly (eg. saved in MultiplyAndAdd_amd64.s)

//+build !noasm !appengine
// AUTO-GENERATED BY C2GOASM -- DO NOT EDIT

TEXT ·_MultiplyAndAdd(SB), $0-32

	MOVQ vec1+0(FP), DI
	MOVQ vec2+8(FP), SI
	MOVQ vec3+16(FP), DX
	MOVQ result+24(FP), CX

	LONG $0x0710fcc5             // vmovups    ymm0, yword [rdi]
	LONG $0x0e10fcc5             // vmovups    ymm1, yword [rsi]
	LONG $0xa87de2c4; BYTE $0x0a // vfmadd213ps    ymm1, ymm0, yword [rdx]
	LONG $0x0911fcc5             // vmovups    yword [rcx], ymm1

	VZEROUPPER
	RET

This needs to be accompanied by the following Go code (in MultiplyAndAdd_amd64.go)

//go:noescape
func _MultiplyAndAdd(vec1, vec2, vec3, result unsafe.Pointer)

func MultiplyAndAdd(someObj Object) {

	_MultiplyAndAdd(someObj.GetVec1(), someObj.GetVec2(), someObj.GetVec3(), someObj.GetResult()))
}

And as you may have gathered the amd64.go file needs to be in place in order for the arguments names to be derived (and allow go vet to succeed).

Benchmark against cgo

We have run benchmarks of c2goasm versus cgo for both Go version 1.7.5 and 1.8.1. You can find the c2goasm benchmark test in test/ and the cgo test in cgocmp/ respectively. Here are the results for both versions:

$ benchcmp ../cgocmp/cgo-1.7.5.out c2goasm.out 
benchmark                      old ns/op     new ns/op     delta
BenchmarkMultiplyAndAdd-12     382           10.9          -97.15%
$ benchcmp ../cgocmp/cgo-1.8.1.out c2goasm.out 
benchmark                      old ns/op     new ns/op     delta
BenchmarkMultiplyAndAdd-12     236           10.9          -95.38%

As you can see Golang 1.8 has made a significant improvement (38.2%) over 1.7.5, but it is still about 20x slower than directly calling into assembly code as wrapped by c2goasm.

Converted projects

Internals

The basic process is to (in the prologue) setup the stack and registers as how the C code expects this to be the case, and upon exiting the subroutine (in the epilogue) to revert back to the golang world and pass a return value back if required. In more details:

  • Define assembly subroutine with proper golang decoration in terms of needed stack space and overall size of arguments plus return value.
  • Function arguments are loaded from the golang stack into registers and prior to starting the C code any arguments beyond 6 are stored in C stack space.
  • Stack space is reserved and setup for the C code. Depending on the C code, the stack pointer maybe aligned on a certain boundary (especially needed for code that takes advantages of SIMD instructions such as AVX etc.).
  • A constants table is generated (if needed) and any rip-based references are replaced with proper offsets to where Go will put the table.

Limitations

  • Arguments need (for now) to be 64-bit size, meaning either a value or a pointer (this requirement will be lifted)
  • Maximum number of 14 arguments (hard limit -- if you hit this maybe you should rethink your api anyway...)
  • Generally no call statements (thus inline your C code) with a couple of exceptions for functions such as memset and memcpy (see clib_amd64.s)

Generate assembly from C/C++

For eg. projects using cmake, here is how to see a list of assembly targets

$ make help | grep "\.s"

To see the actual command to generate the assembly

$ make -n SimdAvx2BgraToGray.s

Supported golang architectures

For now just the AMD64 architecture is supported. Also ARM64 should work just fine in a similar fashion but support is lacking at the moment.

Compatible compilers

The following compilers have been tested:

  • clang (Apple LLVM version) on OSX/darwin
  • clang on linux

Compiler flags:

-masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti
Flag Explanation
-masm=intel Output Intel syntax for assembly
-mno-red-zone Do not write below stack pointer (avoid red zone)
-mstackrealign Use explicit stack initialization
-mllvm -inline-threshold=1000 Higher limit for inlining heuristic (default=255)
-fno-asynchronous-unwind-tables Do not generate unwind tables (for debug purposes)
-fno-exceptions Disable exception handling
-fno-rtti Disable run-time type information

The following flags are only available in clang -cc1 frontend mode (see below):

Flag Explanation
-fno-jump-tables Do not use jump tables as may be generated for select statements

clang vs clang -cc1

As per the clang FAQ, clang -cc1 is the frontend, and clang is a (mostly GCC compatible) driver for the frontend. To see all options that the driver passes on to the frontend, use -### like this:

$ clang -### -c hello.c
"/usr/lib/llvm/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" etc. etc. etc.

Command line flags for clang

To see all command line flags use either clang --help or clang --help-hidden for the clang driver or clang -cc1 -help for the frontend.

Further optimization and fine tuning

Using the LLVM optimizer (opt) you can further optimize the code generation. Use opt -help or opt -help-hidden for all available options.

An option can be passed in via clang using the -mllvm <value> option, such as -mllvm -inline-threshold=1000 as discussed above.

Also LLVM allows you to tune specific functions via function attributes like define void @f() alwaysinline norecurse { ... }.

What about GCC support?

For now GCC code will not work out of the box. However there is no reason why GCC should not work fundamentally (PRs are welcome).

Resources

License

c2goasm is released under the Apache License v2.0. You can find the complete text in the file LICENSE.

Contributing

Contributions are welcome, please send PRs for any enhancements.

c2goasm's People

Contributors

fwessels avatar harshavardhana avatar lrita avatar pwaller avatar silbinarywolf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

c2goasm's Issues

How to prevent from generating a huge constant table ?

I have a c code:

#include <stdint.h>
#if defined ENABLE_AVX2
#define NAME(x) x##_avx2
#elif defined ENABLE_AVX
#define NAME(x) x##_avx
#elif defined ENABLE_SSE4_2
#define NAME(x) x##_sse4_2
#endif

int64_t NAME(sample_sum)(int64_t *beg, int64_t len) {
    int64_t sum = 0;
    int64_t *end = beg + len;
    while (beg < end) {
        sum += *beg++;
    }
    return sum;
}

int64_t NAME(sample_max)(int64_t *beg, int64_t len) {
    int64_t max = 0x8000000000000000;
    int64_t *end = beg + len;
	if (len == 0) {
        return 0;
    }
    while (beg < end) {
        if (*beg > max) {
            max = *beg;
        }
        beg++;
    }
    return max;
}

And compile it to asm by:

clang -S -DENABLE_AVX2 -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -mavx2 lib/sample.c -o lib/sample_avx2.s

I found clang/llvm compile the local variables int64_t max = 0x8000000000000000; to global:

.LBB1_5:
	vpbroadcastq	ymm0, qword ptr [rip + .LCPI1_0]
	vmovdqa	ymm3, ymm0
	vmovdqa	ymm2, ymm0
	vmovdqa	ymm1, ymm0

....

.LCPI1_0:
	.quad	-9223372036854775808    # 0x8000000000000000
	.section	.rodata,"a",@progbits
	.align	32
.LCPI1_1:
	.long	0                       # 0x0
	.long	2                       # 0x2
	.long	4                       # 0x4
	.long	6                       # 0x6
	.zero	4
	.zero	4
	.zero	4
	.zero	4
	.text
	.globl	sample_max_avx2

....

	.ident	"Apple LLVM version 8.0.0 (clang-800.0.42.1)"
	.section	".note.GNU-stack","",@progbits

Thus, when I use c2goasm to generate the goasm,

it found .quad -9223372036854775808 # 0x8000000000000000 .section .rodata,"a",@progbits .align 32 by getFirstLabelConstants, and generate a huge constant table by defineTable.

Thanks.

Vectorized loop crashing?

I tried compile the following C code w/ clang 7.0.0 (trunk 338352) w/ the following command:

clang -O3 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -S count.c
int CountFilledEntries(char* entries, int len) {
  int result = 0;
  for (int i = 0; i < len; i++) {
    if (entries[i] != 0) result++;
  }

  return result;
}

Given the stub file:

//go:noescape
func _CountFilledEntries(entries unsafe.Pointer, len uint64) (count uint64)

func CountFilledEntries(entries []byte) uint {
	return uint(_CountFilledEntries(
		unsafe.Pointer((*reflect.SliceHeader)(unsafe.Pointer(&entries)).Data),
		uint64(len(entries))),
	)
}

Running the test below causes the program to crash. Any possible insight as to why?

func TestCount(t *testing.T) {
	a := []byte{0, 1, 2, 9, 4, 0, 3, 0, 0}
	fmt.Println(CountFilledEntries(a))
}

Assembly:

//+build !noasm !appengine
// AUTO-GENERATED BY C2GOASM -- DO NOT EDIT

TEXT ·_CountFilledEntries(SB), $0-24

    MOVQ entries+0(FP), DI
    MOVQ len+8(FP), SI

    WORD $0xf685                 // test    esi, esi
	JLE LBB0_1
    WORD $0xf189                 // mov    ecx, esi
    WORD $0xfe83; BYTE $0x07     // cmp    esi, 7
	JA LBB0_4
    WORD $0xd231                 // xor    edx, edx
    WORD $0xc031                 // xor    eax, eax
	JMP LBB0_11
LBB0_1:
    WORD $0xc031                 // xor    eax, eax
    MOVQ AX, count+16(FP)
    RET
LBB0_4:
    WORD $0xca89                 // mov    edx, ecx
    WORD $0xe283; BYTE $0xf8     // and    edx, -8
    LONG $0xf8728d48             // lea    rsi, [rdx - 8]
    WORD $0x8948; BYTE $0xf0     // mov    rax, rsi
    LONG $0x03e8c148             // shr    rax, 3
    LONG $0x01c08348             // add    rax, 1
    WORD $0x8941; BYTE $0xc0     // mov    r8d, eax
    LONG $0x01e08341             // and    r8d, 1
    WORD $0x8548; BYTE $0xf6     // test    rsi, rsi
	JE LBB0_5
    LONG $0x000001be; BYTE $0x00 // mov    esi, 1
    WORD $0x2948; BYTE $0xc6     // sub    rsi, rax
    WORD $0x014c; BYTE $0xc6     // add    rsi, r8
    LONG $0xffc68348             // add    rsi, -1
    LONG $0xd2ef0f66             // pxor    xmm2, xmm2
    WORD $0xc031                 // xor    eax, eax
    LONG $0xdb760f66             // pcmpeqd    xmm3, xmm3
    LONG $0xc0ef0f66             // pxor    xmm0, xmm0
    LONG $0xc9ef0f66             // pxor    xmm1, xmm1
LBB0_7:
    LONG $0x246e0f66; BYTE $0x07 // movd    xmm4, dword [rdi + rax]
    LONG $0xe2600f66             // punpcklbw    xmm4, xmm2
    LONG $0xe2610f66             // punpcklwd    xmm4, xmm2
    LONG $0x6c6e0f66; WORD $0x0407 // movd    xmm5, dword [rdi + rax + 4]
    LONG $0xea600f66             // punpcklbw    xmm5, xmm2
    LONG $0xea610f66             // punpcklwd    xmm5, xmm2
    LONG $0xe2760f66             // pcmpeqd    xmm4, xmm2
    LONG $0xe3ef0f66             // pxor    xmm4, xmm3
    LONG $0xc4fa0f66             // psubd    xmm0, xmm4
    LONG $0xea760f66             // pcmpeqd    xmm5, xmm2
    LONG $0xebef0f66             // pxor    xmm5, xmm3
    LONG $0xcdfa0f66             // psubd    xmm1, xmm5
    LONG $0x646e0f66; WORD $0x0807 // movd    xmm4, dword [rdi + rax + 8]
    LONG $0xe2600f66             // punpcklbw    xmm4, xmm2
    LONG $0xe2610f66             // punpcklwd    xmm4, xmm2
    LONG $0x6c6e0f66; WORD $0x0c07 // movd    xmm5, dword [rdi + rax + 12]
    LONG $0xea600f66             // punpcklbw    xmm5, xmm2
    LONG $0xea610f66             // punpcklwd    xmm5, xmm2
    LONG $0xe2760f66             // pcmpeqd    xmm4, xmm2
    LONG $0xe3ef0f66             // pxor    xmm4, xmm3
    LONG $0xc4fa0f66             // psubd    xmm0, xmm4
    LONG $0xea760f66             // pcmpeqd    xmm5, xmm2
    LONG $0xebef0f66             // pxor    xmm5, xmm3
    LONG $0xcdfa0f66             // psubd    xmm1, xmm5
    LONG $0x10c08348             // add    rax, 16
    LONG $0x02c68348             // add    rsi, 2
	JNE LBB0_7
    WORD $0x854d; BYTE $0xc0     // test    r8, r8
	JE LBB0_10
LBB0_9:
    LONG $0x546e0f66; WORD $0x0407 // movd    xmm2, dword [rdi + rax + 4]
    LONG $0xdbef0f66             // pxor    xmm3, xmm3
    LONG $0xd3600f66             // punpcklbw    xmm2, xmm3
    LONG $0xd3610f66             // punpcklwd    xmm2, xmm3
    LONG $0xd3760f66             // pcmpeqd    xmm2, xmm3
    LONG $0xe4760f66             // pcmpeqd    xmm4, xmm4
    LONG $0xd4ef0f66             // pxor    xmm2, xmm4
    LONG $0xcafa0f66             // psubd    xmm1, xmm2
    LONG $0x146e0f66; BYTE $0x07 // movd    xmm2, dword [rdi + rax]
    LONG $0xd3600f66             // punpcklbw    xmm2, xmm3
    LONG $0xd3610f66             // punpcklwd    xmm2, xmm3
    LONG $0xd3760f66             // pcmpeqd    xmm2, xmm3
    LONG $0xd4ef0f66             // pxor    xmm2, xmm4
    LONG $0xc2fa0f66             // psubd    xmm0, xmm2
LBB0_10:
    LONG $0xc1fe0f66             // paddd    xmm0, xmm1
    LONG $0xc8700f66; BYTE $0x4e // pshufd    xmm1, xmm0, 78
    LONG $0xc8fe0f66             // paddd    xmm1, xmm0
    LONG $0xc1700f66; BYTE $0xe5 // pshufd    xmm0, xmm1, 229
    LONG $0xc1fe0f66             // paddd    xmm0, xmm1
    LONG $0xc07e0f66             // movd    eax, xmm0
    WORD $0x3948; BYTE $0xca     // cmp    rdx, rcx
	JE LBB0_12
LBB0_11:
    LONG $0x01173c80             // cmp    byte [rdi + rdx], 1
    WORD $0xd883; BYTE $0xff     // sbb    eax, -1
    LONG $0x01c28348             // add    rdx, 1
    WORD $0x3948; BYTE $0xd1     // cmp    rcx, rdx
	JNE LBB0_11
LBB0_12:
    WORD $0x8948; BYTE $0xec     // mov    rsp, rbp
    BYTE $0x5d                   // pop    rbp
    BYTE $0xc3                   // ret
LBB0_5:
    LONG $0xc0ef0f66             // pxor    xmm0, xmm0
    WORD $0xc031                 // xor    eax, eax
    LONG $0xc9ef0f66             // pxor    xmm1, xmm1
    WORD $0x854d; BYTE $0xc0     // test    r8, r8
	JNE LBB0_9
	JMP LBB0_10

Add support for gcc

Differences to clang:

  • instructions intermixed within header
  • detection of functions
  • loading of rsp using prefix format
  • constants table
	.file	"SimdSse2BgraToYuv.cpp"
	.intel_syntax noprefix
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDB13:
	.section	.text._ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTB13:
	.p2align 4,,15
	.weak	_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m
	.type	_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m, @function
_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m:
	push	r15
	push	r14
	push	r13
	push	r12
	mov	r12, rsi
	push	rbp
	push	rbx
	and	r12, -32
	sub	rsp, 160
	test	rdx, rdx
	mov	QWORD PTR 120[rsp], rdx
	mov	QWORD PTR 144[rsp], r9
	mov	rbx, QWORD PTR 216[rsp]
	mov	rbp, QWORD PTR 232[rsp]
	je	.L1
	lea	rax, [rcx+rcx]
	lea	r15, [r9+r9]
	lea	r10, [r8+r9]
	movdqa	xmm9, XMMWORD PTR .LC2[rip]
	xor	r14d, r14d
	mov	QWORD PTR 112[rsp], rax
	lea	rax, -32[rsi]
	movdqa	xmm8, XMMWORD PTR .LC3[rip]
	mov	rdx, rax
	lea	r13, 0[0+rax*4]
	mov	QWORD PTR 128[rsp], rax
	shr	rdx
	movdqa	xmm10, XMMWORD PTR .LC4[rip]
	mov	QWORD PTR 136[rsp], rdx
	.p2align 4,,10
	.p2align 3
.L9:
	test	r12, r12
	je	.L6
	movdqa	xmm7, XMMWORD PTR .LC10[rip]
	lea	rax, [rdi+rcx]
	mov	rdx, rdi
	xor	r9d, r9d
	movaps	XMMWORD PTR 56[rsp], xmm7
	movdqa	xmm6, XMMWORD PTR .LC12[rip]
	movdqa	xmm7, XMMWORD PTR .LC11[rip]
	movaps	XMMWORD PTR 88[rsp], xmm6
	movaps	XMMWORD PTR 72[rsp], xmm7
	.p2align 4,,10
	.p2align 3
.L7:
	sub	rdx, -128
	sub	rax, -128
	movdqa	xmm4, XMMWORD PTR -80[rdx]
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	pand	xmm5, xmm4
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	movdqa	xmm0, XMMWORD PTR -96[rdx]
	pshufd	xmm6, xmm5, 8
	movdqa	xmm3, XMMWORD PTR .LC0[rip]
	pand	xmm3, xmm0
	psrldq	xmm0, 1
	por	xmm4, xmm9
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pshufd	xmm7, xmm3, 8
	pshufd	xmm2, xmm3, 13
	movdqa	xmm1, xmm4
	pshufd	xmm4, xmm5, 13
	por	xmm0, xmm9
	paddd	xmm2, xmm7
	movdqa	xmm7, xmm2
	pshufd	xmm2, xmm1, 13
	paddd	xmm4, xmm6
	pshufd	xmm6, xmm1, 8
	pmaddwd	xmm1, xmm8
	punpcklqdq	xmm7, xmm4
	pshufd	xmm4, xmm0, 13
	paddd	xmm2, xmm6
	movaps	XMMWORD PTR -120[rsp], xmm7
	pshufd	xmm7, xmm0, 8
	pmaddwd	xmm0, xmm8
	paddd	xmm4, xmm7
	movdqa	xmm6, xmm4
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	punpcklqdq	xmm6, xmm2
	movdqa	xmm2, xmm5
	pxor	xmm5, xmm5
	pmaddwd	xmm2, xmm10
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm3
	psrad	xmm1, 14
	movdqa	xmm3, XMMWORD PTR -112[rdx]
	pmaddwd	xmm2, xmm10
	paddd	xmm2, xmm0
	psrad	xmm2, 14
	packssdw	xmm2, xmm1
	movdqa	xmm0, XMMWORD PTR -128[rdx]
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm5
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	pand	xmm5, xmm3
	psrldq	xmm3, 1
	movaps	XMMWORD PTR -104[rsp], xmm6
	pand	xmm0, XMMWORD PTR .LC1[rip]
	movdqa	xmm6, xmm5
	pshufd	xmm11, xmm4, 8
	pand	xmm3, XMMWORD PTR .LC1[rip]
	pshufd	xmm7, xmm6, 8
	pshufd	xmm5, xmm4, 13
	pmaddwd	xmm4, xmm10
	pshufd	xmm1, xmm6, 13
	pmaddwd	xmm6, xmm10
	por	xmm0, xmm9
	por	xmm3, xmm9
	paddd	xmm5, xmm11
	paddd	xmm1, xmm7
	punpcklqdq	xmm5, xmm1
	pshufd	xmm11, xmm0, 8
	pshufd	xmm7, xmm3, 8
	movaps	XMMWORD PTR -88[rsp], xmm5
	pshufd	xmm1, xmm0, 13
	pmaddwd	xmm0, xmm8
	paddd	xmm0, xmm4
	pshufd	xmm5, xmm3, 13
	psrad	xmm0, 14
	pmaddwd	xmm3, xmm8
	paddd	xmm3, xmm6
	psrad	xmm3, 14
	packssdw	xmm0, xmm3
	pxor	xmm3, xmm3
	paddd	xmm1, xmm11
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	paddd	xmm5, xmm7
	punpcklqdq	xmm1, xmm5
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm2
	movaps	XMMWORD PTR -72[rsp], xmm1
	movaps	XMMWORD PTR [r8+r9*2], xmm0
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	movdqa	xmm0, XMMWORD PTR -32[rdx]
	movdqa	xmm1, XMMWORD PTR -16[rdx]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pand	xmm1, XMMWORD PTR .LC1[rip]
	pshufd	xmm5, xmm4, 13
	pshufd	xmm7, xmm4, 8
	por	xmm0, xmm9
	por	xmm1, xmm9
	pshufd	xmm6, xmm2, 8
	paddd	xmm7, xmm5
	movdqa	xmm15, xmm7
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm10
	pshufd	xmm7, xmm0, 8
	pshufd	xmm5, xmm1, 13
	paddd	xmm3, xmm6
	pshufd	xmm6, xmm1, 8
	pmaddwd	xmm1, xmm8
	punpcklqdq	xmm15, xmm3
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm4
	pshufd	xmm3, xmm0, 13
	psrad	xmm1, 14
	pmaddwd	xmm0, xmm8
	paddd	xmm5, xmm6
	pxor	xmm4, xmm4
	pmaddwd	xmm2, xmm10
	paddd	xmm2, xmm0
	psrad	xmm2, 14
	packssdw	xmm2, xmm1
	paddd	xmm3, xmm7
	punpcklqdq	xmm3, xmm5
	movdqa	xmm1, XMMWORD PTR -64[rdx]
	movaps	XMMWORD PTR -56[rsp], xmm15
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm4
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR -40[rsp], xmm3
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	movdqa	xmm3, XMMWORD PTR -48[rdx]
	pand	xmm4, xmm1
	psrldq	xmm1, 1
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	pand	xmm1, XMMWORD PTR .LC1[rip]
	pand	xmm5, xmm3
	psrldq	xmm3, 1
	pand	xmm3, XMMWORD PTR .LC1[rip]
	pshufd	xmm11, xmm4, 8
	pshufd	xmm6, xmm4, 13
	pmaddwd	xmm4, xmm10
	por	xmm1, xmm9
	por	xmm3, xmm9
	pshufd	xmm7, xmm5, 8
	paddd	xmm6, xmm11
	movdqa	xmm14, xmm6
	pshufd	xmm0, xmm5, 13
	pmaddwd	xmm5, xmm10
	pshufd	xmm15, xmm1, 8
	pshufd	xmm6, xmm3, 13
	paddd	xmm0, xmm7
	pshufd	xmm7, xmm3, 8
	pmaddwd	xmm3, xmm8
	punpcklqdq	xmm14, xmm0
	paddd	xmm3, xmm5
	psrad	xmm3, 14
	pshufd	xmm0, xmm1, 13
	pmaddwd	xmm1, xmm8
	paddd	xmm1, xmm4
	pxor	xmm4, xmm4
	psrad	xmm1, 14
	paddd	xmm6, xmm7
	packssdw	xmm1, xmm3
	paddd	xmm0, xmm15
	punpcklqdq	xmm0, xmm6
	movaps	XMMWORD PTR -24[rsp], xmm14
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm1, xmm4
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm2
	movaps	XMMWORD PTR -8[rsp], xmm0
	movaps	XMMWORD PTR 16[r8+r9*2], xmm1
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	movdqa	xmm0, XMMWORD PTR -96[rax]
	movdqa	xmm1, XMMWORD PTR -80[rax]
	pand	xmm2, xmm0
	psrldq	xmm0, 1
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pand	xmm4, xmm1
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	por	xmm0, xmm9
	movdqa	xmm3, xmm4
	pshufd	xmm7, xmm2, 8
	pshufd	xmm5, xmm2, 13
	pmaddwd	xmm2, xmm10
	por	xmm1, xmm9
	pshufd	xmm6, xmm3, 8
	pmaddwd	xmm3, xmm10
	pshufd	xmm4, xmm4, 13
	paddd	xmm5, xmm7
	movdqa	xmm12, xmm5
	pshufd	xmm7, xmm0, 8
	pshufd	xmm5, xmm0, 13
	pmaddwd	xmm0, xmm8
	paddd	xmm0, xmm2
	paddd	xmm4, xmm6
	pshufd	xmm6, xmm1, 8
	psrad	xmm0, 14
	punpcklqdq	xmm12, xmm4
	movdqa	xmm2, XMMWORD PTR -112[rax]
	pshufd	xmm4, xmm1, 13
	paddd	xmm5, xmm7
	pmaddwd	xmm1, xmm8
	movdqa	xmm13, xmm5
	movdqa	xmm5, xmm0
	paddd	xmm1, xmm3
	psrad	xmm1, 14
	movdqa	xmm0, XMMWORD PTR -128[rax]
	paddd	xmm4, xmm6
	packssdw	xmm5, xmm1
	punpcklqdq	xmm13, xmm4
	movdqa	xmm1, XMMWORD PTR .LC5[rip]
	pxor	xmm4, xmm4
	paddw	xmm1, xmm5
	movaps	XMMWORD PTR 8[rsp], xmm12
	pmaxsw	xmm1, xmm4
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR 24[rsp], xmm13
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	movdqa	xmm3, xmm4
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pshufd	xmm11, xmm3, 8
	pand	xmm4, xmm2
	psrldq	xmm2, 1
	pand	xmm2, XMMWORD PTR .LC1[rip]
	por	xmm0, xmm9
	pshufd	xmm7, xmm4, 8
	pshufd	xmm5, xmm3, 13
	pmaddwd	xmm3, xmm10
	por	xmm2, xmm9
	pshufd	xmm6, xmm4, 13
	pmaddwd	xmm4, xmm10
	pshufd	xmm12, xmm0, 8
	paddd	xmm5, xmm11
	pshufd	xmm11, xmm2, 8
	paddd	xmm6, xmm7
	pshufd	xmm7, xmm0, 13
	pmaddwd	xmm0, xmm8
	punpcklqdq	xmm5, xmm6
	paddd	xmm0, xmm3
	psrad	xmm0, 14
	pshufd	xmm6, xmm2, 13
	pmaddwd	xmm2, xmm8
	paddd	xmm2, xmm4
	pxor	xmm4, xmm4
	psrad	xmm2, 14
	paddd	xmm7, xmm12
	packssdw	xmm0, xmm2
	paddd	xmm6, xmm11
	movdqa	xmm11, xmm7
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm4
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm1
	punpcklqdq	xmm11, xmm6
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	movaps	XMMWORD PTR [r10+r9*2], xmm0
	movaps	XMMWORD PTR 40[rsp], xmm11
	movdqa	xmm0, XMMWORD PTR -32[rax]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	paddw	xmm5, XMMWORD PTR -88[rsp]
	paddw	xmm5, XMMWORD PTR .LC7[rip]
	movdqa	xmm1, XMMWORD PTR -16[rax]
	psrlw	xmm5, 2
	pshufd	xmm6, xmm4, 13
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	pshufd	xmm7, xmm4, 8
	pmaddwd	xmm4, xmm10
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	paddd	xmm7, xmm6
	por	xmm0, xmm9
	pshufd	xmm11, xmm2, 8
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm10
	por	xmm1, xmm9
	pshufd	xmm12, xmm0, 8
	paddd	xmm3, xmm11
	punpcklqdq	xmm7, xmm3
	pshufd	xmm11, xmm1, 8
	pshufd	xmm3, xmm0, 13
	pmaddwd	xmm0, xmm8
	paddd	xmm0, xmm4
	pshufd	xmm6, xmm1, 13
	psrad	xmm0, 14
	pmaddwd	xmm1, xmm8
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm0
	psrad	xmm1, 14
	pxor	xmm4, xmm4
	paddd	xmm3, xmm12
	movdqa	xmm12, XMMWORD PTR .LC0[rip]
	packssdw	xmm2, xmm1
	paddd	xmm6, xmm11
	paddw	xmm7, XMMWORD PTR -56[rsp]
	movdqa	xmm1, XMMWORD PTR -64[rax]
	punpcklqdq	xmm3, xmm6
	paddw	xmm7, XMMWORD PTR .LC7[rip]
	psrlw	xmm7, 2
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm4
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	pand	xmm4, xmm1
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	movdqa	xmm11, xmm4
	movdqa	xmm4, XMMWORD PTR -48[rax]
	pshufd	xmm14, xmm11, 8
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	por	xmm1, xmm9
	pshufd	xmm13, xmm12, 8
	pshufd	xmm6, xmm11, 13
	pmaddwd	xmm11, xmm10
	por	xmm4, xmm9
	pshufd	xmm0, xmm12, 13
	pmaddwd	xmm12, xmm10
	pshufd	xmm15, xmm1, 8
	paddd	xmm6, xmm14
	pshufd	xmm14, xmm4, 8
	paddd	xmm0, xmm13
	pshufd	xmm13, xmm4, 13
	pmaddwd	xmm4, xmm8
	punpcklqdq	xmm6, xmm0
	paddd	xmm4, xmm12
	psrad	xmm4, 14
	pshufd	xmm0, xmm1, 13
	pmaddwd	xmm1, xmm8
	paddd	xmm1, xmm11
	psrad	xmm1, 14
	packssdw	xmm1, xmm4
	paddd	xmm13, xmm14
	pxor	xmm4, xmm4
	paddw	xmm6, XMMWORD PTR -24[rsp]
	paddw	xmm6, XMMWORD PTR .LC7[rip]
	paddd	xmm0, xmm15
	punpcklqdq	xmm0, xmm13
	movdqa	xmm13, xmm7
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	psrlw	xmm6, 2
	movdqa	xmm12, xmm6
	pmaxsw	xmm1, xmm4
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm2
	movdqa	xmm4, XMMWORD PTR -104[rsp]
	paddw	xmm0, XMMWORD PTR -8[rsp]
	paddw	xmm0, XMMWORD PTR .LC7[rip]
	psrlw	xmm0, 2
	pxor	xmm14, xmm14
	movaps	XMMWORD PTR 16[r10+r9*2], xmm1
	paddw	xmm4, XMMWORD PTR 24[rsp]
	paddw	xmm4, XMMWORD PTR .LC7[rip]
	psrlw	xmm4, 2
	movdqa	xmm2, XMMWORD PTR -120[rsp]
	movdqa	xmm1, XMMWORD PTR -72[rsp]
	paddw	xmm2, XMMWORD PTR 8[rsp]
	paddw	xmm2, XMMWORD PTR .LC7[rip]
	psrlw	xmm2, 2
	paddw	xmm1, XMMWORD PTR 40[rsp]
	paddw	xmm1, XMMWORD PTR .LC7[rip]
	paddw	xmm3, XMMWORD PTR -40[rsp]
	paddw	xmm3, XMMWORD PTR .LC7[rip]
	psrlw	xmm3, 2
	movdqa	xmm11, xmm3
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	movdqa	xmm15, XMMWORD PTR 56[rsp]
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddd	xmm13, xmm11
	movdqa	xmm11, xmm0
	psrad	xmm13, 14
	psrlw	xmm1, 2
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddd	xmm12, xmm11
	psrad	xmm12, 14
	packssdw	xmm12, xmm13
	movdqa	xmm11, xmm4
	movdqa	xmm13, xmm2
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddw	xmm12, xmm15
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	pmaxsw	xmm12, xmm14
	paddd	xmm13, xmm11
	movdqa	xmm14, xmm5
	movdqa	xmm11, xmm1
	psrad	xmm13, 14
	pminsw	xmm12, XMMWORD PTR .LC6[rip]
	pmaddwd	xmm14, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddd	xmm11, xmm14
	pxor	xmm14, xmm14
	psrad	xmm11, 14
	packssdw	xmm11, xmm13
	movdqa	xmm13, XMMWORD PTR 88[rsp]
	pmaddwd	xmm7, xmm13
	pmaddwd	xmm2, xmm13
	paddw	xmm11, xmm15
	pmaxsw	xmm11, xmm14
	movdqa	xmm14, XMMWORD PTR 72[rsp]
	pminsw	xmm11, XMMWORD PTR .LC6[rip]
	packuswb	xmm11, xmm12
	pmaddwd	xmm3, xmm14
	paddd	xmm7, xmm3
	movdqa	xmm3, xmm6
	pmaddwd	xmm0, xmm14
	psrad	xmm7, 14
	pmaddwd	xmm4, xmm14
	pmaddwd	xmm3, xmm13
	paddd	xmm3, xmm0
	movdqa	xmm0, xmm5
	psrad	xmm3, 14
	paddd	xmm2, xmm4
	packssdw	xmm3, xmm7
	psrad	xmm2, 14
	pxor	xmm7, xmm7
	pmaddwd	xmm1, xmm14
	pmaddwd	xmm0, xmm13
	paddd	xmm0, xmm1
	psrad	xmm0, 14
	packssdw	xmm0, xmm2
	paddw	xmm3, xmm15
	pmaxsw	xmm3, xmm7
	pminsw	xmm3, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR [rbx+r9], xmm11
	paddw	xmm0, xmm15
	pmaxsw	xmm0, xmm7
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm3
	movaps	XMMWORD PTR 0[rbp+r9], xmm0
	add	r9, 16
	lea	r11, [r9+r9]
	cmp	r12, r11
	ja	.L7
.L6:
	cmp	rsi, r12
	je	.L5
	movdqa	xmm3, XMMWORD PTR .LC0[rip]
	mov	rax, QWORD PTR 128[rsp]
	movdqu	xmm2, XMMWORD PTR 32[rdi+r13]
	movdqa	xmm11, xmm3
	movdqa	xmm7, xmm3
	movdqu	xmm1, XMMWORD PTR 48[rdi+r13]
	pand	xmm11, xmm2
	psrldq	xmm2, 1
	pand	xmm7, xmm1
	psrldq	xmm1, 1
	movdqa	xmm6, XMMWORD PTR .LC1[rip]
	pshufd	xmm4, xmm11, 13
	pshufd	xmm13, xmm11, 8
	pshufd	xmm12, xmm7, 8
	movdqa	xmm5, XMMWORD PTR .LC2[rip]
	pand	xmm2, xmm6
	pshufd	xmm0, xmm7, 13
	paddd	xmm13, xmm4
	movdqa	xmm4, xmm13
	pand	xmm1, xmm6
	por	xmm2, xmm5
	paddd	xmm0, xmm12
	punpcklqdq	xmm4, xmm0
	por	xmm1, xmm5
	pshufd	xmm13, xmm2, 8
	movaps	XMMWORD PTR -120[rsp], xmm4
	pshufd	xmm4, xmm2, 13
	pshufd	xmm12, xmm1, 8
	pshufd	xmm0, xmm1, 13
	paddd	xmm13, xmm4
	movdqa	xmm15, xmm13
	movdqa	xmm4, xmm1
	movdqa	xmm1, XMMWORD PTR .LC4[rip]
	paddd	xmm0, xmm12
	punpcklqdq	xmm15, xmm0
	movdqa	xmm0, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm7, xmm1
	pmaddwd	xmm4, xmm0
	paddd	xmm7, xmm4
	movdqa	xmm4, xmm2
	movdqa	xmm2, xmm11
	psrad	xmm7, 14
	movdqa	xmm11, xmm3
	pmaddwd	xmm4, xmm0
	movaps	XMMWORD PTR -104[rsp], xmm15
	pmaddwd	xmm2, xmm1
	paddd	xmm2, xmm4
	psrad	xmm2, 14
	packssdw	xmm2, xmm7
	pxor	xmm7, xmm7
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm7
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR -88[rsp], xmm2
	movdqa	xmm7, xmm3
	movdqu	xmm4, XMMWORD PTR 16[rdi+r13]
	movdqu	xmm2, XMMWORD PTR [rdi+r13]
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	pand	xmm4, xmm6
	pshufd	xmm14, xmm11, 8
	pshufd	xmm15, xmm7, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm1
	pand	xmm2, xmm6
	paddd	xmm13, xmm15
	por	xmm4, xmm5
	paddd	xmm12, xmm14
	movdqa	xmm14, xmm13
	por	xmm2, xmm5
	punpcklqdq	xmm14, xmm12
	pshufd	xmm12, xmm4, 13
	pshufd	xmm15, xmm2, 8
	movaps	XMMWORD PTR -72[rsp], xmm14
	pshufd	xmm13, xmm2, 13
	pmaddwd	xmm2, xmm0
	paddd	xmm2, xmm7
	pshufd	xmm14, xmm4, 8
	psrad	xmm2, 14
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	psrad	xmm4, 14
	packssdw	xmm2, xmm4
	paddd	xmm13, xmm15
	movdqa	xmm7, xmm3
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	pxor	xmm12, xmm12
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	movaps	XMMWORD PTR -56[rsp], xmm13
	pmaxsw	xmm2, xmm12
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	packuswb	xmm2, XMMWORD PTR -88[rsp]
	movdqa	xmm12, xmm3
	movups	XMMWORD PTR [r8+rax], xmm2
	lea	rax, 32[r13+rcx]
	movdqu	xmm11, XMMWORD PTR 112[rdi+r13]
	movdqu	xmm4, XMMWORD PTR 96[rdi+r13]
	pand	xmm7, xmm11
	psrldq	xmm11, 1
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm11, xmm6
	movdqa	xmm13, xmm7
	pand	xmm4, xmm6
	pshufd	xmm15, xmm12, 8
	pshufd	xmm14, xmm13, 8
	pshufd	xmm7, xmm12, 13
	pmaddwd	xmm12, xmm1
	pshufd	xmm2, xmm13, 13
	pmaddwd	xmm13, xmm1
	por	xmm4, xmm5
	por	xmm11, xmm5
	paddd	xmm7, xmm15
	paddd	xmm2, xmm14
	punpcklqdq	xmm7, xmm2
	pshufd	xmm15, xmm4, 8
	pshufd	xmm14, xmm11, 8
	movaps	XMMWORD PTR -88[rsp], xmm7
	pshufd	xmm2, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm12
	pshufd	xmm7, xmm11, 13
	psrad	xmm4, 14
	pmaddwd	xmm11, xmm0
	pxor	xmm12, xmm12
	paddd	xmm11, xmm13
	psrad	xmm11, 14
	paddd	xmm2, xmm15
	packssdw	xmm4, xmm11
	movdqa	xmm11, xmm3
	paddd	xmm7, xmm14
	punpcklqdq	xmm2, xmm7
	movdqa	xmm7, xmm3
	paddw	xmm4, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm4, xmm12
	pminsw	xmm4, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR -40[rsp], xmm2
	movaps	XMMWORD PTR -24[rsp], xmm4
	movdqu	xmm2, XMMWORD PTR 64[rdi+r13]
	movdqu	xmm4, XMMWORD PTR 80[rdi+r13]
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm2, xmm6
	pshufd	xmm15, xmm7, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm14, xmm11, 8
	pand	xmm4, xmm6
	pshufd	xmm12, xmm11, 13
	paddd	xmm13, xmm15
	movdqa	xmm15, xmm13
	por	xmm2, xmm5
	pmaddwd	xmm11, xmm1
	por	xmm4, xmm5
	paddd	xmm12, xmm14
	punpcklqdq	xmm15, xmm12
	pshufd	xmm13, xmm2, 13
	pshufd	xmm14, xmm4, 8
	movaps	XMMWORD PTR -8[rsp], xmm15
	pshufd	xmm12, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	pshufd	xmm15, xmm2, 8
	psrad	xmm4, 14
	pmaddwd	xmm2, xmm0
	pxor	xmm11, xmm11
	paddd	xmm2, xmm7
	psrad	xmm2, 14
	paddd	xmm12, xmm14
	packssdw	xmm2, xmm4
	movdqa	xmm7, xmm3
	paddd	xmm13, xmm15
	movdqa	xmm14, xmm13
	punpcklqdq	xmm14, xmm12
	movaps	XMMWORD PTR 8[rsp], xmm14
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm11
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	packuswb	xmm2, XMMWORD PTR -24[rsp]
	movdqa	xmm11, xmm3
	movups	XMMWORD PTR -16[r8+rsi], xmm2
	mov	rdx, QWORD PTR 144[rsp]
	movdqu	xmm2, XMMWORD PTR [rdi+rax]
	lea	rax, 48[r13+rcx]
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	lea	rax, [rcx+r13]
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm2, xmm6
	pshufd	xmm15, xmm7, 8
	pshufd	xmm14, xmm11, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm1
	pand	xmm4, xmm6
	por	xmm2, xmm5
	paddd	xmm13, xmm15
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	por	xmm4, xmm5
	pshufd	xmm15, xmm2, 8
	movaps	XMMWORD PTR -24[rsp], xmm13
	pshufd	xmm13, xmm2, 13
	pmaddwd	xmm2, xmm0
	paddd	xmm2, xmm7
	pshufd	xmm14, xmm4, 8
	psrad	xmm2, 14
	movdqa	xmm7, xmm3
	pshufd	xmm12, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	paddd	xmm13, xmm15
	movdqa	xmm15, xmm13
	psrad	xmm4, 14
	packssdw	xmm2, xmm4
	movdqa	xmm11, xmm3
	paddd	xmm12, xmm14
	punpcklqdq	xmm15, xmm12
	pxor	xmm12, xmm12
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	movaps	XMMWORD PTR 24[rsp], xmm15
	pmaxsw	xmm2, xmm12
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR 40[rsp], xmm2
	movdqu	xmm2, XMMWORD PTR [rdi+rax]
	lea	rax, 16[r13+rcx]
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	lea	rax, -32[rsi+rdx]
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm2, xmm6
	pshufd	xmm15, xmm7, 8
	pshufd	xmm14, xmm11, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm1
	pand	xmm4, xmm6
	paddd	xmm13, xmm15
	por	xmm2, xmm5
	paddd	xmm12, xmm14
	movdqa	xmm14, xmm13
	por	xmm4, xmm5
	punpcklqdq	xmm14, xmm12
	pshufd	xmm15, xmm2, 8
	pshufd	xmm13, xmm2, 13
	pmaddwd	xmm2, xmm0
	paddd	xmm2, xmm7
	movaps	XMMWORD PTR 56[rsp], xmm14
	psrad	xmm2, 14
	pshufd	xmm14, xmm4, 8
	pshufd	xmm12, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	pxor	xmm11, xmm11
	psrad	xmm4, 14
	movdqa	xmm7, xmm3
	packssdw	xmm2, xmm4
	paddd	xmm13, xmm15
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	movdqa	xmm12, xmm3
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm11
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	packuswb	xmm2, XMMWORD PTR 40[rsp]
	movaps	XMMWORD PTR 72[rsp], xmm13
	movups	XMMWORD PTR [r8+rax], xmm2
	lea	rax, 96[r13+rcx]
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	lea	rax, 112[r13+rcx]
	movdqu	xmm11, XMMWORD PTR [rdi+rax]
	lea	rax, 64[r13+rcx]
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm7, xmm11
	psrldq	xmm11, 1
	pand	xmm4, xmm6
	pshufd	xmm15, xmm12, 8
	movdqa	xmm13, xmm7
	pand	xmm11, xmm6
	pshufd	xmm7, xmm12, 13
	pmaddwd	xmm12, xmm1
	pshufd	xmm14, xmm13, 8
	pshufd	xmm2, xmm13, 13
	pmaddwd	xmm13, xmm1
	por	xmm4, xmm5
	paddd	xmm7, xmm15
	por	xmm11, xmm5
	paddd	xmm2, xmm14
	punpcklqdq	xmm7, xmm2
	pshufd	xmm15, xmm4, 8
	pshufd	xmm14, xmm11, 8
	movaps	XMMWORD PTR 40[rsp], xmm7
	pshufd	xmm2, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm12
	pshufd	xmm7, xmm11, 13
	psrad	xmm4, 14
	pmaddwd	xmm11, xmm0
	paddd	xmm11, xmm13
	psrad	xmm11, 14
	pxor	xmm12, xmm12
	paddd	xmm2, xmm15
	paddd	xmm7, xmm14
	punpcklqdq	xmm2, xmm7
	movdqu	xmm7, XMMWORD PTR [rdi+rax]
	lea	rax, 80[r13+rcx]
	movaps	XMMWORD PTR 88[rsp], xmm2
	movdqa	xmm2, xmm4
	packssdw	xmm2, xmm11
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	movdqa	xmm11, xmm3
	lea	rax, -16[rsi+rdx]
	pand	xmm11, xmm7
	psrldq	xmm7, 1
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pand	xmm3, xmm4
	psrldq	xmm4, 1
	pmaxsw	xmm2, xmm12
	pand	xmm4, xmm6
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	pand	xmm7, xmm6
	pshufd	xmm13, xmm11, 8
	pshufd	xmm12, xmm3, 8
	por	xmm7, xmm5
	pshufd	xmm6, xmm3, 13
	pmaddwd	xmm3, xmm1
	pmaddwd	xmm1, xmm11
	por	xmm5, xmm4
	pshufd	xmm4, xmm11, 13
	paddd	xmm6, xmm12
	pshufd	xmm12, xmm7, 13
	pshufd	xmm14, xmm5, 8
	paddd	xmm4, xmm13
	pshufd	xmm13, xmm7, 8
	punpcklqdq	xmm4, xmm6
	pshufd	xmm6, xmm5, 13
	pmaddwd	xmm5, xmm0
	pmaddwd	xmm0, xmm7
	paddd	xmm12, xmm13
	paddd	xmm5, xmm3
	paddd	xmm0, xmm1
	psrad	xmm5, 14
	psrad	xmm0, 14
	packssdw	xmm0, xmm5
	paddd	xmm6, xmm14
	punpcklqdq	xmm12, xmm6
	pxor	xmm6, xmm6
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm6
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm2
	movdqa	xmm6, XMMWORD PTR -72[rsp]
	movups	XMMWORD PTR [r8+rax], xmm0
	paddw	xmm6, XMMWORD PTR 56[rsp]
	paddw	xmm4, XMMWORD PTR -8[rsp]
	mov	rax, QWORD PTR 136[rsp]
	movdqa	xmm5, XMMWORD PTR -56[rsp]
	movdqa	xmm1, XMMWORD PTR .LC7[rip]
	paddw	xmm5, XMMWORD PTR 72[rsp]
	movdqa	xmm13, xmm5
	movdqa	xmm5, XMMWORD PTR -40[rsp]
	paddw	xmm6, xmm1
	movdqa	xmm15, xmm6
	paddw	xmm13, xmm1
	paddw	xmm4, xmm1
	movdqa	xmm11, xmm13
	paddw	xmm5, XMMWORD PTR 88[rsp]
	movdqa	xmm6, XMMWORD PTR -120[rsp]
	movdqa	xmm2, xmm5
	psrlw	xmm4, 2
	movdqa	xmm14, xmm4
	psrlw	xmm15, 2
	movdqa	xmm7, XMMWORD PTR -88[rsp]
	paddw	xmm6, XMMWORD PTR -24[rsp]
	paddw	xmm2, xmm1
	paddw	xmm6, xmm1
	psrlw	xmm2, 2
	movdqa	xmm5, xmm2
	movdqa	xmm3, XMMWORD PTR -104[rsp]
	paddw	xmm7, XMMWORD PTR 40[rsp]
	paddw	xmm7, xmm1
	psrlw	xmm7, 2
	movdqa	xmm13, xmm7
	psrlw	xmm6, 2
	movdqa	xmm0, XMMWORD PTR 8[rsp]
	paddw	xmm3, XMMWORD PTR 24[rsp]
	paddw	xmm3, xmm1
	psrlw	xmm3, 2
	psrlw	xmm11, 2
	paddw	xmm0, xmm12
	movdqa	xmm12, XMMWORD PTR .LC9[rip]
	paddw	xmm0, xmm1
	psrlw	xmm0, 2
	movdqa	xmm1, XMMWORD PTR .LC8[rip]
	pmaddwd	xmm13, xmm12
	pmaddwd	xmm14, xmm12
	pmaddwd	xmm5, xmm1
	paddd	xmm13, xmm5
	movdqa	xmm5, xmm0
	psrad	xmm13, 14
	pmaddwd	xmm5, xmm1
	paddd	xmm5, xmm14
	psrad	xmm5, 14
	packssdw	xmm5, xmm13
	pxor	xmm13, xmm13
	movdqa	xmm14, XMMWORD PTR .LC10[rip]
	movaps	XMMWORD PTR 56[rsp], xmm14
	paddw	xmm5, xmm14
	movdqa	xmm14, xmm6
	pmaxsw	xmm5, xmm13
	pminsw	xmm5, XMMWORD PTR .LC6[rip]
	movdqa	xmm13, xmm3
	pmaddwd	xmm14, xmm12
	pmaddwd	xmm12, xmm15
	pmaddwd	xmm13, xmm1
	pmaddwd	xmm1, xmm11
	paddd	xmm13, xmm14
	paddd	xmm1, xmm12
	psrad	xmm13, 14
	pxor	xmm12, xmm12
	psrad	xmm1, 14
	packssdw	xmm1, xmm13
	movdqa	xmm14, XMMWORD PTR 56[rsp]
	paddw	xmm1, xmm14
	pmaxsw	xmm1, xmm12
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm5
	movdqa	xmm5, XMMWORD PTR .LC11[rip]
	movups	XMMWORD PTR [rbx+rax], xmm1
	pmaddwd	xmm2, xmm5
	pmaddwd	xmm0, xmm5
	pmaddwd	xmm3, xmm5
	pmaddwd	xmm11, xmm5
	movdqa	xmm1, XMMWORD PTR .LC12[rip]
	pmaddwd	xmm7, xmm1
	pmaddwd	xmm4, xmm1
	paddd	xmm2, xmm7
	paddd	xmm0, xmm4
	psrad	xmm2, 14
	psrad	xmm0, 14
	pmaddwd	xmm6, xmm1
	packssdw	xmm0, xmm2
	paddd	xmm3, xmm6
	pmaddwd	xmm15, xmm1
	psrad	xmm3, 14
	paddd	xmm11, xmm15
	psrad	xmm11, 14
	packssdw	xmm11, xmm3
	paddw	xmm0, xmm14
	pmaxsw	xmm0, xmm12
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	paddw	xmm11, xmm14
	pmaxsw	xmm11, xmm12
	pminsw	xmm11, XMMWORD PTR .LC6[rip]
	packuswb	xmm11, xmm0
	movups	XMMWORD PTR 0[rbp+rax], xmm11
.L5:
	add	r8, r15
	add	rbx, QWORD PTR 224[rsp]
	add	rbp, QWORD PTR 240[rsp]
	add	r14, 2
	add	r10, r15
	add	rdi, QWORD PTR 112[rsp]
	cmp	QWORD PTR 120[rsp], r14
	ja	.L9
.L1:
	add	rsp, 160
	pop	rbx
	pop	rbp
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	ret
	.size	_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m, .-_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDE13:
	.section	.text._ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTE13:
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDB14:
	.section	.text._ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTB14:
	.p2align 4,,15
	.weak	_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m
	.type	_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m, @function
_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m:
	push	r15
	push	r14
	mov	rax, rsi
	push	r13
	push	r12
	and	rax, -32
	push	rbp
	push	rbx
	sub	rsp, 208
	test	rdx, rdx
	mov	QWORD PTR 144[rsp], rsi
	mov	QWORD PTR 168[rsp], rdx
	mov	QWORD PTR 120[rsp], rax
	je	.L12
	lea	rax, [r9+r9]
	movdqa	xmm9, XMMWORD PTR .LC2[rip]
	mov	r10, rcx
	mov	r11, r9
	mov	QWORD PTR 136[rsp], 0
	mov	QWORD PTR 152[rsp], rax
	lea	rax, [rcx+rcx]
	movdqa	xmm8, XMMWORD PTR .LC3[rip]
	mov	QWORD PTR 160[rsp], rax
	mov	rax, rsi
	sub	rax, 32
	movdqa	xmm10, XMMWORD PTR .LC4[rip]
	mov	rbx, rax
	mov	QWORD PTR 176[rsp], rax
	lea	rax, 0[0+rax*4]
	shr	rbx
	mov	QWORD PTR 184[rsp], rbx
	mov	QWORD PTR 192[rsp], rax
	.p2align 4,,10
	.p2align 3
.L20:
	cmp	QWORD PTR 120[rsp], 0
	je	.L17
	movdqa	xmm7, XMMWORD PTR .LC10[rip]
	lea	r9, 80[r10]
	lea	r15, 32[r10]
	mov	rsi, QWORD PTR 264[rsp]
	mov	rcx, QWORD PTR 280[rsp]
	lea	r14, 48[r10]
	movaps	XMMWORD PTR 56[rsp], xmm7
	mov	QWORD PTR 128[rsp], r9
	lea	r9, 16[r11]
	lea	r13, 16[r10]
	lea	r12, 96[r10]
	lea	rbp, 112[r10]
	lea	rbx, 64[r10]
	xor	eax, eax
	xor	edx, edx
	movdqa	xmm7, XMMWORD PTR .LC11[rip]
	mov	QWORD PTR 72[rsp], r9
	movdqa	xmm6, XMMWORD PTR .LC12[rip]
	movaps	XMMWORD PTR 88[rsp], xmm7
	movaps	XMMWORD PTR 104[rsp], xmm6
	.p2align 4,,10
	.p2align 3
.L18:
	lea	r9, [r15+rax]
	add	rsi, 16
	add	rcx, 16
	movdqu	xmm4, XMMWORD PTR 48[rdi+rax]
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	pand	xmm5, xmm4
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	movdqu	xmm0, XMMWORD PTR 32[rdi+rax]
	pshufd	xmm6, xmm5, 8
	movdqa	xmm3, XMMWORD PTR .LC0[rip]
	pand	xmm3, xmm0
	psrldq	xmm0, 1
	por	xmm4, xmm9
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pshufd	xmm7, xmm3, 8
	pshufd	xmm2, xmm3, 13
	movdqa	xmm1, xmm4
	pshufd	xmm4, xmm5, 13
	por	xmm0, xmm9
	paddd	xmm2, xmm7
	movdqa	xmm7, xmm2
	pshufd	xmm2, xmm1, 13
	paddd	xmm4, xmm6
	pshufd	xmm6, xmm1, 8
	pmaddwd	xmm1, xmm8
	punpcklqdq	xmm7, xmm4
	pshufd	xmm4, xmm0, 13
	paddd	xmm2, xmm6
	movaps	XMMWORD PTR -120[rsp], xmm7
	pshufd	xmm7, xmm0, 8
	pmaddwd	xmm0, xmm8
	paddd	xmm4, xmm7
	movdqa	xmm6, xmm4
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	punpcklqdq	xmm6, xmm2
	movdqa	xmm2, xmm5
	pxor	xmm5, xmm5
	pmaddwd	xmm2, xmm10
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm3
	movaps	XMMWORD PTR -104[rsp], xmm6
	psrad	xmm1, 14
	pmaddwd	xmm2, xmm10
	paddd	xmm2, xmm0
	psrad	xmm2, 14
	packssdw	xmm2, xmm1
	movdqu	xmm3, XMMWORD PTR 16[rdi+rax]
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm5
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	movdqu	xmm0, XMMWORD PTR [rdi+rax]
	pand	xmm5, xmm3
	psrldq	xmm3, 1
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm3, XMMWORD PTR .LC1[rip]
	movdqa	xmm6, xmm5
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pshufd	xmm11, xmm4, 8
	pshufd	xmm7, xmm6, 8
	pshufd	xmm5, xmm4, 13
	pmaddwd	xmm4, xmm10
	pshufd	xmm1, xmm6, 13
	pmaddwd	xmm6, xmm10
	por	xmm0, xmm9
	por	xmm3, xmm9
	paddd	xmm5, xmm11
	paddd	xmm1, xmm7
	punpcklqdq	xmm5, xmm1
	pshufd	xmm11, xmm0, 8
	pshufd	xmm7, xmm3, 8
	movaps	XMMWORD PTR -88[rsp], xmm5
	pshufd	xmm1, xmm0, 13
	pmaddwd	xmm0, xmm8
	paddd	xmm0, xmm4
	pshufd	xmm5, xmm3, 13
	psrad	xmm0, 14
	pmaddwd	xmm3, xmm8
	paddd	xmm3, xmm6
	psrad	xmm3, 14
	packssdw	xmm0, xmm3
	pxor	xmm3, xmm3
	paddd	xmm1, xmm11
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	paddd	xmm5, xmm7
	punpcklqdq	xmm1, xmm5
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm2
	movaps	XMMWORD PTR -72[rsp], xmm1
	movups	XMMWORD PTR [r8+rdx], xmm0
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	movdqu	xmm0, XMMWORD PTR 96[rdi+rax]
	movdqu	xmm1, XMMWORD PTR 112[rdi+rax]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pand	xmm1, XMMWORD PTR .LC1[rip]
	pshufd	xmm5, xmm4, 13
	pshufd	xmm7, xmm4, 8
	por	xmm0, xmm9
	por	xmm1, xmm9
	pshufd	xmm6, xmm2, 8
	paddd	xmm7, xmm5
	movdqa	xmm15, xmm7
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm10
	pshufd	xmm7, xmm0, 8
	pshufd	xmm5, xmm1, 13
	paddd	xmm3, xmm6
	pshufd	xmm6, xmm1, 8
	pmaddwd	xmm1, xmm8
	punpcklqdq	xmm15, xmm3
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm4
	pshufd	xmm3, xmm0, 13
	psrad	xmm1, 14
	pmaddwd	xmm0, xmm8
	paddd	xmm5, xmm6
	pxor	xmm4, xmm4
	pmaddwd	xmm2, xmm10
	movaps	XMMWORD PTR -56[rsp], xmm15
	paddd	xmm2, xmm0
	psrad	xmm2, 14
	paddd	xmm3, xmm7
	punpcklqdq	xmm3, xmm5
	packssdw	xmm2, xmm1
	movaps	XMMWORD PTR -40[rsp], xmm3
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm4
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	movdqu	xmm1, XMMWORD PTR 64[rdi+rax]
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	movdqu	xmm3, XMMWORD PTR 80[rdi+rax]
	pand	xmm4, xmm1
	psrldq	xmm1, 1
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	pand	xmm1, XMMWORD PTR .LC1[rip]
	pand	xmm5, xmm3
	psrldq	xmm3, 1
	pand	xmm3, XMMWORD PTR .LC1[rip]
	pshufd	xmm11, xmm4, 8
	pshufd	xmm6, xmm4, 13
	pmaddwd	xmm4, xmm10
	por	xmm1, xmm9
	por	xmm3, xmm9
	pshufd	xmm7, xmm5, 8
	paddd	xmm6, xmm11
	movdqa	xmm14, xmm6
	pshufd	xmm0, xmm5, 13
	pmaddwd	xmm5, xmm10
	pshufd	xmm15, xmm1, 8
	pshufd	xmm6, xmm3, 13
	paddd	xmm0, xmm7
	pshufd	xmm7, xmm3, 8
	pmaddwd	xmm3, xmm8
	punpcklqdq	xmm14, xmm0
	paddd	xmm3, xmm5
	psrad	xmm3, 14
	pshufd	xmm0, xmm1, 13
	pmaddwd	xmm1, xmm8
	paddd	xmm1, xmm4
	pxor	xmm4, xmm4
	psrad	xmm1, 14
	paddd	xmm6, xmm7
	packssdw	xmm1, xmm3
	paddd	xmm0, xmm15
	punpcklqdq	xmm0, xmm6
	movaps	XMMWORD PTR -24[rsp], xmm14
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm1, xmm4
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm2
	movaps	XMMWORD PTR -8[rsp], xmm0
	movups	XMMWORD PTR 16[r8+rdx], xmm1
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	movdqu	xmm0, XMMWORD PTR [rdi+r9]
	lea	r9, [r14+rax]
	pand	xmm2, xmm0
	psrldq	xmm0, 1
	movdqu	xmm1, XMMWORD PTR [rdi+r9]
	lea	r9, [r10+rax]
	pand	xmm0, XMMWORD PTR .LC1[rip]
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pshufd	xmm7, xmm2, 8
	pand	xmm4, xmm1
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	por	xmm0, xmm9
	movdqa	xmm3, xmm4
	pshufd	xmm5, xmm2, 13
	pmaddwd	xmm2, xmm10
	pshufd	xmm4, xmm4, 13
	por	xmm1, xmm9
	pshufd	xmm6, xmm3, 8
	paddd	xmm5, xmm7
	movdqa	xmm12, xmm5
	pshufd	xmm7, xmm0, 8
	pmaddwd	xmm3, xmm10
	pshufd	xmm5, xmm0, 13
	pmaddwd	xmm0, xmm8
	paddd	xmm0, xmm2
	paddd	xmm4, xmm6
	pshufd	xmm6, xmm1, 8
	psrad	xmm0, 14
	punpcklqdq	xmm12, xmm4
	pshufd	xmm4, xmm1, 13
	paddd	xmm5, xmm7
	movdqa	xmm13, xmm5
	movdqa	xmm5, xmm0
	pmaddwd	xmm1, xmm8
	paddd	xmm1, xmm3
	psrad	xmm1, 14
	movdqu	xmm0, XMMWORD PTR [rdi+r9]
	lea	r9, 0[r13+rax]
	paddd	xmm4, xmm6
	packssdw	xmm5, xmm1
	punpcklqdq	xmm13, xmm4
	movdqa	xmm1, XMMWORD PTR .LC5[rip]
	pxor	xmm4, xmm4
	paddw	xmm1, xmm5
	movdqu	xmm2, XMMWORD PTR [rdi+r9]
	lea	r9, [r11+rdx]
	pmaxsw	xmm1, xmm4
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	movaps	XMMWORD PTR 8[rsp], xmm12
	movdqa	xmm3, xmm4
	movaps	XMMWORD PTR 24[rsp], xmm13
	pshufd	xmm11, xmm3, 8
	por	xmm0, xmm9
	pshufd	xmm5, xmm3, 13
	pmaddwd	xmm3, xmm10
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pshufd	xmm12, xmm0, 8
	pand	xmm4, xmm2
	psrldq	xmm2, 1
	paddd	xmm5, xmm11
	pand	xmm2, XMMWORD PTR .LC1[rip]
	pshufd	xmm7, xmm4, 8
	pshufd	xmm6, xmm4, 13
	pmaddwd	xmm4, xmm10
	por	xmm2, xmm9
	paddd	xmm6, xmm7
	pshufd	xmm7, xmm0, 13
	pmaddwd	xmm0, xmm8
	punpcklqdq	xmm5, xmm6
	paddd	xmm0, xmm3
	psrad	xmm0, 14
	pshufd	xmm11, xmm2, 8
	pshufd	xmm6, xmm2, 13
	pmaddwd	xmm2, xmm8
	paddd	xmm2, xmm4
	pxor	xmm4, xmm4
	psrad	xmm2, 14
	paddd	xmm7, xmm12
	packssdw	xmm0, xmm2
	paddd	xmm6, xmm11
	movdqa	xmm11, xmm7
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm4
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm1
	punpcklqdq	xmm11, xmm6
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	movups	XMMWORD PTR [r8+r9], xmm0
	lea	r9, [r12+rax]
	movaps	XMMWORD PTR 40[rsp], xmm11
	movdqu	xmm0, XMMWORD PTR [rdi+r9]
	lea	r9, 0[rbp+rax]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	paddw	xmm5, XMMWORD PTR -88[rsp]
	paddw	xmm5, XMMWORD PTR .LC7[rip]
	movdqu	xmm1, XMMWORD PTR [rdi+r9]
	lea	r9, [rbx+rax]
	psrlw	xmm5, 2
	pshufd	xmm6, xmm4, 13
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	pshufd	xmm7, xmm4, 8
	pmaddwd	xmm4, xmm10
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	paddd	xmm7, xmm6
	por	xmm0, xmm9
	pshufd	xmm11, xmm2, 8
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm10
	por	xmm1, xmm9
	pshufd	xmm12, xmm0, 8
	paddd	xmm3, xmm11
	punpcklqdq	xmm7, xmm3
	pshufd	xmm11, xmm1, 8
	pshufd	xmm3, xmm0, 13
	pmaddwd	xmm0, xmm8
	paddd	xmm0, xmm4
	pshufd	xmm6, xmm1, 13
	psrad	xmm0, 14
	pmaddwd	xmm1, xmm8
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm0
	psrad	xmm1, 14
	pxor	xmm4, xmm4
	paddd	xmm3, xmm12
	movdqa	xmm12, XMMWORD PTR .LC0[rip]
	packssdw	xmm2, xmm1
	paddd	xmm6, xmm11
	paddw	xmm7, XMMWORD PTR -56[rsp]
	movdqu	xmm1, XMMWORD PTR [rdi+r9]
	punpcklqdq	xmm3, xmm6
	paddw	xmm7, XMMWORD PTR .LC7[rip]
	mov	r9, QWORD PTR 128[rsp]
	psrlw	xmm7, 2
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm4
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	pand	xmm4, xmm1
	add	r9, rax
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	sub	rax, -128
	movdqa	xmm11, xmm4
	movdqu	xmm4, XMMWORD PTR [rdi+r9]
	pshufd	xmm14, xmm11, 8
	mov	r9, QWORD PTR 72[rsp]
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	add	r9, rdx
	add	rdx, 32
	por	xmm1, xmm9
	pshufd	xmm13, xmm12, 8
	pshufd	xmm6, xmm11, 13
	pmaddwd	xmm11, xmm10
	por	xmm4, xmm9
	pshufd	xmm0, xmm12, 13
	pmaddwd	xmm12, xmm10
	pshufd	xmm15, xmm1, 8
	paddd	xmm6, xmm14
	pshufd	xmm14, xmm4, 8
	paddd	xmm0, xmm13
	pshufd	xmm13, xmm4, 13
	pmaddwd	xmm4, xmm8
	punpcklqdq	xmm6, xmm0
	paddd	xmm4, xmm12
	psrad	xmm4, 14
	pshufd	xmm0, xmm1, 13
	pmaddwd	xmm1, xmm8
	paddd	xmm1, xmm11
	psrad	xmm1, 14
	packssdw	xmm1, xmm4
	paddd	xmm13, xmm14
	pxor	xmm4, xmm4
	paddw	xmm6, XMMWORD PTR -24[rsp]
	paddw	xmm6, XMMWORD PTR .LC7[rip]
	paddd	xmm0, xmm15
	punpcklqdq	xmm0, xmm13
	movdqa	xmm13, xmm7
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	psrlw	xmm6, 2
	movdqa	xmm12, xmm6
	pmaxsw	xmm1, xmm4
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm2
	movdqa	xmm4, XMMWORD PTR -104[rsp]
	pxor	xmm14, xmm14
	movups	XMMWORD PTR [r8+r9], xmm1
	paddw	xmm4, XMMWORD PTR 24[rsp]
	paddw	xmm4, XMMWORD PTR .LC7[rip]
	psrlw	xmm4, 2
	movdqa	xmm2, XMMWORD PTR -120[rsp]
	movdqa	xmm1, XMMWORD PTR -72[rsp]
	paddw	xmm2, XMMWORD PTR 8[rsp]
	paddw	xmm2, XMMWORD PTR .LC7[rip]
	psrlw	xmm2, 2
	paddw	xmm1, XMMWORD PTR 40[rsp]
	paddw	xmm1, XMMWORD PTR .LC7[rip]
	paddw	xmm0, XMMWORD PTR -8[rsp]
	paddw	xmm3, XMMWORD PTR -40[rsp]
	paddw	xmm3, XMMWORD PTR .LC7[rip]
	psrlw	xmm3, 2
	movdqa	xmm11, xmm3
	paddw	xmm0, XMMWORD PTR .LC7[rip]
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	psrlw	xmm0, 2
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	movdqa	xmm15, XMMWORD PTR 56[rsp]
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddd	xmm13, xmm11
	movdqa	xmm11, xmm0
	psrad	xmm13, 14
	psrlw	xmm1, 2
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddd	xmm12, xmm11
	psrad	xmm12, 14
	packssdw	xmm12, xmm13
	movdqa	xmm11, xmm4
	movdqa	xmm13, xmm2
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddw	xmm12, xmm15
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	pmaxsw	xmm12, xmm14
	paddd	xmm13, xmm11
	movdqa	xmm14, xmm5
	movdqa	xmm11, xmm1
	psrad	xmm13, 14
	pminsw	xmm12, XMMWORD PTR .LC6[rip]
	pmaddwd	xmm14, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm11, XMMWORD PTR .LC8[rip]
	paddd	xmm11, xmm14
	pxor	xmm14, xmm14
	psrad	xmm11, 14
	packssdw	xmm11, xmm13
	paddw	xmm11, xmm15
	pmaxsw	xmm11, xmm14
	pminsw	xmm11, XMMWORD PTR .LC6[rip]
	packuswb	xmm11, xmm12
	movups	XMMWORD PTR -16[rsi], xmm11
	movdqa	xmm14, XMMWORD PTR 88[rsp]
	movdqa	xmm13, XMMWORD PTR 104[rsp]
	pmaddwd	xmm3, xmm14
	pmaddwd	xmm0, xmm14
	pmaddwd	xmm4, xmm14
	pmaddwd	xmm1, xmm14
	pmaddwd	xmm7, xmm13
	paddd	xmm7, xmm3
	movdqa	xmm3, xmm6
	psrad	xmm7, 14
	pmaddwd	xmm2, xmm13
	paddd	xmm2, xmm4
	pmaddwd	xmm3, xmm13
	paddd	xmm3, xmm0
	movdqa	xmm0, xmm5
	psrad	xmm3, 14
	psrad	xmm2, 14
	packssdw	xmm3, xmm7
	pmaddwd	xmm0, xmm13
	pxor	xmm7, xmm7
	paddd	xmm0, xmm1
	psrad	xmm0, 14
	packssdw	xmm0, xmm2
	paddw	xmm3, xmm15
	pmaxsw	xmm3, xmm7
	pminsw	xmm3, XMMWORD PTR .LC6[rip]
	paddw	xmm0, xmm15
	pmaxsw	xmm0, xmm7
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm3
	movups	XMMWORD PTR -16[rcx], xmm0
	cmp	rdx, QWORD PTR 120[rsp]
	jb	.L18
.L17:
	mov	rsi, QWORD PTR 120[rsp]
	cmp	QWORD PTR 144[rsp], rsi
	je	.L16
	mov	rbx, QWORD PTR 192[rsp]
	mov	rsi, QWORD PTR 176[rsp]
	movdqa	xmm3, XMMWORD PTR .LC0[rip]
	movdqu	xmm2, XMMWORD PTR 32[rdi+rbx]
	lea	rax, 32[rbx+r10]
	movdqa	xmm11, xmm3
	movdqu	xmm1, XMMWORD PTR 48[rdi+rbx]
	movdqa	xmm7, xmm3
	pand	xmm11, xmm2
	psrldq	xmm2, 1
	pand	xmm7, xmm1
	movdqa	xmm6, XMMWORD PTR .LC1[rip]
	psrldq	xmm1, 1
	pshufd	xmm4, xmm11, 13
	pshufd	xmm13, xmm11, 8
	movdqa	xmm5, XMMWORD PTR .LC2[rip]
	pshufd	xmm12, xmm7, 8
	pand	xmm2, xmm6
	pshufd	xmm0, xmm7, 13
	paddd	xmm13, xmm4
	movdqa	xmm4, xmm13
	pand	xmm1, xmm6
	por	xmm2, xmm5
	paddd	xmm0, xmm12
	punpcklqdq	xmm4, xmm0
	por	xmm1, xmm5
	pshufd	xmm13, xmm2, 8
	movaps	XMMWORD PTR -120[rsp], xmm4
	pshufd	xmm4, xmm2, 13
	pshufd	xmm12, xmm1, 8
	pshufd	xmm0, xmm1, 13
	paddd	xmm13, xmm4
	movdqa	xmm15, xmm13
	movdqa	xmm4, xmm1
	movdqa	xmm1, XMMWORD PTR .LC4[rip]
	paddd	xmm0, xmm12
	punpcklqdq	xmm15, xmm0
	movdqa	xmm0, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm7, xmm1
	pmaddwd	xmm4, xmm0
	paddd	xmm7, xmm4
	movdqa	xmm4, xmm2
	movdqa	xmm2, xmm11
	psrad	xmm7, 14
	movdqa	xmm11, xmm3
	pmaddwd	xmm4, xmm0
	movaps	XMMWORD PTR -104[rsp], xmm15
	pmaddwd	xmm2, xmm1
	paddd	xmm2, xmm4
	psrad	xmm2, 14
	packssdw	xmm2, xmm7
	pxor	xmm7, xmm7
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm7
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR -88[rsp], xmm2
	movdqa	xmm7, xmm3
	movdqu	xmm4, XMMWORD PTR 16[rdi+rbx]
	movdqu	xmm2, XMMWORD PTR [rdi+rbx]
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	pand	xmm4, xmm6
	pshufd	xmm14, xmm11, 8
	pshufd	xmm15, xmm7, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm1
	pand	xmm2, xmm6
	paddd	xmm13, xmm15
	por	xmm4, xmm5
	paddd	xmm12, xmm14
	movdqa	xmm14, xmm13
	por	xmm2, xmm5
	punpcklqdq	xmm14, xmm12
	pshufd	xmm12, xmm4, 13
	pshufd	xmm15, xmm2, 8
	movaps	XMMWORD PTR -72[rsp], xmm14
	pshufd	xmm13, xmm2, 13
	pmaddwd	xmm2, xmm0
	paddd	xmm2, xmm7
	pshufd	xmm14, xmm4, 8
	psrad	xmm2, 14
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	psrad	xmm4, 14
	packssdw	xmm2, xmm4
	paddd	xmm13, xmm15
	movdqa	xmm7, xmm3
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	pxor	xmm12, xmm12
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	movaps	XMMWORD PTR -56[rsp], xmm13
	pmaxsw	xmm2, xmm12
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	packuswb	xmm2, XMMWORD PTR -88[rsp]
	movdqa	xmm12, xmm3
	movups	XMMWORD PTR [r8+rsi], xmm2
	movdqu	xmm11, XMMWORD PTR 112[rdi+rbx]
	movdqu	xmm4, XMMWORD PTR 96[rdi+rbx]
	pand	xmm7, xmm11
	psrldq	xmm11, 1
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm11, xmm6
	movdqa	xmm13, xmm7
	pand	xmm4, xmm6
	pshufd	xmm15, xmm12, 8
	pshufd	xmm14, xmm13, 8
	pshufd	xmm7, xmm12, 13
	pmaddwd	xmm12, xmm1
	pshufd	xmm2, xmm13, 13
	pmaddwd	xmm13, xmm1
	por	xmm4, xmm5
	por	xmm11, xmm5
	paddd	xmm7, xmm15
	paddd	xmm2, xmm14
	punpcklqdq	xmm7, xmm2
	pshufd	xmm15, xmm4, 8
	pshufd	xmm14, xmm11, 8
	movaps	XMMWORD PTR -88[rsp], xmm7
	pshufd	xmm2, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm12
	pshufd	xmm7, xmm11, 13
	psrad	xmm4, 14
	pmaddwd	xmm11, xmm0
	pxor	xmm12, xmm12
	paddd	xmm11, xmm13
	psrad	xmm11, 14
	paddd	xmm2, xmm15
	packssdw	xmm4, xmm11
	movdqa	xmm11, xmm3
	paddd	xmm7, xmm14
	punpcklqdq	xmm2, xmm7
	movdqa	xmm7, xmm3
	paddw	xmm4, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm4, xmm12
	pminsw	xmm4, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR -40[rsp], xmm2
	movaps	XMMWORD PTR -24[rsp], xmm4
	movdqu	xmm2, XMMWORD PTR 64[rdi+rbx]
	movdqu	xmm4, XMMWORD PTR 80[rdi+rbx]
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm2, xmm6
	pshufd	xmm15, xmm7, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm14, xmm11, 8
	pand	xmm4, xmm6
	pshufd	xmm12, xmm11, 13
	paddd	xmm13, xmm15
	movdqa	xmm15, xmm13
	por	xmm2, xmm5
	pmaddwd	xmm11, xmm1
	por	xmm4, xmm5
	paddd	xmm12, xmm14
	punpcklqdq	xmm15, xmm12
	pshufd	xmm13, xmm2, 13
	pshufd	xmm14, xmm4, 8
	movaps	XMMWORD PTR -8[rsp], xmm15
	pshufd	xmm12, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	pshufd	xmm15, xmm2, 8
	psrad	xmm4, 14
	pmaddwd	xmm2, xmm0
	pxor	xmm11, xmm11
	paddd	xmm2, xmm7
	psrad	xmm2, 14
	paddd	xmm12, xmm14
	packssdw	xmm2, xmm4
	movdqa	xmm7, xmm3
	paddd	xmm13, xmm15
	movdqa	xmm14, xmm13
	punpcklqdq	xmm14, xmm12
	movaps	XMMWORD PTR 8[rsp], xmm14
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm11
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	packuswb	xmm2, XMMWORD PTR -24[rsp]
	movdqa	xmm11, xmm3
	mov	rcx, QWORD PTR 144[rsp]
	movups	XMMWORD PTR -16[r8+rcx], xmm2
	movdqu	xmm2, XMMWORD PTR [rdi+rax]
	lea	rax, 48[rbx+r10]
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	mov	rax, rbx
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	add	rax, r10
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm2, xmm6
	pshufd	xmm15, xmm7, 8
	pshufd	xmm14, xmm11, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm1
	pand	xmm4, xmm6
	por	xmm2, xmm5
	paddd	xmm13, xmm15
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	por	xmm4, xmm5
	pshufd	xmm15, xmm2, 8
	movaps	XMMWORD PTR -24[rsp], xmm13
	pshufd	xmm13, xmm2, 13
	pmaddwd	xmm2, xmm0
	paddd	xmm2, xmm7
	pshufd	xmm14, xmm4, 8
	psrad	xmm2, 14
	movdqa	xmm7, xmm3
	pshufd	xmm12, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	paddd	xmm13, xmm15
	movdqa	xmm15, xmm13
	psrad	xmm4, 14
	packssdw	xmm2, xmm4
	movdqa	xmm11, xmm3
	paddd	xmm12, xmm14
	punpcklqdq	xmm15, xmm12
	pxor	xmm12, xmm12
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	movaps	XMMWORD PTR 24[rsp], xmm15
	pmaxsw	xmm2, xmm12
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR 40[rsp], xmm2
	movdqu	xmm2, XMMWORD PTR [rdi+rax]
	lea	rax, 16[rbx+r10]
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	lea	rax, -32[rcx+r11]
	pand	xmm7, xmm2
	psrldq	xmm2, 1
	pand	xmm11, xmm4
	psrldq	xmm4, 1
	pand	xmm2, xmm6
	pshufd	xmm15, xmm7, 8
	pshufd	xmm14, xmm11, 8
	pshufd	xmm13, xmm7, 13
	pmaddwd	xmm7, xmm1
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm1
	pand	xmm4, xmm6
	paddd	xmm13, xmm15
	por	xmm2, xmm5
	paddd	xmm12, xmm14
	movdqa	xmm14, xmm13
	por	xmm4, xmm5
	punpcklqdq	xmm14, xmm12
	pshufd	xmm15, xmm2, 8
	pshufd	xmm13, xmm2, 13
	pmaddwd	xmm2, xmm0
	paddd	xmm2, xmm7
	movaps	XMMWORD PTR 56[rsp], xmm14
	psrad	xmm2, 14
	pshufd	xmm14, xmm4, 8
	pshufd	xmm12, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm11
	pxor	xmm11, xmm11
	psrad	xmm4, 14
	movdqa	xmm7, xmm3
	packssdw	xmm2, xmm4
	paddd	xmm13, xmm15
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	movdqa	xmm12, xmm3
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm2, xmm11
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	packuswb	xmm2, XMMWORD PTR 40[rsp]
	movaps	XMMWORD PTR 72[rsp], xmm13
	movups	XMMWORD PTR [r8+rax], xmm2
	lea	rax, 96[rbx+r10]
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	lea	rax, 112[rbx+r10]
	movdqu	xmm11, XMMWORD PTR [rdi+rax]
	lea	rax, 64[rbx+r10]
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm7, xmm11
	psrldq	xmm11, 1
	pand	xmm4, xmm6
	pshufd	xmm15, xmm12, 8
	movdqa	xmm13, xmm7
	pand	xmm11, xmm6
	pshufd	xmm7, xmm12, 13
	pmaddwd	xmm12, xmm1
	pshufd	xmm14, xmm13, 8
	pshufd	xmm2, xmm13, 13
	pmaddwd	xmm13, xmm1
	por	xmm4, xmm5
	paddd	xmm7, xmm15
	por	xmm11, xmm5
	paddd	xmm2, xmm14
	punpcklqdq	xmm7, xmm2
	pshufd	xmm15, xmm4, 8
	pshufd	xmm14, xmm11, 8
	movaps	XMMWORD PTR 40[rsp], xmm7
	pshufd	xmm2, xmm4, 13
	pmaddwd	xmm4, xmm0
	paddd	xmm4, xmm12
	pshufd	xmm7, xmm11, 13
	psrad	xmm4, 14
	pmaddwd	xmm11, xmm0
	paddd	xmm11, xmm13
	psrad	xmm11, 14
	pxor	xmm12, xmm12
	paddd	xmm2, xmm15
	paddd	xmm7, xmm14
	punpcklqdq	xmm2, xmm7
	movdqu	xmm7, XMMWORD PTR [rdi+rax]
	lea	rax, 80[rbx+r10]
	movaps	XMMWORD PTR 88[rsp], xmm2
	movdqa	xmm2, xmm4
	packssdw	xmm2, xmm11
	movdqu	xmm4, XMMWORD PTR [rdi+rax]
	movdqa	xmm11, xmm3
	lea	rax, -16[rcx+r11]
	pand	xmm11, xmm7
	psrldq	xmm7, 1
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	pand	xmm3, xmm4
	psrldq	xmm4, 1
	pmaxsw	xmm2, xmm12
	pand	xmm4, xmm6
	pminsw	xmm2, XMMWORD PTR .LC6[rip]
	pand	xmm7, xmm6
	pshufd	xmm13, xmm11, 8
	pshufd	xmm12, xmm3, 8
	por	xmm7, xmm5
	pshufd	xmm6, xmm3, 13
	pmaddwd	xmm3, xmm1
	pmaddwd	xmm1, xmm11
	por	xmm5, xmm4
	pshufd	xmm4, xmm11, 13
	paddd	xmm6, xmm12
	pshufd	xmm12, xmm7, 13
	pshufd	xmm14, xmm5, 8
	paddd	xmm4, xmm13
	pshufd	xmm13, xmm7, 8
	punpcklqdq	xmm4, xmm6
	pshufd	xmm6, xmm5, 13
	pmaddwd	xmm5, xmm0
	pmaddwd	xmm0, xmm7
	paddd	xmm12, xmm13
	paddd	xmm5, xmm3
	paddd	xmm0, xmm1
	psrad	xmm5, 14
	psrad	xmm0, 14
	packssdw	xmm0, xmm5
	paddd	xmm6, xmm14
	punpcklqdq	xmm12, xmm6
	pxor	xmm6, xmm6
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm6
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, xmm2
	movdqa	xmm6, XMMWORD PTR -72[rsp]
	movups	XMMWORD PTR [r8+rax], xmm0
	paddw	xmm6, XMMWORD PTR 56[rsp]
	paddw	xmm4, XMMWORD PTR -8[rsp]
	mov	rbx, QWORD PTR 264[rsp]
	mov	rax, QWORD PTR 184[rsp]
	movdqa	xmm5, XMMWORD PTR -56[rsp]
	movdqa	xmm1, XMMWORD PTR .LC7[rip]
	paddw	xmm5, XMMWORD PTR 72[rsp]
	movdqa	xmm13, xmm5
	movdqa	xmm5, XMMWORD PTR -40[rsp]
	paddw	xmm6, xmm1
	movdqa	xmm15, xmm6
	paddw	xmm13, xmm1
	paddw	xmm4, xmm1
	movdqa	xmm11, xmm13
	paddw	xmm5, XMMWORD PTR 88[rsp]
	movdqa	xmm6, XMMWORD PTR -120[rsp]
	movdqa	xmm2, xmm5
	psrlw	xmm4, 2
	movdqa	xmm14, xmm4
	psrlw	xmm15, 2
	movdqa	xmm7, XMMWORD PTR -88[rsp]
	paddw	xmm6, XMMWORD PTR -24[rsp]
	paddw	xmm2, xmm1
	paddw	xmm6, xmm1
	psrlw	xmm2, 2
	movdqa	xmm5, xmm2
	movdqa	xmm3, XMMWORD PTR -104[rsp]
	paddw	xmm7, XMMWORD PTR 40[rsp]
	paddw	xmm7, xmm1
	psrlw	xmm7, 2
	movdqa	xmm13, xmm7
	psrlw	xmm6, 2
	movdqa	xmm0, XMMWORD PTR 8[rsp]
	paddw	xmm3, XMMWORD PTR 24[rsp]
	paddw	xmm3, xmm1
	psrlw	xmm3, 2
	psrlw	xmm11, 2
	paddw	xmm0, xmm12
	movdqa	xmm12, XMMWORD PTR .LC9[rip]
	paddw	xmm0, xmm1
	psrlw	xmm0, 2
	movdqa	xmm1, XMMWORD PTR .LC8[rip]
	pmaddwd	xmm13, xmm12
	pmaddwd	xmm14, xmm12
	pmaddwd	xmm5, xmm1
	paddd	xmm13, xmm5
	movdqa	xmm5, xmm0
	psrad	xmm13, 14
	pmaddwd	xmm5, xmm1
	paddd	xmm5, xmm14
	psrad	xmm5, 14
	packssdw	xmm5, xmm13
	pxor	xmm13, xmm13
	movdqa	xmm14, XMMWORD PTR .LC10[rip]
	movaps	XMMWORD PTR 56[rsp], xmm14
	paddw	xmm5, xmm14
	movdqa	xmm14, xmm6
	pmaxsw	xmm5, xmm13
	pminsw	xmm5, XMMWORD PTR .LC6[rip]
	movdqa	xmm13, xmm3
	pmaddwd	xmm14, xmm12
	pmaddwd	xmm12, xmm15
	pmaddwd	xmm13, xmm1
	pmaddwd	xmm1, xmm11
	paddd	xmm13, xmm14
	paddd	xmm1, xmm12
	psrad	xmm13, 14
	pxor	xmm12, xmm12
	psrad	xmm1, 14
	packssdw	xmm1, xmm13
	movdqa	xmm14, XMMWORD PTR 56[rsp]
	paddw	xmm1, xmm14
	pmaxsw	xmm1, xmm12
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm5
	movdqa	xmm5, XMMWORD PTR .LC11[rip]
	movups	XMMWORD PTR [rbx+rax], xmm1
	pmaddwd	xmm2, xmm5
	pmaddwd	xmm0, xmm5
	pmaddwd	xmm3, xmm5
	pmaddwd	xmm11, xmm5
	mov	rsi, QWORD PTR 280[rsp]
	movdqa	xmm1, XMMWORD PTR .LC12[rip]
	pmaddwd	xmm7, xmm1
	pmaddwd	xmm4, xmm1
	paddd	xmm2, xmm7
	paddd	xmm0, xmm4
	psrad	xmm2, 14
	psrad	xmm0, 14
	pmaddwd	xmm6, xmm1
	packssdw	xmm0, xmm2
	paddd	xmm3, xmm6
	pmaddwd	xmm15, xmm1
	psrad	xmm3, 14
	paddd	xmm11, xmm15
	psrad	xmm11, 14
	packssdw	xmm11, xmm3
	paddw	xmm0, xmm14
	pmaxsw	xmm0, xmm12
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	paddw	xmm11, xmm14
	pmaxsw	xmm11, xmm12
	pminsw	xmm11, XMMWORD PTR .LC6[rip]
	packuswb	xmm11, xmm0
	movups	XMMWORD PTR [rsi+rax], xmm11
.L16:
	add	QWORD PTR 136[rsp], 2
	mov	rax, QWORD PTR 272[rsp]
	add	r8, QWORD PTR 152[rsp]
	add	QWORD PTR 264[rsp], rax
	add	rdi, QWORD PTR 160[rsp]
	mov	rax, QWORD PTR 288[rsp]
	add	QWORD PTR 280[rsp], rax
	mov	rax, QWORD PTR 136[rsp]
	cmp	QWORD PTR 168[rsp], rax
	ja	.L20
.L12:
	add	rsp, 208
	pop	rbx
	pop	rbp
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	ret
	.size	_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m, .-_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDE14:
	.section	.text._ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTE14:
	.section	.text.unlikely,"ax",@progbits
.LCOLDB15:
	.text
.LHOTB15:
	.p2align 4,,15
	.globl	_ZN4Simd4Sse213BgraToYuv420pEPKhmmmPhmS3_mS3_m
	.type	_ZN4Simd4Sse213BgraToYuv420pEPKhmmmPhmS3_mS3_m, @function
_ZN4Simd4Sse213BgraToYuv420pEPKhmmmPhmS3_mS3_m:
	test	r8b, 15
	push	rbx
	mov	rax, QWORD PTR 16[rsp]
	mov	r10, QWORD PTR 24[rsp]
	mov	r11, QWORD PTR 32[rsp]
	mov	rbx, QWORD PTR 40[rsp]
	je	.L26
.L23:
	mov	QWORD PTR 40[rsp], rbx
	mov	QWORD PTR 32[rsp], r11
	mov	QWORD PTR 24[rsp], r10
	mov	QWORD PTR 16[rsp], rax
	pop	rbx
	jmp	_ZN4Simd4Sse213BgraToYuv420pILb0EEEvPKhmmmPhmS4_mS4_m@PLT
	.p2align 4,,10
	.p2align 3
.L26:
	test	r9b, 15
	jne	.L23
	test	al, 15
	jne	.L23
	test	r10b, 15
	jne	.L23
	test	r11b, 15
	jne	.L23
	test	bl, 15
	jne	.L23
	test	dil, 15
	jne	.L23
	test	cl, 15
	jne	.L23
	pop	rbx
	jmp	_ZN4Simd4Sse213BgraToYuv420pILb1EEEvPKhmmmPhmS4_mS4_m@PLT
	.size	_ZN4Simd4Sse213BgraToYuv420pEPKhmmmPhmS3_mS3_m, .-_ZN4Simd4Sse213BgraToYuv420pEPKhmmmPhmS3_mS3_m
	.section	.text.unlikely
.LCOLDE15:
	.text
.LHOTE15:
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDB17:
	.section	.text._ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTB17:
	.p2align 4,,15
	.weak	_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m
	.type	_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m, @function
_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m:
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbp
	mov	rbp, rsi
	push	rbx
	and	rbp, -32
	test	rdx, rdx
	mov	r11, QWORD PTR 56[rsp]
	mov	rbx, QWORD PTR 72[rsp]
	je	.L27
	lea	rax, -32[rsi]
	lea	r13, 32[rdi]
	xor	r15d, r15d
	movdqa	xmm7, XMMWORD PTR .LC4[rip]
	mov	r14, rax
	lea	r12, [rdi+rax*4]
	mov	QWORD PTR -16[rsp], rax
	movdqa	xmm8, XMMWORD PTR .LC6[rip]
	shr	r14
	movdqa	xmm9, XMMWORD PTR .LC16[rip]
	.p2align 4,,10
	.p2align 3
.L34:
	test	rbp, rbp
	je	.L32
	movdqa	xmm6, XMMWORD PTR .LC11[rip]
	mov	rax, r13
	xor	edi, edi
	movaps	XMMWORD PTR -56[rsp], xmm6
	movdqa	xmm6, XMMWORD PTR .LC12[rip]
	movdqa	xmm5, XMMWORD PTR .LC10[rip]
	movaps	XMMWORD PTR -40[rsp], xmm6
	movdqa	xmm15, xmm5
	.p2align 4,,10
	.p2align 3
.L29:
	sub	rax, -128
	movdqa	xmm0, XMMWORD PTR -128[rax]
	movdqa	xmm1, XMMWORD PTR -112[rax]
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	pand	xmm5, xmm0
	psrldq	xmm0, 1
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pand	xmm1, XMMWORD PTR .LC1[rip]
	pshufd	xmm4, xmm5, 13
	pshufd	xmm6, xmm2, 8
	pshufd	xmm11, xmm5, 8
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm7
	por	xmm1, XMMWORD PTR .LC2[rip]
	paddd	xmm11, xmm4
	por	xmm0, XMMWORD PTR .LC2[rip]
	paddd	xmm3, xmm6
	movdqa	xmm6, xmm11
	punpcklqdq	xmm6, xmm3
	pshufd	xmm3, xmm1, 13
	pshufd	xmm10, xmm0, 8
	movaps	XMMWORD PTR -120[rsp], xmm6
	pshufd	xmm4, xmm0, 13
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	pshufd	xmm6, xmm1, 8
	paddd	xmm4, xmm10
	paddd	xmm3, xmm6
	punpcklqdq	xmm4, xmm3
	movdqa	xmm3, xmm1
	movdqa	xmm1, xmm5
	pxor	xmm5, xmm5
	pmaddwd	xmm3, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm1, xmm7
	paddd	xmm2, xmm3
	paddd	xmm1, xmm0
	psrad	xmm2, 14
	psrad	xmm1, 14
	packssdw	xmm1, xmm2
	movdqa	xmm0, XMMWORD PTR -160[rax]
	movdqa	xmm2, XMMWORD PTR -144[rax]
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm1, xmm5
	pminsw	xmm1, xmm8
	movdqa	xmm3, XMMWORD PTR .LC0[rip]
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	pand	xmm3, xmm0
	psrldq	xmm0, 1
	pand	xmm5, xmm2
	psrldq	xmm2, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	movaps	XMMWORD PTR -104[rsp], xmm4
	pand	xmm2, XMMWORD PTR .LC1[rip]
	movdqa	xmm4, xmm5
	pshufd	xmm11, xmm3, 8
	por	xmm0, XMMWORD PTR .LC2[rip]
	por	xmm2, XMMWORD PTR .LC2[rip]
	pshufd	xmm10, xmm4, 8
	pshufd	xmm5, xmm3, 13
	pmaddwd	xmm3, xmm7
	pshufd	xmm6, xmm4, 13
	pmaddwd	xmm4, xmm7
	pshufd	xmm12, xmm0, 8
	paddd	xmm5, xmm11
	pshufd	xmm11, xmm2, 8
	paddd	xmm6, xmm10
	pshufd	xmm10, xmm0, 13
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	punpcklqdq	xmm5, xmm6
	paddd	xmm0, xmm3
	psrad	xmm0, 14
	pshufd	xmm6, xmm2, 13
	pmaddwd	xmm2, XMMWORD PTR .LC3[rip]
	paddd	xmm2, xmm4
	pxor	xmm3, xmm3
	psrad	xmm2, 14
	paddd	xmm10, xmm12
	packssdw	xmm0, xmm2
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	paddw	xmm5, xmm9
	paddd	xmm6, xmm11
	punpcklqdq	xmm10, xmm6
	psrlw	xmm5, 1
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, xmm8
	packuswb	xmm0, xmm1
	movaps	XMMWORD PTR -88[rsp], xmm10
	movaps	XMMWORD PTR [r8+rdi*2], xmm0
	movdqa	xmm0, XMMWORD PTR -64[rax]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	movdqa	xmm1, XMMWORD PTR -48[rax]
	pshufd	xmm6, xmm4, 13
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	pshufd	xmm10, xmm4, 8
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	paddd	xmm10, xmm6
	pshufd	xmm11, xmm2, 8
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm7
	por	xmm0, XMMWORD PTR .LC2[rip]
	por	xmm1, XMMWORD PTR .LC2[rip]
	paddd	xmm3, xmm11
	punpcklqdq	xmm10, xmm3
	pshufd	xmm12, xmm0, 8
	pshufd	xmm3, xmm0, 13
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	pshufd	xmm11, xmm1, 8
	paddw	xmm10, xmm9
	psrlw	xmm10, 1
	pshufd	xmm6, xmm1, 13
	pmaddwd	xmm1, XMMWORD PTR .LC3[rip]
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm4
	paddd	xmm3, xmm12
	movdqa	xmm4, XMMWORD PTR -80[rax]
	psrad	xmm1, 14
	paddd	xmm6, xmm11
	pmaddwd	xmm2, xmm7
	movdqa	xmm11, XMMWORD PTR .LC0[rip]
	paddd	xmm2, xmm0
	punpcklqdq	xmm3, xmm6
	psrad	xmm2, 14
	movdqa	xmm0, XMMWORD PTR -96[rax]
	packssdw	xmm2, xmm1
	pxor	xmm1, xmm1
	pand	xmm11, xmm0
	psrldq	xmm0, 1
	movdqa	xmm12, XMMWORD PTR .LC0[rip]
	pand	xmm0, XMMWORD PTR .LC1[rip]
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	paddw	xmm3, xmm9
	pmaxsw	xmm2, xmm1
	pminsw	xmm2, xmm8
	psrlw	xmm3, 1
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	por	xmm0, XMMWORD PTR .LC2[rip]
	pshufd	xmm14, xmm11, 8
	pshufd	xmm6, xmm11, 13
	pmaddwd	xmm11, xmm7
	por	xmm4, XMMWORD PTR .LC2[rip]
	pshufd	xmm13, xmm12, 8
	paddd	xmm6, xmm14
	pshufd	xmm1, xmm12, 13
	pmaddwd	xmm12, xmm7
	pshufd	xmm14, xmm0, 8
	paddd	xmm1, xmm13
	pshufd	xmm13, xmm4, 13
	movaps	XMMWORD PTR -72[rsp], xmm14
	punpcklqdq	xmm6, xmm1
	pshufd	xmm14, xmm4, 8
	pmaddwd	xmm4, XMMWORD PTR .LC3[rip]
	paddd	xmm4, xmm12
	pshufd	xmm1, xmm0, 13
	psrad	xmm4, 14
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	paddd	xmm0, xmm11
	psrad	xmm0, 14
	packssdw	xmm0, xmm4
	paddd	xmm13, xmm14
	movdqa	xmm11, XMMWORD PTR -120[rsp]
	paddw	xmm6, xmm9
	paddd	xmm1, XMMWORD PTR -72[rsp]
	punpcklqdq	xmm1, xmm13
	psrlw	xmm6, 1
	pxor	xmm13, xmm13
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	movdqa	xmm4, XMMWORD PTR -104[rsp]
	movdqa	xmm12, xmm6
	paddw	xmm11, xmm9
	psrlw	xmm11, 1
	paddw	xmm1, xmm9
	psrlw	xmm1, 1
	paddw	xmm4, xmm9
	pmaxsw	xmm0, xmm13
	pminsw	xmm0, xmm8
	packuswb	xmm0, xmm2
	movdqa	xmm13, xmm10
	movdqa	xmm2, xmm3
	psrlw	xmm4, 1
	movdqa	xmm14, xmm5
	movaps	XMMWORD PTR 16[r8+rdi*2], xmm0
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	movdqa	xmm0, XMMWORD PTR -88[rsp]
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	paddd	xmm13, xmm2
	movdqa	xmm2, xmm1
	psrad	xmm13, 14
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	paddw	xmm0, xmm9
	psrlw	xmm0, 1
	pmaddwd	xmm14, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	paddd	xmm12, xmm2
	pxor	xmm2, xmm2
	psrad	xmm12, 14
	packssdw	xmm12, xmm13
	movdqa	xmm13, xmm11
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	paddw	xmm12, xmm15
	pmaxsw	xmm12, xmm2
	movdqa	xmm2, xmm4
	pminsw	xmm12, xmm8
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	paddd	xmm13, xmm2
	movdqa	xmm2, xmm0
	psrad	xmm13, 14
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	paddd	xmm2, xmm14
	pxor	xmm14, xmm14
	psrad	xmm2, 14
	packssdw	xmm2, xmm13
	paddw	xmm2, xmm15
	pmaxsw	xmm2, xmm14
	pminsw	xmm2, xmm8
	packuswb	xmm2, xmm12
	movdqa	xmm14, XMMWORD PTR -40[rsp]
	movaps	XMMWORD PTR [r11+rdi], xmm2
	pmaddwd	xmm6, xmm14
	pmaddwd	xmm10, xmm14
	pmaddwd	xmm11, xmm14
	pmaddwd	xmm5, xmm14
	movdqa	xmm2, XMMWORD PTR -56[rsp]
	pmaddwd	xmm3, xmm2
	pmaddwd	xmm1, xmm2
	paddd	xmm3, xmm10
	paddd	xmm1, xmm6
	psrad	xmm3, 14
	pxor	xmm6, xmm6
	psrad	xmm1, 14
	pmaddwd	xmm4, xmm2
	packssdw	xmm1, xmm3
	paddd	xmm4, xmm11
	pmaddwd	xmm0, xmm2
	psrad	xmm4, 14
	paddd	xmm0, xmm5
	psrad	xmm0, 14
	packssdw	xmm0, xmm4
	paddw	xmm1, xmm15
	pmaxsw	xmm1, xmm6
	pminsw	xmm1, xmm8
	paddw	xmm0, xmm15
	pmaxsw	xmm0, xmm6
	pminsw	xmm0, xmm8
	packuswb	xmm0, xmm1
	movaps	XMMWORD PTR [rbx+rdi], xmm0
	add	rdi, 16
	lea	r10, [rdi+rdi]
	cmp	rbp, r10
	ja	.L29
.L32:
	cmp	rsi, rbp
	je	.L31
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	mov	rax, QWORD PTR -16[rsp]
	movdqu	xmm0, XMMWORD PTR 32[r12]
	movdqa	xmm11, xmm4
	movdqa	xmm10, xmm4
	movdqu	xmm3, XMMWORD PTR 48[r12]
	pand	xmm11, xmm0
	psrldq	xmm0, 1
	pand	xmm10, xmm3
	psrldq	xmm3, 1
	movdqa	xmm6, XMMWORD PTR .LC1[rip]
	pshufd	xmm13, xmm11, 8
	pshufd	xmm2, xmm11, 13
	pshufd	xmm12, xmm10, 8
	movdqa	xmm5, XMMWORD PTR .LC2[rip]
	pand	xmm0, xmm6
	pshufd	xmm1, xmm10, 13
	paddd	xmm2, xmm13
	pand	xmm3, xmm6
	por	xmm0, xmm5
	paddd	xmm1, xmm12
	punpcklqdq	xmm2, xmm1
	por	xmm3, xmm5
	pshufd	xmm13, xmm0, 8
	movaps	XMMWORD PTR -120[rsp], xmm2
	pshufd	xmm2, xmm0, 13
	pshufd	xmm12, xmm3, 8
	pshufd	xmm1, xmm3, 13
	paddd	xmm2, xmm13
	movdqa	xmm15, xmm2
	movdqa	xmm2, XMMWORD PTR .LC4[rip]
	paddd	xmm1, xmm12
	punpcklqdq	xmm15, xmm1
	pmaddwd	xmm10, xmm2
	movdqa	xmm1, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm3, xmm1
	paddd	xmm10, xmm3
	movdqa	xmm3, xmm0
	movdqa	xmm0, xmm11
	psrad	xmm10, 14
	movdqa	xmm11, xmm4
	pmaddwd	xmm3, xmm1
	movaps	XMMWORD PTR -104[rsp], xmm15
	pmaddwd	xmm0, xmm2
	paddd	xmm0, xmm3
	pxor	xmm3, xmm3
	psrad	xmm0, 14
	packssdw	xmm0, xmm10
	movdqa	xmm10, xmm4
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR -88[rsp], xmm0
	movdqu	xmm0, XMMWORD PTR [r12]
	movdqu	xmm3, XMMWORD PTR 16[r12]
	pand	xmm10, xmm0
	psrldq	xmm0, 1
	pand	xmm11, xmm3
	psrldq	xmm3, 1
	pand	xmm0, xmm6
	pshufd	xmm15, xmm10, 8
	pshufd	xmm14, xmm11, 8
	pand	xmm3, xmm6
	pshufd	xmm13, xmm10, 13
	pmaddwd	xmm10, xmm2
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm2
	por	xmm0, xmm5
	por	xmm3, xmm5
	paddd	xmm13, xmm15
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	pshufd	xmm15, xmm0, 8
	pshufd	xmm14, xmm3, 8
	movaps	XMMWORD PTR -72[rsp], xmm13
	pshufd	xmm12, xmm3, 13
	pmaddwd	xmm3, xmm1
	paddd	xmm3, xmm11
	pshufd	xmm13, xmm0, 13
	psrad	xmm3, 14
	pmaddwd	xmm0, xmm1
	paddd	xmm0, xmm10
	pxor	xmm10, xmm10
	psrad	xmm0, 14
	packssdw	xmm0, xmm3
	paddd	xmm12, xmm14
	movdqa	xmm11, xmm4
	paddd	xmm13, xmm15
	movdqa	xmm14, xmm13
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm10
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, XMMWORD PTR -88[rsp]
	punpcklqdq	xmm14, xmm12
	movdqa	xmm12, xmm4
	movaps	XMMWORD PTR -56[rsp], xmm14
	movups	XMMWORD PTR [r8+rax], xmm0
	movdqu	xmm0, XMMWORD PTR 96[r12]
	movdqu	xmm3, XMMWORD PTR 112[r12]
	pand	xmm11, xmm0
	psrldq	xmm0, 1
	pand	xmm12, xmm3
	psrldq	xmm3, 1
	pand	xmm0, xmm6
	pshufd	xmm15, xmm11, 8
	pshufd	xmm14, xmm12, 8
	pand	xmm3, xmm6
	pshufd	xmm10, xmm11, 13
	pmaddwd	xmm11, xmm2
	pshufd	xmm13, xmm12, 13
	pmaddwd	xmm12, xmm2
	por	xmm0, xmm5
	por	xmm3, xmm5
	paddd	xmm10, xmm15
	paddd	xmm13, xmm14
	punpcklqdq	xmm10, xmm13
	pshufd	xmm15, xmm0, 8
	pshufd	xmm13, xmm0, 13
	pmaddwd	xmm0, xmm1
	paddd	xmm0, xmm11
	movaps	XMMWORD PTR -88[rsp], xmm10
	psrad	xmm0, 14
	pshufd	xmm14, xmm3, 8
	pshufd	xmm10, xmm3, 13
	paddd	xmm13, xmm15
	movdqa	xmm15, xmm13
	pmaddwd	xmm3, xmm1
	paddd	xmm3, xmm12
	psrad	xmm3, 14
	movdqa	xmm12, xmm4
	paddd	xmm10, xmm14
	punpcklqdq	xmm15, xmm10
	movaps	XMMWORD PTR -40[rsp], xmm15
	movdqa	xmm15, xmm0
	pxor	xmm0, xmm0
	packssdw	xmm15, xmm3
	movdqa	xmm3, XMMWORD PTR .LC5[rip]
	movdqu	xmm10, XMMWORD PTR 64[r12]
	paddw	xmm3, xmm15
	pmaxsw	xmm3, xmm0
	pminsw	xmm3, XMMWORD PTR .LC6[rip]
	movdqu	xmm0, XMMWORD PTR 80[r12]
	pand	xmm12, xmm10
	psrldq	xmm10, 1
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm10, xmm6
	pand	xmm0, xmm6
	pshufd	xmm13, xmm4, 8
	por	xmm10, xmm5
	pshufd	xmm6, xmm12, 13
	por	xmm5, xmm0
	pshufd	xmm11, xmm12, 8
	pshufd	xmm0, xmm4, 13
	pmaddwd	xmm4, xmm2
	pmaddwd	xmm2, xmm12
	pshufd	xmm14, xmm10, 8
	paddd	xmm11, xmm6
	pshufd	xmm6, xmm5, 13
	paddd	xmm0, xmm13
	pshufd	xmm13, xmm5, 8
	pmaddwd	xmm5, xmm1
	pmaddwd	xmm1, xmm10
	paddd	xmm5, xmm4
	paddd	xmm1, xmm2
	psrad	xmm5, 14
	psrad	xmm1, 14
	packssdw	xmm1, xmm5
	pxor	xmm5, xmm5
	paddd	xmm6, xmm13
	punpcklqdq	xmm11, xmm0
	pshufd	xmm0, xmm10, 13
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm1, xmm5
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm3
	movdqa	xmm4, xmm11
	paddd	xmm0, xmm14
	punpcklqdq	xmm0, xmm6
	movups	XMMWORD PTR -16[r8+rsi], xmm1
	movdqa	xmm3, XMMWORD PTR .LC16[rip]
	movdqa	xmm6, XMMWORD PTR -120[rsp]
	paddw	xmm4, xmm3
	paddw	xmm0, xmm3
	psrlw	xmm4, 1
	psrlw	xmm0, 1
	movdqa	xmm14, xmm4
	movdqa	xmm5, XMMWORD PTR -72[rsp]
	paddw	xmm6, xmm3
	psrlw	xmm6, 1
	movdqa	xmm10, XMMWORD PTR -88[rsp]
	paddw	xmm5, xmm3
	movdqa	xmm15, xmm5
	movdqa	xmm13, XMMWORD PTR -56[rsp]
	paddw	xmm10, xmm3
	psrlw	xmm10, 1
	movdqa	xmm12, xmm10
	psrlw	xmm15, 1
	movdqa	xmm2, XMMWORD PTR -104[rsp]
	paddw	xmm13, xmm3
	psrlw	xmm13, 1
	movdqa	xmm1, XMMWORD PTR .LC8[rip]
	paddw	xmm2, xmm3
	paddw	xmm3, XMMWORD PTR -40[rsp]
	psrlw	xmm3, 1
	movdqa	xmm5, xmm3
	psrlw	xmm2, 1
	movdqa	xmm11, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm5, xmm1
	pmaddwd	xmm12, xmm11
	paddd	xmm12, xmm5
	movdqa	xmm5, xmm0
	pmaddwd	xmm14, xmm11
	psrad	xmm12, 14
	pmaddwd	xmm5, xmm1
	paddd	xmm5, xmm14
	psrad	xmm5, 14
	packssdw	xmm5, xmm12
	pxor	xmm12, xmm12
	movdqa	xmm14, XMMWORD PTR .LC10[rip]
	movaps	XMMWORD PTR -120[rsp], xmm14
	paddw	xmm5, xmm14
	movdqa	xmm14, xmm6
	pmaxsw	xmm5, xmm12
	pminsw	xmm5, XMMWORD PTR .LC6[rip]
	movdqa	xmm12, xmm2
	pmaddwd	xmm14, xmm11
	pmaddwd	xmm11, xmm15
	pmaddwd	xmm12, xmm1
	pmaddwd	xmm1, xmm13
	paddd	xmm12, xmm14
	paddd	xmm1, xmm11
	psrad	xmm12, 14
	pxor	xmm11, xmm11
	psrad	xmm1, 14
	packssdw	xmm1, xmm12
	movdqa	xmm14, XMMWORD PTR -120[rsp]
	paddw	xmm1, xmm14
	pmaxsw	xmm1, xmm11
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm5
	movdqa	xmm5, XMMWORD PTR .LC11[rip]
	movups	XMMWORD PTR [r11+r14], xmm1
	pmaddwd	xmm3, xmm5
	pmaddwd	xmm0, xmm5
	pmaddwd	xmm2, xmm5
	pmaddwd	xmm13, xmm5
	movdqa	xmm1, XMMWORD PTR .LC12[rip]
	pmaddwd	xmm10, xmm1
	pmaddwd	xmm4, xmm1
	paddd	xmm3, xmm10
	paddd	xmm0, xmm4
	psrad	xmm3, 14
	psrad	xmm0, 14
	pmaddwd	xmm6, xmm1
	packssdw	xmm0, xmm3
	paddd	xmm2, xmm6
	pmaddwd	xmm15, xmm1
	psrad	xmm2, 14
	paddd	xmm13, xmm15
	psrad	xmm13, 14
	packssdw	xmm13, xmm2
	paddw	xmm0, xmm14
	pmaxsw	xmm0, xmm11
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	paddw	xmm13, xmm14
	pmaxsw	xmm13, xmm11
	pminsw	xmm13, XMMWORD PTR .LC6[rip]
	packuswb	xmm13, xmm0
	movups	XMMWORD PTR [rbx+r14], xmm13
.L31:
	add	r15, 1
	add	r8, r9
	add	r11, QWORD PTR 64[rsp]
	add	r12, rcx
	add	r13, rcx
	add	rbx, QWORD PTR 80[rsp]
	cmp	rdx, r15
	jne	.L34
.L27:
	pop	rbx
	pop	rbp
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	ret
	.size	_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m, .-_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDE17:
	.section	.text._ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTE17:
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDB18:
	.section	.text._ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTB18:
	.p2align 4,,15
	.weak	_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_mEEEvPKhmmmPhmS4_mS4_m
	.type	_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m, @function
_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m:
	push	r15
	mov	r11, rsi
	push	r14
	push	r13
	push	r12
	and	r11, -32
	push	rbp
	push	rbx
	test	rdx, rdx
	mov	rbp, QWORD PTR 56[rsp]
	mov	r12, QWORD PTR 72[rsp]
	mov	QWORD PTR -24[rsp], r9
	je	.L37
	lea	rax, -32[rsi]
	lea	r13, 32[rdi]
	xor	r15d, r15d
	movdqa	xmm7, XMMWORD PTR .LC4[rip]
	mov	r14, rax
	lea	rbx, [rdi+rax*4]
	mov	QWORD PTR -16[rsp], rax
	movdqa	xmm8, XMMWORD PTR .LC6[rip]
	shr	r14
	movdqa	xmm9, XMMWORD PTR .LC16[rip]
	.p2align 4,,10
	.p2align 3
.L45:
	test	r11, r11
	je	.L42
	movdqa	xmm6, XMMWORD PTR .LC11[rip]
	mov	r10, rbp
	mov	r9, r12
	mov	rax, r13
	xor	edi, edi
	movaps	XMMWORD PTR -56[rsp], xmm6
	movdqa	xmm6, XMMWORD PTR .LC12[rip]
	movdqa	xmm5, XMMWORD PTR .LC10[rip]
	movaps	XMMWORD PTR -40[rsp], xmm6
	movdqa	xmm15, xmm5
	.p2align 4,,10
	.p2align 3
.L43:
	sub	rax, -128
	add	r10, 16
	add	r9, 16
	movdqu	xmm0, XMMWORD PTR -128[rax]
	movdqu	xmm1, XMMWORD PTR -112[rax]
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	pand	xmm5, xmm0
	psrldq	xmm0, 1
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pand	xmm1, XMMWORD PTR .LC1[rip]
	pshufd	xmm4, xmm5, 13
	pshufd	xmm6, xmm2, 8
	pshufd	xmm11, xmm5, 8
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm7
	por	xmm1, XMMWORD PTR .LC2[rip]
	paddd	xmm11, xmm4
	por	xmm0, XMMWORD PTR .LC2[rip]
	paddd	xmm3, xmm6
	movdqa	xmm6, xmm11
	punpcklqdq	xmm6, xmm3
	pshufd	xmm3, xmm1, 13
	pshufd	xmm10, xmm0, 8
	movaps	XMMWORD PTR -120[rsp], xmm6
	pshufd	xmm4, xmm0, 13
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	pshufd	xmm6, xmm1, 8
	paddd	xmm4, xmm10
	paddd	xmm3, xmm6
	punpcklqdq	xmm4, xmm3
	movdqa	xmm3, xmm1
	movdqa	xmm1, xmm5
	pxor	xmm5, xmm5
	movaps	XMMWORD PTR -104[rsp], xmm4
	pmaddwd	xmm3, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm1, xmm7
	paddd	xmm2, xmm3
	paddd	xmm1, xmm0
	psrad	xmm2, 14
	psrad	xmm1, 14
	packssdw	xmm1, xmm2
	movdqu	xmm0, XMMWORD PTR -160[rax]
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm1, xmm5
	pminsw	xmm1, xmm8
	movdqu	xmm2, XMMWORD PTR -144[rax]
	movdqa	xmm3, XMMWORD PTR .LC0[rip]
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	pand	xmm3, xmm0
	psrldq	xmm0, 1
	pand	xmm5, xmm2
	psrldq	xmm2, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	pand	xmm2, XMMWORD PTR .LC1[rip]
	movdqa	xmm4, xmm5
	pshufd	xmm11, xmm3, 8
	por	xmm0, XMMWORD PTR .LC2[rip]
	por	xmm2, XMMWORD PTR .LC2[rip]
	pshufd	xmm10, xmm4, 8
	pshufd	xmm5, xmm3, 13
	pmaddwd	xmm3, xmm7
	pshufd	xmm6, xmm4, 13
	pmaddwd	xmm4, xmm7
	pshufd	xmm12, xmm0, 8
	paddd	xmm5, xmm11
	pshufd	xmm11, xmm2, 8
	paddd	xmm6, xmm10
	pshufd	xmm10, xmm0, 13
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	punpcklqdq	xmm5, xmm6
	paddd	xmm0, xmm3
	psrad	xmm0, 14
	pshufd	xmm6, xmm2, 13
	pmaddwd	xmm2, XMMWORD PTR .LC3[rip]
	paddd	xmm2, xmm4
	pxor	xmm3, xmm3
	psrad	xmm2, 14
	paddd	xmm10, xmm12
	packssdw	xmm0, xmm2
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	paddw	xmm5, xmm9
	paddd	xmm6, xmm11
	punpcklqdq	xmm10, xmm6
	psrlw	xmm5, 1
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, xmm8
	packuswb	xmm0, xmm1
	movaps	XMMWORD PTR -88[rsp], xmm10
	movups	XMMWORD PTR [r8+rdi], xmm0
	movdqu	xmm0, XMMWORD PTR -64[rax]
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm0, XMMWORD PTR .LC1[rip]
	movdqu	xmm1, XMMWORD PTR -48[rax]
	pshufd	xmm6, xmm4, 13
	movdqa	xmm2, XMMWORD PTR .LC0[rip]
	pshufd	xmm10, xmm4, 8
	pand	xmm2, xmm1
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	paddd	xmm10, xmm6
	pshufd	xmm11, xmm2, 8
	pshufd	xmm3, xmm2, 13
	pmaddwd	xmm2, xmm7
	por	xmm0, XMMWORD PTR .LC2[rip]
	por	xmm1, XMMWORD PTR .LC2[rip]
	paddd	xmm3, xmm11
	punpcklqdq	xmm10, xmm3
	pshufd	xmm12, xmm0, 8
	pshufd	xmm3, xmm0, 13
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	pshufd	xmm11, xmm1, 8
	paddw	xmm10, xmm9
	psrlw	xmm10, 1
	pshufd	xmm6, xmm1, 13
	pmaddwd	xmm1, XMMWORD PTR .LC3[rip]
	paddd	xmm1, xmm2
	movdqa	xmm2, xmm4
	paddd	xmm3, xmm12
	movdqu	xmm4, XMMWORD PTR -80[rax]
	psrad	xmm1, 14
	paddd	xmm6, xmm11
	pmaddwd	xmm2, xmm7
	movdqa	xmm11, XMMWORD PTR .LC0[rip]
	paddd	xmm2, xmm0
	punpcklqdq	xmm3, xmm6
	psrad	xmm2, 14
	movdqu	xmm0, XMMWORD PTR -96[rax]
	packssdw	xmm2, xmm1
	pxor	xmm1, xmm1
	pand	xmm11, xmm0
	psrldq	xmm0, 1
	movdqa	xmm12, XMMWORD PTR .LC0[rip]
	pand	xmm0, XMMWORD PTR .LC1[rip]
	paddw	xmm2, XMMWORD PTR .LC5[rip]
	paddw	xmm3, xmm9
	pmaxsw	xmm2, xmm1
	pminsw	xmm2, xmm8
	psrlw	xmm3, 1
	pand	xmm12, xmm4
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	por	xmm0, XMMWORD PTR .LC2[rip]
	pshufd	xmm14, xmm11, 8
	pshufd	xmm6, xmm11, 13
	pmaddwd	xmm11, xmm7
	por	xmm4, XMMWORD PTR .LC2[rip]
	pshufd	xmm13, xmm12, 8
	paddd	xmm6, xmm14
	pshufd	xmm1, xmm12, 13
	pmaddwd	xmm12, xmm7
	pshufd	xmm14, xmm0, 8
	paddd	xmm1, xmm13
	pshufd	xmm13, xmm4, 13
	movaps	XMMWORD PTR -72[rsp], xmm14
	punpcklqdq	xmm6, xmm1
	pshufd	xmm14, xmm4, 8
	pmaddwd	xmm4, XMMWORD PTR .LC3[rip]
	paddd	xmm4, xmm12
	pshufd	xmm1, xmm0, 13
	psrad	xmm4, 14
	pmaddwd	xmm0, XMMWORD PTR .LC3[rip]
	paddd	xmm0, xmm11
	psrad	xmm0, 14
	packssdw	xmm0, xmm4
	paddd	xmm13, xmm14
	paddw	xmm6, xmm9
	psrlw	xmm6, 1
	paddd	xmm1, XMMWORD PTR -72[rsp]
	punpcklqdq	xmm1, xmm13
	pxor	xmm13, xmm13
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	movdqa	xmm12, xmm6
	movdqa	xmm14, xmm5
	paddw	xmm1, xmm9
	psrlw	xmm1, 1
	pmaxsw	xmm0, xmm13
	pminsw	xmm0, xmm8
	packuswb	xmm0, xmm2
	movdqa	xmm13, xmm10
	movdqa	xmm2, xmm3
	movups	XMMWORD PTR 16[r8+rdi], xmm0
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	add	rdi, 32
	movdqa	xmm11, XMMWORD PTR -120[rsp]
	movdqa	xmm0, XMMWORD PTR -88[rsp]
	paddw	xmm11, xmm9
	psrlw	xmm11, 1
	movdqa	xmm4, XMMWORD PTR -104[rsp]
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	paddd	xmm13, xmm2
	movdqa	xmm2, xmm1
	psrad	xmm13, 14
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	paddw	xmm4, xmm9
	psrlw	xmm4, 1
	paddw	xmm0, xmm9
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	paddd	xmm12, xmm2
	pxor	xmm2, xmm2
	psrad	xmm12, 14
	packssdw	xmm12, xmm13
	movdqa	xmm13, xmm11
	psrlw	xmm0, 1
	pmaddwd	xmm14, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm13, XMMWORD PTR .LC9[rip]
	paddw	xmm12, xmm15
	pmaxsw	xmm12, xmm2
	movdqa	xmm2, xmm4
	pminsw	xmm12, xmm8
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	paddd	xmm13, xmm2
	movdqa	xmm2, xmm0
	psrad	xmm13, 14
	pmaddwd	xmm2, XMMWORD PTR .LC8[rip]
	paddd	xmm2, xmm14
	pxor	xmm14, xmm14
	psrad	xmm2, 14
	packssdw	xmm2, xmm13
	paddw	xmm2, xmm15
	pmaxsw	xmm2, xmm14
	pminsw	xmm2, xmm8
	packuswb	xmm2, xmm12
	movups	XMMWORD PTR -16[r10], xmm2
	movdqa	xmm2, XMMWORD PTR -56[rsp]
	movdqa	xmm14, XMMWORD PTR -40[rsp]
	pmaddwd	xmm3, xmm2
	pmaddwd	xmm1, xmm2
	pmaddwd	xmm4, xmm2
	pmaddwd	xmm0, xmm2
	pmaddwd	xmm6, xmm14
	pmaddwd	xmm10, xmm14
	paddd	xmm1, xmm6
	paddd	xmm3, xmm10
	pxor	xmm6, xmm6
	psrad	xmm3, 14
	psrad	xmm1, 14
	pmaddwd	xmm11, xmm14
	packssdw	xmm1, xmm3
	paddd	xmm4, xmm11
	pmaddwd	xmm5, xmm14
	psrad	xmm4, 14
	paddd	xmm0, xmm5
	psrad	xmm0, 14
	packssdw	xmm0, xmm4
	paddw	xmm1, xmm15
	pmaxsw	xmm1, xmm6
	pminsw	xmm1, xmm8
	paddw	xmm0, xmm15
	pmaxsw	xmm0, xmm6
	pminsw	xmm0, xmm8
	packuswb	xmm0, xmm1
	movups	XMMWORD PTR -16[r9], xmm0
	cmp	rdi, r11
	jb	.L43
.L42:
	cmp	rsi, r11
	je	.L41
	movdqa	xmm4, XMMWORD PTR .LC0[rip]
	mov	rax, QWORD PTR -16[rsp]
	movdqu	xmm0, XMMWORD PTR 32[rbx]
	movdqa	xmm11, xmm4
	movdqa	xmm10, xmm4
	movdqu	xmm3, XMMWORD PTR 48[rbx]
	pand	xmm11, xmm0
	psrldq	xmm0, 1
	pand	xmm10, xmm3
	psrldq	xmm3, 1
	movdqa	xmm6, XMMWORD PTR .LC1[rip]
	pshufd	xmm13, xmm11, 8
	pshufd	xmm2, xmm11, 13
	pshufd	xmm12, xmm10, 8
	movdqa	xmm5, XMMWORD PTR .LC2[rip]
	pand	xmm0, xmm6
	pshufd	xmm1, xmm10, 13
	paddd	xmm2, xmm13
	pand	xmm3, xmm6
	por	xmm0, xmm5
	paddd	xmm1, xmm12
	punpcklqdq	xmm2, xmm1
	por	xmm3, xmm5
	pshufd	xmm13, xmm0, 8
	movaps	XMMWORD PTR -120[rsp], xmm2
	pshufd	xmm2, xmm0, 13
	pshufd	xmm12, xmm3, 8
	pshufd	xmm1, xmm3, 13
	paddd	xmm2, xmm13
	movdqa	xmm15, xmm2
	movdqa	xmm2, XMMWORD PTR .LC4[rip]
	paddd	xmm1, xmm12
	punpcklqdq	xmm15, xmm1
	pmaddwd	xmm10, xmm2
	movdqa	xmm1, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm3, xmm1
	paddd	xmm10, xmm3
	movdqa	xmm3, xmm0
	movdqa	xmm0, xmm11
	psrad	xmm10, 14
	movdqa	xmm11, xmm4
	pmaddwd	xmm3, xmm1
	movaps	XMMWORD PTR -104[rsp], xmm15
	pmaddwd	xmm0, xmm2
	paddd	xmm0, xmm3
	pxor	xmm3, xmm3
	psrad	xmm0, 14
	packssdw	xmm0, xmm10
	movdqa	xmm10, xmm4
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	movaps	XMMWORD PTR -88[rsp], xmm0
	movdqu	xmm0, XMMWORD PTR [rbx]
	movdqu	xmm3, XMMWORD PTR 16[rbx]
	pand	xmm10, xmm0
	psrldq	xmm0, 1
	pand	xmm11, xmm3
	psrldq	xmm3, 1
	pand	xmm0, xmm6
	pshufd	xmm15, xmm10, 8
	pshufd	xmm14, xmm11, 8
	pand	xmm3, xmm6
	pshufd	xmm13, xmm10, 13
	pmaddwd	xmm10, xmm2
	pshufd	xmm12, xmm11, 13
	pmaddwd	xmm11, xmm2
	por	xmm0, xmm5
	por	xmm3, xmm5
	paddd	xmm13, xmm15
	paddd	xmm12, xmm14
	punpcklqdq	xmm13, xmm12
	pshufd	xmm15, xmm0, 8
	pshufd	xmm14, xmm3, 8
	movaps	XMMWORD PTR -72[rsp], xmm13
	pshufd	xmm12, xmm3, 13
	pmaddwd	xmm3, xmm1
	paddd	xmm3, xmm11
	pshufd	xmm13, xmm0, 13
	psrad	xmm3, 14
	pmaddwd	xmm0, xmm1
	paddd	xmm0, xmm10
	pxor	xmm10, xmm10
	psrad	xmm0, 14
	packssdw	xmm0, xmm3
	paddd	xmm12, xmm14
	movdqa	xmm11, xmm4
	paddd	xmm13, xmm15
	movdqa	xmm14, xmm13
	paddw	xmm0, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm0, xmm10
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	packuswb	xmm0, XMMWORD PTR -88[rsp]
	punpcklqdq	xmm14, xmm12
	movdqa	xmm12, xmm4
	movaps	XMMWORD PTR -56[rsp], xmm14
	movups	XMMWORD PTR [r8+rax], xmm0
	movdqu	xmm0, XMMWORD PTR 96[rbx]
	movdqu	xmm3, XMMWORD PTR 112[rbx]
	pand	xmm11, xmm0
	psrldq	xmm0, 1
	pand	xmm12, xmm3
	psrldq	xmm3, 1
	pand	xmm0, xmm6
	pshufd	xmm15, xmm11, 8
	pshufd	xmm14, xmm12, 8
	pand	xmm3, xmm6
	pshufd	xmm10, xmm11, 13
	pmaddwd	xmm11, xmm2
	pshufd	xmm13, xmm12, 13
	pmaddwd	xmm12, xmm2
	por	xmm0, xmm5
	por	xmm3, xmm5
	paddd	xmm10, xmm15
	paddd	xmm13, xmm14
	punpcklqdq	xmm10, xmm13
	pshufd	xmm15, xmm0, 8
	pshufd	xmm13, xmm0, 13
	pmaddwd	xmm0, xmm1
	paddd	xmm0, xmm11
	movaps	XMMWORD PTR -88[rsp], xmm10
	psrad	xmm0, 14
	pshufd	xmm14, xmm3, 8
	pshufd	xmm10, xmm3, 13
	paddd	xmm13, xmm15
	movdqa	xmm15, xmm13
	pmaddwd	xmm3, xmm1
	paddd	xmm3, xmm12
	psrad	xmm3, 14
	movdqa	xmm12, xmm4
	paddd	xmm10, xmm14
	punpcklqdq	xmm15, xmm10
	movaps	XMMWORD PTR -40[rsp], xmm15
	movdqa	xmm15, xmm0
	pxor	xmm0, xmm0
	packssdw	xmm15, xmm3
	movdqa	xmm3, XMMWORD PTR .LC5[rip]
	movdqu	xmm10, XMMWORD PTR 64[rbx]
	paddw	xmm3, xmm15
	pmaxsw	xmm3, xmm0
	pminsw	xmm3, XMMWORD PTR .LC6[rip]
	movdqu	xmm0, XMMWORD PTR 80[rbx]
	pand	xmm12, xmm10
	psrldq	xmm10, 1
	pand	xmm4, xmm0
	psrldq	xmm0, 1
	pand	xmm10, xmm6
	pand	xmm0, xmm6
	pshufd	xmm13, xmm4, 8
	por	xmm10, xmm5
	pshufd	xmm6, xmm12, 13
	por	xmm5, xmm0
	pshufd	xmm11, xmm12, 8
	pshufd	xmm0, xmm4, 13
	pmaddwd	xmm4, xmm2
	pmaddwd	xmm2, xmm12
	pshufd	xmm14, xmm10, 8
	paddd	xmm11, xmm6
	pshufd	xmm6, xmm5, 13
	paddd	xmm0, xmm13
	pshufd	xmm13, xmm5, 8
	pmaddwd	xmm5, xmm1
	pmaddwd	xmm1, xmm10
	paddd	xmm5, xmm4
	paddd	xmm1, xmm2
	psrad	xmm5, 14
	psrad	xmm1, 14
	packssdw	xmm1, xmm5
	pxor	xmm5, xmm5
	paddd	xmm6, xmm13
	punpcklqdq	xmm11, xmm0
	pshufd	xmm0, xmm10, 13
	paddw	xmm1, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm1, xmm5
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm3
	movdqa	xmm4, xmm11
	paddd	xmm0, xmm14
	punpcklqdq	xmm0, xmm6
	movups	XMMWORD PTR -16[r8+rsi], xmm1
	movdqa	xmm3, XMMWORD PTR .LC16[rip]
	movdqa	xmm6, XMMWORD PTR -120[rsp]
	paddw	xmm4, xmm3
	paddw	xmm0, xmm3
	psrlw	xmm4, 1
	psrlw	xmm0, 1
	movdqa	xmm14, xmm4
	movdqa	xmm5, XMMWORD PTR -72[rsp]
	paddw	xmm6, xmm3
	psrlw	xmm6, 1
	movdqa	xmm10, XMMWORD PTR -88[rsp]
	paddw	xmm5, xmm3
	movdqa	xmm15, xmm5
	movdqa	xmm13, XMMWORD PTR -56[rsp]
	paddw	xmm10, xmm3
	psrlw	xmm10, 1
	movdqa	xmm12, xmm10
	psrlw	xmm15, 1
	movdqa	xmm2, XMMWORD PTR -104[rsp]
	paddw	xmm13, xmm3
	psrlw	xmm13, 1
	movdqa	xmm1, XMMWORD PTR .LC8[rip]
	paddw	xmm2, xmm3
	paddw	xmm3, XMMWORD PTR -40[rsp]
	psrlw	xmm3, 1
	movdqa	xmm5, xmm3
	psrlw	xmm2, 1
	movdqa	xmm11, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm5, xmm1
	pmaddwd	xmm12, xmm11
	paddd	xmm12, xmm5
	movdqa	xmm5, xmm0
	pmaddwd	xmm14, xmm11
	psrad	xmm12, 14
	pmaddwd	xmm5, xmm1
	paddd	xmm5, xmm14
	psrad	xmm5, 14
	packssdw	xmm5, xmm12
	pxor	xmm12, xmm12
	movdqa	xmm14, XMMWORD PTR .LC10[rip]
	movaps	XMMWORD PTR -120[rsp], xmm14
	paddw	xmm5, xmm14
	movdqa	xmm14, xmm6
	pmaxsw	xmm5, xmm12
	pminsw	xmm5, XMMWORD PTR .LC6[rip]
	movdqa	xmm12, xmm2
	pmaddwd	xmm14, xmm11
	pmaddwd	xmm11, xmm15
	pmaddwd	xmm12, xmm1
	pmaddwd	xmm1, xmm13
	paddd	xmm12, xmm14
	paddd	xmm1, xmm11
	psrad	xmm12, 14
	pxor	xmm11, xmm11
	psrad	xmm1, 14
	packssdw	xmm1, xmm12
	movdqa	xmm14, XMMWORD PTR -120[rsp]
	paddw	xmm1, xmm14
	pmaxsw	xmm1, xmm11
	pminsw	xmm1, XMMWORD PTR .LC6[rip]
	packuswb	xmm1, xmm5
	movdqa	xmm5, XMMWORD PTR .LC11[rip]
	movups	XMMWORD PTR 0[rbp+r14], xmm1
	pmaddwd	xmm3, xmm5
	pmaddwd	xmm0, xmm5
	pmaddwd	xmm2, xmm5
	pmaddwd	xmm13, xmm5
	movdqa	xmm1, XMMWORD PTR .LC12[rip]
	pmaddwd	xmm10, xmm1
	pmaddwd	xmm4, xmm1
	paddd	xmm3, xmm10
	paddd	xmm0, xmm4
	psrad	xmm3, 14
	psrad	xmm0, 14
	pmaddwd	xmm6, xmm1
	packssdw	xmm0, xmm3
	paddd	xmm2, xmm6
	pmaddwd	xmm15, xmm1
	psrad	xmm2, 14
	paddd	xmm13, xmm15
	psrad	xmm13, 14
	packssdw	xmm13, xmm2
	paddw	xmm0, xmm14
	pmaxsw	xmm0, xmm11
	pminsw	xmm0, XMMWORD PTR .LC6[rip]
	paddw	xmm13, xmm14
	pmaxsw	xmm13, xmm11
	pminsw	xmm13, XMMWORD PTR .LC6[rip]
	packuswb	xmm13, xmm0
	movups	XMMWORD PTR [r12+r14], xmm13
.L41:
	add	r15, 1
	add	r8, QWORD PTR -24[rsp]
	add	rbp, QWORD PTR 64[rsp]
	add	rbx, rcx
	add	r13, rcx
	add	r12, QWORD PTR 80[rsp]
	cmp	rdx, r15
	jne	.L45
.L37:
	pop	rbx
	pop	rbp
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	ret
	.size	_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m, .-_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDE18:
	.section	.text._ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTE18:
	.section	.text.unlikely
.LCOLDB19:
	.text
.LHOTB19:
	.p2align 4,,15
	.globl	_ZN4Simd4Sse213BgraToYuv422pEPKhmmmPhmS3_mS3_m
	.type	_ZN4Simd4Sse213BgraToYuv422pEPKhmmmPhmS3_mS3_m, @function
_ZN4Simd4Sse213BgraToYuv422pEPKhmmmPhmS3_mS3_m:
	test	r8b, 15
	push	rbx
	mov	rax, QWORD PTR 16[rsp]
	mov	r10, QWORD PTR 24[rsp]
	mov	r11, QWORD PTR 32[rsp]
	mov	rbx, QWORD PTR 40[rsp]
	je	.L51
.L48:
	mov	QWORD PTR 40[rsp], rbx
	mov	QWORD PTR 32[rsp], r11
	mov	QWORD PTR 24[rsp], r10
	mov	QWORD PTR 16[rsp], rax
	pop	rbx
	jmp	_ZN4Simd4Sse213BgraToYuv422pILb0EEEvPKhmmmPhmS4_mS4_m@PLT
	.p2align 4,,10
	.p2align 3
.L51:
	test	r9b, 15
	jne	.L48
	test	al, 15
	jne	.L48
	test	r10b, 15
	jne	.L48
	test	r11b, 15
	jne	.L48
	test	bl, 15
	jne	.L48
	test	dil, 15
	jne	.L48
	test	cl, 15
	jne	.L48
	pop	rbx
	jmp	_ZN4Simd4Sse213BgraToYuv422pILb1EEEvPKhmmmPhmS4_mS4_m@PLT
	.size	_ZN4Simd4Sse213BgraToYuv422pEPKhmmmPhmS3_mS3_m, .-_ZN4Simd4Sse213BgraToYuv422pEPKhmmmPhmS3_mS3_m
	.section	.text.unlikely
.LCOLDE19:
	.text
.LHOTE19:
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDB20:
	.section	.text._ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTB20:
	.p2align 4,,15
	.weak	_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m
	.type	_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m, @function
_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m:
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbp
	push	rbx
	mov	rbx, rsi
	and	rbx, -16
	test	rdx, rdx
	mov	r10, QWORD PTR 56[rsp]
	mov	r11, QWORD PTR 72[rsp]
	mov	r13, QWORD PTR 80[rsp]
	je	.L52
	lea	r12, -16[rsi]
	movdqa	xmm15, XMMWORD PTR .LC6[rip]
	xor	r15d, r15d
	lea	r14, 0[0+r12*4]
	lea	rbp, [rdi+r14]
	.p2align 4,,10
	.p2align 3
.L60:
	test	rbx, rbx
	je	.L57
	pxor	xmm3, xmm3
	mov	rdi, rbp
	xor	eax, eax
	movdqa	xmm11, XMMWORD PTR .LC11[rip]
	sub	rdi, r14
	movdqa	xmm13, XMMWORD PTR .LC12[rip]
	.p2align 4,,10
	.p2align 3
.L58:
	add	rdi, 64
	movdqa	xmm9, XMMWORD PTR -16[rdi]
	movdqa	xmm2, XMMWORD PTR -32[rdi]
	movdqa	xmm7, xmm9
	pand	xmm9, XMMWORD PTR .LC0[rip]
	movdqa	xmm6, XMMWORD PTR -48[rdi]
	movdqa	xmm1, xmm2
	psrldq	xmm7, 1
	pand	xmm7, XMMWORD PTR .LC1[rip]
	movdqa	xmm5, XMMWORD PTR -64[rdi]
	movdqa	xmm4, xmm6
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	movdqa	xmm0, xmm5
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	movdqa	xmm12, xmm9
	psrldq	xmm0, 1
	por	xmm7, XMMWORD PTR .LC2[rip]
	pmaddwd	xmm12, XMMWORD PTR .LC4[rip]
	pand	xmm0, XMMWORD PTR .LC1[rip]
	por	xmm1, XMMWORD PTR .LC2[rip]
	movdqa	xmm8, xmm7
	pand	xmm2, XMMWORD PTR .LC0[rip]
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	paddd	xmm12, xmm8
	psrad	xmm12, 14
	por	xmm4, XMMWORD PTR .LC2[rip]
	movdqa	xmm8, xmm1
	pand	xmm6, XMMWORD PTR .LC0[rip]
	movdqa	xmm10, xmm2
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	por	xmm0, XMMWORD PTR .LC2[rip]
	pmaddwd	xmm10, XMMWORD PTR .LC4[rip]
	paddd	xmm10, xmm8
	movdqa	xmm8, xmm4
	pand	xmm5, XMMWORD PTR .LC0[rip]
	psrad	xmm10, 14
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	packssdw	xmm10, xmm12
	movdqa	xmm12, xmm6
	movdqa	xmm14, xmm5
	pmaddwd	xmm12, XMMWORD PTR .LC4[rip]
	paddd	xmm12, xmm8
	movdqa	xmm8, xmm0
	paddw	xmm10, XMMWORD PTR .LC5[rip]
	psrad	xmm12, 14
	pmaxsw	xmm10, xmm3
	pminsw	xmm10, xmm15
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm14, XMMWORD PTR .LC4[rip]
	paddd	xmm8, xmm14
	psrad	xmm8, 14
	packssdw	xmm8, xmm12
	movdqa	xmm12, xmm9
	movdqa	xmm14, xmm5
	pmaddwd	xmm9, xmm13
	paddw	xmm8, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm8, xmm3
	pminsw	xmm8, xmm15
	packuswb	xmm8, xmm10
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm14, XMMWORD PTR .LC9[rip]
	movdqa	xmm10, xmm2
	pmaddwd	xmm2, xmm13
	movaps	XMMWORD PTR [r8+rax], xmm8
	pmaddwd	xmm10, XMMWORD PTR .LC9[rip]
	movdqa	xmm8, xmm7
	pmaddwd	xmm7, xmm11
	paddd	xmm7, xmm9
	psrad	xmm7, 14
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	paddd	xmm12, xmm8
	movdqa	xmm8, xmm1
	psrad	xmm12, 14
	pmaddwd	xmm1, xmm11
	paddd	xmm2, xmm1
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	paddd	xmm10, xmm8
	movdqa	xmm8, xmm4
	psrad	xmm10, 14
	packssdw	xmm10, xmm12
	movdqa	xmm12, xmm6
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	movdqa	xmm1, xmm0
	psrad	xmm2, 14
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	paddd	xmm12, xmm8
	movdqa	xmm8, xmm0
	movdqa	xmm0, xmm5
	psrad	xmm12, 14
	packssdw	xmm2, xmm7
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	pmaddwd	xmm4, xmm11
	paddd	xmm8, xmm14
	pmaddwd	xmm6, xmm13
	psrad	xmm8, 14
	paddd	xmm4, xmm6
	packssdw	xmm8, xmm12
	psrad	xmm4, 14
	pmaddwd	xmm1, xmm11
	pmaddwd	xmm0, xmm13
	paddd	xmm0, xmm1
	psrad	xmm0, 14
	packssdw	xmm0, xmm4
	paddw	xmm10, XMMWORD PTR .LC10[rip]
	paddw	xmm2, XMMWORD PTR .LC10[rip]
	pmaxsw	xmm10, xmm3
	paddw	xmm8, XMMWORD PTR .LC10[rip]
	pminsw	xmm10, xmm15
	pmaxsw	xmm8, xmm3
	pmaxsw	xmm2, xmm3
	pminsw	xmm8, xmm15
	pminsw	xmm2, xmm15
	packuswb	xmm8, xmm10
	paddw	xmm0, XMMWORD PTR .LC10[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, xmm15
	packuswb	xmm0, xmm2
	movaps	XMMWORD PTR [r10+rax], xmm8
	movaps	XMMWORD PTR [r11+rax], xmm0
	add	rax, 16
	cmp	rax, rbx
	jb	.L58
.L57:
	cmp	rsi, rbx
	je	.L56
	movdqu	xmm5, XMMWORD PTR 0[rbp]
	movdqa	xmm3, xmm5
	movaps	XMMWORD PTR -56[rsp], xmm5
	psrldq	xmm3, 1
	movaps	XMMWORD PTR -40[rsp], xmm3
	movdqu	xmm4, XMMWORD PTR 48[rbp]
	movdqa	xmm12, XMMWORD PTR .LC1[rip]
	movdqa	xmm2, xmm4
	movdqu	xmm7, XMMWORD PTR 32[rbp]
	psrldq	xmm2, 1
	pand	xmm2, xmm12
	movdqu	xmm6, XMMWORD PTR 16[rbp]
	movdqa	xmm0, xmm7
	movdqa	xmm14, xmm7
	movdqa	xmm11, XMMWORD PTR .LC2[rip]
	movdqa	xmm1, xmm6
	psrldq	xmm0, 1
	pand	xmm0, xmm12
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	por	xmm2, xmm11
	psrldq	xmm1, 1
	pand	xmm1, xmm12
	pand	xmm4, xmm5
	movdqa	xmm8, XMMWORD PTR .LC4[rip]
	pand	xmm12, XMMWORD PTR -40[rsp]
	movdqa	xmm13, xmm2
	movdqa	xmm2, XMMWORD PTR .LC3[rip]
	movdqa	xmm9, xmm4
	por	xmm0, xmm11
	movdqa	xmm3, xmm13
	pand	xmm14, xmm5
	pmaddwd	xmm9, xmm8
	por	xmm1, xmm11
	pmaddwd	xmm3, xmm2
	pand	xmm6, xmm5
	pand	xmm5, XMMWORD PTR -56[rsp]
	movdqa	xmm7, xmm14
	movaps	XMMWORD PTR -24[rsp], xmm4
	movdqa	xmm4, xmm9
	pmaddwd	xmm7, xmm8
	por	xmm12, xmm11
	movdqa	xmm10, xmm1
	movaps	XMMWORD PTR -72[rsp], xmm0
	paddd	xmm4, xmm3
	psrad	xmm4, 14
	movdqa	xmm3, xmm0
	pmaddwd	xmm10, xmm2
	movdqa	xmm0, xmm6
	movdqa	xmm11, xmm6
	pmaddwd	xmm3, xmm2
	pmaddwd	xmm2, xmm12
	paddd	xmm3, xmm7
	pmaddwd	xmm0, xmm8
	psrad	xmm3, 14
	pmaddwd	xmm8, xmm5
	packssdw	xmm3, xmm4
	paddd	xmm10, xmm0
	paddd	xmm2, xmm8
	pxor	xmm4, xmm4
	psrad	xmm10, 14
	psrad	xmm2, 14
	packssdw	xmm2, xmm10
	movdqa	xmm9, XMMWORD PTR .LC6[rip]
	movdqa	xmm10, xmm14
	movdqa	xmm7, XMMWORD PTR .LC5[rip]
	paddw	xmm3, xmm7
	paddw	xmm7, xmm2
	pmaxsw	xmm3, xmm4
	pmaxsw	xmm7, xmm4
	pminsw	xmm3, xmm9
	pminsw	xmm7, xmm9
	packuswb	xmm7, xmm3
	movdqa	xmm0, XMMWORD PTR -24[rsp]
	movdqa	xmm3, xmm13
	movdqa	xmm2, XMMWORD PTR .LC8[rip]
	movdqa	xmm8, xmm0
	movups	XMMWORD PTR [r8+r12], xmm7
	pmaddwd	xmm3, xmm2
	movdqa	xmm7, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm8, xmm7
	paddd	xmm8, xmm3
	movdqa	xmm3, XMMWORD PTR -72[rsp]
	pmaddwd	xmm10, xmm7
	pmaddwd	xmm11, xmm7
	psrad	xmm8, 14
	pmaddwd	xmm3, xmm2
	paddd	xmm3, xmm10
	movdqa	xmm10, xmm1
	pmaddwd	xmm7, xmm5
	psrad	xmm3, 14
	packssdw	xmm3, xmm8
	pmaddwd	xmm10, xmm2
	pmaddwd	xmm2, xmm12
	paddd	xmm10, xmm11
	paddd	xmm2, xmm7
	psrad	xmm10, 14
	psrad	xmm2, 14
	packssdw	xmm2, xmm10
	movdqa	xmm8, XMMWORD PTR .LC10[rip]
	paddw	xmm3, xmm8
	paddw	xmm2, xmm8
	pmaxsw	xmm3, xmm4
	pmaxsw	xmm2, xmm4
	pminsw	xmm3, xmm9
	pminsw	xmm2, xmm9
	packuswb	xmm2, xmm3
	movdqa	xmm11, XMMWORD PTR .LC11[rip]
	movdqa	xmm3, xmm0
	movups	XMMWORD PTR [r10+r12], xmm2
	pmaddwd	xmm1, xmm11
	pmaddwd	xmm12, xmm11
	movdqa	xmm2, xmm13
	pmaddwd	xmm2, xmm11
	movdqa	xmm13, XMMWORD PTR .LC12[rip]
	movdqa	xmm0, XMMWORD PTR -72[rsp]
	pmaddwd	xmm3, xmm13
	paddd	xmm2, xmm3
	movdqa	xmm3, xmm14
	psrad	xmm2, 14
	pmaddwd	xmm6, xmm13
	pmaddwd	xmm0, xmm11
	paddd	xmm1, xmm6
	pmaddwd	xmm5, xmm13
	pmaddwd	xmm3, xmm13
	psrad	xmm1, 14
	paddd	xmm0, xmm3
	paddd	xmm12, xmm5
	psrad	xmm0, 14
	psrad	xmm12, 14
	packssdw	xmm0, xmm2
	packssdw	xmm12, xmm1
	paddw	xmm0, xmm8
	pmaxsw	xmm0, xmm4
	pminsw	xmm0, xmm9
	paddw	xmm8, xmm12
	pmaxsw	xmm4, xmm8
	pminsw	xmm9, xmm4
	packuswb	xmm9, xmm0
	movups	XMMWORD PTR [r11+r12], xmm9
.L56:
	add	r15, 1
	add	r8, r9
	add	r10, QWORD PTR 64[rsp]
	add	r11, r13
	add	rbp, rcx
	cmp	rdx, r15
	jne	.L60
.L52:
	pop	rbx
	pop	rbp
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	ret
	.size	_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m, .-_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDE20:
	.section	.text._ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTE20:
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDB21:
	.section	.text._ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTB21:
	.p2align 4,,15
	.weak	_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m
	.type	_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m, @function
_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m:
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbp
	push	rbx
	mov	rbx, rsi
	and	rbx, -16
	test	rdx, rdx
	mov	r10, QWORD PTR 56[rsp]
	mov	r11, QWORD PTR 72[rsp]
	mov	r13, QWORD PTR 80[rsp]
	je	.L62
	lea	r12, -16[rsi]
	movdqa	xmm15, XMMWORD PTR .LC6[rip]
	xor	r15d, r15d
	lea	r14, 0[0+r12*4]
	lea	rbp, [rdi+r14]
	.p2align 4,,10
	.p2align 3
.L70:
	test	rbx, rbx
	je	.L67
	pxor	xmm3, xmm3
	mov	rdi, rbp
	xor	eax, eax
	movdqa	xmm11, XMMWORD PTR .LC11[rip]
	sub	rdi, r14
	movdqa	xmm13, XMMWORD PTR .LC12[rip]
	.p2align 4,,10
	.p2align 3
.L68:
	add	rdi, 64
	movdqu	xmm9, XMMWORD PTR -16[rdi]
	movdqu	xmm2, XMMWORD PTR -32[rdi]
	movdqa	xmm7, xmm9
	pand	xmm9, XMMWORD PTR .LC0[rip]
	movdqu	xmm6, XMMWORD PTR -48[rdi]
	movdqa	xmm1, xmm2
	psrldq	xmm7, 1
	pand	xmm7, XMMWORD PTR .LC1[rip]
	movdqu	xmm5, XMMWORD PTR -64[rdi]
	movdqa	xmm4, xmm6
	psrldq	xmm1, 1
	pand	xmm1, XMMWORD PTR .LC1[rip]
	movdqa	xmm0, xmm5
	psrldq	xmm4, 1
	pand	xmm4, XMMWORD PTR .LC1[rip]
	movdqa	xmm12, xmm9
	psrldq	xmm0, 1
	por	xmm7, XMMWORD PTR .LC2[rip]
	pmaddwd	xmm12, XMMWORD PTR .LC4[rip]
	pand	xmm0, XMMWORD PTR .LC1[rip]
	por	xmm1, XMMWORD PTR .LC2[rip]
	movdqa	xmm8, xmm7
	pand	xmm2, XMMWORD PTR .LC0[rip]
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	paddd	xmm12, xmm8
	psrad	xmm12, 14
	por	xmm4, XMMWORD PTR .LC2[rip]
	movdqa	xmm8, xmm1
	pand	xmm6, XMMWORD PTR .LC0[rip]
	movdqa	xmm10, xmm2
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	por	xmm0, XMMWORD PTR .LC2[rip]
	pmaddwd	xmm10, XMMWORD PTR .LC4[rip]
	paddd	xmm10, xmm8
	movdqa	xmm8, xmm4
	pand	xmm5, XMMWORD PTR .LC0[rip]
	psrad	xmm10, 14
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	packssdw	xmm10, xmm12
	movdqa	xmm12, xmm6
	movdqa	xmm14, xmm5
	pmaddwd	xmm12, XMMWORD PTR .LC4[rip]
	paddd	xmm12, xmm8
	movdqa	xmm8, xmm0
	paddw	xmm10, XMMWORD PTR .LC5[rip]
	psrad	xmm12, 14
	pmaxsw	xmm10, xmm3
	pminsw	xmm10, xmm15
	pmaddwd	xmm8, XMMWORD PTR .LC3[rip]
	pmaddwd	xmm14, XMMWORD PTR .LC4[rip]
	paddd	xmm8, xmm14
	psrad	xmm8, 14
	packssdw	xmm8, xmm12
	movdqa	xmm12, xmm9
	movdqa	xmm14, xmm5
	pmaddwd	xmm9, xmm13
	paddw	xmm8, XMMWORD PTR .LC5[rip]
	pmaxsw	xmm8, xmm3
	pminsw	xmm8, xmm15
	packuswb	xmm8, xmm10
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm14, XMMWORD PTR .LC9[rip]
	movdqa	xmm10, xmm2
	pmaddwd	xmm2, xmm13
	movups	XMMWORD PTR [r8+rax], xmm8
	pmaddwd	xmm10, XMMWORD PTR .LC9[rip]
	movdqa	xmm8, xmm7
	pmaddwd	xmm7, xmm11
	paddd	xmm7, xmm9
	psrad	xmm7, 14
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	paddd	xmm12, xmm8
	movdqa	xmm8, xmm1
	psrad	xmm12, 14
	pmaddwd	xmm1, xmm11
	paddd	xmm2, xmm1
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	paddd	xmm10, xmm8
	movdqa	xmm8, xmm4
	psrad	xmm10, 14
	packssdw	xmm10, xmm12
	movdqa	xmm12, xmm6
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	movdqa	xmm1, xmm0
	psrad	xmm2, 14
	pmaddwd	xmm12, XMMWORD PTR .LC9[rip]
	paddd	xmm12, xmm8
	movdqa	xmm8, xmm0
	movdqa	xmm0, xmm5
	psrad	xmm12, 14
	packssdw	xmm2, xmm7
	pmaddwd	xmm8, XMMWORD PTR .LC8[rip]
	pmaddwd	xmm4, xmm11
	paddd	xmm8, xmm14
	pmaddwd	xmm6, xmm13
	psrad	xmm8, 14
	paddd	xmm4, xmm6
	packssdw	xmm8, xmm12
	psrad	xmm4, 14
	pmaddwd	xmm1, xmm11
	pmaddwd	xmm0, xmm13
	paddd	xmm0, xmm1
	psrad	xmm0, 14
	packssdw	xmm0, xmm4
	paddw	xmm10, XMMWORD PTR .LC10[rip]
	paddw	xmm2, XMMWORD PTR .LC10[rip]
	pmaxsw	xmm10, xmm3
	paddw	xmm8, XMMWORD PTR .LC10[rip]
	pminsw	xmm10, xmm15
	pmaxsw	xmm8, xmm3
	pmaxsw	xmm2, xmm3
	pminsw	xmm8, xmm15
	pminsw	xmm2, xmm15
	packuswb	xmm8, xmm10
	paddw	xmm0, XMMWORD PTR .LC10[rip]
	pmaxsw	xmm0, xmm3
	pminsw	xmm0, xmm15
	packuswb	xmm0, xmm2
	movups	XMMWORD PTR [r10+rax], xmm8
	movups	XMMWORD PTR [r11+rax], xmm0
	add	rax, 16
	cmp	rax, rbx
	jb	.L68
.L67:
	cmp	rsi, rbx
	je	.L66
	movdqu	xmm5, XMMWORD PTR 0[rbp]
	movdqa	xmm3, xmm5
	movaps	XMMWORD PTR -56[rsp], xmm5
	psrldq	xmm3, 1
	movaps	XMMWORD PTR -40[rsp], xmm3
	movdqu	xmm4, XMMWORD PTR 48[rbp]
	movdqa	xmm12, XMMWORD PTR .LC1[rip]
	movdqa	xmm2, xmm4
	movdqu	xmm7, XMMWORD PTR 32[rbp]
	psrldq	xmm2, 1
	pand	xmm2, xmm12
	movdqu	xmm6, XMMWORD PTR 16[rbp]
	movdqa	xmm0, xmm7
	movdqa	xmm14, xmm7
	movdqa	xmm11, XMMWORD PTR .LC2[rip]
	movdqa	xmm1, xmm6
	psrldq	xmm0, 1
	pand	xmm0, xmm12
	movdqa	xmm5, XMMWORD PTR .LC0[rip]
	por	xmm2, xmm11
	psrldq	xmm1, 1
	pand	xmm1, xmm12
	pand	xmm4, xmm5
	movdqa	xmm8, XMMWORD PTR .LC4[rip]
	pand	xmm12, XMMWORD PTR -40[rsp]
	movdqa	xmm13, xmm2
	movdqa	xmm2, XMMWORD PTR .LC3[rip]
	movdqa	xmm9, xmm4
	por	xmm0, xmm11
	movdqa	xmm3, xmm13
	pand	xmm14, xmm5
	pmaddwd	xmm9, xmm8
	por	xmm1, xmm11
	pmaddwd	xmm3, xmm2
	pand	xmm6, xmm5
	pand	xmm5, XMMWORD PTR -56[rsp]
	movdqa	xmm7, xmm14
	movaps	XMMWORD PTR -24[rsp], xmm4
	movdqa	xmm4, xmm9
	pmaddwd	xmm7, xmm8
	por	xmm12, xmm11
	movdqa	xmm10, xmm1
	movaps	XMMWORD PTR -72[rsp], xmm0
	paddd	xmm4, xmm3
	psrad	xmm4, 14
	movdqa	xmm3, xmm0
	pmaddwd	xmm10, xmm2
	movdqa	xmm0, xmm6
	movdqa	xmm11, xmm6
	pmaddwd	xmm3, xmm2
	pmaddwd	xmm2, xmm12
	paddd	xmm3, xmm7
	pmaddwd	xmm0, xmm8
	psrad	xmm3, 14
	pmaddwd	xmm8, xmm5
	packssdw	xmm3, xmm4
	paddd	xmm10, xmm0
	paddd	xmm2, xmm8
	pxor	xmm4, xmm4
	psrad	xmm10, 14
	psrad	xmm2, 14
	packssdw	xmm2, xmm10
	movdqa	xmm9, XMMWORD PTR .LC6[rip]
	movdqa	xmm10, xmm14
	movdqa	xmm7, XMMWORD PTR .LC5[rip]
	paddw	xmm3, xmm7
	paddw	xmm7, xmm2
	pmaxsw	xmm3, xmm4
	pmaxsw	xmm7, xmm4
	pminsw	xmm3, xmm9
	pminsw	xmm7, xmm9
	packuswb	xmm7, xmm3
	movdqa	xmm0, XMMWORD PTR -24[rsp]
	movdqa	xmm3, xmm13
	movdqa	xmm2, XMMWORD PTR .LC8[rip]
	movdqa	xmm8, xmm0
	movups	XMMWORD PTR [r8+r12], xmm7
	pmaddwd	xmm3, xmm2
	movdqa	xmm7, XMMWORD PTR .LC9[rip]
	pmaddwd	xmm8, xmm7
	paddd	xmm8, xmm3
	movdqa	xmm3, XMMWORD PTR -72[rsp]
	pmaddwd	xmm10, xmm7
	pmaddwd	xmm11, xmm7
	psrad	xmm8, 14
	pmaddwd	xmm3, xmm2
	paddd	xmm3, xmm10
	movdqa	xmm10, xmm1
	pmaddwd	xmm7, xmm5
	psrad	xmm3, 14
	packssdw	xmm3, xmm8
	pmaddwd	xmm10, xmm2
	pmaddwd	xmm2, xmm12
	paddd	xmm10, xmm11
	paddd	xmm2, xmm7
	psrad	xmm10, 14
	psrad	xmm2, 14
	packssdw	xmm2, xmm10
	movdqa	xmm8, XMMWORD PTR .LC10[rip]
	paddw	xmm3, xmm8
	paddw	xmm2, xmm8
	pmaxsw	xmm3, xmm4
	pmaxsw	xmm2, xmm4
	pminsw	xmm3, xmm9
	pminsw	xmm2, xmm9
	packuswb	xmm2, xmm3
	movdqa	xmm11, XMMWORD PTR .LC11[rip]
	movdqa	xmm3, xmm0
	movups	XMMWORD PTR [r10+r12], xmm2
	pmaddwd	xmm1, xmm11
	pmaddwd	xmm12, xmm11
	movdqa	xmm2, xmm13
	pmaddwd	xmm2, xmm11
	movdqa	xmm13, XMMWORD PTR .LC12[rip]
	movdqa	xmm0, XMMWORD PTR -72[rsp]
	pmaddwd	xmm3, xmm13
	paddd	xmm2, xmm3
	movdqa	xmm3, xmm14
	psrad	xmm2, 14
	pmaddwd	xmm6, xmm13
	pmaddwd	xmm0, xmm11
	paddd	xmm1, xmm6
	pmaddwd	xmm5, xmm13
	pmaddwd	xmm3, xmm13
	psrad	xmm1, 14
	paddd	xmm0, xmm3
	paddd	xmm12, xmm5
	psrad	xmm0, 14
	psrad	xmm12, 14
	packssdw	xmm0, xmm2
	packssdw	xmm12, xmm1
	paddw	xmm0, xmm8
	pmaxsw	xmm0, xmm4
	pminsw	xmm0, xmm9
	paddw	xmm8, xmm12
	pmaxsw	xmm4, xmm8
	pminsw	xmm9, xmm4
	packuswb	xmm9, xmm0
	movups	XMMWORD PTR [r11+r12], xmm9
.L66:
	add	r15, 1
	add	r8, r9
	add	r10, QWORD PTR 64[rsp]
	add	r11, r13
	add	rbp, rcx
	cmp	rdx, r15
	jne	.L70
.L62:
	pop	rbx
	pop	rbp
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	ret
	.size	_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m, .-_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m
	.section	.text.unlikely._ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LCOLDE21:
	.section	.text._ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,"axG",@progbits,_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m,comdat
.LHOTE21:
	.section	.text.unlikely
.LCOLDB22:
	.text
.LHOTB22:
	.p2align 4,,15
	.globl	_ZN4Simd4Sse213BgraToYuv444pEPKhmmmPhmS3_mS3_m
	.type	_ZN4Simd4Sse213BgraToYuv444pEPKhmmmPhmS3_mS3_m, @function
_ZN4Simd4Sse213BgraToYuv444pEPKhmmmPhmS3_mS3_m:
	test	r8b, 15
	push	rbx
	mov	rax, QWORD PTR 16[rsp]
	mov	r10, QWORD PTR 24[rsp]
	mov	r11, QWORD PTR 32[rsp]
	mov	rbx, QWORD PTR 40[rsp]
	je	.L76
.L73:
	mov	QWORD PTR 40[rsp], rbx
	mov	QWORD PTR 32[rsp], r11
	mov	QWORD PTR 24[rsp], r10
	mov	QWORD PTR 16[rsp], rax
	pop	rbx
	jmp	_ZN4Simd4Sse213BgraToYuv444pILb0EEEvPKhmmmPhmS4_mS4_m@PLT
	.p2align 4,,10
	.p2align 3
.L76:
	test	r9b, 15
	jne	.L73
	test	al, 15
	jne	.L73
	test	r10b, 15
	jne	.L73
	test	r11b, 15
	jne	.L73
	test	bl, 15
	jne	.L73
	test	dil, 15
	jne	.L73
	test	cl, 15
	jne	.L73
	pop	rbx
	jmp	_ZN4Simd4Sse213BgraToYuv444pILb1EEEvPKhmmmPhmS4_mS4_m@PLT
	.size	_ZN4Simd4Sse213BgraToYuv444pEPKhmmmPhmS3_mS3_m, .-_ZN4Simd4Sse213BgraToYuv444pEPKhmmmPhmS3_mS3_m
	.section	.text.unlikely
.LCOLDE22:
	.text
.LHOTE22:
	.section	.text.unlikely
.LCOLDB23:
	.section	.text.startup,"ax",@progbits
.LHOTB23:
	.p2align 4,,15
	.type	_GLOBAL__sub_I_SimdSse2BgraToYuv.cpp, @function
_GLOBAL__sub_I_SimdSse2BgraToYuv.cpp:
	lea	rdi, _ZStL8__ioinit[rip]
	sub	rsp, 8
	call	_ZNSt8ios_base4InitC1Ev@PLT
	mov	rdi, QWORD PTR _ZNSt8ios_base4InitD1Ev@GOTPCREL[rip]
	lea	rdx, __dso_handle[rip]
	lea	rsi, _ZStL8__ioinit[rip]
	add	rsp, 8
	jmp	__cxa_atexit@PLT
	.size	_GLOBAL__sub_I_SimdSse2BgraToYuv.cpp, .-_GLOBAL__sub_I_SimdSse2BgraToYuv.cpp
	.section	.text.unlikely
.LCOLDE23:
	.section	.text.startup
.LHOTE23:
	.section	.init_array,"aw"
	.align 8
	.quad	_GLOBAL__sub_I_SimdSse2BgraToYuv.cpp
	.local	_ZStL8__ioinit
	.comm	_ZStL8__ioinit,1,1
	.section	.rodata.cst16,"aM",@progbits,16
	.align 16
.LC0:
	.quad	71777214294589695
	.quad	71777214294589695
	.align 16
.LC1:
	.quad	1095216660735
	.quad	1095216660735
	.align 16
.LC2:
	.quad	281474976776192
	.quad	281474976776192
	.align 16
.LC3:
	.value	8258
	.value	8192
	.value	8258
	.value	8192
	.value	8258
	.value	8192
	.value	8258
	.value	8192
	.align 16
.LC4:
	.value	1606
	.value	4211
	.value	1606
	.value	4211
	.value	1606
	.value	4211
	.value	1606
	.value	4211
	.align 16
.LC5:
	.value	16
	.value	16
	.value	16
	.value	16
	.value	16
	.value	16
	.value	16
	.value	16
	.align 16
.LC6:
	.value	255
	.value	255
	.value	255
	.value	255
	.value	255
	.value	255
	.value	255
	.value	255
	.align 16
.LC7:
	.value	2
	.value	2
	.value	2
	.value	2
	.value	2
	.value	2
	.value	2
	.value	2
	.align 16
.LC8:
	.value	-4768
	.value	8192
	.value	-4768
	.value	8192
	.value	-4768
	.value	8192
	.value	-4768
	.value	8192
	.align 16
.LC9:
	.value	7193
	.value	-2425
	.value	7193
	.value	-2425
	.value	7193
	.value	-2425
	.value	7193
	.value	-2425
	.align 16
.LC10:
	.value	128
	.value	128
	.value	128
	.value	128
	.value	128
	.value	128
	.value	128
	.value	128
	.align 16
.LC11:
	.value	-6029
	.value	8192
	.value	-6029
	.value	8192
	.value	-6029
	.value	8192
	.value	-6029
	.value	8192
	.align 16
.LC12:
	.value	-1163
	.value	7193
	.value	-1163
	.value	7193
	.value	-1163
	.value	7193
	.value	-1163
	.value	7193
	.align 16
.LC16:
	.value	1
	.value	1
	.value	1
	.value	1
	.value	1
	.value	1
	.value	1
	.value	1
	.hidden	__dso_handle
	.ident	"GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
	.section	.note.GNU-stack,"",@progbits

Can companion declarations be avoided for private functions?

I'm trying to convert a big library which has many functions calling each other and just few of them are public. Nevertheless I have to write a lot of declarations in the companion file for all those private assembly functions, which is boring. Are they really needed for private functions?

best clang and gcc flags and more examples from start to finish

What are the best clang and gcc flags for getting everything to work well in c2goasm?

I should be clarify. I attempting to reproduce how to build the example in a way that resembles the method for influxdata.com's apache-arrow golang implementation:
https://www.influxdata.com/blog/influxdata-apache-arrow-go-implementation/
https://github.com/influxdata/arrow

I answered my own question.
github.com/influxdata/arrow/
holds a thorough example of c2goasm usage.

Here's the output of the influxdata/arrow/ build:

make -B
make[1]: Entering directory '/home/dma2/Code/go/src/github.com/influxdata/arrow/memory'
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -mavx2 -mfma -mllvm -force-vector-width=32 _lib/memory.c -o _lib/memory_avx2.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/memory_avx2.s
c2goasm -a -f -a -f _lib/memory_avx2.s memory_avx2_amd64.s
Processing _lib/memory_avx2.s
Invoking asm2plan9s on memory_avx2_amd64.s
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -msse4 _lib/memory.c -o _lib/memory_sse4.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/memory_sse4.s
c2goasm -a -f -a -f _lib/memory_sse4.s memory_sse4_amd64.s
Processing _lib/memory_sse4.s
Invoking asm2plan9s on memory_sse4_amd64.s
make[1]: Leaving directory '/home/dma2/Code/go/src/github.com/influxdata/arrow/memory'
make[1]: Entering directory '/home/dma2/Code/go/src/github.com/influxdata/arrow/math'
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -mavx2 -mfma -mllvm -force-vector-width=32 _lib/float64.c -o _lib/float64_avx2.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/float64_avx2.s
c2goasm -a -f -a -f _lib/float64_avx2.s float64_avx2_amd64.s
Processing _lib/float64_avx2.s
Invoking asm2plan9s on float64_avx2_amd64.s
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -msse4 _lib/float64.c -o _lib/float64_sse4.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/float64_sse4.s
c2goasm -a -f -a -f _lib/float64_sse4.s float64_sse4_amd64.s
Processing _lib/float64_sse4.s
Invoking asm2plan9s on float64_sse4_amd64.s
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -mavx2 -mfma -mllvm -force-vector-width=32 _lib/int64.c -o _lib/int64_avx2.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/int64_avx2.s
c2goasm -a -f -a -f _lib/int64_avx2.s int64_avx2_amd64.s
Processing _lib/int64_avx2.s
Invoking asm2plan9s on int64_avx2_amd64.s
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -msse4 _lib/int64.c -o _lib/int64_sse4.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/int64_sse4.s
c2goasm -a -f -a -f _lib/int64_sse4.s int64_sse4_amd64.s
Processing _lib/int64_sse4.s
Invoking asm2plan9s on int64_sse4_amd64.s
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -mavx2 -mfma -mllvm -force-vector-width=32 _lib/uint64.c -o _lib/uint64_avx2.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/uint64_avx2.s
c2goasm -a -f -a -f _lib/uint64_avx2.s uint64_avx2_amd64.s
Processing _lib/uint64_avx2.s
Invoking asm2plan9s on uint64_avx2_amd64.s
clang -S -target x86_64-unknown-none -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -O3 -fno-builtin -ffast-math -fno-jump-tables -I_lib -msse4 _lib/uint64.c -o _lib/uint64_sse4.s ; perl -i -pe 's/(ro[rl]\s+\w{2,3})$/\1, 1/' _lib/uint64_sse4.s
c2goasm -a -f -a -f _lib/uint64_sse4.s uint64_sse4_amd64.s
Processing _lib/uint64_sse4.s
Invoking asm2plan9s on uint64_sse4_amd64.s

Here is an example from start to finish for others to follow:

cd github.com/ermig1979/Simd/prj/cmake
rm CMakeCache.txt
export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
cmake -DTOOLCHAIN="" -DTARGET="" -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON .
make -B

go get -u github.com/minio/c2goasm
go get -u github.com/klauspost/asmfmt/cmd/asmfmt
go get -u github.com/minio/asm2plan9s

cat original.c

#include <x86intrin.h>
#include_next <immintrin.h>

void MultiplyAndAdd(float* arg1, float* arg2, float* arg3, float* result) {
    __m256 vec1 = _mm256_load_ps(arg1);
    __m256 vec2 = _mm256_load_ps(arg2);
    __m256 vec3 = _mm256_load_ps(arg3);
    __m256 res  = _mm256_fmadd_ps(vec1, vec2, vec3);
    _mm256_storeu_ps(result, res);
}

cat original.c.plan9s.go
//go:noescape
func _MultiplyAndAdd(vec1, vec2, vec3, result unsafe.Pointer)

func MultiplyAndAdd(someObj Object) {

	_MultiplyAndAdd(someObj.GetVec1(), someObj.GetVec2(), someObj.GetVec3(), someObj.GetResult()))
}

1) generate the clang assembler code

/usr/bin/clang -mno-red-zone -mstackrealign -fPIC -mavx2 -mavx512bw -o original.c.s -S original.c

original.c.s has been generated.

2) we need original.c.plan9s.go
   declaring what's in the original.c.s

3) generate the asm2plan9s assembler code

c2goasm -a original.c.s original.c.plan9s.s

4) original.c.plan9s.s has been generated.

cat original.c.plan9s.s
//+build !noasm !appengine
// AUTO-GENERATED BY C2GOASM -- DO NOT EDIT

TEXT ·_MultiplyAndAdd(SB), $0-32

    MOVQ vec1+0(FP), DI
    MOVQ vec2+8(FP), SI
    MOVQ vec3+16(FP), DX
    MOVQ result+24(FP), CX

	.cfi_startproc
                                 // pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
                                 // movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
                                 // andq	$-32, %rsp
                                 // subq	$384, %rsp
                                 // movq	%rdi, 176(%rsp)
                                 // movq	%rsi, 168(%rsp)
                                 // movq	%rdx, 160(%rsp)
                                 // movq	%rcx, 152(%rsp)
                                 // movq	176(%rsp), %rcx
                                 // movq	%rcx, 184(%rsp)
                                 // movq	184(%rsp), %rcx
                                 // vmovaps	(%rcx), %ymm0
                                 // vmovaps	%ymm0, 96(%rsp)
                                 // movq	168(%rsp), %rcx
                                 // movq	%rcx, 360(%rsp)
                                 // movq	360(%rsp), %rcx
                                 // vmovaps	(%rcx), %ymm0
                                 // vmovaps	%ymm0, 64(%rsp)
                                 // movq	160(%rsp), %rcx
                                 // movq	%rcx, 352(%rsp)
                                 // movq	352(%rsp), %rcx
                                 // vmovaps	(%rcx), %ymm0
                                 // vmovaps	%ymm0, 32(%rsp)
                                 // vmovaps	96(%rsp), %ymm0
                                 // vmovaps	64(%rsp), %ymm1
                                 // vmovaps	32(%rsp), %ymm2
                                 // vmovaps	%ymm0, 320(%rsp)
                                 // vmovaps	%ymm1, 288(%rsp)
                                 // vmovaps	%ymm2, 256(%rsp)
                                 // vmovaps	320(%rsp), %ymm0
                                 // vmovaps	288(%rsp), %ymm1
                                 // vmovaps	256(%rsp), %ymm2
                                 // vfmadd213ps	%ymm2, %ymm0, %ymm1
                                 // vmovaps	%ymm1, (%rsp)
                                 // movq	152(%rsp), %rcx
                                 // vmovaps	(%rsp), %ymm0
                                 // movq	%rcx, 248(%rsp)
                                 // vmovaps	%ymm0, 192(%rsp)
                                 // vmovaps	192(%rsp), %ymm0
                                 // movq	248(%rsp), %rcx
                                 // vmovups	%ymm0, (%rcx)
                                 // movq	%rbp, %rsp
                                 // popq	%rbp
                                 // vzeroupper
                                 // retq
				 

c++ function names

Hi,
I tried c2goasm on ubuntu18.04 with go 1.11 and encountered 2 little issues:

IMHO assemble.sh should contain clang and not c++.
using c++ (set for me to clang++ via update-alternatives) produces one underscore in front of function identifiers instead of two as expected by c2goasm.

Here's my clang++ version
clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

The 2nd pb is in the MultiplyAandAdd example
MultiplyAndAdd didn't work until I replaced _mm256_load_ps with _mm256_loadu_ps, probably because args are not 32-byte aligned.

Best regards

"panic: Offset for higher number argument asked for than reserved"

I get panic: Offset for higher number argument asked for than reserved for code generated using clang -O3 -masm=intel -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -S mdb.c
with clang 7.0.0 on windows 7 x64

When this check panics, the values are iarg=5; len(registers)=6.

After defining the go signature for func _mdb_txn_renew(txn unsafe.Pointer) (errCode int) the code panics the first time. I think the following asm belongs to it.

	.def	 mdb_txn_renew;
	.scl	2;
	.type	32;
	.endef
	.globl	mdb_txn_renew           # -- Begin function mdb_txn_renew
	.p2align	4, 0x90
mdb_txn_renew:                          # @mdb_txn_renew
# %bb.0:
	push	rbp
	sub	rsp, 32
	lea	rbp, [rsp + 32]
	and	rsp, -16
	test	rcx, rcx
	je	.LBB9_3
# %bb.1:
	mov	eax, 131073
	and	eax, dword ptr [rcx + 124]
	cmp	eax, 131073
	jne	.LBB9_3
# %bb.2:
	call	mdb_txn_renew0
	mov	rsp, rbp
	pop	rbp
	ret
.LBB9_3:
	mov	eax, 22
	mov	rsp, rbp
	pop	rbp
	ret

companion '.go' file is missing

Steps to reproduce:

$ cat <<EOF > hello.c
#include <stdio.h>

int main()
{
	printf("Hello World!\n");
	return 0;
}
EOF
$ gcc -S hello.c # produces hello.s
$ c2goasm hello.s hello-go.s
error: companion '.go' file is missing for hello-go.s
$ touch hello-go.go # doesn't help

How to create this companion file and why is it required?

Crashes on a function with no arguments

I am using 64 bit Linux, clang 6 and latest c2goasm (61af18e) and asm2plan9s (90679c2ae83a1f72e371ea3476e862f619044d47).

$ cat f.c
int buz() {
    return 42;
}

$ clang-6.0 -O3 -mavx -mfma -masm=intel -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -S -o f.s f.c

$ cat f.s
        .text
        .intel_syntax noprefix
        .file   "f.c"
        .globl  buz                     # -- Begin function buz
        .p2align        4, 0x90
        .type   buz,@function
buz:                                    # @buz
# %bb.0:
        mov     eax, 42
        ret
.Lfunc_end0:
        .size   buz, .Lfunc_end0-buz
                                        # -- End function

        .ident  "clang version 6.0.0-1~bpo9+1 (tags/RELEASE_600/final)"
        .section        ".note.GNU-stack","",@progbits

$ cat gofunc/f_amd64.go
package gofunc

//go:nosescape
func _buz() (result int)

$ c2goasm -a f.s gofunc/f_amd64.s
Processing f.s
panic: runtime error: index out of range

goroutine 1 [running]:
main.getGolangArgs(0xc0000bc308, 0x3, 0xc000071140, 0xf, 0x0, 0x60dea8, 0x0, 0x0, 0x60dea8, 0x0, ...)
        c2goasm/arguments.go:89 +0x78e
main.parseCompanionFile(0xc000072e00, 0x11, 0xc0000bc308, 0x3, 0x0, 0x0, 0x1, 0x5235e8, 0xc0000a3930, 0xc000068c00)
        c2goasm/arguments.go:67 +0xc9
main.process(0xc0000c8100, 0x10, 0x10, 0xc000072e00, 0x11, 0x0, 0x0, 0x0, 0x0, 0x0)
        c2goasm/c2goasm.go:86 +0x203
main.main()
        c2goasm/c2goasm.go:264 +0x37e

I debugged this a bit and found that match variable is []string{"func _buz() (result int)", "_buz", "", " (result int)"}. The subsequent code attempts to parse match[2] which is an empty string.

The following patch fixes the issue:

diff --git a/arguments.go b/arguments.go
index ee466aa..ea310ea 100644
--- a/arguments.go
+++ b/arguments.go
@@ -85,8 +85,10 @@ func getGolangArgs(protoName, goline string) (isFunc bool, args, rets []string,
                if match[1] == "_"+protoName {

                        args, rets = []string{}, []string{}
-                       for _, arg := range strings.Split(match[2], ",") {
-                               args = append(args, strings.Fields(arg)[0])
+                       if match[2] != "" {
+                               for _, arg := range strings.Split(match[2], ",") {
+                                       args = append(args, strings.Fields(arg)[0])
+                               }
                        }

                        trailer := strings.TrimSpace(match[3])

c2goasm generates asm which triggers expected identifier, found "."

I am trying to build a C function for variance of a population. I am building it with only c99, c2goasm and clang for mac.
clang works as expected. c2goasm works as expected. however the assembly produced by c2goasm does not work in the build process for go.
go compiler outputs

./stats_nocgo.s:431: expected identifier, found "."
asm: assembly of ./stats_nocgo.s failed

I have posted a working example of this issue as a small github project on my profile.
do make test to see the message above.

Gist of the Code

My c-code looks like

#include <stdint.h>

#define SIZE 16

void variance_uint32_nocgo_darwin(uint32_t buf[], int len, double *result) {
    double intermediate[SIZE];

    uint32_t sum = 0;
    for (int i = 0; i < len; i++) {
        uint32_t value = buf[i];
        sum += value;
    }
    double mean = sum / ((double)len);

    double accumulator = 0.0;
    for (int j = 0; j < len / SIZE; j++) {
        for (int i = 0; i < SIZE; i++) {
            intermediate[i] = (double)buf[(j * SIZE) + i];
        }
        for (int i = 0; i < SIZE; i++) {
            double value = intermediate[i];
            double diff = value - mean;
            intermediate[i] = diff;
        }
        for (int i = 0; i < SIZE; i++) {
            double diff = intermediate[i];
            accumulator += diff * diff;
        }
    }

    int start = (len - (len % SIZE));
    for (int i = start; i < len; i++) {
        intermediate[i - start] = (double)buf[i];
    }
    int end = (len % SIZE);
    for (int i = 0; i < end; i++) {
        double value = intermediate[i];
        double diff = value - mean;
        intermediate[i] = diff;
    }
    for (int i = 0; i < end; i++) {
        double diff = intermediate[i];
        accumulator += diff * diff;
    }

    double variance = accumulator / (double)(len - 1);
    *result = variance;
}

I have also uploaded the output assembly from clang and and c2goasm into as separte folders called examples

Build and run

I am bulding with

clang -S -O3 -mavx2 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti     -c stats_nocgo.c -o stats_intermediate.s
c2goasm stats_intermediate.s stats_nocgo.s;
rm stats_intermediate.s
CGO_ENABLED=0 go build .

The

System Information

go version 13.

$ go version
go version go1.13.1 darwin/amd64
$ clang --version
Apple clang version 11.0.0 (clang-1100.0.33.12)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Running on macOS 10.14.6

Address alignment in go

So I am using AVX instructions (basically using MultiplyAndAdd.cpp). I modified the code of .cpp a bit, and following is the new and the old (unmodified code) snippet of MultiplyAndAdd.s files:

New:

    vmovaps ymm0, YMMWORD PTR [rdi]
    vmovaps ymm1, YMMWORD PTR [rdx]
    vfmadd132ps     ymm0, ymm1, YMMWORD PTR [rsi]
    vmovups YMMWORD PTR [rcx], ymm0
    vzeroupper
    ret

Old:

push	rbp
mov	rbp, rsp
vmovups	ymm0, ymmword ptr [rdi]
vmovups	ymm1, ymmword ptr [rsi]
vfmadd213ps	ymm1, ymm0, ymmword ptr [rdx]
vmovups	ymmword ptr [rcx], ymm1
pop	rbp
vzeroupper
ret

As you can see, old code is using "vmovups", which allows unaligned address, but new code is using "vmovaps", which requires aligned address (256 bits aligned as this is AVX). Go seems to only allow upto 128bit alignment, when using complex128, so we can never be guaranteed of 256 bits alignment. I am assuming "vmovups" will be slower than "vmovaps".

So, the questions is, is there a way to get 256 bit alignment in go?

Also, there seems to be an issue. When I am using c2goasm/test/cpp/assembler.sh, the .s file that I get is not compatible with c2goasm. The reason is that the assembler is c++ assembler so it mangles the function name, so c2goasm is not able to find the corresponding function declaration in .go file. So I have to use extern "C" {} in .cpp file to ensure the function names are not mangled.

So another question is, are the .s files (in c2goasm/test/cpp/ folder) hand crafted and not created using assembler?

go BLAS

Hi,

I am trying to convert openblas to go with this project.

A simple C code:

// degmm.c
#include <cblas.h>

void degmm()
{
  int i=0;
  double A[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0};
  double B[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0};
  double C[9] = {.5,.5,.5,.5,.5,.5,.5,.5,.5};
  cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans,3,3,2,1,A, 3, B, 3,2,C,3);
}

int main () {
    degmm();
}

Use the following command:

clang-6.0 -O3 -mavx -mfma -masm=intel -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -S degmm.c

Will give the following asm code degmm.s:

	.text
	.intel_syntax noprefix
	.file	"degmm.c"
	.section	.rodata.cst32,"aM",@progbits,32
	.p2align	5               # -- Begin function degmm
.LCPI0_0:
	.quad	4607182418800017408     # double 1
	.quad	4611686018427387904     # double 2
	.quad	4607182418800017408     # double 1
	.quad	-4609434218613702656    # double -3
	.section	.rodata.cst8,"aM",@progbits,8
	.p2align	3
.LCPI0_1:
	.quad	4607182418800017408     # double 1
.LCPI0_2:
	.quad	4611686018427387904     # double 2
	.text
	.globl	degmm
	.p2align	4, 0x90
	.type	degmm,@function
degmm:                                  # @degmm
# %bb.0:
	sub	rsp, 168
	vmovaps	ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = [1.000000e+00,2.000000e+00,1.000000e+00,-3.000000e+00]
	vmovups	ymmword ptr [rsp + 48], ymm0
	movabs	rax, 4616189618054758400
	mov	qword ptr [rsp + 80], rax
	movabs	rcx, -4616189618054758400
	mov	qword ptr [rsp + 88], rcx
	vmovups	ymmword ptr [rsp], ymm0
	mov	qword ptr [rsp + 32], rax
	mov	qword ptr [rsp + 40], rcx
	mov	rax, qword ptr [rip + .Ldegmm.C+64]
	mov	qword ptr [rsp + 160], rax
	vmovups	ymm0, ymmword ptr [rip + .Ldegmm.C+32]
	vmovups	ymmword ptr [rsp + 128], ymm0
	vmovups	ymm0, ymmword ptr [rip + .Ldegmm.C]
	vmovups	ymmword ptr [rsp + 96], ymm0
	lea	rax, [rsp + 96]
	mov	r10, rsp
	lea	r11, [rsp + 48]
	vmovsd	xmm0, qword ptr [rip + .LCPI0_1] # xmm0 = mem[0],zero
	vmovsd	xmm1, qword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero
	mov	edi, 102
	mov	esi, 111
	mov	edx, 112
	mov	ecx, 3
	mov	r8d, 3
	mov	r9d, 2
	push	3
	push	rax
	push	3
	push	r10
	push	3
	push	r11
	vzeroupper
	call	cblas_dgemm
	add	rsp, 216
	ret
.Lfunc_end0:
	.size	degmm, .Lfunc_end0-degmm
                                        # -- End function
	.section	.rodata.cst32,"aM",@progbits,32
	.p2align	5               # -- Begin function main
.LCPI1_0:
	.quad	4607182418800017408     # double 1
	.quad	4611686018427387904     # double 2
	.quad	4607182418800017408     # double 1
	.quad	-4609434218613702656    # double -3
	.section	.rodata.cst8,"aM",@progbits,8
	.p2align	3
.LCPI1_1:
	.quad	4607182418800017408     # double 1
.LCPI1_2:
	.quad	4611686018427387904     # double 2
	.text
	.globl	main
	.p2align	4, 0x90
	.type	main,@function
main:                                   # @main
# %bb.0:
	sub	rsp, 168
	vmovaps	ymm0, ymmword ptr [rip + .LCPI1_0] # ymm0 = [1.000000e+00,2.000000e+00,1.000000e+00,-3.000000e+00]
	vmovups	ymmword ptr [rsp + 48], ymm0
	movabs	rax, 4616189618054758400
	mov	qword ptr [rsp + 80], rax
	movabs	rcx, -4616189618054758400
	mov	qword ptr [rsp + 88], rcx
	vmovups	ymmword ptr [rsp], ymm0
	mov	qword ptr [rsp + 32], rax
	mov	qword ptr [rsp + 40], rcx
	mov	rax, qword ptr [rip + .Ldegmm.C+64]
	mov	qword ptr [rsp + 160], rax
	vmovups	ymm0, ymmword ptr [rip + .Ldegmm.C+32]
	vmovups	ymmword ptr [rsp + 128], ymm0
	vmovups	ymm0, ymmword ptr [rip + .Ldegmm.C]
	vmovups	ymmword ptr [rsp + 96], ymm0
	lea	rax, [rsp + 96]
	mov	r10, rsp
	lea	r11, [rsp + 48]
	vmovsd	xmm0, qword ptr [rip + .LCPI1_1] # xmm0 = mem[0],zero
	vmovsd	xmm1, qword ptr [rip + .LCPI1_2] # xmm1 = mem[0],zero
	mov	edi, 102
	mov	esi, 111
	mov	edx, 112
	mov	ecx, 3
	mov	r8d, 3
	mov	r9d, 2
	push	3
	push	rax
	push	3
	push	r10
	push	3
	push	r11
	vzeroupper
	call	cblas_dgemm
	add	rsp, 48
	xor	eax, eax
	add	rsp, 168
	ret
.Lfunc_end1:
	.size	main, .Lfunc_end1-main
                                        # -- End function
	.type	.Ldegmm.C,@object       # @degmm.C
	.section	.rodata,"a",@progbits
	.p2align	4
.Ldegmm.C:
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.quad	4602678819172646912     # double 0.5
	.size	.Ldegmm.C, 72


	.ident	"clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final)"
	.section	".note.GNU-stack","",@progbits

Then converting to go asm:

c2goasm -a degmm.s Degmm.s

Report the following error:

Processing cpp/degmm.s
panic: 'sub rsp' found but in unexpected scenario:      sub     rsp, 168

goroutine 1 [running]:
main.(*Epilogue).isPrologueInstruction(0xc0000575b8, 0xc0000171a0, 0xd, 0xd)
        /home/bzhou/go/src/github.com/minio/c2goasm/epilogue.go:205 +0x3d4
main.getPrologueLines(0xc0000d0150, 0x26, 0xeb, 0xc0000575b8, 0x0)
        /home/bzhou/go/src/github.com/minio/c2goasm/subroutine.go:231 +0xb4
main.(*Subroutine).removePrologueLines(0xc000057590, 0xc0000d0000, 0x8d, 0x100, 0x15, 0x3b)
        /home/bzhou/go/src/github.com/minio/c2goasm/subroutine.go:134 +0x87
main.extractSubroutine(0x3a, 0xc0000d0000, 0x8d, 0x100, 0x11, 0xc000017188, 0x5, 0x14, 0x0, 0x0, ...)
        /home/bzhou/go/src/github.com/minio/c2goasm/subroutine.go:127 +0x2cd
main.segmentSource(0xc0000d0000, 0x8d, 0x100, 0x20, 0xf, 0xc)
        /home/bzhou/go/src/github.com/minio/c2goasm/subroutine.go:85 +0x20e
main.process(0xc0000d0000, 0x8d, 0x100, 0xc000017058, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/bzhou/go/src/github.com/minio/c2goasm/c2goasm.go:78 +0x80
main.main()
        /home/bzhou/go/src/github.com/minio/c2goasm/c2goasm.go:264 +0x37e

Any suggestion?

"Too many fields found: 3" on a line like ".p2align 2, 0x90"

As a workaround, I used:

diff --git a/constants.go b/constants.go
index 90cfafb..d4bae45 100644
--- a/constants.go
+++ b/constants.go
@@ -127,7 +127,14 @@ func defineTable(constants []string, tableName string) Table {
                        bytes = append(bytes, byte(v>>48))
                        bytes = append(bytes, byte(v>>56))
                } else if strings.Contains(line, ".align") || strings.Contains(line, ".p2align") {
-                       bits := getSingleNumber(line)
+                       fields := strings.FieldsFunc(line, func(c rune) bool { return c == ',' || c == ' ' || c == '\t' })
+                       if len(fields) <= 1 || 4 <= len(fields) {
+                               panic(fmt.Sprintf(".p2align must have 2 or 3 arguments; got %v", fields))
+                       }
+                       bits, err := strconv.ParseInt(fields[1], 10, 64)
+                       if err != nil {
+                               panic(err)
+                       }
                        align := 1 << uint(bits)
                        for len(bytes)%align != 0 {
                                bytes = append(bytes, 0)

Note that this erroneously ignores the padding character.

linux gcc8 fun name error

env

  • system: linux amd64(archlinux)
  • gcc: v8.1.0
  • clang & llvm: v6.0.0
  • yasm: v1.3.0
  • c2goasm: last master
  • asm2plan9s: last master

problem

run

git clone https://github.com/minio/c2goasm.git
cd c2goasm/test/cpp
./assemble.sh ./MaddConstant.cpp
cd ../../
go build
./c2goasm -a ./test/cpp/MaddConstant.s ./test/MaddConstant_amd64.s

error

Processing ./test/cpp/MaddConstant.s
panic: Failed to find function prototype for _Z12MaddConstantPfS_S_

goroutine 1 [running]:
main.parseCompanionFile(0xc4200131c0, 0x1c, 0xc420013268, 0x16, 0x0, 0x0, 0x1, 0x40e6d9, 0x51eb80, 0xc42004d850)
        /home/zero/go-works/src/c2goasm/arguments.go:75 +0x24c
main.process(0xc420094600, 0x1e, 0x20, 0xc4200131c0, 0x1c, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/zero/go-works/src/c2goasm/c2goasm.go:86 +0x205
main.main()
        /home/zero/go-works/src/c2goasm/c2goasm.go:264 +0x390

Try to repair

diff --git a/subroutine.go b/subroutine.go

index 00d98f7..c6b3ec4 100644
--- a/subroutine.go
+++ b/subroutine.go
@@ -266,7 +266,7 @@ func extractNamePart(part string) (int, string) {
 func extractName(name string) string {

        // Only proceed for C++ mangled names
-       if !(strings.HasPrefix(name, "_ZN") || strings.HasPrefix(name, "__Z")) {
+       if !(strings.HasPrefix(name, "_ZN") || strings.HasPrefix(name, "__Z") || strings.HasPrefix(name, "_Z")) {
                return name
        }

but run go test -v ./... has error

FAIL    c2goasm/cgocmp [build failed]
=== RUN   TestMaddArgs10
--- PASS: TestMaddArgs10 (0.00s)
=== RUN   TestMaddConstant
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x4e7ba3]

goroutine 6 [running]:
runtime.throw(0x52fa3b, 0x5)
        /usr/lib/go/src/runtime/panic.go:616 +0x81 fp=0xc420039e00 sp=0xc420039de0 pc=0x428a91
runtime.sigpanic()
        /usr/lib/go/src/runtime/signal_unix.go:395 +0x211 fp=0xc420039e50 sp=0xc420039e00 pc=0x43bc61
c2goasm/test._MaddConstant(0xc420039ea0, 0xc420039ec0, 0xc420039e70)
        /home/zero/go-works/src/c2goasm/test/MaddConstant_amd64.s:11 +0x13 fp=0xc420039e58 sp=0xc420039e50 pc=0x4e7ba3
c2goasm/test.MaddConstant(0x3f80000000000000, 0x4040000040000000, 0x40a0000040800000, 0x40e0000040c00000, 0x4000000000000000, 0x40c0000040800000, 0x4120000041000000, 0x4160000041400000, 0x0, 0x0, ...)
        /home/zero/go-works/src/c2goasm/test/MaddConstant_amd64.go:17 +0x5c fp=0xc420039ea0 sp=0xc420039e58 pc=0x4e5f3c
c2goasm/test.TestMaddConstant(0xc4200a21e0)
        /home/zero/go-works/src/c2goasm/test/MaddConstant_test.go:17 +0xd2 fp=0xc420039fa8 sp=0xc420039ea0 pc=0x4e65f2
testing.tRunner(0xc4200a21e0, 0x538520)
        /usr/lib/go/src/testing/testing.go:777 +0xd0 fp=0xc420039fd0 sp=0xc420039fa8 pc=0x4ad900
runtime.goexit()
        /usr/lib/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420039fd8 sp=0xc420039fd0 pc=0x454c51
created by testing.(*T).Run
        /usr/lib/go/src/testing/testing.go:824 +0x2e0

goroutine 1 [chan receive]:
testing.(*T).Run(0xc4200a21e0, 0x53176a, 0x10, 0x538520, 0x465d01)
        /usr/lib/go/src/testing/testing.go:825 +0x301
testing.runTests.func1(0xc4200a2000)
        /usr/lib/go/src/testing/testing.go:1063 +0x64
testing.tRunner(0xc4200a2000, 0xc42004ddf8)
        /usr/lib/go/src/testing/testing.go:777 +0xd0
testing.runTests(0xc42000a080, 0x5da340, 0x7, 0x7, 0x40e959)
        /usr/lib/go/src/testing/testing.go:1061 +0x2c4
testing.(*M).Run(0xc42009e000, 0x0)
        /usr/lib/go/src/testing/testing.go:978 +0x171
main.main()
        _testmain.go:56 +0x151
FAIL    c2goasm/test    0.003s

asm source

test/cpp/MaddConstant.s

	.file	"MaddConstant.cpp"
	.intel_syntax noprefix
	.text
	.p2align 4,,15
	.globl	_Z12MaddConstantPfS_S_
	.type	_Z12MaddConstantPfS_S_, @function
_Z12MaddConstantPfS_S_:
	vmovaps	ymm0, YMMWORD PTR [rdi]
	vmovaps	ymm1, YMMWORD PTR _ZL1a[rip]
	vfmadd132ps	ymm0, ymm1, YMMWORD PTR [rsi]
	vmovups	XMMWORD PTR [rdx], xmm0
	vextractf128	XMMWORD PTR 16[rdx], ymm0, 0x1
	vzeroupper
	ret
	.size	_Z12MaddConstantPfS_S_, .-_Z12MaddConstantPfS_S_
	.section	.rodata
	.align 32
	.type	_ZL1a, @object
	.size	_ZL1a, 32
_ZL1a:
	.long	1065353216
	.long	1073741824
	.long	1077936128
	.long	1082130432
	.long	1084227584
	.long	1086324736
	.long	1088421888
	.long	1090519040
	.ident	"GCC: (GNU) 8.1.0"
	.section	.note.GNU-stack,"",@progbits

test/MaddConstant_amd64.s

//+build !noasm !appengine
// AUTO-GENERATED BY C2GOASM -- DO NOT EDIT

TEXT ·_MaddConstant(SB), $0-24

    MOVQ vec1+0(FP), DI
    MOVQ vec2+8(FP), SI
    MOVQ result+16(FP), DX

    LONG $0x0728fcc5             // vmovaps    ymm0, YMMWORD PTR [rdi]
    QUAD $0x000000000d28fcc5     // vmovaps    ymm1, YMMWORD PTR _ZL1a[rip]
    LONG $0x9875e2c4; BYTE $0x06 // vfmadd132ps    ymm0, ymm1, YMMWORD PTR [rsi]
    LONG $0x0211f8c5             // vmovups    XMMWORD PTR [rdx], xmm0
    LONG $0x197de3c4; WORD $0x1042; BYTE $0x01 // vextractf128    XMMWORD PTR 16[rdx], ymm0, 0x1
    VZEROUPPER
    RET

source by my fork

Took forever to generate the goasm

Hi i am testing the c2goasm, for a simple function it works, but for bigger one it is took like forever, i wonder if there was a problem in my config or a bug?

image
image

build command: clang -S -O3 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti $1

Generated CLang ASM
	.text
	.intel_syntax noprefix
	.file	"encoder.c"
	.globl	qoi_write_32            # -- Begin function qoi_write_32
	.p2align	4, 0x90
	.type	qoi_write_32,@function
qoi_write_32:                           # @qoi_write_32
# %bb.0:
	push	rbp
	mov	rbp, rsp
	and	rsp, -8
	mov	eax, edx
	shr	eax, 24
	movsxd	r8, dword ptr [rsi]
	lea	ecx, [r8 + 1]
	mov	dword ptr [rsi], ecx
	mov	byte ptr [rdi + r8], al
	mov	eax, edx
	shr	eax, 16
	movsxd	r8, dword ptr [rsi]
	lea	ecx, [r8 + 1]
	mov	dword ptr [rsi], ecx
	mov	byte ptr [rdi + r8], al
	movsxd	rax, dword ptr [rsi]
	lea	ecx, [rax + 1]
	mov	dword ptr [rsi], ecx
	mov	byte ptr [rdi + rax], dh
	movsxd	rax, dword ptr [rsi]
	lea	ecx, [rax + 1]
	mov	dword ptr [rsi], ecx
	mov	byte ptr [rdi + rax], dl
	mov	rsp, rbp
	pop	rbp
	ret
.Lfunc_end0:
	.size	qoi_write_32, .Lfunc_end0-qoi_write_32
                                        # -- End function
	.globl	qoi_read_32             # -- Begin function qoi_read_32
	.p2align	4, 0x90
	.type	qoi_read_32,@function
qoi_read_32:                            # @qoi_read_32
# %bb.0:
	push	rbp
	mov	rbp, rsp
	and	rsp, -8
	movsxd	rcx, dword ptr [rsi]
	lea	rax, [rcx + 1]
	mov	dword ptr [rsi], eax
	movzx	r8d, byte ptr [rdi + rcx]
	lea	rax, [rcx + 2]
	mov	dword ptr [rsi], eax
	movzx	r9d, byte ptr [rdi + rcx + 1]
	lea	rax, [rcx + 3]
	mov	dword ptr [rsi], eax
	movzx	eax, byte ptr [rdi + rcx + 2]
	lea	edx, [rcx + 4]
	mov	dword ptr [rsi], edx
	movzx	ecx, byte ptr [rdi + rcx + 3]
	shl	r8d, 24
	shl	r9d, 16
	or	r9d, r8d
	shl	eax, 8
	or	eax, r9d
	or	eax, ecx
	mov	rsp, rbp
	pop	rbp
	ret
.Lfunc_end1:
	.size	qoi_read_32, .Lfunc_end1-qoi_read_32
                                        # -- End function
	.globl	pixel_cpy               # -- Begin function pixel_cpy
	.p2align	4, 0x90
	.type	pixel_cpy,@function
pixel_cpy:                              # @pixel_cpy
# %bb.0:
	push	rbp
	mov	rbp, rsp
	and	rsp, -8
	test	rdx, rdx
	je	.LBB2_20
# %bb.1:
	cmp	rdx, 32
	jb	.LBB2_13
# %bb.2:
	lea	rax, [rsi + rdx]
	cmp	rax, rdi
	jbe	.LBB2_4
# %bb.3:
	lea	rax, [rdi + rdx]
	cmp	rax, rsi
	jbe	.LBB2_4
.LBB2_13:
	lea	r8, [rdx - 1]
	mov	r9, rdx
	and	r9, 7
	je	.LBB2_17
.LBB2_14:
	xor	ecx, ecx
	.p2align	4, 0x90
.LBB2_15:                               # =>This Inner Loop Header: Depth=1
	movzx	eax, byte ptr [rsi + rcx]
	mov	byte ptr [rdi + rcx], al
	add	rcx, 1
	cmp	r9, rcx
	jne	.LBB2_15
# %bb.16:
	sub	rdx, rcx
	add	rsi, rcx
	add	rdi, rcx
.LBB2_17:
	cmp	r8, 7
	jb	.LBB2_20
# %bb.18:
	xor	eax, eax
	.p2align	4, 0x90
.LBB2_19:                               # =>This Inner Loop Header: Depth=1
	movzx	ecx, byte ptr [rsi + rax]
	mov	byte ptr [rdi + rax], cl
	movzx	ecx, byte ptr [rsi + rax + 1]
	mov	byte ptr [rdi + rax + 1], cl
	movzx	ecx, byte ptr [rsi + rax + 2]
	mov	byte ptr [rdi + rax + 2], cl
	movzx	ecx, byte ptr [rsi + rax + 3]
	mov	byte ptr [rdi + rax + 3], cl
	movzx	ecx, byte ptr [rsi + rax + 4]
	mov	byte ptr [rdi + rax + 4], cl
	movzx	ecx, byte ptr [rsi + rax + 5]
	mov	byte ptr [rdi + rax + 5], cl
	movzx	ecx, byte ptr [rsi + rax + 6]
	mov	byte ptr [rdi + rax + 6], cl
	movzx	ecx, byte ptr [rsi + rax + 7]
	mov	byte ptr [rdi + rax + 7], cl
	add	rax, 8
	cmp	rdx, rax
	jne	.LBB2_19
	jmp	.LBB2_20
.LBB2_4:
	mov	r8, rdx
	and	r8, -32
	lea	rax, [r8 - 32]
	mov	rcx, rax
	shr	rcx, 5
	add	rcx, 1
	mov	r9d, ecx
	and	r9d, 3
	cmp	rax, 96
	jae	.LBB2_6
# %bb.5:
	xor	eax, eax
	jmp	.LBB2_8
.LBB2_6:
	sub	rcx, r9
	xor	eax, eax
	.p2align	4, 0x90
.LBB2_7:                                # =>This Inner Loop Header: Depth=1
	movups	xmm0, xmmword ptr [rsi + rax]
	movups	xmm1, xmmword ptr [rsi + rax + 16]
	movups	xmmword ptr [rdi + rax], xmm0
	movups	xmmword ptr [rdi + rax + 16], xmm1
	movups	xmm0, xmmword ptr [rsi + rax + 32]
	movups	xmm1, xmmword ptr [rsi + rax + 48]
	movups	xmmword ptr [rdi + rax + 32], xmm0
	movups	xmmword ptr [rdi + rax + 48], xmm1
	movups	xmm0, xmmword ptr [rsi + rax + 64]
	movups	xmm1, xmmword ptr [rsi + rax + 80]
	movups	xmmword ptr [rdi + rax + 64], xmm0
	movups	xmmword ptr [rdi + rax + 80], xmm1
	movups	xmm0, xmmword ptr [rsi + rax + 96]
	movups	xmm1, xmmword ptr [rsi + rax + 112]
	movups	xmmword ptr [rdi + rax + 96], xmm0
	movups	xmmword ptr [rdi + rax + 112], xmm1
	sub	rax, -128
	add	rcx, -4
	jne	.LBB2_7
.LBB2_8:
	test	r9, r9
	je	.LBB2_11
# %bb.9:
	add	rax, 16
	neg	r9
	.p2align	4, 0x90
.LBB2_10:                               # =>This Inner Loop Header: Depth=1
	movups	xmm0, xmmword ptr [rsi + rax - 16]
	movups	xmm1, xmmword ptr [rsi + rax]
	movups	xmmword ptr [rdi + rax - 16], xmm0
	movups	xmmword ptr [rdi + rax], xmm1
	add	rax, 32
	inc	r9
	jne	.LBB2_10
.LBB2_11:
	cmp	r8, rdx
	jne	.LBB2_12
.LBB2_20:
	mov	eax, 1
	mov	rsp, rbp
	pop	rbp
	ret
.LBB2_12:
	and	edx, 31
	add	rsi, r8
	add	rdi, r8
	lea	r8, [rdx - 1]
	mov	r9, rdx
	and	r9, 7
	jne	.LBB2_14
	jmp	.LBB2_17
.Lfunc_end2:
	.size	pixel_cpy, .Lfunc_end2-pixel_cpy
                                        # -- End function
	.globl	qoi_pixel_encoder       # -- Begin function qoi_pixel_encoder
	.p2align	4, 0x90
	.type	qoi_pixel_encoder,@function
qoi_pixel_encoder:                      # @qoi_pixel_encoder
# %bb.0:
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	and	rsp, -8
	mov	ebx, dword ptr [rbp + 16]
	mov	r10, qword ptr [rbp + 24]
	mov	r11, qword ptr [rbp + 32]
	xor	eax, eax
	cmp	r11, r10
	sete	al
	add	r9d, -1
	xor	r9d, ecx
	add	ebx, -1
	xor	ebx, r8d
	or	ebx, r9d
	sete	cl
	mov	r9b, byte ptr [r11]
	mov	r13b, byte ptr [r11 + 1]
	mov	r14b, byte ptr [r11 + 2]
	mov	r8b, byte ptr [r11 + 3]
	mov	ebx, dword ptr [rdx + 4*rax]
	cmp	r11, r10
	jne	.LBB3_3
# %bb.1:
	test	cl, cl
	jne	.LBB3_3
# %bb.2:
	cmp	ebx, 8224
	jne	.LBB3_7
.LBB3_3:
	shl	rax, 2
	cmp	ebx, 32
	jg	.LBB3_5
# %bb.4:
	add	bl, -1
	or	bl, 64
	mov	r12d, 1
	mov	r15, rsi
	jmp	.LBB3_6
.LBB3_5:
	add	ebx, -33
	mov	dword ptr [rdx + rax], ebx
	shr	ebx, 8
	or	bl, 96
	lea	r15, [rsi + 4]
	movsxd	rcx, dword ptr [rsi]
	mov	byte ptr [rdi + rcx], bl
	mov	bl, byte ptr [rdx + rax]
	mov	r12d, 2
.LBB3_6:
	lea	rsi, [rsi + 4*r12]
	movsxd	rcx, dword ptr [r15]
	mov	byte ptr [rdi + rcx], bl
	mov	dword ptr [rdx + rax], 0
.LBB3_7:
	cmp	r11, r10
	je	.LBB3_25
# %bb.8:
	mov	rax, qword ptr [rbp + 40]
	xor	r13b, r9b
	xor	r13b, r14b
	xor	r13b, r8b
	movzx	ecx, r13b
	shl	rcx, 5
	mov	rax, qword ptr [rax + rcx]
	cmp	rax, r11
	je	.LBB3_26
# %bb.9:
	mov	cl, byte ptr [r11]
	mov	byte ptr [rax], cl
	mov	cl, byte ptr [r11 + 1]
	mov	byte ptr [rax + 1], cl
	mov	cl, byte ptr [r11 + 2]
	mov	byte ptr [rax + 2], cl
	mov	cl, byte ptr [r11 + 3]
	mov	byte ptr [rax + 3], cl
	movsx	ecx, byte ptr [r11]
	movsx	eax, byte ptr [r10]
	sub	ecx, eax
	movsx	r9d, byte ptr [r11 + 1]
	movsx	eax, byte ptr [r10 + 1]
	sub	r9d, eax
	movsx	edx, byte ptr [r11 + 2]
	movsx	eax, byte ptr [r10 + 2]
	sub	edx, eax
	movsx	r8d, byte ptr [r11 + 3]
	movsx	eax, byte ptr [r10 + 3]
	sub	r8d, eax
	lea	r15d, [rcx + 16]
	lea	eax, [r9 + 16]
	or	eax, r15d
	lea	r14d, [rdx + 16]
	lea	r10d, [r8 + 16]
	mov	ebx, r14d
	or	ebx, r10d
	or	ebx, eax
	cmp	ebx, 32
	jae	.LBB3_16
# %bb.10:
	lea	r11d, [rdx + 2]
	cmp	r11d, 3
	ja	.LBB3_13
# %bb.11:
	lea	eax, [rcx + 2]
	lea	ebx, [r9 + 2]
	or	ebx, eax
	and	ebx, -4
	or	ebx, r8d
	jne	.LBB3_13
# %bb.12:
	shl	ecx, 4
	add	ecx, 32
	lea	eax, [4*r9 + 8]
	or	eax, ecx
	or	eax, r11d
	or	al, -128
	movsxd	rcx, dword ptr [rsi]
	mov	byte ptr [rdi + rcx], al
	jmp	.LBB3_25
.LBB3_26:
	movsxd	rax, dword ptr [rsi]
	mov	byte ptr [rdi + rax], r13b
	jmp	.LBB3_25
.LBB3_16:
	test	ecx, ecx
	setne	al
	shl	al, 3
	test	r9d, r9d
	setne	bl
	shl	bl, 2
	or	bl, al
	test	edx, edx
	setne	al
	add	al, al
	or	al, bl
	test	r8d, r8d
	setne	bl
	or	bl, al
	or	bl, -16
	movsxd	rax, dword ptr [rsi]
	mov	byte ptr [rdi + rax], bl
	test	ecx, ecx
	je	.LBB3_17
# %bb.18:
	mov	al, byte ptr [r11]
	movsxd	rcx, dword ptr [rsi + 4]
	add	rsi, 8
	mov	byte ptr [rdi + rcx], al
	test	r9d, r9d
	je	.LBB3_21
.LBB3_20:
	mov	al, byte ptr [r11 + 1]
	movsxd	rcx, dword ptr [rsi]
	add	rsi, 4
	mov	byte ptr [rdi + rcx], al
.LBB3_21:
	test	edx, edx
	je	.LBB3_23
# %bb.22:
	mov	al, byte ptr [r11 + 2]
	movsxd	rcx, dword ptr [rsi]
	add	rsi, 4
	mov	byte ptr [rdi + rcx], al
.LBB3_23:
	test	r8d, r8d
	je	.LBB3_25
# %bb.24:
	mov	al, byte ptr [r11 + 3]
	movsxd	rcx, dword ptr [rsi]
	mov	byte ptr [rdi + rcx], al
	jmp	.LBB3_25
.LBB3_13:
	lea	eax, [r9 + 8]
	add	edx, 8
	or	eax, edx
	and	eax, -16
	or	eax, r8d
	je	.LBB3_14
# %bb.15:
	mov	eax, r15d
	shr	al
	or	al, -32
	movsxd	rcx, dword ptr [rsi]
	mov	byte ptr [rdi + rcx], al
	shl	r15d, 7
	lea	eax, [4*r9 + 64]
	or	eax, r15d
	mov	ecx, r14d
	shr	ecx, 3
	or	ecx, eax
	movsxd	rax, dword ptr [rsi + 4]
	mov	byte ptr [rdi + rax], cl
	shl	r14d, 7
	or	r10d, r14d
	movsxd	rax, dword ptr [rsi + 8]
	mov	byte ptr [rdi + rax], r10b
	jmp	.LBB3_25
.LBB3_17:
	add	rsi, 4
	test	r9d, r9d
	jne	.LBB3_20
	jmp	.LBB3_21
.LBB3_14:
	or	r15b, -64
	movsxd	rax, dword ptr [rsi]
	mov	byte ptr [rdi + rax], r15b
	shl	r9d, 4
	sub	r9d, -128
	or	edx, r9d
	movsxd	rax, dword ptr [rsi + 4]
	mov	byte ptr [rdi + rax], dl
.LBB3_25:
	mov	eax, 1
	lea	rsp, [rbp - 40]
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret
.Lfunc_end3:
	.size	qoi_pixel_encoder, .Lfunc_end3-qoi_pixel_encoder
                                        # -- End function
	.ident	"clang version 10.0.0-4ubuntu1 "
	.section	".note.GNU-stack","",@progbits
	.addrsig
C code
#ifndef QOI_ENCODER_
#define QOI_ENCODER_

#include <stddef.h>
#include "qoi.h"

void pixel_cpy(char *dst, char *src, size_t sz)
{
    while (sz--)
    {
        *dst++ = *src++;
    }
}

int qoi_pixel_encoder(
    char *data, int *cur, int *run,
    const int x, const int y,
    const int maxX, const int maxY,
    const char *px_prev, char *px,
    char **index) // [64][4]
{
    qoi_rgba_t px_ = {.rgba = {
                          .r = px[0],
                          .g = px[1],
                          .b = px[2],
                          .a = px[3],
                      }};

    if (px == px_prev)
    {
        *run++;
    }

    int last_pixel = x == maxX - 1 && y == (maxY - 1);
    if (*run > 0 && *run == 0x2020 || px != px_prev || last_pixel)
    {
        if (*run < 33)
        {
            *(data + *cur++) = QOI_RUN_8 | *run - 1;
        }
        else
        {
            *run -= 33;
            *(data + *cur++) = QOI_RUN_16 | *run >> 8;
            *(data + *cur++) = *run & 0xFF;
        }
        *run = 0;
    }

    if (px != px_prev)
    {
        int index_pos = QOI_COLOR_HASH(px_);
        if (index[index_pos * 4] == px)
        {
            *(data + *cur++) = QOI_INDEX | index_pos;
        }
        else
        {
            pixel_cpy(index[index_pos * 4], px, 4);
            int vr = px[0] - px_prev[0];
            int vg = px[1] - px_prev[1];
            int vb = px[2] - px_prev[2];
            int va = px[3] - px_prev[3];

            if (
                vr > -17 && vr < 16 &&
                vg > -17 && vg < 16 &&
                vb > -17 && vb < 16 &&
                va > -17 && va < 16)
            {
                if (
                    va == 0 &&
                    vr > -3 && vr < 2 &&
                    vg > -3 && vg < 2 &&
                    vb > -3 && vb < 2)
                {
                    *(data + *cur++) = QOI_DIFF_8 | (vr + 2) << 4 | (vg + 2) << 2 | (vb + 2);
                }
                else if (
                    va == 0 &&
                    vr > -17 && vr < 16 &&
                    vg > -9 && vg < 8 &&
                    vb > -9 && vb < 8)
                {
                    *(data + *cur++) = QOI_DIFF_16 | (vr + 16);
                    *(data + *cur++) = (vg + 8) << 4 | (vb + 8);
                }
                else
                {
                    *(data + *cur++) = QOI_DIFF_24 | (vr + 16) >> 1;
                    *(data + *cur++) = (vr + 16) << 7 | (vg + 16) << 2 | (vb + 16) >> 3;
                    *(data + *cur++) = (vb + 16) << 7 | (va + 16);
                }
            }
            else
            {
                *(data + *cur++) = QOI_COLOR | (vr ? 8 : 0) | (vg ? 4 : 0) | (vb ? 2 : 0) | (va ? 1 : 0);
                if (vr)
                {
                    *(data + *cur++) = px[0];
                }
                if (vg)
                {
                    *(data + *cur++) = px[1];
                }
                if (vb)
                {
                    *(data + *cur++) = px[2];
                }
                if (va)
                {
                    *(data + *cur++) = px[3];
                }
            }
        }
        px_prev = px;
    }
    return 1;
}
#endif

panic: runtime error: slice bounds out of range

This sounded like a cool project, so I tried it out!

go env:

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/media/data/go"
GORACE=""
GOROOT="/opt/go/latest"
GOTOOLDIR="/opt/go/latest/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build111986249=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

C: (xor128plus.c)

#include <stdint.h>

uint64_t XORShift128Plus(uint64_t *s) {
        uint64_t s1 = s[0];
        const uint64_t s0 = s[1];
        const uint64_t result = s0 + s1;
        s[0] = s0;
        s1 ^= s1 << 23; // a
        s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
        return result; 
}

Compiling this with

clang -O3 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -S xor128plus.c

yields (xor128plus.s):

        .text
        .file   "xor128plus.c"
        .globl  XORShift128Plus
        .align  16, 0x90
        .type   XORShift128Plus,@function
XORShift128Plus:                        # @XORShift128Plus
# BB#0:
        push    rbp
        mov     rbp, rsp
        and     rsp, -8
        mov     rcx, qword ptr [rdi]
        mov     rdx, qword ptr [rdi + 8]
        lea     rax, qword ptr [rdx + rcx]
        mov     qword ptr [rdi], rdx
        mov     rsi, rcx
        shl     rsi, 23
        xor     rsi, rcx
        mov     rcx, rsi
        xor     rcx, rdx
        shr     rsi, 18
        shr     rdx, 5
        xor     rdx, rcx
        xor     rdx, rsi
        mov     qword ptr [rdi + 8], rdx
        mov     rsp, rbp
        pop     rbp
        ret
.Ltmp0:
        .size   XORShift128Plus, .Ltmp0-XORShift128Plus


        .ident  "Debian clang version 3.5.0-10 (tags/RELEASE_350/final) (based on LLVM 3.5.0)"
        .section        ".note.GNU-stack","",@progbits

Accompanied by xor128plusg_amd64.go:

import "unsafe"

//go:noescape
func _XORShift128Plus(s unsafe.Pointer) uint64

func XORShift128Plus(s [2]uint64) uint64 {
    return _XORShift128Plus(unsafe.Pointer(&s))
}

And finally running c2goasm -a -f xor128plus.s xor128plus_amd64.s yields:

Processing xor128plus.s
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
main.extractNamePart(0xc42000b790, 0x7, 0xe, 0xc420043510, 0x4a0731)
	/media/data/go/src/github.com/minio/c2goasm/subroutine.go:262 +0xf6
main.extractName(0xc42000b788, 0xf, 0xc42000b788, 0xf)
	/media/data/go/src/github.com/minio/c2goasm/subroutine.go:273 +0x145
main.splitOnGlobals(0xc4200a6c00, 0x21, 0x40, 0x7f88b0c56428, 0x1b, 0x7f88b0c575f8)
	/media/data/go/src/github.com/minio/c2goasm/subroutine.go:50 +0x167
main.segmentSource(0xc4200a6c00, 0x21, 0x40, 0xc40000000c, 0x0, 0x4)
	/media/data/go/src/github.com/minio/c2goasm/subroutine.go:64 +0x5d
main.process(0xc4200a6c00, 0x21, 0x40, 0xc42000b6c0, 0x13, 0x0, 0x0, 0xc42008bae0, 0xc420043e60, 0x4bc2bb)
	/media/data/go/src/github.com/minio/c2goasm/c2goasm.go:78 +0x80
main.main()
	/media/data/go/src/github.com/minio/c2goasm/c2goasm.go:264 +0x403

Q: c2goasm

Sorry for posting here but c2goasm does not have enabled issues and this seems like the "main" project here.

I just found the c2goasm project by accident and am wondering if it is capable for making ffmpeg library go-native?

go bin keep not set

hello,

I have set the go bin variable and the go path variable as following :

GoBinandGoPath

your github keep the error about the go bin !

I have all 'go test *_test.go' and 'go build *.go' then 'go install *.go'

a lot of variable hasen't set up too the go bin error,

Thank you in advance to repair all my problems about your gihub,

I have put your github here :

'C:\Users\doria\go\src\github.com\minio\c2goasm'

have a nice new year,

Regards.

Azaretdodo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.