Go性能提升快速指南

原文地址:Simple techniques to optimise Go programs

我非常痴迷程序的性能.我很难解释我如此痴迷的深层原因是什么.运行缓慢的服务或者程序让我非常懊恼,似乎在这条道路上我并不孤独.

In A/B tests, we tried delaying the page in increments of 100 milliseconds and found that even very small delays would result in substantial and costly drops in revenue. - Greg Linden, Amazon.com

根据我的经验,低效的性能表现一般体现在下面2个原因:

1.某些操作在小规模的环境下性能表现还不错,但是随着用户数的增长,性能表现得越来越差. 这些操作一般复杂度为O(N)或者O(N^2).当你的用户基数很小的时候,性能表现比较良好.而且经常这么做是为了尽快将产品退出市场.然而当用户基数逐渐增涨的过程中就会出现很多我们意料之外的问题,并且你的服务可能在运行过程中崩溃.

2.原文:Many individual sources of small optimisation - AKA 'death by a thousand crufts'.(姑且翻译为千里之堤溃于蚁穴吧?)

我的职业生涯大部分时间都是在使用python做数据科学,使用Go来编写服务程序.对于后者我有大量的性能优化方面的经验.在我用Go写的服务程序中,一般没有什么性能瓶颈.这些程序一般都是和数据库交互的IO密集型的程序.然而在我做的机器学习相关的程序中,一般会和CPU关系更加密切.当你的Go程序过度使用CPU时,这种过度的使用同样也会造成负面的影响.这里会有各种各样的策略来减轻这种影响.

这篇文章会讲述通过一些举手之劳就可以大幅度的提升程序的性能.我会忽略一些需要通过巨大努力才能改善的方法甚至是需要重构程序才能解决的改善方案.

开始之前

在修改你的程序之前应该花时间去制定一个基准.如果不制定一个基准的话,就像是在黑暗中摸索,无法确定所做的修改是否起到了重要的作用.先写一个基准测试,然后使用Go的pprof抓取profile分析文件.在最好的情况下,这就是Go的基准测试Go benchmark:能够轻松的使用pprof工具对程序性能以及内存做详细的分析.也可以使用benchcmp这个工具:对于比较2个不同的基准测试非常有帮助的一个工具.

如果你的程序不太容易做基准测试,你可以使用runtime/pprof来进行测试.

让我们正式开始吧.

使用sync.Pool对象池来重用之前分配过的对象.

sync.Pool实现了一个空闲列表.通过sync.Pool可以让你重复利用之前已经分配过内存的对象,避免再次创建对象.这种复用分配对象的方式非常有用,这将会大大减少GC的压力.sync.Pool的api是非常简单的.实现一个返回某个类型的指针的方法,这个方法会创建一个这个对象的实例就行了.如下代码示例:

var bufpool = sync.Pool{
    New: func() interface{} {
        buf := make([]byte, 512)
        return &buf
    }}

var bufpool = sync.Pool{

New: func() interface{} {

buf := make([]byte, 512)

return &buf

}}

完成上面的定义后,你可以使用bufpool.Get方法从这个池中获取对象,在使用完后使用bufpool.Put方法将对象归还到池中.

// sync.Pool returns a interface{}: you must cast it to the underlying type
// before you use it.
bp := bufpool.Get().(*[]byte)
b := *bp
defer func() {
    *bp = b
    bufpool.Put(bp)
}()

// Now, go do interesting things with your byte buffer.
buf := bytes.NewBuffer(b)

// sync.Pool returns a interface{}: you must cast it to the underlying type

// before you use it.

bp := bufpool.Get().(*[]byte)

b := *bp

defer func() {

*bp = b

bufpool.Put(bp)

}()

// Now, go do interesting things with your byte buffer.

buf := bytes.NewBuffer(b)

一些注意事项.在Go1.13版本之前sync.Pool对象池会在每次GC触发的时候进行清空.这样如果程序分配了大量的对象是非常不利于性能的.在1.13版本之后似乎GC不会一次性全部回收所有对象.

在你使用Put方法将对象归还给sync.Pool对象池之前你必须将这个结构体的数据字段归零

如果你不做这个重置操作那么将会得到一个"脏"的对象,这个对象包含了上一次使用时的数据.这是一个非常严重的安全隐患.

type AuthenticationResponse {
    Token string
    UserID string
}

rsp := authPool.Get().(*AuthenticationResponse)
defer authPool.Put(rsp)

// If we don't hit this if statement, we might return data from other users! 😱
if blah {
    rsp.UserID = "user-1"
    rsp.Token = "super-secret
}

return rsp

type AuthenticationResponse {

Token string

UserID string

}

rsp := authPool.Get().(*AuthenticationResponse)

defer authPool.Put(rsp)

// If we don't hit this if statement, we might return data from other users! 😱

if blah {

rsp.UserID = "user-1"

rsp.Token = "super-secret

}

return rsp

一个安全的方法确保你总是将对象的内存归零你需要显式的这么做:

// reset resets all fields of the AuthenticationResponse before pooling it.
func (a* AuthenticationResponse) reset() {
    a.Token = ""
    a.UserID = ""
}

rsp := authPool.Get().(*AuthenticationResponse)
defer func() {
    rsp.reset()
    authPool.Put(rsp)
}()

// reset resets all fields of the AuthenticationResponse before pooling it.

func (a* AuthenticationResponse) reset() {

a.Token = ""

a.UserID = ""

}

rsp := authPool.Get().(*AuthenticationResponse)

defer func() {

rsp.reset()

authPool.Put(rsp)

}()

只有一种情况不会有上述问题就是你确保使用的内存就是你写入的内存数据而没有使用到脏数据.例如:

var (
    r io.Reader
    w io.Writer
)

// Obtain a buffer from the pool.
buf := *bufPool.Get().(*[]byte)
defer bufPool.Put(&buf)

// We only write to w exactly what we read from r, and no more. 😌
nr, er := r.Read(buf)
if nr > 0 {
    nw, ew := w.Write(buf[0:nr])
}

var (

r io.Reader

w io.Writer

)

// Obtain a buffer from the pool.

buf := *bufPool.Get().(*[]byte)

defer bufPool.Put(&buf)

// We only write to w exactly what we read from r, and no more. 😌

nr, er := r.Read(buf)

if nr > 0 {

nw, ew := w.Write(buf[0:nr])

}

避免使用包含指针作为map的key的大map的结构体.

有很多人曾经说过关于大的heap占用对Go程序的性能影响.在一次垃圾回收触发的时候,runtime会扫描包含指针的对象并且标记它们.如果你有一个非常大的map[string]int,GC会检查这个map的每一个string,由于string包含指针所以每一次GC都会进行检查.

-------------------------------------注解---------------------------------------
字符串在 Golang 源码文件 runtime.h 中的定义如下：

struct String
{
    byte* str;
    int len;
};

struct String

{

byte* str;

int len;

};

------------------------------------注解---------------------------------by chet

下面这个示例中我们创建一个包含1000w个元素的map[string]int, 然后把输出GC消耗的时间.我们分配这个map在整个包的作用范围内创建,这样可以确保内存会在heap上进行分配.

package main

import (
    "fmt"
    "runtime"
    "strconv"
    "time"
)

const (
    numElements = 10000000
)

var foo = map[string]int{}

func timeGC() {
    t := time.Now()
    runtime.GC()
    fmt.Printf("gc took: %s\n", time.Since(t))
}

func main() {
    for i := 0; i < numElements; i++ {
        foo[strconv.Itoa(i)] = i
    }

    for {
        timeGC()
        time.Sleep(1 * time.Second)
    }
}

package main

import (

"fmt"

"runtime"

"strconv"

"time"

)

const (

numElements = 10000000

)

var foo = map[string]int{}

func timeGC() {

t := time.Now()

runtime.GC()

fmt.Printf("gc took: %s\n", time.Since(t))

}

func main() {

for i := 0; i < numElements; i++ {

foo[strconv.Itoa(i)] = i

}

for {

timeGC()

time.Sleep(1 * time.Second)

}

执行上面的程序,我们可以看到以下运行结果:
gc took: 98.726321ms
gc took: 105.524633ms
gc took: 102.829451ms
gc took: 102.71908ms
gc took: 103.084104ms
gc took: 104.821989ms

这在计算机运算领域算是很长的一个时间了！！！😰

我们可以怎样改善这段代码呢? 一个好的想法是移除指针的影响,我们来看看map[int]int的实现效果.

package main

import (
    "fmt"
    "runtime"
    "time"
)

const (
    numElements = 10000000
)

var foo = map[int]int{}

func timeGC() {
    t := time.Now()
    runtime.GC()
    fmt.Printf("gc took: %s\n", time.Since(t))
}

func main() {
    for i := 0; i < numElements; i++ {
        foo[i] = i
    }

    for {
        timeGC()
        time.Sleep(1 * time.Second)
    }
}

package main

import (

"fmt"

"runtime"

"time"

)

const (

numElements = 10000000

)

var foo = map[int]int{}

func timeGC() {

t := time.Now()

runtime.GC()

fmt.Printf("gc took: %s\n", time.Since(t))

}

func main() {

for i := 0; i < numElements; i++ {

foo[i] = i

}

for {

timeGC()

time.Sleep(1 * time.Second)

}

执行上面的程序,我们可以看到以下运行结果:
gc took: 3.608993ms
gc took: 3.926913ms
gc took: 3.955706ms
gc took: 4.063795ms
gc took: 3.91519ms
gc took: 3.75226ms

现在好多了. 我们消除了97%的GC执行时间.在生产运用中在元素插入map之前你需要将字符串转换为一个整型数.

有很多的方式可以去避免GC的影响.如果你分配一个巨大的不包含指针的数组,例如整型数组[n]int或者字节切片,GC是不会对它进行扫描的:这意味着你几乎在GC上付出了零代价.不过这种技术一般会对程序进行大量的重构工作,所以我们在这里不做更深入的研究了.

代码生成编码/解码(序列化/反序列化)的代码去避免使用运行时反射(runtime reflection)

对于类似JSON数据对结构体的编码/解码(序列化/反序列化)是一个常见操作,尤其是在构建微服务的时候.事实上你会发现微服务事实上就是在做序列化的事情.在Go里面像json.Marshal和json.Unmarshal这种方法依赖运行时反射技术(runtime reflection)去将结构体的字段序列化成bytes,反之亦然.这个操作会非常慢:反射的性能不会像显式的代码那么高效.

因而,并非必须如此的.JSON的序列化方式有点像这样:

package json

// Marshal take an object and returns its representation in JSON.
func Marshal(obj interface{}) ([]byte, error) {
    // Check if this object knows how to marshal itself to JSON
    // by satisfying the Marshaller interface.
    if m, is := obj.(json.Marshaller); is {
        return m.MarshalJSON()
    }

    // It doesn't know how to marshal itself. Do default reflection based marshallling.
    return marshal(obj)
}

package json

// Marshal take an object and returns its representation in JSON.

func Marshal(obj interface{}) ([]byte, error) {

// Check if this object knows how to marshal itself to JSON

// by satisfying the Marshaller interface.

if m, is := obj.(json.Marshaller); is {

return m.MarshalJSON()

}

// It doesn't know how to marshal itself. Do default reflection based marshallling.

return marshal(obj)

}

如果一个结构体知道该如何将自己序列化成JSON,那么我们就有了一个钩子去避免使用runtime reflection.但是我们不想对所有的结构体都编写序列化代码那我们该怎么办呢?那就让计算机帮我们写代码.代码生成器easyjson会根据结构体生成一个和json.Marshaller兼容的高性能的序列化代码.

下载这个包然后像下面的示例一样对你的包含结构体的文件file.go执行下面的语句就可以自动生成你想要的序列化代码.

你会发现生成了一个file_json.go的代码文件.由于easyjson帮你实现了json.Marshaller这个接口.默认会调用这些生成的代码去替代调用运行时反射(runtime reflection).恭喜你你已经将你的JSON序列化的代码的性能提升了3倍.有很多事情你可以换个角度去提升性能.

我推荐这个包是因为我之前使用过这个包并且得到了良好的性能体验.但是不要把这个认为当做是是一个开始和我讨论什么是最快速的JSON序列化的包的邀请.

注意你需要注意确保当你的结构体发生变化的时候对它重新生成序列化相关的代码.如果忘记了,那么新的字段可能不会参与到序列化或者反序列化中,这可能会造成很多混乱的情况发生. 为了保证同步你可以自己制作一个工具去实时的同步生成对应结构体的序列化/反序列化的代码

使用strings.Builder去构建字符串

在Go里面字符串是不可变的,可以把它当做是一个只读的字节切片.这意味着你每次创建一个字符串就是重新分配了一块内存,并且潜在的给GC也制造了更多的处理工作.

在Go 1.10版本中strings.Builder作为一种高效构建字符串的方式被引进.在其内部维护这一个字节切片的缓冲区.只有当调用String()方法时才会真正的创建一个字符串.它依赖一些unsafe包中的技巧在没有任何额外开销的情况下将一个底层的字节切片作为一个字符串返回.可以参阅这篇博客了解更多详细的底层细节.

让我们来对这两种方法做一个性能比较:

// main.go
package main

import "strings"

var strs = []string{
    "here's",
    "a",
    "some",
    "long",
    "list",
    "of",
    "strings",
    "for",
    "you",
}

func buildStrNaive() string {
    var s string

    for _, v := range strs {
        s += v
    }

    return s
}

func buildStrBuilder() string {
    b := strings.Builder{}

    // Grow the buffer to a decent length, so we don't have to continually
    // re-allocate.
    b.Grow(60)

    for _, v := range strs {
        b.WriteString(v)
    }

    return b.String()
}

// main.go

package main

import "strings"

var strs = []string{

"here's",

"a",

"some",

"long",

"list",

"of",

"strings",

"for",

"you",

}

func buildStrNaive() string {

var s string

for _, v := range strs {

s += v

}

return s

}

func buildStrBuilder() string {

b := strings.Builder{}

// Grow the buffer to a decent length, so we don't have to continually

// re-allocate.

b.Grow(60)

for _, v := range strs {

b.WriteString(v)

}

return b.String()

}

在我的Mac Book上我得到下面的运行结果:
goos: darwin
goarch: amd64
pkg: github.com/sjwhitworth/perfblog/strbuild
BenchmarkStringBuildNaive-8 5000000 255 ns/op 216 B/op 8 allocs/op
BenchmarkStringBuildBuilder-8 20000000 54.9 ns/op 64 B/op 1 allocs/op

我们可以看到strings.Builder快了4.7倍,1/8的分配次数以及1/4的内存分配.

在性能敏感的程序中请使用strings.Builder.一般我推荐所有的字符串都使用strings.Builder构建除非是极其简单的情况下.

使用strconv替代fmt

fmt这个包是Go中最出名的的包了.你可能在你的第一个Go程序hello world中使用过它.然而在将整型或者浮点数转换为字符串的过程中它的性能远不及它的"表兄弟"strconv.这个包会给你一个很大的性能提升,并且只需要修改少量的API代码.

fmt使用interface{}作为方法的参数.这种做法有两种劣势：
1.你失去了类型安全,这对我来说是更大的问题.
2.它会增加内存分配的次数,穿一个非指针类型的数据给interface{}通常会带来额外的堆区内存分配.你可以通过阅读这篇博客了解更详细的基本原理.

下面的程序展示了他们的性能差别:

// main.go
package main

import (
    "fmt"
    "strconv"
)

func strconvFmt(a string, b int) string {
    return a + ":" + strconv.Itoa(b)
}

func fmtFmt(a string, b int) string {
    return fmt.Sprintf("%s:%d", a, b)
}

func main() {}

// main.go

package main

import (

"fmt"

"strconv"

)

func strconvFmt(a string, b int) string {

return a + ":" + strconv.Itoa(b)

}

func fmtFmt(a string, b int) string {

return fmt.Sprintf("%s:%d", a, b)

}

func main() {}

// main_test.go
package main

import (
    "testing"
)

var (
    a    = "boo"
    blah = 42
    box  = ""
)

func BenchmarkStrconv(b *testing.B) {
    for i := 0; i < b.N; i++ {
        box = strconvFmt(a, blah)
    }
    a = box
}

func BenchmarkFmt(b *testing.B) {
    for i := 0; i < b.N; i++ {
        box = fmtFmt(a, blah)
    }
    a = box
}

// main_test.go

package main

import (

"testing"

)

var (

a = "boo"

blah = 42

box = ""

)

func BenchmarkStrconv(b *testing.B) {

for i := 0; i < b.N; i++ {

box = strconvFmt(a, blah)

}

a = box

}

func BenchmarkFmt(b *testing.B) {

for i := 0; i < b.N; i++ {

box = fmtFmt(a, blah)

}

a = box

}

在我的Mac机上运行结果:
goos: darwin
goarch: amd64
pkg: github.com/sjwhitworth/perfblog/strfmt
BenchmarkStrconv-8 30000000 39.5 ns/op 32 B/op 1 allocs/op
BenchmarkFmt-8 10000000 143 ns/op 72 B/op 3 allocs/op

我们可以看到使用strconv比fmt快了3.5倍.内存分配次数只有1/3.内存占用只有一半不到.

make的时候指定容量避免扩容时触发内存的重分配.

在你对性能做出改进之前,让我们来快速回顾一下切片.切片是Go里面一个非常实用的数据结构.它提供一个大小可变的数组,并且可以在同一个底层数组上提供不同的视图.如果你揭开切片的面纱你会发现它的底层数据结构包含了三个元素.

type slice struct {
    // pointer to underlying data in the slice.
    data uintptr
    // the number of elements in the slice.
    len int
    // the number of elements that the slice can 
    // grow to before a new underlying array
    // is allocated.
    cap int     
}

type slice struct {

// pointer to underlying data in the slice.

data uintptr

// the number of elements in the slice.

len int

// the number of elements that the slice can

// grow to before a new underlying array

// is allocated.

cap int

}

这几个字段是什么意思呢?

data: 底层数据的指针
len: 切片中数据元素的数量
cap: 切片在重设大小前能够包含的元素的数量

在底层实现中切片是一个动态数组,当大小达到cap时,会重新创建一个大小为原数组大小2倍的新数组然后将旧数组的数据拷贝到新分配的数组中,然后释放掉之前的旧数组.

我经常看到有的代码在已经元素上限时还是分配了容量大小为0的切片

var userIDs []string
for _, bar := range rsp.Users {
    userIDs = append(userIDs, bar.ID)
}

var userIDs []string

for _, bar := range rsp.Users {

userIDs = append(userIDs, bar.ID)

}

在上面这个例子中,切片的大小是从0开始的,并且它的容量也是0.上面的代码逻辑是在收到响应后将用户的id追加到切片中,根据前面回顾的切片的基本原理如果我们有8个用户,那么这里将会发生5次内存重分配.

一个更高效的修改方式如下:

userIDs := make([]string, 0, len(rsp.Users)

for _, bar := range rsp.Users {
    userIDs = append(userIDs, bar.ID)
}

userIDs := make([]string, 0, len(rsp.Users)

for _, bar := range rsp.Users {

userIDs = append(userIDs, bar.ID)

}

我们使用make时显式的这个切片分配了确定大小的容量.这样我们在对这个切片使用append时将不会触发内存重分配和数据拷贝.

如果在make时你不知道应该给这个切片分配多大的容量因为容量可能是动态的,是由运行过程中确定的.那么你应该给一个大致的估算值,我一般会估算一个90%左右的值然后根据这个值硬编码到程序中.

这个建议同样适用于map结构.

应该使用允许传递字节切片的方法

在使用独立的包的时候,应该使用可以传递字节切片的方法.这些方法将会让你内存分配方面有更多的控制权.
time.Format和time.AppendFormat就是一个很好的例子. time.Format返回一个字符串.time.Format在底层分配了一个新的字节切片,然后使用这个字节切片调用time.AppendFormat. time.AppendFormat接受一个字节buffer将时间格式化表达写入这个buffer中然后将其返回.这种方式在标准库中非常常见.参考:strconv.AppendFloat或者bytes.NewBuffer.

为什么这么做能够提升性能呢？因为这时候你可以传入你自己拥有的字节切片数据,这个数据可能是你从sync.Pool对象池中获取的,会避免每次都申请新的内存. 或者你能够同于初始化时预设好容量来避免扩容时内存重分配和内存数据拷贝来提升性能.

总结

通过阅读这篇文章你应该可以将这些技术运用到你的程序中了. 随着时间推移,你将会逐步形成一种对性能敏感的编程思维.这大大有益于你的程序设计.

最后提醒一句,请将我的这些指导意见当做一种建议不要当做真理.实践是检验真理的唯一标准.

提升系统的性能对于一个工程师来说是一个非常棒的体验:通常这种问题非常有趣,结果也是立竿见影的.但是有价值的性能优化非常依赖实际情况.假如你的服务的响应时间是10ms但是网络轮训响应传输的时间是90ms,那么优化一般一半的响应性能从10ms到5ms的价值并不明显.因为你依然需要消耗95ms,就算你能从10ms优化到1ms依然需要91ms的响应时间.如果是这样的话,你可能有其他更重要的性能优化方向了.尝试换个角度去优化你的程序.

引用

如果你对更多的细节感兴趣,下面这些文章将是你灵感的来源.

[译]Go性能提升快速指南

Go性能提升快速指南

开始之前

使用sync.Pool对象池来重用之前分配过的对象.

避免使用包含指针作为map的key的大map的结构体.

代码生成编码/解码(序列化/反序列化)的代码去避免使用运行时反射(runtime reflection)

使用strings.Builder去构建字符串

使用strconv替代fmt

make的时候指定容量避免扩容时触发内存的重分配.

应该使用允许传递字节切片的方法

总结

引用

发表评论取消回复

Go性能提升快速指南

开始之前

使用sync.Pool对象池来重用之前分配过的对象.

避免使用包含指针作为map的key的大map的结构体.

代码生成编码/解码(序列化/反序列化)的代码去避免使用运行时反射(runtime reflection)

使用strings.Builder去构建字符串

使用strconv替代fmt

make的时候指定容量避免扩容时触发内存的重分配.

应该使用允许传递字节切片的方法

总结

引用

发表评论 取消回复

发表评论取消回复