美文网首页
从C++和OCaml程序员的视角看Rust (Part 2)

从C++和OCaml程序员的视角看Rust (Part 2)

作者: 朱小虎XiaohuZhu | 来源:发表于2019-06-04 15:38 被阅读0次

    本文来源地址

    SEPTEMBER 29, 2017

    by GUILLAUME ENDIGNOUX
    @GEndignoux

    This post is the second of my series about Rust compared to other languages. After a first post about the basics of the language, I’ll now discuss memory management, a cornerstone of Rust’s safety guarantees. You will hopefully get a clear idea about how moves, borrows and other lifetimes work!

    Despite the title, in this post, I will mostly compare Rust to C++, because there isn’t much memory management to do in OCaml, which is garbage-collected.


    Allocation and memory access

    Memory management is a core aspect of programming. The fundamental reason is that most complex algorithms require a variable amount of memory, depending on their input. Therefore, we obtain such memory at runtime with allocators. Most of the time, we also want to deallocate this memory once we are done using it, to avoid exhausting memory resources (otherwise we have a memory leak). But we must make sure that we do not access the memory again after deallocation, because this memory may have been reallocated for other unrelated objects (use-after-free bug).

    Several strategies have been implemented to manage memory resources.

    • In C, memory management is manual with the malloc() and free() functions. This gives full control to the programmer, but the code is tedious and repetitive, and a single mistake can lead to (security) bugs (memory leak, use-after-free, double-free, etc.).
    • In C++, the <abbr title="Resource acquisition is initialization" style="box-sizing: border-box; margin: 0px; padding: 0px; border-bottom: 1px dotted; cursor: help;">RAII</abbr> paradigm automates memory (and more generally resources) management with constructors and destructors. If destructors are properly implemented, the resources acquired by a variable are released as soon as the variable goes out-of-scope. The destructing code is automatically added by the compiler, and this ensures that neither memory leaks nor use-after-freehappen.
    • In most modern languages (e.g. Java, Python, OCaml), a garbage collector (<abbr title="Garbage collector" style="box-sizing: border-box; margin: 0px; padding: 0px; border-bottom: 1px dotted; cursor: help;">GC</abbr>) runs from time to time and looks for unreachable memory by traversing all allocated memory.

    Although garbage collection avoids pitfalls of manual memory management and is easy to use for the programmer (there is essentially nothing to do), it introduces some overhead (due to memory traversal) and unpredictability. There is no way to know when the <abbr title="Garbage collector" style="box-sizing: border-box; margin: 0px; padding: 0px; border-bottom: 1px dotted; cursor: help;">GC</abbr> will run. But every time it runs there is some delay before the program resumes, which can be problematic for highly-interactive scenarios. Also, <abbr title="Garbage collector" style="box-sizing: border-box; margin: 0px; padding: 0px; border-bottom: 1px dotted; cursor: help;">GC</abbr> only cares about memory, so other resources (e.g. files, network connections) must be managed by other means, e.g. back with the manual strategy, or with finalizers.

    Note for the curious: In OCaml, the native integer type has one bit reserved by the <abbr title="Garbage collector" style="box-sizing: border-box; margin: 0px; padding: 0px; border-bottom: 1px dotted; cursor: help;">GC</abbr>, to avoid wrapping the value into a box. So int is a 63-bit integer (on 64-bit architectures), i.e. in range <math><semantics><annotation encoding="application/x-tex">[-2^{62}, 2^{62}-1]</annotation></semantics></math>[−262,262−1].

    Another problem is unsafe access to memory. In C (and C++), because you have pointers with unrestricted arithmetic operations, you can (accidentally) access unallocated memory. This often boils down to access an array index i larger than the array’s length l, i.e. an out-of-bound access (or buffer overflow). This can be avoided at the cost of a comparison i < l, as implemented in C++’s std::vector::at() function. In OCaml, there are no pointers, and array access always checks bounds, so you don’t have these safety problems.

    In Rust, the <abbr title="Resource acquisition is initialization" style="box-sizing: border-box; margin: 0px; padding: 0px; border-bottom: 1px dotted; cursor: help;">RAII</abbr> strategy of allocation is chosen for its predictability and performance properties, and its safety is enhanced compared to C++. There is no direct access to pointers (unless you explicitly write unsafe code), and we will now see that safe access is enforced by the compiler thanks to more static checks!

    Ownership

    Rust goes beyond <abbr title="Resource acquisition is initialization" style="box-sizing: border-box; margin: 0px; padding: 0px; border-bottom: 1px dotted; cursor: help;">RAII</abbr>, by introducing ownership rules that are enforced statically at compile time.

    Some objects are not copyable, which means that only one variable can have ownership of a given object. This is, for example, the case of heap-allocated objects (the Box type), that implement deallocation in their destructor. Indeed, if such an object could be copied verbatim, then each copy would call the deallocation and we would obtain a double-free bug.

    Instead, ownership can be transferred via a move. This is similar to std::move() in C++11, but in Rust it’s automatic and the compiler prevents access to the moved-from variable.

    // C++14
    auto a = std::make_unique<int>(1);
    // The compiler forbids a copy "b = a", we need to move explicitly.
    auto b = std::move(a);
    std::cout << "b = " << *b << std::endl;
    // The following is accepted by the compiler, despite "a" being in an unspecified state at runtime.
    std::cout << "a = " << *a << std::endl;
    
    
    // Rust
    let a = Box::new(1);
    // The box is moved from "a" to "b" and ownership is transfered.
    let b = a;
    println!("b = {}", b);
    // This does not compile because "a" has lost ownership.
    println!("a = {}", a);
    // error[E0382]: use of moved value: `a`
    //  --> src/main.rs:7:24
    //   |
    // 4 |     let b = a;
    //   |         - value moved here
    // ...
    // 7 |     println!("a = {}", a);
    //   |                        ^ value used here after move
    //   |
    //   = note: move occurs because `a` has type `std::boxed::Box<i32>`, which does not implement the `Copy` trait
    
    

    Although transferring ownership is safe, we often want to have several variables pointing to the same object, in particular for function calls. Typically, the calling function wants to give access to some data to the called function (via an argument), but needs to access it after the call, so transferring ownership to the called function does not work.

    For this purpose, Rust introduces a concept of references, that borrow objects from their owner. Similarly to C/C++, a reference can be obtained with the & operator, and the corresponding reference type is &T.

    // Type annotations are optional, written here for clarity.
    let a : i32 = 1;
    let b : &i32 = &a;
    println!("a = {}", a);
    println!("b = {}", b);
    
    

    Borrows and mutability

    By default, Rust variables are immutable. If you want to borrow a variable to modify it, you need to use a mutable borrow, as follows.

    // The original variable must be mutable to obtain a mutable borrow
    let mut a = 1;
    {
        let b = &mut a; // Mutable borrow
        *b += 1; // Use "*" to dereference
    }
    println!("a = {}", a);
    // a = 2
    
    

    Here, you can notice that we used a temporary scope to operate on b. This is because Rust has strict rules about borrows, enforced by the so-called borrow checker. These rules can be formalized as follows.

    • (Paraphrased from the Rust book) You may have one or the other of these two kinds of borrows, but not both at the same time:
      • one or more immutable references (&T) to a resource,
      • exactly one mutable reference (&mut T).
    • When a variable is mutably borrowed, it cannot be read. Likewise, when a variable is (mutably or immutably) borrowed, it cannot be modified.

    To summarize, as long as a variable is mutably borrowed, nothing else can access this variable. This rules may seem strict but are here to avoid race-conditions. Indeed, « threads without data races » is one of the main goals of Rust.

    But these rules also avoid data races in single-threaded programs! For example, in C++ you can obtain a reference to a vector cell, and then resize this vector, obtaining a dangling reference…

    // C++
    std::vector<int> v(10, 0); // Vector of length 10.
    int& x = v[9];
    v.resize(5); // The resize invalidates reference "x".
    // Undefined behavior.
    std::cout << "x = " << x << std::endl;
    
    

    In Rust, this would violate the third rule: v cannot be modified as long as reference x is in scope. This example might seem obvious, but if you hide references behind iterators and wrap this code in a more complex program, it becomes hard to debug in practice!

    Warning: Despite the name and the type syntax, references in Rust really are safe pointers. Contrary to C++ references, they can be rebound to another object, as long as they are declared mutable (mut). Note that this is orthogonal to a mutable borrow, for which the referred-to object is mutable.

    let a = 1;
    let b = 2;
    let mut r = &a;
    println!("r = {}", r);
    r = &b;
    println!("r = {}", r);
    // r = 1
    // r = 2
    
    

    Lifetimes

    Beyond the basic rules on mutability and borrows, Rust introduces the concept of lifetimes, to avoid dangling references. A typical example of dangling reference is to return a reference to a temporary.

    fn bad() -> &i32 {
        let a : i32 = 1;
        return &a;
    }
    // error[E0106]: missing lifetime specifier
    //  --> src/main.rs:1:13
    //   |
    // 1 | fn bad() -> &i32 {
    //   |             ^ expected lifetime parameter
    //   |
    
    

    This does not compile, and Rust requests a « lifetime specifier » for the return type. Returning a reference is not forbidden in itself, and there are indeed valid reasons to do it. For example, you may want to access an array cell or some internal data in an object, to modify it or to read it without an expensive copy.

    Let’s take a more concrete example with a Point structure over integers, and define a getx() function that returns a reference to the x coordinate (i.e. takes an argument of type &Point and returns a &i32). The returned reference to the x coordinate will logically have the same lifetime as the Point parameter because x is part of the point. But we don’t know exactly what this lifetime is, and in fact, this varies for each call to the function getx(), depending on the parameter. So we use a generic lifetime 'a (or any word preceded by an apostrophe) and write the function as follows.

    #[derive(Debug)]
    struct Point { x: i32, y: i32 }
    
    fn getx<'a>(p : &'a Point) -> &'a i32 {
        return &p.x;
    }
    
    fn main() {
        let p = Point {x: 1, y: 2};
        let px = getx(&p);
        println!("p.x = {}", px);
        println!("p = {:?}", p);
    }
    // p.x = 1
    // p = Point { x: 1, y: 2 }
    
    

    Here, the diamond syntax getx<'a> means that getx() is parametrized by a generic lifetime 'a. This syntax is similar to templates in C++. Then, &'a Point represents a reference to Point which has lifetime 'a, and the result &'a i32 is a reference to i32 which has the same lifetime 'a.

    In fact, I lied a bit to you because in practice you don’t need to explicit lifetimes for simple functions such as getx(). This is due to lifetime elision rules, that allows the programmer to remove lifetime annotations in non-ambiguous cases, where the compiler can figure them out itself. For example, if there is only one reference parameter, a reference result can only come from this parameter (reference to temporaries have an invalid lifetime), so the compiler infers that they share a common lifetime 'a. So you can rewrite getx() as follows.

    fn getx(p : &Point) -> &i32 {
        return &p.x;
    }
    
    

    Lifetimes in structures

    Thanks to elision rules, you probably won’t have to write lifetimes in functions, unless you write complex stuff. However, a common case where you won’t avoid lifetimes is if you use references in structures.

    To give a more concrete example, let’s consider a common exercise in functional programming: writing a linked list. Typically, a list is either empty or contains a pointer to the first node. In turn, each node contains a value and a pointer to the remaining of the list, which is itself a list. Let’s try to formalize this.

    enum List { Nil(), Next(Node) }
    struct Node { value: i32, next: List }
    // error[E0072]: recursive type `List` has infinite size
    //  --> src/main.rs:1:1
    //   |
    // 1 | enum List { Nil(), Next(Node) }
    //   | ^^^^^^^^^               ----- recursive without indirection
    //   | |
    //   | recursive type has infinite size
    //   |
    //   = help: insert indirection (e.g., a `Box`, `Rc`, or `&`) at some point to make `List` representable
    
    

    As we can see, if we don’t use any kind of indirection and put the next node directly inside the current node, we obtain a recursive structure and the compiler detects that it is not representable. Instead, let’s try to use a reference to the next node.

    enum List { Nil(), Next(&Node) }
    // error[E0106]: missing lifetime specifier
    //  --> src/main.rs:1:25
    //   |
    // 1 | enum List { Nil(), Next(&Node) }
    //   |                         ^ expected lifetime parameter
    
    

    Now, we are back to lifetimes. You cannot write a reference inside a structure (or enumeration) without specifying its lifetime.This makes sense because such a reference points to another object, which must be alive at least as long as the current object exists (otherwise we have a dangling reference).

    So let’s try with a generic lifetime 'a.

    enum List { Nil(), Next(&'a Node) }
    // error[E0261]: use of undeclared lifetime name `'a`
    //  --> src/main.rs:1:26
    //   |
    // 1 | enum List { Nil(), Next(&'a Node) }
    //   |                          ^^ undeclared lifetime
    
    

    This lifetime must be declared. For this, we add it to the List enumeration, using the diamond notation. The following means that the List has some lifetime 'a, and that the reference to the next Node must have at least this lifetime 'a.

    enum List<'a> { Nil(), Next(&'a Node) }
    
    

    Now, we rewrite the Node structure to also declare and use the lifetime 'a.

    struct Node<'a> { value: i32, next: List<'a> }
    // error[E0106]: missing lifetime specifier
    //  --> src/main.rs:1:33
    //   |
    // 1 | enum List<'a> { Nil(), Next(&'a Node) }
    //   |                                 ^^^^ expected lifetime parameter
    
    

    By doing this, we also need to propagate the lifetime to the Node instance in the List definition. The following code finally works and implements our linked list!

    #[derive(Debug)]
    enum List<'a> { Nil(), Next(&'a Node<'a>) }
    #[derive(Debug)]
    struct Node<'a> { value: i32, next: List<'a> }
    
    fn main() {
        let n1 = Node {value: 1, next: List::Nil()};
        let n2 = Node {value: 2, next: List::Next(&n1)};
        println!("node = {:?}", n2);
    }
    // node = Node { value: 2, next: Next(Node { value: 1, next: Nil }) }
    
    

    Now, let’s see if the borrow checker really checks lifetimes. In the following example, we replace the next node of n2 by a node n3which does not live long enough, due to a temporary scope. We can verify that the borrow checker complains!

    fn main() {
        let n1 = Node {value: 1, next: List::Nil()};
        let mut n2 = Node {value: 2, next: List::Next(&n1)};
        {
            let n3 = Node {value: 3, next: List::Nil()};
            n2.next = List::Next(&n3);
        }
        println!("node = {:?}", n2);
    }
    // error[E0597]: `n3` does not live long enough
    //   --> src/main.rs:12:5
    //    |
    // 11 |         n2.next = List::Next(&n3);
    //    |                               -- borrow occurs here
    // 12 |     }
    //    |     ^ `n3` dropped here while still borrowed
    // 13 |     println!("node = {:?}", n2);
    // 14 | }
    //    | - borrowed value needs to live until here
    
    

    Compare this to pointers in C/C++, for which the compiler does not complain about such unsafe code!

    // DO NOT DO THIS!
    struct Node {
        Node(int v) : value(v), next(nullptr) {}
        Node(int v, Node* n) : value(v), next(n) {}
    
        int value;
        Node* next;
    }
    
    void print(Node n);
    
    int main() {
        Node n1(1);
        Node n2(2, &n1);
        {
            Node n3(3);
            n2.next = &n3;
        }
        // n2.next is now invalid but the compiler does not complain...
        print(n2);
    }
    
    

    In summary, the borrow checker verifies mutable borrows and lifetimes, to ensure safe access to all references. There are more interesting and powerful features about lifetimes, such as bounds and coercion, but this post is already long enough so I invite the curious reader to have a look at them in the Rust book.

    Conclusion

    Memory management is a ubiquitous programming problem, and there is no simple solution. You can hide this complexity with a simple API (malloc/free) or an invisible garbage collector, but with complex programs, you can end up with security problems or performance issues. Instead, Rust chose to expose powerful abstractions that are not easy to understand at first glance but provide compile-time safety without compromising on performance.

    There is still a lot to say about Rust, so get ready for my next post in this series, and more Rust projects! In the meantime, you can find me at RustFest this weekend.

    相关文章

      网友评论

          本文标题:从C++和OCaml程序员的视角看Rust (Part 2)

          本文链接:https://www.haomeiwen.com/subject/uxxrxctx.html