Storing UTF-8 Encoded Text with Strings

We talked about strings in the ownership chapter, but we'll look at them in more depth now. New programmers commonly get stuck on strings for a combination of three reasons: the language's propensity for exposing possible errors, strings being a more complicated data structure than many programmers give them credit for, and UTF-8. These factors combine in a way that can seem difficult when you're coming from other programming languages.

We discuss strings in the context of collections because strings are implemented as a collection of bytes, plus some methods to provide useful functionality when those bytes are interpreted as text. In this section, we'll talk about the operations on String that every collection type has, such as creating, updating, and reading. We'll also discuss the ways in which String is different from the other collections, namely how indexing into a String is complicated by the differences between how people and computers interpret String data.

What Is a String?

Oxide has only one string type in the core language, which is the string slice str that is usually seen in its borrowed form, &str. We talked about string slices in the ownership chapter, which are references to some UTF-8 encoded string data stored elsewhere. String literals, for example, are stored in the program's binary and are therefore string slices.

The String type, which is provided by the standard library rather than coded into the core language, is a growable, mutable, owned, UTF-8 encoded string type. When we refer to "strings" in Oxide, we might be referring to either the String or the string slice &str types, not just one of those types. Although this section is largely about String, both types are used heavily in the standard library, and both String and string slices are UTF-8 encoded.

Creating a New String

Many of the same operations available with Vec<T> are available with String as well because String is actually implemented as a wrapper around a vector of bytes with some extra guarantees, restrictions, and capabilities. An example of a function that works the same way with Vec<T> and String is the new function to create an instance:

fn main() {
    let s = String.new()
}

This line creates a new, empty string called s, into which we can then load data. Often, we'll have some initial data with which we want to start the string. For that, we use the toString method, which is available on any type that implements the Display trait, as string literals do:

fn main() {
    let data = "initial contents"
    let s = data.toString()

    // Or more directly:
    let s = "initial contents".toString()
}

This code creates a string containing initial contents.

We can also use the function String.from to create a String from a string literal:

fn main() {
    let s = String.from("initial contents")
}

Because strings are used for so many things, we can use many different APIs for strings, providing us with a lot of options. In this case, String.from and toString do the same thing, so which one you choose is a matter of style and readability.

Rust comparison: Oxide uses dot notation (String.new(), String.from()) instead of path notation (String::new(), String::from()). Also, Oxide uses toString() in camelCase instead of to_string().

#![allow(unused)]
fn main() {
// Rust
let s = String::new();
let s = "initial contents".to_string();
let s = String::from("initial contents");
}

Remember that strings are UTF-8 encoded, so we can include any properly encoded data in them:

fn main() {
    let hello = String.from("Hola")
    let hello = String.from("Hello")
    let hello = String.from("Zdravstvuyte")
    let hello = String.from("Bonjour")
    let hello = String.from("Hallo")
    let hello = String.from("Ciao")
    let hello = String.from("Olá")
}

All of these are valid String values.

String Interpolation

One of Oxide's most convenient features for working with strings is string interpolation. Instead of using the format! macro with placeholders, you can embed expressions directly in string literals using \(expression) syntax:

fn main() {
    let name = "Alice"
    let age = 30

    // String interpolation
    let greeting = "Hello, \(name)! You are \(age) years old."
    println!("\(greeting)")

    // Expressions work too
    let message = "Next year you'll be \(age + 1)."
    println!("\(message)")
}

This is much more readable than the equivalent code using format!:

fn main() {
    let name = "Alice"
    let age = 30

    // Using format! macro (also works)
    let greeting = format!("Hello, {}! You are {} years old.", name, age)
}

Rust comparison: Rust requires the format! macro for string formatting. Oxide's \(expr) syntax is inspired by Swift and provides a cleaner alternative.

#![allow(unused)]
fn main() {
// Rust
let name = "Alice";
let age = 30;
let greeting = format!("Hello, {}! You are {} years old.", name, age);
}

Updating a String

A String can grow in size and its contents can change, just like the contents of a Vec<T>, if you push more data into it. In addition, you can conveniently use the + operator or string interpolation to concatenate String values.

Appending with `pushStr` and `push`

We can grow a String by using the pushStr method to append a string slice:

fn main() {
    var s = String.from("foo")
    s.pushStr("bar")
    println!("\(s)")  // Prints: foobar
}

After these two lines, s will contain foobar. The pushStr method takes a string slice because we don't necessarily want to take ownership of the parameter. For example, in the following code, we want to be able to use s2 after appending its contents to s1:

fn main() {
    var s1 = String.from("foo")
    let s2 = "bar"
    s1.pushStr(s2)
    println!("s2 is \(s2)")  // s2 is still valid!
}

If the pushStr method took ownership of s2, we wouldn't be able to print its value on the last line. However, this code works as we'd expect!

The push method takes a single character as a parameter and adds it to the String:

fn main() {
    var s = String.from("lo")
    s.push('l')
    println!("\(s)")  // Prints: lol
}

Rust comparison: Oxide uses camelCase method names: pushStr instead of push_str.

#![allow(unused)]
fn main() {
// Rust
let mut s = String::from("foo");
s.push_str("bar");
s.push('l');
}

Concatenating with `+` or String Interpolation

Often, you'll want to combine two existing strings. One way to do so is to use the + operator:

fn main() {
    let s1 = String.from("Hello, ")
    let s2 = String.from("world!")
    let s3 = s1 + &s2  // Note: s1 has been moved here and can no longer be used
}

The string s3 will contain Hello, world!. The reason s1 is no longer valid after the addition, and the reason we used a reference to s2, has to do with the signature of the method that's called when we use the + operator. The + operator uses the add method, whose signature looks something like this:

consuming fn add(s: &str): String

This means s1 will be moved into the add call and will no longer be valid after that. So, although let s3 = s1 + &s2; looks like it will copy both strings and create a new one, this statement actually takes ownership of s1, appends a copy of the contents of s2, and then returns ownership of the result.

If we need to concatenate multiple strings, the behavior of the + operator gets unwieldy:

fn main() {
    let s1 = String.from("tic")
    let s2 = String.from("tac")
    let s3 = String.from("toe")

    let s = s1 + "-" + &s2 + "-" + &s3
}

At this point, s will be tic-tac-toe. With all of the + and " characters, it's difficult to see what's going on. For combining strings in more complicated ways, we can instead use string interpolation:

fn main() {
    let s1 = String.from("tic")
    let s2 = String.from("tac")
    let s3 = String.from("toe")

    let s = "\(s1)-\(s2)-\(s3)"
}

This code also sets s to tic-tac-toe. String interpolation is much easier to read, and unlike the + operator, it doesn't take ownership of any of its parameters because it uses references internally.

Indexing into Strings

In many other programming languages, accessing individual characters in a string by referencing them by index is a valid and common operation. However, if you try to access parts of a String using indexing syntax in Oxide, you'll get an error:

fn main() {
    let s1 = String.from("hello")
    let h = s1[0]  // Error! Strings cannot be indexed by integers
}

The error tells the story: Oxide strings don't support indexing. But why not? To answer that question, we need to discuss how Oxide stores strings in memory.

Internal Representation

A String is a wrapper over a Vec<UInt8>. Let's look at some of our properly encoded UTF-8 example strings. First, this one:

let hello = String.from("Hola")

In this case, len will be 4, which means the vector storing the string "Hola" is 4 bytes long. Each of these letters takes 1 byte when encoded in UTF-8. The following line, however, may surprise you (note that this string begins with the capital Cyrillic letter Ze, not the number 3):

let hello = String.from("Zdravstvuyte")  // Russian greeting

If you were asked how long the string is, you might say 12. In fact, Oxide's answer is 24: That's the number of bytes it takes to encode "Zdravstvuyte" in UTF-8, because each Unicode scalar value in that string takes 2 bytes of storage. Therefore, an index into the string's bytes will not always correlate to a valid Unicode scalar value.

You already know that answer will not be Z, the first letter. When encoded in UTF-8, the first byte of Z is 208 and the second is 151, so it would seem that answer should in fact be 208, but 208 is not a valid character on its own. Returning 208 is likely not what a user would want if they asked for the first letter of this string; however, that's the only data that Oxide has at byte index 0.

The answer, then, is that to avoid returning an unexpected value and causing bugs that might not be discovered immediately, Oxide doesn't compile this code at all and prevents misunderstandings early in the development process.

Bytes, Scalar Values, and Grapheme Clusters

Another point about UTF-8 is that there are actually three relevant ways to look at strings from Oxide's perspective: as bytes, scalar values, and grapheme clusters (the closest thing to what we would call letters).

If we look at the Hindi word "namaste" written in the Devanagari script, it is stored as a vector of UInt8 values that looks like this:

[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]

That's 18 bytes and is how computers ultimately store this data. If we look at them as Unicode scalar values, which are what Oxide's Char type is, those bytes look like this:

['n', 'm', 's', '्', 't', 'े']

There are six Char values here, but the fourth and sixth are not letters: They're diacritics that don't make sense on their own. Finally, if we look at them as grapheme clusters, we'd get what a person would call the four letters that make up the Hindi word.

Oxide provides different ways of interpreting the raw string data that computers store so that each program can choose the interpretation it needs, no matter what human language the data is in.

A final reason Oxide doesn't allow us to index into a String to get a character is that indexing operations are expected to always take constant time (O(1)). But it isn't possible to guarantee that performance with a String, because Oxide would have to walk through the contents from the beginning to the index to determine how many valid characters there were.

Slicing Strings

Indexing into a string is often a bad idea because it's not clear what the return type of the string-indexing operation should be: a byte value, a character, a grapheme cluster, or a string slice. If you really need to use indices to create string slices, Oxide asks you to be more specific.

Rather than indexing using [] with a single number, you can use [] with a range to create a string slice containing particular bytes:

let hello = "Zdravstvuyte"  // Russian greeting in Cyrillic

let s = &hello[0..4]

Here, s will be a &str that contains the first 4 bytes of the string. Earlier, we mentioned that each of these characters was 2 bytes, which means s will be the first two Cyrillic characters.

If we were to try to slice only part of a character's bytes with something like &hello[0..1], Oxide would panic at runtime in the same way as if an invalid index were accessed in a vector:

thread 'main' panicked at 'byte index 1 is not a char boundary'

You should use caution when creating string slices with ranges, because doing so can crash your program.

Iterating Over Strings

The best way to operate on pieces of strings is to be explicit about whether you want characters or bytes. For individual Unicode scalar values, use the chars method. Calling chars on a Cyrillic string separates out and returns values of type Char, and you can iterate over the result to access each element:

fn main() {
    for c in "Hello".chars() {
        println!("\(c)")
    }
}

This code will print:

H
e
l
l
o

Alternatively, the bytes method returns each raw byte, which might be appropriate for your domain:

fn main() {
    for b in "Hello".bytes() {
        println!("\(b)")
    }
}

This code will print the bytes that make up this string:

But be sure to remember that valid Unicode scalar values may be made up of more than 1 byte.

Getting grapheme clusters from strings is complex, so this functionality is not provided by the standard library. Crates are available on crates.io if this is the functionality you need.

Common String Methods

Here are some commonly used methods on String and &str:

fn main() {
    let s = String.from("Hello, World!")

    // Check if empty
    let empty = s.isEmpty()  // false

    // Get length in bytes
    let len = s.len()  // 13

    // Check if string contains a substring
    let hasWorld = s.contains("World")  // true

    // Replace occurrences
    let replaced = s.replace("World", "Oxide")  // "Hello, Oxide!"

    // Convert to uppercase/lowercase
    let upper = s.toUppercase()  // "HELLO, WORLD!"
    let lower = s.toLowercase()  // "hello, world!"

    // Trim whitespace
    let padded = "  hello  "
    let trimmed = padded.trim()  // "hello"

    // Split into parts
    let csv = "a,b,c"
    for part in csv.split(',') {
        println!("\(part)")
    }
}

Rust comparison: Method names use camelCase in Oxide: isEmpty instead of is_empty, toUppercase instead of to_uppercase.

#![allow(unused)]
fn main() {
// Rust
let s = String::from("Hello, World!");
let empty = s.is_empty();
let upper = s.to_uppercase();
let lower = s.to_lowercase();
}

Strings Are Not So Simple

To summarize, strings are complicated. Different programming languages make different choices about how to present this complexity to the programmer. Oxide has chosen to make the correct handling of String data the default behavior for all Oxide programs, which means programmers have to put more thought into handling UTF-8 data up front. This trade-off exposes more of the complexity of strings than is apparent in other programming languages, but it prevents you from having to handle errors involving non-ASCII characters later in your development life cycle.

The good news is that the standard library offers a lot of functionality built off the String and &str types to help handle these complex situations correctly. Be sure to check out the documentation for useful methods like contains for searching in a string and replace for substituting parts of a string with another string.

Let's switch to something a bit less complex: hash maps!

The Oxide Programming Language