美文网首页
Book Review: Elasticsearch in Ac

Book Review: Elasticsearch in Ac

作者: 马文Marvin | 来源:发表于2018-09-13 22:55 被阅读50次

    作者:Radu Gheorghe / Matthew Lee Hinman / Roy Russo
    出版社:manning
    发行时间:2014
    来源:下载的 pdf 版本
    Goodreads:4.2 (46 Ratings)
    豆瓣:8.0(11人评价)

    There was no book to teach us that, so we had to learn the hard way: lots of experiments, lots of questions and answers to the mailing list. The upside was that I got to know a lot of nice people that posted there regularly. This is how I came to work at Sematext, where I could concentrate on Elasticsearch full-time, and this is why Manning asked me if I would be interested in writing about Elasticsearch.
    Of course I was. They warned me it was hard work, but told me that Lee Hinman was also interested, so we joined forces. With two authors, we thought it was going to be easy, especially as Lee and I really clicked and provided useful feedback to one another. Little did we know that it’s much easier to present features in the early chapters than to combine those features into best practices for various use cases in later chapters. Then, with feedback from our reviewers, we found that it’s even more work to fit everything together, so our pace became slower and slower. That’s when Roy Russo joined us and helped with that final push.
    After two and a half years of early mornings, late nights, and weekends, I can finally say we’re done. It was a tough experience, but a rich one as well. I would surely have loved to have this book in my hands four years ago, and I hope you’ll enjoy it, too.

    By default, the algorithm used to calculate a document’s relevancy score is TF-IDF. We’ll discuss scoring and TF-IDF more in chapters 4 and 6, which are about searching and relevancy, but here’s the basic idea: TF-IDF stands for term frequency–inverse document frequency, which are the two factors that influence relevancy score.
    ■ Term frequency—The more times the words you’re looking for appear in a document, the higher the score.
    ■ Inverse document frequency—The weight of each word is higher if the word is uncommon across other documents.
    For example, if you’re looking for “bicycle race” on a cyclist’s blog, the word “bicycle” counts much less for the score than “race.” But the more times both words appear in a document, the higher that document’s score. In addition to choosing an algorithm, Elasticsearch provides many other built-in features to influence the relevancy score to suit your needs.
    For example, you can “boost” the score of a particular field, such as the title of a post, to be more important than the body. This gives higher scores to documents that match your search criteria in the title, compared to similar documents that match only the body. You can make exact matches count more than partial matches, and you can even use a script to add custom criteria to the way the score is calculated. For example, if you let users like posts, you can boost the score based on the number of likes, or you can make newer posts have higher scores than similar, older posts.

    With Elasticsearch you have options to make your searches intuitive and go beyond exactly matching what the user types in. These options are handy when the user enters a typo or uses a synonym or a derived word different than what you’ve stored. They’re also handy when the user doesn’t know exactly what to search for in the first place.

    Like other NoSQL data stores, Elasticsearch doesn’t support transactions. In chapter 3, you’ll see how you can use versioning to manage concurrency, but if you need transactions, consider using another database as the “source of truth.” Also, regular backups are a good practice when you’re using a single data store

    Logical layout — What your search application needs to be aware of. The unit you’ll use for indexing and searching is a document, and you can think of it like a row in a relational database. Documents are grouped into types, which contain documents in a way similar to how tables contain rows. Finally, one or multiple types live in an index, the biggest container, similar to a database in the SQL world.
    Physical layout — How Elasticsearch handles your data in the background. Elasticsearch divides each index into shards, which can migrate between servers that make up a cluster. Typically, applications don’t care about this because they work with Elasticsearch in much the same way, whether it’s one or more servers. But when you’re administering the cluster, you care because the way you configure the physical layout determines its performance, scalability, and availability.

    A node is an instance of Elasticsearch. When you start Elasticsearch on your server, you have a node. If you start Elasticsearch on another server, it’s another node. You can even have more nodes on the same server by starting multiple Elasticsearch processes.
    Multiple nodes can join the same cluster. As we’ll discuss later in this chapter, starting nodes with the same cluster name and otherwise default settings is enough to make a cluster. With a cluster of multiple nodes, the same data can be spread across multiple servers. This helps performance because Elasticsearch has more resources to work with. It also helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch will still serve you all the data. For an application that’s using Elasticsearch, having one or more nodes in a cluster is transparent. By default, you can connect to any node from the cluster and work with the whole data just as if you had a single node.
    Although clustering is good for performance and availability, it has its disadvantages: you have to make sure nodes can communicate with each other quickly enough and that you won’t have a split brain (two parts of the cluster that can’t communicate and think the other part dropped out).

    You can change the number of replicas per shard at any time because replicas can always be created or removed. This doesn’t apply to the number of primary shards an index is divided into; you have to decide on the number of shards before creating the index.
    Keep in mind that too few shards limit how much you can scale, but too many shards impact performance. The default setting of five is typically a good start. You’ll learn more in chapter 9, which is all about scaling. We'll also explain how to add/remove replica shards dynamically.

    The simplest Elasticsearch cluster has one node: one machine running one Elasticsearch process. When you installed Elasticsearch in chapter 1 and started it, you created a one-node cluster.
    As you add more nodes to the same cluster, existing shards get balanced between all nodes. As a result, both indexing and search requests that work with those shards benefit from the extra power of your added nodes. Scaling this way (by adding nodes to a cluster) is called horizontal scaling; you add more nodes, and requests are then distributed so they all share the work. The alternative to horizontal scaling is to scale vertically; you add more resources to your Elasticsearch node, perhaps by dedicating more processors to it if it’s a virtual machine, or adding RAM to a physical machine. Although vertical scaling helps performance almost every time, it’s not always possible or cost-effective. Using shards enables you to scale horizontally.
    Suppose you want to scale your get-together index, which currently has two primary shards and no replicas. As shown in figure 2.7, the first option is to scale vertically by upgrading the node: for example, adding more RAM, more CPUs, faster disks, and so on. The second option is to scale horizontally by adding another node and having your data distributed between the two nodes.

    For most snippets in this book you’ll use the cURL binary. cURL is a command-line tool for transferring data over HTTP. You’ll use the curl command to make HTTP requests, as it has become a convention to use cURL for Elasticsearch code snippets. That’s because it’s easy to translate a cURL example into any programming language. In fact, if you ask for help on the official mailing list for Elasticsearch, it’s recommended that you provide a curl recreation of your problem. A curl recreation is a command or a sequence of curl commands that reproduces the problem you’re experiencing, and anyone who has Elasticsearch installed locally can run it.

    相关文章

      网友评论

          本文标题:Book Review: Elasticsearch in Ac

          本文链接:https://www.haomeiwen.com/subject/yupigftx.html