Site Reliability Engineering mobi 网盘高速下载地址大全免费

mobi电子书下载地址

其他格式下载地址

下载地址
[word] Site Reliability Engineering

Site Reliability Engineering书籍详细信息

ISBN：9781491929124
作者：贝特西拜尔 (Betsy Beyer) / Chris Jones / Jennifer Petoff / Niall Richard Murphy
出版社：O'Reilly Media
出版时间：2016-4-16
页数：552
价格：USD 44.99
纸张：暂无纸张
装帧：Paperback
开本：暂无开本
语言：暂无语言
适合人群：IT从业者, 系统管理员, DevOps工程师, 网络工程师, 企业架构师, 数据中心经理, 系统分析师
TAG：云计算 / DevOps / 可靠性工程 / 企业架构 / 系统运维
豆瓣评分：9
更新时间：2025-05-19 19:08:55

内容简介：

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

书籍目录：

Chapter 1Introduction The Sysadmin Approach to Service Management Google’s Approach to Service Management: Site Reliability Engineering Tenets of SRE The End of the Beginning Chapter 2The Production Environment at Google, from the Viewpoint of an SRE Hardware System Software That “Organizes” the Hardware Other System Software Our Software Infrastructure Our Development Environment Shakespeare: A Sample Service Principles Chapter 3Embracing Risk Managing Risk Measuring Service Risk Risk Tolerance of Services Motivation for Error Budgets Chapter 4Service Level Objectives Service Level Terminology Indicators in Practice Objectives in Practice Agreements in Practice Chapter 5Eliminating Toil Toil Defined Why Less Toil Is Better What Qualifies as Engineering? Is Toil Always Bad? Conclusion Chapter 6Monitoring Distributed Systems Definitions Why Monitor? Setting Reasonable Expectations for Monitoring Symptoms Versus Causes Black-Box Versus White-Box The Four Golden Signals Worrying About Your Tail (or, Instrumentation and Performance) Choosing an Appropriate Resolution for Measurements As Simple as Possible, No Simpler Tying These Principles Together Monitoring for the Long Term Conclusion Chapter 7The Evolution of Automation at Google The Value of Automation The Value for Google SRE The Use Cases for Automation Automate Yourself Out of a Job: Automate ALL the Things! Soothing the Pain: Applying Automation to Cluster Turnups Borg: Birth of the Warehouse-Scale Computer Reliability Is the Fundamental Feature Recommendations Chapter 8Release Engineering The Role of a Release Engineer Philosophy Continuous Build and Deployment Configuration Management Conclusions Chapter 9Simplicity System Stability Versus Agility The Virtue of Boring I Won’t Give Up My Code! The “Negative Lines of Code” Metric Minimal APIs Modularity Release Simplicity A Simple Conclusion Practices Chapter 10Practical Alerting from Time-Series Data The Rise of Borgmon Instrumentation of Applications Collection of Exported Data Storage in the Time-Series Arena Rule Evaluation Alerting Sharding the Monitoring Topology Black-Box Monitoring Maintaining the Configuration Ten Years On… Chapter 11Being On-Call Introduction Life of an On-Call Engineer Balanced On-Call Feeling Safe Avoiding Inappropriate Operational Load Conclusions Chapter 12Effective Troubleshooting Theory In Practice Negative Results Are Magic Case Study Making Troubleshooting Easier Conclusion Chapter 13Emergency Response What to Do When Systems Break Test-Induced Emergency Change-Induced Emergency Process-Induced Emergency All Problems Have Solutions Learn from the Past. Don’t Repeat It. Conclusion Chapter 14Managing Incidents Unmanaged Incidents The Anatomy of an Unmanaged Incident Elements of Incident Management Process A Managed Incident When to Declare an Incident In Summary Chapter 15Postmortem Culture: Learning from Failure Google’s Postmortem Philosophy Collaborate and Share Knowledge Introducing a Postmortem Culture Conclusion and Ongoing Improvements Chapter 16Tracking Outages Escalator Outalator Chapter 17Testing for Reliability Types of Software Testing Creating a Test and Build Environment Testing at Scale Conclusion Chapter 18Software Engineering in SRE Why Is Software Engineering Within SRE Important? Auxon Case Study: Project Background and Problem Space Intent-Based Capacity Planning Fostering Software Engineering in SRE Conclusions Chapter 19Load Balancing at the Frontend Power Isn’t the Answer Load Balancing Using DNS Load Balancing at the Virtual IP Address Chapter 20Load Balancing in the Datacenter The Ideal Case Identifying Bad Tasks: Flow Control and Lame Ducks Limiting the Connections Pool with Subsetting Load Balancing Policies Chapter 21Handling Overload The Pitfalls of “Queries per Second” Per-Customer Limits Client-Side Throttling Criticality Utilization Signals Handling Overload Errors Load from Connections Conclusions Chapter 22Addressing Cascading Failures Causes of Cascading Failures and Designing to Avoid Them Preventing Server Overload Slow Startup and Cold Caching Triggering Conditions for Cascading Failures Testing for Cascading Failures Immediate Steps to Address Cascading Failures Closing Remarks Chapter 23Managing Critical State: Distributed Consensus for Reliability Motivating the Use of Consensus: Distributed Systems Coordination Failure How Distributed Consensus Works System Architecture Patterns for Distributed Consensus Distributed Consensus Performance Deploying Distributed Consensus-Based Systems Monitoring Distributed Consensus Systems Conclusion Chapter 24Distributed Periodic Scheduling with Cron Cron Cron Jobs and Idempotency Cron at Large Scale Building Cron at Google Summary Chapter 25Data Processing Pipelines Origin of the Pipeline Design Pattern Initial Effect of Big Data on the Simple Pipeline Pattern Challenges with the Periodic Pipeline Pattern Trouble Caused By Uneven Work Distribution Drawbacks of Periodic Pipelines in Distributed Environments Introduction to Google Workflow Stages of Execution in Workflow Ensuring Business Continuity Summary and Concluding Remarks Chapter 26Data Integrity: What You Read Is What You Wrote Data Integrity’s Strict Requirements Google SRE Objectives in Maintaining Data Integrity and Availability How Google SRE Faces the Challenges of Data Integrity Case Studies General Principles of SRE as Applied to Data Integrity Conclusion Chapter 27Reliable Product Launches at Scale Launch Coordination Engineering Setting Up a Launch Process Developing a Launch Checklist Selected Techniques for Reliable Launches Development of LCE Conclusion Management Chapter 28Accelerating SREs to On-Call and Beyond You’ve Hired Your Next SRE(s), Now What? Initial Learning Experiences: The Case for Structure Over Chaos Creating Stellar Reverse Engineers and Improvisational Thinkers Five Practices for Aspiring On-Callers On-Call and Beyond: Rites of Passage, and Practicing Continuing Education Closing Thoughts Chapter 29Dealing with Interrupts Managing Operational Load Factors in Determining How Interrupts Are Handled Imperfect Machines Chapter 30Embedding an SRE to Recover from Operational Overload Phase 1: Learn the Service and Get Context Phase 2: Sharing Context Phase 3: Driving Change Conclusion Chapter 31Communication and Collaboration in SRE Communications: Production Meetings Collaboration within SRE Case Study of Collaboration in SRE: Viceroy Collaboration Outside SRE Case Study: Migrating DFP to F1 Conclusion Chapter 32The Evolving SRE Engagement Model SRE Engagement: What, How, and Why The PRR Model The SRE Engagement Model Production Readiness Reviews: Simple PRR Model Evolving the Simple PRR Model: Early Engagement Evolving Services Development: Frameworks and SRE Platform Conclusion Conclusions Chapter 33Lessons Learned from Other Industries Meet Our Industry Veterans Preparedness and Disaster Testing Postmortem Culture Automating Away Repetitive Work and Operational Overhead Structured and Rational Decision Making Conclusions Chapter 34Conclusion Appendix Availability Table Appendix A Collection of Best Practices for Production Services Fail Sanely Progressive Rollouts Define SLOs Like a User Error Budgets Monitoring Postmortems Capacity Planning Overloads and Failure SRE Teams Appendix Example Incident State Document Appendix Example Postmortem Lessons Learned Timeline Supporting information: Appendix Launch Coordination Checklist Appendix Example Production Meeting Minutes

作者简介：

Betsy Beyer Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane. Chris Jones Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer. Jennifer Petoff Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester. Niall Richard Murphy Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.

其它内容：

暂无其它内容！

下载点评

感谢(163+)
学生(1164+)
清晰(387+)
惊喜(1270+)
如获至宝(998+)
最新(245+)
内容齐全(598+)
流畅(189+)
云同步(334+)
重排(907+)
可复制(801+)
错乱(330+)
无损(467+)
逻辑严密(904+)
MOBI(517+)
稳定(590+)
感动(128+)
双语(923+)

下载评论

用户1733556381： ( 2024-12-07 15:26:21 )
秒传下载EPUB/AZW3文件，高清期刊推荐收藏，操作便捷。
用户1728610719： ( 2024-10-11 09:38:39 )
优质版本报告资源，EPUB/MOBI格式适配各种阅读设备，操作便捷。
用户1741784131： ( 2025-03-12 20:55:31 )
找了很久终于找到高清版本，排版清晰，阅读体验很棒！
用户1727608044： ( 2024-09-29 19:07:24 )
优质的教材资源，多格式设计提升阅读体验，操作便捷。
用户1744426141： ( 2025-04-12 10:49:01 )
完整版本小说资源，PDF/TXT格式适配各种阅读设备，资源优质。

Site Reliability Engineering mobi 网盘高速下载地址大全免费

mobi电子书下载地址

Site Reliability Engineering书籍详细信息

内容简介：

书籍目录：

作者简介：

其它内容：

下载点评

下载评论

相关书评

Site Reliability Engineering

运维vs迭代，人工vs自动化

混乱的解药

Site Reliability Engineering mobi 网盘 高速 下载地址大全 免费

mobi电子书下载地址

Site Reliability Engineering书籍详细信息

内容简介：

书籍目录：

作者简介：

其它内容：

下载点评

下载评论

相关书评

Site Reliability Engineering

运维vs迭代，人工vs自动化

混乱的解药

Site Reliability Engineering mobi 网盘高速下载地址大全免费