Making Sense Out of the Reliability Prediction Business
Reliability Predictions are commonly used in the development of products and systems to compare alternative design approaches and to assess progress toward reliability design goals. They’re often criticized as not being accurate forecasts of field reliability performance because they don’t usually account for all the factors that cause field failures. Nevertheless, predictions are a valuable form of analysis that also provide insight into safety, maintenance and warranty costs and other product considerations.Commonly used electronic reliability prediction approaches include:
- MIL-HDBK-217 “Reliability Prediction of Electronic Equipment” – Even though this handbook is no longer being kept up to date by the US military, it remains the most widely used approach by both commercial and military analysts.
- Bellcore (now Telcordia) TR-332 – The Bellcore approach is widely used in the telecommunications industry and has been recently updated to SR-332 in May 2001. It is very similar to MIL-HDBK-217.
- RDF 2000– This is the latest and most comprehensive of the European methodologies developed by CNET. It hasn’t yet received much attention in the US but it could evolve into the new international standard should MIL-HDBK-217 continue to become outdated. Like the PRISM approach, it also addresses thermal cycling and dormant system modeling.
- PRISM – PRISM is a new technique developed by the Reliability Analysis Center which has the ability to model the effects of thermal cycling and dormancy. At this time it’s rather limited from a device coverage standpoint but it shows potential for community acceptance as it matures. Quanterion’s David Mahar developed the software currently being marketed.
- Physics-of-Failure – This family of approaches differs significantly from the other empirical methodologies listed above and is used primarily at the sub-device level during the design stage.
- The IEEE Gold Book– IEEE STD 493-1997, IEEE Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems, provides data on commercial power distribution systems.
Mechanical equipment has always presented special challenges in terms of reliability prediction because of the uniqueness and variety of components and assemblies. These systems are often susceptible to wearout, which is usually not an issue with electronic systems. There are two basic approaches for predicting the reliability of mechanical systems:
- NPRD-95 – The Nonelectronic Parts Reliability Data (NPRD-95) databook is a widely used data book published by the Reliability Analysis Center that provides a compendium of historical field failure rate data on a wide array of mechanical assemblies.
- NSWC-94/L07 – Handbook of Reliability Prediction Procedures for Mechanical Equipment. This handbook presents a unique approach for prediction of mechanical component reliability by presenting failure rate models for fundamental classes of mechanical components.
Brief descriptions of the various prediction approaches follow:
MIL-HDBK-217 has been the mainstay of reliability predictions for about 40 years but it has not been updated since 1995, and there are no plans by the military to update it in the future. For more than ten years Quanterion’s Seymour Morris was DoD program manager for MIL-HDBK-217. The handbook includes a series of empirical failure rate models developed using historical piece part failure data for a wide array of component types. There are models for virtually all electrical/electronic parts and a number of electromechanical parts as well. All models predict reliability in terms of failures per million operating hours and assume an exponential distribution (constant failure rate), which allows the addition of failure rates to determine higher assembly reliability. The handbook contains two prediction approaches: the parts stress technique and the parts counttechnique and covers 14 separate operational environments, such as ground fixed, airborne inhabited, etc. As the names imply, the parts stress technique requires knowledge of the stress levels on each part to determine its failure rate, while the parts count technique assumes average stress levels as a means of providing an early design estimate of the failure rate. Typical factors used in determining a part’s failure rate include a temperature factor (πT), power factor (πP), power stress factor (πS), quality factor (πQ) and environmental factor (πE) in addition to the base failure rate (λb). For example, the model for a resistor is as follows:
λResistor= λb πT πP πS πQ πE
Bellcore’s approach is very similar to that of MIL-HDBK-217 but it’s based primarily on telecommunicationsdata and covers five separate use environments. The approach also assumes an exponential failure distribution and calculates reliability in terms of failures per billion part operating hours, or FITs. Its empirically based models are in three categories: the Method Iparts count approach that applies when there is no field failure data available, the Method II modification to Method I to include lab test data and the Method IIIvariation that includes field failure tracking. Method I includes a first year modifier to account for infant mortality. Method II includes a Bayes weighting procedure that covers three approaches depending on the level of previous burn-in the part or unit has undergone. Method III includes a Bayes weighting procedure as well but it is based on three different cases depending on how similar the equipment is to that from which the data was collected. For the most widely used Method I case where the burn-in varies, the steady-state failure rate depends on the basic part steady-state failure rate and the quality, electrical stress and temperature factors as follows:
λSSi = λGi πQi πSi πTi
RDF 2000 is the new version of the CNET UTEC80810reliability prediction standard that covers most of the same components as MIL-HDBK-217. The models take into account power on/off cycling as well astemperature cycling and are very complex with predictions for integrated circuits requiring information on equipment outside ambient and print circuit ambient temperatures, type of technology, number of transistors, year of manufacture, junction temperature, working time ratio, storage time ratio, thermal expansion characteristics, number of thermal cycles, thermal amplitude of variation, application of the device, as well as per transistor, technology related and package related base failure rates. As this standard becomes more widely used it could become the international successor to the US MIL-HDBK-217.
PRISM is a new approach released in 2000 based on the DoD Reliability Analysis Center’s databases. It provides the ability to update predictions based on test data and addresses factors such as development process robustness. Available as an automated tool (as opposed to a handbook compendium of models like the others), PRISM interfaces directly with RAC’s electronic and nonelectronic automated databases and provides an elaborate methodology to assess the quality of the system development process. It includes a means to include software reliability but is limited by the fact that it does not yet include models for all commonly used devices. The PRISM system reliability model is:
λS = λIA ( πP πIM πE + πD πG + πM πIM + πE πG + πS πG + πI πE + πN + πW πE ) + λSW
where λIA is the initial assessment failure rate (based on “RACRates” component failure rate models incorporated into PRISM) for the system based on its parts and the remaining factors address parts processes (πP) , infant mortality(πIM) , environment(πE) , design processes(πD) , reliability growth(πG) , manufacturing processes(πM) , system management processes(πS) , induced processes(πI) , no-defect processes(πN) , and wear-out processes(πW) . λSW is the software failure rate. Quantitative values for the individual factors are determined through an extensive question and answer process intended to benchmark the extent that measures known to enhance reliability are used in design, manufacturing and management processes.
Physics-of-Failure approaches attempt to identify the “weakest link” of a design to ensure that the required equipment life is exceeded by the design. The methodology generally ignores the issue of defects escaping from the manufacturing process and assumes that product reliability is strictly governed by the predicted life of the weakest link. Example models address microcircuit die attach fatigue, bond wire flexure fatigue and die fatigue cracking. The models are very complex and require detailed device geometry information and materials properties. In general, the models are thought to bemost useful in the early stages of designing devices (e.g., hybrids) but not at the assembly level when flexibility no longer exists to change device designs.
The IEEE Gold Book provides data concerning equipment reliability used in industrial and commercial power distribution systems. Reliability data for different types of equipment are provided along with other aspects of reliability analysis for power distribution systems, such as basic concepts of reliability analysis, probability methods, fundamentals of power system reliability evaluation, economic evaluation of reliability, and cost of power outage data. The handbook was updated in 1997; however, the most recent reliability data reflected in the document is only through 1989.
NPRD-95 data provides failure rates for a wide variety of items, including mechanical and electromechanical parts and assemblies. The document provides detailed failure rate data on over 25,000 parts for numerous part categories grouped by environment and quality level. Because the data does not include time-to-failure, the document is forced to report average failure rates to account for both defects and wearout. Cumulatively, the database represents approximately 2.5 trillion part hours and 387,000 failures accumulated from the early 1970’s through 1994. The environments addressed include the same ones covered by MIL-HDBK-217; however, data is often very limited for some environments and specific part types. For these cases, it then becomes necessary to use the “rolled up” estimates provided, which make use of all data available for a broader class of parts and environments. Although the data book approach is generally thought to be less desirable, it remains an economical means of estimating “ballpark” reliability for mechanical components.
NSWC-94/L07 – Handbook of Reliability Prediction Procedures for Mechanical Equipment. This handbook, developed by the Naval Surface Warfare Center – Carderock Division provides failure rate models for fundamental classes of mechanical components. Examples of the specific mechanical devices addressed by the document include belts, springs, bearings, seals, brakes, slider-crank mechanisms, and clutches. Failure rate models include factors that are known to impact the reliability of the components. For example, the most common failure modes for springs are fracture due to fatigue and excessive load stress relaxation. The reliability of a spring will therefore depend on the material, design characteristics and the operating environment. NSWC-94/L07 models attempt to predict spring reliability based on these input characteristics. The drawback of the approach is that, like the physics of failure models for electronics, the models require a significant amount of detailed input data (e.g., material properties, applied forces, etc.) that is often not readily available. They also do not address the issue of manufacturing defects.
Summary: Even though MIL-HDBK-217 is becoming more obsolete every day, it remains the most widely used technique for electronics. TR-332 is widely used in the telecommunications industry and is generally believed to more accurately predict the reliability of telecomm equipment. New and more robust methodologies such as the RAC’s PRISM model provide improved modeling capability but will need to be expanded to include more part categories, and further evaluated by industry prior to widespread adoption. For mechanical components, NPRD-95 is the most widely used with approaches such as NSWC-94/L07 offering a more accurate alternative if the required detailed input data is available and manufacturing defects can be ignored. Many of the approaches are available in automated form from Relex Software, ITEM Software, Isograph Software, ALD, Oerlikon-Contraves and a number of others. The packages typically are integrated with other reliability and maintainability analyses greatly reducing the labor required for multiple analyses. At this time none of the tools include RDF 2000. PRISM is a stand-alone package that is marketed by RAC and several resellers.