10. Software and Control Risks

The GMTO Risk Management Plan [Rayb13] identifies how the project identifies, evaluates and tracks risks to ensure that the project handles them in an appropriate manner. The Risk Management Plan does not include risks associated with safety concerning personnel or equipment that is covered by the Design Safety Process Plan[Sawy13b]_ being executed by the Systems Engineering Group.

10.1. Risk Mitigation Plans

The risk mitigations identified in the risk register generally fall under categories that include: Agile development, prototyping, simulation, continuous integration and delivery, scenarios, interfaces, user involvement, development staff, operations staff, de-scoping design, standard software practices and collaboration. Examples of how some of the SWC risks will be mitigated are listed for these categories in the following sections.

10.2. Agile Development

A continuous stream of requirement changes could delay the development of some subsystems. Agile techniques provide an efficient way to adapt projects to changes in the requirements.

Unrealistic schedule, cost estimation and staffing plans can all be addressed by an Agile process deployment that will help to identify issues early on in the project. Performance metrics should be used systematically. An Agile process that enforces a predictable development model can forecast or mitigate the likelihood of delays in software delivery.

Incorrect priorities can exhaust resources and deplete the contingency budget. Therefore user scenarios need to be prioritized according to realistic end-values so that critical scenarios are implemented first.

10.3. Prototyping and Simulation

Sometimes an incorrect requirement can lead to the development of a wrong functionality that then wastes observatory resources, or may lead to unforeseen risks. Activities like prototyping, early integration and simulations are ways to mitigate cost and improve safety, efficiency, and performance.

Extensive tests that simulate M1 cell and supports should be carried out before installing the real mirror, so as to protect the primary mirror from damage in case of an M1 Control System malfunction.

The ultra-low latency bus is critical for the AO/AcO system performance. To minimize unforeseen effects of the low latency bus on AO performance, it is essential to prototype the hardware-software combination early on and to conduct a thorough analysis on the performance.

10.4. Continuous Integration and Delivery

The M1 Control System is a complex software component that needs to be fully tested under realistic conditions. Performing the integration with a dummy primary mirror, and implementing a safety system independently and in parallel with the main controller will mitigate risks when testing the control algorithms.

Poor software and hardware that are procured externally can result in additional rework. It is therefore necessary to establish an adequate acceptance procedure, and to enforce timely delivery of software components to permit quality and performance assessment.

A poor architectural design adversely compounds and confounds a system’s complexity. To mitigate this issue, it is necessary to deliver complete, end-to-end, functionalities early on, with enough slack time in the schedule to allow for architectural changes (e.g., refactoring components) when excessive complexities start to develop.

Poor documentation impacts the operational efficiency and maintenance. Documentation is an integral part of the product that will be integrated into the Semantic Model, so that it may stay current at all times. It is also necessary to require external contractors to provide quality documentation.

10.5. Interfaces

Large/complex interfaces require significant effort to define and manage. The number of interfaces must be optimized to be as few as necessary and be short in breadth.

10.6. User Involvement

Early integration of stakeholders into the development team reduces the likelihood for producing incorrect or unstable requirements, wrong functionalities, or delays in the subsystem development.

10.7. Development and Operations Staff

The departure of key staff members affects team productivity, so competitive employment conditions need to be provided. For the same reason, some level of redundancy between members of the development team must be allowed and to avoid code ownership.

Poor operations staff involvement during the design phase results in knowledge gap that may impact operations efficiency. Operations staff involvement should start early on in the project.

The lack of maintainability results in efficiency losses during design and operation phases. To counteract those inefficiencies operation staff must be involved early on in the project.

10.8. De-scope Design or Relax Requirements

Incorrect priorities exhaust resources and deplete the contingency budget. Low priority scenarios ought to be de-scoped in those cases.

10.9. Standard Software Practices

Peer review of software code is common in software development, especially in critical areas. The GMT primary mirror control system is an area that can benefit from peer reviews. A project of this scale must have configuration control for establishing and maintaining consistency in: telescope software performance, functional adherence to requirements, design, and operational information throughout its life. Diagnostic tools should be implemented throughout to correct poor maintainability, which results in efficiency losses during design and operation phases.

10.10. Collaboration

Establishing a strong software community for the project plays an important role to remedy the lack of support to external groups, which results in poor implementation and delays.

10.11. Standards Adoption

The platform technologies, when acquired from a single vendor, can end up placing the project in a lock-in situation, which may result in a lack of support and eventually may require the cleanroom implementation of some components. In this case, the use of open standards and the ability to encapsulate product dependencies can mitigate risks.

During the course of the project, the technology adopted might become obsolete, compromising support and maintainability. To mitigate that risk, the use of commercial-off-the-shelf products based on open standards should be adopted. Furthermore, acquisition can be delayed if the integration plan for a component allows for it. The goal is also to use open source components with excellent community support. Finally, it is necessary to keep enough backup spares to allow the observatory to operate for the duration of its lifecycle.

10.12. Risk Register

The Software and Controls risks currently maintained in the risk register are summarized in the following tables. The tables below show medium exposure risks for the SWC Design and Development Phase, and low exposure risks. Note that the SWCS does not have high exposure risks.

Table 10.1 Medium Exposure SWC Risks
Risk Description Risk Type Impact Likelihood Risk Exposure Mitigations
RISK0054: M1 Support control system A malfunction in the control system of M1 could eventually break the primary mirror. Technical 5 - Significant 2 - Unlikely 10 MIT0122: Integration with a dummy mirror MIT0123: Implement safety system independently of the main controller MIT0124: Safety system shall implement independent parallel and different safety strategies MIT0125: Peer review of critical code parts
RISK0044: Incorrect Priorities Developing software without appropriate priorities could consume resources that don’t add value to the system, eventually exceeding the contingency budget. Technical 2 - Minor 4 - Probable 8 MIT0093: Prioritize user scenarios MIT0094: Ensure that user scenarios provide real end value MIT0095: Ensure that critical scenarios are implemented first MIT0096: De-scope low priority scenarios
RISK0043: Incorrect requirements Developing the wrong functionality could be a waste of resources. Schedule 2 - Minor 4 - Probable 8 MIT0088: Early user involvement MIT0089: Prototyping and early integration MIT0090: Simulation MIT0091: Analysis of requirements MIT0092: Configuration control
RISK0042: Interface complexity Large or complex interfaces could require a significant effort to define and manage them. Technical 2 - Minor 4 - Probable 8 MIT0086: Few interfaces MIT0087: Narrow interfaces
RISK0036: Requirements stability Continuous stream of requirement changes could delay the development of some subsystems. Schedule 2 - Minor 4 - Probable 8 MIT0065: Early user involvement MIT0066: Change control management MIT0067: Agile development
Table 10.2 Low Exposure SWC Risks
Risk Description Risk Type Impact Likelihood Risk Exposure Mitigations
RISK0079: Requirements change propagation Propagation of requirement changes when external developers are involved may result in increased costs. Schedule 2 - Minor 3 - Possible 6 MIT0167: Change control management
RISK0078: External quality control Shortfalls in externally procured software and hardware may result in additional rework. Technical 2 - Minor 3 - Possible 6 MIT0164: Establish adequate acceptance procedures MIT0165: Request early delivery of software components to assess performance of the external organization MIT0166: Ensure external organization has adequate CMMI or equivalent level
RISK0056: Complexity Poor architectural design may produce a system too complex to operate and understand Technical 2 - Minor 3 - Possible 6 MIT0128: Early delivery of complete end-to-end functionality will assess the adequacy of the architecture MIT0129: Allow enough slack in the schedule to refactor components when architecture starts to show excessive complexity
RISK0053: Unrealistic plan Unrealistic schedule, cost estimation or staffing plans. Schedule 2 - Minor 3 - Possible 6 MIT0119: Deployment of an Agile process will help to identify those issues early in the project MIT0120: Periodic reviews and retrospectives MIT0121: Systematic use of performance metrics
RISK0052: Stability of staff Key members that leave the project can affect the productivity of the development team. Schedule 2 - Minor 3 - Possible 6 MIT0115: Competitive employment conditions MIT0116: Allow for some level of redundancy between members of the development team MIT0117: Avoid code ownership MIT0118: Maintain high team motivation
RISK0051: Operations staff overlap Lack of operations staff involvement in the design could result in knowledge gap between project and operations with the consequent impact on efficiency. Schedule 2 - Minor 3 - Possible 6 MIT0114: Involve operations staff early on in the project
RISK0050: Maintainability A system that is difficult to maintain could result in efficiency losses during design and operation phases. Technical 2 - Minor 3 - Possible 6 MIT0112: Involve operation staff early on MIT0113: Implement diagnostic tools throughout
RISK0047: External deadlines Late delivery of software to external developers may result in delays in their schedule. Schedule 2 - Minor 3 - Possible 6 MIT0102: Agile process to develop a predictable development process MIT0103: Realistic milestones
RISK0046: Support to external groups Lack of adequate support to external groups developing software for GMT may result in poor implementation and delays. Schedule 2 - Minor 3 - Possible 6 MIT0100: Adequate sizing of the support effort MIT0101: Strong software community support
RISK0045: Ultra-low latency bus The ultra-low latency bus is critical to the performance of a modular AO/AcO system. Technical 2 - Minor 3 - Possible 6 MIT0097: Early prototyping. MIT0098: Performance analysis MIT0099: Agile delivery of critical scenarios
RISK0041: Poor documentation Poor documentation could impact the efficiency of operation and maintenance tasks. Technical 2 - Minor 3 - Possible 6 MIT0083: Consider documentation as an integral part of the product MIT0084: Integrate documentation in the semantic model so always stays current MIT0085: Ensure quality of documentation produced by external providers
RISK0039: Vendor lock-in Single vendor lock-in could expose the project to lack of support of some components if vendor goes out of business. Technical 2 - Minor 3 - Possible 6 MIT0076: Use of open standards MIT0077: Encapsulate product dependencies
RISK0055: Mount Control System A malfunction of the mount servo system could result in a system that doesn’t meet specs. Technical 2 - Minor 2 - Unlikely 4 MIT0126: Modeling MIT0127: System shall allow to tune easily any parameter that can affect the performance of the servo loop
RISK0049: Scalability Inadequate architectural design may result in a system that doesn’t scale properly in the production phase. Technical 2 - Minor 2 - Unlikely 4 MIT0109: Early prototyping MIT0110: Stress testing MIT0111: Redesign components causing scalability bottleneck
RISK0048: Technology obsolescence Technology adopted could become obsolete or vendor can go out of business due to the long life spam of the project phase making difficult to ensure support and maintainability. Technical 1 - Low 3 - Possible 3 MIT0104: Delay acquisition when component planned integration allows it. MIT0105: Use COTS products based on open standards MIT0106: Consider the use of open source components with excellent track of community support MIT0107: Plan for enough spares to guarantee operation life MIT0108: Avoid single vendor lock-in
RISK0040: Inadequate technology Technology with poor reliability or that doesn’t perform as expected could make difficult to meet the performance requirements. Technical 1 - Low 3 - Possible 3 MIT0078: Prototype early MIT0079: Adopt different technology if prototype shows that the chosen one is not adequate MIT0080: Use fault tolerance techniques on critical systems MIT0081: Check experience of other users MIT0082: Use conservative specs
RISK0038: External software overhead An inappropriate management of software developed externally could take excessive resources from the core development team. Schedule 1 - Low 3 - Possible 3 MIT0072: Strong GMT software community support MIT0073: Well defined software standards MIT0074: Well defined interfaces MIT0075: Adequate estimation of the support needed
RISK0037: Process adequacy An insufficient or inadequate development process could delay the completion of the system. Technical 1 - Low 3 - Possible 3 MIT0068: Iterative development allows to assess the adequacy of the process and identify areas to improve MIT0069: Agile development MIT0070: Periodic reviews and retrospectives MIT0071: Develop systematic metrics to assess the development effort