AI Development - Tips & Tricks

1.    AI Based Application Development – Tips & Tricks

At the beginning of new AI oriented project, besides definition of required functionality, it is critical to ask the questions listed in the article Artificial Intelligence in your Cockpit. To find the right answers is not easy task and right answers may add additional work into the program – it means that “temptation” to ignore at least some of the questions may be high, however if any question is missed or not addressed in the project plan, the later corrections of problems may be expensive or even not possible.

1.1.     Safety First

Obviously, your AI function needs to work. But – as you will use it in the cockpit – it must be also safe. Now – “How we can know the AI based function decisions are always safe?”

The core is to understand hazards related with AI function. Pay attention to review of defined hazards with pilots / users of your AI application to correctly assess crew impacts. 

Another “catch” are situations, which were not covered in training data. If the situation is ‘far away’ from the scenarios covered in the data, then AI may have problems to provide good answers in these situations. Map operational space and compare with your data coverage. Understand well the corner cases. Check system response at the operational boundaries and during corner case scenarios by good robust testing. Try what will happen if inputs are completely out of the data coverage – for example – if you trained the AI to recognize animal species from the photo – use orange to test the response of AI. The right AI answer in this case is: “No animal recognized.” 

Prepare prototype/mockup and analyze user interaction with the AI application. Focus on unexpected interactions (not considered at the beginning) and see if new hazards may be imposed.

Keep in mind that “always safe” means – “the probability of the failure with safety impact is low enough”. Based on the selected AI algorithms use proper statistical/analytical methods to determine probability of incorrect output. Note that this differs from regular safety analysis – in this case we are not looking for failure of some component (i.e., CPU HW failure) and how it contributes to analyzed hazard – in this case we analyze complex AI algorithm performance and we are looking on probability of “wrong decision taken by AI”. 

Keep order in the data – do not loss track of what data were used for what (training, testing, …) and when (initial training phase, informal validation, re-training to improve performance,…). The data management has to be established from early phase of program.

And of course – the whole regular safety assessment process of the system which hosts AI application still applies. 

1.2.     Human Centric Design

Your AI function supposes to assist to a pilot during flight andthus, your design has to be oriented on pilot needs. At least following areas should be considered for AI design:

  • The user interface design. This may not differ too much from designing other avionics functions – but if the provided functionality is novel to the industry, then there may not be well known ways how to interact with the user established yet. However, the generic rules can apply here – the UI should intuitive, easy to use, not distracting – and of course - satisfying.
  • Another aspect, which is more specific to AI is human perception of the AI. Typically, the first reaction to AI function is “wow, that’s cool” (if it’s working).  Then, user will get used to the function and the ‘wow effect’ is over. Instead of initial enchantment, the sensitivity to errors may show up. For example – if you have new voice assistant, the initially you may be excited about the great functionality. Later, you get used to the fact, it works well most of the time and instead of it, you start to pay attention to errors of the assistant. If the number of errors is higher than ~5%, the function starts to be perceived as unreliable. Now – if the system for advertisement recommendations do more than 5% errors, then you probably will be sometimes surprised about displayed advertisement, but that’s it. However, if your cockpit assistant do 5% or more of weird recommendations, you stop to trust it and remaining 95% of good advices will be compromised. It means that deploying of AI based application too early into the field (even if there are no safety impacts) may compromise the function even if overall failure rate is relatively low and acceptable from safety point of view.
  • Your function may get cultural/racial/gender/… biases. There are various sources of biases. For example – Artificial Instructor responses to detected piloting problems are based on recommendations from skilled pilot instructors. Thus, the system is influenced by this fact. In other words – for design of expert systems the selection of human experts which help to set the system behavior and responses may lead to bias. Another source of bias may be your training data. For example – if your AI application processes pilot face images, then selection of images for training may cause bias. If you pick up images of while bold men with glasses into your training set, then the AI application may incorrectly provide responses when there is someone who does not fit well into your training set. General advice how to minimize these biases is high variance in inputs (experts from various countries/cultures equally represented during definition of correct responses of expert system,…).
  • Unexpected interactions with AI – people may react to the AI function in unexpected ways which designers did not consider. It is important to prepare representative mockup/prototype for early evaluation to discover these interactions, asses them and – if needed – adjust user interface/functionality as required.

image

How to indicate visually a need of quick action?

image

May be this way?

 

1.3.     The More Data – The Better AI?

Many people think that the more data is used for machine learning, the better AI will be. In general – yes and no. To train good AI, we need right amount of data – in other words, we need good coverage of the space, in which AI will operate. If you have huge amount of data, where most of data cover just limited subset of the operational space, then your AI may be over-trained for some scenarios and may fail to respond correctly in other scenarios. In such case – the problem may not be solved just by adding more data – you need to add data to cover gaps in operational space coverage and maybe reduce previous data set to ensure you correctly equalize data for various scenarios in your training set. The good data management process supported with iterative data sets analysis may help to reduce risks related to improper usage/balance of data. The right data management may also help you to not ‘over-collect’ data. If your project is dependent on the flight data, the data collection may be expensive process and collecting of unnecessary data will be primarily waste of money and time rather than real contribution to the AI function quality.

Another way how the cost of data collection may be reduced (in aerospace industry), is usage of simulation. This path may help to speed up data collection and keep cost reasonable, however the data must be checked for applicability. For example – during AFI-1 program – we discovered that data from simulator can do great job for flight phase detection AI function while for some other modules (i.e. monitoring of approach correctness) the simulator data were not representative enough. Further analysis has shown that the reason was in minor differences in a way how pilots fly at simulator versus how they fly in real airplane. The possible solution was to build high fidelity simulator or collect enough data for approaches.

Another way how to optimize data for AI algorithms in number of inputs into the algorithm. Sometimes designers think that injecting many inputs into the training is good idea – “Machine learning of Neural Network will solve this out.” Well, selecting right inputs for your application may help you with computational demand reduction and also analysis of why the algorithm took some decision is easier. The goal should be to minimize inputs for AI algorithms – but do not remove something critical. Watch the video and see how small change in input set may impact results in critical phase of the flight. The video (5x faster than real flight time) demonstrates ability of two neural networks to detect pattern phase which airplane currently flies (together with manual pattern phase tag created by operator). 

Both networks were trained by the same set of data and both networks have the same architecture. The only difference is that one network uses information about engine RPM while the other does not. You can see that both networks behave very similar way most of the time. The only noticeable difference is at the end of the flight. The network without RPM will change state from landing to take off (for about 1-2 seconds of real flight time) and then back to the landing (landing is correct state), while the network with RPM continues to detect landing. The 1-2 seconds error in pattern phase detection seems like negligible performance problem (as whole flight time is 371 seconds) – however the impact is significant. And why significant? The erroneous detection of take off in landing phase (even for 1-2 seconds) may generate Artificial Instructor output which will try to correct errors during take off phase – but such recommendation will be misleading in the landing phase.

1.4.     Performance Evaluation – Statistical vs. Subjective 

Once you are done with AI training and implementation, it is the time to verify the AI functionality. This verification is done at multiple levels and by various means. The very first step is validation of learning process results. Based on the selected AI algorithm, the proper method of learning process validation is selected. Let’s take a look on two neural networks for detection of the pattern phase (which ability to detect pattern phase was compared in previous part – you remember the video, right?).

As discussed previously, both networks have the same architecture and the same inputs are the same – except the second network utilizes also information about engine RPM. And as we have seen in the video, both networks behave almost equally – except landing phase. During landing phase, the network without information about actual engine setting may confuse landing with take off. Now let’s take a look if we can recognize that the first net is prone to such problem in early phase of the project.

The figure below shows confusion matrices for the two networks (captured after initial training phase). The Class 1 corresponds to the Take Off and the Class 8 corresponds to the landing.If you check overall performance of both networks after the initial training, you will find out that both networks deliver similar quality of outputs – the network without engine RPM input reached 96.5% of correct classifications while the network with the engine RPM input reached 97.2% of correct classifications. This difference seems minor and additional cost of make this input available for the AI application may seems as not worth investment.However, if we check the confusion matrix for the first network carefully, we can see that Class 1 and Class 8 has approximately 10% of erroneous detections. After adding the engine RPM information, the error ratio drops down to approximately 5% - and if you check the confusion matrix really carefully, you fill see that scenario when class 1 was classified as class 8 or vice versa is almost completely removed but the confusion between class 1 and 9 remains almost the same. 

And bonus question – what you can say about the initial data set by looking on the confusion matrix? It can be seen that number of data points for various pattern phases significantly differs – and better balancing of the data set may improve performance as well.

image

Confusion matrix for the neural network without the engine RPM input.

 

image

Confusion matrix for the neural network with the engine RPM input.

Once the model seems to be performing well, the next step is implementation of the model and it’s verification. Let’s skip the software level verification which ensures correctness of SW implementation of the model and let’s take a look on the system level testing of the final system. At the system level, the statistical evaluation and data analysis may help to well understand the algorithm performance, probability of incorrect decision and it also may support explainability at algorithm level. The statistical evaluation, and analysis of delivered outputs helps to verify design objectives and safety measures – however it does not address how a pilot / end user perceives interaction with AI during real-time operational scenarios. For this reason, it is important to perform set of tests which are focused on human interactions with the AI and understand if designed algorithms and it’s user interface helps to build a thrust of the user to the AI outputs.

The system level testing of AI based functionality (both scenarios above) may be extensive to the flight hours. The question if simulation may help to reduce cost and/or time of the testing is the valid one. Then answer is – maybe yes / maybe no – it depends. When we tested AFI-1 prototype, we considered utilization of the simulator as well as real flight tests. During the tests, test pilots performed subjective evaluation of AFI-1 recommendations – during scenarios when the pilot error was intentionally inserted as well as during flight when no failure was intentionally done. After initial set of tests, statistical comparison of subjective evaluations at simulator and at real airplane was done. It was found that AFI-1 corrections presented to the test pilots during situations when no fault was intentionally inserted were subjectively evaluated differently. At simulator (no-motion simulator), pilots typically accepted corrections while they considered corrections as ‘too strict’ or ‘incorrect’ more frequently in real airplane. This difference was statistically significant. The table below shows pilots’ subjective evaluation of AFI-1 outputs after initial tests. 

AI Output Evaluation

Correct

Sooner

Too Soon

Later

Too Late

Too Strict

Incorrect

Missing

Simulator Data

584

0

5

0

4

28

70

0

84.52%

0.00%

0.72%

0.00%

0.58%

4.05%

10.13%

0.00%

Airplane Data

184

1

0

2

1

35

43

3

68.40%

0.37%

0.00%

0.74%

0.37%

13.01%

15.99%

1.12%

Comparison of pilots’ subjective evaluation of AI outputs at simulator and in real airplane.

The major contributor to this difference in judgment was driven by the fact that at no-motion simulator pilots had ‘worse feeling’ of the flight and were more willing to accept recommendations and corrections. Slight reduction of the AFI-1 ‘strictness’ level improved perception of the AI function also in real flight tests. And why the small reduction of ‘strictness’ improves the subjective perception? Because after the ‘strictness’ reduction, test pilots start to see developing error as well – right after the AFI-1 prototype recommendation.And a note – the number of outputs classified as ‘Incorrect’ was primarily tied with the assessment of errors during the approach. The root cause of relatively high number of ‘Incorrect’ evaluations in this phase during initial round of testing was not complete airplane energy assessment.

In general, we learnt two main things:

  • Pilot is willing to accept AI recommendations if he understands why/how the AI come to the conclusion. And it does not mean that pilots need to be experts on AI algorithms, it means that AI outcomes for specific situation must be understandable in it’s content to human user.
  • The data collection, and explainability at engineering level is critical for corrections and changes in AI logic – not just because of regulatory rules or to support safety analysis but it is also critical during development phase – without the algorithm explainability, the engineering team will not be able to effectively tweak algorithms and/or training data to improve performance criteria and/or user perception of AI functionality.