Jump to content
Sign in to follow this  
Guest

Reinforcement learning tech demo

Recommended Posts

Guest

A reinforcement learning (SARSA, on-policy, epsilon-greedy policy)  system implemented in OFP. Two examples are included. The first one QlearnSimple demonstrates the learning by showing how an M113 learns to navigate to a goal.

The second example is more complex is learning a squad to fight another squad.

Brief introduction:

</span><table border="0" align="center" width="95%" cellpadding="3" cellspacing="1"><tr><td>Quote </td></tr><tr><td id="QUOTE">

[*]Definition...

Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behaviour; this is known as the reinforcement signal.

There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is defined by a specific type of problem, and all its solutions are classed as Reinforcement Learning algorithms. In the problem, an agent is supposed decide the best action to select based on his current state. When this step is repeated, the problem is known as a Markov Decision Process.

[*]Why Reinforcement Learning?

Reinforcement Learning allows the machine or software agent to learn its behaviour based on feedback from the environment. This behaviour can be learnt once and for all, or keep on adapting as time goes by. If the problem is modelled with care, some Reinforcement Learning algorithms can converge to the global optimum; this is the ideal behaviour that maximises the reward.

This automated learning scheme implies that there is little need for a human expert who knows about the domain of application. Much less time will be spent designing a solution, since there is no need for hand-crafting complex sets of rules as with Expert Systems, and all that is required is someone familiar with Reinforcement Learning.

[*]Technology...

As mentioned, there are many different solutions to the problem. The most popular, however, allow the software agent to select an action that will maximise the reward in the long term (and not only in the immediate future). Such algorithms are know to have infinite horizon.

In practice, this is done by learning to estimate the value of a particular state. This estimate is adjusted over time by propagating part of the next state's reward. If all the states and all the actions are tried a sufficient amount of times, this will allow an optimal policy to be defined; the action which maximises the value of the next state is picked.

[*]When does Reinforcement Learning fail?

There are many challenges in current Reinforcement Learning research. Firstly, it is often too memory expensive to store values of each state, since the problems can be pretty complex. Solving this involves looking into value approximation techniques, such as Decision Trees or Neural Networks. There are many consequence of introducing these imperfect value estimations, and research tries to minimise their impact on the quality of the solution.

Moreover, problems are also generally very modular; similar behaviours reappear often, and modularity can be introduced to avoid learning everything all over again. Hierarchical approaches are common-place for this, but doing this automatically is proving a challenge. Finally, due to limited perception, it is often impossible to fully determine the current state. This also affects the performance of the algorithm, and much work has been done to compensate this Perceptual Aliasing.

[*]Who uses Reinforcement Learning?

The possible applications of Reinforcement Learning are abundant, due to the genericness of the problem specification. As a matter of fact, a very large number of problems in Artificial Intelligence can be fundamentally mapped to a decision process. This is a distinct advantage, since the same theory can be applied to many different domain specific problem with little effort.

In practice, this ranges from controlling robotic arms to find the most efficient motor combination, to robot navigation where collision avoidance behaviour can be learnt by negative feedback from bumping into obstacles. Logic games are also well-suited to Reinforcement Learning, as they are traditionally defined as a sequence of decisions: games such as poker, back-gammom, othello, chess have been tackled more or less succesfully.

<span id='postcolor'>

Description taken from AI depot

The global definitions are in the Qinit.sqf while the implementation specific files are in the env directory.

This system will in the future be integrated with the chain of command core system, the Command Engine and will be used for high level AI simulation. This version uses a matrix for Q function approximation which will in the future be replaced by a full neural network.

Download:

Qlearn Simple (Basic demonstration)

Qlearn 1.5(Full demonstration)

Share this post


Link to post
Share on other sites

Cool program! I combined it with an artillery script to have it learn ballistic targeting. Works quite well xmas.gif

It took some time to get the hang of the examples, but I finally got it to work.

A small suggestion: write a manual, please tounge.gif

Good work!

Share this post


Link to post
Share on other sites

Could this be used to teach fighters how to use dumb bombs?

because right now the bastard AI drops them about 500m

away from their target.

Share this post


Link to post
Share on other sites
Guest

</span><table border="0" align="center" width="95%" cellpadding="3" cellspacing="1"><tr><td>Quote (uiox @ Dec. 22 2002,23:38)</td></tr><tr><td id="QUOTE">manual?<span id='postcolor'>

No xmas.gif Not yet.

</span><table border="0" align="center" width="95%" cellpadding="3" cellspacing="1"><tr><td>Quote </td></tr><tr><td id="QUOTE">Could this be used to teach fighters how to use dumb bombs?

because right now the bastard AI drops them about 500m

away from their target. <span id='postcolor'>

Yes it could. I am however currently making a couple of different neural networks for OFP which would be much more suitable for task.

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
Sign in to follow this  

×