Post on 14-Jan-2022
transcript
A Study on Automatic Fixture Design Using Reinforcement Learning
Darren Low Wei Wen1, Dennis Neo Wee Keong2 and A Senthil Kumar1
1Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive
1, Block EA, Singapore 117576, Singapore
2Singapore Institute of Manufacturing Technology (SIMTech), 73 Nanyang Drive, Singapore
Email: darren.low@u.nus.edu; mpeask@nus.edu.sg
Abstract
Fixtures are used to locate and secure workpieces for further machining or measurement process.
Design of these fixtures remains a costly process due to the significant technical know-how
required. Automated fixture design can mitigate much of these costs by reducing the dependence
on skilled labour, making it an attractive endeavour. Historical attempts in achieving automated
fixture design solutions predominantly relied on case-based reasoning (CBR) to generate fixtures
by extrapolating from previously proven designs. These approaches are limited by their
dependence on a fixturing library. Attempts in using rule-based reasoning (RBR) has also shown
to be difficult to implement comprehensively. Reinforcement learning, on the other hand, does not
require a fixturing library and instead builds experience and learns through interacting with an
environment. This paper discusses the use of reinforcement learning to generate optimised
fixturing solutions. Through a proposed reinforcement learning driven fixture design (RL-FD)
framework, reinforcement learning was used to generate optimised fixturing solutions. In response
to the fixturing environment, adjustments to the reinforcement learning process in the exploration
phase is studied. A case study is presented, comparing a conventional exploration method with an
adjusted one. Both agents show improved average results over time, with the adjusted exploration
model exhibiting faster performance.
Key words
Fixture design, reinforcement learning, deep learning, locator positions
1. Introduction
Fixtures are an integral aspect of manufacturing, providing the essential workpiece locating and
clamping elements prior to subsequent processes. Having well designed fixtures are crucial in
achieving consistent and accurate manufacturing outcomes. Design of these fixtures is challenging
due to many engineering considerations and optimisation strategies. Conventionally, fixture design
relies on the vast heuristic experience of human designers [1], demanding years of apprenticeship
to learn. Reliance on these skilled professionals is essential in creating optimal fixtures, which
results in costly fixtures. Moreover, the experience-heavy nature of fixture design causes a
significant knowledge gap for junior fixture design engineers, which can limit the effectiveness of
their contributions. Much research has therefore been done on automating fixture design, which
would potentially reduce design costs, human error and lead time.
1.1. Automated fixture design systems
There has been significant research on the ideal model of automated fixture design solution. CBR
has been the predominant approach in these attempts [2-5]. CBR leverages on matching a given
workpiece to similar proven fixturing designs, subsequently providing the necessary design
adjustments [2]. In essence, CBR requires feature recognition, an indexed design library, design
retrieval and evaluation, and final adjustments to work.
The need to properly index a design library poses a major limitation of CBR. McSherry [6]
argues that CBR is limited by inseparability cases, whereby poor definition of indexing parameters
could result in two equally weighted solutions. This rare scenario could result in a less optimal
solution being used as the reference design. McSherry [6] suggests that this occurs due to
inadequate representation of the indexing mechanisms used. Therefore, the selection of CBR
indexing parameters can be difficult to optimise.
Another limitation of CBR is the reliance of extensive and comprehensive fixture design
libraries to produce fixturing solutions, which limits the ability of CBR when processing unique
and unorthodox workpieces. If a given workpiece is significantly different from those found in the
library, CBR may not be able to produce a valid result [7]. In other words, although CBR has
shown to be effective in solving experienced-based problems through inference from a library, it
is unable to adapt to significantly different situations. This inflexible nature of CBR therefore
limits its potential in handling edge cases and CBR ultimately still depends on fixture designers to
provide high-quality examples.
Apart from CBR approaches, RBR has also been studied in generating fixture designs [1,
8-10]. Unlike CBR, RBR utilises a selection of defined rules to convert geometric information into
suggested positions for fixture locating elements.
RBR’s main disadvantage is the difficulty in accurately and comprehensively defining the
rules needed to encompass all possible fixture designs. Prentzas and Hatzilygeroudis [11] describes
the complications of converting a domain expert’s knowledge and experience into rules precisely
and exhaustively. Additionally, using too little rules would result in poor coverage of possible
fixturing problems, whereas too many rules would make the code significantly complex.
Zhang, et al. [12] published a case study on the use of a combination of CBR and RBR to
generate fixture design solutions. Such approaches attempt to combine the benefits of both
methods, but still suffer from inheriting their individual limitations as discussed previously.
Machine learning has also been explored in automating fixture design [13]. However, much
of reported work remains conceptual.
1.2. Reinforcement learning
Research into AI has greatly accelerated in the past decade, spurred on by widespread adoption
and availability of machine learning.
Reinforcement learning is a subset of machine learning where an artificial agent learns
through interaction with an environment. Conventional reinforcement learning agents leverages
on deep convolutional neural networks (CNN) to mimic the effects of the neurological vision
processing [14]. CNNs is a highly successful neural network that is capable of processing 2-
dimensional (2D) data like those found in images or videos [15]. In essence, this neural network
receives a 2D input and produces a relevant agent action as the output.
In reinforcement learning, the agent is trained using a reward function for performance
feedback. This reward function evaluates the action chosen by the agent, resulting in a positive or
negative reward. The agent ultimately strives to achieve the maximum cumulative reward, which
mimics a human performing all the optimal actions required for the task.
Reinforcement learning has shown to be effective in training agents in accomplishing
complex tasks. Classic Atari games has been extensively used in demonstrating effective
reinforcement learning [14, 16]. On the other hand, reinforcement learning also has shown
successes in real-world problems like traffic engineering [17] and supply-chain management [18].
1.3. Discussion on reinforcement learning driven automated fixture design
The use of reinforcement learning to automate fixture design has significant advantages over
current approaches in CBR and RBR driven fixture design.
Reinforcement learning removes the requirement of a fixturing library and fixture design
rules, which is a prerequisite for CBR and RBR based approaches respectively. Instead, the use of
reinforcement learning enables the adoption of a single fundamental rule, which is the requirement
for the workpiece to not move. The agent will then train itself to adapt to the given environment
with that rule. This approach therefore negates much of the limitations and assumptions affecting
CBR and RBR based fixture design.
The use of reinforcement learning also brings more flexibility and dynamism to generating
fixture designs. As the agent improves over time through interaction with the fixturing
environment, novel solutions could be generated for even the most unorthodox workpieces. CBR
on the other hand, struggles in generating fixturing solutions for workpieces that are significantly
different from those present in the fixture library. Workpieces that require fundamentally different
design rules would also struggle with the rigidity of RBR’s rules. Therefore, reinforcement
learning can be used to train a smart and adaptive agent in generating novel fixturing solutions for
even the most unconventional of workpieces.
In summary, automatic fixture design using CBR and RBR has been well researched, but
critical assumptions in their fundamental mechanisms limits their practicality for fixture design
use. Reinforcement learning provides a radically different approach to generating fixturing
solutions, whereby an agent is progressively trained to make better fixturing decisions.
2. Application Framework
In this paper, we propose a reinforcement learning driven fixture design framework (RL-FD) as a
novel approach towards automating fixture design decisions. The RL-FD architecture is as shown
in Figure 1. A 2D silhouette of the given workpiece is first generated, following which the initial
locator positions are generated. Using an interactive physics-based silhouette-locator environment,
the fixture’s ability to constrain the workpiece will be tested. A reinforcement learning agent,
which is made using CNN, will be trained against this environment. With the rewards given for
acceptable locator removal, the agent should progressively get better at fixturing decision making.
Eventually, the agent would be able to create highly optimised fixturing solutions, which will be
stored and listed out to the user for selection. The selected solution can then be used to
automatically generate a computer-aided design (CAD) fixture.
Figure 1: RL-FD framework
There are several key requirements necessary for reinforcement learning function, these
would be discussed in the following sections.
2.1. Generating workpiece silhouette
In any 3-dimensional fixture system, the outer surface tends to be more important for locating and
clamping. This surface can be defined using a 2D top-down silhouette of the workpiece as shown
in Figure 2. Using a 2D silhouette also simplifies the physics and improves compatibility with
already available CNNs.
Figure 2: Conversion of workpiece to silhouette (a) Isometric view (b) Top view (c) Workpiece Silhouette
2.2. Physics simulation and state representation
To determine if a fixture is properly constraining a particular workpiece silhouette, it is necessary
to represent physics interactions in the environment. This paper utilises pybox2d, a physics rigid
body simulation library made by Catto [19] and later adapted for python by Lauer [20]. The physics
environment was programmed to be compatible with OpenAI’s Gym requirements, which can be
easily adapted on other Gym compatible agents.
This programmed environment contains two types of objects as shown in Figure 3a. Firstly,
the silhouette is represented as a dynamic object, which in this case is a rectangle with a pair of
opposing chamfers. Circles in this case functions as the locators and is represented as a static circle
shape to replicate a point contact on the silhouette.
(a) (b) (c)
Figure 3: Observation scaling to reduce input size (a) Original (b) Scaled down (c) Greyscale
Reinforcement learning also requires information on the current state of simulation, which
would be used to determine the following actions and subsequent adjustments of the artificial
agent’s behaviour. For this paper, pyglet will be used for the purposes of rendering a 2D image
using input about the current physics state from pybox2d. Following this, the rendered image
would be scaled down and grey-scaled to significantly reduce the state size as shown in Figure 3.
2.3. Action space and reward feedback
There are only two actions available to the agent, which are either a remove or keep locator action.
These two actions represent the absolute minimum necessary for proper interactions with the
physics simulation, which can help speed up the learning process.
Table 1: Fixture testing
# 1 2 3 4 5 6
Test
To encourage the agent to reach a more optimal fixture (with the least number of locators),
positive reward will be given for each successful remove action performed. Successful removals
(a) (b) (c)
are evaluated by applying a series of forces and torques for 30 simulation steps each on the
workpiece silhouette, which are summarised in Table 1. The evaluation simply identifies
significant linear or angular movement beyond acceptable limits, which is indicates that the
workpiece is no longer secure and therefore the fixture is invalid.
Table 2: Example of possible actions and rewards
Frame 1 2 3 4
Image
Action Keep Remove Remove -
Reward 0 1 0 -
Cumulative Reward 0 1 1 0
In RL-FD, the decision to perform either action will be made by the neural network and
discussed subsequently in Section 2.5. To better illustrate the working mechanisms behind the
environment, Table 2 demonstrates an example when the following user selected actions are made:
Frame 1
o A random locator is chosen and highlighted green
o Agent chooses keep action, which provides no reward
Frame 2
o A different randomly chosen locator is chosen and highlighted green
o Agent chooses remove action, which removes the circle and begins the fixture
testing
o Test identified a constrained workpiece silhouette. Therefore, fixture solution is
acceptable, reward of 1 is given.
Frame 3
o A different randomly chosen locator is chosen and highlighted green
o Agent chooses remove action, which removes the circle and begins the fixture
testing
o Test identified an unconstrained workpiece silhouette. Therefore, fixture solution
is unacceptable, reward of 0 is given. The cumulative reward for this episode is
therefore 1.
Frame 4
o As fixture is considered invalid, simulation is reset back to original state and
another random point is chosen again.
The simulation resets when any one of the following conditions are met:
a) The fixture is unable to constrain the workpiece.
b) All locators listed has been acted upon by the agent.
c) There are only 3 remaining locators1.
Randomised selection of locators was performed to provide even representation of all test
locators. If the locators were iterated in a fixed sequence, locators towards the end of the list would
have lesser chances to be acted upon by the agent. Thus, randomising was performed to provide
for wider case representation.
1 3 locators are typically the optimal result in most scenarios, but this goal can be user-adjusted for special cases.
2.4. Automatic generation of initial locators
Figure 4: Locator geometries
Workpiece silhouette geometries are defined using the X-, Y-coordinates of the polygon edges
within the physics engine. The list of given coordinates can be used to generate the initial locators
around the silhouette using the following equations:
𝜃 = arctan 2 (𝑦𝐵−𝑦𝐴
𝑥𝐵−𝑥𝐴) (1)
𝑛 = [ 0,√(𝑥𝑏−𝑥𝑎)2+(𝑦𝑏−𝑦𝑎)2
2𝑟 ] (2)
𝑥𝑛 = 𝑥𝐴 + 𝑟(2𝑛 + 1)(𝑐𝑜𝑠𝜃) 𝑦𝑛 = 𝑦𝐵 + 𝑟(2𝑛 + 1)(𝑠𝑖𝑛𝜃) (3)
𝑥′𝑛 = 𝑥𝑛 + 𝑟(cos(90 − 𝜃)) 𝑦′𝑛 = 𝑦𝑛 − 𝑟(sin (90 − 𝜃)) (4)
Where (𝑥𝐴, 𝑦𝐵) and (𝑥𝐵, 𝑦𝐵) represents the initial and subsequent silhouette geometrical
coordinates respectively. (𝑥𝑛, 𝑦𝑛) are coordinates between A and B, where 𝑛 is a range of integers
(𝑥𝐵, 𝑦𝐵)
(𝑥𝐴, 𝑦𝐴)
𝜃
(𝑥𝑛, 𝑦𝑛)
(𝑥′0, 𝑦′0)
(𝑥0, 𝑦0)
𝑟
(𝜃 − 90)
𝐵
𝐴
defined by Eq. (2). (𝑥𝑛′ , 𝑦𝑛
′ ), which is centre of each circular locator can then be determined using
Eq. (4). The conversion occurs until all the edges of the given polygon has been acted upon. These
final converted coordinates are used to automatically populate the necessary initial locator
positions based off the given geometry.
Locator geometries can also be manually inserted, which is necessary in cases where
specific edges of the workpiece are involved in machining processes.
2.5. Neural network
This paper will study the use of CNNs to come out with effective fixturing decisions. CNNs are
convenient in this study as a 2D pixel array (Section 2.1) can be used as an input.
The CNN used in this study is shown in Figure 5. This neural network was adapted from
Muntean [21], which was originally designed for use on Atari game environments. Changes
include widening the neural network input layer to incorporate the larger pixel array and
adjustments to grey scaling.
Figure 5: Neural network (Not drawn to scale)
In this paper, the converted image generated from Figure 3 is passed through the CNN
which consists of three convolutional layers. These convolutions reduce the size of the input image
between each layer, which reduces downstream computational loads while maintaining the spatial
connections of the original image. A flatten operation is then performed on the output, which
transforms the resulting image into 1-dimensional data. The flattened output would then be passed
through two fully connected layers with 512 and 2 nodes respectively, with the final 2 nodes
representing the decision to perform either Keep or Remove.
A key consideration of reinforcement learning is the incorporation of an exploration
method, which provides adequate and representative training on all possible actions available to
the agent. At the start of any learning process, the agent would pick actions from this exploration
function instead of the neural network. Overtime, dependence on this exploration mechanism
would decrease and instead the neural network would be used to make decisions.
Figure 6: Exploration mechanism (a) shows equal chance of either keep or remove actions in initial exploration phase (b) shows
75% chance of remove and 25% keep in initial exploration phase
A widely adopted exploration method is selecting actions randomly, with each possible
action having an equal chance of being selected. For the environment in this paper, the agent only
needs to decide between keep and remove. The exploration mechanism of these two actions would
therefore be as illustrated in Figure 6a, where both actions would be chosen 50% of the time.
However, in the proposed fixture design environment, it can be observed that an ideal agent
would perform the remove action frequently in order to maximise its cumulative reward.
Therefore, this paper would also compare a chance-adjusted model whereby the remove and keep
actions are performed 75% and 25% of the time respectively as shown in Figure 6b. This
adjustment should theoretically allow the agent to learn quicker by bringing it closer towards an
ideal state, while also maintaining sufficient representation of the keep action.
3. Case study
This paper will utilise the workpiece shown in Figure 2a as a case study. Using the proposed RL-
FD framework, a feasible fixture would be generated for machining the top surface of this
workpiece.
As discussed in Section 2.5, two different agents would be trained on this environment
until 5,000,000 steps2 are performed. These are:
a) 50-50 Agent: Performs 50% Remove and 50% Keep when exploring
b) 75-25 Agent: Performs 75% Remove and 25% Keep when exploring
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐹𝑖𝑛𝑎𝑙 𝑅𝑒𝑤𝑎𝑟𝑑 = ∑ 𝐹𝑖𝑛𝑎𝑙 𝑅𝑒𝑤𝑎𝑟𝑑𝑠
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑢𝑛𝑠 (5)
The average final rewards will be used as a means of evaluating learning performance and
is calculated at every step using equation (5). Final rewards are recorded when the minimum
number of active constraints is achieved, or when the solution is not valid. In both cases, the
physics environment is reset after.
2 A step counts as a single action performed by the agent. 5,000,000 steps was chosen as a basis of comparison between
both agents as preliminary runs show stable results prior to this step count.
Figure 7: Average Final Rewards obtained during training, a single step refers to either a Remove or Keep action.
As shown in Figure 7, both agents show performance improvements over the time. This
demonstrates their capabilities in learning from the fixture design environment and progressive
improvements in making better fixture design decisions. However, the 50-50 Agent initially started
with a noticeably lower average reward and took significantly longer to reach higher scores as
compared to the 75-25 Agent. This observation can be attributed by a more optimal exploration
phase, resulting in favourable learning performance as exhibited by the 75-25 Agent.
Using the same computer in both instances, a faster learning was observed from the 75-25
Agent as compared to the 50-50 Agent. The 75-25 Agent also obtained the optimal fixture (highest
reward) in 20 hours, as compared to 35 hours from the 50-50 Agent. The time reduction of 43%
exhibited by the 75-25 Agent is preferable, resulting in an overall faster fixture optimization
process which consequentially reduces fixture lead times.
20
25
30
35
40
45
0 1M 2M 3M 4M 5M
Ave
rage
Fin
al R
ewar
d
Steps
Average Reward
50-50 Agent 75-25 Agent
Figure 8: Selection of training results.
Generated solutions that receive high cumulative rewards are saved and listed out as shown
in Figure 8. The presented solutions have passed the fixturing test described in Section 2.3 and
therefore represent the valid and optimal fixturing solutions for the given workpiece.
Figure 9: Generated fixture (a) Raw workpiece prior to machining (b) Workpiece during intermediate machining processes
Upon selecting any of the fixturing solutions in Figure 8, the user will have to select which of the
given locators should be changed into clamping elements. The completed CAD model can then be
generated as shown in Figure 9.
3.1. Discussions
The case study has demonstrated the use of reinforcement learning to automate desirable fixture
designs. As neither fixture design rules nor other fixture design cases were required, the previously
discussed limitations posed by RBR and CBR respectively was avoided. The adaptive nature of
reinforcement learning empowers the agent to work with varying workpiece shapes, progressively
becoming better at making fixturing decisions through training.
It is also important to plan the relevant machining processes prior to performing RL-FD.
In the case study demonstrated, only the machining of the top surface was considered. Therefore,
the silhouette was obtained with reference to the top surface. RL-FD would need to be performed
again for machining processes that are performed on other surfaces with differing silhouettes.
Optimization in the exploration method has also exhibited an improvement in learning
performance. Adjustments in the decisions made during exploration phase led to a significant
reduction in time required to generate an optimal solution. Such training optimization strategies is
vital as solution lead time is a significant consideration for practical use.
Results generated by the proposed automated RL-FD framework can also be applied on
robot grippers. Solutions generated through the studied method can also be interpreted as gripper
finger positions, ensuring that the object picked up is secure during subsequent movements.
Figure 10: Difficulty in accurately evaluating fixtures
One limitation has been observed from the described automatic RL-FD framework is the
inaccuracies of the fixturing test performed. As illustrated in Figure 10, when Force A is applied,
the workpiece would remain in position and therefore the fixture is valid. On the other hand, Force
B would result in a counter-clockwise rotation of the workpiece, which results in an invalid fixture.
The number, location and direction of the forces applied during the test is therefore essential in
accurate evaluation of the fixture. This problem could be addressed by modifying the physics
engine to represent a more comprehensive physics test, which would be more computationally
expensive if not defined optimally. Consideration into the required machining processes is a
possible avenue of optimising this fixturing test.
4. Conclusions
Fixture design relies heavily on the experience and awareness of the multitude of engineering
considerations. It is therefore very vital and essential to automate this process, which can reduce
the reliance on experienced designers and thereby improving the overall efficiency. Historical
attempts in automated fixture design has largely adopted the use of CBR and RBR, which exhibits
A B
certain fundamental limitations in its practical use. This paper demonstrates the use of an
automated RL-FD framework using reinforcement learning. The framework was described and
demonstrated through a case study. Reinforcement learning was found to be capable of training a
neural network in making better fixture design decisions over time. We also found that
optimisation of the exploration mechanism was shown to be effective in improving the training
time. Further research is ongoing to encompass more complex workpieces using non-silhouette-
based approaches.
References
[1] A. Y. C. Nee and A. S. Kumar, "A Framework for an Object/Rule-Based Automated
Fixture Design System," CIRP Annals, vol. 40, no. 1, pp. 147-151, 1991/01/01/ 1991.
[2] A. S. Kumar and A. Y. C. Nee, "A framework for a variant fixture design system using
case-based reasoning technique," Manufacturing Science and Engineering, ASME, vol. 3,
pp. 763-775, 1995 1995.
[3] S. H. Sun and J. L. Chen, "A fixture design system using case-based reasoning,"
Engineering Applications of Artificial Intelligence, vol. 9, no. 5, pp. 533-540, 1996/10/01/
1996.
[4] W. Li, P. Li, and Y. Rong, "Case-based agile fixture design," Journal of Materials
Processing Technology, vol. 128, no. 1, pp. 7-18, 2002/10/06/ 2002.
[5] H. Hashemi, A. M. Shaharoun, and I. Sudin, "A case-based reasoning approach for design
of machining fixture," The International Journal of Advanced Manufacturing Technology,
vol. 74, no. 1, pp. 113-124, 2014/09/01 2014.
[6] D. McSherry, "The Inseparability Problem in Interactive Case-Based Reasoning," in
Research and Development in Intelligent Systems XVIII, London, 2002, pp. 109-122:
Springer London.
[7] P. S. Szczepaniak and A. Duraj, "Case-Based Reasoning: The Search for Similar Solutions
and Identification of Outliers," Complexity, vol. 2018, p. 12, 2018, Art. no. 9280787.
[8] A. Y. C. Nee, A. S. Kurnar, S. Prombanpong, and K. Y. Puah, "A Feature-Based
Classification Scheme for Fixtures," CIRP Annals, vol. 41, no. 1, pp. 189-192, 1992/01/01/
1992.
[9] X. Dong, W. R. DeVries, and M. J. Wozny, "Feature-Based Reasoning in Fixture Design,"
CIRP Annals, vol. 40, no. 1, pp. 111-114, 1991/01/01/ 1991.
[10] A. S. Kumar, A. Y. C. Nee, and S. Prombanpong, "Expert fixture-design system for an
automated manufacturing environment," Computer-Aided Design, vol. 24, no. 6, pp. 316-
326, 1992/06/01/ 1992.
[11] J. Prentzas and I. Hatzilygeroudis, "Categorizing approaches combining rule-based and
case-based reasoning," Expert Systems, vol. 24, no. 2, pp. 97-122, 2007.
[12] F. P. Zhang, D. Wu, T. H. Zhang, Y. Yan, and S. I. Butt, "Knowledge component-based
intelligent method for fixture design," The International Journal of Advanced
Manufacturing Technology, vol. 94, no. 9, pp. 4139-4157, 2018/02/01 2018.
[13] A. S. Kumar, V. Subramaniam, and T. Boon Teck, "Conceptual design of fixtures using
machine learning techniques," The International Journal of Advanced Manufacturing
Technology, vol. 16, no. 3, pp. 176-181, 2000.
[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I.
Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level
control through deep reinforcement learning," Nature, vol. 518, p. 529, 02/25/online 2015.
[15] I. Arel, D. C. Rose, and T. P. Karnowski, "Deep Machine Learning - A New Frontier in
Artificial Intelligence Research [Research Frontier]," IEEE Computational Intelligence
Magazine, vol. 5, no. 4, pp. 13-18, 2010.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M.
Riedmiller, "Playing atari with deep reinforcement learning," arXiv preprint
arXiv:1312.5602, 2013.
[17] M. D. Pendrith, "Distributed reinforcement learning for a traffic engineering application,"
in Proceedings of the fourth international conference on Autonomous agents, 2000, pp.
404-411: Citeseer.
[18] S. K. Chaharsooghi, J. Heydari, and S. H. Zegordi, "A reinforcement learning model for
supply chain ordering management: An application to the beer game," Decision Support
Systems, vol. 45, no. 4, pp. 949-959, 2008.
[19] E. Catto. (2011). Box2D. Available: https://github.com/erincatto/Box2D
[20] K. Lauer. (2011). pybox2d. Available: https://github.com/pybox2d/pybox2d
[21] A. Muntean. (2017). Deep Q-Learning. Available:
https://github.com/andreimuntean/Deep-Q-Learning