punjabtechnicaluniversity.blogspot.in
Introduction
1.1 Simulation-Based Optimization
Computer simulations are used extensively as models of
real systems to evaluate output responses. Applications of simulation are
widely found in many areas including supply chain management, finance,
manufacturing, engineering design and medical treatment [42, 48, 60, 83, 96].
The choice of optimal simulation parameters can lead to improved operation, but
configuring them well remains a challenging problem. Historically, the
parameters are chosen by selecting the best from a set of candidate parameter
settings. Simulation-based
optimization [3, 40, 41, 47, 69, 90] is an emerging field which integrates
optimization techniques into simulation analysis. The corresponding objective
function is an associated measurement of an experimental simulation. Due to the
complexity of the simulation, the objective function may be difficult and
expensive to evaluate. Moreover, the inaccuracy of the objective function often
complicates the optimization process. Deterministic optimization tools may
lead to inaccurate solutions. Indeed, derivative information
Floating sleeve
Outer conductor
3
7
Sleeve position
Teflon catheter
is typically
unavailable, so many derivative-dependent methods are not applicable to these
problems.
A particular example is an engineering design problem
for an microwave coaxial antenna in hepatic tumor ablation. (More details can
be found in Section 4.2.) We aim to determine the optimal dimensions of the
antenna, such as the slot size, the dipole tip length, and the thicknesses of
the Teflon coatings (see Figure 1 for the antenna illustration) to yield a
desired treatment performance. Finite element models (MultiPhysics 3.2) are
used to simulate the electromagnetic (EM) field distribution in liver given a
particular design. Since dielectric properties of the tissue are assumed to
vary within ±10% of average properties, it is important to ensure that the
antenna used in ablation is robust, i.e., relatively insensitive to the
variations in physical properties of the tissue.
Although real world problems have many forms, in the
thesis we consider the following bounded stochastic formulation:
min f (x)= E [F(x,£(u))]
, (1.1)
where
Q = {x e Rn
: l <
x
< u}.
Here, l and u are the lower and upper bounds for the input parameter
x, respectively. We focus on the case where the parameter x comes from a continuous set Q,
but not necessarily a finite set. £(u) is a random vector defined on a probability space.
The sample response function F : Rn
x Rd —► R takes two
inputs, the simulation parameters x e
Rn and a random sample of £(u)
in Rd. Note that the variables x
correspond to the design parameters (e.g., slot size,
dipole tip length, etc.) in the above example, and the random realizations
correspond to different dielectric properties of individuals. As in the
example, our approaches concentrate on cases where the number of design
parameters is small.
F(x, £(u)) corresponds to a particular evaluation using the finite element
model. Given a random realization £i of £(u), F(x, £i) can be evaluated via a single
simulation run. The underlying objective function f(x) is computed by taking an
expectation over the sample response function and has no explicit form.
(Different formulations other than expectation are acceptable; for example,
maximum, minimum, or quantiles of the sample objectives [4, 32]). A basic
assumption requires that the expectation function f (x) is well defined (for any x G Rn the
function F(x,
•) is measurable, and either E[F(x, £(u;))+]
or E[F(x,£(u))-]
is finite, see page 57 of [91]).
1.1.1 Simulation-Based Optimization Methods
There are several excellent review papers on the
subject of simulation-based optimization methods, such as [3, 40, 41]. These
papers also provide a good survey on optimization add-ons for discrete-event
simulation software, which have been undergoing fast development over the last
decade; for example, OptQuest (www.optquest.com) for Arena (www.arena.com)
and for
SIMUL8 (www.simul8.com).
A detailed
summary of commercial simulation software and optimization add-ons can be found
in Table 1 of [41]. Contemporary methods for continuous simulation-based
optimization, which are implemented in these add-ons, are classified in several
categories.
Response surface methodology The idea of Response Surface Methodology [12, 90] (RSM) is to construct
one (or multiple) mathematical model A, which is called a surrogate model, to
approximate the underlying function f, so that it is can be easily
and cheaply evaluated at each parameter point x. Optimization procedures are
executed over the inexpensive model A, instead of the expensive cost
function. Moreover, the approximating model A could have estimated gradient
values that enable the application of more efficient optimization algorithms.
We determine the model A(-) by minimizing the difference of A(-) and the
function F(•) over a representative set
of points S:
min J2xi&s 0 (F(xi, & - A(xi)) 1 2)
(1.2)
s.t. A(^)
G A
Here 0(-) is a merit function, which is typically
chosen as an /2-norm; ^ is one sample realization; and the set A is a class of tunable
functional forms. The most popular choices of A are linear and quadratic
models. Polynomial models are also compact forms and easy to evaluate, but they
often have limited capacity to model complex functions of arbitrary shape.
Splines are alternative tools, which are more flexible in fitting and are
capable of handling large scale data. But in high dimensional approximation,
the (Tensor product) spline requires sample data on a grid mesh for efficient
implementation. Kriging is a type of interpolative tool which was originally
developed in geo-statistics. It is flexible enough to be implemented on
high-dimensional spatial data and has been used in many engineering problems.
However, the performance is limited by the total number of design points; for
example, increasing the number of points in S can quadratically increase the
Kriging model construction time.
Heuristic methods Heuristic methods have proven
to be practically useful in many real-world applications. We will briefly
introduce the three most popular methods: genetic algorithms, tabu search and
simulated annealing.
Genetic algorithms [13, 94,
95] are inspired from the process of biological evolution. The algorithm is
initialized with a finite set of potential solutions called the population.
Each potential solution, referred to as an individual, may be coded as a binary
string or a real-coded integer or taken from a fixed alphabet of characters.
These solutions are evaluated by a fitness function (normally the objective
function) and the fit individuals are assigned a high probability to
"reproduce" in the next generation of solutions, a sort of survival
of the fittest scheme. In the reproducing process, new offspring inherit traits
from parents (crossover) and are subject to mutation changes. The new offspring
are accepted using various criteria to form a new population of candidate
solutions. The algorithm proceeds to generate more favorable solutions in each
iteration and eventually converges to a population with a distribution of good
solutions.
Tabu search [45, 46] is a metaheuristic based on the local search method, which iteratively moves the
current iterate to a neighbor solution, until certain criteria are satisfied.
The algorithm allows a move to a neighbor solution that has a worse objective
value. Meanwhile, in order to prevent circling to the previous solutions or
infinite loops, a list of tabu or forbidden moves are kept and updated at each
iteration. A short-term memory tabu list is the most used type and is often
referred to as a tabu tenure.
Simulated annealing [23, 31]
searches local moves randomly from a list of candidate neighbor points. If a
better neighbor point is encountered, it replaces the current iterate with
probability one, or if a worse point is found, it replaces the iterate with a
probability value strictly less than one. The appropriate probability value is
determined by the difference of the objective values. For the algorithm to
converge, the probability of moving towards a worse point should decrease along
the iterations according to the decrement of a certain 'temperature' value,
which changes based on a cooling schedule. All of the three algorithms
are considered as global optimization methods since they are able to move
iterates out of regions where locally optimal solutions reside.
Stochastic approximation Stochastic approximation [67, 104]
falls into the category of gradient-based approaches. Typically, the method
iteratively updates the current solution by
xfc+i
= xk + ak Vf (xk),
where the estimated gradient V f (xk) can be calculated by various
gradient estimation tools and ak is the step length. Since the method is an extension of
the line search method, it obviously is a local optimization method. Under
proper conditions, such as when the error in the gradient approximation and the
step length converges to zero as a certain rate, the stochastic approximation
method can be shown to converge to a local optimum of the underlying function.
The approximate gradient Vf (xk) is estimated using sample responses F.
Popular gradient estimation methods include
perturbation analysis and the likelihood ratio method [43]. Spall's lab
develops simultaneous perturbation (SP) and finite difference (FD) based
stochastic approximation algorithms: SPSA and FDSA. The lab's web site (www.jhuapl.edu/SPSA)
provides a
good source of references on these two algorithms.
Derivative-free optimization methods Derivative-free optimization
methods [80] are a class of methods that do not try to utilize or directly
estimate the gradient value, thus are a good fit for the optimization problem
(1.1). Compared to stochastic approximation algorithms, the derivative-free
methods avoid the gradient estimation step, which is sensitive and crucial to
the convergence of these algorithms. In many practical examples, we find that the
gradient estimation tools often become incorrect and problematic when the
gradient value gets close to zero (i.e., when near a local solution).
There are two categories of derivative-free methods: the so-called
model-based approach [20, 21] and the pattern or geometry-based approach. The
model-based approach typically constructs a chain of local models that approximate
the objective function, and the algorithm proceeds based on model predictions;
an alternative to the model-based approach is the pattern-based approach, which
directly uses the functional output at locations specified by geometric
arguments, such as the pattern search method [70].
One of the
optimization methods we apply in the thesis is the UOBYQA algorithm [85]
(Unconstrained Optimization BY Quadratic Approximation), which is a model-based
method. It is designed for solving nonlinear problems with a moderate number of
dimensions. The method shares similarities to response surface methodology
(RSM): both of them construct a series of models to the simulation response
during the optimization process. However, many features in UOBYQA, such as the
quadratic model update and local region shift have advantages over the
classical RSM, and UOBYQA is more mathematically rigorous in convergence.
Dynamic programming and neuro-dynamic programming Dynamic Programming (DP) [6]
problems are a special type of simulation-based optimization problems with
internal time stages and state transitions. The objective function is not a
single black-box output, but typically is a combination of intermediate costs
during state transitions plus a cost measurement of the final state.
Appropriate controls are determined at each time stage, typically in a
sequential favor.
Neuro-Dynamic Programming (NDP) [9, 47, 106] is a class
of reinforcement learning methods to solve complex DP problems. The central
idea of NDP is to approximate the optimal cost-to-go function using simple
structured functions, such as neuro networks or simulation evaluations. NDP obtains
sub-optimal controls, trading off expensive computational costs. This feature
distinguishes NDP methods from standard DP methods.
1.1.2 White Noise vs. Common Random Numbers
To handle stochastic functional output, we adopt a
simple and effective approach - sample multiple replications per point and
take the average to reduce the uncertainty. Algorithms without such a procedure
can obtain a good solution but not the exact solution, because a
single-replication observation is inevitably affected by noise. The trickly
question of setting the replication number is actually algorithm dependent. We
have applied tools from Bayesian inference in a novel way to facilitate this
computation process.
We separate our optimization
approaches into two cases, using an important property of the simulation
system, which is based on properties of the replications.
The white noise case corresponds to the situation when simulation outputs
F(x,£(u))
are independent for different simulation runs, whatever the input x is. The random component £(u)
of the function is not fixable. The simulation outputs are observed to be
biased by white noise. The second case is the implementation of Common Random Numbers (CRN), which assumes the
random factor £(u)
can be fixed; for example, this can be achieved by fixing the random seed in
the computer simulation process. One of the advantageous properties of using
CRN is that given a realized sample the sample response function F(x,^) is a deterministic function.
The introduction of CRN significantly reduces the level of randomness and can
facilitate comparisons between different x.
When CRN is used, the expected
value function f (x)
in (1.1) can be approximated by a deterministic averaged sample response
function
1 N
f (x) « fN(x):= F(x,&, (1.3)
where is the number of samples. This is the core idea
of the sample-path method [35, 49, 50, 83, 84, 89],
which is a well-recognized method in simulation-based optimization. The
sample-path method is sometimes called the Monte Carlo sampling approach [99] or the sample average approximation
method [52,
53, 63, 98, 100, 101]. The sample-path method has been applied in many
settings, including buffer allocation, tandem queue servers, network design,
etc. A wealth of deterministic optimization techniques can be employed to solve
the sample-path problem
min fN(x), (1.4)
which serves as a substitute for (1.1). An optimal
solution x*N
to the
problem (1.4) is then treated as an approximation of x*, the solution of
(1.1). Note that the method is not restricted to bounded problems, but in more
general settings it requires appropriate deterministic tools (i.e., constrained
optimization methods) to be used.
1The functions fN epiconverges to a function f if the epigraphs of the
function fN converges to the epigraph of the function f [91].
|
Convergence proofs of the sample-path method are given in [89, 97]. Suppose
there is a unique solution x* to the problem (1.1), then under the assumption that
the sequence of functions {fN} epiconverges1 to
the function f, the optimal solution sequence {x*N} converges to x* almost surely for all sample
paths. See Figure 2 for the illustration of the sample-path optimization
method. Starting from x0, for a given N, a deterministic algorithm is
applied to solve the sample-path problem. Invoking the above analysis then
allows us to use the solution x*N as an approximation to the true solution x*.
1.2 Bayesian Analysis in Simulation
We have extensively used Bayesian analysis in our
optimization methods. Our analysis is inspired by its application in selecting the best system, which is one of the
fundamental problems in discrete simulation-based optimization.
In Bayesian analysis, a posterior distribution for the
simulation output, including both mean and variance information, is derived.
Since the posterior distribution has a parametric form, it can be readily
assembled and used in optimization methods.
Another important usage of Bayesian analysis is Monte Carlo validation. When Bayesian posterior
distributions cannot be used directly in the algorithms, the Monte Carlo
validation step serves as an alternative. This idea is motivated by using Monte
Carlo simulation of random variables to construct statistical estimators for
unknown quantities. For example, in the computation of the expectation of a
random variable, a simple way to accomplish this is to generate a large number
of realizations from the random variable and take the arithmetic mean (or
compute the sample mean). This provides a so called Monte Carlo estimator for the unknown expectation.
Note that one crucial part of the Monte Carlo method necessitates an efficient
way to generate random samples from a given distribution.
We adopt the same idea in Monte Carlo validation.
Assume that Bayesian posterior distributions can capture the real uncertainty
of the unknown random quantity well from a practical standpoint. In order to
valid a given variance criterion, we can extract a vast amount of samples from
the derived posterior distributions and plug them into the criterion to
validate it. For example, we may observe that 1 — a of all the samples satisfy the
criterion, therefore, we can assert that such a criterion can be satisfied with
an accuracy of 1 — a. The Monte Carlo validation step avoids using the forms
of the Bayesian distributions directly.
We will introduce the problem of selecting the best
system in Section 1.2.1 and the role which Bayesian analysis plays. In Section
1.2.2, we describe the construction of Bayesian posterior estimations and in
Section 1.2.3, we illustrate how the analysis is applied to a simple selecting
the best system problem.
1.2.1 Selecting the Best System
The goal of selecting the best system is to identify
the best system from a collection of candidate systems. The type of problem is
encountered many times (in various forms) in our algorithmic designs, as will
be shown later. Selecting the best system is equivalent to solving a discrete
optimization problem:
argmin E (Yi), (1.5)
i
where Yi is a random variable representing the output of system i, i = 1, 2, • • • ,K. (Without loss of generality,
we assume that a desired system has the smallest expected mean.) Let fii be the unknown mean of Yi and > H[2\ > • •
• > H[k\
be the
ordered sequence of the means but such a sequence is unknown. We prefer to
determine the index [k] and there is a
probability of correct selection (PCS) value that quantifies the
correctness: PCS =
Pr(select Y[K]\> fi[K] ,j = 1, 2, ■■■ ,K - 1). (1.6)
The difficulty of selecting the best system is to
design a statistically accurate procedure that identifies the best system with
the condition
PCS > 1 -
a,
where a is a threshold value, while using the least number of
realized samples of Yi.
The standard approach to selecting the best system
includes Indifference-Zone
Ranking and Selection (IZ-R&S). An indifference-zone parameter 8 should first be prescribed.
When the means of two systems are within a tolerance 8, it is indifferent to
choose either system. One of the most influential IZ-R&S approaches is a
two-stage sampling strategy, described by Rinott [88] in 1978. Firstly, r0 replications of each system
are simulated to estimate the preliminary sample variance. Secondly, it
evaluates additional replications for each system, whereby numbers of
replications are determined by the corresponding variance information. The best
system is then chosen as the one having the smallest sample mean. Kim and
Nelson [61] include a screen step that is designed to eliminate a set of
inferior systems from consideration at an early stage. A good survey on the
IZ-R&S procedures can be found in [62].
The Bayesian method is an
alternative to the standard IZ-R&S [16, 17, 19]. The idea is to construct
posterior distributions for the underlying mean ji using observations and
acquired knowledge. The selection accuracy is accordingly estimated via joint
posterior distributions. There are many practically efficient procedures
including Chen's OCBA [16] (Optimal Computer Budget Allocation) and Chick's
0-1(B) [17, 19]. The report [56] offers an empirical comparison of these
methods, as well as Rinott's two-stage IZ-R&S method.
The Bayesian approach has advantages over the
traditional IZ-R&S approach. First, it is not necessary to set the
indifference-zone parameter 5. In fact, it deals with the ranking and selection
problem (1.5) directly. Moreover, it utilizes both the mean and variance
information of the data, while IZ-R&S only uses the variance information
directly.
1.2.2 Constructing Bayesian Posterior Estimations
The Bayesian approach described in this subsection is
standard and close to that presented in Chick's papers [17, 19].
Given a set of points [x1,x2,... ,xL}, we assume the simulation output at these points
F = (F(x1, £(u)), F(x2, £(u)),..., F(xL, £(u)))
is a multivariate normal
variable, with mean j = (^(x1),..., fi(xL)) and
covariance matrix E:
F ~ N(p, E).
(1.7)
In the white noise case, since simulation outputs are
independent, E should be a diagonal matrix; while for the CRN case, it is
typically not. Suppose we evaluate N replications of simulation runs at each point, the
existing data X can
be accumulated as an N x L matrix, with
Let p and E denote the sample mean and sample covariance
matrix of the data. For simplicity, we introduce the notation si = (F(x1,^),..., F(xL, ^)), i 1,..., N, so that
r
s1
X
S2
sN
The sample mean and sample
covariance matrix are calculated as
1
N
N
i=1
(fN (x1),...,^ (xL)),
(1.8)
and
1
N1
N
i=1
(1.9)
Since we do not have any prior
assumption for the distributions of j and E, we assign non-informative prior distributions
for them. In doing this, the joint posterior distributions of j and E are derived as
E\X ~ WishartL(E,N + L - 2),
j\E,X ~ N(j, E/N). (1.10)
Here the Wishart distribution Wishartp(v,m) has covariance matrix v and m degrees of freedom [25]. The
Wishart distribution is a multivariate generalization of the x2 distribution.
The distribution of the mean value is of most interest
to us. When the sample size is large, we can replace the covariance matrix E in
(1.10) with the sample covariance matrix E, and asymptotically derive the
posterior distribution of \ X as
j\X ~ N(j,E/N). (1.11)
It should be noted that, with an exact computation, the
marginal distribution of \ X inferred by (1.10)
(eliminating E) is,
j\X ~ StL(/,NE-1,N - 1), (1.12)
where a random variable with Student's t-distribution StL(j, K, m) has mean j, precision k, and m degrees of freedom. The normal
formulation (1.11) is more convenient to manipulate than the t-version (1.12).
We have applied both forms in our algorithms when appropriate, but typically
prefer the simpler normal form.
Particularly, in the white noise case, the distribution
in (1.12) is separable. We have the component posterior distribution
j(xi)\X ~ N(j(xi),a2(xi)/N),
where a2(xi) is the sample mean for the
point x%
and is the
ith component in the diagonal of the matrix E.
While the multivariate normal assumption (1.7) is not
always valid, several relevant points indicate that it is likely to be
satisfied in practice [18].
•
The form (1.7) is only used to derive the (normal)
posterior distribution j\X.
•
Other types of distribution assumptions may be
appropriate in different circumstances. For example, when a simulation output
follows a Bernoulli 0-1 distribution, then it would be easier to perform parameter
analysis using beta prior and posterior distributions. The normal assumption
(1.7) is the more relevant to continuous simulation output with unknown mean
and variance.
•
The normal assumption is asymptotically valid for many
applications. Many regular distributions, such as distributions from the
exponential family, are normal-like distributions. The analysis using normal
distributions is asymptotically correct.
1.2.3 The Bayesian Method for Selecting the Best
System
As an introductory application, we consider a
preliminary case in selecting the best system when there are only two systems
involved, K
= 2. A
simplified analysis on how to compute the PCS is presented as follows. Interested
readers can seek details (e.g., multiple comparisons and procedures) in [17,
19].
Let X = {yij,i = 1, 2,j = 1, 2, • • • ,ri} denote two sequences of output
of both systems. The sample mean jii and sample variance a2
are defined as Ui =
Er3Li yi,j/n and
A decision is drawn by directly comparing the two
sample means. The system evidenced with a smaller average output is selected:
{
|
1, if Ui < U2; 2, otherwise.
Without loss of generality, we assume that we observe
the order of the sample means Ui < U2 and select the first system as
the winner.
In the Bayesian framework, the
posterior distribution of fii\X can be asymptotically
estimated as
Ui\X ~ N(fii,a2/Ti). (1.13)
Following our earlier assumption, the PCS corresponds to the probability
of event {j1 < j^},
PCS ~ Pr(j1 < j2\ X) = Pr(jn\X - j2\X < 0). (1.14)
The difference of two normal random variables j1 \X and ji2\X remains a normal random
variable. It is straightforward to show:
Therefore, the PCS defined in (1.14) is a tail probability of a normal
distribution:
PCS ~ Pr(N- jj2, a1/n + a\/r2) < 0). (1.15)
The computation corresponds to a single cdf evaluation
of a one-dimensional normal distribution. The value can be computed either by
looking up in a standard cdf table or using routines in mathematics software,
i.e., Matlab's function normcdf.
In general, for formulation (1.15), one may expect the PCS value to become higher as the
numbers of replications r1 and r2 increase.
1.3 The Two-Phase Optimization
Framework
The thesis focuses on a two-phase optimization
framework for solving (1.1), which is motivated by response surface
methodology. Recalling the model construction step (1.2), the overall error in
approximation comes from two sources: random error and model error. Random
error is the type of error directly induced by the uncertainty £(u) in the sample response; model
error is due to an inappropriate choice of functional forms to fit the data.
For example, fitting a cubic curve with a quadratic function will result in a
large discrepancy, when the domain of interest is not appropriately restricted.
This type of error exists independent of the random term £(<j). In the global view, the model
error dominates the random error; therefore, constructing a surrogate function
aims at reducing the model error to the maximum extent. However, in the local
view, reducing the random error becomes the primary consideration. For this
reason, we design a two-phase framework for simulation-based optimization which
addresses distinct goals:
1.
Phase I is a global exploration and rough search step.
The algorithm explores the entire domain and proceeds to determine potentially
good subregions for future investigation. We assume the observed randomness in
the subregion detection process is insignificant and can be ignored.
Appropriate criteria to determine when a good subregion has been identified are
required.
2.
Phase II is a local exploitation step. Local
optimization algorithms are applied in each subregion to determine the exact
optimum. The algorithms are required to deal with the randomness explicitly,
since in local subregions, random error is considered as the dominating
feature.
Phase I typically produces one or multiple good points
representing centers of good local subregions. These points are used as
starting points for the Phase II local search algorithms. Alternatively, Phase
I may generate geometric information for the location and shape of the
subregions, which are more interesting for analyzing the significance,
interactions and patterns of individual dimensions.
1.3.1 The WISOPT Structure
We design an optimization package WISOPT (which stands
for Wisconsin Simulation OPTimization) incorporating different optimization
methodologies, based on the two-phase framework. See Figure 3 for a flow
chart.
Phase I is a global exploration step over the entire
domain. Algorithms in Phase I should be able to generate (and evaluate) densely
distributed samples in promising subregions and sparsely distributed samples in
inferior subregions. The entire set of samples will be passed to a phase
transition procedure between Phase I and Phase II, which implements a
non-parametric statistical method to determine starting points and surrounding
subregions for multistart Phase II optimizations.
One of our Phase I methods
employs classification tools to facilitate the global search process. By
learning a surrogate from existing data the approach identifies promising
subregions and generates dense samples in the subregions. Additional features
of the method are: (a) more reliable predictions obtained using a voting
scheme combining the options of multiple classifiers, (b) a data pre-processing
step that copes with imbalanced training data. Another Phase I method is the
Noisy DIRECT (DIviding RECTangles) algorithm, which is an extension of the
DIRECT optimization algorithm to the noisy case. As with the
classification-based global search, the method returns a collection of
promising points together with surrounding rectangles.
Phase II performs local
derivative-free optimization based on the UOBYQA (Unconstrained Optimization BY
Quadratic Approximation) algorithm, in each of the subregions identified. We do
consider whether we can implement CRN in the simulator, corresponding to the
VNSP-UOBYQA (Variable-Number Sample-Path UOBYQA) algorithm for the CRN case and
the Noisy UOBYQA algorithm for the white noise case. Both algorithms apply
Bayesian techniques to guide appropriate sampling strategies while
simultaneously enhancing algorithmic efficiency to obtain solutions of a
desired accuracy. The statistically accurate scheme determines the number of
simulation runs and guarantees the global convergence of the algorithm.
A key component in the extended algorithms is to
incorporate distribution information provided by Bayesian estimations.
Bayesian inference is an effective tool in input modeling and uncertainty
analysis. In particular, in order to prevent unnecessarily exploring
hyperrectangles in inferior regions,
DIRECT tests the accuracy of selecting hyperrectangles
for future partitioning using a Monte Carlo validation process. Trial samples
are extracted from the Bayesian posterior distributions to validate a
predefined variance criterion. In UOBYQA algorithms, we derive the posterior
volatility for the local model construction step, and therefore control the
uncertainty of solutions in the trust region subproblems.
Certainly, the variety of methods in both phases is not
restricted to what we present. Additional methods that satisfy the purposes of
both phases may work as new modules and can be plugged into the two-phase
framework.
1.3.2 Introduction to the Methodologies
We give a brief introduction to the methodologies we
use in WISOPT (refer to Figure 3). Details will be discussed in later chapters.
Classification-based global optimization A good surrogate model for the
entire space (often noted as a metamodel) may require a large amount of
simulations and can be very expensive to compute, but it may be capable of
approximating global behavior of the underlying function f. Since Phase I only attempts
to determine promising subregions of the search space, the model can be
constructed in a coarser manner.
Classification-based global
optimization
or
Noisy DIRECT
Phase II
|
Phase transition
VNSP-UOBYQA
Noisy UOBYQA
Figure 3: The two-phase WISOPT
structure.
We are indeed interested in the behavior of a simple indicator function:
(1.16)
where promising subregions in the method correspond to
level sets. This function gives sufficient information to determine where a
subregion is located. Approximating the indicator function I is simpler than approximating
the underlying function f, especially in a high dimension case. Normally, we
sample dense designs in the subregion and sparse designs out of the subregion.
In the classification-based global search, we utilize
the predicting power of a classifier: the classifier works as a surrogate
function for the indicator function I, which reveals the location of promising subregions. A
classifier is a cheap mechanism to predict whether new samples are in promising
sub-regions or not, thus we can generate a dense collection of points in these
subregions. Figure 4 illustrates the local regions (the dotted circles and the
level sets) and the samples (the '+'s). The method fits well to the theme of the
Phase I optimization, which is to identify local regions. In fact, the method
is relatively insensitive to noise, because the simplification step (1.16)
smoothes out the occurrence of noise. That is reason we normally do not use
replicated samples in training the classifiers. Setting the algorithm may require
specific parameters; for example, we have to potentially understand the proper
number of samples to use in training the classifiers.
Classification forms a major
branch in in machine learning/data mining [51, 77]. In the past couple of years, there has been
increasing interest on techniques interfacing optimization and machine
learning. For example, Kim and Ding [60] have implemented data mining tools in
an engineering design optimization problem. Mangasarian [74] has enhanced
optimization robustness in support vector machines.
10,------- ,------- ,------- ,------- ,------- , , , , , 1
-10 -8
-6 -4 -2
0 2 4
6 8 10
Figure 4: Predicated local subregions of the Griewank
function. The function has local optimums in each subregion (circles) and the
global optimum at [0, 0].
The Noisy DIRECT
algorithm The DIRECT optimization method
[37, 38, 58, 59] is a deterministic global optimization algorithm for
bound-constrained problems. The algorithm, first motivated by Lipschitz
optimization [59], has proven to be effective in a wide range of application
domains. The algorithm centers around a space-partitioning scheme that divides
large hyperrectangles into small ones. Promising hyperrectangles are subject to
further division. Figure 5 provides an illustration of the algorithm on the
Goldstein Price function. The algorithm therefore proceeds to gradually explore
promising subregions.
Figure 5: The DIRECT optimization algorithm on the
Goldstein Price function. The contour plot of the Goldstein Price function is
shown in Figure 15.
When the objective function is
subjected to uncertainty, some crucial operational steps of the DIRECT
algorithm are affected. For example, the choice of potentially optimal hyperrectangles
becomes incorrect because of the noisy function values, possibly misleading the
algorithm to search in inferior regions. We modify the original DIRECT
algorithm using a simple approach - multiple replications are sampled to reduce
output uncertainty. We expect the algorithm to proceed correctly as in the
deterministic case. However, we must face the issue of handling the tradeoff
between two design goals: efficiency of the algorithm versus total computational effort. Since the objective function
is often computationally expensive to evaluate, we must be very cautious in
using function evaluations. On the other hand, we need to maintain a certain
precision in the functions for correctness of the algorithm. In our
modification, we apply Bayesian techniques to derive a posterior distribution
for the function output at each point, and incorporate the distribution
information into the algorithm to determine an appropriate number of
replications to be used.
The parameters for this algorithm are easy to set. Since deterministic
DIRECT is designed for both global and local optimization, one should note that
it is desirable to terminate the algorithm at a reasonable time, at which
sufficient information of local regions can be identified. This can avoid unnecessary
function evaluations for handling comparisons that are really dominated by
noise.
The phase transition Using the evaluated samples in
Phase I, the phase transition procedure consists of a non-parametric local
quadratic regression method to determine the appropriate subregion size. Being
different from regular regression methods, which use the entire set of samples
in the domain to construct one model, local regression makes a prediction (at a
point) using a local model based on samples within a 'window size', thus the
approach values the local behavior of a function more. 'Non-parametric' means
the regression model is not from a single parametric family. It is presumed
that the samples outside the local region have a slight relevance to the current
prediction. In our procedure, we treat the resulting 'window size' as our
subregion radius.
A sequence of good starting
points is generated, satisfying the criteria: (a) each starting point is the
center of a subregion, (b) the subregions are mutually separated. The sequence
of starting points and the subregion sizes are passed to Phase II for local
processing, possibly in a parallel setting.
Extended UOBYQA algorithms In Phase II, the deterministic
UOBYQA algorithm is applied as the base local search method and is extended for
noisy function optimization. The method is an iterative algorithm in a trust
region framework [80], but it differs from a classical trust region method in
that it creates quadratic models by interpolating a set of sample points instead
of using the gradient and Hessian values of the objective function (thus making
it a derivative-free tool). Besides UOBYQA, other model-based software include
WEDGE [75] and NEWUOA [86]. We choose UOBYQA, because it
We developed variants of
the original UOBYQA, called the VNSP-UOBYQA and the Noisy UOBYQA, that are
adapted for noisy optimization problems. The extension idea is similar to that
of the Noisy DIRECT algorithm. We sample multiple replications per point to
reduce variance and apply Bayesian techniques to guide appropriate sampling
strategies to estimate the objective function. The two algorithms employ
different mechanisms in the sampling process. The VNSP-UOBYQA determines
appropriate replication numbers by whether sufficient reduction is identified
in the trust-region subproblem, while the Noisy UOBYQA determines the number by
whether the quadratic models can be shown to be stable or not. Generally
speaking, when CRN is implemented, the noise is relatively easy to handle
because it is correlated among sites. Therefore, it is shown that the
VNSP-UOBYQA has better convergence properties.
1.4 Outline of the Thesis
The general theme of thesis centers around the WISOPT
package, which is designed based on the two-phase optimization framework. Phase
I techniques are described in Chapter 2 and Phase II techniques are described
in Chapter 3.
More specifically, the first section of Chapter 2
presents the classification-based global search. The detailed procedures of
applying several types of classifiers are included, together with two special
features: a voting scheme to assemble multiple classifiers and imbalanced data
handling. The second section introduces and explains the development of the
Noisy DIRECT algorithm. We will first describe the deterministic DIRECT
algorithm, then analyze how to enhance the algorithm under uncertain
conditions. Several numerical examples and a simulation problem are presented.
The third section is on the procedure of the phase transition.
Chapter 3 contains the details about the Phase II noisy versions of the
UOBYQA algorithm. The ideas of the extensions of both the VNSP-UOBYQA and the
Noisy UOBYQA are similar: the Bayesian posterior distributions for the
parameters of the quadratic model are derived and further we can estimate the
stability of the algorithms. Since the VNSP-UOBYQA is in the CRN setting, the
corresponding section is written in a more rigorous manner, with the
convergence proof of the algorithm provided.
We show in Chapter 4 two
real-world simulation optimization examples. We aim to fine tune the simulation
parameters in the Wisconsin Breast Cancer Epidemiology model, which is from a
collaborative project with researchers in the Medical School at the University
of Wisconsin. In the second example, we optimize the shape of a type of
microwave coaxial antenna for hepatic tumor ablation, which is a current
ongoing project with the Biomed-ical Engineering Department at the University
of Wisconsin.
In Chapter 5 we demonstrate a special simulation-based problem in dynamic
programming. This type of simulation problem has an internal time structure -
the objective function is not a pure black-box stochastic function, but is
constituted of a sequence of costs along the time stages. We modify the rollout
algorithm from neuro-dynamic programming, using a similar Bayesian analysis to
that outline above to improve the simulation efficiency in training and in
deriving optimal controls at each stage. We illustrate the effectiveness of
our new algorithm using an example in fractionated radiotherapy treatment. This
approach to using Bayesian analysis in neuro-dynamic programming effectively
generalizes the methods of this thesis to time domain problems, and leads to
further possible applications in other optimization contexts.
0 comments:
Post a Comment
North India Campus