Abstract

In this paper, we consider the optimal control of material microstructures. Such material microstructures are modeled by the so-called phase-field model. We study the underlying physical structure of the model and propose a data-based approach for its optimal control, along with a comparison to the control using a state-of-the-art reinforcement learning (RL) algorithm. Simulation results show the feasibility of optimally controlling such microstructures to attain desired material properties and complex target microstructures.

1 Introduction

In this paper, we consider the optimal control of complex and high degree-of-freedom material microstructure dynamics, by doing which we aim at developing a robust tool for exploring the fundamental limits of materials microstructure control during processing. The microstructure dynamics process is governed by the so-called phase-field model [1], which is a powerful tool for modeling microstructure evolution of materials. It represents the spatial distribution of physical quantities of interest by an order parameter field governed by one or a set of partial differential equations (PDEs). The phase-field method can naturally handle material systems with complicated interface, thanks to which it has been successfully applied to the simulation of a wide range of materials microstructure evolution processes, such as alloy solidification [2,3], alloy phase transformation [4,5], grain growth [6,7], and failure mechanism of materials [8,9], etc. For this work, we consider the evolution of the interface in multiphase microstructures to be guided by the Allen–Cahn [10] and Cahn–Hilliard [11] PDEs. Albeit significant research in the optimal control of the Allen–Cahn [1214] and Cahn–Hilliard [15,16] PDEs in the recent decade, material processing through data-driven optimal control of material microstructure dynamics still faces some challenging obstacles for real-time implementation. Utilizing nonlinear convex-optimization algorithms to solve for the optimal control of nonlinear Allen–Cahn and Cahn–Hilliard PDEs is viable for smaller instances. However, this approach encounters the well-known “curse of dimensionality,” manifesting increased computational complexity as discretization scales to finer grids to capture intricate dynamics. These algorithms are also known to have convergence issues, owing in part to the existence of multiple local minima. Furthermore, the evolution of microstructures is inherently susceptible to uncertainties stemming from material properties, external conditions, and initial configurations. The development of control strategies resilient to these uncertainties remains an ongoing challenge. Finally, navigating the selection of an appropriate boundary condition, capable of steering microstructure evolution in desired directions, remains a substantial hurdle in the pursuit of effective control methodologies.

The phase-field model is complex, nonlinear, and high degree-of-freedom, and does not admit analytical solutions. Thus, data-based approaches are natural to consider for the control of these systems. In recent work [17], we proposed a novel decoupled data-based control (D2C) algorithm for learning to control an unknown nonlinear dynamical system. Our approach introduced a rigorous decoupling of the open-loop (planning) problem from the closed-loop (feedback control) problem. This decoupling allows us to come up with a highly efficient approach to solve the problem in a completely data-based fashion. Our approach proceeds in two steps: (i) first, we optimize the nominal open-loop trajectory of the system using a “black-box simulation” model, and (ii) then we identify the linear system governing perturbations from the nominal trajectory using random input–output perturbation data, and design a linear-quadratic regulator (LQR) controller for this linearized system. We have shown that the performance of D2C algorithm is approximately optimal, in the sense that the decoupled design is near-optimal to second order in a suitably defined noise parameter [17]. In this work, we apply the D2C approach to the optimal control of the evolution of material microstructures governed by the phase-field model. For comparison, we consider reinforcement learning (RL) techniques [18]. The past several years have seen significant progress in deep neural networks-based reinforcement learning approaches for controlling unknown dynamical systems, with applications in many areas like playing games [19], locomotion [20], and robotic hand manipulation [21]. A number of new algorithms that show promising performance are proposed [2224] and various improvements and innovations have been continuously developed. However, despite excellent performance on a number of tasks, RL is highly data intensive. The training time for such algorithms is typically very large, and high variance and reproducibility issues mar the performance [25]. Thus, material microstructure control is intractable for current RL techniques. Recent work in RL has also focused on PDE control using techniques like the deep deterministic policy gradient (DDPG) with modifications such as action descriptors to limit the large action-spaces in infinite dimensional systems [26], as such algorithms tend to degrade in performance with an increase in dimensionality of the action-space. However, such an approach, in general, is not feasible for the microstructure problem (Fig. 1).

Fig. 1
Material microstructure control: the goal is to take the microstructure from the initial to the goal state (top row). The microstructure evolution is governed by a nonlinear PDE which makes its control computationally intractable. A comparison between a state-of-the-art data-based RL technique is shown versus the proposed D2C technique (bottom row) for the reliable and tractable solution of this control problem. (a) Initial state, (b) goal state, (c) D2C final state, and (d) DDPG final state.
Fig. 1
Material microstructure control: the goal is to take the microstructure from the initial to the goal state (top row). The microstructure evolution is governed by a nonlinear PDE which makes its control computationally intractable. A comparison between a state-of-the-art data-based RL technique is shown versus the proposed D2C technique (bottom row) for the reliable and tractable solution of this control problem. (a) Initial state, (b) goal state, (c) D2C final state, and (d) DDPG final state.
Close modal

The control design of systems governed by PDEs is known to be computationally intractable [27]. Owing to the infinite dimensionality of the system, typically model reduction techniques are used to reduce the size of the state space. The most widely used method is the proper orthogonal decomposition method that finds a reduced basis for the approximation of the nonlinear PDE using a simulation of the same [28]. However, in general, the basis eigenfunctions can be quite different around the optimal trajectory when compared to an initial trajectory, and an approximation using the latter's reduced basis can lead to unacceptable performance [29,30]. In our approach, the nominal optimization problem is directly solved which does not need any reduced order modeling. Furthermore, the feedback design is accomplished by identifying a linear time-varying (LTV) model of the system based purely on the input output data while facilitating the use of linear control design techniques such as linear-quadratic-gaussian/iterative linear quadratic regulator (ILQR) on the problem as opposed to having to solve a nonlinear control problem in the methods stated previously. Furthermore, the entire control design for the problem is done purely in a data-based fashion as the technique only queries a black-box simulation model of the process.

The contributions of the paper are as follows: we show how to model the dynamics of a multiphase microstructure and unveil its underlying structure. We present the application of the D2C approach, and a state-of-the-art RL technique (DDPG) to the control of such material systems, and show that local D2C approach outperforms the global RL approach such as DDPG when operating on higher dimensional phase-field PDEs. We show that the D2C is highly robust to noise in the practical noise regimes, and that global optimality can be recovered in higher noise levels through an open-loop replanning approach. Further, we can exploit the peculiar physical properties of material microstructure dynamics to optimize the control actuation architecture that leads to highly computationally efficient control that does not sacrifice much performance. We envisage that this work is a first step toward the construction of a systematic feedback control synthesis approach for the control of material microstructures, potentially scalable to real applications.

The rest of the paper is organized as follows: In Sec. 2, the microstructure dynamics are expanded upon, describing the different PDEs tested. We propose the D2C approach in Sec. 3. In Sec. 5, the proposed approach is illustrated through custom-defined test problems of varying dimensions, along with testing robustness to noise, followed by improvements to the algorithms which exploit the physics of the system.

2 Microstructure Dynamics: Classes of Partial Differential Equations Implemented

This section provides a brief overview of the nonlinear dynamics model of a general multiphase microstructure. We assume an infinitely large, two-dimensional (2D) structure satisfying periodical boundary conditions.

2.1 The Material System.

The evolution of material microstructures can be represented by two types of partial differential equations, i.e., the Allen–Cahn equation [10] representing the evolution of a nonconserved quantity, and the Cahn–Hilliard equation [11] representing the evolution of a conserved quantity. The Allen–Cahn equation has a general form of
(1)
while the Cahn–Hilliard equation has the form
(2)

where ϕ=ϕ(x,t) is called the “order parameter,” which is a spatially varying quantity. In controls parlance, ϕ is the state of the system and is infinite dimensional, i.e., a spatiotemporally varying function. It reflects the component proportion of each phase of material system. For a two-phase system studied in this work, ϕ = −1 represents one pure phase and ϕ = 1 represent the other, while ϕ (−1, 1) stands for a combination state of both phases on the boundary interface between two pure phases; M is a parameter related to the mobility of material, which is assumed constant in this study; F is an energy function with a nonlinear dependence on ϕ; and γ is a gradient coefficient controlling the diffusion level or thickness of the boundary interface.

In essence, the phase-field method is one of gradient flow methods, which means the evolution process follows the path of steepest descent in an energy landscape starting from an initial state until arriving at a local minimum. So, the behavior of microstructures in phase-field modeling highly depends on the selection of energy density function F. For instance, the double-well potential function owing to two minima at ϕ =±1 and separated by a maximum at ϕ = 0 (as plotted in Fig. 2) will cause the field ϕ to separate into regions where ϕ±1, divided by the boundary interface, while the single-well potential function owing to a single minimum at ϕ = 0 (see Fig. 2) predicts a gradual smoothing of any initial nonuniform ϕ field, yielding a uniform field with ϕ = 0 in the long-time limit.

Fig. 2
Illustration of double-well and single-well potential function F
Fig. 2
Illustration of double-well and single-well potential function F
Close modal

Accordingly, the evolution of material microstructure can be governed by selecting proper form of energy density function F. Controlling the evolution process of microstructures represented by Allen–Cahn equation and Cahn–Hilliard diffusion equation are completely distinct. The former one is governed through generating or deleting order parameter straightforwardly, while the latter one is done through guiding the transporting and redistribution of the conserved order parameter across the whole domain.

In this study, we adopt the following general form of energy density function F:
(3)
Here, the parameters T and h are not governed by the A–C and C–H equations. Instead, they may be externally set to any value and may be spatially as well as temporally varying.

2.2 Discretized Model for Partial Differential Equation Control.

The phase-field is discretized into a 2D grid of dimension N × N. Let ϕti,j denote the phase-field of the cell defined by grid location (i, j) for all i,j={1,,N}, at time t.

From Eqs. (1) and (3), we have the A–C equation of the form
(4)
Now, we apply the above discretization to Eq. (4), and use a central-difference scheme to get the high-dimensional 2D phase-field model. Re-arranging the terms gives the time-variation of the local phase-field
(5)
We can further separate the state and control terms to get
(6)

with the periodic boundary conditions ϕti,N+1=ϕti,1 and ϕtN+1,j=ϕt1,j.

Similarly, from Eq. (2), we can write a central-difference scheme for the C–H equation
(7)

where μti,j={4(ϕti,j)3+2Tti,jϕti,j+hti,jγ((ϕti+1,j+ϕti1,j+ϕti,j+1+ϕti,j14ϕti,j)/Δx2)}.

Given the phase-field vector ϕt and the control input vector ut at any time t, as a stack of all states {ϕti,j} and control inputs {uti,j}={Tti,j,hti,j}T, respectively, we can simulate the phase-field at the next time-step from Eqs. (6) and (7). We also observe the dynamics to be affine in control for both cases.

The system can, thus, be described in the discretized state-space form as
(8)

where Φt={ϕti,j} is the phase variable on the spatially discretized finite difference grid at time t, and Ut={uti,j} represents the control variables on the same grid at time t.

Remark 1. We consider the system available to us as sufficiently finely discretized to be able to accurately represent the dynamics of the infinite-dimensional PDE system. All our claims regarding the optimality of the feedback solution are with respect to this high, albeit finite, dimensional problem.

2.3 Minimization of Energy Function F for Controlling Microstructure Dynamics.

The control variables T and h govern the evolution process of microstructure by influencing the slope and minimums of energy density function F as shown in Fig. 3. Suppose that h =0, a positive T yields a double-well F while negative T gives a single-well F, as seen in Fig. 3(a). As explained earlier, under h =0, T determines the ϕ phase will go to a uniform state or phase-separating state in the long-time limit. When h 0, the symmetric energy profile shown in Fig. 3(a) becomes skewed as shown in Fig. 3(b). Under this condition, the parameter h determines which phase will be favored during evolution. If h >0, the minimum locates on the right-hand side, and therefore the phase represented by ϕ=1 is favored while, if h <0, then the ϕ field more easily evolves to the phase ϕ=1.

Fig. 3
Effect of T and h on energy function F: (a) effect of T on F and (b) effect of h on F
Fig. 3
Effect of T and h on energy function F: (a) effect of T on F and (b) effect of h on F
Close modal

In most applications, T and h represent specific input conditions, i.e., representing temperature and an external force field (e.g., electromagnetic or mechanical), respectively. Since we aim at investigating the control of material microstructure evolution during processing, we suppose that T and h can be adjusted to any value instantaneously, and that the feedback to the controller is the entire phase-field ϕ (please see Sec. 5 for details).

2.4 Difference of Allen–Cahn and Cahn–Hilliard Control.

Determined by the different orders of derivatives of control variables (T and h) in the evolution equations, the evolving behaviors of order parameter field from initial state to the desired goal state governed by Allen–Cahn and Cahn–Hilliard equations are completely distinct. For the Allen–Cahn equation, T and h are taken no derivative, and the value of order parameter can be therefore changed straightforwardly by choosing proper T and h (see a0a4 in Fig. 4). While for the Cahn–Hilliard equation, a Laplacian operation is applied to the term containing T and h, which causes the order parameter field evolves to the desired state through the transport of order parameter field (see b0b4 in Fig. 4). Consequently, the demanded cost of controlling Allen–Cahn model to evolve from a given initial state to a goal state depends on the discrepancy of the initial state from the desired goal state, while that for Cahn–Hilliard model depends on the distribution of order parameter field as well.

Fig. 4
Evolution process of order parameter field governed by Allen–Cahn (a0−a4) and Cahn–Hilliard (b0−b4) equations. a0 and b0 represent the initial states that be controlled to evolve to the final states represented by a4 and b4, respectively.
Fig. 4
Evolution process of order parameter field governed by Allen–Cahn (a0−a4) and Cahn–Hilliard (b0−b4) equations. a0 and b0 represent the initial states that be controlled to evolve to the final states represented by a4 and b4, respectively.
Close modal

3 Decoupled Data-Based Control Algorithm

In this section, we propose a data-based approach to solve the material phase-separation control problem. In particular, we apply the so-called D2C algorithm that we recently proposed [17], and extend it for high-dimensional applications. The D2C algorithm is a highly data-efficient RL method that has shown to be much superior to state-of-the-art RL algorithms such as the DDPG in terms of data efficiency and training stability while retaining similar or better performance. In the following, we give a very brief introduction to the D2C algorithm (please see Refs. [17], [31], and [32] for the relevant details).

Let us reconsider the dynamics of the system given in Eqs. (1) and (2) and rewrite it in a discrete time, noise perturbed state space form as follows:
(9)

where ϕt is the state, ut is the control, wt is a white noise perturbation to the system, and ϵ is a small parameter that modulates the noise in the system. Suppose now that we have a finite horizon objective function: J(ϕ0)=E[t=0T1c(ϕt,ut)+Ψ(ϕT)], where c(ϕ,u) denotes a running incremental cost, Ψ(ϕ) denotes a terminal cost function, and E[·] is an expectation operator taken over the sample paths of the system. The objective of the control design is then to design a feedback policy ut(ϕt) that minimizes the cost function above and is given by J*(ϕ0)=minut(.)E[t=0T1c(ϕt,ut)+Ψ(ϕT)].

The D2C algorithm then proposes a three step procedure to approximate the solution to the above problem.

First, a noiseless open-loop optimization problem is solved to find an optimal control sequence, u¯t*, that generates the nominal state trajectory ϕ¯t*
(10)
subject to the zero noise nominal dynamics: ϕt+1=f(ϕt)+g(ϕt)ut.

Second, the perturbed time-varying linear system around the nominal given by δϕt+1=Atδϕt+Btδut is identified in terms of the time-varying system matrices At and Bt.

Third, an LQR controller for the above time-varying linear system is designed whose time-varying gain is given by Kt. Finally, the control applied to the system is given by ut(ϕt)=u¯t*+Ktδϕt.

Sections 3.13.4 provide details for each of the above-mentioned steps of the D2C algorithm.

3.1 Open-Loop Trajectory Optimization: Iterative Linear Quadratic Regulator-Based Approach.

We utilize an ILQR-based method to solve the open-loop optimization problem. ILQR typically requires the availability of analytical system Jacobian and thus cannot be directly applied when such analytical gradient information is unavailable (much like nonlinearprograming software whose efficiency depends on the availability of analytical gradients and Hessians). In order to make it an (analytical) model-free algorithm, it is sufficient to obtain estimates of the system Jacobians from simulations, and a sample-efficient randomized way of doing so is described in Sec. 3.1.1. Since ILQR is a well-established framework, we skip the details and instead present pseudocode in Algorithm 1.

3.1.1 Estimation of Jacobians: Linear Least Squares by Central Difference.

Using Taylor's expansions of “h” (for generality, h is the nonlinear model of Sec. 2) about the nominal trajectory (ϕ¯t,u¯t) on both the positive and the negative sides, we obtain the following central-difference equation: h(ϕ¯t+δϕt,u¯t+δut)h(ϕ¯tδϕt,u¯tδut)=2[hϕthut][δϕtδut]+O(||δϕt||3+||δut||3). Multiplying by [δϕtTδutT] on both sides to the above equation and apply standard least square method

where “ns” be the number of samples for each of the random variables, δϕt and δut. Denote the random samples as δΦt=[δϕt1δϕt2δϕtns],δUt=[δut1δut2δutns], and δYt=[δΦtδUt].

We are free to choose the distribution of δϕt and δut. Given that we perform rollouts of the system, where {δut(i)} is a Gaussian white noise sequence for all rollouts i, the state perturbations δϕt(i) are also Gaussian and independent for the different rollouts i, for any given time t. This ensures that δYtδYt is very well conditioned. To see this, let us consider the terms in the matrix δYtδYt=[δΦtδΦtδΦtδUtδUtδΦtδUtδUt], δΦtδΦt=i=1nsδϕt(i)δϕt(i). Similarly, δUtδUt=i=1nsδut(i)δut(i),δUtδΦt=i=1nsδut(i)δϕt(i) and δΦtδUt=i=1nsδϕt(i)δut(i).

From the definition of sample variance, for a large enough ns, we can write the above matrix as

3.2 Linear Time-Varying System Identification.

Closed-loop control design in step 2 of D2C requires the knowledge of the linearized system parameters At and Bt for 0tT1. Here, we use the standard least square method to estimate these parameters from input–output experiment data.

First start from the perturbed linear system about the nominal trajectory and estimate the system parameters At and Bt from: δϕt+1=At̂δϕt+Bt̂δut, where δut(n) is the control perturbation vector we feed to the system at step t of the nth simulation, and δϕt(n) is the state perturbation vector that we get from running forward simulations/rollouts with the above control perturbations. All the perturbations are zero-mean, independent and identically distributed, Gaussian noise with covariance matrix σI. σ is a o(u) small value selected by the user. δϕt+1(n) denotes the deviation of the output state vector from the nominal state after propagating for one step.

Run N simulations for each step and collect the input–output data: Y=[At̂|Bt̂]Φ and write out the components
(11)

Finally, using the standard least square method, the linearized system parameters are estimated as [At̂|Bt̂]=YΦT(ΦΦT)1.

3.3 Closed-Loop Control Design.

Given the estimated perturbed linear system, we design a finite horizon, discrete time LQR [33] along the trajectory for each time-step to minimize the cost function J=δϕTTQδϕT+t=0T1(δϕtTQδϕt+utTRut+2δϕtTNut), subject to δϕt+1=At̂δϕt+Bt̂δut, where ut=Ktδϕt. The feedback gains are calculated as Kt=(R+BtTPt+1Bt)1(BtTPt+1At+NT), where Pt is solved in a back propagation fashion from the Riccati equation: Pt1=At-1TPtAt-1(At-1TPtBt-1+N)(R+Bt-1TPtBt-1)1(Bt-1TPtAt-1+NT)+Q,PT=Q,N=0. The closed-loop control policy is ut(ϕt)=u¯t*Ktδϕt, where δϕt is the state deviation vector from the nominal state at time-step t.

3.4 Decoupled Data-Based Control Algorithm: Summary.

The D2C algorithm is summarized in Algorithm 2.

Algorithm 1

Open-loop trajectory optimization via model-free ILQR

Input: Initial State—x0, System parameters—P;
m1. /* Initialize the iteration number m to 1.*/
forward_pass_flag = true.
 /* Run until the difference in costs between subsequent iterations is less an ϵ fraction of the  former cost.*/
whilem==1or(cost(Tnomm)/cost(Tnomm1))<1+ϵdo
   /*Each iteration has a backward pass followed by a forward pass.*/
   {k0:N1m,K0:N1m}, backward_pass_success_flag= Backward Pass(Tnomm,P).
   ifbackward_pass_success_flag == truethen
    Tnomm+1,forward_pass_flag= Forward
    Pass(Tnomm,{k0:N1m,K0:N1m},P).
     whileforward_pass_flag == falsedo
       Tnomm+1,forward_pass_flag= Forward
       Pass(Tnomm,{k0:N1m,K0:N1m},P).
       Reduce α from P.
    end while
  end if
  Else
    Increase μ from P. /* Regularization step */
end if
mm+1.
Tnom*Tnomm+1.
end while
returnTnom*
Input: Initial State—x0, System parameters—P;
m1. /* Initialize the iteration number m to 1.*/
forward_pass_flag = true.
 /* Run until the difference in costs between subsequent iterations is less an ϵ fraction of the  former cost.*/
whilem==1or(cost(Tnomm)/cost(Tnomm1))<1+ϵdo
   /*Each iteration has a backward pass followed by a forward pass.*/
   {k0:N1m,K0:N1m}, backward_pass_success_flag= Backward Pass(Tnomm,P).
   ifbackward_pass_success_flag == truethen
    Tnomm+1,forward_pass_flag= Forward
    Pass(Tnomm,{k0:N1m,K0:N1m},P).
     whileforward_pass_flag == falsedo
       Tnomm+1,forward_pass_flag= Forward
       Pass(Tnomm,{k0:N1m,K0:N1m},P).
       Reduce α from P.
    end while
  end if
  Else
    Increase μ from P. /* Regularization step */
end if
mm+1.
Tnom*Tnomm+1.
end while
returnTnom*
Algorithm 2

D2C algorithm

(1) Solve the deterministic open-loop optimization problem for optimal open-loop control sequence and state trajectory ({u¯t*}t=0T1,{x¯t*}t=0T) using ILQR (Sec. 3.1).
(2) Linearize and identify the LTV system parameters (Ât,B̂t) via least square (Sec. 3.2).
(3) Solve the Riccati equations for each step along the nominal trajectory for feedback gain {Kt}t=0T1.
(4) Apply the closed-loop control policy,  whilet < Tdo
    
(12)
   t=t+1.
end while
(1) Solve the deterministic open-loop optimization problem for optimal open-loop control sequence and state trajectory ({u¯t*}t=0T1,{x¯t*}t=0T) using ILQR (Sec. 3.1).
(2) Linearize and identify the LTV system parameters (Ât,B̂t) via least square (Sec. 3.2).
(3) Solve the Riccati equations for each step along the nominal trajectory for feedback gain {Kt}t=0T1.
(4) Apply the closed-loop control policy,  whilet < Tdo
    
(12)
   t=t+1.
end while

4 Control Discretization

The system dynamics and algorithm described in Sec. 2 and 3 assume the state to be fully observed, and the actuation was designed to have grid-level discretization in the control inputs. However, such a detailed actuation scheme is not practical, owing to (a) a lack of actuators of such precision scales, and (b) the scale of the actuation growing exponentially for realistic grid discretizations. Thus, there is a need to find a suitable control discretization, akin to the boundary condition, for suitable direction of the phase-field dynamics.

In general, PDEs generally tend to have an underlying low-dimensional structure owing to sparsity. Existing adaptive strategy for the Allen–Cahn equation without control uses a fine mesh on the interface and coarser mesh on the bulk regions [13]; however, incorporating control with such a scheme may result in the nucleation of a phase [34]. Hence, there is a need to fine-tune the control discretization in order to optimize between controllability and feasibility.

4.1 Tuning Control Discretization.

Selection of the ideal control discretization warrants the consideration of the features present in the dynamics, along with the interfaces between the multiple phases present in the microstructure. Typically, a grid-scale control actuation would be ideal for maintaining full control over the microstructure dynamics. However, from Sec. 3.2, we know that the computation of the LTV system parameters (At, Bt) scales as O(nϕ+nu), with a full grid-scale control discretization resulting in nu=2N2. As we scale the grid to realistic meshes, the LTV system identification, thus, scales exponentially, making the algorithm infeasible! Hence, there is a need for reducing the control discretization to minimize computational complexity. On the contrary, too coarse of a control mesh would result in the inability to dictate the dynamics to achieve specific features, leading to a reduction in the reachability of the target microstructure.

In order to empirically determine the effect of different control discretization schemes and figure out the optimal choice, we vary the discretization from the phase grid-level to coarser grids. We aim to achieve a banded microstructure configuration as our target and vary the actuation scheme. This results in situations where the actuation grid is either smaller or larger than the features of the target microstructure (Figs. 5(a) and 5(b)). For both these cases, it is observed that the ease of control effort, as compared to the original grid-scale control discretization, comes at the expense of reachability of the target microstructure (Fig. 5(d)).

Fig. 5
Controlling a microstructure configuration wherein the features of the target microstructure are (a)smaller than, (b) larger than the control resolution, and (c) zonal actuation corresponding to similar target states. We observe that the control is feasible only for the target-oriented zonal actuation scheme (d). (a)Smaller actuation discretization, (b) larger actuation discretization, (c) target-oriented zonal actuation, and (d) cost comparison.
Fig. 5
Controlling a microstructure configuration wherein the features of the target microstructure are (a)smaller than, (b) larger than the control resolution, and (c) zonal actuation corresponding to similar target states. We observe that the control is feasible only for the target-oriented zonal actuation scheme (d). (a)Smaller actuation discretization, (b) larger actuation discretization, (c) target-oriented zonal actuation, and (d) cost comparison.
Close modal

We also observe that, given a segment of the grid with the same goal state, if we choose our control resolution such that the same control input is fed to the grid sections with the same corresponding target-state, we are able to achieve the desired state (Figs. 5(c) and 5(d)). Thus, it is essential to optimize between using a sufficiently fine grid that can capture the features of the target microstructure, but is sparse enough to reduce computational burden and ease of application without significant losses in precision.

Hence, we arrive at a spatial control strategy based on the distribution of order parameter in goal state named as “goal state-oriented” or GS-o control in this study.

4.2 Goal State-Oriented Scheme of Actuation Discretization.

While the desired order parameter field always consists of a certain combination of pure phases represented by ϕ =±1, numerical experiments reveal that the order parameter can be guided from any initial value to pure phases by using the same control sequence (T and h).

This can be done through a two-step process. Starting from any arbitrary distribution of the order parameters, we can homogenize the microstructure through uniform heating, hence giving a structure with a uniform initial order parameter, considered as our new initial state. Now, we can realize the control by applying identical control sequence to the region whose target values in the goal state are the same (Fig. 6). That is, we do not control every pixel across the whole domain, instead, we control consistently two regions: one is represented by ϕ = −1 and the other one by ϕ = +1 in the goal state (Fig. 5(c)). Thus, the same control inputs (T, h) are used to actuate all the regions corresponding to ϕ=1, while a different control input is used to actuate all the regions corresponding to ϕ=1. This actuation scheme is typical of these material microstructure systems and computational results on using such a scheme are presented in Sec. 5.

Fig. 6
The GS-o process: starting from a nonuniform initial microstructure (a), we apply uniform heating and homogenize the microstructure (b) to generate same initial conditions throughout the grid. Thus, the problem reduces to the control of two zones corresponding to the phases ϕ=±1 of the target microstructure (c). (a) Nonuniform initial state, (b) homogenized new initial state, and (c) target microstructure.
Fig. 6
The GS-o process: starting from a nonuniform initial microstructure (a), we apply uniform heating and homogenize the microstructure (b) to generate same initial conditions throughout the grid. Thus, the problem reduces to the control of two zones corresponding to the phases ϕ=±1 of the target microstructure (c). (a) Nonuniform initial state, (b) homogenized new initial state, and (c) target microstructure.
Close modal

This treatment significantly reduces the number of control inputs from the order of the spatial discretization (2N2) to just 4, and therefore greatly reduces the computational cost. Moreover, such a control resolution scheme is more feasible in practical scenarios as well.

5 Simulation Results

In this section, we compare the training and performance of the data-based D2C approach, with the RL-based DDPG, on a material with control inputs as the temperature and external field inputs on subsets of grid points.

5.1 Structure and Task.

We simulated the phase-separation dynamics in python, through calling an explicit, second-order solver subroutine in fortran. The system and its tasks are defined as follows:

Material Microstructure. The material model simulated consists of a two-dimensional grid of dimensions 10 × 10, 20 × 20, and 50 × 50, i.e., 100, 400, and 2500 fully observable states, respectively. The order parameter at each of the grid point can be varied in the range [1,1]. Furthermore, the model is propagated using an explicit, second-order, central-difference based scheme. The control inputs (T, h) are actuated at each grid point separately; thus, the number of control variables is twice the total number of observable states (200, 800, and 5000, respectively). The control task is to influence the material dynamics to converge to a

  1. Banded phase-distribution (Fig. 7(b)), with 100 state variables and 200 control channels.

  2. Checkerboard phase-distribution (Fig. 7(c)), with 400 state variables and 800 control channels.

  3. Custom phase-distribution (Fig. 7(d)), with 2500 state variables and 5000 control channels.

Fig. 7
Model simulated in python: (a) initial state, (b) goal state-I, (c) goal state-II, and (d) goal state-III
Fig. 7
Model simulated in python: (a) initial state, (b) goal state-I, (c) goal state-II, and (d) goal state-III
Close modal

The initial and the desired final states of the model are shown in Fig. 7.

5.2 Algorithms Tested.

Deep deterministic policy gradient is a policy-gradient based off-policy reinforcement learning algorithm that operates in continuous state and action-spaces. It relies on two function approximation networks, one each for the actor and the critic. The critic network estimates the Q(s, a) value given the state and the action taken, while the actor network engenders a policy given the current state. Neural networks are employed to represent the networks.

Further details regarding the algorithm and the implementation are provided in Appendix A2.

5.3 Training and Testing.

Decoupled data-based control implementation is done in three stages corresponding to those mentioned in Sec. 3, and a black-box phase-field model is simulated in python.

Training: The open-loop training plots in Fig. 8 show the cost curve during training. After the cost curves converge, we get the optimal control sequence that could drive the systems to accomplish their tasks. The training parameters and outcomes are summarized in Tables 13. We note that the training time is dominated by the python simulation code and can be made significantly faster with an optimized and parallelized code. Thus, we envisage this work as an initial proof of the concept for controlling material microstructures.

Fig. 8
Convergence of episodic cost for goal-state I (left column), goal-state II (middle column) for Allen–Cahn PDE, and goal-state I (right column) for Cahn–Hilliard PDE. (a) DDPG convergence (Allen–Cahn, GS-I), (b) DDPG convergence (Allen–Cahn, GS-II), (c) DDPG convergence (Cahn–Hilliard, GS-I), (d) D2C–ILQR convergence (Allen–Cahn, GS-I), (e) D2C–ILQR convergence (Allen–Cahn, GS-II), and (f) D2C–ILQR convergence (Cahn–Hilliard, GS-I).
Fig. 8
Convergence of episodic cost for goal-state I (left column), goal-state II (middle column) for Allen–Cahn PDE, and goal-state I (right column) for Cahn–Hilliard PDE. (a) DDPG convergence (Allen–Cahn, GS-I), (b) DDPG convergence (Allen–Cahn, GS-II), (c) DDPG convergence (Cahn–Hilliard, GS-I), (d) D2C–ILQR convergence (Allen–Cahn, GS-I), (e) D2C–ILQR convergence (Allen–Cahn, GS-II), and (f) D2C–ILQR convergence (Cahn–Hilliard, GS-I).
Close modal
Table 1

Comparison of the training outcomes of D2C with DDPG for the Allen–Cahn model

Training time (s)
Goal stateD2CDDPG
Goal-I (10 × 10)15.832567.71*
Goal-II (20 × 20)55.2064419.36*
Goal-III (50 × 50)3289.354**
Training time (s)
Goal stateD2CDDPG
Goal-I (10 × 10)15.832567.71*
Goal-II (20 × 20)55.2064419.36*
Goal-III (50 × 50)3289.354**

(1) The open-loop training is run on a laptop with a two-core CPU@2.9 GHz and 12 GB RAM. No multithreading at this point. (2)

*

implies algorithm cannot converge to the goal state. (3)

**

implies system running out of memory at run-time.

Table 2

Comparison of the training outcomes of D2C with DDPG for the Cahn–Hilliard model

Training time (s)
Goal stateD2CDDPG
Goal-I (10 × 10)141.5733598.298*
Goal-II (20 × 20)5083.2239956.454*
Training time (s)
Goal stateD2CDDPG
Goal-I (10 × 10)141.5733598.298*
Goal-II (20 × 20)5083.2239956.454*

(1) The open-loop training is run on a laptop with a two-core CPU@2.9 GHz and 12 GB RAM. No multithreading at this point. (2)

*

implies algorithm cannot converge to the goal state.

Table 3

Parameter size comparison between D2C and DDPG

SystemNo. of stepsNo. of actuatorsNo. of parameters optimized in D2CNo. of parameters optimized in DDPG
Goal-I102002000461,901
Goal-II1080080001,122,501
Goal-III10500050,0005,746,701
SystemNo. of stepsNo. of actuatorsNo. of parameters optimized in D2CNo. of parameters optimized in DDPG
Goal-I102002000461,901
Goal-II1080080001,122,501
Goal-III10500050,0005,746,701

Testing Criterion: Given that RL algorithms perform a search over a nonlinear parametrization for optimization, it might be expected for DDPG to yield a global feedback policy, while D2C, with a linear feedback structure, should provide a locally valid feedback solution. However, from the training outcomes, we observe that DDPG fails to converge to the target, and hence using DDPG solutions as a global feedback law is infeasible in this context. Thus, for the closed-loop design, we proceed with the system identification and feedback gain design step of the D2C algorithm mentioned in Secs. 3.2 and 3.3 to get the closed-loop control policy.

To observe how the closed-loop feedback solution performs globally, we benchmark the performance between the open-loop and the closed-loop D2C control policy under different noise levels. The open-loop control policy is to apply the optimal control sequence solved in the open-loop training step without any feedback. Adding perturbation to this control sequence drives the state off the nominal trajectory, increasing the episodic cost with a corresponding increase in noise levels. Hence, a zero-mean Gaussian independent and identically distributed. noise is added to every control channel at each step, with the standard deviation proportional to the maximum control signal in the designed optimal control sequence for D2C, sweeping over a range of noise levels to explore the “global” behavior. One-hundred rollouts are simulated for each noise level, and the mean and standard deviation of the total episodic cost are used as the benchmark criterion.

The feedback policy obtained appears to be near-optimal and highly robust for low noise levels (Fig. 9), which is as expected from the O(ϵ4)-optimality of the algorithm used [32]. But when operating in high noise regimes, the robustness of the feedback policy quickly degenerates (Fig. 10).

Fig. 9
Performance comparison between D2C open-loop and closed-loop control policy
Fig. 9
Performance comparison between D2C open-loop and closed-loop control policy
Close modal

It must be noted that the range of noise levels (at least up until about 100% of the maximum nominal control signal) that we are considering here is far beyond what is typically encountered in practical scenarios. Generally, noise in the system is a result of either external disturbances in the ambient conditions or small disturbances in the control channels, arising potentially due to faulty actuators, signal attenuation, cross-talk, etc. In practice, control actuators have a typical signal-to-noise ratio of 2080dB, resulting in a noise 1% of the control signal. Thus, as evident from the results, the feedback policy should maintain robustness and near-optimality in such practical regimes.

5.4 Optimization for Computational Efficiency.

From the experiments, we are able to converge the given microstructure to any of our given states within finite time, through an iterative optimization. Thus, the question arises: What would be the most optimal actuation scheme so as to minimize the control cost (analogous to energy expenditure in practical applications), without significantly impacting the ease of application of such a scheme? Are there approaches for smarter selection of our initial guess wherein we can leverage the knowledge about the physics of the system, in order to ensure faster convergence?

From the set of observations, there are three regimes wherein we tackle this problem.

Fig. 10
Comparing robustness to process noise between feedback-policy and recursive model-predictive control
Fig. 10
Comparing robustness to process noise between feedback-policy and recursive model-predictive control
Close modal

5.4.1 Actuation Discretization Using Goal State-Oriented Control.

The advantage of GS-o control in computation efficiency over the fully actuated control is shown by Fig. 11: (i) thanks to less number of control variables, GS-o converges much faster than fully actuated control as seen in Fig. 11(a), and takes much less computation time shown in Fig. 11(b) and (ii) the cost of attaining better computation efficiency is that the GS-o control either requires relatively higher control cost to guide the microstructure to similar terminal states with the same discrepancy from goal state as seen in Fig. 11(c) or attains a terminal state with relatively larger discrepancy with the same control cost. This is caused by the increasing difficulty of controlling the interface region in GS-o control. Considering the sharp interface in goal state is actually a sort of smoothing in real cases, the error locating nearby interface during control makes limited sense. Therefore, it is reasonable to conclude that GS-o is able to control the system with much better computational efficiency.

Fig. 11
Comparison of (a) convergence rate, (b) computation time, and (c) cost between fully discretized control (O(state-space)) and GS-o control (O(4)). In (c), cost1, cost2, and cost3 stand for the terminal cost, cost contributed by intermediate states, and control cost, respectively. (a) Convergence rate, (b) computation time, and (c) cost comparison.
Fig. 11
Comparison of (a) convergence rate, (b) computation time, and (c) cost between fully discretized control (O(state-space)) and GS-o control (O(4)). In (c), cost1, cost2, and cost3 stand for the terminal cost, cost contributed by intermediate states, and control cost, respectively. (a) Convergence rate, (b) computation time, and (c) cost comparison.
Close modal

5.4.2 Temporal Discretization.

The rate of attenuation scheduling of the control inputs (i.e., horizon discretization) also impacts the total control cost. We, thus, need to test whether holding a control input for longer step sizes is more efficient than introducing a faster frequency of updating the controls.

From simulations held for a fixed period of the dynamics, it is observed that increasing the number of control update steps (while decreasing the average hold times) is significantly more cost-efficient (Fig. 12).

Fig. 12
Comparing control cost at different time-discretizations: (a) Allen–Cahn model and (b) Cahn–Hilliard model
Fig. 12
Comparing control cost at different time-discretizations: (a) Allen–Cahn model and (b) Cahn–Hilliard model
Close modal

5.4.3 Warm Starting (Model-Based Approach).

In this section, we derive a simple control scheme to control the inputs to each grid point in the material, so as to reach the desired phase distribution. In order to achieve this, we exploit the Allen–Cahn dynamics along with the assumptions of a periodic boundary condition to get a better initial estimate for the control scheme.

The Allen–Cahn equation has an interesting structure. Suppose we assume our system to be a single grid cell/pixel, i.e., discretized as a 1 × 1 matrix. Owing to our assumption of a periodic boundary condition, the diffusion term effectively vanishes, i.e., 2ϕ=0. Thus, our dynamics reduce to
(13)

Thus, the simplified model reduces to a temporal differential equation instead of a complex spatiotemporal partial differential equation. This single pixel system is then fed to our ILQR-based control scheme, to generate a look-up table (of sorts) for a control trajectory from the given initial state ϕ0 to both stable final states ϕf=±1.

The control trajectories thus obtained are mapped to any desired final state structure and serve as an improved initial guess for the main algorithm, i.e., now the algorithm only has to optimize for the diffusion term.

We compare the D2C training plots in Fig. 13, with a random initial guess, along with optimizing the single-pixel model-based control trajectory (Fig. 14). From the training convergence plots, it is clear that the control trajectory obtained from the model-based approach has a performance within 10% of that of the optimized trajectory which includes the diffusive-aspects of the Allen–Cahn dynamics.

Fig. 13
Open-loop training with starting guess initialized as (a) random control trajectory with zero-mean and (b)model-based approach
Fig. 13
Open-loop training with starting guess initialized as (a) random control trajectory with zero-mean and (b)model-based approach
Close modal

Our approach, thus, cuts the training time by near 40%, and the total convergence time (including generating the look-up table for the single-pixel model) is still less than that of running a full-scale model with a random initial guess.

We also observe that the final optimal control trajectory for different spatial-points of the material, converging to the same final phase ϕf, is the same. Thus, mapping the same initial guess for all such points is justified.

Discussion: The total time comparison in Tables 1 and 2 shows that D2C learns the optimal policy significantly faster than DDPG. The primary reason for this disparity is the feedback parametrization of the two methods: the DDPG deep neural nets are complex parametrizations that are difficult to search over, when compared to the highly compact open-loop + linear feedback parametrization of D2C, as is apparent from Table 3. We observe from experiments that the DDPG algorithm is not able to converge to the goal state for the material system of state-dimension over 5 × 5 discretization (25 states and 50 controls), while the D2C algorithm can quickly generate the closed-loop feedback solution even for the more complex systems.

From Fig. 11(c), we also observe that, for the Allen–Cahn dynamics, the additional cost from switching to a control-grid of smaller discretization, mainly stemming from an increase in control cost, is significantly smaller when compared to the ease of application, and computation time advantage (Fig. 11(b)) gained from such an approach.

Limitations: Albeit the computational advantages of a tighter parametrization, along with the near-optimality guarantees that the D2C algorithm holds, scaling such an optimal control approach to realistic microstructure grids using this approach will be computationally intractable, owing to the sampling/rollouts required for system identification of the “black-box.” In order to exploit the sparsity of the microstructure dynamics, we have proposed a new approach to solve the issue of scaling complexity in our recent work [35]. Also, the D2C algorithm relies on a full-state feedback to generate the optimal trajectories, which may not be an option for realistic scale grids owing to sensor constraints. In order to incorporate our proposed GS-o approach to such problems, we have proposed a partially observed application of the ILQR algorithm in our recent work [36].

Fig. 14
Comparing trajectories from model-based single-pixel method (no diffusion) with the optimized trajectory (with diffusion)
Fig. 14
Comparing trajectories from model-based single-pixel method (no diffusion) with the optimized trajectory (with diffusion)
Close modal

6 Conclusions and Future Work

In this article, we have provided an overview of the modeling of multiphase microstructures in accordance with the Allen–Cahn and Cahn–Hilliard equations. We have also compared learning/data-based approaches to the control of such structures, outlining their relative merits and demerits.

In the future, we further plan to extend the methodology to real-world materials, leading to the development of an experimental testbed on which the proposed approach for microstructure control can be verified and modified to facilitate the development of designer materials.

Funding Data

  • NSF CDS&E program (Grant No. 1802867; Funder ID: 10.13039/100000078).

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

Footnotes

1

This paper was presented at the American Control Conference 2024.

Appendix

A1 Iterative Linear Quadratic Regulator Details.

We present details of the D2C–ILQR algorithm in the following. The “forward pass” and “backward pass” algorithms are summarized in Algorithms 3 and 4, respectively. Algorithm 5 presents the differential dynamic programming (DDP) feedback algorithm used to calculate the optimal feedback gain.

Algorithm 3

Forward pass

 Input: Nominal trajectory—Tnomk, previous iteration policy  parameters—{k0:N1,K0:N1}, and system and cost parameters—P.
{x¯tprev,u¯tprev}Tnomk.
t0.
 whilet < Ndo
   /*Simulate one step forward (α is the line-search parameter.)*/
     u¯t=u¯tprev+αkt+Kt(x¯tx¯tprev),x¯t+1=simulate_forward_step(x¯t,u¯t).
   tt+1.
 end while
Tnomk+1{x¯0:N,u¯0:N1}.
 ifTnomk+1toTnomkis acceptablethen
   returnTnomk+1, true.
 end if
 else
   returnTnomk, false.
 end if
 Input: Nominal trajectory—Tnomk, previous iteration policy  parameters—{k0:N1,K0:N1}, and system and cost parameters—P.
{x¯tprev,u¯tprev}Tnomk.
t0.
 whilet < Ndo
   /*Simulate one step forward (α is the line-search parameter.)*/
     u¯t=u¯tprev+αkt+Kt(x¯tx¯tprev),x¯t+1=simulate_forward_step(x¯t,u¯t).
   tt+1.
 end while
Tnomk+1{x¯0:N,u¯0:N1}.
 ifTnomk+1toTnomkis acceptablethen
   returnTnomk+1, true.
 end if
 else
   returnTnomk, false.
 end if
Algorithm 4

Backward pass

 Input: Nominal trajectory—Tnomk, previous iteration policy parameters—{k0:N1,K0:N1}, horizon—N, and system and cost parameters—P.
/* Backward pass starts from the final time-step, i.e., N − 1.*/
tN1.
Compute JxN and JxNxN using boundary conditions.
/*Keep a copy of previous policy gains.*/
k_oldk0:N and K_oldK0:N.
 whilet>=0do
    /*Obtain the Jacobians from simulator rollouts as shown in Sec.3.1.1:*/
  fxt,futmodel_free_jacobian(x¯t,u¯t).
   /*Obtain the partials of the Q function as follows:*/
     Qxt=cxt+hxtTJxt+1,Qut=cut+hutTJxt+1,Qxtxt=cxtxt+hxtTJxt+1xt+1hxt,Qutxt=cutxt+hutT(Jxt+1xt+1+μInx×nx)hxt,Qutut=cutut+hutT(Jxt+1xt+1+μInx×nx)hut.
   ifQututis positive-definitethen
       kt=Qutut1Qut,Kt=Qutut1Qutxt.
   end if
   Else
     /*If Qututis not positive-definite, then, abort the backward pass.*/
     return{k_old,K_old}, false.
   end if
   /*Obtain the partials of the value function Jt as follows:*/
     Jxt=Qxt+KtTQututkt+KtTQut+QutxtTkt,Jxtxt=Qxtxt+KtTQututKt+KtTQutxt+QutxtTKt.
     tt1
end while
k_new=k0:N1,
K_new=K0:N1.
 return{k_new,K_new}, true.
 Input: Nominal trajectory—Tnomk, previous iteration policy parameters—{k0:N1,K0:N1}, horizon—N, and system and cost parameters—P.
/* Backward pass starts from the final time-step, i.e., N − 1.*/
tN1.
Compute JxN and JxNxN using boundary conditions.
/*Keep a copy of previous policy gains.*/
k_oldk0:N and K_oldK0:N.
 whilet>=0do
    /*Obtain the Jacobians from simulator rollouts as shown in Sec.3.1.1:*/
  fxt,futmodel_free_jacobian(x¯t,u¯t).
   /*Obtain the partials of the Q function as follows:*/
     Qxt=cxt+hxtTJxt+1,Qut=cut+hutTJxt+1,Qxtxt=cxtxt+hxtTJxt+1xt+1hxt,Qutxt=cutxt+hutT(Jxt+1xt+1+μInx×nx)hxt,Qutut=cutut+hutT(Jxt+1xt+1+μInx×nx)hut.
   ifQututis positive-definitethen
       kt=Qutut1Qut,Kt=Qutut1Qutxt.
   end if
   Else
     /*If Qututis not positive-definite, then, abort the backward pass.*/
     return{k_old,K_old}, false.
   end if
   /*Obtain the partials of the value function Jt as follows:*/
     Jxt=Qxt+KtTQututkt+KtTQut+QutxtTkt,Jxtxt=Qxtxt+KtTQututKt+KtTQutxt+QutxtTKt.
     tt1
end while
k_new=k0:N1,
K_new=K0:N1.
 return{k_new,K_new}, true.
Algorithm 5

DDP feedback

 Input: Nominal trajectory—Tnomk, horizon—N, and system and cost parameters—P.
/* Start from the final time-step, i.e., N − 1.*/
tN1.
Compute JxN and JxNxN using boundary conditions.
 whilet>=0do
   /*Obtain the Jacobians from simulator rollouts as shown in Sec.3.1.1:*/
   hxt,hutmodel_free_jacobian(x¯t,u¯t).
   /*Obtain the Hessians from simulator rollouts as shown above:*/
   hxtxt,hutxt,hututmodel_free_hessian(x¯t,u¯t).
  /*Obtain the partials of the Q function as follows:*/
   Qxt=cxt+hxtTJxt+1,Qut=cut+hutTJxt+1,Qxtxt=cxtxt+hxtTJxt+1xt+1hxt+Jxt+1hxtxt,Qutxt=cutxt+hutT(Jxt+1xt+1+μInx×nx)hxt+Jxt+1hutxt,Qutut=cutut+hutT(Jxt+1xt+1+μInx×nx)hut+Jxt+1hutut.
   ifQututis positive-definitethen
     kt=Qutut1Qut,Kt=Qutut1Qutxt.
   end if
   else
     /*IfQututis not positive-definite, then, abort the backward pass.*/
     return error.
   end if
  /*Obtain the partials of the value function Jt as follows:*/
  Jxt=Qxt+KtTQututkt+KtTQut+QutxtTkt,Jxtxt=Qxtxt+KtTQututKt+KtTQutxt+QutxtTKt.
  tt1
 end while
K=K0:N1.
 return {K}, true.
 Input: Nominal trajectory—Tnomk, horizon—N, and system and cost parameters—P.
/* Start from the final time-step, i.e., N − 1.*/
tN1.
Compute JxN and JxNxN using boundary conditions.
 whilet>=0do
   /*Obtain the Jacobians from simulator rollouts as shown in Sec.3.1.1:*/
   hxt,hutmodel_free_jacobian(x¯t,u¯t).
   /*Obtain the Hessians from simulator rollouts as shown above:*/
   hxtxt,hutxt,hututmodel_free_hessian(x¯t,u¯t).
  /*Obtain the partials of the Q function as follows:*/
   Qxt=cxt+hxtTJxt+1,Qut=cut+hutTJxt+1,Qxtxt=cxtxt+hxtTJxt+1xt+1hxt+Jxt+1hxtxt,Qutxt=cutxt+hutT(Jxt+1xt+1+μInx×nx)hxt+Jxt+1hutxt,Qutut=cutut+hutT(Jxt+1xt+1+μInx×nx)hut+Jxt+1hutut.
   ifQututis positive-definitethen
     kt=Qutut1Qut,Kt=Qutut1Qutxt.
   end if
   else
     /*IfQututis not positive-definite, then, abort the backward pass.*/
     return error.
   end if
  /*Obtain the partials of the value function Jt as follows:*/
  Jxt=Qxt+KtTQututkt+KtTQut+QutxtTkt,Jxtxt=Qxtxt+KtTQututKt+KtTQutxt+QutxtTKt.
  tt1
 end while
K=K0:N1.
 return {K}, true.

DDP Feedback Gain Calculation. Once the optimal nominal trajectory is obtained with ILQR, one DDP back pass is conducted to find the linear optimal feedback gain as shown in Algorithm 5. Then, the linear feedback is wrapped around the nominal control sequence (ut=u¯t+Ktδxt), where δxt is the state deviation from the nominal state x¯t.

A2 Deep Deterministic Policy Gradient Algorithm Implementation Details.

The off-policy characteristic of the algorithm employs a separate behavioral policy by introducing additive noise to the policy output obtained from the actor network. The critic network minimizes loss based on the temporal-difference error, and the actor network uses the deterministic policy gradient theorem to update its policy gradient as shown below:

Critic update by minimizing the loss
Actor policy gradient update

The actor and the critic networks consist of two hidden layers with the first layer containing 400 (“relu” activated) units followed by the second layer containing 300 (relu activated) units. The output layer of the actor network has the number of (“tanh” activated) units equal to that of the number of actions in the action-space.

Target networks one each for the actor and the critic are employed for a gradual update of network parameters, thereby reducing the oscillations and a better training stability. The target networks are updated at τ=0.001. Finally, the networks are compiled using Adams' optimizer with a learning rate of 0.001.

To account for state-space exploration, the behavioral policy consists of an off-policy term arising from a random process. We obtain discrete samples from the Ornstein–Uhlenbeck process to generate noise as followed in the original DDPG method. The Ornstein–Uhlenbeck process has mean-reverting property and produces temporally correlated noise samples as follows:

where Θ indicates how fast the process reverts to mean, μ is the equilibrium or the mean value, and σ corresponds to the degree of volatility of the process. Θ is set to 0.15, μ to 0, and σ is annealed from 0.35 to 0.05 over the training process.

A3 Deep Deterministic Policy Gradient Trajectories.

The corresponding trajectories as generated by the DDPG algorithm can be seen in Fig. 15. It is observed that the DDPG algorithm cannot converge to the goal state for grids of over 5 × 5 discretization.

Fig. 15
DDPG trajectories for a material model discretized as (a) 5 × 5 and (b) 20 × 20. (a)Initial, (b) t = 0.50 s, (c) t = 1.00 s, (d) initial, (e) t = 0.50 s, and (f) t = 1.00 s.
Fig. 15
DDPG trajectories for a material model discretized as (a) 5 × 5 and (b) 20 × 20. (a)Initial, (b) t = 0.50 s, (c) t = 1.00 s, (d) initial, (e) t = 0.50 s, and (f) t = 1.00 s.
Close modal

References

1.
Nikolas Provatas
,
K. E.
,
2010
,
Phase Field Methods in Materials Science and Engineering
,
Wiley
, Hoboken, NJ.
2.
Boettinger
,
W. J.
,
Warren
,
J. A.
,
Beckermann
,
C.
, and
Karma
,
A.
,
2002
, “
Phase-Field Simulation of Solidification
,”
Annu. Rev. Mater. Res.
,
32
(
1
), pp.
163
194
.10.1146/annurev.matsci.32.101901.155803
3.
Moelans
,
N.
,
Blanpain
,
B.
, and
Wollants
,
P.
,
2008
, “
An Introduction to Phase-Field Modeling of Microstructure Evolution
,”
Calphad
,
32
(
2
), pp.
268
294
.10.1016/j.calphad.2007.11.003
4.
Nguyen
,
L.
,
Shi
,
R.
,
Wang
,
Y.
, and
Graef
,
M. D.
,
2016
, “
Quantification of Rafting of Precipitates in Ni-Based Superalloys
,”
Acta Mater.
,
103
, pp.
322
333
.10.1016/j.actamat.2015.09.060
5.
Travasso
,
R. D.
,
Castro
,
M.
, and
Oliveira
,
J. C.
,
2011
, “
The Phase-Field Model in Tumor Growth
,”
Philos. Mag.
,
91
(
1
), pp.
183
206
.10.1080/14786435.2010.501771
6.
Flint
,
T.
,
Sun
,
Y.
,
Xiong
,
Q.
,
Smith
,
M.
, and
Francis
,
J.
,
2019
, “
Phase-Field Simulation of Grain Boundary Evolution in Microstructures Containing Second-Phase Particles With Heterogeneous Thermal Properties
,”
Sci. Rep.
,
9
(
1
), p.
12
.10.1038/s41598-019-54883-8
7.
Cheniour
,
A.
,
Tonks
,
M.
,
Gong
,
B.
,
Yao
,
T.
,
He
,
L.
,
Harp
,
J.
,
Beeler
,
B.
,
Zhang
,
Y.
, and
Lian
,
J.
,
2020
, “
Development of a Grain Growth Model for U3Si2 Using Experimental Data, Phase Field Simulation and Molecular Dynamics
,”
J. Nucl. Mater.
,
532
, p.
152069
.10.1016/j.jnucmat.2020.152069
8.
Khaderi
,
S.
,
Murali
,
P.
, and
Ahluwalia
,
R.
,
2014
, “
Failure and Toughness of Bio-Inspired Composites: Insights From Phase Field Modelling
,”
Comput. Mater. Sci.
,
95
, pp.
1
7
.10.1016/j.commatsci.2014.07.001
9.
Hansen-Dörr
,
A. C.
,
de Borst
,
R.
,
Hennig
,
P.
, and
Kästner
,
M.
,
2019
, “
Phase-Field Modelling of Interface Failure in Brittle Materials
,”
Comput. Methods Appl. Mech. Eng.
,
346
, pp.
25
42
.10.1016/j.cma.2018.11.020
10.
Allen
,
S. M.
, and
Cahn
,
J. W.
,
1979
, “
A Microscopic Theory for Antiphase Boundary Motion and Its Application to Antiphase Domain Coarsening
,”
Acta Metall.
,
27
(
6
), pp.
1085
1095
.10.1016/0001-6160(79)90196-2
11.
Cahn
,
J. W.
, and
Hilliard
,
J. E.
,
1958
, “
Free Energy of a Nonuniform System. I. Interfacial Free Energy
,”
J. Chem. Phys
,
28
(
2
), pp.
258
267
.10.1063/1.1744102
12.
Benner
,
P.
, and
Stoll
,
M.
,
2013
, “
Optimal Control for Allen-Cahn Equations Enhanced by Model Predictive Control
,”
IFAC Proc. Vol.
,
46
(
26
), pp.
139
143
.10.3182/20130925-3-FR-4043.00062
13.
Blank
,
L.
,
Farshbaf Shaker
,
M.
,
Hecht
,
C.
,
Michl
,
J.
, and
Rupprecht
,
C.
,
2013
, “
Optimal Control of Allen-Cahn Systems
,” Birkhäuser Cham, Vol.
165
, Basel, Switzerland, p.
12
.
14.
Zhang
,
X.
,
Li
,
H.
, and
Liu
,
C.
,
2020
, “
Optimal Control Problem for the Cahn–Hilliard/Allen–Cahn Equation With State Constraint
,”
Appl. Math. Optim.
,
82
(
2
), pp.
721
754
.10.1007/s00245-018-9546-1
15.
Scarpa
,
L.
,
2019
, “
Optimal Distributed Control of a Stochastic Cahn–Hilliard Equation
,”
SIAM J. Control Optim.
,
57
(
5
), pp.
3571
3602
.10.1137/18M1222223
16.
Kadiri
,
M.
,
Louaked
,
M.
, and
Trabelsi
,
S.
,
2023
, “
Optimal Control and Parameters Identification for the Cahn–Hilliard Equations Modeling Tumor Growth
,”
Mathematics
,
11
(
7
), p.
1607
.10.3390/math11071607
17.
Wang
,
R.
,
Parunandi
,
K.
,
Yu
,
D.
,
Kalathil
,
D.
, and
Chakravorty
,
S.
,
2019
, “
Decoupled Data Based Approach for Learning to Control Nonlinear Dynamical Systems
,”
IEEE Trans. Autom. Control
, 67(7), pp.
3582
3589
.10.1109/TAC.2021.3108552
18.
Sutton
,
R. S.
, and
Barto
,
A. G.
,
2018
,
Reinforcement Learning: An Introduction
,
MIT Press
, Cambridge, MA.
19.
Silver
,
D.
,
Huang
,
A.
,
Maddison
,
C. J.
,
Guez
,
A.
,
Sifre
,
L.
,
van den Driessche
,
G.
, and
Schrittwieser
,
J.
, et al.,
2016
, “
Mastering the Game of Go With Deep Neural Networks and Tree Search
,”
Nature
,
529
(
7587
), pp.
484
489
.10.1038/nature16961
20.
Lillicrap
,
T. P.
,
Hunt
,
J. J.
,
Pritzel
,
A.
,
Heess
,
N.
,
Erez
,
T.
,
Tassa
,
Y.
,
Silver
,
D.
, and
Wierstra
,
D.
,
2015
, “Continuous Control With Deep Reinforcement Learning,” preprint
arXiv:1509.02971
.10.48550/arXiv.1509.02971
21.
Levine
,
S.
,
Finn
,
C.
,
Darrell
,
T.
, and
Abbeel
,
P.
,
2016
, “
End-to-End Training of Deep Visuomotor Policies
,”
J. Mach. Learn. Res.
,
17
(
1
), pp.
1334
1373
.https://www.jmlr.org/papers/volume17/15-522/15-522.pdf
22.
Yuhuai
,
W.
,
Elman
,
M.
,
Shun
,
L.
,
Roger
,
G.
, and
Jimmy
,
B.
,
2017
, “
Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation
,” 31st International Conference on Neural Information Processing Systems (
NIPS
), Long Beach, CA, Dec. 4–9, pp.
5285
5294
.https://proceedings.neurips.cc/paper_files/paper/2017/file/361440528766bbaaaa1901845cf4152b-Paper.pdf
23.
Schulman
,
J.
,
Levine
,
S.
,
Moritz
,
P.
,
Jordan
,
M. I.
, and
Abbeel
,
P.
,
2015
, “
Trust Region Policy Optimization
,”
Proceedings of the 32nd International Conference on Machine Learning
, Lille, France, July 6–11, pp.
1
16
.https://arxiv.org/pdf/1502.05477
24.
Schulman
,
J.
,
Wolski
,
F.
,
Dhariwal
,
P.
,
Radford
,
A.
, and
Klimov
,
O.
,
2017
, “
Proximal Policy Optimization Algorithms
,” preprint
arXiv:1707.06347
.10.48550/arXiv.1707.06347
25.
Henderson
,
P.
,
Islam
,
R.
,
Bachman
,
P.
,
Pineau
,
J.
,
Precup
,
D.
, and
Meger
,
D.
,
2018
, “
Deep Reinforcement Learning That Matters
,”
32nd AAAI Conference on Artificial Intelligence
, New Orleans, LA, Feb. 2–7, pp.
3207
3214
.https://dl.acm.org/doi/abs/10.5555/3504035.3504427
26.
Pan
,
Y.
,
Farahmand
,
A.-M.
,
White
,
M.
,
Nabi
,
S.
,
Grover
,
P.
, and
Nikovski
,
D.
,
2018
, “
Reinforcement Learning With Function-Valued Action Spaces for Partial Differential Equation Control
,”
Proceedings of the 35th International Conference on Machine Learning
, Stockholm, Sweden, July 10–15,
pp.
3986
3995
.https://merl.com/publications/docs/TR2018-028.pdf
27.
Bensoussan
,
A.
,
Da Prato
,
G.
,
Delfour
,
M. C.
, and
Mitter
,
S. K.
,
1992
,
Representation and Control of Infinite Dimensional Systems
, Vols.
1 and 2
,
Birkhäuser Boston
, Boston, MA.
28.
Lall
,
S.
,
Marsden
,
J. E.
, and
Glavaški
,
S.
,
2002
, “
A Subspace Approach to Balanced Truncation for Model Reduction of Nonlinear Control Systems
,”
Int. J. Robust Nonlinear Control: IFAC-Affiliated J.
,
12
(
6
), pp.
519
535
.10.1002/rnc.657
29.
Ravindran
,
S.
,
2002
, “
Adaptive Reduced-Order Controllers for a Thermal Flow System Using Proper Orthogonal Decomposition
,”
SIAM J. Sci. Comput.
,
23
(
6
), pp.
1924
1942
.10.1137/S1064827500374716
30.
Kunisch
,
K.
, and
Volkwein
,
S.
,
2008
, “
Proper Orthogonal Decomposition for Optimality Systems
,”
ESAIM: Math. Modell. Numer. Anal.-Modél. Math. Anal. Numér.
,
42
(
1
), pp.
1
23
.10.1051/m2an:2007054
31.
Yu
,
D.
,
Rafieisakhaei
,
M.
, and
Chakravorty
,
S.
,
2017
, “
Stochastic Feedback Control of Systems With Unknown Nonlinear Dynamics
,” 56th IEEE Conference on Decision and Control (
CDC
), Melbourne, VIC, Australia, Dec. 12–15, pp.
4309
4314
.10.1109/CDC.2017.8264294
32.
Wang
,
R.
,
Parunandi
,
K. S.
,
Sharma
,
A.
,
Chakravorty
,
S.
, and
Kalathil
,
D.
,
2020
, “
On the Search for Feedback in Reinforcement Learning
,” 60th IEEE Conference on Decision and Control (
CDC
), Austin, TX, Dec. 14–17, pp.
1560
1567
.10.1109/CDC45484.2021.9683350
33.
Bryson
,
A. E.
, and
Ho
,
Y.-C.
,
1975
,
Applied Optimal Control: Optimization, Estimation, and Control
,
Routledge
,
New York
.
34.
Barrett
,
J. W.
,
Nürnberg
,
R.
, and
Styles
,
V.
,
2004
, “
Finite Element Approximation of a Phase Field Model for Void Electromigration
,”
SIAM J. Numer. Anal.
,
42
(
2
), pp.
738
772
.10.1137/S0036142902413421
35.
Sharma
,
A.
, and
Chakravorty
,
S.
,
2023
, “
A Reduced Order Iterative Linear Quadratic Regulator (ILQR) Technique for the Optimal Control of Nonlinear Partial Differential Equations
,” 2023 American Control Conference (
ACC
), San Diego, CA, May 31–June 2, pp.
3389
3394
.10.23919/ACC55779.2023.10156062
36.
Goyal
,
R.
,
Wang
,
R.
,
Mohamed
,
M. N. G.
,
Sharma
,
A.
, and
Chakravorty
,
S.
,
2023
, “
An Information-State Based Approach to the Optimal Output Feedback Control of Nonlinear Systems
,” preprint
arXiv:2107.08086
.10.48550/arXiv.2107.08086