A Benchmark Time Series Dataset for Semiconductor Fabrication Manufacturing Constructed using Component-based Discrete-Event Simulation Models
基于组件离散事件仿真模型的半导体制造基准时间序列数据集

Vamsi Krishna Pendyala
Arizona State University
Tempe, AZ 85281
vpendya2@asu.edu
\AndHessam S. Sarjoughian
Arizona State University
Tempe, AZ 85281
hessam.sarjoughian@asu.edu
\ANDBala Sujith Potineni
Arizona State University
Tempe, AZ 85281
bpotinen@asu.edu
\AndEdward J. Yellig
Intel Corporation
Chandler, AZ 85226
edward.j.yellig@intel.com

Abstract 摘要

Advancements in high-computing devices increase the necessity for improved and new understanding and development of smart manufacturing factories. Discrete-event models with simulators have been shown to be critical to architect, designing, building, and operating the manufacturing of semiconductor chips. The diffusion, implantation, and lithography machines have intricate processes due to their feedforward and feedback connectivity. The dataset collected from simulations of the factory models holds the promise of generating valuable machine-learning models. As surrogate data-based models, their executions are highly efficient compared to the physics-based counterpart models. For the development of surrogate models, it is beneficial to have publicly available benchmark simulation models that are grounded in factory models that have concise structures and accurate behaviors. Hence, in this research, a dataset is devised and constructed based on a benchmark model of an Intel semiconductor fabrication factory. The model is formalized using the Parallel Discrete-Event System Specification and executed using the DEVS-Suite simulator. The time series dataset is constructed using discrete-event time trajectories. This dataset is further analyzed and used to develop baseline univariate and multivariate machine learning models. The dataset can also be utilized in the machine learning community for behavioral analysis based on formalized and scalable component-based discrete-event models and simulations.
高计算设备的进步增加了对智能制造工厂更好和新的理解和发展的必要性。带有模拟器的离散事件模型已被证明对半导体芯片制造的设计、设计、建造和运营至关重要。扩散、注入和光刻机由于其前馈和反馈连接而具有复杂的过程。从工厂模型的模拟中收集的数据集有望生成有价值的机器学习模型。作为替代的基于数据的模型，它们的执行效率高于基于物理的对应模型。对于代理模型的开发，具有公开可用的基准仿真模型是有益的，这些基准仿真模型基于结构简洁和行为准确的工厂模型。因此，在本研究中，以英特尔半导体制造厂的基准模型为基础，设计并构建了一个数据集。该模型使用并行离散事件系统规范进行形式化，并使用DEVS-Suite仿真器执行。使用离散事件时间轨迹来构建时间序列数据集。该数据集被进一步分析并用于开发基线单变量和多变量机器学习模型。该数据集还可以在机器学习社区中用于基于形式化和可伸缩的基于组件的离散事件模型和模拟的行为分析。

1 Introduction 1简介

Simulation of automated manufacturing processes is essential for the efficient use of complex and expensive machines. This requires effective coordinated decision-making due to a variety of factors including market conditions and processes spanning days to months with end-to-end controls of networked machines in minutes. Extensive resources are employed to create accurate physics-based manufacturing models to achieve these well-understood needs. In addition, designing, conducting, and evaluating simulation experiments are major undertakings. To reduce cost and time for practitioners, research in machine learning methods supporting building manufacturing digital twins is attracting greater emphasis among academic and industry researchers. There are many benefits of machine learning models for analytics. Executing data-based models compared with their counterpart physics-based models is expected to be computationally scale-free. The concept of models primarily generated from data is aimed at responding faster to the changes that can lead to knowing how best manufacturing systems should operate under short/long time horizon requirements. Furthermore, generated ML model libraries can be frequently updated as additional data becomes available relative to physics-based models. It is also possible to generate synthetic data using physics-based simulation of semiconductor factories. This class of data is strictly prescriptive and structured. The basic entities that undergo various kinds of processing include dies on wafers, wafers making lots, and wafer lots assembled into batches. Inventories, machines, and transportation are the entities that are used for processing semiconductor chips. Every entity can be ascribed one or more quantitative/qualitative variables. Every variable has a measurable value at any instance within a finite period. Other data are also measured and aggregated, such as average inventory volume, work-in-progress in machines, and wafers’ transportation routes. It is common for every machine to behave either deterministically or stochastically. The time designated to each entity in factories can be continuous or discrete. The collected data can have arbitrary accuracy and precision. Formal models of factories are suitable for creating sound datasets for use in developing regression and deep machine-learning models. Given the event-driven nature of manufacturing systems, Discrete-Event Simulation (DES) is widely used. This approach naturally defines inputs and outputs as events that can occur at any arbitrary time. The events can interrupt any process and the processes can be combined to form factories using well-formed input and output relationships. However, the development of ML models depends on having rich data combined with expert domain knowledge among other factors. While massive amounts of live data are collected from semiconductor factories (e.g., from work centers to enterprise supply chain systems [29]), it is challenging to generate ML models. In a semiconductor fabrication factory, data should be collected, organized, maintained, and synchronized across different individual and networked machines with varying logical/physical topologies in time and space. Given that collecting data from actual semiconductor factories is restricted due to the company’s proprietary constraints and the resources required for data engineering, benchmark datasets can be developed using simulations of physics-based models. Indeed, various efforts have been undertaken to collect data from physics-based models for use in statistical and machine-learning studies [15, 23].
自动化制造过程的模拟对于高效使用复杂而昂贵的机器是必不可少的。这需要根据各种因素进行有效的协调决策，这些因素包括市场条件和流程，时间跨度从几天到几个月，对联网机器的端到端控制只需几分钟。我们使用大量的资源来创建精确的基于物理的制造模型，以实现这些众所周知的需求。此外，设计、进行和评估模拟实验是一项重要的工作。为了减少从业者的成本和时间，支持建筑制造数字双胞胎的机器学习方法的研究正吸引着学术界和行业研究人员的更多重视。机器学习模型对分析有很多好处。与相应的基于物理的模型相比，执行基于数据的模型预计在计算上是无尺度的。主要由数据生成的模型的概念旨在更快地响应变化，从而了解制造系统应如何在短/长时间范围要求下最佳运行。此外，随着相对于基于物理的模型的额外数据变得可用，所生成的ML模型库可以被频繁地更新。也可以使用半导体工厂的基于物理的模拟来生成合成数据。这类数据具有严格的规范性和结构化。经过各种加工的基本实体包括晶片上的芯片、制造批次的晶片和组装成批次的晶片批次。库存、机器和运输是用于加工半导体芯片的实体。每个实体都可以归因于一个或多个定量/定性变量。在有限的时间内，每个变量在任何情况下都有一个可测量值。其他数据也被测量和汇总，如平均库存量、机器中的在制品和晶圆的运输路线。每台机器的行为要么是确定性的，要么是随机的，这是很常见的。指定给工厂中每个实体的时间可以是连续的，也可以是离散的。采集的数据可以有任意的准确度和精确度。工厂的正式模型适合于创建合理的数据集，用于开发回归和深度机器学习模型。鉴于制造系统的事件驱动特性，离散事件仿真(DES)得到了广泛的应用。这种方法很自然地将输入和输出定义为可以在任意时间发生的事件。这些事件可以中断任何流程，并且可以使用格式良好的输入和输出关系将这些流程组合成工厂。然而，ML模型的发展依赖于拥有丰富的数据和专家领域的知识等因素。虽然从半导体工厂(例如，从工作中心到企业供应链系统[29])收集了大量的实时数据，但生成ML模型是具有挑战性的。在半导体制造工厂中，应该在时间和空间上具有不同逻辑/物理拓扑的不同的单独和联网的机器上收集、组织、维护和同步数据。考虑到从实际半导体工厂收集数据受到公司专有限制和数据工程所需资源的限制，基准数据集可以使用基于物理的模型的模拟来开发。事实上，人们已经做出了各种努力来从基于物理的模型中收集数据，以用于统计和机器学习研究[ 15，23]。

2 Background 2背景

Manufacturing systems can be modeled as continuous differential equations, discrete-time, and discrete event specification languages [30]. Following system theory, they can be modeled as a set of atomic and coupled components. Every atomic model has the means to specify input, output, and state variables and state transition, output, and timing functions. Every coupled component has its own input and output variables. It has a set of atomic and other coupled models. Atomic and coupled models communicate through sending and receiving inputs (input variables) and outputs (output variables). Simulation output (data) naturally is partitioned and belongs to the atomic and coupled components.
制造系统可以建模为连续微分方程、离散时间和离散事件规范语言[30]。根据系统理论，它们可以被建模为一组原子和耦合组件。每个原子模型都有指定输入、输出和状态变量以及状态转换、输出和时间函数的手段。每个耦合组件都有自己的输入和输出变量。它具有一组原子和其他耦合模型。原子和耦合模型通过发送和接收输入（输入变量）和输出（输出变量）进行通信。模拟输出（数据）自然地被分割并属于原子和耦合组件。

2.1 Discrete-event modeling
2.1 离散事件建模

Discrete Event Simulation has been extensively used to model and simulate semiconductor manufacturing (e.g., [19]). One of the methods for DES is known as Parallel Discrete Event System Specification (PDEVS) [8]. An atomic model has distinct input, state, and output variables, each with a finite set of values. The variables can have primitive and compound alphanumeric values including null values. The input and output trajectories are event-based (i.e., every trajectory has only a finite number of values over a finite period). Every state trajectory is piecewise constant and has a value for every time instance. Every atomic model has external, internal, confluent, output, and timing functions. The external transition function is responsible for processing input events and changes of state. The internal transition function is responsible for state change in the absence of input events. The confluent function defined an order for the combined external and internal transition functions. Input events may received, and outputs may be sent at nonuniform time intervals. Multiple input and output events may occur concurrently at arbitrary time instances. State transitions can occur either due to receiving input events (external transition function) or not (internal transition function). Atomic and coupled models can be hierarchically composed to create other coupled models using external input, external output, and internal coupling relationships. Well-formed input and output for atomic and coupled components have concise physics-based syntax and semantics.
离散事件仿真已被广泛用于建模和模拟半导体制造（例如，[19]）。 DES 的一种方法被称为并行离散事件系统规范（PDEVS）[8]。原子模型具有明确的输入、状态和输出变量，每个变量都有一个有限的值集。这些变量可以具有原始和复合的字母数字值，包括空值。输入和输出轨迹是基于事件的（即，每个轨迹在有限时间内只有有限数量的值）。每个状态轨迹是分段常数的，并且在每个时间点都有一个值。每个原子模型都有外部、内部、混合、输出和定时函数。外部转换函数负责处理输入事件和状态变化。内部转换函数负责在没有输入事件的情况下改变状态。混合函数定义了组合外部和内部转换函数的顺序。输入事件可能在非均匀时间间隔接收，并且输出可能在任意时间点同时发生多个输入和输出事件。状态转换可以是由于接收输入事件（外部转换函数）或不是（内部转换函数）而发生的。原子模型和耦合模型可以按层次结构组合，以使用外部输入、外部输出和内部耦合关系创建其他耦合模型。原子和耦合组件的输入和输出应具有简洁的基于物理的语法和语义。

Parallel DEVS models are causal, providing concise understanding and rich interpretations of simulated behavior. Execution of these event-based atomic and coupled models results in time trajectories where their data do not necessarily have any uniform time intervals (i.e., for any of the input, output, and state variables may be separated at arbitrary time instances). These models can be simulated using a variety of simulators supported in popular programming languages and executable on single/multi-processor computing platforms supported with distributed technologies [20].
并行 DEVS 模型是因果的，提供了对模拟行为的简洁理解和丰富解释。执行这些基于事件的原子和耦合模型会产生时间轨迹，其中它们的数据不一定具有统一的时间间隔（即，任何输入、输出和状态变量都可能在任意时间点分开）。这些模型可以使用流行编程语言中支持的各种模拟器进行模拟，并在支持分布式技术的单/多处理器计算平台上执行。【20】。

2.2 PDEVS Semiconductor Fabrication Model
2.2PDEVS 半导体制造模型

Single-stage and multi-stage semiconductor fabrication factory models are developed based on the PDEVS formalism [8]. It is based on the description of a single-stage benchmark model named MiniFab [25]. The factory is modeled as a coupled model that has a Diffusion, Implantation, and Lithography machine. A machine is either in processing or repair mode at any given time. Every machine processes wafer lots in consecutive, non-interruptable loading, processing, unloading, and transportation phases, each with configurable duration and stochasticity. Each machine can enter a repair mode either after processing a number of lots or an amount of time representing the mean time between failures. A coordinator dispatches wafer lots to the diffusion machines. Similarly, another coordinator dispatches wafer lots to the Implantation machines. The assignment of wafer lots to these machines is instantaneous.
基于 PDEVS 形式主义[8]，开发了单级和多级半导体制造工厂模型。它基于一个名为 MiniFab 的单级基准模型的描述[25]。工厂被建模为一个耦合模型，其中包括扩散、注入和光刻机。每台机器在任何给定时间都处于处理或维修模式。每台机器依次处理晶圆批次，包括装载、处理、卸载和运输阶段，每个阶段都具有可配置的持续时间和随机性。每台机器在处理一定数量的批次或表示故障间平均时间之后可以进入维修模式。一个协调员将晶圆批次分派给扩散机器。同样，另一个协调员将晶圆批次分派给注入机器。将晶圆批次分配给这些机器是瞬时的。

The factory models can receive Product a (Pa), Product b (Pb), and Test wafer (Tw) lots. The lots form a batch with a size of three before being chronologically processed in six steps starting from Diffusion to Implantation and ending in Lithography. Steps 1 and 5 are assigned to each of the diffusion machines named A and B, steps 2 and 4 are assigned to each of the implantation machines named C and D, and steps 3 and 6 are assigned to the lithography machine named E. Feedforward and feedback relationships among the machines define the ordering of the six steps (refer to Figure 1(a) for the diagram illustration of the single-stage factory [25]).
工厂模型可以接收产品 a（Pa）、产品 b（Pb）和测试晶圆（Tw）批次。这些批次在进行六个步骤的时间顺序处理之前，形成一个大小为三的批次，从扩散到注入，最后到光刻。步骤 1 和 5 分别分配给名为 A 和 B 的扩散机器，步骤 2 和 4 分配给名为 C 和 D 的注入机器，步骤 3 和 6 分配给名为 E 的光刻机器。机器之间的前馈和反馈关系定义了这六个步骤的顺序（参见图 1(a)以查看单级工厂的示意图[25]）。

The number for the Pa and Pb can be 0, 2, 3, 6, 9, 12, 18, 24, 27, 30, 36, 45, 48, 54, 60, 72, 81, 90 and for Tw can be 0, 1, 3, 6, 9, 12, 15, 18, 27 [25]. Atomic models are developed to generate wafer lots in uniform. The Pa, Pb, and Tw wafer lots are generated every 8, 16, and 24 units of time (hours), respectively. Sinusoidal wafer lots are also generated at the same frequency, with sizes having simple patterns of 1, 2, 3, 2, 1. Transducers are devised to collect data from the machines and coordinators in each part of the factory. These do not influence the operations of the MiniFab.
Pa 和 Pb 的数字可以是 0、2、3、6、9、12、18、24、27、30、36、45、48、54、60、72、81、90，Tw 的数字可以是 0、1、3、6、9、12、15、18、27。原子模型被开发用于生成均匀的晶圆批次。Pa、Pb 和 Tw 晶圆批次分别每 8、16 和 24 个时间单位（小时）生成一次。正弦波晶圆批次也以相同频率生成，尺寸具有简单的 1、2、3、2、1 的模式。传感器被设计用于从工厂各部分的机器和协调员那里收集数据。这些不会影响 MiniFab 的运作。

3 Related work 3 相关工作

Machine learning (ML) has been pivotal in advancing the traditional semiconductor manufacturing process [13, 2]. As Jiang et al. [14] discusses, ML can be effectively utilized to improve the yield in semiconductor manufacturing. Various methods have been explored for data collection necessary for developing these ML models. For instance, Liu et al. [17] reviews the use of different ML algorithms and available datasets for enhancing semiconductor manufacturing. Shin and Park [26] derive data from manufacturing logs, while Saif M. Khan [24] highlights a publicly available dataset of the semiconductor manufacturing process. The dataset [18] offers time series data for a specific factory configuration, limiting its flexibility to alter factory settings and parameters.
机器学习（ML）在推动传统半导体制造过程方面发挥了关键作用[13, 2]。正如江等人[14]所讨论的，ML 可以有效地用于提高半导体制造中的产量。已经探索了用于开发这些 ML 模型所需的数据收集的各种方法。例如，刘等人[17]回顾了不同的 ML 算法和可用数据集，以增强半导体制造。Shin 和 Park[26]从制造日志中提取数据，而 Saif M. Khan[24]强调了一个公开可用的半导体制造过程数据集。该数据集[18]为特定工厂配置提供了时间序列数据，限制了其灵活性以更改工厂设置和参数。

In this research, we focus on using Discrete Event Simulation (DES) to generate data [6] for a given factory configuration, which can then be used to develop ML algorithms. Specifically, we aim to predict the overall throughput of the factory for a given time instance. Although Godahewa et al. [11] discusses various time series datasets and evaluation metrics along with the implementation of time series models, there is no exclusive dataset related to semiconductor manufacturing. The work by Singgih [27] uses the MiniFab model description [28] to develop a model in Anylogic. A dataset from this DES model is generated and used to develop ML classification models. We extend the functionalities of the simulation model to incorporate features such as repair state and wafer generation dataset. One key aspect of semiconductor manufacturing is predicting the throughput of a fabrication plant for a given time instance and factory configuration [7]. We perform data analysis with different factory configurations and analyze parameters such as throughput and turnaround time of a semiconductor manufacturing process at different time instances.
在这项研究中，我们专注于使用离散事件模拟（DES）来生成给定工厂配置的数据[6]，然后可以用这些数据来开发机器学习算法。具体来说，我们的目标是预测给定时间点工厂的总吞吐量。尽管 Godahewa 等人[11]讨论了各种时间序列数据集和评估指标，以及时间序列模型的实施，但没有专门与半导体制造相关的数据集。Singgih 的工作[27]使用 MiniFab 模型描述[28]在 Anylogic 中开发模型。从这个 DES 模型生成了一个数据集，用于开发机器学习分类模型。我们扩展了模拟模型的功能，以包括修复状态和晶圆生成数据集等特性。半导体制造的一个关键方面是预测给定时间点和工厂配置下制造厂的吞吐量[7]。我们对不同的工厂配置进行数据分析，并分析半导体制造过程在不同时间点的吞吐量和周转时间等参数。

4 Dataset generation 4 数据集生成

The PDEVS models are used to simulate a set of experiments conducted for the single-stage and multi-stage factories using the DEVS-Suite simulator [1]. Four transducer models are used to measure and collect input, output, and state information at every simulation execution step. An eight-stage cascade model is created using the single-stage model. This model generates scenarios that exhibit higher structure and behavior complexities. Based on the logic of wafer processing of the MiniFab we formed 93 different tuples of Pa, Pb, and Tw values, each comprising different lot configurations ranging from small, medium, and large lot sizes relative to each other. In addition to the lot size and lot configurations, we can also specify how the model can process a particular batch of wafers (see Table 1). We have 372 simulation scenarios for the eight-stage cascade factory models based on wafer lot sizes, configurations, uniform and sinusoidal patterns, and repair mode. Using the fabrication model, 93 experiments are simulated, each representing a scenario for Pa, Pb, and Tw. These simulated experiments are chosen to form small, medium, and large lot configurations with unique lot sizes. (see Table 1).
PDEVS 模型用于使用 DEVS-Suite 模拟器[1]对单级和多级工厂进行一系列实验。四个传感器模型用于在每次模拟执行步骤中测量和收集输入、输出和状态信息。使用单级模型创建了一个八级级联模型。该模型生成展示更高结构和行为复杂性的场景。基于 MiniFab 晶圆加工逻辑，我们形成了 93 个不同的 Pa、Pb 和 Tw 值元组，每个元组包含不同的批次配置，从小、中到大批次大小相对于彼此。除了批次大小和批次配置外，我们还可以指定模型如何处理特定批次的晶圆（见表 1）。基于晶圆批次大小、配置、均匀和正弦模式以及维修模式，我们有 372 个八级级联工厂模型的模拟场景。使用制造模型，对 Pa、Pb 和 Tw 进行了 93 次实验模拟，每次代表具有独特批次大小的小、中和大批次配置。（见表 1）。

Lot Configurations 地块配置
Pa 爬	Pb	Tw	Lot Size 批量大小	Repair State 修复状态	Wafer Generation 晶圆制造
93 Lot Configurations 93 个地块配置				Processing Steps 处理步骤	Uniform 制服
93 Lot Configurations 93 个地块配置				Mean-time between failure 故障间隔时间	Uniform 制服
93 Lot Configurations 93 个地块配置				No Repair 无修复	Uniform 制服
93 Lot Configurations 93 个地块配置				No Repair 无修复	Sinusoidal 正弦形

Table 1: Simulation Scenarios
表 1：模拟场景

The output generated from each simulation is stored in comma-separated value (CSV) files for individual atomic and coupled factory model components. Each simulated component has several rows, each representing an instance of time associated with the columns of data for every model (e.g., throughput shown in Figure 1). For example, data on loading and transportation time trajectories are essential for determining optimal factory operation under different factory configurations. Data collected for throughput and turnaround time can be used to build surrogate machine-learning models, which can be trained to mimic the simulated behavior of the factory models as closely as possible. Every of the simulation output files has a column depicting the logical time of the simulation, hence plotting different values against time can provide us with time series data. These values’ temporal nature provides insight into wafer processing at each step. Our focus of this research is to analyze the throughput of a semiconductor manufacturing factory with different configurations. Now we plot the throughput values against time with an interval of 1 minute between every successive time value as depicted in Figure 1.
每次模拟生成的输出都存储在逗号分隔值（CSV）文件中，用于单个原子和耦合工厂模型组件。每个模拟的组件都有几行，每行代表与每个模型数据列相关联的时间实例（例如，吞吐量如图 1 所示）。例如，加载和运输时间轨迹的数据对于确定在不同工厂配置下的最佳工厂运行至关重要。收集的吞吐量和周转时间数据可用于构建替代机器学习模型，这些模型可以被训练成尽可能模仿工厂模型的模拟行为。每个模拟输出文件都有一列描述模拟的逻辑时间，因此将不同值绘制在时间轴上可以为我们提供时间序列数据。这些值的时间性质可以洞察每个步骤的晶圆加工。我们研究的重点是分析具有不同配置的半导体制造工厂的吞吐量。现在我们将吞吐量值根据时间绘制在一分钟的间隔内，每个连续的时间值如图 1 所示。

In these plots, we can see that there are some empty values at certain time instances, the empty values are based on the fact that our discrete event simulation records values at an event occurrence. The conversion of these values into a complete time series dataset would require us to have values at each time instance (Section 4.1). Even though the plots have a similar monotonically increasing trend for different configurations the values at a granular level differ from each other. The differences can be visualized in the plots based on different configurations of the factory.
在这些图中，我们可以看到在某些时间点存在一些空值，这些空值是基于我们的离散事件模拟在事件发生时记录值的事实。将这些值转换为完整的时间序列数据集需要我们在每个时间点都有值（第 4.1 节）。尽管这些图表在不同配置下具有类似的单调增长趋势，但在粒度级别上的值却有所不同。根据工厂不同配置的图表可以看出这些差异。

4.1 Time series dataset 4.1 时间序列数据集

From the simulation output, we have data at different time instances of a MiniFab factory. Since values at certain time intervals are not directly provided by the simulation, for converting the data into a time series, we have used the front-filing method to fill the missing values. The reason for using a front-fill is based on the fact that the throughput values are piece-wise constant at different time instances until and unless they are changed by the occurrence of an event, hence the value at a time instance if not defined would be the value at a previous instance. This pre-processing has helped us to form a complete time series dataset with time granularity being 1 minute. Considering the dataset with a file for each stage’s values, a univariate and a multivariate time series analysis can be performed depicting the versatility of the dataset. We can see the time series plots of multiple stages of a factory as shown in Figure 2. In these plots, we can see throughput values of each stage of an 8-stage cascade MiniFab factory for a given configuration, and the ‘Cascade Factory Throughput’ in these plots indicates the overall throughput value of the factory. The plots indicate that introducing a repair state in the factory increases the level of uncertainty in the data. As per Section 2, every stage of the MiniFab model is connected to the next stage (connected components), indicating a causality in throughput time series between each stage. Hence, can even perform a multivariate time series analysis to predict the throughput of each stage based on the throughput of other stages which is discussed further in Section 5. For a particular factory configuration, we can even configure the number of stages the factory can have, here we have considered an 8-stage cascade MiniFab factory.
从模拟输出中，我们获得了 MiniFab 工厂在不同时间点的数据。由于模拟未直接提供特定时间间隔的数值，为了将数据转换为时间序列，我们使用了前向填充方法来填补缺失值。使用前向填充的原因是通过事实，吞吐量数值在不同时间点是分段恒定的，直到被事件改变，因此如果某个时间点的值未定义，则将使用前一个时间点的值。这种预处理帮助我们形成了一个时间粒度为 1 分钟的完整时间序列数据集。考虑到每个阶段数值的文件数据集，可以进行单变量和多变量时间序列分析，展示数据集的多样性。我们可以看到工厂多个阶段的时间序列图，如图 2 所示。在这些图中，我们可以看到给定配置下 8 级级联 MiniFab 工厂每个阶段的吞吐量数值，这些图中的“级联工厂吞吐量”表示工厂的整体吞吐量数值。情节表明，在工厂引入修复状态会增加数据的不确定性水平。根据第 2 节，MiniFab 模型的每个阶段都与下一个阶段相连（连接组件），表明各阶段之间的吞吐时间序列存在因果关系。因此，甚至可以执行多变量时间序列分析，以预测基于其他阶段吞吐量的各阶段吞吐量，这在第 5 节进一步讨论。对于特定的工厂配置，我们甚至可以配置工厂可以拥有的阶段数量，这里我们考虑了一个 8 阶级级联 MiniFab 工厂。

4.2 Feature extraction and analysis
4.2 特征提取和分析

To better understand the relationship between throughput values of different configurations, we have extracted around 6,995 features of the ’Cascade Factory Throughput’ time series for all the 372 simulation runs using Python’s TsFresh library [9]. These include the lag features (auto-correlation function, partial auto-correlation function, etc., for different lag values), trend, skewness, quantile changes, entropy, etc., for the time series. The values of all these features are present in GitHub¹¹1Repository: https://github.com/comses/SCFM.git. The Principle Component Analysis depiction for the Linear Trend r-value, Skewness, Lempel Ziv Complexity [31], Kurtosis [21], and Permutation Entropy are shown in Figure 3. Skewness provides valuable insights into the asymmetry of the data distribution, indicating potential irregularities or outliers in the manufacturing process. The linear trend r-values, which are close to 1 in our case for all the time series, signify a strong linear relationship over time, enabling predictive analysis of future outcomes and trends in the manufacturing process. We have observed negative kurtosis (K) values for all the time series, indicative of a platykurtic distribution, suggesting a more stable and uniform process with fewer extreme outliers, facilitating early detection of deviations from expected distributions. Lempel Ziv Complexity (LZC) measures the complexity of the manufacturing process, helping to identify patterns and irregularities that may impact production quality. Permutation entropy (PE) assesses the randomness and unpredictability within the data, offering insights into the information content and complexity of the manufacturing process. By leveraging these features collectively, semiconductor manufacturers can gain a comprehensive understanding of their production processes, identify areas for improvement, and optimize operational efficiency [4]. The majority of the actual skewness values for all time series are positive and the actual linear trend r-values are closer to ‘1’ due to the monotonically increasing nature of overall throughput with time (refer to Figure 2). The kurtosis values across all conditions are negative suggesting the series to be platykurtic with a lack of extreme values and a more uniform distribution, reflecting stability in the process. The features can be visualized using their PCA-Loadings plots as shown in Figure 3. In the scenario where the wafer generator is uniform and there is no repair state as depicted in Figure 3 (a), data points exhibit clustering around the center. Skewness and trend emerge as dominant features influencing the variation in the data, suggesting a uniform process with some asymmetry. For factories with sinusoidal wafer generation as shown in Figure 3 (b), fewer wafer lots are queued for processing resulting in a comparatively smaller skewness vector, indicating that generator configurations govern skewness. Also for the same scenario, data points are more dispersed, reflecting increased variability. Permutation entropy gains significance alongside skewness, indicating added complexity. In another scenario with a uniform generator and repair based on mean-time between failure in Figure 3 (c), data points cluster closely, showing consistency. Skewness and trend remain significant, indicating a stable process with some asymmetry. Lastly, when the uniform generator is coupled with repair based on processing steps as in Figure 3 (d), distinct clustering patterns emerge. The influence of complexity measures like Lempel Ziv Complexity and Permutation Entropy diminishes, reflecting reduced complexity. A comparison of these values based on the PCA-loading plots indicates that the factory with a uniform generator with no repair has a lesser overall throughput. Additionally, by adding a repair state, we have more cluster points near the trend vector indicating a higher throughput.
为了更好地理解不同配置的吞吐量数值之间的关系，我们使用 Python 的 TsFresh 库从所有 372 次模拟运行中提取了约 6,995 个“级联工厂吞吐量”时间序列的特征。这些特征包括滞后特征（自相关函数、偏自相关函数等不同滞后值）、趋势、偏度、分位数变化、熵等。所有这些特征的数值都在 GitHub 上。线性趋势 r 值、偏度、Lempel Ziv 复杂度、峰度和置换熵的主成分分析图在图 3 中显示。偏度为我们提供了有关数据分布的不对称性的宝贵见解，表明制造过程中可能存在不规则或异常值。在我们的情况下，所有时间序列的线性趋势 r 值接近 1，表明随着时间的推移存在强烈的线性关系，有助于对未来结果和制造过程中的趋势进行预测分析。我们观察到所有时间序列的负峰度（K）值，表明呈现出平峰分布，暗示了一个更稳定和均匀的过程，极端异常值较少，有助于早期发现与预期分布的偏差。Lempel Ziv 复杂度（LZC）衡量了制造过程的复杂性，有助于识别可能影响生产质量的模式和不规则性。置换熵（PE）评估了数据内部的随机性和不可预测性，提供了关于制造过程信息内容和复杂性的见解。通过共同利用这些特征，半导体制造商可以全面了解其生产过程，找出改进的领域，并优化运营效率。所有时间序列的实际偏度值大多为正值，实际线性趋势 r 值接近于‘1’，这是由于随着时间的推移整体吞吐量单调增加的性质（参见图 2）。所有条件下的峰度值为负，表明序列是低峰度的，缺乏极端值并且具有更均匀的分布，反映了过程的稳定性。特征可以通过它们的 PCA-Loadings 图来进行可视化，如图 3 所示。在晶圆生成器均匀且没有维修状态的情况下（如图 3（a）所示），数据点围绕中心聚集。偏度和趋势成为影响数据变化的主要特征，表明了一个具有一定不对称性的均匀过程。对于晶圆生成为正弦波的工厂（如图 3（b）所示），排队等待处理的晶圆批次较少，导致相对较小的偏度向量，表明生成器配置决定了偏度。在相同的情况下，数据点更分散，反映出增加的变异性。排列熵与偏度一起变得重要，表明增加了复杂性。在另一种情况下，晶圆生成器均匀且基于故障间隔时间的维修（如图 3（c）所示），数据点紧密聚集，显示出一致性。偏度和趋势仍然显著，表明一个具有一定不对称性的稳定过程。最后，当均匀生成器与基于处理步骤的修复耦合，如图 3（d）所示，会出现明显的聚类模式。莱普尔-齐夫复杂度和排列熵等复杂度测量指标的影响减弱，反映出复杂度降低。基于主成分分析载荷图的这些数值比较表明，没有修复的均匀生成器的工厂整体吞吐量较低。此外，通过添加修复状态，我们在趋势向量附近有更多的聚类点，表明吞吐量更高。

5 Demonstration of benchmark datasets
5 基准数据集演示

The simulation output is transformed from a discrete event (Figure 1) to a discrete-time time series (Figure 2) by filling the missing values with previous values. For constructing a baseline model, we have used multiple time series forecasting models: Auto-Regressive Integrated Moving Average (ARIMA), Recurrent Neural Network (RNN), Long-Short Term Memory (LSTM), Temporal Convolutional Neural Network (TCN) and Temporal Fusion Transformers (TFT). The choice of selecting these models was based on the uniqueness of the dataset considering the large size of the dataset (25,000 instances) and the monotonically increasing nature with granular differences in the values of throughput for different configurations. ARIMA captures linear trends and seasonality [5], RNN and LSTM handle sequential data and dependencies [10, 12], TCN models long-range temporal dependencies [3] and TFT model considers the impact of static variables (e.g., repair state, lot size, etc.) for time series forecasting [16]. As per feature analysis in Section 4.2, some models may outperform others. Choosing the right metrics to evaluate the performance of these models on the dataset is significant. Mean-square error (MSE), is significant for evaluating forecasting performance, but since, Mean-Square Error is not a scale-free error metric and due to the small scale ( $\sim 10^{-3}$ ) of throughput values, we get small scale values ( $\sim 10^{-6}$ ) of MSE, hence it is convenient to use MSE as a relative measure to evaluate a model’s performance. The value of Mean Average Percentage Error (MAPE) can be used to assess the performance of a model as it is scale-free, but since MAPE cannot handle cases where the actual value of throughput can be zero, we consider non-zero values of throughput for computing MAPE. Next, we used $R^{2}$ scores to evaluate the performance, which is both scale-independent and can handle zero values in the dataset, but it can be misleading for non-linear relationships, sensitive to outliers, and does not account for overfitting or model complexity. Therefore, $R^{2}$ should be used with other metrics for a comprehensive evaluation of model performance. Finally, we use the Mean-Forecast Error (MFE) which directly compares the actual and predicted values hence it is scale-dependent, for throughput prediction it can be of the same scale as the actual values ( $\sim 10^{-3}$ ). All the above metrics can be used as a combination to evaluate the performance of a model, since the dataset consists of multiple zero values and a small scale, relying on a single error metric can lead to ambiguous results. Additionally, with these metrics, it is also pertinent to visualize the prediction plots as there are granular differences in the overall throughput values. Performance evaluation of different time series models for an exemplary simulation configuration: Pa = 10, Pb = 90, Tw = 20, Lot Size = 120, Repair State = No Repair, Wafer Generator = Uniform results are as depicted in Figure 4 (a). Since the TFT model considers also the static covariates of a time series, the evaluation for the TFT model has been done for different configurations as seen in Figure 4 (b). The time series models predict throughput values based on a look-back window of size 10 and the error metrics for the model evaluation are as mentioned in Table 2.
模拟输出通过用先前的值填充缺失值，从离散事件（图 1）转换为离散时间序列（图 2）。为构建基准模型，我们使用了多个时间序列预测模型：自回归综合移动平均（ARIMA）、循环神经网络（RNN）、长短期记忆（LSTM）、时间卷积神经网络（TCN）和时间融合变压器（TFT）。选择这些模型是基于数据集的独特性，考虑到数据集的规模较大（25,000 个实例）以及吞吐量值在不同配置下的粒度差异呈单调增加的特性。ARIMA 捕捉线性趋势和季节性[5]，RNN 和 LSTM 处理序列数据和依赖性[10,12]，TCN 模型长期时间依赖性[3]，TFT 模型考虑静态变量（例如，维修状态、批量大小等）对时间序列预测的影响[16]。根据第 4.2 节的特征分析，一些模型可能表现优于其他模型。选择正确的指标来评估这些模型在数据集上的性能是重要的。均方误差（MSE）对于评估预测性能至关重要，但由于均方误差不是无量纲的错误度量，并且由于吞吐量值的小尺度（），我们得到小尺度值（）的均方误差，因此使用 MSE 作为相对度量来评估模型的性能是方便的。均值百分比误差（MAPE）的值可用于评估模型的性能，因为它是无量纲的，但由于 MAPE 无法处理吞吐量实际值为零的情况，我们考虑非零吞吐量值来计算 MAPE。接下来，我们使用分数来评估性能，这既是独立于尺度的，又可以处理数据集中的零值，但对于非线性关系可能会产生误导，对异常值敏感，并且不考虑过拟合或模型复杂性。因此，应与其他指标一起用于全面评估模型性能。最后，我们使用均值预测误差（MFE），它直接比较实际值和预测值，因此它是依赖于比例的，对于吞吐量预测，它可以与实际值（）具有相同的比例。所有上述指标可以结合使用来评估模型的性能，因为数据集包含多个零值和小规模，依赖单个错误指标可能导致模棱两可的结果。此外，使用这些指标时，还有必要可视化预测图，因为整体吞吐量值存在细微差异。对于示例仿真配置的不同时间序列模型的性能评估：Pa = 10，Pb = 90，Tw = 20，批量大小 = 120，维修状态 = 无维修，晶圆生成器 = 均匀结果如图 4（a）所示。由于 TFT 模型还考虑了时间序列的静态协变量，因此 TFT 模型的评估已针对不同配置进行，如图 4（b）所示。时间序列模型根据大小为 10 的回顾窗口预测吞吐量值，模型评估的错误指标如表 2 所述。

Error Metric 错误度量	ARIMA	RNN	LSTM	TCN	TFT
	Model
MSE	3E-07	1.35E-08	9.73E-09	2.29E-09	2.11E-08
R2 Score R2 分数	-6.006	0.681	0.771	0.946	0.9917
MFE	5.07E-04	1.04E-04	-9.35E-05	-4.56E-06	-9.59E-03
MAPE	0.0813	0.0167	0.0153	0.0051	0.1327

Table 2: 8-stage MiniFab time series model comparison
表 2：8 阶段 MiniFab 时间序列模型比较

The ARIMA model has an MSE value of $3\times 10^{-7}$ and a MAPE value of $8\%$ but the prediction plot of the ARIMA model shows sub-optimal performance by the model. Hence the error metric along with the actual plot serves as a strong parameter when compared to error metrics alone in the case of throughput prediction.
ARIMA 模型具有 MSE 值为和 MAPE 值为，但 ARIMA 模型的预测图显示出模型性能不佳。因此，在吞吐量预测的情况下，误差度量与实际图形一起作为强参数，与仅使用误差度量相比更为重要。

5.1 Univariate time series forecasting
5.1 单变量时间序列预测

Based on the PCA-Loading plots in Figure 3, the data points are spread for different lot sizes based on lot configuration, this suggests that lot size doesn’t share a linear relationship with time series forecasting. With our dataset derived from various lot configurations (Section 4), we aim to identify an optimal lot size for training a time series model to enhance accuracy and understand the impact of lot size on model performance. We employed a TCN model, training it on three lot size categories relatively: small - $M_{small}$ (60: Pa=15, Pb=36, Tw=9), medium - $M_{medium}$ (120: Pa=10, Pb=90, Tw=20), and large - $M_{large}$ (192: Pa=150, Pb=24, Tw=18) with a uniform wafer generator and no repair. Comprehensive tests across all 93 configurations were followed to provide insights into achieving high accuracy in time series models based on optimal lot sizes. The prediction $R^{2}$ scores are depicted in the Figure 5 for models $M_{small}$ , $M_{medium}$ , and $M_{large}$ . The results corroborate our observations in Section 4.2 as the model trained on medium lot size ( $M_{medium}$ ) has a better performance compared to other models, additional results of turnaround time prediction also show similar behavior as seen in Figure 5 (b).
根据图 3 中的 PCA-Loading 图，数据点根据批次配置的不同而分布，这表明批次大小与时间序列预测没有线性关系。通过我们从各种批次配置（第 4 节）中得到的数据集，我们旨在确定一个最佳批次大小，以训练时间序列模型以提高准确性，并了解批次大小对模型性能的影响。我们采用了一个 TCN 模型，相对地在三个批次大小类别上进行训练：小型-（60：Pa=15，Pb=36，Tw=9），中型-（120：Pa=10，Pb=90，Tw=20），和大型-（192：Pa=150，Pb=24，Tw=18），使用均匀晶圆生成器和无修复。对所有 93 种配置进行了全面测试，以提供关于基于最佳批次大小实现高准确性的时间序列模型的见解。模型，和的预测得分在图 5 中显示。结果证实了我们在第 4 节中的观察。2 作为在中等批量大小（）上训练的模型与其他模型相比表现更好，周转时间预测的额外结果也显示出与图 5（b）中所见相似的行为。

5.2 Multivariate time series forecasting
5.2 多元时间序列预测

Section 2 suggests that each stage of the MiniFab model is a connected component to other stages. This suggests that a multivariate analysis can be performed between the throughput values of each stage. The plots in Figure 2 show how each stage throughput values for a similar trend. We performed multivariate analysis using the TCN model to predict overall or selected stage throughput, considering various combinations of throughput values from multiple stages as input. The TCN model demonstrated strong performance in multivariate analysis, as shown in Figure 6. These findings emphasize the importance of considering stage interdependence in semiconductor manufacturing predictive modeling, with higher accuracy noted towards the simulation’s end and with more stages.
第 2 节建议 MiniFab 模型的每个阶段都是与其他阶段相连的组件。这表明可以在每个阶段的吞吐量值之间进行多变量分析。图 2 中的图表显示了每个阶段吞吐量值的类似趋势。我们使用 TCN 模型进行多变量分析，以预测整体或选定阶段的吞吐量，考虑来自多个阶段的吞吐量值的各种组合作为输入。如图 6 所示，TCN 模型在多变量分析中表现出色。这些发现强调了在半导体制造预测建模中考虑阶段相互依赖性的重要性，随着模拟的结束和更多阶段的准确性提高。

6 Conclusion 6 结论

Machine learning (ML) models suitable for time series can benefit from concise and accurate datasets collected from physics-based simulations. These ML models are desirable due to their significant execution speedup compared to simulating causal models. The ML datasets facilitate a wide range of metrics and measurements. Principal Component Analysis (PCA) captures key time series features and helps understand the role and impact of various semiconductor fabrication factory configurations on essential operational measures. We have developed datasets using PDEVS models and Intel’s benchmark factory description. We detailed dataset generation for multi-stage factories for uniform (with and without repair/maintenance) and sinusoidal wafer lot experimental configurations. The utility of these datasets is demonstrated by developing time series models, such as TCN and TFT. They can make predictions comparable to those obtained from PDEVS simulations, with TFT accounting for static covariates. Future work includes developing additional models and benchmark datasets for semiconductor supply-chain systems.
机器学习（ML）模型适用于时间序列，可以从基于物理模拟收集的简洁准确数据集中受益。与模拟因果模型相比，这些 ML 模型由于其显著的执行加速而备受青睐。ML 数据集促进了各种指标和测量。主成分分析（PCA）捕捉关键的时间序列特征，并有助于理解各种半导体制造工厂配置对关键运营指标的作用和影响。我们使用 PDEVS 模型和英特尔的基准工厂描述开发了数据集。我们详细介绍了多阶段工厂的数据集生成，包括均匀（带和不带维修/保养）和正弦晶圆批量实验配置。通过开发时间序列模型，如 TCN 和 TFT，展示了这些数据集的实用性。它们可以进行与从 PDEVS 模拟获得的预测相媲美的预测，TFT 考虑了静态协变量。未来工作包括为半导体供应链系统开发额外的模型和基准数据集。

Acknowledgments and Disclosure of Funding
致谢和资金披露

This research is funded by Intel Corporation, Chandler, Arizona, USA.
这项研究由美国亚利桑那州钱德勒的英特尔公司资助。

References 参考资料

ACIMS [2023] ACIMS. Devs-suite simulator, version 7.0.0, 2023. https://acims.asu.edu/devs-suite/ [Accessed: 10th January].
ACIMS。Devs-suite 模拟器，版本 7.0.0，2023 年。https://acims.asu.edu/devs-suite/ [访问日期：1 月 10 日]。
Ailisto et al. [2023] Ailisto 等人[2023] Heikki Ailisto, Heli Helaakoski, and Anssi Neuvonen. Benefits of machine learning in the manufacturing industry. 2023.
Heikki Ailisto，Heli Helaakoski 和 Anssi Neuvonen。机器学习在制造业中的好处。2023。
Bai et al. [2018] 白等人。[2018] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. In Proceedings of the 36th International Conference on Machine Learning, pages 556–565. PMLR, 2018.
白少杰，J Zico Kolter 和 Vladlen Koltun。用于序列建模的通用卷积和循环网络的实证评估。在第 36 届国际机器学习会议论文集中，第 556-565 页。PMLR，2018 年。
Bojer and Meldgaard [2021]
博耶尔和梅尔德加德[2021] Casper Solheim Bojer and Jens Peder Meldgaard. Kaggle forecasting competitions: An overlooked learning opportunity. International Journal of Forecasting, 37(2):587–603, 2021.
卡斯帕·索尔海姆·博耶尔和延斯·佩德尔·梅尔加德。 Kaggle 预测比赛：一个被忽视的学习机会。《国际预测杂志》，37（2）：587-603，2021 年。
Box and Jenkins [1970] Box 和 Jenkins [1970] George EP Box and Gwilym M Jenkins. Time Series Analysis: Forecasting and Control. Holden-Day, 1970.
George EP Box 和 Gwilym M Jenkins。时间序列分析：预测与控制。Holden-Day，1970。
Chan et al. [2022] 陈等人[2022] KC Chan, Marsel Rabaev, and Handy Pratama. Generation of synthetic manufacturing datasets for machine learning using discrete-event simulation. Production & Manufacturing Research, 10(1):337–353, 2022.
KC Chan，Marsel Rabaev 和 Handy Pratama。使用离散事件模拟生成机器学习的合成制造数据集。生产与制造研究，10(1)：337-353，2022。
Chong and Ng [2016] Chong 和 Ng [2016] Kuan Eng Chong and Kam Choi Ng. Relationship between overall equipment effectiveness, throughput and production part cost in semiconductor manufacturing industry. In 2016 IEEE international conference on industrial engineering and engineering management (IEEM), pages 75–79. IEEE, 2016.
关荣昌和吴锦才。半导体制造业中设备综合效率、吞吐量和生产零部件成本之间的关系。2016 年 IEEE 工业工程与工程管理国际会议（IEEM）论文集，第 75-79 页。IEEE，2016 年。
Chow and Zeigler [1994] 周和齐格勒[1994] Alex Chung Hen Chow and Bernard P Zeigler. Parallel devs: A parallel, hierarchical, modular modeling formalism. In Proceedings of Winter Simulation Conference, pages 716–722. IEEE, 1994.
Alex Chung Hen Chow 和 Bernard P Zeigler。并行开发者：一种并行、分层、模块化建模形式主义。在冬季模拟会议论文集中，第 716-722 页。IEEE，1994 年。
Christ et al. [2018] 基督等人。[2018] Maximilian Christ, Nils Braun, Julius Neuffer, and Andreas W Kempa-Liehr. Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package). Neurocomputing, 307:72–77, 2018.
马克西米利安·克里斯特，尼尔斯·布劳恩，尤利乌斯·诺伊弗和安德烈亚斯·W·肯帕-利尔。基于可扩展假设检验的时间序列特征提取（tsfresh-一个 Python 包）。《神经计算》（Neurocomputing），307：72-77，2018 年。
Elman [1990] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
杰弗里 L 艾尔曼。在时间中找到结构。认知科学，14(2)：179-211，1990 年。
Godahewa et al. [2021] Godahewa 等人[2021] Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive. arXiv preprint arXiv:2105.06643, 2021.
Rakshitha Godahewa，Christoph Bergmeir，Geoffrey I Webb，Rob J Hyndman 和 Pablo Montero-Manso。莫纳什时间序列预测档案。arXiv 预印本 arXiv:2105.06643，2021。
Hochreiter and Schmidhuber [1997]
Hochreiter 和 Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Sepp Hochreiter 和 Jürgen Schmidhuber。长短期记忆。神经计算，9(8)：1735-1780，1997。
Irani et al. [1993] 伊拉尼等人[1993] Keki B Irani, Jie Cheng, Usama M Fayyad, and Zhaogang Qian. Applying machine learning to semiconductor manufacturing. IEEE Expert, 8(1):41–47, 1993.
Keki B Irani，Jie Cheng，Usama M Fayyad 和 Zhaogang Qian。将机器学习应用于半导体制造。IEEE 专家，8(1)：41-47，1993。
Jiang et al. [2020] 蒋等人。[2020] Dan Jiang, Weihua Lin, and Nagarajan Raghavan. A novel framework for semiconductor manufacturing final test yield classification using machine learning techniques. IEEE Access, 8:197885–197895, 2020.
但江，林伟华和纳加拉詹·拉加万。一种利用机器学习技术进行半导体制造最终测试良率分类的新框架。IEEE Access，8：197885-197895，2020。
Korosteleva and Lee [2021]
Korosteleva 和 Lee [2021] Maria Korosteleva and Sung-Hee Lee. Generating datasets of 3d garments with sewing patterns. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/013d407166ec4fa56eb1e1f8cbe183b9-Paper-round1.pdf.
Maria Korosteleva 和 Sung-Hee Lee。生成带有缝纫图案的 3D 服装数据集。在 J. Vanschoren 和 S. Yeung 编辑，Neural Information Processing Systems Track on Datasets and Benchmarks 会议论文集，卷 1，2021 年。URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/013d407166ec4fa56eb1e1f8cbe183b9-Paper-round1.pdf。
Lim et al. [2021] Lim 等人。[2021] Bryan Lim, Sercan O Arik, Nico Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting. In International Journal of Forecasting, volume 37, pages 1748–1764. Elsevier, 2021.
Bryan Lim, Sercan O Arik, Nico Loeff 和 Tomas Pfister。用于可解释的多视角时间序列预测的时间融合变压器。在《国际预测杂志》第 37 卷，第 1748-1764 页。爱思唯尔，2021 年。
Liu et al. [2022] 刘等人[2022] Duan-Yang Liu, Li-Ming Xu, Xu-Min Lin, Xing Wei, Wen-Jie Yu, Yang Wang, and Zhong-Ming Wei. Machine learning for semiconductors. Chip, 1(4):100033, 2022.
刘端阳，徐立明，林旭敏，魏兴，于文杰，王洋，魏中明。半导体的机器学习。芯片，1(4)：100033，2022。
McCann and Johnston [2008]
麦肯和约翰斯顿[2008] Michael McCann and Adrian Johnston. SECOM. UCI Machine Learning Repository, 2008. doi: https://doi.org/10.24432/C54305.
迈克尔·麦肯和阿德里安·约翰斯顿。SECOM。UCI 机器学习库，2008 年。doi: https://doi.org/10.24432/C54305。
Mönch et al. [2012] Mönch 等人[2012] Lars Mönch, John W Fowler, and Scott J Mason. Production planning and control for semiconductor wafer fabrication facilities: modeling, analysis, and systems, volume 52. Springer Science & Business Media, 2012.
Lars Mönch, John W Fowler 和 Scott J Mason. Production planning and control for semiconductor wafer fabrication facilities: modeling, analysis, and systems, volume 52. Springer Science & Business Media, 2012.
Ören et al. [2023] Ören 等人[2023] Tuncer Ören, Bernard P Zeigler, and Andreas Tolk. Body of Knowledge for Modeling and Simulation: A Handbook by the Society for Modeling and Simulation International. Springer Nature, 2023.
Tuncer Ören、Bernard P Zeigler 和 Andreas Tolk。《建模与仿真知识体系：国际建模与仿真协会手册》。Springer Nature，2023 年。
Pearson [1905] 皮尔逊【1905】 Karl Pearson. Das fehlergesetz und seine verallgemeinerungen durch fechner und pearson. a rejoinder. Biometrika, 4(1/2):169–212, 1905.
卡尔·皮尔逊。错误定律及其由费希纳和皮尔逊推广的回应。生物统计学，4(1/2)：169-212，1905 年。
Pendyala et al. [2024] 彭迪亚拉等人。[2024] Vamsi Krishna Pendyala, Hessam S. Sarjoughian, and Edward J. Yellig. Generating temporal convolutional network models from parallel devs models: Semiconductor manufacturing systems. In 2024 Winter Simulation Conference (WSC), 2024. Accepted for publication.
Vamsi Krishna Pendyala、Hessam S. Sarjoughian 和 Edward J. Yellig。从并行 devs 模型生成时间卷积网络模型：半导体制造系统。在 2024 年冬季模拟会议（WSC）中，2024 年。已接受发表。
Rühling Cachay et al. [2021]
Rühling Cachay 等人[2021] Salva Rühling Cachay, Venkatesh Ramesh, Jason Cole, Howard Barker, and David Rolnick. Climart: A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/f718499c1c8cef6730f9fd03c8125cab-Paper-round2.pdf.
Salva Rühling Cachay、Venkatesh Ramesh、Jason Cole、Howard Barker 和 David Rolnick。Climart：用于在天气和气候模型中模拟大气辐射传输的基准数据集。在 J. Vanschoren 和 S. Yeung 编辑，神经信息处理系统数据集和基准赛道论文集，卷 1，2021 年。URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/f718499c1c8cef6730f9fd03c8125cab-Paper-round2.pdf.
Saif M. Khan [2021] 赛义夫·M·汗【2021】 Dahlia Peterson Saif M. Khan, Alexander Mann. Emerging technology observatory advanced semiconductor supply chain dataset (2022 release). The Semiconductor Supply Chain: Assessing National Competitiveness (Center for Security and Emerging Technology, January 2021), 2021. URL https://eto.tech/dataset-docs/chipexplorer/.
达莉娅·彼得森赛义夫·M·汗，亚历山大·曼。新兴技术观察先进半导体供应链数据集（2022 年发布）。半导体供应链：评估国家竞争力（安全与新兴技术中心，2021 年 1 月），2021。网址 https://eto.tech/dataset-docs/chipexplorer/。
Sarjoughian et al. [2023]
Sarjoughian 等人[2023] Hessam S. Sarjoughian, Forouzan Fallah, Seyyedamirhossein Saeidi, and Edward J. Yellig. Transforming discrete event models to machine learning models. In 2023 Winter Simulation Conference (WSC), pages 2662–2673, 2023.
Hessam S. Sarjoughian，Forouzan Fallah，Seyyedamirhossein Saeidi 和 Edward J. Yellig。将离散事件模型转换为机器学习模型。在 2023 年冬季模拟会议（WSC）上，页面 2662-2673，2023 年。
Shin and Park [2000] 辛和朴[2000] Chung Kwan Shin and Sang Chan Park. A machine learning approach to yield management in semiconductor manufacturing. International Journal of Production Research, 38(17):4261–4271, 2000.
Chung Kwan Shin 和 Sang Chan Park。半导体制造中收益管理的机器学习方法。《生产研究国际期刊》，38(17)：4261-4271，2000 年。
Singgih [2021] 辛吉 [2021] Ivan Kristianto Singgih. Production flow analysis in a semiconductor fab using machine learning techniques. Processes, 9(3), 2021. ISSN 2227-9717. doi: 10.3390/pr9030407. URL https://www.mdpi.com/2227-9717/9/3/407.
Ivan Kristianto Singgih. 使用机器学习技术在半导体工厂进行生产流程分析。Processes，9(3)，2021。ISSN 2227-9717。doi: 10.3390/pr9030407。URL https://www.mdpi.com/2227-9717/9/3/407。
Spier and Kempf [1995] 斯皮尔和肯普夫[1995] Jonathan Spier and Karl Kempf. Simulation of emergent behavior in manufacturing systems. In Proceedings of SEMI Advanced Semiconductor Manufacturing Conference and Workshop, pages 90–94. IEEE, 1995.
乔纳森·斯皮尔（Jonathan Spier）和卡尔·肯普夫（Karl Kempf）。在《SEMI 先进半导体制造会议和研讨会论文集》中模拟制造系统中的新兴行为。IEEE，1995 年。
Yuan and Ponsignon [2014]
袁和庞西尼翁[2014] Jingjing Yuan and Thomas Ponsignon. Towards a semiconductor supply chain simulation library (scsc-simlib). In 2014 Winter Simulation Conference (WSC), pages 2522–2532, 2014.
袁晶晶和托马斯·庞西农。走向半导体供应链仿真库（scsc-simlib）。在 2014 年冬季模拟会议（WSC）上，第 2522-2532 页，2014 年。
Zeigler et al. [2018] Zeigler 等人。[2018] Bernard P Zeigler, Alexandre Muzy, and Ernesto Kofman. Theory of modeling and simulation: discrete event & iterative system computational foundations. Academic press, 2018.
伯纳德·P·齐格勒（Bernard P Zeigler）、亚历山德罗·穆齐（Alexandre Muzy）和埃内斯托·科夫曼（Ernesto Kofman）。建模与仿真理论：离散事件和迭代系统计算基础。学术出版社，2018 年。
Ziv and Lempel [1977] 齐夫和莱姆佩尔[1977] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, 1977.
Jacob Ziv 和 Abraham Lempel。顺序数据压缩的通用算法。《IEEE 信息论文献》，23（3）：337-343，1977 年。

A Benchmark Time Series Dataset for Semiconductor Fabrication Manufacturing Constructed using Component-based Discrete-Event Simulation Models基于组件离散事件仿真模型的半导体制造基准时间序列数据集

Abstract 摘要

1 Introduction 1简介

2 Background 2背景

2.1 Discrete-event modeling2.1 离散事件建模

2.2 PDEVS Semiconductor Fabrication Model2.2PDEVS 半导体制造模型