PPO 学习中的 AttributeError

[I 2025-03-12 18:36:32,364] A new study created in RDB with name: OHLC_EconomicCalender_ppo_study Using cuda device CustomFeaturesExtractor: n_assets=1, features_dim=96 [W 2025-03-12 18:36:34,723] Trial 0 failed with parameters: {'learning_rate': 0.00037110990690939903, 'n_steps': 5120, 'total_timesteps': 2000000, 'batch_size': 512, 'gamma': 0.9921798261736212, 'gae_lambda': 0.8697772993247844, 'clip_range': 0.2201477433324902} because of the following error: AttributeError("'SquashedDiagGaussianDistribution' object has no attribute 'update'"). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-15-8571b78beb29>", line 166, in objective model.learn(total_timesteps=total_timesteps, callback=eval_callback) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 311, in learn return super().learn( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 202, in collect_rollouts actions, values, log_probs = self.policy(obs_tensor) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "<ipython-input-13-c61d479822a6>", line 687, in forward self.action_dist.update(mean_actions, log_std) AttributeError: 'SquashedDiagGaussianDistribution' object has no attribute 'update' [W 2025-03-12 18:36:34,726] Trial 0 failed with value None. --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-15-8571b78beb29> in <cell line: 183>() 181 load_if_exists=True 182 ) --> 183 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 184 185 # Best parameters 11 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-15-8571b78beb29> in objective(trial) 164 165 # print(model.policy) # Should show mlp_extractor with in_features=95 --> 166 model.learn(total_timesteps=total_timesteps, callback=eval_callback) 167 168 val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 309 progress_bar: bool = False, 310 ) -> SelfPPO: --> 311 return super().learn( 312 total_timesteps=total_timesteps, 313 callback=callback, /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 321 322 while self.num_timesteps < total_timesteps: --> 323 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) 324 325 if not continue_training: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps) 200 # Convert to pytorch tensor or to TensorDict 201 obs_tensor = obs_as_tensor(self._last_obs, self.device) --> 202 actions, values, log_probs = self.policy(obs_tensor) 203 actions = actions.cpu().numpy() 204 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs) 1737 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1738 else: -> 1739 return self._call_impl(*args, **kwargs) 1740 1741 # torchrec tests the code consistency with the following code /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs) 1748 or _global_backward_pre_hooks or _global_backward_hooks 1749 or _global_forward_hooks or _global_forward_pre_hooks): -> 1750 return forward_call(*args, **kwargs) 1751 1752 result = None <ipython-input-13-c61d479822a6> in forward(self, obs, deterministic) 685 686 # Update the distribution with current parameters --> 687 self.action_dist.update(mean_actions, log_std) 688 689 # Sample actions or get deterministic actions AttributeError: 'SquashedDiagGaussianDistribution' object has no attribute 'update' ``` import datetime import math import random import ast import gymnasium as gym from gymnasium import spaces from gymnasium.utils import seeding import os import csv import numpy as np import pandas_ta as ta from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor, MlpExtractor from stable_baselines3.common.policies import ActorCriticPolicy from stable_baselines3.common.torch_layers import MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution import torch import torch.nn as nn from stable_baselines3 import PPO from stable_baselines3.common.policies import ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart import TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym(gym.Env): """forex/future/option trading gym environment 1. Three action space (0 Buy, 1 Sell, 2 Nothing) 2. Multiple trading pairs (EURUSD, GBPUSD...) under same time frame 3. Timeframe from 1 min to daily as long as use candlestick bar (Open, High, Low, Close) 4. Use StopLose, ProfitTaken to realize rewards. each pair can configure it own SL and PT in configure file 5. Configure over night cash penalty and each pair's transaction fee and overnight position holding penalty 6. Split dataset into daily, weekly or monthly..., with fixed time steps, at end of len(df). The business logic will force to Close all positions at last Close price (game over). 7. Must have df column name: [(time_col),(asset_col), Open,Close,High,Low,day] (case sensitive) 8. Addition indicators can add during the data process. 78 available TA indicator from Finta 9. Customized observation list handled in json config file. 10. ProfitTaken = fraction_action * max_profit_taken + SL. 11. SL is pre-fixed 12. Limit order can be configure, if limit_order == True, the action will preset buy or sell at Low or High of the bar, with a limit_order_expiration (n bars). It will be triggered if the price go cross. otherwise, it will be drop off 13. render mode: human -- display each steps realized reward on console file -- create a transaction log graph -- create transaction in graph (under development) 14. 15. Reward, we want to incentivize profit that is sustained over long periods of time. At each step, we will set the reward to the account balance multiplied by some fraction of the number of time steps so far.The purpose of this is to delay rewarding the agent too fast in the early stages and allow it to explore sufficiently before optimizing a single strategy too deeply. It will also reward agents that maintain a higher balance for longer, rather than those who rapidly gain money using unsustainable strategies. 16. Observation_space contains all of the input variables we want our agent to consider before making, or not making a trade. We want our agent to “see” the forex data points (Open price, High, Low, Close, time serial, TA) in the game window, as well a couple other data points like its account balance, current positions, and current profit.The intuition here is that for each time step, we want our agent to consider the price action leading up to the current price, as well as their own portfolio’s status in order to make an informed decision for the next action. 17. reward is forex trading unit Point, it can be configure for each trading pair 18. To make the unrealized profit reward reflect market conditions, we’ll compute ATR for each asset and use it to scale the reward dynamically. """ metadata = {"render.modes": ["graph", "human", "file", "none"]} def __init__( self, df, event_map, currency_map, env_config_file="./neo_finrl/env_fx_trading/config/gdbusd-test-1.json", ): assert df.ndim == 2 super(tgym, self).__init__() self.cf = EnvConfig(env_config_file) self.observation_list = self.cf.env_parameters("observation_list") # Economic data mappings self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy() if 'events' not in self.df.columns: raise ValueError("DataFrame must contain an 'events' column") def parse_events(x): if isinstance(x, str): try: parsed = ast.literal_eval(x) return parsed if isinstance(parsed, list) else [] except (ValueError, SyntaxError): return [] return x if isinstance(x, list) else [] self.df['events'] = self.df['events'].apply(parse_events) if not isinstance(self.df['events'].iloc[0], list): raise ValueError("'events' must be a list") if self.df['events'].iloc[0] and not isinstance(self.df['events'].iloc[0][0], dict): raise ValueError("Elements in 'events' must be dictionaries") self.balance_initial = self.cf.env_parameters("balance") self.over_night_cash_penalty = self.cf.env_parameters("over_night_cash_penalty") self.asset_col = self.cf.env_parameters("asset_col") self.time_col = self.cf.env_parameters("time_col") self.random_start = self.cf.env_parameters("random_start") self.log_filename = ( self.cf.env_parameters("log_filename") + datetime.datetime.now().strftime("%Y%m%d%H%M%S") + ".csv" ) self.analyze_transaction_history_log_filename = ("transaction_history_log" + datetime.datetime.now().strftime("%Y%m%d%H%M%S") + ".csv") self.df["_time"] = self.df[self.time_col] self.df["_day"] = self.df["weekday"] self.assets = self.df[self.asset_col].unique() self.dt_datetime = self.df[self.time_col].sort_values().unique() self.df = self.df.set_index(self.time_col) self.visualization = False # Calculate ATR and save DataFrame self.calculate_atr() self.df.to_csv("processed_df_with_atr.csv") print("Saved processed DataFrame to 'processed_df_with_atr.csv'") # Reset values self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_step = 0 self.episode = 0 # Start from 0, increment on episode end self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True # Cache data self.cached_ohlc_data = [self.get_observation_vector(_dt) for _dt in self.dt_datetime] self.cached_economic_data = [self.get_economic_vector(_dt) for _dt in self.dt_datetime] self.cached_time_serial = ( self.df[["_time", "_day"]].sort_values("_time").drop_duplicates().values.tolist() ) self.reward_range = (-np.inf, np.inf) self.action_space = spaces.Box(low=0, high=3, shape=(len(self.assets),), dtype=np.float32) self.observation_space = spaces.Dict({ "ohlc_data": spaces.Box(low=-np.inf, high=np.inf, shape=(len(self.assets) * len(self.observation_list),), dtype=np.float32), "event_ids": spaces.Box(low=0, high=len(self.event_map)-1, shape=(self.max_events,), dtype=np.int32), "currency_ids": spaces.Box(low=0, high=len(self.currency_map)-1, shape=(self.max_events,), dtype=np.int32), "economic_numeric": spaces.Box(low=-np.inf, high=np.inf, shape=(self.max_events * 6,), dtype=np.float32), "portfolio_data": spaces.Box(low=-np.inf, high=np.inf, shape=(3 + 2 * len(self.assets),), dtype=np.float32) }) print( f"initial done:\n" f"observation_list:{self.observation_list}\n" f"assets:{self.assets}\n" f"time serial: {min(self.dt_datetime)} -> {max(self.dt_datetime)} length: {len(self.dt_datetime)}\n" f"events: {len(self.event_map)}, currencies: {len(self.currency_map)}" ) self._seed() def _seed(self, seed=None): self.np_random, seed = seeding.np_random(seed) return [seed] # Assuming self.df contains OHLC data with columns: Open, High, Low, Close, asset_col, time_col def calculate_atr(self): # Group by asset to calculate ATR for each trading pair atr_dfs = [] time_col = "_time" for asset in self.assets: asset_df = self.df[self.df[self.asset_col] == asset][["Open", "High", "Low", "Close"]].copy() # Calculate ATR with default period=14 asset_df["ATR"] = ta.atr( high=asset_df["High"], low=asset_df["Low"], close=asset_df["Close"], length=14 ) asset_df["asset"] = asset asset_df[time_col] = self.df[self.df[self.asset_col] == asset][time_col] atr_dfs.append(asset_df[[time_col, "asset", "ATR"]]) # Combine ATR data into the main DataFrame atr_df = pd.concat(atr_dfs).set_index(time_col) self.df = self.df.merge(atr_df, left_index=True, right_index=True, how="left") self.df["ATR"] = self.df["ATR"].fillna(method="ffill") # Forward fill NaNs # Update observation_list to include ATR if "ATR" not in self.observation_list: self.observation_list.append("ATR") # Recache observations with ATR self.cached_ohlc_data = [self.get_observation_vector(_dt) for _dt in self.dt_datetime] def _take_action(self, actions, done): # action = math.floor(x), # profit_taken = math.ceil((x- math.floor(x)) * profit_taken_max - stop_loss_max ) # _actions = np.floor(actions).astype(int) # _profit_takens = np.ceil((actions - np.floor(actions)) *self.cf.symbol(self.assets[i],"profit_taken_max")).astype(int) _action = 2 _profit_taken = 0 rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # need use multiply assets for i, action in enumerate(actions): # Actions are now floats between 0 and 3 self._o = self.get_observation(self.current_step, i, "Open") self._h = self.get_observation(self.current_step, i, "High") self._l = self.get_observation(self.current_step, i, "Low") self._c = self.get_observation(self.current_step, i, "Close") self._t = self.get_observation(self.current_step, i, "_time") self._day = self.get_observation(self.current_step, i, "_day") # Extract integer action type and fractional part _action = math.floor(action) # 0=Buy, 1=Sell, 2=Nothing rewards[i] = self._calculate_reward(i, done, _action) # Pass action for exploration reward print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")): # Dynamically calculate PT using action fraction _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 if self.cf.symbol(self.assets[i], "limit_order"): transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._l if _action == 0 else self._h, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": -1, "CloseStep": -1, } self.transaction_limit_order.append(transaction) else: transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._c, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": self.current_step, "CloseStep": -1, } self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) return sum(rewards) def _calculate_reward(self, i, done, action): _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[:]: # Copy to avoid modification issues if tr["Symbol"] == self.assets[i]: _point = self.cf.symbol(self.assets[i], "point") # cash discount overnight if self._day > tr["DateDuration"]: tr["DateDuration"] = self._day tr["Reward"] -= self.cf.symbol(self.assets[i], "over_night_penalty") if tr["Type"] == 0: # Buy # stop loss trigger _sl_price = tr["ActionPrice"] - tr["SL"] / _point _pt_price = tr["ActionPrice"] + tr["PT"] / _point if done: p = (self._c - tr["ActionPrice"]) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._l <= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._h >= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: # still open self.current_draw_downs[i] = int((self._l - tr["ActionPrice"]) * _point) _max_draw_down += self.current_draw_downs[i] if self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i]: tr["MaxDD"] = self.current_draw_downs[i] elif tr["Type"] == 1: # Sell # stop loss trigger _sl_price = tr["ActionPrice"] + tr["SL"] / _point _pt_price = tr["ActionPrice"] - tr["PT"] / _point if done: p = (tr["ActionPrice"] - self._c) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._h >= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._l <= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: self.current_draw_downs[i] = int( (tr["ActionPrice"] - self._h) * _point ) _max_draw_down += self.current_draw_downs[i] if ( self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i] ): tr["MaxDD"] = self.current_draw_downs[i] if _max_draw_down > self.max_draw_downs[i]: self.max_draw_downs[i] = _max_draw_down return _total_reward def _limit_order_process(self, i, _action, done): for tr in self.transaction_limit_order[:]: if tr["Symbol"] == self.assets[i]: if tr["Type"] != _action or done: self.transaction_limit_order.remove(tr) tr["Status"] = 3 tr["CloseStep"] = self.current_step self.transaction_history.append(tr) elif (tr["ActionPrice"] >= self._l and _action == 0) or ( tr["ActionPrice"] <= self._h and _action == 1): tr["ActionStep"] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_limit_order.remove(tr) self.transaction_live.append(tr) self.tranaction_open_this_step.append(tr) elif (tr["LimitStep"] + self.cf.symbol(self.assets[i], "limit_order_expiration") > self.current_step): tr["CloseStep"] = self.current_step tr["Status"] = 4 self.transaction_limit_order.remove(tr) self.transaction_history.append(tr) def _manage_tranaction(self, tr, _p, close_price, status=1): self.transaction_live.remove(tr) tr["ClosePrice"] = close_price tr["Point"] = int(_p) tr["Reward"] = int(tr["Reward"] + _p) # Realized profit/loss tr["Status"] = status # 1=SL/PT, 2=Forced close, 3=Canceled limit, 4=Expired limit tr["CloseTime"] = self._t tr["CloseStep"] = self.current_step self.balance += int(tr["Reward"]) self.total_equity -= int(abs(tr["Reward"])) self.tranaction_close_this_step.append(tr) self.transaction_history.append(tr) def analyze_transaction_history(self, log_file): if not self.transaction_history: metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0} else: trades = len(self.transaction_history) rewards = [tr["Reward"] for tr in self.transaction_history] wins = sum(1 for r in rewards if r > 0) losses = sum(1 for r in rewards if r < 0) gross_profit = sum(r for r in rewards if r > 0) gross_loss = abs(sum(r for r in rewards if r < 0)) win_rate = wins / trades if trades > 0 else 0.0 profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf") # Sharpe Ratio (simplified, assumes risk-free rate = 0) returns = np.array(rewards, dtype=np.float32) sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0 total_profit = sum(rewards) metrics = { "trades": trades, "win_rate": win_rate, "profit_factor": profit_factor, "sharpe_ratio": sharpe_ratio, "total_profit": total_profit } # Append to log file with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f: writer = csv.DictWriter(f, fieldnames=["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]) metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") metrics["episode"] = self.episode writer.writerow(metrics) return metrics def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} return obs, total_reward, terminated, truncated, info def get_observation(self, _step, _iter=0, col=None): if col is None: return self.cached_ohlc_data[_step] if col == "_day": return self.cached_time_serial[_step][1] elif col == "_time": return self.cached_time_serial[_step][0] try: col_pos = self.observation_list.index(col) except ValueError: raise ValueError(f"Column '{col}' not found in observation_list") return self.cached_ohlc_data[_step][_iter * len(self.observation_list) + col_pos] def get_observation_vector(self, _dt, cols=None): cols = self.observation_list if cols is None else cols v = [] for a in self.assets: subset = self.df.query(f'{self.asset_col} == "{a}" & {self.time_col} == "{_dt}"') assert not subset.empty v += subset.loc[_dt, cols].tolist() assert len(v) == len(self.assets) * len(cols) return v def get_economic_vector(self, _dt): subset = self.df.loc[_dt] events = subset['events'] if isinstance(subset, pd.Series) else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) currency_ids = [self.currency_map.get(e['currency'], 0) for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) numeric_fields = ['actual_norm', 'forecast_norm', 'previous_norm', 'surprise_norm', 'event_freq', 'impact_code'] numeric = [e[field] for e in events[:self.max_events] for field in numeric_fields] + [0] * (self.max_events * 6 - len(events) * 6) return { "event_ids": np.array(event_ids, dtype=np.int32), "currency_ids": np.array(currency_ids, dtype=np.int32), "numeric": np.array(numeric, dtype=np.float32) } def reset(self, seed=None, options=None): # Set the seed for reproducibility if seed is not None: self._seed(seed) if self.random_start: self.current_step = random.choice(range(int(len(self.dt_datetime) * 0.5))) else: self.current_step = 0 self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True self.visualization = False obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } info = {} return obs, info def render(self, mode="human", title=None, **kwargs): if mode in ("human", "file"): printout = mode == "human" pm = { "log_header": self.log_header, "log_filename": self.log_filename, "printout": printout, "balance": self.balance, "balance_initial": self.balance_initial, "tranaction_close_this_step": self.tranaction_close_this_step, "done_information": self.done_information, } render_to_file(**pm) if self.log_header: self.log_header = False elif mode == "graph" and self.visualization: print("plotting...") p = TradingChart(self.df, self.transaction_history) p.plot() def close(self): pass def get_sb_env(self): e = DummyVecEnv([lambda: self]) obs = e.reset() return e, obs class CustomFeaturesExtractor(BaseFeaturesExtractor): def __init__(self, observation_space): n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2 ohlc_dim = observation_space.spaces["ohlc_data"].shape[0] max_events = observation_space.spaces["event_ids"].shape[0] economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0] portfolio_dim = observation_space.spaces["portfolio_data"].shape[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim super().__init__(observation_space, features_dim=features_dim) # num_embeddings=num of unique data self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events) self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events) print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}") def forward(self, obs): ohlc_data = obs["ohlc_data"] event_ids = obs["event_ids"].to(torch.long) currency_ids = obs["currency_ids"].to(torch.long) economic_numeric = obs["economic_numeric"] portfolio_data = obs["portfolio_data"] event_emb = self.event_embedding(event_ids).mean(dim=1) currency_emb = self.currency_embedding(currency_ids).mean(dim=1) features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1) return features class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Extract action space bounds self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch["pi"], activation_fn=nn.ReLU, device=self.device ) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Update the distribution with current parameters self.action_dist.update(mean_actions, log_std) # Sample actions or get deterministic actions if deterministic: actions = self.action_dist.mode() else: actions = self.action_dist.sample() # Map from [-1, 1] to [0, 3] actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities log_prob = self.action_dist.log_prob(actions).sum(dim=-1, keepdim=True) # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) self.action_dist.update(mean_actions, log_std) # Actions need to be unsquashed back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 log_prob = self.action_dist.log_prob(unsquashed_actions).sum(dim=-1, keepdim=True) entropy = self.action_dist.entropy().sum(dim=-1, keepdim=True) values = self.value_net(latent_vf) return values, log_prob, entropy ```
[我 2025-03-12 18：36：32,364]在 RDB 中创建的新研究，名称为：OHLC_EconomicCalender_ppo_study使用 cuda 设备 CustomFeaturesExtractor： n_assets=1， features_dim=96 [W 2025-03-12 18：36：34,723] 试验 0 失败，参数为：{'learning_rate'： 0.00037110990690939903， 'n_steps'： 5120， 'total_timesteps'： 2000000， 'batch_size'： 512， 'gamma'： 0.9921798261736212， 'gae_lambda'： 0.8697772993247844， 'clip_range'： 0.2201477433324902} 因为以下错误： AttributeError（“'SquashedDiagGaussianDistribution' 对象' 对象没有属性 'update'”）。回溯（最近调用最后一次）：文件 “/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py”，第 197 行，_run_trial value_or_values = func（trial）文件“<ipython-input-15-8571b78beb29>”，第 166 行，在 objective model.learn（total_timesteps=total_timesteps， callback=eval_callback）文件 “/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py”，第 311 行，在 learn 中返回 super（）.learn（文件 “/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py“，第 323 行，在 learn continue_training = self.collect_rollouts（self.env， callback， self.rollout_buffer， n_rollout_steps=self.n_steps）文件 ”/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py“，第 202 行，在 collect_rollouts作、值log_probs = self.policy（obs_tensor）文件中 ”/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py“，第 1739 行，_wrapped_call_impl返回 self._call_impl（*args， **kwargs）文件“/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”，第 1750 行，_call_impl返回 forward_call（*args， **kwargs）文件“<ipython-input-13-c61d479822a6>”，第 687 行，在转发 self.action_dist.update（mean_actions， log_std） AttributeError： 'SquashedDiagGaussianDistribution' 对象没有属性 'update' [W 2025-03-12 18：36：34,726] 试验 0 失败，值为 None。 --------------------------------------------------------------------------- AttributeError Traceback （最近调用最后） <cell line中 <ipython-input-15-8571b78beb29> in <cell line： 183>（） 181 load_if_exists=True 182 ） --> 183 study.optimize（objective， n_trials=1） # 根据资源调整试验次数 184 185 # 最佳参数 11 帧 /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize（self， func， n_trials， timeout， n_jobs， catch、callbacks、gc_after_trial、show_progress_bar） 473 如果发生此方法的嵌套调用。474 “”“ --> 475 _optimize（ 476 study=self， 477 func=func， /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize（study， func， n_trials， timeout， n_jobs， catch， callbacks， gc_after_trial， show_progress_bar） 61 try： 62 if n_jobs == 1： ---> 63 _optimize_sequential（ 64 study， 65 func， /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential（study， func， n_trials， timeout， catch， callbacks， gc_after_trial， reseed_sampler_rng， time_start， progress_bar） 158 159 try： --> 160 frozen_trial = _run_trial（study， func， catch） 161 finally： 162 # 以下行缓解了 _run_trial（study， func，）中某些 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py 中可能发生的内存问题 catch） 246 而不是 isinstance（func_err， catch） 247 ）： --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial（study， func， catch） 195 with get_heartbeat_thread（trial._trial_id， study._storage）： 196 try： --> 197 value_or_values = func（trial） 198 除了例外。TrialPruned as e： 199 # TODO（mamu）：处理多目标案例。 <ipython-input-15-8571b78beb29> in objective（trial） 164 165 # print（model.policy） # 应该显示in_features=95 的mlp_extractor --> 166 model.learn（total_timesteps=total_timesteps， callback=eval_callback） 167 168 val_avg_profit = evaluate（model， val_env_vec， n_episodes=10） # 评估函数 /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in learn（self， total_timesteps， callback， log_interval， tb_log_name， reset_num_timesteps， progress_bar） 309 progress_bar： bool = False， 310 ） -> SelfPPO： --> 311 return super（）.learn（ 312 total_timesteps=total_timesteps， 313 callback=callback， /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in learn（self， total_timesteps， callback， log_interval， tb_log_name， reset_num_timesteps， progress_bar） 321 322 while self.num_timesteps < total_timesteps： --> 323 continue_training = self.collect_rollouts（self.env， callback， self.rollout_buffer， n_rollout_steps=self.n_steps） 324 325 如果不continue_training：/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in collect_rollouts（self， env， callback， rollout_buffer， n_rollout_steps） 200 # 转换为 pytorch 张量或 TensorDict 201 obs_tensor = obs_as_tensor（self._last_obs， self.device） --> 202作、值log_probs = self.policy（obs_tensor） 203作 = actions.cpu（）.numpy（） 204 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl（self， *args， **kwargs） 1737 return self._compiled_call_impl（*args， **kwargs） # type： ignore[misc] 1738 else： -> 1739 return self._call_impl（*args， **kwargs） 1740 1741 # torchrec 使用以下代码测试代码一致性 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl（self， *args， **kwargs） 1748 or _global_backward_pre_hooks or _global_backward_hooks 1749 or _global_forward_hooks or _global_forward_pre_hooks）： -> 1750 return forward_call（*args， **kwargs） 1751 1752 result = None <ipython-input-13-c61d479822a6> in forward（self， obs， deterministic） 685 686 # 使用当前参数更新发行版 --> 687 self.action_dist.update（mean_actions， log_std） 688 689 # 示例作或获取确定性作 AttributeError： 'SquashedDiagGaussianDistribution' 对象没有属性 'update' ''' import datetime import math import random import ast import gymnasium as gym from gymnasium.utils import spaces from gymnasium.utils import os import csv import numpy as np import pandas_ta as ta from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor， MlpExtractor from stable_baselines3.common.policies import ActorCriticPolicy from stable_baselines3.common.torch_layers import MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution import torch import torch.nn as nn from stable_baselines3 import PPO from stable_baselines3.common.policies import ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart import TradingChart frommeta.env_fx_trading.util.read_config import EnvConfig 类 tgym（gym.Env）： “”“外汇/期货/期权交易健身房环境 1. 三个作空间（0 买入，1 卖出，2 无） 2.同一时间框架下的多个交易对（EURUSD， GBPUSD...） 3.时间范围从 1 分钟到每天，只要使用 K 线（开盘价、最高价、最低价、收盘价） 4.使用 StopLose、ProfitTaken 实现奖励。每对都可以在配置文件 5 中配置自己的 SL 和 PT。配置隔夜现金罚金以及每对的交易手续费和隔夜持仓罚金 6.将数据集拆分为每日、每周或每月...，具有固定的时间步长，在 len（df）结束时。业务逻辑将强制以最后收盘价平仓所有持仓（游戏结束）。7. 必须具有 df 列名称： [（time_col），（asset_col）， Open，Close，High，Low，day] （区分大小写） 8.添加指示符可以在数据处理过程中添加。Finta 9 的 78 个可用 TA 指标。在 json 配置文件中处理的自定义观察列表。10. ProfitTaken = fraction_action * max_profit_taken + SL. 11.SL 是前缀 12。可以设置限价单，如果 limit_order == True，则动作将预设在柱线的 Low 或 High 处买入或卖出，并带有 limit_order_expiration（n 个柱线）。如果价格交叉，它将触发。否则，它将是 drop off 13。渲染模式：人工 -- 在控制台文件上显示每一步实现的奖励 -- 创建事务日志图 -- 在图中创建事务（开发中） 14.15. 奖励，我们希望激励长期持续的利润。在每个步骤中，我们将奖励设置为账户余额乘以到目前为止时间步长数量的一小部分。这样做的目的是在早期阶段延迟过快地奖励代理，并允许它在过于深入地优化单个策略之前进行充分探索。它还将奖励那些在更长时间内保持较高余额的代理商，而不是那些使用不可持续的策略快速赚钱的代理商。16. Observation_space 包含我们希望代理在进行交易或不进行交易之前考虑的所有输入变量。我们希望我们的代理在游戏窗口中“看到”外汇数据点（开盘价、最高价、最低价、收盘价、时间序列、技术分析），以及其他几个数据点，如其账户余额、当前头寸和当前利润。这里的直觉是，对于每个时间步长，我们希望我们的代理考虑导致当前价格的价格行为，以及他们自己的投资组合的状态，以便为下一步行动做出明智的决定。17.奖励为外汇交易单位积分，可为每个交易对配置 18 个。为了使未实现利润奖励反映市场状况，我们将计算每种资产的 ATR，并使用它来动态扩展奖励。 “”“ 元数据 = {”render.modes“： [”graph“， ”human“， ”file“， ”none“]} def __init__（ self， df， event_map， currency_map， env_config_file=”./neo_finrl/env_fx_trading/config/gdbusd-test-1.json“，）： assert df.ndim == 2 super（tgym， self）.__init__（） self.cf = EnvConfig（env_config_file） self.observation_list = self.cf.env_parameters（”observation_list“） # 经济数据映射 self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy（）如果 'events' 不在 self.df 中。列： raise ValueError（“数据帧必须包含'事件'列”） def parse_events（x）： if isinstance（x， str）： try： parsed = ast.literal_eval（x） return parsed if isinstance（parsed， list） else [] except （ValueError， SyntaxError）： return [] return x if isinstance（x， list） else [] self.df['events'] = self.df['events'].apply（parse_events） if not isinstance（self.df['events'].iloc[0]， list）： raise ValueError（“'事件'必须是一个列表”） if self.df['events'].iloc[0] 而不是 isinstance（self.df['events'].iloc[0][0]， dict）： raise ValueError（“'events' 中的元素必须是字典”） self.balance_initial = self.cf.env_parameters（“balance”） self.over_night_cash_penalty = self.cf.env_parameters（“over_night_cash_penalty”） self.asset_col = self.cf.env_parameters（“asset_col”） self.time_col = self.cf.env_parameters（“time_col”） self.random_start = self.cf.env_parameters（“random_start”） self.log_filename = （ self.cf.env_parameters（“log_filename”） + datetime.datetime.now（）.strftime（“%Y%m%d%H%M%S”） + “.csv” ） self.analyze_transaction_history_log_filename = （“transaction_history_log” + datetime.datetime.now（）.strftime（“%Y%m%d%H%M%S”） + “.csv”） self.df[“_time”] = self.df[self.time_col] self.df[“_day”] = self.df[“工作日”] self.assets = self.df[self.asset_col].unique（） self.dt_datetime = self.df[self.time_col].sort_values（）.unique（） self.df = self.df.set_index（self.time_col） self.visualization = False # 计算 ATR 并保存 DataFrame self.calculate_atr（） self.df.to_csv（“processed_df_with_atr.csv”） print（“已保存已处理的DataFrame to 'processed_df_with_atr.csv'”） # 重置值 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_step = 0 self.episode = 0 # 从 0 开始，在剧集结束时递增 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = True # 缓存数据 self.cached_ohlc_data = [self.get_observation_vector（_dt） for _dt in self.dt_datetime] self.cached_economic_data = [self.get_economic_vector（_dt） for _dt in self.dt_datetime] self.cached_time_serial = （ self.df[[“_time”， “_day”]].sort_values（“_time”）.drop_duplicates（）.values.tolist（）） self.reward_range = （-np.inf， np.inf） self.action_space = 空格。Box（low=0， high=3， shape=（len（self.assets），）， dtype=np.float32） self.observation_space = 空格。dict（{ “ohlc_data”：空格.Box（low=-np.inf， high=np.inf， shape=（len（self.assets） * len（self.observation_list），）， dtype=np.float32）， “event_ids”：空格。Box（low=0， high=len（self.event_map）-1， shape=（self.max_events，）， dtype=np.int32）， “currency_ids”：空格。Box（low=0， high=len（self.currency_map）-1， shape=（self.max_events，）， dtype=np.int32）， “economic_numeric”：空格。Box（low=-np.inf， high=np.inf， shape=（self.max_events * 6，）， dtype=np.float32）， “portfolio_data”：空格。Box（low=-np.inf， high=np.inf， shape=（3 + 2 * len（self.assets），）， dtype=np.float32） }） print（ f“初始完成：\n” f“observation_list：{self.observation_list}\n” f“assets：{self.assets}\n” f“时间序列： {min（self.dt_datetime）} -> {max（self.dt_datetime）} 长度： {len（self.dt_datetime）}\n” f“事件： {len（self.event_map）}，货币： {len（self.currency_map）}” ） self._seed（） def _seed（self， seed=None）： self.np_random， seed = seeding.np_random（seed） return [seed] # 假设 self.df 包含 OHLC 数据，列为：开盘价、最高价、最低价、收盘价、asset_col time_col def calculate_atr（self）： # 按资产分组以计算每个交易对的 ATR atr_dfs = [] time_col = “_time” for asset in self.assets： asset_df = self.df[self.df[self.asset_col] == asset][[“Open”， “high”， “Low”， “Close”]].copy（） # 计算默认 period=14 的 ATR asset_df[“ATR”] = ta.atr（ high=asset_df[“High”]， low=asset_df[“Low”]， close=asset_df[“Close”]， length=14 ） asset_df[“asset”] = asset asset_df[time_col] = self.df[self.asset_col] == asset][time_col] atr_dfs.append（asset_df[[time_col， “asset”， “ATR”]]） # 将 ATR 数据合并到主 DataFrame 中 atr_df = pd.concat（atr_dfs）.set_index（time_col） self.df = self.df.merge（atr_df， left_index=True， right_index=True， how=“left”） self.df[“ATR”] = self.df[“ATR”].fillna（method=“ffill”） # 正向填充 NaN # 如果 “ATR” 不在 self.observation_list 中，则更新 observation_list 以包含 ATR： self.observation_list.append（“ATR”） # 使用 ATR 重新缓存观察值 self.cached_ohlc_data = [self.get_observation_vector（_dt） for _dt in self.dt_datetime] def _take_action（self， actions， done）： # action = math.floor（x）， # profit_taken = math.ceil（（x- math.floor（x）） * profit_taken_max - stop_loss_max ） # _actions = np.floor（actions）.astype（int） # _profit_takens = np.ceil（（actions - np.floor（actions）） *self.cf.symbol（self.assets[i]，“profit_taken_max”））.astype（int） _action = 2 _profit_taken = 0 rewards = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # 需要对 i， action in enumerate（actions）使用乘法资产： # 动作现在在 0 到 3 之间浮动 self._o = self.get_observation（self.current_step， i， “Open”） self._h = self.get_observation（self.current_step， i， “High”） self._l = self.get_observation（self.current_step， i， “Low”） self._c = self.get_observation（self.current_step， i， “Close”） self._t = self.get_observation（self.current_step， i， “_time”） self._day = self.get_observation（self.current_step， i， “_day”） # 提取整数动作类型和小数部分 _action = math.floor（action） # 0=买入， 1=卖出， 2=什么都没有 rewards[i] = self._calculate_reward（i， done， _action） # 通过探索奖励的行动 print（f“Asset {self.assets[i]}： action={action}， reward={rewards[i]}， Holding={self.current_holding[i]}”） if self.cf.symbol（self.assets[i]， “limit_order”）： self._limit_order_process（i， _action， done） if （ _action in （0， 1） and not done and self.current_holding[i] < self.cf.symbol（self.assets[i]， “max_current_holding”））： # 使用动作分数动态计算 PT _profit_taken = math.ceil（（action - _action） * self.cf.symbol（self.assets[i]， “profit_taken_max”）） + self.cf.symbol（self.assets[i]， “stop_loss_max”） self.ticket_id += 1 if self.cf.symbol（self.assets[i]， “limit_order”）： transaction = { “Ticket”： self.ticket_id， “Symbol”： self.assets[i]， “ActionTime”： self._t， “Type”： _action， “Lot”： 1， “ActionPrice”： self._l if _action == 0 else self._h， “SL”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”： _profit_taken， “MaxDD”： 0， “Swap”： 0.0， “CloseTime”： “”， “ClosePrice”： 0.0， “Point”： 0， “Reward”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”： self._day， “Status”： 0， “LimitStep”： self.current_step， “ActionStep”： -1， “CloseStep”： -1， } self.transaction_limit_order.append（transaction） else： transaction = { “Ticket”： self.ticket_id， “Symbol”： self.assets[i]， “ActionTime”： self._t， “Type”： _action， “Lot”： 1， “ActionPrice”： self._c， “SL”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”： _profit_taken， “MaxDD”： 0， “Swap”： 0.0， “CloseTime”： “”， “ClosePrice”： 0.0， “Point”： 0， “Reward”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”： self._day， “Status”： 0， “LimitStep”： self.current_step， “ActionStep”： self.current_step， “CloseStep”： -1， } self.current_holding[i] += 1 self.tranaction_open_this_step.append（transaction） self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_live.append（transaction） return sum（rewards） def _calculate_reward（self， i， done， action）： _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[：]： # 复制以避免修改问题 if tr[“Symbol”] == self.assets[i]： _point = self.cf.symbol（self.assets[i]， “point”） # 如果 self._day > 隔夜现金折扣 tr[“DateDuration”]： tr[“DateDuration”] = self._day tr[“奖励”] -= self.cf.symbol（self.assets[i]， “over_night_penalty”） if tr[“Type”] == 0： # 买入 # 止损触发 _sl_price = tr[“ActionPrice”] - tr[“SL”] / _point _pt_price = tr[“ActionPrice”] + tr[“PT”] / _point 如果完成： p = （self._c - tr[“ActionPrice”]） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._l <= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._h >= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 else： # 仍然打开 self.current_draw_downs[i] = int（（self._l - tr[“ActionPrice”]） * _point） _max_draw_down += self.current_draw_downs[i] if self.current_draw_downs[i] < 0 and tr[“MaxDD”] > self.current_draw_downs[i]： tr[“MaxDD”] = self.current_draw_downs[i] elif tr[“Type”] == 1： # 卖出 # 止损触发 _sl_price = tr[“ActionPrice”] + tr[“SL”] / _point _pt_price = tr[“ActionPrice”] - tr[“PT”] / _point 如果完成： p = （tr[“ActionPrice”] - self._c） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._h >= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._l <= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 else： self.current_draw_downs[i] = int（（tr[“ActionPrice”] - self._h） * _point ） _max_draw_down += self.current_draw_downs[i] if （ self.current_draw_downs[i] < 0 and tr[“MaxDD”] > self.current_draw_downs[i] ）： tr[“MaxDD”] = self.current_draw_downs[i] if _max_draw_down > self.max_draw_downs[i]： self.max_draw_downs[i] = _max_draw_down return _total_reward def _limit_order_process （self， i， _action， done）：对于self.transaction_limit_order中的tr[：]： if tr[“Symbol”] == self.assets[i]： if tr[“Type”] ！= _action or done： self.transaction_limit_order.remove（tr） tr[“Status”] = 3 tr[“CloseStep”] = self.current_step self.transaction_history.append（tr） elif （tr[“ActionPrice”] >= self._l and _action == 0） or （ tr[“ActionPrice”] <= self._h and _action == 1）： tr[“ActionStep”] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_limit_order.remove（tr） self.transaction_live.append（tr） self.tranaction_open_this_step.append（tr） elif （tr[“LimitStep”] + self.cf.symbol（self.assets[i]， “limit_order_expiration”） > self.current_step）： tr[“CloseStep”] = self.current_step tr[“Status”] = 4 self.transaction_limit_order.remove（tr） self.transaction_history.append（tr） def _manage_tranaction（self， tr， _p， close_price， status=1）： self.transaction_live.remove（tr） tr[“ClosePrice”] = close_price tr[“Point”] = int（_p） tr[“Reward”] = int（tr[“Reward”] + _p） # 已实现盈亏 tr[“Status”] = 状态 # 1=止损/PT，2=强制平仓，3=取消限价，4=过期限价 tr[“CloseTime”] = self._t tr[“CloseStep”] = self.current_step self.balance += int（tr[“Reward”]） self.total_equity -= int（abs（tr[“Reward”]）） self.tranaction_close_this_step.append（tr） self.transaction_history.append（tr） def analyze_transaction_history（self， log_file）：如果不是self.transaction_history： metrics = {“trades”： 0， “win_rate”： 0.0， “profit_factor”： 0.0， “sharpe_ratio”： 0.0， “total_profit”： 0.0} else： trades = len（self.transaction_history） rewards = [tr[“Reward”] for tr in self.transaction_history] wins = sum（1 for r in r）如果r > 0） losses = sum（1 for r r在奖励中如果r < 0） gross_profit = sum（r for r如果 r > 0） gross_loss = abs（sum（r for r in rewards if r < 0）） win_rate = 赢 / 交易如果交易 > 0 else 0.0 profit_factor = gross_profit / gross_loss 如果 gross_loss > 0 else float（“inf”） # 夏普比率（简化，假设无风险利率 = 0） returns = np.array（rewards， dtype=np.float32） sharpe_ratio = np.mean（returns） / np.std（returns） if np.std（returns） > 0 else 0.0 total_profit = sum（rewards） metrics = { “trades”： trades， “win_rate”： win_rate， “profit_factor”： profit_factor， “sharpe_ratio”： sharpe_ratio， “total_profit”： total_profit } # 以 open（self.analyze_transaction_history_log_filename， 'a'， newline=''）作为 f： writer = csv 附加到日志文件。DictWriter（f， fieldnames=[“timestamp”， “episode”， “trades”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”]） metrics[“timestamp”] = datetime.datetime.now（）.strftime（“%Y-%m-%d %H：%M：%S”） metrics[“episode”] = self.episode writer.writerow（metrics） return metrics def step（self， actions）： self.current_step += 1 # 定义终止和截断条件 terminate = self.balance <= 0 # 剧集因破产结束（最终状态）截断 = self.current_step == len（self.dt_datetime） - 1 # 剧集因最大步数（时间限制）而结束 = 终止或截断 # 合并为 VecEnv 的单个“完成”标志 # 对于渲染或剧集跟踪，如果完成，您仍然可以检查任一条件是否为真： self.done_information += f“Episode： {self.episode} 余额： {self.balance} 步数： {self.current_step}\n” self.visualization = True self.episode += 1 # 增加剧集计数器 # 计算基础交易奖励 base_reward = self._take_action（actions， done） # 计算未平仓头寸的未实现利润 unrealized_profit = 0 atr_scaling = 0 # 对于i的市场条件缩放，资产在枚举中（self.assets： atr = self.get_observation（self.current_step， i， “ATR”） atr_scaling += atr # 对self.transaction_live中的tr进行归一化的资产的ATR总和： if tr[“Symbol”] == asset： if tr[“Type”] == 0： # 买入未实现的 = = （self._c - tr[“ActionPrice”]） * self.cf.symbol（asset， “point”） else： # 卖出未实现 = （tr[“ActionPrice”] - self._c） * self.cf.symbol（asset， “point”） unrealized_profit += 未实现 atr_scaling = atr_scaling / len（self.assets） if atr_scaling > 0 else 1 # 避免被 0 除以 # 持续奖励：仅适用于未实现/已实现的利润，按 ATR 缩放 # 调整 0.01 至 0.05 sustained_reward = （unrealized_profit + base_reward） * 0.01 / atr_scaling if self.transaction_live else 0 # 如果没有持仓，则因不作为而受到处罚。transaction_live and all（math.floor（a） == 2 for a in actions）： sustained_reward -= 0.1 # 鼓励探索的小惩罚 total_reward = base_reward + sustained_reward 如果 self._day > self.current_day： self.current_day = self._day self.balance -= self.over_night_cash_penalty 如果 self.balance ！= 0： self.max_draw_down_pct = abs（sum（self.max_draw_downs） / self.balance * 100） obs = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“数字”]， “portfolio_data”： np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs， dtype=np.float32 ） } print（f“Step {self.current_step}：基础奖励={base_reward}，持续奖励={sustained_reward}，总计={total_reward}，余额={self.balance}“） # 信息字典保持不变 info = {”Close“： self.tranaction_close_this_step} return obs， total_reward， terminated， truncated， info def get_observation（self， _step， _iter=0， col=None）：如果 col 为 None：返回 self.cached_ohlc_data[_step] if col == ”_day“： return self.cached_time_serial[_step][1] elif col == ”_time“： return self.cached_time_serial[_step][0] try： col_pos = self.observation_list.index（col） except ValueError： raise ValueError（f“Column '{col}' not found in observation_list”） return self.cached_ohlc_data[_step][_iter * len（self.observation_list） + col_pos] def get_observation_vector（self， _dt， cols=None）： cols = self.observation_list if cols is None else cols v = [] for a in self.assets： subset = self.df.query（f'{self.asset_col} == “{a}” & {self.time_col} == “{_dt}”'） assert not subset.empty v += subset.loc[_dt， cols].tolist（） assert len（v） == len（self.assets） * len（cols） return v def get_economic_vector（self， _dt）： subset = self.df.loc[_dt] events = subset['events'] if isinstance（subset， pd.Series） else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） currency_ids = [self.currency_map.get（e['currency']， 0） for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） numeric_fields = ['actual_norm'， 'forecast_norm'， 'previous_norm'， 'surprise_norm'， 'event_freq'， 'impact_code'] numeric = [e[field] for e in events[：self.max_events] for field in numeric_fields] + [0] * （self.max_events * 6 - len（events） * 6） return { “event_ids”： np.array（event_ids， dtype=np.int32）， “currency_ids”： np.array（currency_ids， dtype=np.int32）， “numeric”： np.array（numeric， dtype=np.float32） } def reset（self， seed=None， options=None）： # 如果 seed 不是 None： self._seed（seed） if self.random_start： self.current_step = random.choice（range（int（len（self.dt_datetime） * 0.5））） else： self.current_step = 0 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = True self.visualization = False obs = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric“： self.cached_economic_data[self.current_step][”numeric“]， ”portfolio_data“： np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs， dtype=np.float32 ） } info = {} return obs， info def render（self， mode=”human“， title=None， **kwargs）： if mode in （”human“， ”file“）： printout = mode == ”human“ pm = { ”log_header“： self.log_header， ”log_filename“： self.log_filename， ”printout“： printout， ”balance“： self.balance， “balance_initial”： self.balance_initial， “tranaction_close_this_step”： self.tranaction_close_this_step， “done_information”： self.done_information， } render_to_file（**pm） if self.log_header： self.log_header = False elif mode == “graph” and self.visualization： print（“plotting...”） p = TradingChart（self.df， self.transaction_history） p.plot（） def close（self）： pass def get_sb_env（self）： e = DummyVecEnv（[lambda： self]） obs = e.reset（） return e， obs 类 CustomFeaturesExtractor（BaseFeaturesExtractor）： def __init__（self， observation_space）： n_assets = （observation_space.spaces[“portfolio_data”].shape[0] - 3） // 2 ohlc_dim = observation_space.spaces[“ohlc_data”].shape[0] max_events = observation_space.spaces[“event_ids”].shape[0] economic_numeric_dim = observation_space.spaces[“economic_numeric”].shape[0] portfolio_dim = observation_space.spaces[“portfolio_data”].shape[0] features_dim = ohlc_dim + 2 * max_events +economic_numeric_dim + portfolio_dim super（）.__init__（observation_space， features_dim=features_dim） # num_embeddings=唯一数据数 self.event_embedding = nn.Embedding（num_embeddings=129， embedding_dim=max_events） self.currency_embedding = nn.Embedding（num_embeddings=6， embedding_dim=max_events） print（f“CustomFeaturesExtractor： n_assets={n_assets}， features_dim={features_dim}”） def forward（self， obs）： ohlc_data = obs[“ohlc_data”] event_ids = obs[“event_ids”].to（torch.long） currency_ids = obs[“currency_ids”].to（torch.long） economic_numeric = obs[“economic_numeric”] portfolio_data = obs[“portfolio_data”] event_emb = self.event_embedding（event_ids）.mean（dim=1） currency_emb = self.currency_embedding（currency_ids）.mean（dim=1） features = torch.cat（[ohlc_data， event_emb， currency_emb， economic_numeric， portfolio_data]， dim=1） return features class CustomMultiInputPolicy（ActorCriticPolicy）： def __init__（self， observation_space， action_space， lr_schedule， *args， **kwargs）： # 提取动作空间边界 self.action_space_low = torch.tensor（action_space.low， dtype=torch.float32） self.action_space_high = torch.tensor（action_space.high， dtype=torch.float32） action_dim = action_space.shape[0] # 数字资产 super（）.__init__（ observation_space， action_space， lr_schedule， features_extractor_class=CustomFeaturesExtractor， features_extractor_kwargs={}， net_arch=dict（pi=[64， 64]， vf=[64， 64]）， *args， **kwargs ） features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor（ features_dim， net_arch=self.net_arch[“pi”]， activation_fn=nn.ReLU， device=self.device ） # 定义输出高斯 self.action_net = nn 的均值和log_std的动作网络。Linear（64， action_dim * 2） # 输出每种资产的平均值和log_std self.value_net = nn.Linear（64， 1） # 初始化分布 self.action_dist = SquashedDiagGaussianDistribution（action_dim） self.num_timesteps = 0 def forward（self， obs， deterministic=False）： # 在每次前向传递上增加时间步长 self.num_timesteps += 1 # 提取特征 features = self.extract_features（obs） latent_pi， latent_vf = self.mlp_extractor（features） # 从 action_net action_params 获取平均值和log_std = self.action_net（latent_pi） # [batch， n_assets * 2] mean_actions， log_std = action_params.chunk（2， dim=-1） # 拆分为均值和 log_std log_std = torch.clamp（log_std， min=-20， max=2） # 稳定log_std # 使用当前参数更新分布 self.action_dist.update（mean_actions， log_std） # 如果确定性，则获取确定性作的示例作或获取确定性作： actions = self.action_dist.mode（） else： actions = self.action_dist.sample（） # 从 [-1， 1] 到 [0， 3] actions = self._squash_to_range（actions， self.action_space_low， self.action_space_high） # 计算对数概率 log_prob = self.action_dist.log_prob（actions）.sum（dim=-1， keepdim=True） # 值预测值 = self.value_net（latent_vf） if self.num_timesteps % 1000 == 0： print（f“Step {self.num_timesteps}， training： {self.training}， actions： {actions}， mean： {mean_actions.mean（）}， log std： {log_std.mean（）}“） return actions， values， log_prob def _squash_to_range（self， actions， low， high）： ”“将压缩的动作从 [-1， 1] 缩放到 [low， high].”“ return （actions + 1） * （high - low） / 2 + low def extract_features（self， obs）： return self.features_extractor（obs） def predict_values（self， obs）： features = self.extract_features（obs） _， latent_vf = self.mlp_extractor（features） return self.value_net（latent_vf） def evaluate_actions（self， obs， actions）： “”“评估训练作（由 PPO 使用）.”“” features = self.extract_features（obs） latent_pi， latent_vf = self.mlp_extractor（features） action_params = self.action_net（latent_pi） mean_actions， log_std = action_params.chunk（2， dim=-1） log_std = torch.clamp（log_std， min=-20， max=2） self.action_dist.update（mean_actions， log_std） #作需要取消压缩回 [-1， 1] 对于log_prob计算 unsquashed_actions = 2 * （作 - self.action_space_low） / （self.action_space_high - self.action_space_low） - 1 log_prob = self.action_dist.log_prob（unsquashed_actions）.sum（dim=-1， keepdim=True）熵 = self.action_dist.entropy（）.sum（dim=-1， keepdim=True）值 = self.value_net（latent_vf）返回值， log_prob，熵 '''

The error you're encountering, 'SquashedDiagGaussianDistribution' object has no attribute 'update', stems from an incorrect assumption about the SquashedDiagGaussianDistribution class in Stable-Baselines3 (SB3). Unlike some other distribution classes, SquashedDiagGaussianDistribution does not have an update method to dynamically set its parameters (mean_actions and log_std). Instead, SB3 expects the policy to directly compute the distribution parameters (mean_actions and log_std) and pass them to the distribution's methods (like sample, log_prob, etc.) during each forward pass.
您遇到的错误“SquashedDiagGaussianDistribution”对象没有属性“update”，源于对 Stable-Baselines3 （SB3）中 SquashedDiagGaussianDistribution 类的错误假设。与其他一些分布类不同，SquashedDiagGaussianDistribution 没有动态设置其参数（mean_actions 和 log_std）的更新方法。相反，SB3 希望策略直接计算分配参数（mean_actions 和 log_std），并在每次正向传递期间将它们传递给分配的方法（如 sample、log_prob 等）。

This issue arises in your CustomMultiInputPolicy class where you attempt to call self.action_dist.update(mean_actions, log_std). Let’s fix this by modifying the policy to work with SB3’s design, ensuring the distribution is handled correctly without relying on an update method.
此问题出现在您尝试调用 self.action_dist.update（mean_actions， log_std）的 CustomMultiInputPolicy 类中。让我们通过修改策略以使用 SB3 的设计来解决此问题，确保正确处理分发，而无需依赖更新方法。

Root Cause

The SquashedDiagGaussianDistribution in SB3 is designed to be stateless between calls—it doesn’t maintain or update internal parameters via an update method. Instead, it expects the policy to provide mean_actions and log_std each time an action is sampled or evaluated.
Your current implementation assumes you can update the distribution’s parameters and reuse it, which isn’t how SB3’s implementation works.

Solution

You need to:

Remove the update call and directly use the distribution methods (sample, log_prob, etc.) with the computed mean_actions and log_std in each forward pass.
Ensure the distribution is reinitialized or correctly utilized within the forward and evaluate_actions methods without relying on a persistent state.

Here’s a corrected version of your CustomMultiInputPolicy class:

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        # Extract action space bounds
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32)
        action_dim = action_space.shape[0]  # Number of assets

        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            *args,
            **kwargs
        )

        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(
            features_dim,
            net_arch=self.net_arch,
            activation_fn=nn.ReLU,
            device=self.device
        )

        # Define the action network to output mean and log_std for the Gaussian
        self.action_net = nn.Linear(64, action_dim * 2)  # Outputs mean and log_std for each asset
        self.value_net = nn.Linear(64, 1)

        # Initialize the distribution (no need to store it as a persistent object)
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        self.num_timesteps += 1

        # Extract features
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        # Get mean and log_std from action_net
        action_params = self.action_net(latent_pi)  # [batch, n_assets * 2]
        mean_actions, log_std = action_params.chunk(2, dim=-1)  # Split into mean and log_std
        log_std = torch.clamp(log_std, min=-20, max=2)  # Stabilize log_std

        # Create a fresh distribution instance with current parameters
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        
        # Sample actions or get deterministic actions
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)
        
        # Map from [-1, 1] to [0, 3]
        actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

        # Compute log probabilities
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions)

        # Value prediction
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        """Scale squashed actions from [-1, 1] to [low, high]."""
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        """Evaluate actions for training (used by PPO)."""
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        
        # Actions need to be unsquashed back to [-1, 1] for log_prob calculation
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1
        
        # Compute log_prob and entropy
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions)
        entropy = distribution.entropy_from_params(mean_actions, log_std)
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

Key Changes

Removed self.action_dist.update:
- Instead of calling update, we use the distribution’s methods like actions_from_params and log_prob_from_params, which take mean_actions and log_std directly as inputs.
  我们不使用 update 来调用，而是使用发行版的方法（如 actions_from_params 和 log_prob_from_params），它们直接将 mean_actions 和 log_std 作为输入。
Fresh Distribution Instance:
新鲜分发实例：
- In both forward and evaluate_actions, a new SquashedDiagGaussianDistribution instance is created for each call. This avoids the need to update an existing instance and aligns with SB3’s design.
  在正向和evaluate_actions中，都会为每个调用创建一个新的 SquashedDiagGaussianDistribution 实例。这避免了更新现有实例的需要，并与 SB3 的设计保持一致。
Updated Action Sampling and Log Probability Calculation:
更新了作采样和对数概率计算：
- Use actions_from_params to sample actions and log_prob_from_params to compute log probabilities, passing the computed mean_actions and log_std directly.
  使用 actions_from_params 对作进行采样，使用 log_prob_from_params 计算对数概率，直接传递计算mean_actions和log_std。
Entropy Calculation: 熵计算：
- Added entropy_from_params in evaluate_actions to compute the entropy of the distribution, which is required by PPO for policy optimization.
  在 evaluate_actions 中添加了 entropy_from_params 以计算分布的熵，这是 PPO 进行策略优化所需的。

Additional Notes 其他说明

Debugging Output: The print statement in forward remains to help you monitor training progress. You might want to adjust the frequency (e.g., every 10,000 steps) depending on your needs.
调试输出：forward 中的 print 语句仍可帮助你监视训练进度。您可能希望根据需要调整频率（例如，每 10,000 步）。
Action Space Consistency: Ensure your environment’s action space (Box(low=0, high=3, shape=(n_assets,), dtype=np.float32)) matches the policy’s expectations. The squashing from [-1, 1] to [0, 3] should work as intended.
作空间一致性：确保环境的作空间（Box（low=0， high=3， shape=（n_assets，）， dtype=np.float32））与策略的预期匹配。从 [-1， 1] 到 [0， 3] 的压缩应该按预期工作。
Dependencies: Verify that your versions of stable_baselines3, torch, and other libraries are compatible (e.g., stable_baselines3>=2.0.0).
依赖：验证 stable_baselines3、torch 和其他库的版本是否兼容（例如，stable_baselines3>=2.0.0）。

Testing the Fix 测试修复程序

After applying these changes, rerun your training loop. The error should disappear, and the PPO algorithm should proceed with learning. Monitor the output to ensure actions are being sampled correctly within the [0, 3] range and that the policy is updating as expected.
应用这些更改后，重新运行训练循环。错误应该消失，PPO 算法应该继续学习。监控输出以确保在 [0， 3] 范围内正确采样作，并确保策略按预期更新。

If you encounter further issues (e.g., NaNs in rewards or unexpected behavior), let me know, and I can help debug further!
如果您遇到其他问题（例如，奖励中的 NaN 或意外行为），请告诉我，我可以帮助进一步调试！

make it only use cude. [I 2025-03-13 04:04:21,994] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. [W 2025-03-13 04:04:22,099] Trial 2 failed with parameters: {'learning_rate': 0.0006209627114896912, 'n_steps': 3072, 'total_timesteps': 900000, 'batch_size': 256, 'gamma': 0.9661851780352002, 'gae_lambda': 0.8556369419581837, 'clip_range': 0.2349170083964893} because of the following error: RuntimeError('Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!'). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-22-5234907ae9e5>", line 166, in objective model.learn(total_timesteps=total_timesteps, callback=eval_callback) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 311, in learn return super().learn( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 202, in collect_rollouts actions, values, log_probs = self.policy(obs_tensor) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "<ipython-input-21-eae563be9ffc>", line 658, in forward actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) File "<ipython-input-21-eae563be9ffc>", line 673, in _squash_to_range return (actions + 1) * (high - low) / 2 + low RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! [W 2025-03-13 04:04:22,100] Trial 2 failed with value None. Using cuda device CustomFeaturesExtractor: n_assets=1, features_dim=96 --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-22-5234907ae9e5> in <cell line: 183>() 181 load_if_exists=True 182 ) --> 183 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 184 185 # Best parameters 12 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-22-5234907ae9e5> in objective(trial) 164 165 # print(model.policy) # Should show mlp_extractor with in_features=95 --> 166 model.learn(total_timesteps=total_timesteps, callback=eval_callback) 167 168 val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 309 progress_bar: bool = False, 310 ) -> SelfPPO: --> 311 return super().learn( 312 total_timesteps=total_timesteps, 313 callback=callback, /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 321 322 while self.num_timesteps < total_timesteps: --> 323 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) 324 325 if not continue_training: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps) 200 # Convert to pytorch tensor or to TensorDict 201 obs_tensor = obs_as_tensor(self._last_obs, self.device) --> 202 actions, values, log_probs = self.policy(obs_tensor) 203 actions = actions.cpu().numpy() 204 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs) 1737 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1738 else: -> 1739 return self._call_impl(*args, **kwargs) 1740 1741 # torchrec tests the code consistency with the following code /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs) 1748 or _global_backward_pre_hooks or _global_backward_hooks 1749 or _global_forward_hooks or _global_forward_pre_hooks): -> 1750 return forward_call(*args, **kwargs) 1751 1752 result = None <ipython-input-21-eae563be9ffc> in forward(self, obs, deterministic) 656 657 # Map from [-1, 1] to [0, 3] --> 658 actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) 659 660 # Compute log probabilities <ipython-input-21-eae563be9ffc> in _squash_to_range(self, actions, low, high) 671 def _squash_to_range(self, actions, low, high): 672 """Scale squashed actions from [-1, 1] to [low, high].""" --> 673 return (actions + 1) * (high - low) / 2 + low 674 675 def extract_features(self, obs): RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

The error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! occurs because some tensors in your CustomMultiInputPolicy class are not consistently placed on the CUDA device, even though you're using CUDA for training. Specifically, the issue arises in the _squash_to_range method where self.action_space_low and self.action_space_high are initialized as PyTorch tensors but not explicitly moved to the CUDA device, while other tensors (like actions) are on cuda:0.

To fix this, you need to ensure that all tensors and operations in your policy are performed on the CUDA device. Since your environment indicates "Using cuda device," I'll modify the CustomMultiInputPolicy class to explicitly use CUDA (cuda:0) for all tensors and operations.

Updated Code

Here’s the corrected version of CustomMultiInputPolicy that ensures everything runs on CUDA:

import torch
import torch.nn as nn
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.torch_layers import MlpExtractor
from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        # Set device to CUDA explicitly
        self.device = torch.device("cuda:0")

        # Extract action space bounds and move them to CUDA
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=self.device)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=self.device)
        action_dim = action_space.shape[0]  # Number of assets

        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            *args,
            **kwargs
        )

        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(
            features_dim,
            net_arch=self.net_arch,
            activation_fn=nn.ReLU,
            device=self.device
        ).to(self.device)

        # Define the action network to output mean and log_std for the Gaussian
        self.action_net = nn.Linear(64, action_dim * 2).to(self.device)
        self.value_net = nn.Linear(64, 1).to(self.device)

        # Initialize the distribution (no need to store it as a persistent object)
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        self.num_timesteps += 1

        # Extract features (ensure obs is on CUDA via SB3's obs_as_tensor)
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        # Get mean and log_std from action_net
        action_params = self.action_net(latent_pi)  # [batch, n_assets * 2]
        mean_actions, log_std = action_params.chunk(2, dim=-1)  # Split into mean and log_std
        log_std = torch.clamp(log_std, min=-20, max=2)  # Stabilize log_std

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        
        # Sample actions or get deterministic actions
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)
        
        # Map from [-1, 1] to [0, 3]
        actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

        # Compute log probabilities
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions)

        # Value prediction
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        """Scale squashed actions from [-1, 1] to [low, high]."""
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        """Evaluate actions for training (used by PPO)."""
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        
        # Actions need to be unsquashed back to [-1, 1] for log_prob calculation
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1
        
        # Compute log_prob and entropy
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions)
        entropy = distribution.entropy_from_params(mean_actions, log_std)
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

Key Changes to Ensure CUDA-Only Usage

Explicit CUDA Device Specification:
- Set self.device = torch.device("cuda:0") explicitly in __init__ to ensure all operations use CUDA.
Move Tensors to CUDA:
- Initialized self.action_space_low and self.action_space_high with device=self.device to place them on cuda:0 from the start.
Move Neural Networks to CUDA:
- Added .to(self.device) to self.mlp_extractor, self.action_net, and self.value_net to ensure all layers are on CUDA.
Ensure Input Observations Are on CUDA:
- Stable-Baselines3’s obs_as_tensor (called in collect_rollouts) automatically moves observations to the policy’s device (set to cuda:0 here). This ensures obs in forward and evaluate_actions is already on CUDA, and thus features and subsequent tensors are as well.
No CPU Operations:
- Removed any potential CPU fallback by ensuring all tensor operations (e.g., in _squash_to_range) use tensors already on CUDA.

Additional Modifications

You’ll also need to ensure the CustomFeaturesExtractor class uses CUDA consistently. Here’s the updated version:

class CustomFeaturesExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space):
        n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2
        ohlc_dim = observation_space.spaces["ohlc_data"].shape[0]
        max_events = observation_space.spaces["event_ids"].shape[0]
        economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0]
        portfolio_dim = observation_space.spaces["portfolio_data"].shape[0]
        features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim
        super().__init__(observation_space, features_dim=features_dim)

        # Move embeddings to CUDA
        self.device = torch.device("cuda:0")
        self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device)
        self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device)

        print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}")

    def forward(self, obs):
        ohlc_data = obs["ohlc_data"]
        event_ids = obs["event_ids"].to(self.device, dtype=torch.long)
        currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long)
        economic_numeric = obs["economic_numeric"]
        portfolio_data = obs["portfolio_data"]

        event_emb = self.event_embedding(event_ids).mean(dim=1)
        currency_emb = self.currency_embedding(currency_ids).mean(dim=1)
        features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1)
        return features

Verifying the Fix

Check Device Consistency:
- Add debugging prints in forward to confirm all tensors are on cuda:0:
  python
  print(f"Actions device: {actions.device}, Mean device: {mean_actions.device}, Log_std device: {log_std.device}") print(f"Action_space_low device: {self.action_space_low.device}, Action_space_high device: {self.action_space_high.device}")
- These should all output cuda:0.
Run the Training:
- With these changes, rerun your training loop (study.optimize(objective, n_trials=1)). The device mismatch error should be resolved.
Environment Compatibility:
- Ensure your environment (tgym) returns observations as NumPy arrays or tensors that SB3 can convert to CUDA tensors. Since SB3 handles this via obs_as_tensor, this should already work if your environment follows Gymnasium standards.

Final Notes

CUDA Availability: This code assumes a CUDA-capable GPU is available (cuda:0). If you need to support environments without CUDA, you’d add a check (e.g., torch.cuda.is_available()), but your logs indicate CUDA is in use.
Performance: Running everything on CUDA should improve training speed, especially for larger models or datasets.
Dependencies: Ensure your PyTorch installation supports CUDA (e.g., torch.version.cuda should return a version like 11.8).

Let me know if you encounter any further issues!

1. write an if else statement, if found cuda, use it else use cpu ``` class CustomFeaturesExtractor(BaseFeaturesExtractor): def __init__(self, observation_space): n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2 ohlc_dim = observation_space.spaces["ohlc_data"].shape[0] max_events = observation_space.spaces["event_ids"].shape[0] economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0] portfolio_dim = observation_space.spaces["portfolio_data"].shape[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim super().__init__(observation_space, features_dim=features_dim) # num_embeddings=num of unique data self.device = torch.device("cuda:0") self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device) self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device) print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}") def forward(self, obs): ohlc_data = obs["ohlc_data"] event_ids = obs["event_ids"].to(self.device, dtype=torch.long) currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long) economic_numeric = obs["economic_numeric"] portfolio_data = obs["portfolio_data"] event_emb = self.event_embedding(event_ids).mean(dim=1) currency_emb = self.currency_embedding(currency_ids).mean(dim=1) features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1) return features class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Set device to CUDA explicitly self.device = torch.device("cuda:0") # Extract action space bounds and move them to CUDA self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=self.device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=self.device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=self.device ).to(self.device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(self.device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(self.device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features (ensure obs is on CUDA via SB3's obs_as_tensor) features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions) # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Actions need to be unsquashed back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions) entropy = distribution.entropy_from_params(mean_actions, log_std) values = self.value_net(latent_vf) return values, log_prob, entropy ``` 2. is this two CustomMultiInputPolicy will return same value? code 1: ``` class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Extract action space bounds self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch["pi"], activation_fn=nn.ReLU, device=self.device ) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Update the distribution with current parameters self.action_dist.update(mean_actions, log_std) # Sample actions or get deterministic actions if deterministic: actions = self.action_dist.mode() else: actions = self.action_dist.sample() # Map from [-1, 1] to [0, 3] actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities log_prob = self.action_dist.log_prob(actions).sum(dim=-1, keepdim=True) # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) self.action_dist.update(mean_actions, log_std) # Actions need to be unsquashed back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 log_prob = self.action_dist.log_prob(unsquashed_actions).sum(dim=-1, keepdim=True) entropy = self.action_dist.entropy().sum(dim=-1, keepdim=True) values = self.value_net(latent_vf) return values, log_prob, entropy ``` code 2: ``` class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Set device to CUDA explicitly self.device = torch.device("cuda:0") # Extract action space bounds and move them to CUDA self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=self.device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=self.device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=self.device ).to(self.device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(self.device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(self.device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features (ensure obs is on CUDA via SB3's obs_as_tensor) features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions) # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Actions need to be unsquashed back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions) entropy = distribution.entropy_from_params(mean_actions, log_std) values = self.value_net(latent_vf) return values, log_prob, entropy ```

Let’s address your two requests step-by-step.

1. Adding an If-Else Statement for CUDA/CPU Selection

To make your code flexible and use CUDA if available, otherwise fall back to CPU, you can use torch.cuda.is_available() to check for CUDA support. Below is the modified code for both CustomFeaturesExtractor and CustomMultiInputPolicy with this logic:

Modified CustomFeaturesExtractor

class CustomFeaturesExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space):
        n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2
        ohlc_dim = observation_space.spaces["ohlc_data"].shape[0]
        max_events = observation_space.spaces["event_ids"].shape[0]
        economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0]
        portfolio_dim = observation_space.spaces["portfolio_data"].shape[0]
        features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim
        super().__init__(observation_space, features_dim=features_dim)

        # Check if CUDA is available, else use CPU
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}")

        # Move embeddings to the selected device
        self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device)
        self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device)

        print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}")

    def forward(self, obs):
        ohlc_data = obs["ohlc_data"].to(self.device)
        event_ids = obs["event_ids"].to(self.device, dtype=torch.long)
        currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long)
        economic_numeric = obs["economic_numeric"].to(self.device)
        portfolio_data = obs["portfolio_data"].to(self.device)

        event_emb = self.event_embedding(event_ids).mean(dim=1)
        currency_emb = self.currency_embedding(currency_ids).mean(dim=1)
        features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1)
        return features

Modified CustomMultiInputPolicy

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        # Check if CUDA is available, else use CPU
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}")

        # Extract action space bounds and move them to the selected device
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=self.device)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=self.device)
        action_dim = action_space.shape[0]  # Number of assets

        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            *args,
            **kwargs
        )

        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(
            features_dim,
            net_arch=self.net_arch,
            activation_fn=nn.ReLU,
            device=self.device
        ).to(self.device)

        # Define the action network and value network, move to the selected device
        self.action_net = nn.Linear(64, action_dim * 2).to(self.device)  # Outputs mean and log_std for each asset
        self.value_net = nn.Linear(64, 1).to(self.device)

        # Initialize the distribution
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        self.num_timesteps += 1

        # Extract features (observations are moved to device by SB3's obs_as_tensor)
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        # Get mean and log_std from action_net
        action_params = self.action_net(latent_pi)  # [batch, n_assets * 2]
        mean_actions, log_std = action_params.chunk(2, dim=-1)  # Split into mean and log_std
        log_std = torch.clamp(log_std, min=-20, max=2)  # Stabilize log_std

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Sample actions or get deterministic actions
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)

        # Map from [-1, 1] to [0, 3]
        actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

        # Compute log probabilities
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions)

        # Value prediction
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        """Scale squashed actions from [-1, 1] to [low, high]."""
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        """Evaluate actions for training (used by PPO)."""
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        
        # Actions need to be unsquashed back to [-1, 1] for log_prob calculation
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1
        
        # Compute log_prob and entropy
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions)
        entropy = distribution.entropy_from_params(mean_actions, log_std)
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

Key Changes:

Added self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") to both classes to dynamically select the device.
Ensured all tensors (e.g., action_space_low, action_space_high) and models (e.g., mlp_extractor, action_net, value_net, event_embedding, currency_embedding) are moved to self.device using .to(self.device).
In CustomFeaturesExtractor.forward, explicitly moved observation tensors to self.device to ensure consistency (though SB3’s obs_as_tensor typically handles this, it’s good practice to enforce it here).
Added a print statement to log the selected device for debugging.

This modification ensures your code works on CUDA if available, falling back to CPU otherwise.

2. Will the Two CustomMultiInputPolicy Classes Return the Same Value?

Let’s compare Code 1 and Code 2 to determine if they produce the same outputs given identical inputs (observations, random seeds, and model weights).

Key Differences Between Code 1 and Code 2

Device Handling:
- Code 1: Does not explicitly set a device. Tensors like action_space_low and action_space_high are created on CPU by default (torch.tensor without device argument), and neural networks (action_net, value_net) are not explicitly moved to CUDA. However, Stable-Baselines3 might move them to CUDA during training if device="cuda" is set in the PPO model.
- Code 2: Explicitly sets self.device = torch.device("cuda:0") and moves all tensors and models to CUDA. This ensures consistent device placement.
Distribution Handling:
- Code 1: Uses self.action_dist.update(mean_actions, log_std) to update the distribution, then calls sample() or mode(). This is incorrect because SquashedDiagGaussianDistribution in SB3 doesn’t have an update method, leading to runtime errors (as you encountered earlier).
- Code 2: Creates a fresh SquashedDiagGaussianDistribution instance and uses actions_from_params and log_prob_from_params, which is the correct approach for SB3.
Network Architecture:
- Both codes use the same network architecture (net_arch=dict(pi=[64, 64], vf=[64, 64]), action_net = nn.Linear(64, action_dim * 2), value_net = nn.Linear(64, 1)), so the computation logic (aside from distribution handling) is identical.
Randomness:
- Both use the same distribution class (SquashedDiagGaussianDistribution), but the way actions are sampled differs due to the incorrect update in Code 1. If Code 1 were fixed to match Code 2’s distribution logic, randomness would depend on PyTorch’s RNG state, which can be controlled with a seed.

Will They Return the Same Value?

No, they won’t return the same value as-is:
- Code 1 Fails: Code 1 will raise an AttributeError due to self.action_dist.update not existing, so it doesn’t even run to produce a value. Even if you fixed this by removing update and using actions_from_params (as in Code 2), other differences remain.
- Device Mismatch: If Code 1 runs on CPU (default) and Code 2 runs on CUDA, floating-point computations might differ slightly due to hardware differences (e.g., CUDA’s fused multiply-add operations). However, this difference is typically negligible unless precision is critical.
- Distribution Logic: Code 1’s action sampling (sample()/mode()) after an invalid update wouldn’t match Code 2’s actions_from_params, even if corrected, unless the RNG state is identical.
If Fixed and Seeded:
- If you fix Code 1 to use the same distribution logic as Code 2 (remove update, use actions_from_params and log_prob_from_params), move all tensors/models to the same device (e.g., CUDA), and set the same random seed (e.g., torch.manual_seed(42)), then yes, they would return the same values for the same inputs and initial weights. The core computation (feature extraction, MLP, action scaling) is identical; only the implementation details differ.

Example Fixed Code 1 (for Comparison)

Here’s what Code 1 would look like if corrected to match Code 2’s logic:

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=self.device)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=self.device)
        action_dim = action_space.shape[0]

        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            *args,
            **kwargs
        )

        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(features_dim, net_arch=self.net_arch["pi"], activation_fn=nn.ReLU, device=self.device).to(self.device)
        self.action_net = nn.Linear(64, action_dim * 2).to(self.device)
        self.value_net = nn.Linear(64, 1).to(self.device)
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        self.num_timesteps += 1
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)
        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)
        actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions)
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)
        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions)
        entropy = distribution.entropy_from_params(mean_actions, log_std)
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

With this fix, if you:

Use the same random seed (torch.manual_seed(seed)),
Initialize weights identically (e.g., same torch.manual_seed before model creation),
Provide the same input observations,
Run on the same device (e.g., both on CUDA or both on CPU),

then Code 1 (fixed) and Code 2 will return the same actions, values, and log_prob for a given forward pass.

Conclusion

Original Code 1 vs. Code 2: No, they won’t return the same values because Code 1 fails due to update and has device inconsistencies.
Fixed Code 1 vs. Code 2: Yes, they will return the same values under controlled conditions (same seed, device, and inputs).

Let me know if you need help testing this or further clarification!

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [I 2025-03-13 04:24:23,646] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. [W 2025-03-13 04:24:23,744] Trial 3 failed with parameters: {'learning_rate': 3.024644462283983e-05, 'n_steps': 3072, 'total_timesteps': 1000000, 'batch_size': 64, 'gamma': 0.9656795908895551, 'gae_lambda': 0.8606602748352157, 'clip_range': 0.2560344361591046} because of the following error: AttributeError("can't set attribute 'device'"). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-26-5234907ae9e5>", line 139, in objective model = PPO( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 171, in __init__ self._setup_model() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 174, in _setup_model super()._setup_model() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 135, in _setup_model self.policy = self.policy_class( # type: ignore[assignment] File "<ipython-input-24-024273e8a02a>", line 609, in __init__ self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2029, in __setattr__ super().__setattr__(name, value) AttributeError: can't set attribute 'device' [W 2025-03-13 04:24:23,745] Trial 3 failed with value None. Using cuda device --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-26-5234907ae9e5> in <cell line: 183>() 181 load_if_exists=True 182 ) --> 183 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 184 185 # Best parameters 10 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-26-5234907ae9e5> in objective(trial) 137 138 # Train PPO model on training set --> 139 model = PPO( 140 CustomMultiInputPolicy, 141 train_env_vec, /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in __init__(self, policy, env, learning_rate, n_steps, batch_size, n_epochs, gamma, gae_lambda, clip_range, clip_range_vf, normalize_advantage, ent_coef, vf_coef, max_grad_norm, use_sde, sde_sample_freq, rollout_buffer_class, rollout_buffer_kwargs, target_kl, stats_window_size, tensorboard_log, policy_kwargs, verbose, seed, device, _init_setup_model) 169 170 if _init_setup_model: --> 171 self._setup_model() 172 173 def _setup_model(self) -> None: /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in _setup_model(self) 172 173 def _setup_model(self) -> None: --> 174 super()._setup_model() 175 176 # Initialize schedules for policy/value clipping /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in _setup_model(self) 133 **self.rollout_buffer_kwargs, 134 ) --> 135 self.policy = self.policy_class( # type: ignore[assignment] 136 self.observation_space, self.action_space, self.lr_schedule, use_sde=self.use_sde, **self.policy_kwargs 137 ) <ipython-input-24-024273e8a02a> in __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs) 607 def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): 608 # Check if CUDA is available, else use CPU --> 609 self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 610 print(f"CustomMultiInputPolicy using device: {self.device}") 611 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in __setattr__(self, name, value) 2027 # === HACK END === 2028 else: -> 2029 super().__setattr__(name, value) 2030 2031 def __delattr__(self, name): AttributeError: can't set attribute 'device'

The error AttributeError: can't set attribute 'device' occurs because ActorCriticPolicy (the parent class of CustomMultiInputPolicy) in Stable-Baselines3 already defines self.device as an attribute, and PyTorch’s nn.Module (which ActorCriticPolicy inherits from) prevents overwriting it directly in the child class’s __init__ method after the parent’s __init__ is called. This is due to how PyTorch manages module attributes and device handling internally.

To fix this, you should:

Pass the device to the parent class (ActorCriticPolicy) via the device argument in the super().__init__() call, rather than setting self.device directly in the child class.
Ensure all tensors and modules are moved to the chosen device after initialization.

Here’s how to modify your CustomMultiInputPolicy and CustomFeaturesExtractor classes to handle this correctly with an if-else statement for CUDA/CPU selection:

Fixed Code

CustomFeaturesExtractor

import torch
import torch.nn as nn
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

class CustomFeaturesExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space):
        n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2
        ohlc_dim = observation_space.spaces["ohlc_data"].shape[0]
        max_events = observation_space.spaces["event_ids"].shape[0]
        economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0]
        portfolio_dim = observation_space.spaces["portfolio_data"].shape[0]
        features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim
        
        # Determine device before calling super
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        print(f"CustomFeaturesExtractor using device: {self.device}")
        
        super().__init__(observation_space, features_dim=features_dim)

        # Move embeddings to the selected device
        self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device)
        self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device)

        print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}")

    def forward(self, obs):
        ohlc_data = obs["ohlc_data"].to(self.device)
        event_ids = obs["event_ids"].to(self.device, dtype=torch.long)
        currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long)
        economic_numeric = obs["economic_numeric"].to(self.device)
        portfolio_data = obs["portfolio_data"].to(self.device)

        event_emb = self.event_embedding(event_ids).mean(dim=1)
        currency_emb = self.currency_embedding(currency_ids).mean(dim=1)
        features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1)
        return features

CustomMultiInputPolicy

import torch
import torch.nn as nn
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.torch_layers import MlpExtractor
from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        # Determine device before calling super
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        print(f"CustomMultiInputPolicy using device: {device}")

        # Extract action space bounds and move them to the selected device
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device)
        action_dim = action_space.shape[0]  # Number of assets

        # Pass the device to the parent class
        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            device=device,  # Pass device here instead of setting self.device
            *args,
            **kwargs
        )

        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(
            features_dim,
            net_arch=self.net_arch,
            activation_fn=nn.ReLU,
            device=device
        ).to(device)

        # Define the action network and value network, move to the selected device
        self.action_net = nn.Linear(64, action_dim * 2).to(device)
        self.value_net = nn.Linear(64, 1).to(device)

        # Initialize the distribution
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        self.num_timesteps += 1

        # Extract features (observations are moved to device by SB3's obs_as_tensor)
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        # Get mean and log_std from action_net
        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Sample actions or get deterministic actions
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)

        # Map from [-1, 1] to [0, 3]
        actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

        # Compute log probabilities
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions)

        # Value prediction
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        """Scale squashed actions from [-1, 1] to [low, high]."""
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        """Evaluate actions for training (used by PPO)."""
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)
        
        # Actions need to be unsquashed back to [-1, 1] for log_prob calculation
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1
        
        # Compute log_prob and entropy
        log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions)
        entropy = distribution.entropy_from_params(mean_actions, log_std)
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

Key Changes

Device Assignment:
- Instead of setting self.device directly in CustomMultiInputPolicy.__init__, the device is determined before calling super().__init__ and passed as an argument to the parent class via the device parameter. This leverages Stable-Baselines3’s built-in device management, avoiding the attribute overwrite issue.
- In CustomFeaturesExtractor, self.device is set before super().__init__ because BaseFeaturesExtractor doesn’t manage the device attribute itself, making it safe to define here.
Consistent Device Usage:
- All tensors (action_space_low, action_space_high) and modules (mlp_extractor, action_net, value_net, event_embedding, currency_embedding) are explicitly moved to the selected device using .to(device).
Debugging Output:
- Added print statements to confirm which device is being used (cuda:0 or cpu).

Why the Original Code Failed

Stable-Baselines3’s ActorCriticPolicy (via nn.Module) sets up self.device during its initialization based on the device argument passed to __init__. Attempting to overwrite self.device afterward in the child class (self.device = torch.device(...)) triggers the AttributeError because PyTorch protects certain attributes after module initialization.
By passing the device to super().__init__, we let the parent class handle device setup properly.

Testing the Fix

To verify this works:

Run your training loop again (study.optimize(objective, n_trials=1)).
再次运行训练循环（study.optimize（objective， n_trials=1））。
Check the output for the device selection message (e.g., CustomMultiInputPolicy using device: cuda:0 or cpu).
检查设备选择消息的输出（例如，使用 device： cuda：0 或 cpu 的 CustomMultiInputPolicy）。
Ensure no AttributeError occurs and training proceeds.
确保没有发生 AttributeError 并继续训练。

If CUDA is available, it will use cuda:0; otherwise, it will fall back to cpu. For example:
如果 CUDA 可用，它将使用 cuda：0;否则，它将回退到 CPU。例如：

On a CUDA-enabled system: CustomMultiInputPolicy using device: cuda:0
在支持 CUDA 的系统上：使用设备的 CustomMultiInputPolicy：cuda：0
On a non-CUDA system: CustomMultiInputPolicy using device: cpu
在非 CUDA 系统上：使用 device： cpu 的 CustomMultiInputPolicy

Additional Notes 其他说明

Deprecation Warning: The DeprecationWarning about should_run_async is unrelated to your code and comes from IPython. It won’t affect functionality but indicates a future change in IPython’s async handling. You can ignore it for now or update your IPython version and adjust your notebook accordingly if needed.
弃用警告：关于 should_run_async 的 DeprecationWarning 与您的代码无关，并且来自 IPython。它不会影响功能，但指示 IPython 的异步处理将来会发生变化。您可以暂时忽略它，也可以更新您的 IPython 版本并根据需要相应地调整您的笔记本。
Environment Compatibility: Ensure your train_env_vec and observation space are compatible with PyTorch tensors on the chosen device. Stable-Baselines3’s obs_as_tensor should handle this automatically.
环境兼容性：确保您的train_env_vec和观察空间与所选设备上的 PyTorch 张量兼容。Stable-Baselines3 的 obs_as_tensor 应该会自动处理这个问题。

Let me know if you encounter any further issues!
如果您遇到任何其他问题，请告诉我！

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [I 2025-03-13 04:46:44,907] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. Using cuda device CustomMultiInputPolicy using device: cuda:0 CustomFeaturesExtractor using device: cuda:0 CustomFeaturesExtractor: n_assets=1, features_dim=96 [W 2025-03-13 04:46:45,193] Trial 5 failed with parameters: {'learning_rate': 0.000862351915210243, 'n_steps': 8192, 'total_timesteps': 1900000, 'batch_size': 256, 'gamma': 0.9384572784518019, 'gae_lambda': 0.8117121313629536, 'clip_range': 0.2555030351906416} because of the following error: TypeError('SquashedDiagGaussianDistribution.log_prob_from_params() takes 3 positional arguments but 4 were given'). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-33-5234907ae9e5>", line 166, in objective model.learn(total_timesteps=total_timesteps, callback=eval_callback) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 311, in learn return super().learn( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 202, in collect_rollouts actions, values, log_probs = self.policy(obs_tensor) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "<ipython-input-31-e14142e1ae6a>", line 668, in forward log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions) TypeError: SquashedDiagGaussianDistribution.log_prob_from_params() takes 3 positional arguments but 4 were given [W 2025-03-13 04:46:45,194] Trial 5 failed with value None. --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-33-5234907ae9e5> in <cell line: 183>() 181 load_if_exists=True 182 ) --> 183 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 184 185 # Best parameters 11 frames <ipython-input-31-e14142e1ae6a> in forward(self, obs, deterministic) 666 667 # Compute log probabilities --> 668 log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions) 669 670 # Value prediction TypeError: SquashedDiagGaussianDistribution.log_prob_from_params() takes 3 positional arguments but 4 were given ``` import datetime import math import random import ast import torch import torch.nn as nn import csv import numpy as np import gymnasium as gym from gymnasium import spaces from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor, MlpExtractor from stable_baselines3.common.policies import ActorCriticPolicy from stable_baselines3.common.torch_layers import MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution from stable_baselines3 import PPO from stable_baselines3.common.policies import ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart import TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym(gym.Env): """forex/future/option trading gym environment 1. Three action space (0 Buy, 1 Sell, 2 Nothing) 2. Multiple trading pairs (EURUSD, GBPUSD...) under same time frame 3. Timeframe from 1 min to daily as long as use candlestick bar (Open, High, Low, Close) 4. Use StopLose, ProfitTaken to realize rewards. each pair can configure it own SL and PT in configure file 5. Configure over night cash penalty and each pair's transaction fee and overnight position holding penalty 6. Split dataset into daily, weekly or monthly..., with fixed time steps, at end of len(df). The business logic will force to Close all positions at last Close price (game over). 7. Must have df column name: [(time_col),(asset_col), Open,Close,High,Low,day] (case sensitive) 8. Addition indicators can add during the data process. 78 available TA indicator from Finta 9. Customized observation list handled in json config file. 10. ProfitTaken = fraction_action * max_profit_taken + SL. 11. SL is pre-fixed 12. Limit order can be configure, if limit_order == True, the action will preset buy or sell at Low or High of the bar, with a limit_order_expiration (n bars). It will be triggered if the price go cross. otherwise, it will be drop off 13. render mode: human -- display each steps realized reward on console file -- create a transaction log graph -- create transaction in graph (under development) 14. 15. Reward, we want to incentivize profit that is sustained over long periods of time. At each step, we will set the reward to the account balance multiplied by some fraction of the number of time steps so far.The purpose of this is to delay rewarding the agent too fast in the early stages and allow it to explore sufficiently before optimizing a single strategy too deeply. It will also reward agents that maintain a higher balance for longer, rather than those who rapidly gain money using unsustainable strategies. 16. Observation_space contains all of the input variables we want our agent to consider before making, or not making a trade. We want our agent to “see” the forex data points (Open price, High, Low, Close, time serial, TA) in the game window, as well a couple other data points like its account balance, current positions, and current profit.The intuition here is that for each time step, we want our agent to consider the price action leading up to the current price, as well as their own portfolio’s status in order to make an informed decision for the next action. 17. reward is forex trading unit Point, it can be configure for each trading pair 18. To make the unrealized profit reward reflect market conditions, we’ll compute ATR for each asset and use it to scale the reward dynamically. """ metadata = {"render.modes": ["graph", "human", "file", "none"]} def __init__( self, df, event_map, currency_map, env_config_file="./neo_finrl/env_fx_trading/config/gdbusd-test-1.json", ): assert df.ndim == 2 super(tgym, self).__init__() self.cf = EnvConfig(env_config_file) self.observation_list = self.cf.env_parameters("observation_list") # Economic data mappings self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy() if 'events' not in self.df.columns: raise ValueError("DataFrame must contain an 'events' column") def parse_events(x): if isinstance(x, str): try: parsed = ast.literal_eval(x) return parsed if isinstance(parsed, list) else [] except (ValueError, SyntaxError): return [] return x if isinstance(x, list) else [] self.df['events'] = self.df['events'].apply(parse_events) if not isinstance(self.df['events'].iloc[0], list): raise ValueError("'events' must be a list") if self.df['events'].iloc[0] and not isinstance(self.df['events'].iloc[0][0], dict): raise ValueError("Elements in 'events' must be dictionaries") self.balance_initial = self.cf.env_parameters("balance") self.over_night_cash_penalty = self.cf.env_parameters("over_night_cash_penalty") self.asset_col = self.cf.env_parameters("asset_col") self.time_col = self.cf.env_parameters("time_col") self.random_start = self.cf.env_parameters("random_start") self.log_filename = ( self.cf.env_parameters("log_filename") + datetime.datetime.now().strftime("%Y%m%d%H%M%S") + ".csv" ) self.analyze_transaction_history_log_filename = ("transaction_history_log" + datetime.datetime.now().strftime("%Y%m%d%H%M%S") + ".csv") self.df["_time"] = self.df[self.time_col] self.df["_day"] = self.df["weekday"] self.assets = self.df[self.asset_col].unique() self.dt_datetime = self.df[self.time_col].sort_values().unique() self.df = self.df.set_index(self.time_col) self.visualization = False # Reset values self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_step = 0 self.episode = 0 # Start from 0, increment on episode end self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True # Cache data self.cached_ohlc_data = [self.get_observation_vector(_dt) for _dt in self.dt_datetime] self.cached_economic_data = [self.get_economic_vector(_dt) for _dt in self.dt_datetime] self.cached_time_serial = ( self.df[["_time", "_day"]].sort_values("_time").drop_duplicates().values.tolist() ) self.reward_range = (-np.inf, np.inf) self.action_space = spaces.Box(low=0, high=3, shape=(len(self.assets),), dtype=np.float32) self.observation_space = spaces.Dict({ "ohlc_data": spaces.Box(low=-np.inf, high=np.inf, shape=(len(self.assets) * len(self.observation_list),), dtype=np.float32), "event_ids": spaces.Box(low=0, high=len(self.event_map)-1, shape=(self.max_events,), dtype=np.int32), "currency_ids": spaces.Box(low=0, high=len(self.currency_map)-1, shape=(self.max_events,), dtype=np.int32), "economic_numeric": spaces.Box(low=-np.inf, high=np.inf, shape=(self.max_events * 6,), dtype=np.float32), "portfolio_data": spaces.Box(low=-np.inf, high=np.inf, shape=(3 + 2 * len(self.assets),), dtype=np.float32) }) print( f"initial done:\n" f"observation_list:{self.observation_list}\n" f"assets:{self.assets}\n" f"time serial: {min(self.dt_datetime)} -> {max(self.dt_datetime)} length: {len(self.dt_datetime)}\n" f"events: {len(self.event_map)}, currencies: {len(self.currency_map)}" ) self._seed() def _seed(self, seed=None): self.np_random, seed = seeding.np_random(seed) return [seed] def _take_action(self, actions, done): # action = math.floor(x), # profit_taken = math.ceil((x- math.floor(x)) * profit_taken_max - stop_loss_max ) # _actions = np.floor(actions).astype(int) # _profit_takens = np.ceil((actions - np.floor(actions)) *self.cf.symbol(self.assets[i],"profit_taken_max")).astype(int) _action = 2 _profit_taken = 0 rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # need use multiply assets for i, action in enumerate(actions): # Actions are now floats between 0 and 3 self._o = self.get_observation(self.current_step, i, "Open") self._h = self.get_observation(self.current_step, i, "High") self._l = self.get_observation(self.current_step, i, "Low") self._c = self.get_observation(self.current_step, i, "Close") self._t = self.get_observation(self.current_step, i, "_time") self._day = self.get_observation(self.current_step, i, "_day") # Extract integer action type and fractional part _action = math.floor(action) # 0=Buy, 1=Sell, 2=Nothing rewards[i] = self._calculate_reward(i, done, _action) # Pass action for exploration reward print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")): # Dynamically calculate PT using action fraction _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 if self.cf.symbol(self.assets[i], "limit_order"): transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._l if _action == 0 else self._h, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": -1, "CloseStep": -1, } self.transaction_limit_order.append(transaction) else: transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._c, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": self.current_step, "CloseStep": -1, } self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) return sum(rewards) def _calculate_reward(self, i, done, action): _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[:]: # Copy to avoid modification issues if tr["Symbol"] == self.assets[i]: _point = self.cf.symbol(self.assets[i], "point") # cash discount overnight if self._day > tr["DateDuration"]: tr["DateDuration"] = self._day tr["Reward"] -= self.cf.symbol(self.assets[i], "over_night_penalty") if tr["Type"] == 0: # Buy # stop loss trigger _sl_price = tr["ActionPrice"] - tr["SL"] / _point _pt_price = tr["ActionPrice"] + tr["PT"] / _point if done: p = (self._c - tr["ActionPrice"]) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._l <= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._h >= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: # still open self.current_draw_downs[i] = int((self._l - tr["ActionPrice"]) * _point) _max_draw_down += self.current_draw_downs[i] if self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i]: tr["MaxDD"] = self.current_draw_downs[i] elif tr["Type"] == 1: # Sell # stop loss trigger _sl_price = tr["ActionPrice"] + tr["SL"] / _point _pt_price = tr["ActionPrice"] - tr["PT"] / _point if done: p = (tr["ActionPrice"] - self._c) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._h >= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._l <= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: self.current_draw_downs[i] = int( (tr["ActionPrice"] - self._h) * _point ) _max_draw_down += self.current_draw_downs[i] if ( self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i] ): tr["MaxDD"] = self.current_draw_downs[i] if _max_draw_down > self.max_draw_downs[i]: self.max_draw_downs[i] = _max_draw_down return _total_reward def _limit_order_process(self, i, _action, done): for tr in self.transaction_limit_order[:]: if tr["Symbol"] == self.assets[i]: if tr["Type"] != _action or done: self.transaction_limit_order.remove(tr) tr["Status"] = 3 tr["CloseStep"] = self.current_step self.transaction_history.append(tr) elif (tr["ActionPrice"] >= self._l and _action == 0) or ( tr["ActionPrice"] <= self._h and _action == 1): tr["ActionStep"] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_limit_order.remove(tr) self.transaction_live.append(tr) self.tranaction_open_this_step.append(tr) elif (tr["LimitStep"] + self.cf.symbol(self.assets[i], "limit_order_expiration") > self.current_step): tr["CloseStep"] = self.current_step tr["Status"] = 4 self.transaction_limit_order.remove(tr) self.transaction_history.append(tr) def _manage_tranaction(self, tr, _p, close_price, status=1): self.transaction_live.remove(tr) tr["ClosePrice"] = close_price tr["Point"] = int(_p) tr["Reward"] = int(tr["Reward"] + _p) # Realized profit/loss tr["Status"] = status # 1=SL/PT, 2=Forced close, 3=Canceled limit, 4=Expired limit tr["CloseTime"] = self._t tr["CloseStep"] = self.current_step self.balance += int(tr["Reward"]) self.total_equity -= int(abs(tr["Reward"])) self.tranaction_close_this_step.append(tr) self.transaction_history.append(tr) def analyze_transaction_history(self, log_file): if not self.transaction_history: metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0} else: trades = len(self.transaction_history) rewards = [tr["Reward"] for tr in self.transaction_history] wins = sum(1 for r in rewards if r > 0) losses = sum(1 for r in rewards if r < 0) gross_profit = sum(r for r in rewards if r > 0) gross_loss = abs(sum(r for r in rewards if r < 0)) win_rate = wins / trades if trades > 0 else 0.0 profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf") # Sharpe Ratio (simplified, assumes risk-free rate = 0) returns = np.array(rewards, dtype=np.float32) sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0 total_profit = sum(rewards) metrics = { "trades": trades, "win_rate": win_rate, "profit_factor": profit_factor, "sharpe_ratio": sharpe_ratio, "total_profit": total_profit } # Append to log file with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f: writer = csv.DictWriter(f, fieldnames=["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]) metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") metrics["episode"] = self.episode writer.writerow(metrics) return metrics def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} return obs, total_reward, terminated, truncated, info def get_observation(self, _step, _iter=0, col=None): if col is None: return self.cached_ohlc_data[_step] if col == "_day": return self.cached_time_serial[_step][1] elif col == "_time": return self.cached_time_serial[_step][0] try: col_pos = self.observation_list.index(col) except ValueError: raise ValueError(f"Column '{col}' not found in observation_list") return self.cached_ohlc_data[_step][_iter * len(self.observation_list) + col_pos] def get_observation_vector(self, _dt, cols=None): cols = self.observation_list if cols is None else cols v = [] for a in self.assets: subset = self.df.query(f'{self.asset_col} == "{a}" & {self.time_col} == "{_dt}"') assert not subset.empty v += subset.loc[_dt, cols].tolist() assert len(v) == len(self.assets) * len(cols) return v def get_economic_vector(self, _dt): subset = self.df.loc[_dt] events = subset['events'] if isinstance(subset, pd.Series) else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) currency_ids = [self.currency_map.get(e['currency'], 0) for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) numeric_fields = ['actual_norm', 'forecast_norm', 'previous_norm', 'surprise_norm', 'event_freq', 'impact_code'] numeric = [e[field] for e in events[:self.max_events] for field in numeric_fields] + [0] * (self.max_events * 6 - len(events) * 6) return { "event_ids": np.array(event_ids, dtype=np.int32), "currency_ids": np.array(currency_ids, dtype=np.int32), "numeric": np.array(numeric, dtype=np.float32) } def reset(self, seed=None, options=None): # Set the seed for reproducibility if seed is not None: self._seed(seed) if self.random_start: self.current_step = random.choice(range(int(len(self.dt_datetime) * 0.5))) else: self.current_step = 0 self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True self.visualization = False obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } info = {} return obs, info def render(self, mode="human", title=None, **kwargs): if mode in ("human", "file"): printout = mode == "human" pm = { "log_header": self.log_header, "log_filename": self.log_filename, "printout": printout, "balance": self.balance, "balance_initial": self.balance_initial, "tranaction_close_this_step": self.tranaction_close_this_step, "done_information": self.done_information, } render_to_file(**pm) if self.log_header: self.log_header = False elif mode == "graph" and self.visualization: print("plotting...") p = TradingChart(self.df, self.transaction_history) p.plot() def close(self): pass def get_sb_env(self): e = DummyVecEnv([lambda: self]) obs = e.reset() return e, obs class CustomFeaturesExtractor(BaseFeaturesExtractor): def __init__(self, observation_space): n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2 ohlc_dim = observation_space.spaces["ohlc_data"].shape[0] max_events = observation_space.spaces["event_ids"].shape[0] economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0] portfolio_dim = observation_space.spaces["portfolio_data"].shape[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # Check if CUDA is available, else use CPU self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomFeaturesExtractor using device: {self.device}") super().__init__(observation_space, features_dim=features_dim) # Move embeddings to the selected device self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device) self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device) print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}") def forward(self, obs): ohlc_data = obs["ohlc_data"].to(self.device) event_ids = obs["event_ids"].to(self.device, dtype=torch.long) currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long) economic_numeric = obs["economic_numeric"].to(self.device) portfolio_data = obs["portfolio_data"].to(self.device) event_emb = self.event_embedding(event_ids).mean(dim=1) currency_emb = self.currency_embedding(currency_ids).mean(dim=1) features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1) return features class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Check if CUDA is available, else use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomMultiInputPolicy using device: {device}") # Extract action space bounds and move them to the selected device self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=device ).to(device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features (ensure obs is on CUDA via SB3's obs_as_tensor) features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities log_prob = distribution.log_prob_from_params(mean_actions, log_std, actions) # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Actions need to be unsquashed back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob_from_params(mean_actions, log_std, unsquashed_actions) entropy = distribution.entropy_from_params(mean_actions, log_std) values = self.value_net(latent_vf) return values, log_prob, entropy ```

The error TypeError: SquashedDiagGaussianDistribution.log_prob_from_params() takes 3 positional arguments but 4 were given occurs because the log_prob_from_params method of SquashedDiagGaussianDistribution in Stable-Baselines3 expects only two arguments (mean_actions and log_std) plus the implicit self, but your code is passing three arguments (mean_actions, log_std, and actions) plus self. This mismatch happens in both the forward and evaluate_actions methods of your CustomMultiInputPolicy.

In Stable-Baselines3, log_prob_from_params computes the log probability of actions sampled from the distribution defined by mean_actions and log_std, and it doesn’t take an additional actions argument directly. Instead, you should first sample the actions using actions_from_params and then use log_prob to compute the log probability of those specific actions, or adjust how you call the method.

Let’s fix this by ensuring the correct usage of SquashedDiagGaussianDistribution methods. Below is the corrected version of your CustomMultiInputPolicy class, addressing the error and maintaining consistency with Stable-Baselines3’s design.

Fixed CustomMultiInputPolicy

import torch
import torch.nn as nn
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.torch_layers import MlpExtractor
from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        # Check if CUDA is available, else use CPU
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        print(f"CustomMultiInputPolicy using device: {device}")

        # Extract action space bounds and move them to the selected device
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device)
        action_dim = action_space.shape[0]  # Number of assets

        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            device=device,  # Pass device to parent class
            *args,
            **kwargs
        )

        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(
            features_dim,
            net_arch=self.net_arch,
            activation_fn=nn.ReLU,
            device=device
        ).to(device)

        # Define the action network to output mean and log_std for the Gaussian
        self.action_net = nn.Linear(64, action_dim * 2).to(device)  # Outputs mean and log_std for each asset
        self.value_net = nn.Linear(64, 1).to(device)

        # Initialize the distribution
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        # Increment timestep on each forward pass
        self.num_timesteps += 1

        # Extract features (ensure obs is on the correct device via SB3's obs_as_tensor)
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        # Get mean and log_std from action_net
        action_params = self.action_net(latent_pi)  # [batch, n_assets * 2]
        mean_actions, log_std = action_params.chunk(2, dim=-1)  # Split into mean and log_std
        log_std = torch.clamp(log_std, min=-20, max=2)  # Stabilize log_std

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Sample actions or get deterministic actions
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)

        # Map from [-1, 1] to [0, 3]
        squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

        # Compute log probabilities of the unsquashed actions
        log_prob = distribution.log_prob(actions)  # Use log_prob on the unsquashed actions

        # Value prediction
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return squashed_actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        """Scale squashed actions from [-1, 1] to [low, high]."""
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        """Evaluate actions for training (used by PPO)."""
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Unsquash the actions back to [-1, 1] for log_prob calculation
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1

        # Compute log_prob and entropy
        log_prob = distribution.log_prob(unsquashed_actions)  # Use log_prob on unsquashed actions
        entropy = distribution.entropy()  # Entropy doesn’t need actions
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

Key Fixes

Corrected log_prob_from_params Usage:
- In the original code, you called distribution.log_prob_from_params(mean_actions, log_std, actions), but log_prob_from_params only takes mean_actions and log_std (plus self). It computes the log probability of the actions sampled internally.
- Fix: Replaced it with distribution.log_prob(actions) in forward, where actions are the unsquashed actions sampled from actions_from_params. This computes the log probability of the specific actions taken.
- In evaluate_actions, used distribution.log_prob(unsquashed_actions) to compute the log probability of the provided actions (after unsquashing them back to the [-1, 1] range).
Action Squashing:
- Renamed the returned actions in forward to squashed_actions for clarity, since they are mapped to [0, 3]. The log_prob is computed on the unsquashed actions (in [-1, 1]), which is correct for the Gaussian distribution.
Entropy Calculation:
- In evaluate_actions, replaced entropy_from_params with entropy(), as SquashedDiagGaussianDistribution provides an entropy() method that computes the differential entropy of the distribution directly from its parameters, without needing additional arguments.
Device Consistency:
- The device handling remains correct, using torch.device("cuda:0" if torch.cuda.is_available() else "cpu") and passing it to the parent class and all modules/tensors.

Why the Error Occurred

SquashedDiagGaussianDistribution.log_prob_from_params is designed to compute the log probability of actions sampled internally from the distribution defined by mean_actions and log_std. It doesn’t accept an additional actions argument because it assumes you want the log probability of the sampled actions, not arbitrary ones.
Your code passed actions as an extra argument, causing the TypeError. The correct approach is to sample actions first (via actions_from_params) and then compute their log probability (via log_prob).

Verifying the Fix

To ensure this works:

Run the Training Loop:
- Replace your CustomMultiInputPolicy with this corrected version and run study.optimize(objective, n_trials=1). The TypeError should no longer occur.
Check Outputs:
- Verify that actions, values, and log_prob are returned correctly in forward. The debug print statement will show the squashed actions in [0, 3].
Monitor Training:
- Ensure the PPO training progresses without errors, and the policy learns as expected.

Additional Notes

Deprecation Warning: The DeprecationWarning from IPython is unrelated to your code and can be ignored for now. It’s a notebook-specific warning about async execution and won’t affect your training.
Environment Consistency: Your tgym environment looks fine, and its observation/action spaces match the policy’s expectations (Box(low=0, high=3, shape=(n_assets,)) for actions).
Stable-Baselines3 Version: Ensure you’re using a recent version of Stable-Baselines3 (e.g., >=2.0.0) to avoid potential API mismatches. You can check with import stable_baselines3; print(stable_baselines3.__version__).

Let me know if you encounter any further issues or need additional debugging help!

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [I 2025-03-13 04:59:20,437] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. [W 2025-03-13 04:59:20,530] Trial 6 failed with parameters: {'learning_rate': 0.0006996125755627991, 'n_steps': 1024, 'total_timesteps': 2000000, 'batch_size': 512, 'gamma': 0.9925093577931308, 'gae_lambda': 0.8250343770383468, 'clip_range': 0.1597628892936832} because of the following error: TypeError("ActorCriticPolicy.__init__() got an unexpected keyword argument 'device'"). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-36-5234907ae9e5>", line 139, in objective model = PPO( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 171, in __init__ self._setup_model() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 174, in _setup_model super()._setup_model() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 135, in _setup_model self.policy = self.policy_class( # type: ignore[assignment] File "<ipython-input-34-7ec042f512c9>", line 618, in __init__ super().__init__( TypeError: ActorCriticPolicy.__init__() got an unexpected keyword argument 'device' [W 2025-03-13 04:59:20,531] Trial 6 failed with value None. Using cuda device CustomMultiInputPolicy using device: cuda:0 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-36-5234907ae9e5> in <cell line: 183>() 181 load_if_exists=True 182 ) --> 183 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 184 185 # Best parameters 9 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-36-5234907ae9e5> in objective(trial) 137 138 # Train PPO model on training set --> 139 model = PPO( 140 CustomMultiInputPolicy, 141 train_env_vec, /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in __init__(self, policy, env, learning_rate, n_steps, batch_size, n_epochs, gamma, gae_lambda, clip_range, clip_range_vf, normalize_advantage, ent_coef, vf_coef, max_grad_norm, use_sde, sde_sample_freq, rollout_buffer_class, rollout_buffer_kwargs, target_kl, stats_window_size, tensorboard_log, policy_kwargs, verbose, seed, device, _init_setup_model) 169 170 if _init_setup_model: --> 171 self._setup_model() 172 173 def _setup_model(self) -> None: /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in _setup_model(self) 172 173 def _setup_model(self) -> None: --> 174 super()._setup_model() 175 176 # Initialize schedules for policy/value clipping /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in _setup_model(self) 133 **self.rollout_buffer_kwargs, 134 ) --> 135 self.policy = self.policy_class( # type: ignore[assignment] 136 self.observation_space, self.action_space, self.lr_schedule, use_sde=self.use_sde, **self.policy_kwargs 137 ) <ipython-input-34-7ec042f512c9> in __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs) 616 action_dim = action_space.shape[0] # Number of assets 617 --> 618 super().__init__( 619 observation_space, 620 action_space, TypeError: ActorCriticPolicy.__init__() got an unexpected keyword argument 'device' ``` class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Check if CUDA is available, else use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomMultiInputPolicy using device: {device}") # Extract action space bounds and move them to the selected device self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), device=device, # Pass device to parent class *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=device ).to(device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities of the unsquashed actions log_prob = distribution.log_prob(actions) # Use log_prob on the unsquashed actions # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return squashed_actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Unsquash the actions back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob(unsquashed_actions) # Use log_prob on unsquashed actions entropy = distribution.entropy() # Entropy doesn’t need actions values = self.value_net(latent_vf) return values, log_prob, entropy ```

The error TypeError: ActorCriticPolicy.__init__() got an unexpected keyword argument 'device' occurs because the ActorCriticPolicy class in Stable-Baselines3 does not accept a device keyword argument directly in its __init__ method in the version you’re using. Instead, the device is typically set when instantiating the PPO model (e.g., PPO(..., device="cuda")), and Stable-Baselines3 handles device placement internally.

To fix this, you should:

Remove the device argument from the super().__init__() call in CustomMultiInputPolicy, as it’s not expected by ActorCriticPolicy.
Determine the device (CUDA or CPU) before initialization and use it to move tensors and modules explicitly, relying on Stable-Baselines3 to set self.device correctly based on the PPO model’s device setting.

Below is the corrected version of your CustomMultiInputPolicy class:

Fixed CustomMultiInputPolicy

import torch
import torch.nn as nn
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.torch_layers import MlpExtractor
from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        # Check if CUDA is available, else use CPU
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        print(f"CustomMultiInputPolicy using device: {device}")

        # Extract action space bounds and move them to the selected device
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device)
        action_dim = action_space.shape[0]  # Number of assets

        # Remove 'device' from super().__init__ call
        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            *args,
            **kwargs
        )

        # Use the device set by Stable-Baselines3 (self.device) or override with our choice
        self.device = device  # This will be overridden by PPO if specified differently
        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(
            features_dim,
            net_arch=self.net_arch,
            activation_fn=nn.ReLU,
            device=self.device
        ).to(self.device)

        # Define the action network to output mean and log_std for the Gaussian
        self.action_net = nn.Linear(64, action_dim * 2).to(self.device)  # Outputs mean and log_std for each asset
        self.value_net = nn.Linear(64, 1).to(self.device)

        # Initialize the distribution
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        # Increment timestep on each forward pass
        self.num_timesteps += 1

        # Extract features (obs is moved to self.device by SB3's obs_as_tensor)
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        # Get mean and log_std from action_net
        action_params = self.action_net(latent_pi)  # [batch, n_assets * 2]
        mean_actions, log_std = action_params.chunk(2, dim=-1)  # Split into mean and log_std
        log_std = torch.clamp(log_std, min=-20, max=2)  # Stabilize log_std

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Sample actions or get deterministic actions
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)

        # Map from [-1, 1] to [0, 3]
        squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

        # Compute log probabilities of the unsquashed actions
        log_prob = distribution.log_prob(actions)  # Use log_prob on the unsquashed actions

        # Value prediction
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return squashed_actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        """Scale squashed actions from [-1, 1] to [low, high]."""
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        """Evaluate actions for training (used by PPO)."""
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Unsquash the actions back to [-1, 1] for log_prob calculation
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1

        # Compute log_prob and entropy
        log_prob = distribution.log_prob(unsquashed_actions)  # Use log_prob on unsquashed actions
        entropy = distribution.entropy()  # Entropy doesn’t need actions
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

Key Changes

Removed device from super().__init__:
- The ActorCriticPolicy class does not accept a device argument directly in its __init__ in Stable-Baselines3 (at least not in the version you’re using). Instead, the device is set by the PPO model when it initializes the policy, based on the device argument passed to PPO(...).
Device Handling:
- Determined the device (cuda:0 or cpu) before calling super().__init__ and used it to initialize self.action_space_low and self.action_space_high.
- After super().__init__, set self.device explicitly to our chosen device. However, note that if you specify a different device when creating the PPO model (e.g., PPO(..., device="cpu")), Stable-Baselines3 will override self.device. To ensure consistency, you should pass the device to the PPO constructor (see below).
Consistency with Stable-Baselines3:
- Stable-Baselines3’s obs_as_tensor in collect_rollouts will move observations to self.device, which is set by the PPO model. By aligning our tensors and modules with this device, we avoid device mismatch errors.

Updating

Interrupted

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [I 2025-03-13 06:18:47,024] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. [W 2025-03-13 06:18:47,130] Trial 7 failed with parameters: {'learning_rate': 1.0524647940358103e-05, 'n_steps': 7168, 'total_timesteps': 1300000, 'batch_size': 512, 'gamma': 0.9870184012478126, 'gae_lambda': 0.8054870201604556, 'clip_range': 0.21413881244972097} because of the following error: AttributeError("can't set attribute 'device'"). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-40-5234907ae9e5>", line 139, in objective model = PPO( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 171, in __init__ self._setup_model() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 174, in _setup_model super()._setup_model() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 135, in _setup_model self.policy = self.policy_class( # type: ignore[assignment] File "<ipython-input-38-ca7aa31740ab>", line 629, in __init__ self.device = device File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2029, in __setattr__ super().__setattr__(name, value) AttributeError: can't set attribute 'device' [W 2025-03-13 06:18:47,131] Trial 7 failed with value None. Using cuda device CustomMultiInputPolicy using device: cuda:0 CustomFeaturesExtractor using device: cuda:0 CustomFeaturesExtractor: n_assets=1, features_dim=96 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-40-5234907ae9e5> in <cell line: 183>() 181 load_if_exists=True 182 ) --> 183 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 184 185 # Best parameters 10 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-40-5234907ae9e5> in objective(trial) 137 138 # Train PPO model on training set --> 139 model = PPO( 140 CustomMultiInputPolicy, 141 train_env_vec, /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in __init__(self, policy, env, learning_rate, n_steps, batch_size, n_epochs, gamma, gae_lambda, clip_range, clip_range_vf, normalize_advantage, ent_coef, vf_coef, max_grad_norm, use_sde, sde_sample_freq, rollout_buffer_class, rollout_buffer_kwargs, target_kl, stats_window_size, tensorboard_log, policy_kwargs, verbose, seed, device, _init_setup_model) 169 170 if _init_setup_model: --> 171 self._setup_model() 172 173 def _setup_model(self) -> None: /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in _setup_model(self) 172 173 def _setup_model(self) -> None: --> 174 super()._setup_model() 175 176 # Initialize schedules for policy/value clipping /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in _setup_model(self) 133 **self.rollout_buffer_kwargs, 134 ) --> 135 self.policy = self.policy_class( # type: ignore[assignment] 136 self.observation_space, self.action_space, self.lr_schedule, use_sde=self.use_sde, **self.policy_kwargs 137 ) <ipython-input-38-ca7aa31740ab> in __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs) 627 ) 628 --> 629 self.device = device 630 features_dim = self.features_extractor.features_dim 631 self.mlp_extractor = MlpExtractor( /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in __setattr__(self, name, value) 2027 # === HACK END === 2028 else: -> 2029 super().__setattr__(name, value) 2030 2031 def __delattr__(self, name): AttributeError: can't set attribute 'device' ``` class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Check if CUDA is available, else use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomMultiInputPolicy using device: {device}") # Extract action space bounds and move them to the selected device self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) self.device = device features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=self.device ).to(self.device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(self.device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(self.device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities of the unsquashed actions log_prob = distribution.log_prob(actions) # Use log_prob on the unsquashed actions # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return squashed_actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Unsquash the actions back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob(unsquashed_actions) # Use log_prob on unsquashed actions entropy = distribution.entropy() # Entropy doesn’t need actions values = self.value_net(latent_vf) return values, log_prob, entropy ```

The error AttributeError: can't set attribute 'device' occurs because ActorCriticPolicy (a subclass of torch.nn.Module) has already set up its internal state, including the device attribute, by the time super().__init__() is called. PyTorch’s nn.Module protects certain attributes like device after initialization, and attempting to overwrite self.device directly in the child class (self.device = device) after the parent’s initialization triggers this error.

To resolve this:

Avoid Overwriting self.device: Instead of setting self.device after super().__init__(), determine the device beforehand and use it only for initializing tensors and modules. Let Stable-Baselines3 manage self.device based on the device passed to the PPO constructor.
Pass Device to PPO: Ensure the device is specified when creating the PPO model, so Stable-Baselines3 propagates it correctly to the policy.

Here’s the corrected version of your CustomMultiInputPolicy:

Fixed CustomMultiInputPolicy

import torch
import torch.nn as nn
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.torch_layers import MlpExtractor
from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution

class CustomMultiInputPolicy(ActorCriticPolicy):
    def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs):
        # Determine device before initialization
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        print(f"CustomMultiInputPolicy initializing with device: {device}")

        # Extract action space bounds and move them to the selected device
        self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device)
        self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device)
        action_dim = action_space.shape[0]  # Number of assets

        # Call parent constructor without overriding device
        super().__init__(
            observation_space,
            action_space,
            lr_schedule,
            features_extractor_class=CustomFeaturesExtractor,
            features_extractor_kwargs={},
            net_arch=dict(pi=[64, 64], vf=[64, 64]),
            *args,
            **kwargs
        )

        # Use the device from Stable-Baselines3 (set by PPO) or our chosen device
        features_dim = self.features_extractor.features_dim
        self.mlp_extractor = MlpExtractor(
            features_dim,
            net_arch=self.net_arch,
            activation_fn=nn.ReLU,
            device=device  # Use local device variable here
        ).to(device)

        # Define the action network and value network, move to the selected device
        self.action_net = nn.Linear(64, action_dim * 2).to(device)  # Outputs mean and log_std for each asset
        self.value_net = nn.Linear(64, 1).to(device)

        # Initialize the distribution
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)
        self.num_timesteps = 0

    def forward(self, obs, deterministic=False):
        # Increment timestep on each forward pass
        self.num_timesteps += 1

        # Extract features (obs is moved to self.device by SB3's obs_as_tensor)
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        # Get mean and log_std from action_net
        action_params = self.action_net(latent_pi)  # [batch, n_assets * 2]
        mean_actions, log_std = action_params.chunk(2, dim=-1)  # Split into mean and log_std
        log_std = torch.clamp(log_std, min=-20, max=2)  # Stabilize log_std

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Sample actions or get deterministic actions
        actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)

        # Map from [-1, 1] to [0, 3]
        squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

        # Compute log probabilities of the unsquashed actions
        log_prob = distribution.log_prob(actions)  # Use log_prob on the unsquashed actions

        # Value prediction
        values = self.value_net(latent_vf)

        if self.num_timesteps % 1000 == 0:
            print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}")

        return squashed_actions, values, log_prob

    def _squash_to_range(self, actions, low, high):
        """Scale squashed actions from [-1, 1] to [low, high]."""
        return (actions + 1) * (high - low) / 2 + low

    def extract_features(self, obs):
        return self.features_extractor(obs)

    def predict_values(self, obs):
        features = self.extract_features(obs)
        _, latent_vf = self.mlp_extractor(features)
        return self.value_net(latent_vf)

    def evaluate_actions(self, obs, actions):
        """Evaluate actions for training (used by PPO)."""
        features = self.extract_features(obs)
        latent_pi, latent_vf = self.mlp_extractor(features)

        action_params = self.action_net(latent_pi)
        mean_actions, log_std = action_params.chunk(2, dim=-1)
        log_std = torch.clamp(log_std, min=-20, max=2)

        # Create a fresh distribution instance
        distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

        # Unsquash the actions back to [-1, 1] for log_prob calculation
        unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1

        # Compute log_prob and entropy
        log_prob = distribution.log_prob(unsquashed_actions)  # Use log_prob on unsquashed actions
        entropy = distribution.entropy()  # Entropy doesn’t need actions
        values = self.value_net(latent_vf)

        return values, log_prob, entropy

Key Changes

Removed self.device = device:
- Instead of setting self.device after super().__init__(), we define a local device variable before the parent initialization and use it to place tensors and modules. This avoids conflicting with the self.device attribute set by ActorCriticPolicy.
Device Propagation:
- Stable-Baselines3 sets self.device based on the device argument passed to the PPO constructor (e.g., PPO(..., device="cuda")). By not overwriting it, we rely on this mechanism while ensuring our tensors and modules are initialized on the correct device.
Consistency:
- All tensors (self.action_space_low, self.action_space_high) and modules (self.mlp_extractor, self.action_net, self.value_net) are moved to the local device variable during initialization, ensuring they match the device Stable-Baselines3 will use.

Updating the PPO Instantiation

To ensure the device is correctly set, update your objective function to pass the device to the PPO constructor explicitly. Here’s an example:

from stable_baselines3 import PPO

def objective(trial):
    # Hyperparameters from Optuna
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
    n_steps = trial.suggest_categorical("n_steps", [1024, 2048, 4096, 8192])
    total_timesteps = trial.suggest_categorical("total_timesteps", [1000000, 1500000, 2000000])
    batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512])
    gamma = trial.suggest_float("gamma", 0.9, 0.9999)
    gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.95)
    clip_range = trial.suggest_float("clip_range", 0.1, 0.3)

    # Choose device
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device for PPO: {device}")

    # Train PPO model on training set
    model = PPO(
        CustomMultiInputPolicy,
        train_env_vec,  # Your training environment
        learning_rate=learning_rate,
        n_steps=n_steps,
        batch_size=batch_size,
        gamma=gamma,
        gae_lambda=gae_lambda,
        clip_range=clip_range,
        verbose=1,
        device=device,  # Pass device explicitly
    )

    # Training
    model.learn(total_timesteps=total_timesteps, callback=eval_callback)

    # Evaluation logic (e.g., profit calculation)
    val_avg_profit = evaluate(model, val_env_vec, n_episodes=10)
    return val_avg_profit

# Run optimization
study.optimize(objective, n_trials=1)

Why the Error Occurred

ActorCriticPolicy inherits from torch.nn.Module, and after super().__init__() is called, PyTorch sets up module attributes, including device, based on the device specified in the PPO constructor. Attempting to overwrite self.device afterward violates PyTorch’s attribute protection, resulting in the AttributeError.
Stable-Baselines3 manages the device internally when you pass it to the PPO constructor, and the policy inherits this setting via self.device. Overwriting it manually in the child class conflicts with this mechanism.

Verifying the Fix

Update Your Code:
- Replace your CustomMultiInputPolicy with the corrected version above.
- Update your objective function to pass the device argument to PPO as shown.
Run the Training:
- Execute study.optimize(objective, n_trials=1). The AttributeError should be resolved, and training should proceed.
Check Device Usage:
- The print statement in CustomMultiInputPolicy.__init__ will confirm the device (e.g., cuda:0 or cpu). During training, ensure all tensors (e.g., actions, mean_actions) are on the same device as self.device (set by PPO).

Additional Notes

Stable-Baselines3 Version: Verify your version with import stable_baselines3; print(stable_baselines3.__version__). If it’s outdated (e.g., <2.0.0), consider upgrading (pip install --upgrade stable-baselines3) to ensure compatibility with modern device handling.
Deprecation Warning: The IPython warning is unrelated to your code and can be ignored. It’s a notebook-specific issue about async execution.
Device Consistency: By passing device to PPO and avoiding manual overwriting of self.device, you ensure consistency between the policy and the training process.

Let me know if you need further assistance or encounter additional errors!

[I 2025-03-13 06:29:49,206] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. Using device for PPO: cuda Using cuda device CustomMultiInputPolicy using device: cuda:0 CustomFeaturesExtractor using device: cuda:0 CustomFeaturesExtractor: n_assets=1, features_dim=96 [W 2025-03-13 06:29:49,410] Trial 8 failed with parameters: {'learning_rate': 0.00017208292151274393, 'n_steps': 1024, 'total_timesteps': 1300000, 'batch_size': 256, 'gamma': 0.979195921305903, 'gae_lambda': 0.8985151006724164, 'clip_range': 0.2814463936700019} because of the following error: ValueError('not enough values to unpack (expected 5, got 4)'). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-46-fd8dcce349f6>", line 171, in objective model.learn(total_timesteps=total_timesteps, callback=eval_callback) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 311, in learn return super().learn( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 224, in collect_rollouts if not callback.on_step(): File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/callbacks.py", line 114, in on_step return self._on_step() File "<ipython-input-46-fd8dcce349f6>", line 109, in _on_step current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit) File "<ipython-input-46-fd8dcce349f6>", line 16, in evaluate obs, rewards, terminated, truncated, info = env_vec.step(action) ValueError: not enough values to unpack (expected 5, got 4) [W 2025-03-13 06:29:49,413] Trial 8 failed with value None. Asset XAUUSD: Action=0.0, Reward=0, Holding=0 Step 1: Base Reward=0, Sustained Reward=0, Total=0, Balance=99900 Asset XAUUSD: Action=0.0, Reward=0, Holding=0 Step 1: Base Reward=0, Sustained Reward=0, Total=0, Balance=99900 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-46-fd8dcce349f6> in <cell line: 188>() 186 load_if_exists=True 187 ) --> 188 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 189 190 # Choose device 11 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-46-fd8dcce349f6> in objective(trial) 169 170 # print(model.policy) # Should show mlp_extractor with in_features=95 --> 171 model.learn(total_timesteps=total_timesteps, callback=eval_callback) 172 173 val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 309 progress_bar: bool = False, 310 ) -> SelfPPO: --> 311 return super().learn( 312 total_timesteps=total_timesteps, 313 callback=callback, /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 321 322 while self.num_timesteps < total_timesteps: --> 323 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) 324 325 if not continue_training: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps) 222 # Give access to local variables 223 callback.update_locals(locals()) --> 224 if not callback.on_step(): 225 return False 226 /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/callbacks.py in on_step(self) 112 self.num_timesteps = self.model.num_timesteps 113 --> 114 return self._on_step() 115 116 def on_training_end(self) -> None: <ipython-input-46-fd8dcce349f6> in _on_step(self) 107 if self.last_mean_reward is not None: 108 # Use profit or reward based on use_profit flag --> 109 current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit) 110 if current_metric > self.best_metric + self.min_delta: 111 self.best_metric = current_metric <ipython-input-46-fd8dcce349f6> in evaluate(model, env_vec, n_episodes, return_mean_reward) 14 while not np.all(done) and step_count < max_steps: 15 action, _ = model.predict(obs, deterministic=True) ---> 16 obs, rewards, terminated, truncated, info = env_vec.step(action) 17 done = np.logical_or(terminated, truncated) 18 episode_rewards += rewards ValueError: not enough values to unpack (expected 5, got 4) ``` # Evaluation function def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False): total_rewards = [] total_profits = [] # Track actual trading profit metrics = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 # Sum of realized profits/losses step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=True) obs, rewards, terminated, truncated, info = env_vec.step(action) done = np.logical_or(terminated, truncated) episode_rewards += rewards # Extract profit from closed transactions for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] # Add realized reward (profit/loss) step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) # Logs to file here mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) avg_metrics = { k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys() } print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") return mean_reward if return_mean_reward else mean_profit # Return profit instead of reward for optimization # Evaluation function (with returns for QuantStats) def final_evaluate(model, env_vec, n_episodes=20, return_mean_reward=False): total_rewards = [] total_profits = [] metrics = [] returns = [] # For QuantStats for ep in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) episode_returns = [] # For QuantStats while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=True) obs, rewards, terminated, truncated, info = env_vec.step(action) done = np.logical_or(terminated, truncated) episode_rewards += rewards for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] episode_returns.append(tr["Reward"]) # Track per-trade returns step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) returns.extend(episode_returns if episode_returns else [episode_profit]) # Fallback to total profit mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()} print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Metrics: {avg_metrics}") # QuantStats reporting (for test evaluation) returns_series = pd.Series(returns, index=pd.date_range(start="2025-03-12", periods=len(returns), freq="D")) qs.reports.html(returns_series, output="quantstats_report.html", title="FX Trading Performance") return mean_reward if return_mean_reward else mean_profit # Custom early stopping callback class EarlyStoppingCallback(EvalCallback): def __init__(self, eval_env, eval_freq, n_eval_episodes, patience, min_delta, verbose=0, use_profit=False, best_model_save_path=None): super().__init__( eval_env=eval_env, eval_freq=eval_freq, n_eval_episodes=n_eval_episodes, verbose=verbose, deterministic=True ) self.patience = patience self.min_delta = min_delta self.best_metric = -float('inf') self.no_improvement_count = 0 self.use_profit = use_profit # Toggle between reward and profit self.best_model_save_path = best_model_save_path def _on_step(self): continue_training = super()._on_step() if not continue_training: return False if self.last_mean_reward is not None: # Use profit or reward based on use_profit flag current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit) if current_metric > self.best_metric + self.min_delta: self.best_metric = current_metric self.no_improvement_count = 0 if self.verbose > 0: print(f"New best {'profit' if self.use_profit else 'reward'}: {self.best_metric:.2f}") if self.best_model_save_path: self.model.save(self.best_model_save_path) else: self.no_improvement_count += 1 if self.verbose > 0: print(f"No improvement for {self.no_improvement_count}/{self.patience} evaluations") if self.no_improvement_count >= self.patience: if self.verbose > 0: print(f"Early stopping triggered after {self.patience} evaluations without improvement") return False return True # Objective function for Optuna def objective(trial): # Define hyperparameter search space learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True) n_steps = trial.suggest_int("n_steps", 1024, 8192, step=1024) total_timesteps = trial.suggest_int("total_timesteps", 500000, 2000000, step=100000) batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512]) gamma = trial.suggest_float("gamma", 0.9, 0.9999) gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.99) clip_range = trial.suggest_float("clip_range", 0.1, 0.3) # Choose device device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device for PPO: {device}") # Train PPO model on training set model = PPO( CustomMultiInputPolicy, train_env_vec, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, n_epochs=100, # Large fixed value gamma=gamma, gae_lambda=gae_lambda, clip_range=clip_range, verbose=1, # 0: No output during training, # 1: Prints basic training progress, # 2: More detailed output (Additional details like optimization steps, loss values (e.g., policy loss, value loss), and learning rate updates.) device=device, # Pass device explicitly ) eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # More frequent n_eval_episodes=10, # More episodes for stability patience=5, # Stop if no improvement after 5 evaluations min_delta=0.01, # Minimum improvement to consider verbose=1, use_profit=True, # Track profit for trading focus best_model_save_path=f"./best_model/trial_{trial.number}/" # Save best model ) # print(model.policy) # Should show mlp_extractor with in_features=95 model.learn(total_timesteps=total_timesteps, callback=eval_callback) val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function print(f"Validation Average Profit: {val_avg_profit:.2f}") return val_avg_profit # Maximize reward # Specify the SQLite database file db_path = 'optuna_study.db' # Run optimization study = optuna.create_study( study_name='OHLC_EconomicCalender_ppo_study', storage=f'sqlite:///{db_path}', direction="maximize", load_if_exists=True ) study.optimize(objective, n_trials=1) # Adjust number of trials based on resources # Choose device device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device for PPO: {device}") # Best parameters print("Best hyperparameters:", study.best_params) print("Best validation reward:", study.best_value) best_params = study.best_params # Train final model with best parameters on training set best_model = PPO( CustomMultiInputPolicy, train_env_vec, # Use full training environment learning_rate=best_params["learning_rate"], n_steps=best_params["n_steps"], batch_size=best_params["batch_size"], n_epochs=100, # Large fixed value gamma=best_params["gamma"], gae_lambda=best_params["gae_lambda"], clip_range=best_params["clip_range"], verbose=1, tensorboard_log="./tensorboard_logs/", device=device, # Pass device explicitly ) # Early stopping callback on validation set best_trial = study.best_trial.number eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # more frequent evaluations (e.g., every 5,000 steps) n_eval_episodes=10, patience=5, min_delta=0.01, verbose=1, use_profit=True, # use mean_profits best_model_save_path=f"./best_model/trial_{best_trial}/" ) # Train the final model best_model.learn(total_timesteps=best_params["total_timesteps"], callback=eval_callback) # Save the final model final_model.save(f"ppo_xauusd_optimized_trial_{best_trial}") # Evaluate on test data with QuantStats print("\nEvaluating Final Model on Test Data:") test_avg_profit = final_evaluate(best_model, test_env_vec, n_episodes=20) print(f"Test Average Profit: {test_avg_profit:.2f}") # Optional: Load and re-evaluate to verify # loaded_model = PPO.load("ppo_xauusd_optimized") # test_avg_profit_loaded = final_evaluate(loaded_model, test_env_vec, n_episodes=10) # print(f"Test Average Profit (Loaded Model): {test_avg_profit_loaded:.2f}") # Clean up train_env_vec.close() val_env_vec.close() test_env_vec.close() ``` ``` def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} return obs, total_reward, terminated, truncated, info ``` my step() return 5 value, why it just got 4?

The error ValueError: not enough values to unpack (expected 5, got 4) occurs because your step() method in the tgym environment returns 5 values (obs, total_reward, terminated, truncated, info), but Stable-Baselines3’s vectorized environment (VecEnv) expects the step() method to return only 4 values: obs, rewards, dones, infos. This mismatch happens because Stable-Baselines3 (and Gymnasium prior to version 0.26.0) does not natively support the newer Gymnasium API where step() returns 5 values (obs, reward, terminated, truncated, info).

Why This Happens

Your Environment (tgym): You’ve updated step() to return 5 values (obs, total_reward, terminated, truncated, info) to align with the modern Gymnasium API (introduced in Gymnasium 0.26.0+), which separates terminated (goal reached or failure) and truncated (time limit or external interruption) as distinct signals.
Stable-Baselines3: Stable-Baselines3 uses the older Gym API (or a compatible version of Gymnasium) where step() returns 4 values: obs, reward, done, info. Here, done combines both termination and truncation into a single boolean flag. Stable-Baselines3’s VecEnv (e.g., DummyVecEnv) expects this 4-tuple format when calling env_vec.step(action).

When you call obs, rewards, terminated, truncated, info = env_vec.step(action) in your evaluate function, Stable-Baselines3’s DummyVecEnv provides only 4 values, but your code expects 5, leading to the ValueError.

How to Fix It

To resolve this, you need to align your environment’s step() method with what Stable-Baselines3 expects: a 4-tuple return value (obs, reward, done, info). You can combine terminated and truncated into a single done flag, as Stable-Baselines3 does not distinguish between them internally.

Here’s the corrected step() method for your tgym class:

Fixed step() Method

def step(self, actions):
    self.current_step += 1

    # Define termination and truncation conditions
    terminated = self.balance <= 0  # Episode ends due to bankruptcy (terminal state)
    truncated = self.current_step == len(self.dt_datetime) - 1  # Episode ends due to max steps (time limit)
    done = terminated or truncated  # Combine into a single 'done' flag for VecEnv

    # For rendering or episode tracking, you might still check if either condition is true
    if done:
        self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n"
        self.visualization = True
        self.episode += 1  # Increment episode counter

    # Calculate base trading reward
    base_reward = self._take_action(actions, done)

    # Calculate unrealized profit from open positions
    unrealized_profit = 0
    atr_scaling = 0  # For market condition scaling
    for i, asset in enumerate(self.assets):
        atr = self.get_observation(self.current_step, i, "ATR")
        atr_scaling += atr  # Sum ATR across assets for normalization
        for tr in self.transaction_live:
            if tr["Symbol"] == asset:
                if tr["Type"] == 0:  # Buy
                    unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point")
                else:  # Sell
                    unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point")
                unrealized_profit += unrealized

    atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1  # Avoid division by 0

    # Sustained reward: only applies to unrealized/realized profits, scaled by ATR
    sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
    
    # Penalty for inaction if no positions are held
    if not self.transaction_live and all(math.floor(a) == 2 for a in actions):
        sustained_reward -= 0.1  # Small penalty to encourage exploration

    total_reward = base_reward + sustained_reward

    if self._day > self.current_day:
        self.current_day = self._day
        self.balance -= self.over_night_cash_penalty

    if self.balance != 0:
        self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100)

    obs = {
        "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32),
        "event_ids": self.cached_economic_data[self.current_step]["event_ids"],
        "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"],
        "economic_numeric": self.cached_economic_data[self.current_step]["numeric"],
        "portfolio_data": np.array(
            [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs,
            dtype=np.float32
        )
    }
    print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}")

    # Info dictionary remains unchanged
    info = {"Close": self.tranaction_close_this_step}

    # Return 4 values as expected by Stable-Baselines3
    return obs, total_reward, done, info

Key Changes

Combined terminated and truncated into done:
- Instead of returning terminated and truncated separately, compute done = terminated or truncated and return it as the third value. This matches the Stable-Baselines3 VecEnv expectation.
Return a 4-tuple:
- Changed the return statement from return obs, total_reward, terminated, truncated, info (5 values) to return obs, total_reward, done, info (4 values).

Updating the Evaluation Functions

Since your evaluate and final_evaluate functions expect 5 values from env_vec.step(action), you need to update them to unpack 4 values instead. Here’s the corrected version of evaluate (apply the same change to final_evaluate):

Fixed evaluate Function

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []  # Track actual trading profit
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0  # Sum of realized profits/losses
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)

        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, done, info = env_vec.step(action)  # Changed to 4 values
            episode_rewards += rewards

            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]  # Add realized reward (profit/loss)

            step_count += 1

        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())  # Logs to file here

    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {
        k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()
    }
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit  # Return profit instead of reward for optimization

Notes for final_evaluate

Apply the same change: replace obs, rewards, terminated, truncated, info = env_vec.step(action) with obs, rewards, done, info = env_vec.step(action).
Update the done logic accordingly (it’s already correct as done in your current code).

Why It Returned 4 Values

Stable-Baselines3’s VecEnv: When you use DummyVecEnv (via get_sb_env()), it wraps your tgym environment and calls its step() method. However, VecEnv processes the output and returns a 4-tuple (obs, rewards, dones, infos) to match the older Gym API, ignoring the 5-tuple format of your step() method. This is why your code received only 4 values despite returning 5.

Verifying the Fix

Update Your Code:
- Replace the step() method in tgym with the corrected version above.
- Update evaluate and final_evaluate to unpack 4 values instead of 5.
Run the Training:
- Execute study.optimize(objective, n_trials=1). The ValueError should be resolved, and training should proceed.
Check Outputs:
- Ensure the step() method’s print statements (e.g., Step 1: Base Reward=...) appear as expected, and the evaluation functions log rewards and profits correctly.

Additional Notes

Gymnasium Compatibility: If you want to use the modern Gymnasium API (5-tuple step()), you’d need a version of Stable-Baselines3 that supports it (e.g., a fork like stable-baselines3-contrib or a custom patch) or wrap your environment to convert the 5-tuple to a 4-tuple explicitly. For now, aligning with the 4-tuple format is simpler and compatible with the standard Stable-Baselines3.
Stable-Baselines3 Version: Check your version (import stable_baselines3; print(stable_baselines3.__version__)). If it’s <2.0.0, upgrading might help with compatibility, though the 4-tuple expectation remains standard.
Deprecation Warning: The IPython warning is unrelated and can be ignored.

Let me know if you need further clarification or encounter additional issues!

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [I 2025-03-13 06:43:32,020] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. [W 2025-03-13 06:43:32,173] Trial 10 failed with parameters: {'learning_rate': 1.630840922258771e-05, 'n_steps': 5120, 'total_timesteps': 700000, 'batch_size': 128, 'gamma': 0.9014507112976935, 'gae_lambda': 0.9771284555507461, 'clip_range': 0.22747081231760874} because of the following error: ValueError('not enough values to unpack (expected 5, got 4)'). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-52-002c30ec2605>", line 172, in objective model.learn(total_timesteps=total_timesteps, callback=eval_callback) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 311, in learn return super().learn( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 218, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 207, in step return self.step_wait() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 59, in step_wait obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step( # type: ignore[assignment] ValueError: not enough values to unpack (expected 5, got 4) [W 2025-03-13 06:43:32,174] Trial 10 failed with value None. Using device for PPO: cuda Using cuda device CustomMultiInputPolicy using device: cuda:0 CustomFeaturesExtractor using device: cuda:0 CustomFeaturesExtractor: n_assets=1, features_dim=96 Asset XAUUSD: Action=0.0, Reward=0, Holding=0 Step 1: Base Reward=0, Sustained Reward=0, Total=0, Balance=99900 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-52-002c30ec2605> in <cell line: 189>() 187 load_if_exists=True 188 ) --> 189 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 190 191 # Choose device 10 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-52-002c30ec2605> in objective(trial) 170 171 # print(model.policy) # Should show mlp_extractor with in_features=95 --> 172 model.learn(total_timesteps=total_timesteps, callback=eval_callback) 173 174 val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 309 progress_bar: bool = False, 310 ) -> SelfPPO: --> 311 return super().learn( 312 total_timesteps=total_timesteps, 313 callback=callback, /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 321 322 while self.num_timesteps < total_timesteps: --> 323 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) 324 325 if not continue_training: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps) 216 clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high) 217 --> 218 new_obs, rewards, dones, infos = env.step(clipped_actions) 219 220 self.num_timesteps += env.num_envs /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/base_vec_env.py in step(self, actions) 205 """ 206 self.step_async(actions) --> 207 return self.step_wait() 208 209 def get_images(self) -> Sequence[Optional[np.ndarray]]: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/dummy_vec_env.py in step_wait(self) 57 # Avoid circular imports 58 for env_idx in range(self.num_envs): ---> 59 obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step( # type: ignore[assignment] 60 self.actions[env_idx] 61 ) ValueError: not enough values to unpack (expected 5, got 4) if step() return 5 show error, if step() return 4 value also show error.

The issue you’re encountering—ValueError: not enough values to unpack (expected 5, got 4) when step() returns 4 values, and a similar error when it returns 5—stems from a mismatch between what your tgym environment’s step() method returns and what Stable-Baselines3’s DummyVecEnv expects. Let’s clarify this and resolve the confusion.

Root Cause

Stable-Baselines3 Expectation: Stable-Baselines3’s VecEnv (e.g., DummyVecEnv) expects the step() method of the underlying environment to return exactly 4 values: obs, reward, done, info. This is based on the older Gym API (pre-Gymnasium 0.26.0), where done is a single boolean combining termination and truncation.
Your Current Code:
- When step() returns 5 values (obs, reward, terminated, truncated, info), DummyVecEnv tries to unpack this into 5 variables (obs, rew, terminated, truncated, info), but Stable-Baselines3’s internal logic only handles 4 return values, causing an earlier error.
- When you corrected step() to return 4 values (obs, reward, done, info), the error persists because the DummyVecEnv implementation in your Stable-Baselines3 version is still trying to unpack 5 values internally (obs, rew, terminated, truncated, info), which indicates a version mismatch or a bug in how DummyVecEnv is handling the return values.

Why Both 4 and 5 Values Fail

5 Values (obs, reward, terminated, truncated, info): This matches the modern Gymnasium API, but Stable-Baselines3 (up to its latest stable release, e.g., 2.3.0) does not fully support this natively. The DummyVecEnv.step_wait() method expects 4 values and doesn’t handle the separate terminated and truncated correctly unless patched or using a newer, Gymnasium-compatible fork.
4 Values (obs, reward, done, info): This should work with Stable-Baselines3, but the error expected 5, got 4 in DummyVecEnv.step_wait() suggests your Stable-Baselines3 version has been updated or modified to expect the 5-value Gymnasium format internally, yet your environment now returns only 4 values, causing a mismatch.

Diagnosing the Problem

The traceback shows the error occurs in DummyVecEnv.step_wait():

obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step(...)

This line expects 5 values from self.envs[env_idx].step(), which is your tgym.step(). However:

If step() returns 4 values, it fails because DummyVecEnv is coded to unpack 5.
If step() returns 5 values, it fails earlier because Stable-Baselines3’s broader logic (e.g., collect_rollouts) expects 4.

This indicates your Stable-Baselines3 installation might be:

A version or fork expecting the Gymnasium 5-tuple API (e.g., a pre-release or custom build).
Inconsistent with your environment’s implementation.

Solution

To resolve this, you need to:

Ensure step() Returns 4 Values: Align your tgym.step() with the Stable-Baselines3 expectation of 4 values (obs, reward, done, info).
Fix DummyVecEnv Compatibility: If your Stable-Baselines3 version expects 5 values, either downgrade to a stable release (e.g., 2.3.0) or use a wrapper to adapt your 4-value step() to the 5-value format.

Step 1: Confirm Stable-Baselines3 Version

Check your version:

import stable_baselines3
print(stable_baselines3.__version__)

If it’s >=2.0.0 but <2.3.0, it should expect 4 values.
If it’s a pre-release or custom build (e.g., 2.3.0a0), it might expect 5 values due to Gymnasium integration.
如果它是预发布或自定义版本（例如 2.3.0a0），由于 Gymnasium 集成，它可能需要 5 个值。

Step 2: Fix step() to Return 4 Values
第 2 步：修复 step（）以返回 4 个值

Here’s your step() method corrected to return 4 values consistently with Stable-Baselines3’s expectation:
以下是更正后的 step（）方法，以返回 4 个与 Stable-Baselines3 的期望一致的值：

def step(self, actions):
    self.current_step += 1

    # Define termination and truncation conditions
    terminated = self.balance <= 0  # Episode ends due to bankruptcy
    truncated = self.current_step == len(self.dt_datetime) - 1  # Episode ends due to max steps
    done = terminated or truncated  # Combine into a single 'done' flag

    # For rendering or episode tracking
    if done:
        self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n"
        self.visualization = True
        self.episode += 1  # Increment episode counter

    # Calculate base trading reward
    base_reward = self._take_action(actions, done)

    # Calculate unrealized profit from open positions
    unrealized_profit = 0
    atr_scaling = 0  # For market condition scaling
    for i, asset in enumerate(self.assets):
        atr = self.get_observation(self.current_step, i, "ATR")
        atr_scaling += atr  # Sum ATR across assets for normalization
        for tr in self.transaction_live:
            if tr["Symbol"] == asset:
                if tr["Type"] == 0:  # Buy
                    unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point")
                else:  # Sell
                    unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point")
                unrealized_profit += unrealized

    atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1  # Avoid division by 0

    # Sustained reward: only applies to unrealized/realized profits, scaled by ATR
    sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
    
    # Penalty for inaction if no positions are held
    if not self.transaction_live and all(math.floor(a) == 2 for a in actions):
        sustained_reward -= 0.1  # Small penalty to encourage exploration

    total_reward = base_reward + sustained_reward

    if self._day > self.current_day:
        self.current_day = self._day
        self.balance -= self.over_night_cash_penalty

    if self.balance != 0:
        self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100)

    obs = {
        "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32),
        "event_ids": self.cached_economic_data[self.current_step]["event_ids"],
        "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"],
        "economic_numeric": self.cached_economic_data[self.current_step]["numeric"],
        "portfolio_data": np.array(
            [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs,
            dtype=np.float32
        )
    }
    print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}")

    # Info dictionary
    info = {"Close": self.tranaction_close_this_step}

    # Return 4 values as expected by Stable-Baselines3
    return obs, total_reward, done, info

Step 3: Adjust Evaluation Functions
第 3 步：调整评估函数

Your evaluate and final_evaluate functions still expect 5 values from env_vec.step(). Update them to expect 4 values:
您的 evaluate 和 final_evaluate 函数仍然需要 env_vec.step（）中的 5 个值。将它们更新为预期 4 个值：

Fixed evaluate Function
固定 evaluate 函数

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []  # Track actual trading profit
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0  # Sum of realized profits/losses
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)

        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, done, info = env_vec.step(action)  # Expect 4 values
            episode_rewards += rewards

            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]  # Add realized reward (profit/loss)

            step_count += 1

        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())  # Logs to file here

    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {
        k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()
    }
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit

Apply the Same to final_evaluate
将相同内容应用于 final_evaluate

Replace obs, rewards, terminated, truncated, info = env_vec.step(action) with obs, rewards, done, info = env_vec.step(action).
将 obs， rewards， terminated， truncated， info = env_vec.step（action）替换为 obs， rewards， done， info = env_vec.step（action）。

Step 4: Handle Version Mismatch (If Needed)
第 4 步：处理版本不匹配（如果需要）

If the above changes still result in expected 5, got 4, your Stable-Baselines3 version expects the 5-tuple Gymnasium API. You have two options:
如果上述更改仍然导致预期的 5，得到 4，则您的 Stable-Baselines3 版本需要 5 元组的 Gymnasium API。您有两个选项：

Downgrade Stable-Baselines3:
降级 Stable-Baselines3：
- Install a stable version that uses the 4-tuple API:
  安装使用 4 元组 API 的稳定版本：
  bash 重击
  pip install stable-baselines3==2.2.1
- This ensures DummyVecEnv expects 4 values, matching your corrected step().
  这可确保 DummyVecEnv 需要 4 个值，与更正后的 step（）匹配。
Adapt to 5 Values: 适应 5 个值：
- If you want to keep the current version and use the 5-tuple API, modify step() to return 5 values and ensure compatibility:
  如果您想保留当前版本并使用 5 元组 API，请修改 step（）以返回 5 个值并确保兼容性：
  python 蟒
  def step(self, actions): # ... (same logic as above) ... terminated = self.balance <= 0 truncated = self.current_step == len(self.dt_datetime) - 1 done = terminated or truncated # Still compute for internal use # ... (rest of the logic) ... return obs, total_reward, terminated, truncated, info # Return 5 values
- However, this requires Stable-Baselines3 to support Gymnasium’s 5-tuple API (e.g., a pre-release or fork like stable-baselines3-contrib).
  但是，这需要 Stable-Baselines3 支持 Gymnasium 的 5 元组 API（例如，像 stable-baselines3-contrib 这样的预发布或分叉）。

Recommended Fix 建议的修复

Since Stable-Baselines3’s stable releases (up to 2.3.0) expect 4 values, the simplest solution is:
由于 Stable-Baselines3 的稳定版本（直到 2.3.0）需要 4 个值，因此最简单的解决方案是：

Use the corrected step() returning 4 values (obs, reward, done, info).
使用更正后的 step（）返回 4 个值（obs， reward， done， info）。
Update evaluate and final_evaluate to unpack 4 values.
更新 evaluate 和 final_evaluate 以解压缩 4 个值。
Downgrade to stable-baselines3==2.2.1 if the error persists with your current version.
如果当前版本仍然存在错误，请降级到 stable-baselines3==2.2.1。

Verifying the Fix 验证修复

Update Your Code: 更新您的代码：
- Replace step() with the 4-value version above.
  将 step（）替换为上面的 4 值版本。
- Update evaluate and final_evaluate as shown.
  更新 evaluate 和 final_evaluate，如图所示。
- Optionally downgrade Stable-Baselines3: pip install stable-baselines3==2.2.1.
  （可选）降级 Stable-Baselines3：pip install stable-baselines3==2.2.1。
Run the Training: 运行训练：
- Execute study.optimize(objective, n_trials=1). The ValueError should resolve.
  执行 study.optimize（objective， n_trials=1）。ValueError 应该会解决。
Check Outputs: 检查输出：
- Ensure training proceeds and evaluation logs rewards/profits correctly.
  确保培训收益和评估日志奖励/利润正确。

Additional Debugging 其他调试

If the error persists: 如果错误仍然存在：

Share your Stable-Baselines3 version (print(stable_baselines3.__version__)).
共享您的 Stable-Baselines3 版本（print（stable_baselines3.__version__））。
Confirm whether get_sb_env() uses DummyVecEnv or another VecEnv subclass.
确认 get_sb_env（）使用的是 DummyVecEnv 还是其他 VecEnv 子类。

Let me know your version or if you need further assistance!
请告诉我您的版本，或者如果您需要进一步的帮助！

import stable_baselines3 print(stable_baselines3.__version__) 2.5.0 upgrade which package to make it available for 5 value?

20 web pages

Your current Stable-Baselines3 version is 2.5.0, which is a stable release that expects the 4-tuple return value (obs, reward, done, info) from the step() method of environments, as per the older Gym API. The error you're encountering (ValueError: not enough values to unpack (expected 5, got 4)) indicates that your version of Stable-Baselines3’s DummyVecEnv is trying to unpack 5 values (obs, reward, terminated, truncated, info), which aligns with the modern Gymnasium API, but your environment’s step() method is returning only 4 values—or vice versa.

Why This Happens

Stable-Baselines3 <= 2.5.0: These versions are designed to work with Gym (up to 0.26) and Gymnasium (via the shimmy compatibility layer), but their vectorized environments (VecEnv) expect a 4-tuple return from step(). The terminated and truncated distinction (5-tuple) was introduced in Gymnasium 0.26.0+, and full support for this requires changes in Stable-Baselines3’s internals.
Your Situation: The mismatch suggests either:
您的情况：不匹配表明：
1. Your step() returns 4 values, but DummyVecEnv expects 5 (due to a version or configuration issue).
2. Your step() returns 5 values, but Stable-Baselines3 2.5.0 processes it as 4, causing an earlier error.
  您的 step（）返回 5 个值，但 Stable-Baselines3 2.5.0 将其处理为 4，从而导致早期错误。

Upgrade Path for 5-Tuple Support
5 元组支持的升级路径

Stable-Baselines3 began integrating full Gymnasium support (including the 5-tuple step() API) starting with version 2.0.0, but complete compatibility with the terminated/truncated distinction in VecEnv was refined in later pre-release versions (e.g., 2.6.0aX). As of March 12, 2025, the latest stable release is 2.5.0 (released January 26, 2025), but pre-release versions beyond this (e.g., 2.6.0a2) include better Gymnasium support.
从 2.0.0 版本开始，Stable-Baselines3 开始集成完整的 Gymnasium 支持（包括 5 元组 step（） API），但在后来的预发布版本（例如 2.6.0aX）中，与 VecEnv 中终止/截断区别的完全兼容性得到了改进。截至 2025 年 3 月 12 日，最新的稳定版本是 2.5.0（2025 年 1 月 26 日发布），但在此之后的预发布版本（例如 2.6.0a2）包含更好的 Gymnasium 支持。

To enable native support for the 5-tuple step() return (obs, reward, terminated, truncated, info), you should upgrade to a pre-release version of Stable-Baselines3 that fully supports Gymnasium’s API. Here’s how:
要启用对 5 元组 step（）返回（obs， reward， terminated， truncated， info）的原生支持，您应该升级到完全支持 Gymnasium API 的 Stable-Baselines3 预发布版本。方法如下：

Upgrade Stable-Baselines3
升级 stable-baselines3

Upgrade to the latest pre-release version from the GitHub repository, as the stable 2.5.0 does not fully handle the 5-tuple API in VecEnv. The pre-release versions (e.g., 2.6.0a2) include updates for Gymnasium compatibility.
从 GitHub 存储库升级到最新的预发布版本，因为稳定的 2.5.0 无法完全处理 VecEnv 中的 5 元组 API。预发布版本（例如 2.6.0a2）包括 Gymnasium 兼容性更新。

pip install git+https://github.com/DLR-RM/stable-baselines3.git@master

This installs the latest development version from the master branch, which, as of March 12, 2025, should include Gymnasium 5-tuple support (post-2.5.0 changes).
这将从 master 分支安装最新的开发版本，自 2025 年 3 月 12 日起，该版本应包括 Gymnasium 5 元组支持（2.5.0 后的更改）。
Verify the installed version:
验证已安装的版本：
python 蟒
import stable_baselines3 print(stable_baselines3.__version__) # Should show something like 2.6.0aX

Additional Dependencies 其他依赖项

Ensure you have Gymnasium installed, as it’s the primary backend for newer Stable-Baselines3 versions:
确保您已安装 Gymnasium，因为它是较新 Stable-Baselines3 版本的主要后端：

pip install gymnasium>=0.29.1

Stable-Baselines3 2.5.0+ requires Gymnasium 0.29.1 or higher for full compatibility.
Stable-Baselines3 2.5.0+ 需要 Gymnasium 0.29.1 或更高版本才能完全兼容。

Verify Your Environment 验证您的环境

Update your tgym.step() to return 5 values if it doesn’t already. Here’s the corrected version based on your previous code:
更新你的 tgym.step（）以返回 5 个值（如果它还没有）。以下是基于您之前代码的更正版本：

def step(self, actions):
    self.current_step += 1

    # Define termination and truncation conditions
    terminated = self.balance <= 0  # Episode ends due to bankruptcy
    truncated = self.current_step == len(self.dt_datetime) - 1  # Episode ends due to max steps
    done = terminated or truncated  # For internal use or rendering

    # For rendering or episode tracking
    if done:
        self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n"
        self.visualization = True
        self.episode += 1

    # Calculate base trading reward
    base_reward = self._take_action(actions, done)

    # Calculate unrealized profit from open positions
    unrealized_profit = 0
    atr_scaling = 0
    for i, asset in enumerate(self.assets):
        atr = self.get_observation(self.current_step, i, "ATR")
        atr_scaling += atr
        for tr in self.transaction_live:
            if tr["Symbol"] == asset:
                if tr["Type"] == 0:  # Buy
                    unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point")
                else:  # Sell
                    unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point")
                unrealized_profit += unrealized

    atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1
    sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
    
    if not self.transaction_live and all(math.floor(a) == 2 for a in actions):
        sustained_reward -= 0.1

    total_reward = base_reward + sustained_reward

    if self._day > self.current_day:
        self.current_day = self._day
        self.balance -= self.over_night_cash_penalty

    if self.balance != 0:
        self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100)

    obs = {
        "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32),
        "event_ids": self.cached_economic_data[self.current_step]["event_ids"],
        "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"],
        "economic_numeric": self.cached_economic_data[self.current_step]["numeric"],
        "portfolio_data": np.array(
            [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs,
            dtype=np.float32
        )
    }
    print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}")

    info = {"Close": self.tranaction_close_this_step}

    # Return 5 values for Gymnasium compatibility
    return obs, total_reward, terminated, truncated, info

Update Evaluation Functions
更新评估函数

If your evaluate and final_evaluate functions expect 4 values, update them to handle 5 values from env_vec.step():
如果你的 evaluate 和 final_evaluate 函数需要 4 个值，请更新它们以处理 env_vec.step（）中的 5 个值：

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)

        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, terminated, truncated, info = env_vec.step(action)  # Expect 5 values
            done = np.logical_or(terminated, truncated)  # Combine for loop condition
            episode_rewards += rewards

            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]

            step_count += 1

        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())

    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()}
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit

Why Upgrade to Pre-Release?
为什么要升级到预发布版？

Stable-Baselines3 2.5.0: Supports Gymnasium but processes VecEnv.step() as a 4-tuple internally (dones combines terminated and truncated), which may cause issues if your environment or code expects 5 values.
稳定基线3 2.5.0：支持 Gymnasium，但在内部将 VecEnv.step（）作为 4 元组处理（dones 结合了 terminated 和 truncated），如果您的环境或代码需要 5 个值，这可能会导致问题。
Pre-Release (e.g., 2.6.0a2): Fully aligns VecEnv with Gymnasium’s 5-tuple API, as noted in the changelog (e.g., updated DummyVecEnv to handle terminated and truncated separately).
预发布版（例如 2.6.0a2）：如更改日志中所述，将 VecEnv 与 Gymnasium 的 5 元组 API 完全对齐（例如，更新了 DummyVecEnv 以分别处理终止和截断）。

Alternative: Downgrade for 4-Tuple
替代方案：降级为 4 元组

If you prefer to stick with the 4-tuple API (and avoid pre-release instability):
如果您更喜欢坚持使用 4 元组 API（并避免发布前的不稳定性）：

Keep Stable-Baselines3 at 2.5.0.
将 stable-baselines3 保持在 2.5.0。
Ensure step() returns 4 values (obs, reward, done, info).
确保 step（）返回 4 个值（obs， reward， done， info）。
Downgrade Gymnasium to a version compatible with 4-tuple expectations (e.g., via shimmy):
将 Gymnasium 降级到与 4 元组预期兼容的版本（例如，通过 shimmy）：
bash 重击
pip install gymnasium==0.29.1 shimmy[gym-v26]

However, since you asked for 5-tuple support, upgrading to the pre-release is the recommended path.
但是，由于您要求支持 5 元组，因此建议升级到预发行版。

Verify the Fix 验证修复

Install Pre-Release: 安装预发布版：
bash 重击
pip install git+https://github.com/DLR-RM/stable-baselines3.git@master pip install gymnasium>=0.29.1
Update step() and Evaluation Functions: Use the 5-tuple versions above.
更新 step（）和 Evaluation 函数：使用上面的 5 元组版本。
Run Training: 运行训练：
python 蟒
study.optimize(objective, n_trials=1)
Check for errors and ensure training proceeds.
检查错误并确保培训继续进行。

If issues persist, share the updated error message or confirm your Gymnasium version (import gymnasium; print(gymnasium.__version__)).
如果问题仍然存在，请分享更新的错误消息或确认您的 Gymnasium 版本（import gymnasium; print（gymnasium.__version__））。

https://github.com/DLR-RM/stable-baselines3/releases the latest stable-baselines3 is v2.5.0: New algorithm (SimBa in SBX) and NumPy 2.0 support. not 2.6.0, should i upgrade stable-baselines3?

github.com

Based on the information you’ve provided and the release notes from the GitHub page, here’s an analysis of your situation and whether you should upgrade Stable-Baselines3 (SB3) beyond version 2.5.0 to support the 5-tuple step() return value (obs, reward, terminated, truncated, info).

Current Situation

Your SB3 Version: You’re running 2.5.0, which is the latest stable release as of January 27, 2025, according to the GitHub releases page (https://github.com/DLR-RM/stable-baselines3/releases).
Error Encountered: You’re seeing ValueError: not enough values to unpack (expected 5, got 4) when your environment’s step() returns 4 values (obs, reward, done, info), indicating a mismatch between what your environment returns and what SB3’s DummyVecEnv expects.
Desired Behavior: You want your environment’s step() to return 5 values (obs, reward, terminated, truncated, info) to align with the modern Gymnasium API, and you want SB3 to handle this correctly.

Stable-Baselines3 v2.5.0 Behavior

Release Date: January 27, 2025.
Gymnasium Support: SB3 v2.5.0 supports Gymnasium (minimum version 0.29.1, with support for v1.0 added in v2.4.0), but its vectorized environments (VecEnv, including DummyVecEnv) are designed to work with the older 4-tuple API (obs, reward, done, info). Internally, it combines terminated and truncated into a single done signal when processing environment steps.
5-Tuple Handling: While v2.5.0 can interact with Gymnasium environments that return 5 values (via compatibility layers like shimmy), the VecEnv.step_wait() method in stable releases up to 2.5.0 expects to process a 4-tuple output. The error you’re seeing (expected 5, got 4) suggests that your DummyVecEnv is attempting to unpack 5 values, which is inconsistent with the stable branch’s documented behavior.

Why the Confusion?

Error Mismatch: The error expected 5, got 4 in DummyVecEnv.step_wait():
python
obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step(...)
indicates that your SB3 installation expects a 5-tuple return, which is not the default behavior for v2.5.0’s stable release. This could mean:
1. Version Misidentification: Your installed version might not actually be 2.5.0 but a pre-release or custom build (e.g., 2.6.0aX) that expects 5 values. Re-verify with:
  python
  import stable_baselines3 print(stable_baselines3.__version__)
2. Environment Wrapper Issue: An external wrapper or your get_sb_env() function might be altering the step() return format.
3. Gymnasium Version: Your Gymnasium version might be forcing a 5-tuple expectation that SB3 2.5.0 isn’t fully handling in its stable VecEnv.
Latest Stable Release: The GitHub releases confirm v2.5.0 is the latest stable version, with no mention of v2.6.0 as a stable release yet. Pre-release versions (e.g., 2.6.0aX) exist on the master branch but aren’t officially tagged as stable.

Should You Upgrade?

To support a 5-tuple step() return natively, you need a version of SB3 that fully integrates Gymnasium’s 5-value API in its VecEnv implementation. Here’s the decision:

Option 1: Upgrade to Pre-Release (Recommended for 5-Tuple Support)

Why Upgrade: SB3 v2.5.0 does not fully support the 5-tuple step() return in its stable VecEnv implementation. Pre-release versions (e.g., 2.6.0aX) from the master branch include updates to handle terminated and truncated separately, aligning with Gymnasium’s API.
How to Upgrade:
bash
pip install git+https://github.com/DLR-RM/stable-baselines3.git@master pip install gymnasium>=0.29.1
Pros:
- Native support for 5-tuple step() (obs, reward, terminated, truncated, info).
- Matches your goal of using the modern Gymnasium API.
- Includes latest bug fixes and features beyond v2.5.0.
Cons:
- Pre-release versions (e.g., 2.6.0a2 as of now) may be unstable or have undocumented changes.
- Requires testing to ensure compatibility with your codebase.

Option 2: Stay with v2.5.0 and Use 4-Tuple

Why Stay: If you’re satisfied with the 4-tuple API (obs, reward, done, info), v2.5.0 is stable and well-tested. You can modify your environment to combine terminated and truncated into done.
How to Adjust:
- Update step() to return 4 values (as previously provided):
  python
  def step(self, actions): # ... (existing logic) ... terminated = self.balance <= 0 truncated = self.current_step == len(self.dt_datetime) - 1 done = terminated or truncated # ... (rest of logic) ... return obs, total_reward, done, info # 4 values
- Update evaluate and final_evaluate to expect 4 values:
  python
  obs, rewards, done, info = env_vec.step(action)
Pros:
- Stable, production-ready version.
- No need to deal with pre-release instability.
Cons:
- Loses the distinction between terminated and truncated, which might matter for your trading logic.

Option 3: Downgrade (Not Recommended)

Downgrading to an earlier version (e.g., 2.2.1) ensures strict 4-tuple compatibility but sacrifices newer features and Gymnasium support in v2.5.0. This isn’t ideal given your v2.5.0 is already installed.

Recommendation: Upgrade to Pre-Release

Since you specifically want 5-tuple support (obs, reward, terminated, truncated, info) and v2.5.0 doesn’t fully provide this in its stable VecEnv, upgrading to the latest pre-release from the master branch is the best path. The GitHub master branch (as of March 12, 2025) includes post-2.5.0 commits that enhance Gymnasium compatibility, likely fixing the DummyVecEnv to handle 5 values correctly.

Steps to Upgrade and Test

Upgrade SB3:
bash
pip install git+https://github.com/DLR-RM/stable-baselines3.git@master
Verify Version:
python
import stable_baselines3 print(stable_baselines3.__version__) # Expect 2.6.0aX or similar
Ensure Gymnasium:
bash
pip install gymnasium>=0.29.1
Verify:
python
import gymnasium print(gymnasium.__version__) # Should be >= 0.29.1
Update step() to Return 5 Values: Use the 5-tuple version provided earlier:
python
return obs, total_reward, terminated, truncated, info
Update Evaluation Functions: Ensure evaluate and final_evaluate expect 5 values:
python
obs, rewards, terminated, truncated, info = env_vec.step(action) done = np.logical_or(terminated, truncated) # Combine for loop logic
Run Training:
python
study.optimize(objective, n_trials=1)
Check for errors and ensure training proceeds.
检查错误并确保培训继续进行。

Why Not v2.6.0 Stable? 为什么不是 v2.6.0 稳定版？

As of the GitHub releases page (last updated January 27, 2025, for v2.5.0), there’s no stable v2.6.0 release yet. The “2.6.0” mention might refer to pre-release tags (e.g., 2.6.0a2) or a misunderstanding. The master branch is your best bet for the latest 5-tuple support until a stable v2.6.0 is tagged.
截至 GitHub 版本页面（上次更新时间为 2025 年 1 月 27 日，适用于 v2.5.0），目前还没有稳定的 v2.6.0 版本。提及“2.6.0”可能是指预发布标签（例如 2.6.0a2）或误解。在标记稳定的 v2.6.0 之前，master 分支是获得最新 5 元组支持的最佳选择。

Troubleshooting 故障排除

If Error Persists After Upgrade:
如果升级后错误仍然存在：
- Confirm the exact SB3 version post-upgrade.
  升级后确认确切的 SB3 版本。
- Check your get_sb_env() implementation—ensure it uses DummyVecEnv or a compatible VecEnv.
  检查您的 get_sb_env（）实现 — 确保它使用 DummyVecEnv 或兼容的 VecEnv。
- Share the new error message.
  分享新的错误消息。
Environment Wrapper Check: Ensure no custom wrappers are altering the step() return format unexpectedly.
环境包装器检查：确保没有自定义包装器意外更改 step（）返回格式。

Upgrading to the pre-release should resolve your issue and align SB3 with your 5-tuple requirement. Let me know how it goes or if you need further assistance!
升级到预发布版应该可以解决您的问题，并使 SB3 与您的 5 元组要求保持一致。让我知道进展如何，或者您是否需要进一步的帮助！

still error show, why? how to fix it? ``` # Evaluation function def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False): total_rewards = [] total_profits = [] # Track actual trading profit metrics = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 # Sum of realized profits/losses step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=True) obs, rewards, done, info = env_vec.step(action) # obs, rewards, terminated, truncated, info = env_vec.step(action) # done = np.logical_or(terminated, truncated) episode_rewards += rewards # Extract profit from closed transactions for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] # Add realized reward (profit/loss) step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) # Logs to file here mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) avg_metrics = { k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys() } print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") return mean_reward if return_mean_reward else mean_profit # Return profit instead of reward for optimization # Evaluation function (with returns for QuantStats) def final_evaluate(model, env_vec, n_episodes=20, return_mean_reward=False): total_rewards = [] total_profits = [] metrics = [] returns = [] # For QuantStats for ep in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) episode_returns = [] # For QuantStats while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=True) obs, rewards, done, info = env_vec.step(action) # obs, rewards, terminated, truncated, info = env_vec.step(action) # done = np.logical_or(terminated, truncated) episode_rewards += rewards for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] episode_returns.append(tr["Reward"]) # Track per-trade returns step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) returns.extend(episode_returns if episode_returns else [episode_profit]) # Fallback to total profit mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()} print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Metrics: {avg_metrics}") # QuantStats reporting (for test evaluation) returns_series = pd.Series(returns, index=pd.date_range(start="2025-03-12", periods=len(returns), freq="D")) qs.reports.html(returns_series, output="quantstats_report.html", title="FX Trading Performance") return mean_reward if return_mean_reward else mean_profit # Custom early stopping callback class EarlyStoppingCallback(EvalCallback): def __init__(self, eval_env, eval_freq, n_eval_episodes, patience, min_delta, verbose=0, use_profit=False, best_model_save_path=None): super().__init__( eval_env=eval_env, eval_freq=eval_freq, n_eval_episodes=n_eval_episodes, verbose=verbose, deterministic=True ) self.patience = patience self.min_delta = min_delta self.best_metric = -float('inf') self.no_improvement_count = 0 self.use_profit = use_profit # Toggle between reward and profit self.best_model_save_path = best_model_save_path def _on_step(self): continue_training = super()._on_step() if not continue_training: return False if self.last_mean_reward is not None: # Use profit or reward based on use_profit flag current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit) if current_metric > self.best_metric + self.min_delta: self.best_metric = current_metric self.no_improvement_count = 0 if self.verbose > 0: print(f"New best {'profit' if self.use_profit else 'reward'}: {self.best_metric:.2f}") if self.best_model_save_path: self.model.save(self.best_model_save_path) else: self.no_improvement_count += 1 if self.verbose > 0: print(f"No improvement for {self.no_improvement_count}/{self.patience} evaluations") if self.no_improvement_count >= self.patience: if self.verbose > 0: print(f"Early stopping triggered after {self.patience} evaluations without improvement") return False return True # Objective function for Optuna def objective(trial): # Define hyperparameter search space learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True) n_steps = trial.suggest_int("n_steps", 1024, 8192, step=1024) total_timesteps = trial.suggest_int("total_timesteps", 500000, 2000000, step=100000) batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512]) gamma = trial.suggest_float("gamma", 0.9, 0.9999) gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.99) clip_range = trial.suggest_float("clip_range", 0.1, 0.3) # Train PPO model on training set model = PPO( CustomMultiInputPolicy, train_env_vec, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, n_epochs=100, # Large fixed value gamma=gamma, gae_lambda=gae_lambda, clip_range=clip_range, verbose=1, # 0: No output during training, # 1: Prints basic training progress, # 2: More detailed output (Additional details like optimization steps, loss values (e.g., policy loss, value loss), and learning rate updates.) ) eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # More frequent n_eval_episodes=10, # More episodes for stability patience=5, # Stop if no improvement after 5 evaluations min_delta=0.01, # Minimum improvement to consider verbose=1, use_profit=True, # Track profit for trading focus best_model_save_path=f"./best_model/trial_{trial.number}/" # Save best model ) # print(model.policy) # Should show mlp_extractor with in_features=95 model.learn(total_timesteps=total_timesteps, callback=eval_callback) val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function print(f"Validation Average Profit: {val_avg_profit:.2f}") return val_avg_profit # Maximize reward # Specify the SQLite database file db_path = 'optuna_study.db' # Run optimization study = optuna.create_study( study_name='OHLC_EconomicCalender_ppo_study', storage=f'sqlite:///{db_path}', direction="maximize", load_if_exists=True ) study.optimize(objective, n_trials=1) # Adjust number of trials based on resources # Best parameters print("Best hyperparameters:", study.best_params) print("Best validation reward:", study.best_value) best_params = study.best_params # Train final model with best parameters on training set best_model = PPO( CustomMultiInputPolicy, train_env_vec, # Use full training environment learning_rate=best_params["learning_rate"], n_steps=best_params["n_steps"], batch_size=best_params["batch_size"], n_epochs=100, # Large fixed value gamma=best_params["gamma"], gae_lambda=best_params["gae_lambda"], clip_range=best_params["clip_range"], verbose=1, tensorboard_log="./tensorboard_logs/" ) # Early stopping callback on validation set best_trial = study.best_trial.number eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # more frequent evaluations (e.g., every 5,000 steps) n_eval_episodes=10, patience=5, min_delta=0.01, verbose=1, use_profit=True, # use mean_profits best_model_save_path=f"./best_model/trial_{best_trial}/" ) # Train the final model best_model.learn(total_timesteps=best_params["total_timesteps"], callback=eval_callback) # Save the final model best_model.save(f"ppo_xauusd_optimized_trial_{best_trial}") # Evaluate on test data with QuantStats print("\nEvaluating Final Model on Test Data:") test_avg_profit = final_evaluate(best_model, test_env_vec, n_episodes=20) print(f"Test Average Profit: {test_avg_profit:.2f}") # Optional: Load and re-evaluate to verify # loaded_model = PPO.load("ppo_xauusd_optimized") # test_avg_profit_loaded = final_evaluate(loaded_model, test_env_vec, n_episodes=10) # print(f"Test Average Profit (Loaded Model): {test_avg_profit_loaded:.2f}") # Clean up train_env_vec.close() val_env_vec.close() test_env_vec.close() ``` ``` def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} return obs, total_reward, done, info ``` /usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) [I 2025-03-13 08:51:18,238] Using an existing study with name 'OHLC_EconomicCalender_ppo_study' instead of creating a new one. [W 2025-03-13 08:51:18,345] Trial 13 failed with parameters: {'learning_rate': 1.164777225090814e-05, 'n_steps': 5120, 'total_timesteps': 1700000, 'batch_size': 64, 'gamma': 0.9280405388692499, 'gae_lambda': 0.8046323591596111, 'clip_range': 0.2605072265150349} because of the following error: ValueError('not enough values to unpack (expected 5, got 4)'). Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial value_or_values = func(trial) File "<ipython-input-60-1aea10becb13>", line 168, in objective model.learn(total_timesteps=total_timesteps, callback=eval_callback) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py", line 311, in learn return super().learn( File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 323, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 218, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 207, in step return self.step_wait() File "/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 59, in step_wait obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step( # type: ignore[assignment] ValueError: not enough values to unpack (expected 5, got 4) [W 2025-03-13 08:51:18,346] Trial 13 failed with value None. Using cuda device CustomMultiInputPolicy using device: cuda:0 CustomFeaturesExtractor using device: cuda:0 CustomFeaturesExtractor: n_assets=1, features_dim=96 Asset XAUUSD: Action=0.0, Reward=0, Holding=0 Step 1: Base Reward=0, Sustained Reward=0, Total=0, Balance=99900 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-60-1aea10becb13> in <cell line: 185>() 183 load_if_exists=True 184 ) --> 185 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 186 187 # Best parameters 10 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-60-1aea10becb13> in objective(trial) 166 167 # print(model.policy) # Should show mlp_extractor with in_features=95 --> 168 model.learn(total_timesteps=total_timesteps, callback=eval_callback) 169 170 val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 309 progress_bar: bool = False, 310 ) -> SelfPPO: --> 311 return super().learn( 312 total_timesteps=total_timesteps, 313 callback=callback, /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 321 322 while self.num_timesteps < total_timesteps: --> 323 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) 324 325 if not continue_training: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps) 216 clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high) 217 --> 218 new_obs, rewards, dones, infos = env.step(clipped_actions) 219 220 self.num_timesteps += env.num_envs /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/base_vec_env.py in step(self, actions) 205 """ 206 self.step_async(actions) --> 207 return self.step_wait() 208 209 def get_images(self) -> Sequence[Optional[np.ndarray]]: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/dummy_vec_env.py in step_wait(self) 57 # Avoid circular imports 58 for env_idx in range(self.num_envs): ---> 59 obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step( # type: ignore[assignment] 60 self.actions[env_idx] 61 ) ValueError: not enough values to unpack (expected 5, got 4)
还是 error show，为什么？怎么解决？ ``` # 评估函数 def evaluate（model， env_vec， n_episodes=10， return_mean_reward=False）： total_rewards = [] total_profits = [] # 跟踪实际交易利润指标 = [] 对于 range（n_episodes）中的 _： obs = env_vec.reset（）完成 = np.array（[假] * env_vec.num_envs） episode_rewards = np.zeros（env_vec.num_envs） episode_profit = 0 # 已实现盈亏之和 step_count = 0 max_steps = len（env_vec.envs[0].dt_datetime） while not np.all（done）且 step_count < max_steps：作， _ = model.predict（obs， deterministic=True） obs、rewards、done、info = env_vec.step（action） # obs， rewards， terminated， truncated， info = env_vec.step（action） # 完成 = np.logical_or（终止，截断） episode_rewards += 奖励 # 从已平仓交易中提取利润对于 range （env_vec.num_envs）中的env_idx： if info[env_idx][“关闭”]：对于 info[env_idx][“Close”] 中的 tr： episode_profit += tr[“Reward”] # 添加已实现的奖励（盈利/亏损） step_count += 1 total_rewards.extend（episode_rewards） total_profits.append（episode_profit） metrics.append（env_vec.envs[0].analyze_transaction_history（）） # 在此处提交的日志 mean_reward = np.mean（total_rewards） mean_profit = np.mean（total_profits） avg_metrics = { k： np.mean（[m[k] for m in metrics]） for k in metrics[0].keys（） } print（f“平均奖励： {mean_reward：.2f}，平均利润： {mean_profit：.2f}”） print（f“平均度量： {avg_metrics}”） return mean_reward if return_mean_reward else mean_profit # 返回利润而不是优化奖励 # 评估函数（返回 QuantStats） def final_evaluate（模型， env_vec， n_episodes=20， return_mean_reward=False）： total_rewards = [] total_profits = [] 指标 = [] returns = [] # 对于 QuantStats 对于 EP in range （n_episodes）： obs = env_vec.reset（）完成 = np.array（[假] * env_vec.num_envs） episode_rewards = np.zeros（env_vec.num_envs） episode_profit = 0 step_count = 0 max_steps = len（env_vec.envs[0].dt_datetime） episode_returns = [] # 对于 QuantStats while not np.all（done）且 step_count < max_steps：作， _ = model.predict（obs， deterministic=True） obs、rewards、done、info = env_vec.step（action） # obs， rewards， terminated， truncated， info = env_vec.step（action） # 完成 = np.logical_or（终止，截断） episode_rewards += 奖励对于 range （env_vec.num_envs）中的env_idx： if info[env_idx][“关闭”]：对于 info[env_idx][“Close”] 中的 tr： episode_profit += tr[“奖励”] episode_returns.append（tr[“Reward”]） # 追踪每笔交易的回报 step_count += 1 total_rewards.extend（episode_rewards） total_profits.append（episode_profit） metrics.append（env_vec.envs[0].analyze_transaction_history（）） returns.extend（episode_returns if episode_returns else [episode_profit]） # 回退到总利润 mean_reward = np.mean（total_rewards） mean_profit = np.mean（total_profits） avg_metrics = {k： np.mean（[m[k] for m in metrics]） for k in metrics[0].keys（）} print（f“平均奖励： {mean_reward：.2f}，平均利润： {mean_profit：.2f}”） print（f“度量： {avg_metrics}”） # QuantStats 报告（用于测试评估） returns_series = pd。Series（返回值， index=pd.date_range（start=“2025-03-12”， periods=len（returns）， freq=“D”）） qs.reports.html（returns_series， output=“quantstats_report.html”， title=“外汇交易表现”）如果 return_mean_reward else mean_profit，则返回 mean_reward # 自定义提前停止回调类 EarlyStoppingCallback（EvalCallback）： def __init__（self， eval_env， eval_freq， n_eval_episodes， patience， min_delta， verbose=0， use_profit=False， best_model_save_path=None）： super（）.__init__（ eval_env=eval_env、 eval_freq=eval_freq、 n_eval_episodes=n_eval_episodes， verbose=verbose， deterministic=真 ) self.patience = 耐心 self.min_delta = min_delta self.best_metric = -float（'inf'） self.no_improvement_count = 0 self.use_profit = use_profit # 在奖励和利润之间切换 self.best_model_save_path = best_model_save_path def _on_step（个体经营）： continue_training = super（）._on_step（）如果不continue_training： return False 如果 self.last_mean_reward 不是 None： # 使用基于use_profit标志的利润或奖励 current_metric = evaluate（self.model， self.eval_env， self.n_eval_episodes， return_mean_reward=not self.use_profit）如果 current_metric > self.best_metric + self.min_delta： self.best_metric = current_metric self.no_improvement_count = 0 如果 self.verbose > 0： print（f“新最佳 {'profit' if self.use_profit else 'reward'}： {self.best_metric：.2f}”）如果self.best_model_save_path： self.model.save（self.best_model_save_path）还： self.no_improvement_count += 1 如果 self.verbose > 0： print（f“{self.no_improvement_count}/{self.patience} evaluations没有改善”） if self.no_improvement_count >= self.patience：如果 self.verbose > 0： print（f“在 {self.patience} 评估后触发提前停止，但无改善”） return False 返回 True # Optuna 的目标函数 def 目标（试用）： # 定义超参数搜索空间 learning_rate = trial.suggest_float（“learning_rate”， 1e-5， 1e-3， log=True） n_steps = trial.suggest_int（“n_steps”， 1024， 8192， step=1024） total_timesteps = trial.suggest_int（“total_timesteps”， 500000， 2000000， step=100000） batch_size = trial.suggest_categorical（“batch_size”， [64， 128， 256， 512]）伽玛 = trial.suggest_float（“伽玛”， 0.9， 0.9999） gae_lambda = trial.suggest_float（“gae_lambda”， 0.8， 0.99） clip_range = trial.suggest_float（“clip_range”， 0.1， 0.3） # 在训练集上训练 PPO 模型模型 = PPO（ CustomMultiInputPolicy 和 train_env_vec， learning_rate=learning_rate、 n_steps=n_steps、 batch_size=batch_size， n_epochs=100， # 大的固定值 gamma=伽玛， gae_lambda=gae_lambda、 clip_range=clip_range、 verbose=1， # 0：训练时无输出， # 1：打印基础训练进度， # 2：更详细的输出（其他详细信息，如优化步骤、损失值（例如，策略损失、值损失）和学习率更新。 ) eval_callback = EarlyStoppingCallback（ eval_env=val_env_vec、 eval_freq=5000， # 更频繁 n_eval_episodes=10， # 更多剧集以保持稳定 patience=5， # 如果 5 次评估后没有改善，则停止 min_delta=0.01， # 要考虑的最小改进 verbose=1， use_profit=True， # 跟踪利润以专注于交易 best_model_save_path=f“./best_model/trial_{trial.number}/” # 保存最佳模型 ) # print（model.policy） # 应该显示 in_features=95 的 mlp_extractor model.learn（total_timesteps=total_timesteps， callback=eval_callback） val_avg_profit = evaluate（model， val_env_vec， n_episodes=10） # 评估函数 print（f“验证平均利润： {val_avg_profit：.2f}”） return val_avg_profit # 最大化奖励 # 指定 SQLite 数据库文件 db_path = 'optuna_study.db' # 运行优化研究 = optuna.create_study（ study_name='OHLC_EconomicCalender_ppo_study'，存储=f'sqlite：///{db_path}'， direction=“maximize”， load_if_exists=真 ) study.optimize（objective， n_trials=1） # 根据资源调整 Trial 次数 # 最佳参数 print（“最佳超参数：”， study.best_params） print（“最佳验证奖励：”， study.best_value） best_params = study.best_params # 使用训练集上的最佳参数训练最终模型 best_model = PPO（ CustomMultiInputPolicy 和 train_env_vec， # 使用完整的训练环境 learning_rate=best_params[“learning_rate”]， n_steps=best_params[“n_steps”]， batch_size=best_params[“batch_size”]， n_epochs=100， # 大的固定值伽玛=best_params[“伽玛”]， gae_lambda=best_params[“gae_lambda”]， clip_range=best_params[“clip_range”]， verbose=1， tensorboard_log=“./tensorboard_logs/” ) # 在验证集上提前停止回调 best_trial = study.best_trial.number eval_callback = EarlyStoppingCallback（ eval_env=val_env_vec、 eval_freq=5000， # 更频繁的评估（例如，每 5,000 个步骤） n_eval_episodes=10，耐心 = 5， min_delta=0.01， verbose=1， use_profit=True， # 使用 mean_profits best_model_save_path=f“./best_model/trial_{best_trial}/” ) # 训练最终模型 best_model.learn（total_timesteps=best_params[“total_timesteps”]， callback=eval_callback） # 保存最终模型 best_model.save（f“ppo_xauusd_optimized_trial_{best_trial}”） # 使用 QuantStats 评估测试数据 print（“\n根据测试数据评估最终模型：”） test_avg_profit = final_evaluate（best_model， test_env_vec， n_episodes=20） print（f“测试平均利润： {test_avg_profit：.2f}”） # 可选：加载并重新评估以验证 # loaded_model = PPO.load（“ppo_xauusd_optimized”） # test_avg_profit_loaded = final_evaluate（loaded_model， test_env_vec， n_episodes=10） # print（f“测试平均利润（加载的模型）：{test_avg_profit_loaded：.2f}”） # 清理 train_env_vec.close（） val_env_vec.close（） test_env_vec.close（） ``` ``` def step（self， actions）： self.current_step += 1 # 定义终止和截断条件 terminated = self.balance <= 0 # 剧集因破产而结束（最终状态） truncated = self.current_step == len（self.dt_datetime） - 1 # 由于最大步数（时间限制）而结束剧集 done = terminated or truncated # 合并为 VecEnv 的单个 'done' 标志 # 对于渲染或剧集跟踪，您仍然可以检查任一条件是否为 true 如果完成： self.done_information += f“集数： {self.episode} 平衡： {self.balance} 步数： {self.current_step}\n” self.visualization = 真 self.episode += 1 # 增加 episode 计数器 # 计算基础交易奖励 base_reward = self._take_action（作，完成） # 计算持仓的未实现利润 unrealized_profit = 0 atr_scaling = 0 # 用于市场条件缩放对于 i，枚举（self.assets）中的 asset： atr = self.get_observation（self.current_step， i， “ATR”） atr_scaling += atr # 用于标准化的资产的 ATR 总和对于 self.transaction_live 中的 TR： if tr[“Symbol”] == asset： if tr[“Type”] == 0： # 买入未实现 = （self._c - tr[“ActionPrice”]） * self.cf.symbol（asset， “point”） else： # 卖出未实现 = （tr[“ActionPrice”] - self._c） * self.cf.symbol（asset， “点”） unrealized_profit += 未实现 atr_scaling = atr_scaling / len（self.assets） if atr_scaling > 0 else 1 # 避免被 0 除以 # 持续奖励：仅适用于未实现/已实现的利润，由 ATR 缩放 # 调整 0.01 到 0.05 sustained_reward = （unrealized_profit + base_reward） * 0.01 / atr_scaling 如果self.transaction_live否则为 0 # 如果没有持仓，则对不作为的处罚如果不是 self.transaction_live 和 all（math.floor（a） == 2 for a in actions）： sustained_reward -= 0.1 # 鼓励探索的小额惩罚 total_reward = base_reward + sustained_reward 如果 self._day > self.current_day： self.current_day = self._day self.balance -= self.over_night_cash_penalty 如果 self.balance ！= 0： self.max_draw_down_pct = abs（sum（self.max_draw_downs） / self.balance * 100） OBS = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“numeric”]， “portfolio_data”：np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs，数据类型=np.float32 ) } print（f“步骤 {self.current_step}：基础奖励={base_reward}，持续奖励={sustained_reward}，总计={total_reward}，余额={self.balance}”） # 信息字典保持不变 info = {“关闭”： self.tranaction_close_this_step} 返回 OBS、total_reward、完成、信息 ``` /usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py：283：DeprecationWarning：“should_run_async”将来不会自动调用“transform_cell”。请将结果传递给 'transformed_cell' 参数以及 IPython 7.17 及更高版本中 'preprocessing_exc_tuple' 转换期间发生的任何异常。和 should_run_async（代码） [我 2025-03-13 08：51：18,238]使用名称为“OHLC_EconomicCalender_ppo_study”的现有研究，而不是创建新研究。 [周 2025-03-13 08：51：18,345]试验 13 失败，参数为：{'learning_rate'： 1.164777225090814e-05， 'n_steps'： 5120， 'total_timesteps'： 1700000， 'batch_size'： 64， 'gamma'： 0.9280405388692499， 'gae_lambda'： 0.8046323591596111， 'clip_range'： 0.2605072265150349}}，因为以下错误：ValueError（'not enough values to unpack （expected 5， got 4）'）。回溯（最近调用最后）：文件 “/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py”，第 197 行，_run_trial value_or_values = func（试用）文件 “<ipython-input-60-1aea10becb13>”，第 168 行，目标 model.learn（total_timesteps=total_timesteps， callback=eval_callback）文件 “/usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py”，第 311 行，在 learn 中返回 super（）.learn（文件 “/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py”，第 323 行，在 learn 中 continue_training = self.collect_rollouts（self.env， callback， self.rollout_buffer， n_rollout_steps=self.n_steps）文件 “/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py”，第 218 行，collect_rollouts new_obs、rewards、dones、infos = env.step（clipped_actions）文件 “/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/base_vec_env.py”，第 207 行，步骤返回 self.step_wait（）文件 “/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/dummy_vec_env.py”，第 59 行，step_wait obs， self.buf_rews[env_idx]， terminated， truncated， self.buf_infos[env_idx] = self.envs[env_idx].step（ # type： ignore[assignment] ValueError：没有足够的值来解包（预期为 5，得到 4） [周 2025-03-13 08：51：18,346]试验 13 失败，值为 None。使用 cuda 设备使用设备的 CustomMultiInputPolicy：cuda：0 CustomFeaturesExtractor 使用设备：cuda：0 CustomFeaturesExtractor：n_assets=1，features_dim=96 资产 XAUUSD：作 = 0.0，奖励 = 0，持仓 = 0 第 1 步：基础奖励 = 0，持续奖励 = 0，总计 = 0，余额 = 99900 --------------------------------------------------------------------------- ValueError Traceback （最近调用最后） <cell 行中的 <ipython-input-60-1aea10becb13>：185>（） 183 load_if_exists=真 184 ) --> 185 study.optimize（objective， n_trials=1） # 根据资源调整 Trial 数量 186 187 # 最佳参数 10 帧 /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize（self， func， n_trials， timeout， n_jobs， catch， callbacks， gc_after_trial， show_progress_bar） 473 如果发生此方法的嵌套调用。 474 """ --> 475 _optimize（ 476 study=自我， 477 func=func， _optimize中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study、func、n_trials、timeout、n_jobs、catch、callbacks、gc_after_trial、show_progress_bar） 61 次尝试：如果 n_jobs == 1，则为 62： ---> 63 _optimize_sequential（ 64 项研究， 65 函数， _optimize_sequential 中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study、func、n_trials、timeout、catch、callbacks、gc_after_trial、reseed_sampler_rng、time_start、progress_bar） 158 159 次尝试： --> 160 frozen_trial = _run_trial（研究、func、catch） 161 最后： 162 # 以下行缓解了某些 _run_trial中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study， func， catch） 246 而不是 isinstance（func_err， catch） 247 ): --> 248 加注 func_err 249 返回 frozen_trial 250 _run_trial中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study， func， catch） 195 带 get_heartbeat_thread（trial._trial_id， study._storage）： 196 次尝试： --> 197 value_or_values = func（试用） 198 个，例外情况除外。TrialPruned 为 e： 199 # TODO（mamu）：处理多目标案例。 <ipython-input-60-1aea10becb13> 在目标（试用） 166 167 # print（model.policy） # 应该显示 in_features=95 的mlp_extractor --> 168 model.learn（total_timesteps=total_timesteps， callback=eval_callback） 169 170 val_avg_profit = evaluate（model， val_env_vec， n_episodes=10） # 评估函数 /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py 在 learn（self， total_timesteps， callback， log_interval， tb_log_name， reset_num_timesteps， progress_bar） 309 progress_bar： bool = False， 310 ） -> SelfPPO： --> 311 return super（）.learn（ 312 total_timesteps=total_timesteps， 313 callback=callback 中， /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py 在 learn（self， total_timesteps， callback， log_interval， tb_log_name， reset_num_timesteps， progress_bar） 321 322 同时self.num_timesteps < total_timesteps： --> 323 continue_training = self.collect_rollouts（self.env， callback， self.rollout_buffer， n_rollout_steps=self.n_steps） 324 325 如果不continue_training： collect_rollouts 中的 /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py（self， env， callback， rollout_buffer， n_rollout_steps） 216 clipped_actions = np.clip（作， self.action_space.low， self.action_space.high） 217 --> 218 new_obs、 rewards、dones、 infos = env.step（clipped_actions） 219 220 self.num_timesteps += env.num_envs /usr/local/lib/python3.10/dist-packages//stable_baselines3/common/vec_env/base_vec_env.py in step（self， actions） 205 """ 206 self.step_async（作） --> 207 返回 self.step_wait（） 208 209 def get_images（self） -> 序列[可选[np.ndarray]]： /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/dummy_vec_env.py in step_wait（个体经营） 57 # 避免循环导入 58 代表范围内env_idx （self.num_envs）： ---> 59 obs， self.buf_rews[env_idx]，终止，截断， self.buf_infos[env_idx] = self.envs[env_idx].step（ # 类型： ignore[assignment] 60 个 self.actions[env_idx] 61 ) ValueError：没有足够的值来解包（预期为 5，得到 4）

The error ValueError: not enough values to unpack (expected 5, got 4) persists because there’s still a mismatch between what your environment’s step() method returns and what Stable-Baselines3’s DummyVecEnv expects. Let’s break this down and fix it.
错误 ValueError： not enough values to unpack （expected 5， got 4）仍然存在，因为环境的 step（）方法返回的内容与 Stable-Baselines3 的 DummyVecEnv 期望的值之间仍然存在不匹配。让我们分解并修复它。

Current State 当前状态

Your step() Method: Returns 4 values: obs, total_reward, done, info. This aligns with the older Gym API where done is a single boolean combining termination and truncation.
您的 step（）方法：返回 4 个值：obs、total_reward、done、info。这与旧的 Gym API 一致，其中 done 是一个结合了终止和截断的布尔值。
Stable-Baselines3 Expectation: The error occurs in DummyVecEnv.step_wait():
Stable-Baselines3 期望：错误发生在 DummyVecEnv.step_wait（）中：
python 蟒
obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step(...)
This line shows that DummyVecEnv expects 5 values (obs, reward, terminated, truncated, info), which is the modern Gymnasium API.
此行显示 DummyVecEnv 需要 5 个值（obs， reward， terminated， truncated， info），这是现代的 Gymnasium API。
Your SB3 Version: You’ve reported 2.5.0, but the behavior suggests you might have upgraded to a pre-release (e.g., 2.6.0aX) after installing from git+https://github.com/DLR-RM/stable-baselines3.git@master. However, your environment hasn’t been updated to match this expectation.
您的 SB3 版本：您已报告 2.5.0，但该行为表明您可能在从 git+https：//github.com/DLR-RM/stable-baselines3.git@master 安装后已升级到预发行版（例如 2.6.0aX）。但是，您的环境尚未更新以符合此预期。

Why the Error Persists 错误持续存在的原因

Mismatch: Your step() returns 4 values, but after upgrading to a pre-release version of Stable-Baselines3 (post-2.5.0 from the master branch), DummyVecEnv now expects 5 values to align with Gymnasium’s API. The stable v2.5.0 expects 4 values, but the pre-release expects 5, and your code hasn’t fully adapted.
失配：你的 step（）返回 4 个值，但在升级到 Stable-Baselines3 的预发布版本（从 master 分支升级到 2.5.0 之后）后，DummyVecEnv 现在需要 5 个值与 Gymnasium 的 API 保持一致。稳定版 v2.5.0 需要 4 个值，但预发布版需要 5 个值，并且您的代码尚未完全适应。
Evaluation Functions: Your evaluate and final_evaluate functions expect 4 values (obs, rewards, done, info), which worked with v2.5.0 but fails with the pre-release expecting 5.
评估功能：您的 evaluate 和 final_evaluate 函数需要 4 个值（obs、rewards、done、info），这些值在 v2.5.0 中有效，但在预发布版中需要 5 个值时失败。

How to Fix It 怎么修

Since you’ve likely upgraded to a pre-release version of Stable-Baselines3 that supports the 5-tuple Gymnasium API (or want to use it), you need to:
由于您可能已经升级到支持 5 元组 Gymnasium API（或想要使用它）的 Stable-Baselines3 预发布版本，因此您需要：

Update your step() method to return 5 values (obs, reward, terminated, truncated, info).
更新您的 step（）方法以返回 5 个值（obs、reward、terminated、truncated、info）。
Update your evaluate and final_evaluate functions to handle 5 values from env_vec.step().
更新您的 evaluate 和 final_evaluate 函数以处理 env_vec.step（）中的 5 个值。

Step 1: Update step() to Return 5 Values
第 1 步：更新 step（）以返回 5 个值

Modify your tgym.step() method to return terminated and truncated separately, aligning with Gymnasium’s API:
修改你的 tgym.step（）方法，分别返回 terminated 和 truncated，与 Gymnasium 的 API 保持一致：

def step(self, actions):
    self.current_step += 1

    # Define termination and truncation conditions
    terminated = self.balance <= 0  # Episode ends due to bankruptcy (terminal state)
    truncated = self.current_step == len(self.dt_datetime) - 1  # Episode ends due to max steps (time limit)
    done = terminated or truncated  # Still compute for internal use or rendering

    # For rendering or episode tracking
    if done:
        self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n"
        self.visualization = True
        self.episode += 1  # Increment episode counter

    # Calculate base trading reward
    base_reward = self._take_action(actions, done)

    # Calculate unrealized profit from open positions
    unrealized_profit = 0
    atr_scaling = 0  # For market condition scaling
    for i, asset in enumerate(self.assets):
        atr = self.get_observation(self.current_step, i, "ATR")
        atr_scaling += atr  # Sum ATR across assets for normalization
        for tr in self.transaction_live:
            if tr["Symbol"] == asset:
                if tr["Type"] == 0:  # Buy
                    unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point")
                else:  # Sell
                    unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point")
                unrealized_profit += unrealized

    atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1  # Avoid division by 0

    # Sustained reward: only applies to unrealized/realized profits, scaled by ATR
    sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
    
    # Penalty for inaction if no positions are held
    if not self.transaction_live and all(math.floor(a) == 2 for a in actions):
        sustained_reward -= 0.1  # Small penalty to encourage exploration

    total_reward = base_reward + sustained_reward

    if self._day > self.current_day:
        self.current_day = self._day
        self.balance -= self.over_night_cash_penalty

    if self.balance != 0:
        self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100)

    obs = {
        "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32),
        "event_ids": self.cached_economic_data[self.current_step]["event_ids"],
        "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"],
        "economic_numeric": self.cached_economic_data[self.current_step]["numeric"],
        "portfolio_data": np.array(
            [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs,
            dtype=np.float32
        )
    }
    print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}")

    # Info dictionary remains unchanged
    info = {"Close": self.tranaction_close_this_step}

    # Return 5 values for Gymnasium compatibility
    return obs, total_reward, terminated, truncated, info

Key Change: Replace return obs, total_reward, done, info with return obs, total_reward, terminated, truncated, info.
主要变化：将 return obs、total_reward、done、info 替换为 return obs、total_reward、terminated、truncated、info。

Step 2: Update Evaluation Functions
第 2 步：更新评估函数

Update evaluate and final_evaluate to unpack 5 values and combine terminated and truncated into done for the loop condition:
更新 evaluate 和 final_evaluate 以解压缩 5 个值，并将 terminated 和 truncated 合并为 done 作为循环条件：

Fixed evaluate Function
固定 evaluate 函数

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []  # Track actual trading profit
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0  # Sum of realized profits/losses
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, terminated, truncated, info = env_vec.step(action)  # Expect 5 values
            done = np.logical_or(terminated, truncated)  # Combine for loop condition
            episode_rewards += rewards
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]  # Add realized reward (profit/loss)
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())  # Logs to file here
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {
        k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()
    }
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit  # Return profit instead of reward for optimization

Fixed final_evaluate Function
固定final_evaluate功能

def final_evaluate(model, env_vec, n_episodes=20, return_mean_reward=False):
    total_rewards = []
    total_profits = []
    metrics = []
    returns = []  # For QuantStats
    for ep in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        episode_returns = []   # For QuantStats
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, terminated, truncated, info = env_vec.step(action)  # Expect 5 values
            done = np.logical_or(terminated, truncated)  # Combine for loop condition
            episode_rewards += rewards
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]
                        episode_returns.append(tr["Reward"])  # Track per-trade returns
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())
        returns.extend(episode_returns if episode_returns else [episode_profit])  # Fallback to total profit
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()}
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Metrics: {avg_metrics}")
    
    # QuantStats reporting (for test evaluation)
    returns_series = pd.Series(returns, index=pd.date_range(start="2025-03-12", periods=len(returns), freq="D"))
    qs.reports.html(returns_series, output="quantstats_report.html", title="FX Trading Performance")
    
    return mean_reward if return_mean_reward else mean_profit

Key Changes: 主要变化：

Replace obs, rewards, done, info = env_vec.step(action) with obs, rewards, terminated, truncated, info = env_vec.step(action).
将 obs， rewards， done， info = env_vec.step（action）替换为 obs， rewards， terminated， truncated， info = env_vec.step（action）。
Add done = np.logical_or(terminated, truncated) to maintain the loop logic.
添加 done = np.logical_or（terminated， truncated）以维护循环逻辑。

Step 3: Verify Your SB3 Version
第 3 步：验证您的 SB3 版本

Since you’ve upgraded to the pre-release, confirm the version:
由于您已升级到预发布版，请确认版本：

import stable_baselines3
print(stable_baselines3.__version__)  # Should be 2.6.0aX or similar

If it’s still 2.5.0, the upgrade didn’t take effect. Re-run:
如果仍为 2.5.0，则表示升级未生效。重新运行：

pip install git+https://github.com/DLR-RM/stable-baselines3.git@master --force-reinstall

Why This Fixes It 为什么这会修复它

Pre-Release SB3 (e.g., 2.6.0aX): The master branch updates DummyVecEnv to expect 5 values from step(), matching Gymnasium’s API. Your original 4-value return caused the unpack error.
预发布版 SB3（例如 2.6.0aX）：master 分支更新 DummyVecEnv 以期望来自 step（）的 5 个值，与 Gymnasium 的 API 匹配。您最初的 4 值返回导致了解压缩错误。
Alignment: By updating step() to return 5 values and adjusting your evaluation functions, you align with the pre-release’s expectations.
对准：通过更新 step（）以返回 5 个值并调整评估函数，您可以与预发布版的期望保持一致。

Additional Notes 其他说明

Stable v2.5.0: If you’re still on 2.5.0 (not pre-release), it expects 4 values, and you’d need to revert step() to 4 values and downgrade:
稳定版 v2.5.0：如果您仍在使用 2.5.0（非预发布版），它需要 4 个值，您需要将 step（）恢复为 4 个值并降级：
bash 重击
pip install stable-baselines3==2.5.0
But since you want 5-tuple support, the pre-release is the right choice.
但是，由于您需要 5 元组支持，因此预发布版是正确的选择。
Gymnasium Version: Ensure compatibility:
体育馆版：确保兼容性：
python 蟒
import gymnasium print(gymnasium.__version__) # Should be >= 0.29.1

Verify the Fix 验证修复

Update Code: Apply the changes to step(), evaluate, and final_evaluate.
更新代码：将更改应用于 step（）、evaluate 和 final_evaluate。
Run Training: 运行训练：
python 蟒
study.optimize(objective, n_trials=1)
Check Output: Ensure no ValueError occurs and training proceeds.
检查输出：确保没有发生 ValueError 并继续训练。

If the error persists, share the updated SB3 version and any new error messages. This should resolve the issue based on your pre-release upgrade intent!
如果错误仍然存在，请共享更新的 SB3 版本和任何新的错误消息。这应该会根据您的预发布升级意图解决问题！

no, my stable-baseline3 still is 2.5.0. not is the gymnasium need 5 value or stable-baseline3 need 5 value? import gymnasium print(gymnasium.__version__) 1.0.0 in this code, can use use if detect 4 value use "obs, rewards, done, info = env_vec.step(action)" else if detect 5 value use "obs, rewards, terminated, truncated, info = env_vec.step(action)"? ``` # Evaluation function def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False): total_rewards = [] total_profits = [] # Track actual trading profit metrics = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 # Sum of realized profits/losses step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=True) obs, rewards, done, info = env_vec.step(action) # obs, rewards, terminated, truncated, info = env_vec.step(action) # done = np.logical_or(terminated, truncated) episode_rewards += rewards # Extract profit from closed transactions for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] # Add realized reward (profit/loss) step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) # Logs to file here mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) avg_metrics = { k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys() } print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") return mean_reward if return_mean_reward else mean_profit # Return profit instead of reward for optimization ```

2 / 2

Let’s clarify the situation and address your questions based on the current versions and your code.

Current Versions

Stable-Baselines3 (SB3): You’re using version 2.5.0, which is the latest stable release as of January 27, 2025.
Gymnasium: You’re using version 1.0.0, which fully implements the 5-tuple step() return (obs, reward, terminated, truncated, info).

Who Expects 5 Values?

Gymnasium 1.0.0: The step() method of a raw Gymnasium environment returns 5 values (obs, reward, terminated, truncated, info). This is the modern standard introduced in Gymnasium 0.26.0+ to distinguish between termination (e.g., goal reached or failure) and truncation (e.g., time limit).
Stable-Baselines3 2.5.0: SB3 v2.5.0 supports Gymnasium (minimum 0.29.1, up to 1.0.0), but its vectorized environments (VecEnv, like DummyVecEnv) are designed to work with the older 4-tuple API (obs, reward, done, info). Internally, SB3 combines terminated and truncated into a single done signal when processing steps through VecEnv. However, the error you’re seeing (expected 5, got 4) comes from DummyVecEnv.step_wait(), which attempts to unpack 5 values from your environment’s step():
python
obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step(...)
This indicates that SB3 v2.5.0’s DummyVecEnv expects your raw environment’s step() to return 5 values (Gymnasium-style), even though SB3’s broader logic processes it as a 4-tuple downstream.

Why the Error?

Your step() Method: Returns 4 values (obs, total_reward, done, info), which matches the older Gym API and what SB3 v2.5.0 should theoretically handle.
Mismatch: The DummyVecEnv in SB3 v2.5.0 expects 5 values from the raw environment’s step() because it’s designed to be compatible with Gymnasium’s API. When your step() returns only 4 values, the unpack fails, causing the ValueError.

Should You Use 4 or 5 Values?

SB3 v2.5.0 Expectation: Despite supporting Gymnasium, SB3 v2.5.0’s stable release assumes your environment’s step() returns 5 values (Gymnasium-style), which VecEnv then processes into a 4-tuple (obs, rewards, dones, infos) for internal use. This is a compatibility layer issue: DummyVecEnv directly calls your environment’s step() and expects 5 values, but your step() provides 4.
Conclusion: You need to update your step() to return 5 values to match what DummyVecEnv expects in SB3 v2.5.0 when used with Gymnasium 1.0.0.

Can You Detect 4 vs. 5 Values Dynamically?

Yes, you can modify your evaluate function to handle both 4-value and 5-value returns dynamically. This is useful for compatibility across different SB3 versions or environments. Here’s how:

Updated evaluate Function with Dynamic Unpacking

import numpy as np
import gymnasium as gym
import stable_baselines3 as sb3

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []  # Track actual trading profit
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0  # Sum of realized profits/losses
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            # Attempt to step and handle 4 or 5 return values
            step_result = env_vec.step(action)
            if len(step_result) == 4:
                # 4-tuple: obs, rewards, done, info
                obs, rewards, done, info = step_result
            elif len(step_result) == 5:
                # 5-tuple: obs, rewards, terminated, truncated, info
                obs, rewards, terminated, truncated, info = step_result
                done = np.logical_or(terminated, truncated)
            else:
                raise ValueError(f"Unexpected number of values returned from step(): {len(step_result)}")
            
            episode_rewards += rewards
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]  # Add realized reward (profit/loss)
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())  # Logs to file here
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {
        k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()
    }
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit  # Return profit instead of reward for optimization

# Verify versions
print("Stable-Baselines3 version:", sb3.__version__)  # Should be 2.5.0
print("Gymnasium version:", gym.__version__)      # Should be 1.0.0

Key Changes: 主要变化：

Dynamic Unpacking: Use step_result = env_vec.step(action) and check len(step_result) to determine if it’s 4 or 5 values.
动态解包：使用 step_result = env_vec.step（action）并检查 len（step_result）以确定它是 4 个值还是 5 个值。
4-Value Case: Directly unpack as obs, rewards, done, info.
4 值大小写：直接解压为 obs、rewards、done、info。
5-Value Case: Unpack as obs, rewards, terminated, truncated, info and compute done = np.logical_or(terminated, truncated) for the loop condition.
5 值大小写：对于循环条件，解压缩为 obs、rewards、terminated、truncated、info 并计算完成 = np.logical_or（terminated， truncated）。
Error Handling: Raise an exception if the number of values is neither 4 nor 5.
错误处理：如果值的数量既不是 4 也不是 5，则引发异常。

Fixing the Root Cause 修复根本原因

While dynamic unpacking makes your code flexible, the error occurs because your step() method returns 4 values, but SB3 v2.5.0’s DummyVecEnv expects 5 with Gymnasium 1.0.0. To fix this properly:
虽然动态解包使您的代码变得灵活，但出现错误是因为您的 step（）方法返回 4 个值，但 SB3 v2.5.0 的 DummyVecEnv 在 Gymnasium 1.0.0 中期望返回 5 个值。要正确修复此问题，请执行以下作：

Update step() to Return 5 Values
更新 step（）以返回 5 个值

Modify your step() method to return 5 values, matching Gymnasium’s API and SB3 v2.5.0’s DummyVecEnv expectation:
修改你的 step（）方法以返回 5 个值，匹配 Gymnasium 的 API 和 SB3 v2.5.0 的 DummyVecEnv 期望：

def step(self, actions):
    self.current_step += 1

    # Define termination and truncation conditions
    terminated = self.balance <= 0  # Episode ends due to bankruptcy (terminal state)
    truncated = self.current_step == len(self.dt_datetime) - 1  # Episode ends due to max steps (time limit)
    done = terminated or truncated  # For internal use or rendering

    # For rendering or episode tracking
    if done:
        self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n"
        self.visualization = True
        self.episode += 1  # Increment episode counter

    # Calculate base trading reward
    base_reward = self._take_action(actions, done)

    # Calculate unrealized profit from open positions
    unrealized_profit = 0
    atr_scaling = 0  # For market condition scaling
    for i, asset in enumerate(self.assets):
        atr = self.get_observation(self.current_step, i, "ATR")
        atr_scaling += atr  # Sum ATR across assets for normalization
        for tr in self.transaction_live:
            if tr["Symbol"] == asset:
                if tr["Type"] == 0:  # Buy
                    unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point")
                else:  # Sell
                    unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point")
                unrealized_profit += unrealized

    atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1  # Avoid division by 0
    sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
    
    if not self.transaction_live and all(math.floor(a) == 2 for a in actions):
        sustained_reward -= 0.1  # Small penalty to encourage exploration

    total_reward = base_reward + sustained_reward

    if self._day > self.current_day:
        self.current_day = self._day
        self.balance -= self.over_night_cash_penalty

    if self.balance != 0:
        self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100)

    obs = {
        "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32),
        "event_ids": self.cached_economic_data[self.current_step]["event_ids"],
        "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"],
        "economic_numeric": self.cached_economic_data[self.current_step]["numeric"],
        "portfolio_data": np.array(
            [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs,
            dtype=np.float32
        )
    }
    print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}")

    info = {"Close": self.tranaction_close_this_step}

    # Return 5 values for Gymnasium compatibility
    return obs, total_reward, terminated, truncated, info

Update evaluate to Expect 5 Values
将 evaluate 更新为 Expect 5 Values

If you update step() to return 5 values, adjust evaluate (and final_evaluate) to match:
如果您更新 step（）以返回 5 个值，请调整 evaluate（和 final_evaluate）以匹配：

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []  # Track actual trading profit
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0  # Sum of realized profits/losses
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, terminated, truncated, info = env_vec.step(action)  # Expect 5 values
            done = np.logical_or(terminated, truncated)  # Combine for loop
            episode_rewards += rewards
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]  # Add realized reward (profit/loss)
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())  # Logs to file here
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {
        k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()
    }
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit  # Return profit instead of reward for optimization

Recommendation 建议

Best Fix: Update your step() to return 5 values (obs, total_reward, terminated, truncated, info) to match Gymnasium 1.0.0 and SB3 v2.5.0’s DummyVecEnv expectation. Then, update evaluate and final_evaluate to handle 5 values consistently. This avoids the need for dynamic detection and resolves the error directly.
最佳修复：更新你的 step（）以返回 5 个值（obs、total_reward、terminated、truncated、info）以匹配 Gymnasium 1.0.0 和 SB3 v2.5.0 的 DummyVecEnv 期望。然后，更新 evaluate 和 final_evaluate 以一致地处理 5 个值。这样就无需进行动态检测，并直接解决错误。
Dynamic Approach: If you want flexibility across SB3 versions (e.g., 2.5.0 vs. pre-release), use the dynamic unpacking version. However, since you’re on v2.5.0 with Gymnasium 1.0.0, aligning with 5 values is cleaner and future-proof.
动态方法：如果您希望在 SB3 版本之间具有灵活性（例如，2.5.0 与预发布版），请使用动态解包版本。但是，由于您使用的是 v2.5.0 和 Gymnasium 1.0.0，因此与 5 个值对齐更简洁且面向未来。

Verify the Fix 验证修复

Update step(): Use the 5-value version above.
更新 step（）：使用上面的 5 值版本。
Update evaluate: Use the 5-value version (or dynamic version if preferred).
更新评估：使用 5 值版本（如果愿意，请使用 dynamic version）。
Run Training: 运行训练：
python 蟒
study.optimize(objective, n_trials=1)
Check Output: Ensure no ValueError occurs.
检查输出：确保没有发生 ValueError。

If you stick with v2.5.0 and don’t want to upgrade, the 5-value step() is still required due to DummyVecEnv’s behavior with Gymnasium. Let me know if you need help applying this to final_evaluate or encounter further issues!
如果你坚持使用 v2.5.0 并且不想升级，由于 DummyVecEnv 对 Gymnasium 的行为，仍然需要 5 个值的 step（）。如果您需要帮助将其应用于final_evaluate或遇到进一步的问题，请告诉我！

but using this "Detect 4 vs. 5 Values Dynamically", will the EnvVec know the done is combine of (terminated, truncated)? or EnvVec will not use this done, will it affect how the EnvVec work? becasue i am worry that the step return 5 value obs, rewards, terminated, truncated, info, but the EnvVec only detected the obs, rewards, terminated, info, not count on truncated or only detect obs, rewards, terminated, truncated ignore info, make it run wrongly. will this happen? or EnvVec only use the step() in my evaluate()?

Your concerns are valid, and I’ll address them step-by-step to clarify how the dynamic detection approach interacts with VecEnv (specifically DummyVecEnv in Stable-Baselines3) and whether it could lead to issues with how terminated and truncated are handled. Let’s break this down:

Recap of the Situation

Your Current Setup:
- Stable-Baselines3 (SB3) version: 2.5.0.
- Gymnasium version: 1.0.0.
- Your step() returns: 4 values (obs, total_reward, done, info).
- Error: DummyVecEnv.step_wait() expects 5 values (obs, reward, terminated, truncated, info), causing ValueError: not enough values to unpack (expected 5, got 4).
Dynamic Detection Proposal:
python
step_result = env_vec.step(action) if len(step_result) == 4: obs, rewards, done, info = step_result elif len(step_result) == 5: obs, rewards, terminated, truncated, info = step_result done = np.logical_or(terminated, truncated) else: raise ValueError(f"Unexpected number of values returned from step(): {len(step_result)}")

How VecEnv Works in SB3

Stable-Baselines3’s VecEnv (e.g., DummyVecEnv):
- When you call env_vec.step(action) in your evaluate() function, it triggers DummyVecEnv.step_wait(), which iterates over each environment in the vector and calls their individual step() methods:
  python
  # From stable_baselines3/common/vec_env/dummy_vec_env.py (v2.5.0) def step_wait(self): for env_idx in range(self.num_envs): obs, self.buf_rews[env_idx], terminated, truncated, self.buf_infos[env_idx] = self.envs[env_idx].step( self.actions[env_idx] ) # ... (handles reset if terminated or truncated) return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones), deepcopy(self.buf_infos))
- Key Point: DummyVecEnv expects your environment’s step() to return 5 values (obs, reward, terminated, truncated, info). It then combines terminated and truncated into self.buf_dones (a single boolean array) and returns a 4-tuple to the caller: (obs, rewards, dones, infos).
Your evaluate() Function:
- When you call env_vec.step(action), you’re interacting with the VecEnv API, not your raw environment’s step() directly. The VecEnv.step() method processes the raw environment’s 5-value return and always returns a 4-tuple (obs, rewards, dones, infos) to your code, regardless of what your step() returns internally.

Does Dynamic Detection Affect VecEnv’s done Handling?

Short Answer: No, the dynamic detection in evaluate() doesn’t affect how VecEnv internally handles terminated and truncated. Here’s why:
- Inside VecEnv: When DummyVecEnv.step_wait() calls your environment’s step(), it expects 5 values. If your step() returns 4 values (obs, total_reward, done, info), it fails before your evaluate() code even runs (as seen in the error). If it returns 5 values (obs, total_reward, terminated, truncated, info), VecEnv processes them correctly, combines terminated and truncated into dones, and passes a 4-tuple back to your evaluate().
- Your Dynamic Code: The if len(step_result) == 4 or elif len(step_result) == 5 logic only applies to the tuple returned by env_vec.step(), which is always 4 values in SB3 v2.5.0. Your dynamic detection won’t see 5 values because VecEnv standardizes the output to 4.
Conclusion: The done = np.logical_or(terminated, truncated) line in your dynamic detection will never execute with SB3 v2.5.0 because env_vec.step() always returns 4 values (obs, rewards, dones, infos), where dones is already the combined result of terminated or truncated from VecEnv.

Will VecEnv Ignore truncated or info?

Your Worry: If step() returns 5 values (obs, rewards, terminated, truncated, info), could VecEnv ignore truncated or info, causing incorrect behavior?
Answer: No, this won’t happen with SB3 v2.5.0, provided your step() returns 5 values correctly:
- truncated Handling: DummyVecEnv uses both terminated and truncated to set self.buf_dones (via terminated or truncated) and to decide whether to reset the environment. It won’t ignore truncated.
- info Handling: DummyVecEnv stores info in self.buf_infos and returns it in the 4-tuple as the last element. Your evaluate() code accesses info[env_idx]["Close"], so it’s fully utilized.
- No Partial Detection: VecEnv doesn’t “detect” only some values and ignore others—it either unpacks all 5 correctly from your step() or raises an error if the count mismatches (as it does now with 4 values).
Current Issue: Your step() returns 4 values, so VecEnv fails before returning anything to evaluate(). It’s not ignoring truncated or info—it’s crashing because it expects 5 values and gets 4.

Does VecEnv Use step() Only in evaluate()?

No: VecEnv uses your environment’s step() method in two main contexts:
1. Training (via model.learn()): During PPO.learn(), collect_rollouts() calls env.step() (a VecEnv method), which triggers your step() via DummyVecEnv.step_wait(). This is where the error originates in your traceback:
  python
  new_obs, rewards, dones, infos = env.step(clipped_actions) # In collect_rollouts
2. Evaluation (via evaluate()): Your evaluate() function also calls env_vec.step(), which again uses your step() via DummyVecEnv.
Impact: The error occurs during training, not just evaluation, because VecEnv expects 5 values everywhere it calls your step(). Fixing it in evaluate() alone won’t help—you need to fix the root cause in your environment’s step().

Recommended Fix: Align step() with 5 Values

Since you’re using SB3 v2.5.0 with Gymnasium 1.0.0, and DummyVecEnv expects 5 values from your step(), update your environment to return 5 values consistently. Then, update evaluate() to expect the 4-value output from VecEnv. Here’s how:

Updated step() (5 Values)

def step(self, actions):
    self.current_step += 1

    # Define termination and truncation conditions
    terminated = self.balance <= 0  # Episode ends due to bankruptcy
    truncated = self.current_step == len(self.dt_datetime) - 1  # Episode ends due to max steps
    done = terminated or truncated  # For internal use or rendering

    # For rendering or episode tracking
    if done:
        self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n"
        self.visualization = True
        self.episode += 1

    # Calculate base trading reward
    base_reward = self._take_action(actions, done)

    # Calculate unrealized profit from open positions
    unrealized_profit = 0
    atr_scaling = 0
    for i, asset in enumerate(self.assets):
        atr = self.get_observation(self.current_step, i, "ATR")
        atr_scaling += atr
        for tr in self.transaction_live:
            if tr["Symbol"] == asset:
                if tr["Type"] == 0:  # Buy
                    unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point")
                else:  # Sell
                    unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point")
                unrealized_profit += unrealized

    atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1
    sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
    
    if not self.transaction_live and all(math.floor(a) == 2 for a in actions):
        sustained_reward -= 0.1

    total_reward = base_reward + sustained_reward

    if self._day > self.current_day:
        self.current_day = self._day
        self.balance -= self.over_night_cash_penalty

    if self.balance != 0:
        self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100)

    obs = {
        "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32),
        "event_ids": self.cached_economic_data[self.current_step]["event_ids"],
        "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"],
        "economic_numeric": self.cached_economic_data[self.current_step]["numeric"],
        "portfolio_data": np.array(
            [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs,
            dtype=np.float32
        )
    }
    print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}")

    info = {"Close": self.tranaction_close_this_step}

    # Return 5 values for Gymnasium compatibility
    return obs, total_reward, terminated, truncated, info

Updated evaluate() (Expects 4 Values from VecEnv)

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, done, info = env_vec.step(action)  # VecEnv returns 4 values
            episode INTERNAL_LINKrewards += rewards
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()}
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit

Why Dynamic Detection Isn’t Needed Here

SB3 v2.5.0 Behavior: With your current setup, env_vec.step() will always return 4 values to evaluate(), even if your step() returns 5 values internally. The dynamic detection is overkill because:
SB3 v2.5.0 行为：在你当前的设置中，env_vec.step（）将始终返回 4 个值给 evaluate（），即使你的 step（）内部返回 5 个值。动态检测有点过头了，因为：
- If step() returns 4 values, VecEnv fails before evaluate() gets the result (current error).
  如果 step（）返回 4 个值，则 VecEnv 在 evaluate（）获取结果之前失败（当前错误）。
- If step() returns 5 values, VecEnv processes it and returns 4 values consistently.
  如果 step（）返回 5 个值，则 VecEnv 会对其进行处理并始终返回 4 个值。
No Risk of Ignoring Values: Once step() returns 5 values, VecEnv handles terminated, truncated, and info correctly and won’t “ignore” them.
没有忽略值的风险：一旦 step（）返回 5 个值，VecEnv 就会正确处理 terminated、truncated 和 info，并且不会 “忽略” 它们。

Final Recommendation 最终建议

Fix the Root Cause: Update your step() to return 5 values as shown above. Keep evaluate() simple with the 4-value unpack, since that’s what VecEnv provides.
修复根本原因：更新你的 step（）以返回 5 个值，如上所示。使用 4 值解包保持 evaluate（）简单，因为这就是 VecEnv 提供的。
Avoid Dynamic Detection: It’s unnecessary here and could confuse maintenance, as VecEnv standardizes the output.
避免动态检测：这里没有必要，并且可能会混淆维护，因为 VecEnv 对输出进行了标准化。

Verify the Fix 验证修复

Update your step() to return 5 values.
更新您的 step（）以返回 5 个值。
Use the 4-value evaluate() version.
使用 4 值 evaluate（）版本。
Run: 跑：
python 蟒
study.optimize(objective, n_trials=1)
Check for errors. 检查错误。

This ensures VecEnv works correctly with your Gymnasium-compatible environment and avoids any risk of mishandling truncated or info. Let me know if you need the same fix for final_evaluate() or encounter issues!
这可确保 VecEnv 与您的 Gymnasium 兼容环境正常工作，并避免任何错误处理截断或信息的风险。如果您需要对 final_evaluate（）进行相同的修复或遇到问题，请告诉我！

use which code for evaluate()? ``` def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} ``` code 1: ``` step_result = env_vec.step(action) if len(step_result) == 4: obs, rewards, done, info = step_result elif len(step_result) == 5: obs, rewards, terminated, truncated, info = step_result done = np.logical_or(terminated, truncated) else: raise ValueError(f"Unexpected number of values returned from step(): {len(step_result)}") ``` code 2: ``` def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False): total_rewards = [] total_profits = [] metrics = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=True) obs, rewards, done, info = env_vec.step(action) # VecEnv returns 4 values episode INTERNAL_LINKrewards += rewards # Extract profit from closed transactions for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()} print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") return mean_reward if return_mean_reward else mean_profit ```

2 / 2

Given your updated step() method and your current setup (Stable-Baselines3 v2.5.0 and Gymnasium 1.0.0), let’s determine which evaluate() code to use—Code 1 (dynamic detection) or Code 2 (static 4-value unpack)—and why.

Analysis of Your step() Method

Your provided step() method still returns 4 values:

return obs, total_reward, done, info

Current Behavior: This matches the older Gym API, where done is a single boolean combining terminated and truncated.
Problem: Stable-Baselines3 v2.5.0’s DummyVecEnv.step_wait() expects your environment’s step() to return 5 values (obs, reward, terminated, truncated, info) when used with Gymnasium 1.0.0. This mismatch causes the error:
text
ValueError: not enough values to unpack (expected 5, got 4)
Fix Needed: You should update step() to return 5 values to align with DummyVecEnv’s expectation. However, since you’ve shared a 4-value version, I’ll address both scenarios.

Comparing Code 1 and Code 2

Code 1: Dynamic Detection

step_result = env_vec.step(action)
if len(step_result) == 4:
    obs, rewards, done, info = step_result
elif len(step_result) == 5:
    obs, rewards, terminated, truncated, info = step_result
    done = np.logical_or(terminated, truncated)
else:
    raise ValueError(f"Unexpected number of values returned from step(): {len(step_result)}")

What It Does: Checks the number of values returned by env_vec.step() and unpacks accordingly:
- 4 values: obs, rewards, done, info.
- 5 values: obs, rewards, terminated, truncated, info, then computes done.
Applicability: Useful if env_vec.step() could return either 4 or 5 values, providing flexibility across SB3 versions or environments.
Reality with SB3 v2.5.0: env_vec.step() (from DummyVecEnv) always returns 4 values (obs, rewards, dones, infos) to the caller, regardless of what your raw step() returns internally. The 5-value expectation is only at the raw environment level inside DummyVecEnv.step_wait(). Thus, the elif len(step_result) == 5 branch will never execute in your current setup.

Code 2: Static 4-Value Unpack

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, done, info = env_vec.step(action)  # VecEnv returns 4 values
            episode_rewards += rewards  # Fixed typo: episode INTERNAL_LINKrewards -> episode_rewards
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()}
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit

What It Does: Assumes env_vec.step() returns exactly 4 values (obs, rewards, done, info), which is the standard output of VecEnv in SB3 v2.5.0.
Applicability: Matches the actual behavior of DummyVecEnv.step() in your version, which processes your raw step() output (5 values expected) and returns a 4-tuple to the caller.

Which Code to Use?

If You Keep step() Returning 4 Values

Problem: Your current step() returns 4 values, which causes the error in DummyVecEnv.step_wait() because it expects 5. Neither Code 1 nor Code 2 will work yet, as the error occurs before evaluate() gets the result.
Solution: You must update step() to return 5 values first (see below). Keeping it at 4 values isn’t viable with SB3 v2.5.0 and Gymnasium 1.0.0.

If You Update step() to Return 5 Values

Here’s the corrected step():

def step(self, actions):
    self.current_step += 1

    # Define termination and truncation conditions
    terminated = self.balance <= 0  # Episode ends due to bankruptcy
    truncated = self.current_step == len(self.dt_datetime) - 1  # Episode ends due to max steps
    done = terminated or truncated  # For internal use or rendering

    # For rendering or episode tracking
    if done:
        self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n"
        self.visualization = True
        self.episode += 1

    # Calculate base trading reward
    base_reward = self._take_action(actions, done)

    # Calculate unrealized profit from open positions
    unrealized_profit = 0
    atr_scaling = 0
    for i, asset in enumerate(self.assets):
        atr = self.get_observation(self.current_step, i, "ATR")
        atr_scaling += atr
        for tr in self.transaction_live:
            if tr["Symbol"] == asset:
                if tr["Type"] == 0:  # Buy
                    unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point")
                else:  # Sell
                    unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point")
                unrealized_profit += unrealized

    atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1
    sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
    
    if not self.transaction_live and all(math.floor(a) == 2 for a in actions):
        sustained_reward -= 0.1

    total_reward = base_reward + sustained_reward

    if self._day > self.current_day:
        self.current_day = self._day
        self.balance -= self.over_night_cash_penalty

    if self.balance != 0:
        self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100)

    obs = {
        "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32),
        "event_ids": self.cached_economic_data[self.current_step]["event_ids"],
        "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"],
        "economic_numeric": self.cached_economic_data[self.current_step]["numeric"],
        "portfolio_data": np.array(
            [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs,
            dtype=np.float32
        )
    }
    print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}")

    info = {"Close": self.tranaction_close_this_step}

    # Return 5 values for Gymnasium compatibility
    return obs, total_reward, terminated, truncated, info

With 5-Value step():
- DummyVecEnv.step_wait() expects and processes 5 values from your step(), combines terminated and truncated into dones, and returns 4 values (obs, rewards, dones, infos) to evaluate().
- Code 2 is correct because it matches the 4-value output of env_vec.step().
- Code 1 is unnecessary since env_vec.step() will never return 5 values in SB3 v2.5.0—only your raw step() needs to provide 5 values.

Recommendation: Use Code 2

Why Code 2?
- After updating step() to return 5 values, env_vec.step() will consistently return 4 values (obs, rewards, done, info) to evaluate(), as designed in SB3 v2.5.0.
- Code 2 correctly unpacks this 4-tuple without extra complexity.
- Code 1’s dynamic detection is redundant because VecEnv standardizes the output to 4 values, and the 5-value case won’t occur at the evaluate() level.
Steps to Implement:
1. Replace your step() return with the 5-value version: return obs, total_reward, terminated, truncated, info.
2. Use Code 2 for evaluate() (with the typo episode INTERNAL_LINKrewards fixed to episode_rewards).
3. Apply the same 4-value unpack to final_evaluate() if needed.

Final evaluate() Code

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, done, info = env_vec.step(action)  # VecEnv returns 4 values
            episode_rewards += rewards
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()}
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit

Verify the Fix

Update step() to return 5 values.
Use the above evaluate() (Code 2).
Run:
python
study.optimize(objective, n_trials=1)
Check for errors.

This will resolve the ValueError and ensure VecEnv processes terminated and truncated correctly internally, while evaluate() handles the standardized 4-value output. Let me know if you need the same adjustment for final_evaluate()!

explain what is the action return value? ``` class CustomFeaturesExtractor(BaseFeaturesExtractor): def __init__(self, observation_space): n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2 ohlc_dim = observation_space.spaces["ohlc_data"].shape[0] max_events = observation_space.spaces["event_ids"].shape[0] economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0] portfolio_dim = observation_space.spaces["portfolio_data"].shape[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # Check if CUDA is available, else use CPU self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomFeaturesExtractor using device: {self.device}") super().__init__(observation_space, features_dim=features_dim) # Move embeddings to the selected device self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device) self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device) print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}") def forward(self, obs): ohlc_data = obs["ohlc_data"].to(self.device) event_ids = obs["event_ids"].to(self.device, dtype=torch.long) currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long) economic_numeric = obs["economic_numeric"].to(self.device) portfolio_data = obs["portfolio_data"].to(self.device) event_emb = self.event_embedding(event_ids).mean(dim=1) currency_emb = self.currency_embedding(currency_ids).mean(dim=1) features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1) return features class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Check if CUDA is available, else use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomMultiInputPolicy using device: {device}") # Extract action space bounds and move them to the selected device self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=device ).to(device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities of the unsquashed actions log_prob = distribution.log_prob(actions) # Use log_prob on the unsquashed actions # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return squashed_actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Unsquash the actions back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob(unsquashed_actions) # Use log_prob on unsquashed actions entropy = distribution.entropy() # Entropy doesn’t need actions values = self.value_net(latent_vf) return values, log_prob, entropy ```
解释什么是 action return value？ ``` class CustomFeaturesExtractor（BaseFeaturesExtractor）： def __init__（self， observation_space）： n_assets = （observation_space.spaces[“portfolio_data”].shape[0] - 3） // 2 ohlc_dim = observation_space.spaces[“ohlc_data”].shape[0] max_events = observation_space.spaces[“event_ids”].shape[0] economic_numeric_dim = observation_space.spaces[“economic_numeric”].shape[0] portfolio_dim = observation_space.spaces[“portfolio_data”].形状[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # 检查 CUDA 是否可用，否则使用 CPU self.device = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomFeaturesExtractor using device： {self.device}”） super（）.__init__（observation_space， features_dim=features_dim） # 将 embedding 移动到所选设备 self.event_embedding = nn.Embedding（num_embeddings=129， embedding_dim=max_events）.to（self.device） self.currency_embedding = nn。Embedding（num_embeddings=6， embedding_dim=max_events）.to（self.device） print（f“自定义功能提取器： n_assets={n_assets}， features_dim={features_dim}”） def forward（self， obs）： ohlc_data = obs[“ohlc_data”].to（self.device） event_ids = obs[“event_ids”].to（self.device， dtype=torch.long） currency_ids = obs[“currency_ids”].to（self.device， dtype=torch.long） economic_numeric = obs[“economic_numeric”].to（self.device） portfolio_data = obs[“portfolio_data”].to（self.device） event_emb = self.event_embedding（event_ids）.均值（dim=1） currency_emb = self.currency_embedding（currency_ids）.mean（dim=1）特征 = torch.cat（[ohlc_data， event_emb， currency_emb， economic_numeric， portfolio_data]， dim=1）返回功能类 CustomMultiInputPolicy（ActorCriticPolicy）： def __init__（self， observation_space， action_space， lr_schedule， *args， **kwargs）： # 检查 CUDA 是否可用，否则使用 CPU 设备 = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomMultiInputPolicy using device： {device}”） # 提取动作空间边界并将其移动到所选设备 self.action_space_low = torch.tensor （action_space.low， dtype =torch.float32， device=device） self.action_space_high = torch.tensor （action_space.high， dtype =torch.float32， device=device） action_dim = action_space.shape[0] # 资产数量 super（）.__init__（ observation_space， action_space， lr_schedule， features_extractor_class=CustomFeaturesExtractor， features_extractor_kwargs={}， net_arch=dict（pi=[64， 64]， vf=[64， 64]）， *args、 **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor（ features_dim， net_arch=self.net_arch， activation_fn=nn.ReLU ， device=设备）.to（设备） # 定义动作网络以输出高斯的均值和log_std self.action_net = nn.Linear（64， action_dim * 2）.to（device） # 输出每个资产的平均值和log_std self.value_net = nn.线性（64， 1）.to（设备） # 初始化发行版 self.action_dist = SquashedDiagGaussianDistribution （action_dim） self.num_timesteps = 0 def forward（self， obs， deterministic=False）： # 在每次前向传递上增加时间步长 self.num_timesteps += 1 # 提取特征特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） # 从 action_net 获取 mean 和 log_std action_params = self.action_net（latent_pi） # [批量， n_assets * 2] mean_actions， log_std = action_params.chunk（2， dim=-1） # 拆分为均值和log_std log_std = torch.clamp（log_std， min=-20， max=2） # 稳定log_std # 使用当前参数创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 示例作或获取确定性作作 = distribution.actions_from_params（mean_actions， log_std， deterministic=deterministic） # 从 [-1， 1] 到 [0， 3] 的映射 squashed_actions = self._squash_to_range（作、self.action_space_low、self.action_space_high） # 计算未压缩的 action 的对数概率 log_prob = distribution.log_prob（actions） # 对未压缩的动作使用log_prob # 价值预测值 = self.value_net（latent_vf）如果 self.num_timesteps % 1000 == 0： print（f“步骤 {self.num_timesteps}，训练： {self.training}，作： {squashed_actions}，平均值： {mean_actions.mean（）}，对数标准： {log_std.mean（）}”）返回 squashed_actions、值 log_prob def _squash_to_range（self， actions， low， high）： “”“将压缩的作从 [-1， 1] 缩放到 [low， high]。”“”” 返回（作数 + 1） * （最高价 - 最低价） / 2 + 最低价 def extract_features（自身，OBS）：返回self.features_extractor （OBS） def predict_values（自身， OBS）：特征 = self.extract_features（OBS） _， latent_vf = self.mlp_extractor（特征）返回 self.value_net（latent_vf） def evaluate_actions（自我、OBS、作）： “”“评估训练作（由 PPO 使用）。”“” 特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） action_params = self.action_net（latent_pi） mean_actions，log_std = action_params.chunk（2， dim=-1） log_std = torch.clamp（log_std，最小=-20，最大=2） # 创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 将 action 解压缩回 [-1， 1] 以进行log_prob计算 unsquashed_actions = 2 * （作 - self.action_space_low） / （self.action_space_high - self.action_space_low） - 1 # 计算log_prob和熵 log_prob = distribution.log_prob（unsquashed_actions） # 对未压缩的动作使用 log_prob entropy = distribution.entropy（） # 熵不需要动作值 = self.value_net（latent_vf）返回值、log_prob、熵 ```

Let’s break down what the action return value represents in your provided code, specifically within the context of your CustomMultiInputPolicy class, which is a custom implementation of ActorCriticPolicy from Stable-Baselines3 (SB3). This class defines how actions are generated, processed, and returned during training and evaluation (e.g., in your evaluate() function). I’ll explain step-by-step based on your code.
让我们分解一下作返回值在您提供的代码中表示的内容，特别是在 CustomMultiInputPolicy 类的上下文中，该类是 Stable-Baselines3 （SB3）中 ActorCriticPolicy 的自定义实现。此类定义在训练和评估期间（例如，在 evaluate（）函数中）如何生成、处理和返回作。我将根据您的代码逐步解释。

Overview of CustomMultiInputPolicy
CustomMultiInputPolicy 概述

Purpose: This is a policy network for a PPO agent that takes a multi-input observation space (e.g., OHLC data, economic events, portfolio data) and outputs actions for trading (e.g., buy, sell, or hold for each asset).
目的：这是一个 PPO 代理的策略网络，它采用多输入观察空间（例如，OHLC 数据、经济事件、投资组合数据）并输出交易作（例如，为每个资产购买、出售或持有）。
Action Space: Assumed to be a continuous Box space (since you’re using a Gaussian distribution), with bounds defined by action_space.low and action_space.high.
作空间：假设是一个连续的 Box 空间（因为您使用的是 Gaussian 分布），其边界由 action_space.low 和 action_space.high 定义。
Key Components: 关键组件：
- CustomFeaturesExtractor: Processes the observation into a flat feature vector.
  CustomFeaturesExtractor：将观察结果处理为平面特征向量。
- MlpExtractor: Splits features into policy (pi) and value (vf) latent representations.
  MlpExtractor：将特征拆分为策略（pi）和值（vf）潜在表示。
- action_net: Outputs parameters (mean and log standard deviation) for a Gaussian distribution.
  action_net：输出高斯分布的参数（平均值和对数标准差）。
- SquashedDiagGaussianDistribution: Generates actions, squashing them to a specific range.
  SquashedDiagGaussianDistribution：生成作，将它们压缩到特定范围。

What is the action Return Value?
什么是作 Return Value？

The action return value comes from the forward() method when you call model.predict(obs) in your evaluate() function:
当您在 evaluate（）函数中调用 model.predict（obs）时，作返回值来自 forward（）方法：

action, _ = model.predict(obs, deterministic=True)

This action is then passed to env_vec.step(action) to interact with the environment. Let’s trace its generation and final form.
然后，此作将传递给 env_vec.step（action）以与环境交互。让我们追溯它的生成和最终形式。

1. Action Generation in forward()
1. forward（）中的动作生成

Here’s the relevant part of forward():
这是 forward（）的相关部分：

def forward(self, obs, deterministic=False):
    # Extract features
    features = self.extract_features(obs)
    latent_pi, latent_vf = self.mlp_extractor(features)

    # Get mean and log_std from action_net
    action_params = self.action_net(latent_pi)  # [batch, n_assets * 2]
    mean_actions, log_std = action_params.chunk(2, dim=-1)  # Split into mean and log_std
    log_std = torch.clamp(log_std, min=-20, max=2)  # Stabilize log_std

    # Create a fresh distribution instance
    distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim)

    # Sample actions or get deterministic actions
    actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic)

    # Map from [-1, 1] to [0, 3]
    squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high)

    # Compute log probabilities of the unsquashed actions
    log_prob = distribution.log_prob(actions)

    # Value prediction
    values = self.value_net(latent_vf)

    return squashed_actions, values, log_prob

Steps: 步骤：
1. Feature Extraction: obs (a dictionary of observation data) is processed by CustomFeaturesExtractor into a tensor features.
  Feature Extraction：obs（观察数据的字典）由 CustomFeaturesExtractor 处理成张量特征。
2. Latent Representation: MlpExtractor splits features into latent_pi (for the policy) and latent_vf (for the value function).
  潜在表示：MlpExtractor 将特征拆分为 latent_pi（用于策略）和 latent_vf（用于值函数）。
3. Action Parameters: action_net outputs action_params, a tensor of shape [batch_size, action_dim * 2], where:
  作参数：action_net 输出 action_params，即形状为 [batch_size， action_dim * 2] 的张量，其中：
  - action_dim is the number of assets (e.g., 1 for XAUUSD alone, or more if multiple assets).
    action_dim 是资产数量（例如，单独 XAUUSD 为 1，如果有多个资产，则更多）。
  - Split into mean_actions (mean of the Gaussian) and log_std (log standard deviation), each of shape [batch_size, action_dim].
    分为 mean_actions（高斯均值）和 log_std（对数标准差），每个形状为 [batch_size， action_dim]。
4. Gaussian Distribution: SquashedDiagGaussianDistribution takes mean_actions and log_std to create a distribution:
  Gaussian Distribution：SquashedDiagGaussianDistribution 需要 mean_actions 和 log_std来创建分布：
  - If deterministic=True (as in evaluate()), it uses mean_actions directly.
    如果 deterministic=True （如 evaluate（）），则直接使用 mean_actions。
  - If deterministic=False, it samples from the Gaussian distribution.
    如果 deterministic=False，则从高斯分布中采样。
5. Initial Actions: actions are sampled (or taken as mean) in the range [-1, 1] due to the squashing in SquashedDiagGaussianDistribution.
  初始作：由于 SquashedDiagGaussianDistribution 中的压缩，作在 [-1， 1] 范围内采样（或取平均值）。
6. Squashing to Action Space: _squash_to_range() maps actions from [-1, 1] to your action space bounds [low, high] (e.g., [0, 3] if that’s your intent).
  压缩到作空间：_squash_to_range（）将作从 [-1， 1] 映射到你的作空间边界 [low， high]（例如，如果这是你的意图，则为 [0， 3]）。
7. Return Value: squashed_actions is the first element of the returned tuple.
  Return Value：squashed_actions 是返回的元组的第一个元素。

2. Action Space and _squash_to_range()
2. Action Space 和 _squash_to_range（）

Action Space Definition: Your action_space is a gym.spaces.Box with bounds low and high:
作空间定义：你的action_space是一个 gym.spaces.Box，其边界为 low 和 high：
python 蟒
self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device)
- Example: If action_space = Box(low=0, high=3, shape=(n_assets,)), then low = 0 and high = 3 for each asset.
  示例：如果 action_space = Box（low=0， high=3， shape=（n_assets，）），则每个资产的 low = 0 和 high = 3。
- action_dim = action_space.shape[0] is the number of assets (e.g., 1 for a single asset like XAUUSD).
  action_dim = action_space.shape[0] 是资产的数量（例如，像 XAUUSD 这样的单一资产为 1）。
Squashing Function: 挤压功能：
python 蟒
def _squash_to_range(self, actions, low, high): return (actions + 1) * (high - low) / 2 + low
- Input: actions in [-1, 1].
  输入：[-1， 1] 中的作。
- Output: squashed_actions in [low, high].
  输出：squashed_actions [low， high]。
- Formula: Maps [-1, 1] linearly to [low, high].
  公式：将 [-1， 1] 线性映射到 [low， high]。
  - For low = 0, high = 3:
    对于最低价 = 0，最高价 = 3：
    - -1 → ( (-1 + 1) * (3 - 0) / 2 + 0 ) = 0.
      -1 → （（-1 + 1） * （3 - 0） / 2 + 0 ） = 0。
    - 1 → ( (1 + 1) * (3 - 0) / 2 + 0 ) = 3.
      1 → （（1 + 1） * （3 - 0） / 2 + 0 ） = 3。
    - 0 → ( (0 + 1) * (3 - 0) / 2 + 0 ) = 1.5.
      0 → （（0 + 1） * （3 - 0） / 2 + 0 ） = 1.5。

3. Final action Return Value
3. 最终作返回值

In predict(): When you call model.predict(obs, deterministic=True):
在 predict（）中：当您调用 model.predict（obs， deterministic=True）时：
- predict() internally calls forward(obs, deterministic=True) and returns squashed_actions (and possibly other outputs, but you discard _).
  predict（）在内部调用 forward（obs， deterministic=True）并返回 squashed_actions（可能还有其他输出，但你丢弃了 _）。
- Shape: [n_envs, action_dim] (e.g., [1, 1] for 1 environment and 1 asset).
  形状：[n_envs， action_dim]（例如，[1， 1] 表示 1 个环境和 1 个资产）。
- Range: [action_space.low, action_space.high] (e.g., [0, 3]).
  范围：[action_space.low， action_space.high]（例如，[0， 3]）。
- Type: PyTorch tensor on the device (e.g., cuda:0 or cpu).
  类型：设备上的 PyTorch 张量（例如 cuda：0 或 cpu）。
In Your evaluate():
在 evaluate（）中：
python 蟒
action, _ = model.predict(obs, deterministic=True) obs, rewards, done, info = env_vec.step(action)
- action is the squashed_actions tensor, scaled to your action space bounds (e.g., [0, 3]).
  action 是 squashed_actions 张量，缩放到您的动作空间边界（例如 [0， 3]）。
- Passed to env_vec.step(), which expects a NumPy array, so SB3 converts it automatically if needed.
  传递给 env_vec.step（），它需要一个 NumPy 数组，因此 SB3 会在需要时自动转换它。

What Does the Action Represent?
动作代表什么？

Interpretation: Each value in the action tensor corresponds to an action for one asset:
解释：action 张量中的每个值对应于一个资产的一个作：
- Range [0, 3] suggests a continuous action space, possibly interpreted as:
  范围 [0， 3] 表示连续的动作空间，可能解释为：
  - 0: Strong sell or no action.
    0：强烈卖出或不采取行动。
  - 1.5: Neutral/hold (midpoint).
    1.5：中性/保持（中点）。
  - 3: Strong buy. 3：强烈买入。
- Exact meaning depends on how your environment’s _take_action() interprets these values (e.g., buy/sell thresholds, position sizes).
  确切的含义取决于您环境的 _take_action（）如何解释这些值（例如，买入/卖出阈值、头寸大小）。
Example: 例：
- For 1 asset (e.g., XAUUSD): action = tensor([[1.2]]) might mean a weak buy.
  对于 1 种资产（例如 XAUUSD）：action = tensor（[[1.2]]）可能意味着弱买入。
- For 2 assets: action = tensor([[0.5, 2.8]]) might mean sell asset 1, buy asset 2.
  对于 2 个资产：action = tensor（[[0.5， 2.8]]）可能意味着卖出资产 1，买入资产 2。

Key Properties of the action Return Value
作 Return Value 的关键属性

Shape: [n_envs, action_dim], where:
Shape：[n_envs， action_dim]，其中：
- n_envs is the number of environments in env_vec (e.g., 1 for a single DummyVecEnv).
  n_envs 是 env_vec 中的环境数量（例如，单个 DummyVecEnv 为 1）。
- action_dim is the number of assets (e.g., 1 or more).
  action_dim 是资产的数量（例如，1 个或更多）。
Range: [action_space.low, action_space.high] (e.g., [0, 3]).
范围：[action_space.low， action_space.high] （例如，[0， 3]）。
Type: PyTorch tensor, automatically converted to NumPy by SB3 when passed to env_vec.step().
类型：PyTorch 张量，当传递给 env_vec.step（）时，SB3 会自动转换为 NumPy。
Deterministic in evaluate(): Since deterministic=True, it’s the mean of the Gaussian (after squashing), not a random sample.
在 evaluate（）中具有确定性：由于 deterministic=True，它是高斯的平均值（压缩后），而不是随机样本。

How It’s Used in Your Environment
如何在您的环境中使用它

In step(): 在 step（）中：
python 蟒
base_reward = self._take_action(actions, done)
- actions is the NumPy-converted squashed_actions.
  actions 是 NumPy 转换的squashed_actions。
- _take_action() interprets these values (e.g., as buy/sell signals or position sizes) to update the environment state and compute base_reward.
  _take_action（）解释这些值（例如，作为买入/卖出信号或仓位大小）以更新环境状态并计算base_reward。

Potential Issues to Check
需要检查的潜在问题

Action Space Mismatch: Ensure your environment’s action_space matches [0, 3] (or whatever bounds you intend). If it’s different (e.g., [-1, 1]), the squashing in _squash_to_range() won’t align, causing unexpected behavior.
作空间不匹配：确保环境的 action_space 与 [0， 3] （或您想要的任何边界）匹配。如果它不同（例如，[-1， 1]），则 _squash_to_range（）中的压缩不会对齐，从而导致意外行为。
Interpretation: Verify _take_action() handles the [0, 3] range correctly (e.g., thresholds like < 1 for sell, > 2 for buy).
解释：验证 _take_action（）是否正确处理了 [0， 3] 范围（例如，%3C 1 表示卖出，> 2 表示买入）。
Debugging: Add a print statement in evaluate():
调试：在 evaluate（）中添加 print 语句：
python 蟒
print(f"Action: {action}, Type: {type(action)}, Shape: {action.shape}")

Summary 总结

Action Return Value: A tensor of shape [n_envs, action_dim] with values in [action_space.low, action_space.high] (e.g., [0, 3]), representing trading actions per asset, generated deterministically in evaluate() via a squashed Gaussian mean.
Action 返回值：形状为 [n_envs， action_dim] 的张量，值为 [action_space.low， action_space.high]（例如，[0， 3]），代表每个资产的交易作，在 evaluate（）中通过压缩的高斯均值确定性生成。
Flow: Observation → Features → Gaussian (mean/log_std) → Squashed Actions → Environment.
流：观测→特征→高斯（均值/log_std） → →环境的挤压作用。

Let me know if you need clarification on how _take_action() uses these actions or if your action space bounds differ from [0, 3]!
如果您需要澄清 _take_action（）如何使用这些作，或者您的作空间边界是否与 [0， 3] 不同，请告诉我！

Step 10310: Base Reward=0, Sustained Reward=-9.317110000000003, Total=-9.317110000000003, Balance=99100 Asset XAUUSD: Action=0.0, Reward=0, Holding=50 Step 10311: Base Reward=0, Sustained Reward=-9.441610000000003, Total=-9.441610000000003, Balance=99100 Asset XAUUSD: Action=0.0, Reward=0, Holding=50 Step 10312: Base Reward=0, Sustained Reward=-9.493610000000004, Total=-9.493610000000004, Balance=99100 Asset XAUUSD: Action=0.0, Reward=0, Holding=50 Step 10313: Base Reward=0, Sustained Reward=-9.534109999999998, Total=-9.534109999999998, Balance=99100 Asset XAUUSD: Action=0.0, Reward=0, Holding=50 Step 10314: Base Reward=0, Sustained Reward=-9.65761, Total=-9.65761, Balance=99100 Asset XAUUSD: Action=0.0, Reward=-961.3609999999994, Holding=50 Step 10315: Base Reward=-961.3609999999994, Sustained Reward=0, Total=-961.3609999999994, Balance=96157 --------------------------------------------------------------------------- UFuncTypeError Traceback (most recent call last) <ipython-input-70-b83ff300d59f> in <cell line: 182>() 180 load_if_exists=True 181 ) --> 182 study.optimize(objective, n_trials=1) # Adjust number of trials based on resources 183 184 # Best parameters 14 frames /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 473 If nested invocation of this method occurs. 474 """ --> 475 _optimize( 476 study=self, 477 func=func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 61 try: 62 if n_jobs == 1: ---> 63 _optimize_sequential( 64 study, 65 func, /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 158 159 try: --> 160 frozen_trial = _run_trial(study, func, catch) 161 finally: 162 # The following line mitigates memory problems that can be occurred in some /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 246 and not isinstance(func_err, catch) 247 ): --> 248 raise func_err 249 return frozen_trial 250 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py in _run_trial(study, func, catch) 195 with get_heartbeat_thread(trial._trial_id, study._storage): 196 try: --> 197 value_or_values = func(trial) 198 except exceptions.TrialPruned as e: 199 # TODO(mamu): Handle multi-objective cases. <ipython-input-70-b83ff300d59f> in objective(trial) 163 164 # print(model.policy) # Should show mlp_extractor with in_features=95 --> 165 model.learn(total_timesteps=total_timesteps, callback=eval_callback) 166 167 val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 309 progress_bar: bool = False, 310 ) -> SelfPPO: --> 311 return super().learn( 312 total_timesteps=total_timesteps, 313 callback=callback, /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar) 321 322 while self.num_timesteps < total_timesteps: --> 323 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) 324 325 if not continue_training: /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps) 222 # Give access to local variables 223 callback.update_locals(locals()) --> 224 if not callback.on_step(): 225 return False 226 /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/callbacks.py in on_step(self) 112 self.num_timesteps = self.model.num_timesteps 113 --> 114 return self._on_step() 115 116 def on_training_end(self) -> None: <ipython-input-70-b83ff300d59f> in _on_step(self) 106 if self.last_mean_reward is not None: 107 # Use profit or reward based on use_profit flag --> 108 current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit) 109 if current_metric > self.best_metric + self.min_delta: 110 self.best_metric = current_metric <ipython-input-70-b83ff300d59f> in evaluate(model, env_vec, n_episodes, return_mean_reward) 31 mean_reward = np.mean(total_rewards) 32 mean_profit = np.mean(total_profits) ---> 33 avg_metrics = { 34 k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys() 35 } <ipython-input-70-b83ff300d59f> in <dictcomp>(.0) 32 mean_profit = np.mean(total_profits) 33 avg_metrics = { ---> 34 k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys() 35 } 36 print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims, where) 3502 return mean(axis=axis, dtype=dtype, out=out, **kwargs) 3503 -> 3504 return _methods._mean(a, axis=axis, dtype=dtype, 3505 out=out, **kwargs) 3506 /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims, where) 116 is_float16_result = True 117 --> 118 ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) 119 if isinstance(ret, mu.ndarray): 120 with _no_nep50_warning(): UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U19'), dtype('<U19')) -> None ``` import datetime import math import random import ast import torch import torch.nn as nn import csv import numpy as np import gymnasium as gym from gymnasium import spaces from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor, MlpExtractor from stable_baselines3.common.policies import ActorCriticPolicy from stable_baselines3.common.torch_layers import MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution from stable_baselines3 import PPO from stable_baselines3.common.policies import ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart import TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym(gym.Env): """forex/future/option trading gym environment 1. Three action space (0 Buy, 1 Sell, 2 Nothing) 2. Multiple trading pairs (EURUSD, GBPUSD...) under same time frame 3. Timeframe from 1 min to daily as long as use candlestick bar (Open, High, Low, Close) 4. Use StopLose, ProfitTaken to realize rewards. each pair can configure it own SL and PT in configure file 5. Configure over night cash penalty and each pair's transaction fee and overnight position holding penalty 6. Split dataset into daily, weekly or monthly..., with fixed time steps, at end of len(df). The business logic will force to Close all positions at last Close price (game over). 7. Must have df column name: [(time_col),(asset_col), Open,Close,High,Low,day] (case sensitive) 8. Addition indicators can add during the data process. 78 available TA indicator from Finta 9. Customized observation list handled in json config file. 10. ProfitTaken = fraction_action * max_profit_taken + SL. 11. SL is pre-fixed 12. Limit order can be configure, if limit_order == True, the action will preset buy or sell at Low or High of the bar, with a limit_order_expiration (n bars). It will be triggered if the price go cross. otherwise, it will be drop off 13. render mode: human -- display each steps realized reward on console file -- create a transaction log graph -- create transaction in graph (under development) 14. 15. Reward, we want to incentivize profit that is sustained over long periods of time. At each step, we will set the reward to the account balance multiplied by some fraction of the number of time steps so far.The purpose of this is to delay rewarding the agent too fast in the early stages and allow it to explore sufficiently before optimizing a single strategy too deeply. It will also reward agents that maintain a higher balance for longer, rather than those who rapidly gain money using unsustainable strategies. 16. Observation_space contains all of the input variables we want our agent to consider before making, or not making a trade. We want our agent to “see” the forex data points (Open price, High, Low, Close, time serial, TA) in the game window, as well a couple other data points like its account balance, current positions, and current profit.The intuition here is that for each time step, we want our agent to consider the price action leading up to the current price, as well as their own portfolio’s status in order to make an informed decision for the next action. 17. reward is forex trading unit Point, it can be configure for each trading pair 18. To make the unrealized profit reward reflect market conditions, we’ll compute ATR for each asset and use it to scale the reward dynamically. """ metadata = {"render.modes": ["graph", "human", "file", "none"]} def __init__( self, df, event_map, currency_map, env_config_file="./neo_finrl/env_fx_trading/config/gdbusd-test-1.json", ): assert df.ndim == 2 super(tgym, self).__init__() self.cf = EnvConfig(env_config_file) self.observation_list = self.cf.env_parameters("observation_list") # Economic data mappings self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy() if 'events' not in self.df.columns: raise ValueError("DataFrame must contain an 'events' column") def parse_events(x): if isinstance(x, str): try: parsed = ast.literal_eval(x) return parsed if isinstance(parsed, list) else [] except (ValueError, SyntaxError): return [] return x if isinstance(x, list) else [] self.df['events'] = self.df['events'].apply(parse_events) if not isinstance(self.df['events'].iloc[0], list): raise ValueError("'events' must be a list") if self.df['events'].iloc[0] and not isinstance(self.df['events'].iloc[0][0], dict): raise ValueError("Elements in 'events' must be dictionaries") self.balance_initial = self.cf.env_parameters("balance") self.over_night_cash_penalty = self.cf.env_parameters("over_night_cash_penalty") self.asset_col = self.cf.env_parameters("asset_col") self.time_col = self.cf.env_parameters("time_col") self.random_start = self.cf.env_parameters("random_start") log_file_datetime = datetime.datetime.now().strftime("%Y%m%d%H%M%S") self.log_filename = ( self.cf.env_parameters("log_filename") + log_file_datetime + ".csv" ) self.analyze_transaction_history_log_filename = ("transaction_history_log" + log_file_datetime + ".csv") self.df["_time"] = self.df[self.time_col] self.df["_day"] = self.df["weekday"] self.assets = self.df[self.asset_col].unique() self.dt_datetime = self.df[self.time_col].sort_values().unique() self.df = self.df.set_index(self.time_col) self.visualization = False # Reset values self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_step = 0 self.episode = 0 # Start from 0, increment on episode end self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True # Cache data self.cached_ohlc_data = [self.get_observation_vector(_dt) for _dt in self.dt_datetime] self.cached_economic_data = [self.get_economic_vector(_dt) for _dt in self.dt_datetime] self.cached_time_serial = ( self.df[["_time", "_day"]].sort_values("_time").drop_duplicates().values.tolist() ) self.reward_range = (-np.inf, np.inf) self.action_space = spaces.Box(low=0, high=3, shape=(len(self.assets),), dtype=np.float32) self.observation_space = spaces.Dict({ "ohlc_data": spaces.Box(low=-np.inf, high=np.inf, shape=(len(self.assets) * len(self.observation_list),), dtype=np.float32), "event_ids": spaces.Box(low=0, high=len(self.event_map)-1, shape=(self.max_events,), dtype=np.int32), "currency_ids": spaces.Box(low=0, high=len(self.currency_map)-1, shape=(self.max_events,), dtype=np.int32), "economic_numeric": spaces.Box(low=-np.inf, high=np.inf, shape=(self.max_events * 6,), dtype=np.float32), "portfolio_data": spaces.Box(low=-np.inf, high=np.inf, shape=(3 + 2 * len(self.assets),), dtype=np.float32) }) print( f"initial done:\n" f"observation_list:{self.observation_list}\n" f"assets:{self.assets}\n" f"time serial: {min(self.dt_datetime)} -> {max(self.dt_datetime)} length: {len(self.dt_datetime)}\n" f"events: {len(self.event_map)}, currencies: {len(self.currency_map)}" ) self._seed() def _seed(self, seed=None): self.np_random, seed = seeding.np_random(seed) return [seed] def _take_action(self, actions, done): # action = math.floor(x), # profit_taken = math.ceil((x- math.floor(x)) * profit_taken_max - stop_loss_max ) # _actions = np.floor(actions).astype(int) # _profit_takens = np.ceil((actions - np.floor(actions)) *self.cf.symbol(self.assets[i],"profit_taken_max")).astype(int) _action = 2 _profit_taken = 0 rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # need use multiply assets for i, action in enumerate(actions): # Actions are now floats between 0 and 3 self._o = self.get_observation(self.current_step, i, "Open") self._h = self.get_observation(self.current_step, i, "High") self._l = self.get_observation(self.current_step, i, "Low") self._c = self.get_observation(self.current_step, i, "Close") self._t = self.get_observation(self.current_step, i, "_time") self._day = self.get_observation(self.current_step, i, "_day") # Extract integer action type and fractional part _action = math.floor(action) # 0=Buy, 1=Sell, 2=Nothing rewards[i] = self._calculate_reward(i, done, _action) # Pass action for exploration reward print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")): # Dynamically calculate PT using action fraction _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 if self.cf.symbol(self.assets[i], "limit_order"): transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._l if _action == 0 else self._h, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": -1, "CloseStep": -1, } self.transaction_limit_order.append(transaction) else: transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._c, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": self.current_step, "CloseStep": -1, } self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) return sum(rewards) def _calculate_reward(self, i, done, action): _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[:]: # Copy to avoid modification issues if tr["Symbol"] == self.assets[i]: _point = self.cf.symbol(self.assets[i], "point") # cash discount overnight if self._day > tr["DateDuration"]: tr["DateDuration"] = self._day tr["Reward"] -= self.cf.symbol(self.assets[i], "over_night_penalty") if tr["Type"] == 0: # Buy # stop loss trigger _sl_price = tr["ActionPrice"] - tr["SL"] / _point _pt_price = tr["ActionPrice"] + tr["PT"] / _point if done: p = (self._c - tr["ActionPrice"]) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._l <= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._h >= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: # still open self.current_draw_downs[i] = int((self._l - tr["ActionPrice"]) * _point) _max_draw_down += self.current_draw_downs[i] if self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i]: tr["MaxDD"] = self.current_draw_downs[i] elif tr["Type"] == 1: # Sell # stop loss trigger _sl_price = tr["ActionPrice"] + tr["SL"] / _point _pt_price = tr["ActionPrice"] - tr["PT"] / _point if done: p = (tr["ActionPrice"] - self._c) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._h >= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._l <= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: self.current_draw_downs[i] = int( (tr["ActionPrice"] - self._h) * _point ) _max_draw_down += self.current_draw_downs[i] if ( self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i] ): tr["MaxDD"] = self.current_draw_downs[i] if _max_draw_down > self.max_draw_downs[i]: self.max_draw_downs[i] = _max_draw_down return _total_reward def _limit_order_process(self, i, _action, done): for tr in self.transaction_limit_order[:]: if tr["Symbol"] == self.assets[i]: if tr["Type"] != _action or done: self.transaction_limit_order.remove(tr) tr["Status"] = 3 tr["CloseStep"] = self.current_step self.transaction_history.append(tr) elif (tr["ActionPrice"] >= self._l and _action == 0) or ( tr["ActionPrice"] <= self._h and _action == 1): tr["ActionStep"] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_limit_order.remove(tr) self.transaction_live.append(tr) self.tranaction_open_this_step.append(tr) elif (tr["LimitStep"] + self.cf.symbol(self.assets[i], "limit_order_expiration") > self.current_step): tr["CloseStep"] = self.current_step tr["Status"] = 4 self.transaction_limit_order.remove(tr) self.transaction_history.append(tr) def _manage_tranaction(self, tr, _p, close_price, status=1): self.transaction_live.remove(tr) tr["ClosePrice"] = close_price tr["Point"] = int(_p) tr["Reward"] = int(tr["Reward"] + _p) # Realized profit/loss tr["Status"] = status # 1=SL/PT, 2=Forced close, 3=Canceled limit, 4=Expired limit tr["CloseTime"] = self._t tr["CloseStep"] = self.current_step self.balance += int(tr["Reward"]) self.total_equity -= int(abs(tr["Reward"])) self.tranaction_close_this_step.append(tr) self.transaction_history.append(tr) def analyze_transaction_history(self): if not self.transaction_history: metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0} else: trades = len(self.transaction_history) rewards = [tr["Reward"] for tr in self.transaction_history] wins = sum(1 for r in rewards if r > 0) losses = sum(1 for r in rewards if r < 0) gross_profit = sum(r for r in rewards if r > 0) gross_loss = abs(sum(r for r in rewards if r < 0)) win_rate = wins / trades if trades > 0 else 0.0 profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf") # Sharpe Ratio (simplified, assumes risk-free rate = 0) returns = np.array(rewards, dtype=np.float32) sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0 total_profit = sum(rewards) metrics = { "trades": trades, "win_rate": win_rate, "profit_factor": profit_factor, "sharpe_ratio": sharpe_ratio, "total_profit": total_profit } # Append to log file with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f: writer = csv.DictWriter(f, fieldnames=["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]) metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") metrics["episode"] = self.episode writer.writerow(metrics) return metrics def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} return obs, total_reward, terminated, truncated, info def get_observation(self, _step, _iter=0, col=None): if col is None: return self.cached_ohlc_data[_step] if col == "_day": return self.cached_time_serial[_step][1] elif col == "_time": return self.cached_time_serial[_step][0] try: col_pos = self.observation_list.index(col) except ValueError: raise ValueError(f"Column '{col}' not found in observation_list") return self.cached_ohlc_data[_step][_iter * len(self.observation_list) + col_pos] def get_observation_vector(self, _dt, cols=None): cols = self.observation_list if cols is None else cols v = [] for a in self.assets: subset = self.df.query(f'{self.asset_col} == "{a}" & {self.time_col} == "{_dt}"') assert not subset.empty v += subset.loc[_dt, cols].tolist() assert len(v) == len(self.assets) * len(cols) return v def get_economic_vector(self, _dt): subset = self.df.loc[_dt] events = subset['events'] if isinstance(subset, pd.Series) else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) currency_ids = [self.currency_map.get(e['currency'], 0) for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) numeric_fields = ['actual_norm', 'forecast_norm', 'previous_norm', 'surprise_norm', 'event_freq', 'impact_code'] numeric = [e[field] for e in events[:self.max_events] for field in numeric_fields] + [0] * (self.max_events * 6 - len(events) * 6) return { "event_ids": np.array(event_ids, dtype=np.int32), "currency_ids": np.array(currency_ids, dtype=np.int32), "numeric": np.array(numeric, dtype=np.float32) } def reset(self, seed=None, options=None): # Set the seed for reproducibility if seed is not None: self._seed(seed) if self.random_start: self.current_step = random.choice(range(int(len(self.dt_datetime) * 0.5))) else: self.current_step = 0 self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True self.visualization = False obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } info = {} return obs, info def render(self, mode="human", title=None, **kwargs): if mode in ("human", "file"): printout = mode == "human" pm = { "log_header": self.log_header, "log_filename": self.log_filename, "printout": printout, "balance": self.balance, "balance_initial": self.balance_initial, "tranaction_close_this_step": self.tranaction_close_this_step, "done_information": self.done_information, } render_to_file(**pm) if self.log_header: self.log_header = False elif mode == "graph" and self.visualization: print("plotting...") p = TradingChart(self.df, self.transaction_history) p.plot() def close(self): pass def get_sb_env(self): e = DummyVecEnv([lambda: self]) obs = e.reset() return e, obs class CustomFeaturesExtractor(BaseFeaturesExtractor): def __init__(self, observation_space): n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2 ohlc_dim = observation_space.spaces["ohlc_data"].shape[0] max_events = observation_space.spaces["event_ids"].shape[0] economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0] portfolio_dim = observation_space.spaces["portfolio_data"].shape[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # Check if CUDA is available, else use CPU self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomFeaturesExtractor using device: {self.device}") super().__init__(observation_space, features_dim=features_dim) # Move embeddings to the selected device self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device) self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device) print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}") def forward(self, obs): ohlc_data = obs["ohlc_data"].to(self.device) event_ids = obs["event_ids"].to(self.device, dtype=torch.long) currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long) economic_numeric = obs["economic_numeric"].to(self.device) portfolio_data = obs["portfolio_data"].to(self.device) event_emb = self.event_embedding(event_ids).mean(dim=1) currency_emb = self.currency_embedding(currency_ids).mean(dim=1) features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1) return features class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Check if CUDA is available, else use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomMultiInputPolicy using device: {device}") # Extract action space bounds and move them to the selected device self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=device ).to(device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities of the unsquashed actions log_prob = distribution.log_prob(actions) # Use log_prob on the unsquashed actions # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return squashed_actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Unsquash the actions back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob(unsquashed_actions) # Use log_prob on unsquashed actions entropy = distribution.entropy() # Entropy doesn’t need actions values = self.value_net(latent_vf) return values, log_prob, entropy ```
步骤 10310：基础奖励 = 0，持续奖励 =-9.317110000000003，总计 = -9.317110000000003，余额 = 99100 资产 XAUUSD：作 = 0.0，奖励 = 0，持仓 = 50 步骤 10311：基础奖励=0，持续奖励=-9.441610000000003，总计=-9.441610000000003，余额=99100 资产 XAUUSD：作 = 0.0，奖励 = 0，持仓 = 50 步骤 10312：基础奖励 = 0，持续奖励 = -9.493610000000004，总计 = -9.493610000000004，余额 = 99100 资产 XAUUSD：作 = 0.0，奖励 = 0，持仓 = 50 步骤 10313：基础奖励=0，持续奖励=-9.534109999999998，总计=-9.534109999999998，余额=99100 资产 XAUUSD：作 = 0.0，奖励 = 0，持仓 = 50 步骤 10314：基础奖励 = 0，持续奖励 = -9.65761，总计 = -9.65761，余额 = 99100 资产 XAUUSD：作 = 0.0，奖励 = -961.3609999999994，持仓 = 50 步骤 10315：基础奖励=-961.3609999999994，持续奖励=0，总计=-961.3609999999994，余额=96157 --------------------------------------------------------------------------- UFuncTypeError 回溯（最近调用最后） <cell 行中的 <ipython-input-70-b83ff300d59f>：182>（） 180 load_if_exists=真 181 ) --> 182 study.optimize（objective， n_trials=1） # 根据资源调整 Trial 数量 183 184 # 最佳参数 14 帧 /usr/local/lib/python3.10/dist-packages/optuna/study/study.py in optimize（self， func， n_trials， timeout， n_jobs， catch， callbacks， gc_after_trial， show_progress_bar） 473 如果发生此方法的嵌套调用。 474 """ --> 475 _optimize（ 476 study=自我， 477 func=func， _optimize中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study、func、n_trials、timeout、n_jobs、catch、callbacks、gc_after_trial、show_progress_bar） 61 次尝试：如果 n_jobs == 1，则为 62： ---> 63 _optimize_sequential（ 64 项研究， 65 函数， _optimize_sequential 中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study、func、n_trials、timeout、catch、callbacks、gc_after_trial、reseed_sampler_rng、time_start、progress_bar） 158 159 次尝试： --> 160 frozen_trial = _run_trial（研究、func、catch） 161 最后： 162 # 以下行缓解了某些 _run_trial中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study， func， catch） 246 而不是 isinstance（func_err， catch） 247 ): --> 248 加注 func_err 249 返回 frozen_trial 250 _run_trial中的 /usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py（study， func， catch） 195 带 get_heartbeat_thread（trial._trial_id， study._storage）： 196 次尝试： --> 197 value_or_values = func（试用） 198 个，例外情况除外。TrialPruned 为 e： 199 # TODO（mamu）：处理多目标案例。 <ipython-input-70-b83ff300d59f> 在目标（试用）中 163 164 # print（model.policy） # 应该显示 mlp_extractor 和 in_features=95 --> 165 model.learn（total_timesteps=total_timesteps， callback=eval_callback） 166 167 val_avg_profit = evaluate（model， val_env_vec， n_episodes=10） # 评估函数 /usr/local/lib/python3.10/dist-packages/stable_baselines3/ppo/ppo.py 在 learn（self， total_timesteps， callback， log_interval， tb_log_name， reset_num_timesteps， progress_bar） 309 progress_bar： bool = False， 310 ） -> SelfPPO： --> 311 return super（）.learn（ 312 total_timesteps=total_timesteps， 313 callback=callback 中， /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py 在 learn（self， total_timesteps， callback， log_interval， tb_log_name， reset_num_timesteps， progress_bar） 321 322 同时self.num_timesteps < total_timesteps： --> 323 continue_training = self.collect_rollouts（self.env， callback， self.rollout_buffer， n_rollout_steps=self.n_steps） 324 325 如果不continue_training： collect_rollouts 中的 /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/on_policy_algorithm.py（self， env， callback， rollout_buffer， n_rollout_steps） 222 # 授予对局部变量的访问权限 223 callback.update_locals（当地人（）） --> 224 如果不是 callback.on_step（）： 225 返回 False 226 /usr/local/lib/python3.10/dist-packages/stable_baselines3/common/callbacks.py 在 on_step（个体经营） 112 self.num_timesteps = self.model.num_timesteps 113 --> 114 返回 self._on_step（） 115 116 def on_training_end（个体） -> 无： <ipython-input-70-b83ff300d59f> 以_on_step方式（个体经营）如果 self.last_mean_reward 不是 None，则为 106： 107 # 使用基于use_profit标志的利润或奖励 --> 108 current_metric = evaluate（self.model， self.eval_env， self.n_eval_episodes， return_mean_reward=not self.use_profit）如果 current_metric > self.best_metric + self.min_delta，则为 109： 110 self.best_metric = current_metric <ipython-input-70-b83ff300d59f> 在评估中（model， env_vec， n_episodes， return_mean_reward） 31 mean_reward = np.mean（total_rewards） 32 mean_profit = np.mean（total_profits） ---> 33 avg_metrics = { 34 k： np.mean（[m[k] for m in metrics]） for k in metrics[0].keys（） 35 } <ipython-input-70-b83ff300d59f> in <dictcomp>（.0） 32 mean_profit = np.mean（total_profits） 33 avg_metrics = { ---> 34 k： np.mean（[m[k] for m in metrics]） for k in metrics[0].keys（） 35 } 36 print（f“平均奖励： {mean_reward：.2f}，平均利润： {mean_profit：.2f}”） /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py in mean（a， axis， dtype， out， keepdims， where） 3502 返回 mean（axis=axis， dtype=dtype， out=out， **kwargs） 3503 -> 3504 返回 _methods._mean（a， axis=axis， dtype=dtype， 3505 out=out，**kwargs） 3506 /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py in _mean（a， axis， dtype， out， keepdims， where） 116 is_float16_result = 真 117 --> 118 ret = umr_sum（arr， axis， dtype， out， keepdims， where=where） 119 if isinstance（ret， mu.ndarray）： 120 带 _no_nep50_warning（）： UFuncTypeError：ufunc 'add' 不包含具有签名匹配类型的循环（dtype（'<U19'）， dtype（'<U19'）） -> 无 ``` 导入 datetime 导入数学导入随机导入 AST Import Torch （导入火炬）将 torch.nn 导入为 nn 导入 CSV 将 numpy 导入为 NP 将 gymnasium 导入为 Gym 从 Gymnasium 导入空间 from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor， MlpExtractor 从 stable_baselines3.common.policies 导入 ActorCriticPolicy 从 stable_baselines3.common.torch_layers 导入 MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution 从 stable_baselines3 进口 PPO 从 stable_baselines3.common.policies 导入 ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart 导入 TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym（健身房.Env）： “”“外汇/期货/期权交易 GYM 环境 1. 三个作空间（0 买入、1 卖出、2 无） 2. 同一时间范围内的多个交易对（EURUSD、GBPUSD... 3. 时间范围从 1 分钟到每天，只要使用烛台柱（开盘价、最高价、最低价、收盘价） 4. 使用 StopLose、ProfitTaken 实现奖励。每对都可以在配置文件中配置自己的 SL 和 PT 5. 配置隔夜现金罚金以及每对的交易手续费和隔夜持仓罚金 6. 将数据集拆分为每日、每周或每月...，具有固定的时间步长，在 len（df）结束时。业务 logic 将强制在最后收盘价平仓所有持仓（游戏结束）。 7. 必须具有 df 列名称：[（time_col），（asset_col），Open，Close，High，Low，day]（区分大小写） 8. 加法指标可以在数据处理过程中添加。Finta 提供 78 种 TA 指标 9. 在 json 配置文件中处理的自定义观察列表。 10. ProfitTaken = fraction_action * max_profit_taken + 止损。 11. SL 是预先固定的 12. 限价单可以配置，如果 limit_order == True，则动作将预设在柱线的低点或高点买入或卖出，带 limit_order_expiration （n 条）。如果价格交叉，它将触发。否则，它将是 drop off 13. 渲染模式： human -- 在控制台上显示每个步骤实现的奖励 file -- 创建事务日志 graph -- 在 Graph 中创建交易（开发中） 14. 15. 奖励，我们希望激励长期持续的利润。在每个步骤中，我们将奖励设置为账户余额乘以到目前为止时间步数的一小部分。这样做的目的是拖延在早期阶段过快地奖励代理并允许其探索在过于深入地优化单个策略之前。它还将奖励在更长时间内保持较高余额的代理，而不是那些使用不可持续的策略快速赚钱的人。 16. Observation_space 包含我们需要代理的所有输入变量在进行交易或不进行交易之前考虑。我们希望我们的代理 “看到” 游戏窗口中的外汇数据点（开盘价、最高价、最低价、收盘价、时间序列、TA），以及其他一些数据点，如账户余额、当前头寸、和当前利润。这里的直觉是，对于每个时间步，我们都需要我们的代理考虑导致当前价格的价格行为，以及他们的拥有投资组合的状态，以便为下一步行动做出明智的决策。 17. 奖励为外汇交易单位积分，可为每个交易对配置 18. 为了使未实现利润奖励反映市场状况，我们将计算每种资产的 ATR，并使用它来动态扩展奖励。 """ 元数据 = {“render.modes”： [“graph”， “human”， “file”， “none”]} 防守 __init__（自我 DF / event_map， currency_map， env_config_file=“./neo_finrl/env_fx_trading/config/gdbusd-test-1.json”， ): 断言 df.ndim == 2 超级（tgym， self）.__init__（） self.cf = 环境配置（env_config_file） self.observation_list = self.cf.env_parameters（“observation_list”） # 经济数据映射 self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy（）如果 'events' 不在 self.df.columns 中： raise ValueError（“DataFrame 必须包含一个 'events' 列”） def parse_events（x）：如果 isinstance（x， str）：尝试：解析 = ast.literal_eval（x）返回已解析 if isinstance（parsed， list） else [] except （ValueError， SyntaxError）：返回 [] 返回 x if isinstance（x， list） else [] self.df['事件'] = self.df['事件'].apply（parse_events）如果不是 isinstance（self.df['events'].iloc[0]， list）： raise ValueError（“'事件'必须是一个列表”）如果 self.df['events'].iloc[0] 而不是 isinstance（self.df['events'].iloc[0][0]， dict）： raise ValueError（“'events' 中的元素必须是字典”） self.balance_initial = self.cf.env_parameters（“余额”） self.over_night_cash_penalty = self.cf.env_parameters（“over_night_cash_penalty”） self.asset_col = self.cf.env_parameters（“asset_col”） self.time_col = self.cf.env_parameters（“time_col”） self.random_start = self.cf.env_parameters（“random_start”） log_file_datetime = datetime.datetime.now（）.strftime（“%Y%m%d%H%M%S”） self.log_filename = （ self.cf.env_parameters（“log_filename”） + log_file_datetime + “.csv” ) self.analyze_transaction_history_log_filename = （“transaction_history_log” + log_file_datetime + “.csv”） self.df[“_time”] = self.df[self.time_col] self.df[“_day”] = self.df[“工作日”] self.assets = self.df[self.asset_col].unique（） self.dt_datetime = self.df[self.time_col].sort_values（）.unique（） self.df = self.df.set_index（self.time_col） self.visualization = False # 重置值 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_step = 0 self.episode = 0 # 从 0 开始，在剧集结束时递增 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = 真 # 缓存数据 self.cached_ohlc_data = [self.get_observation_vector（_dt）表示 self.dt_datetime _dt] self.cached_economic_data = [self.get_economic_vector（_dt）表示 self.dt_datetime 年的 _dt] self.cached_time_serial = （ self.df[[“_time”， “_day”]].sort_values（“_time”）.drop_duplicates（）.values.tolist（） ) self.reward_range = （-np.inf， np.inf） self.action_space = 空格。盒子（low=0， high=3， shape=（len（self.assets），）， dtype=np.float32） self.observation_space = 空格。字典（{ “ohlc_data”：空格。框（low=-np.inf， high=np.inf， shape=（len（self.assets） * len（self.observation_list），）， dtype=np.float32）， “event_ids”：空格。框（low=0， high=len（self.event_map）-1， shape=（self.max_events，）， dtype=np.int32）， “currency_ids”：空格。框（低=0，高=len（self.currency_map）-1， shape=（self.max_events，）， dtype=np.int32）， “economic_numeric”：空格。盒子（low=-np.inf， high=np.inf， shape=（self.max_events * 6，）， dtype=np.float32）， “portfolio_data”：空格。盒子（low=-np.inf， high=np.inf， shape=（3 + 2 * len（self.assets），）， dtype=np.float32） }) 打印（ f“初始完成：\n” f“observation_list：{self.observation_list}\n” f“资产：{self.assets}\n” f“时间序列： {min（self.dt_datetime）} -> {max（self.dt_datetime）} 长度： {len（self.dt_datetime）}\n” f“事件：{len（self.event_map）}，货币：{len（self.currency_map）}” ) self._seed（） def _seed（self， seed=None）： self.np_random，种子 = seeding.np_random（种子）返回 [种子] def _take_action（self， actions， done）： #作 = math.floor（x）， # profit_taken = math.ceil（（x- math.floor（x）） * profit_taken_max - stop_loss_max ） # _actions = np.floor（作）.astype（int） # _profit_takens = np.ceil（（作 - np.floor（作）） *self.cf.symbol（self.assets[i]，“profit_taken_max”））.astype（int） _action = 2 _profit_taken = 0 奖励 = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # 需要使用 multiply 资产对于 i， action in enumerate（actions）： # 动作现在在 0 和 3 之间浮动 self._o = self.get_observation（self.current_step， i， “打开”） self._h = self.get_observation（self.current_step， i， “高”） self._l = self.get_observation（self.current_step， i， “低”） self._c = self.get_observation（self.current_step， i， “关闭”） self._t = self.get_observation（self.current_step， i， “_time”） self._day = self.get_observation（self.current_step， i， “_day”） # 提取整数动作类型和小数部分 _action = math.floor（action） # 0=买入，1=卖出，2=什么都没有 rewards[i] = self._calculate_reward（i， done， _action） # 通过行动以获得探索奖励 print（f“资产 {self.assets[i]}： action={action}， reward={rewards[i]}， Holding={self.current_holding[i]}”）如果 self.cf.symbol（self.assets[i]， “limit_order”）： self._limit_order_process（i， _action，完成）如果（ _action 英寸（0， 1）并且未完成和 self.current_holding[i] < self.cf.symbol（self.assets[i]， “max_current_holding”））： # 使用 action fraction 动态计算 PT _profit_taken = math.ceil（（作 - _action） * self.cf.symbol（self.assets[i]， “profit_taken_max”）） + self.cf.symbol（self.assets[i]， “stop_loss_max”） self.ticket_id += 1 如果 self.cf.symbol（self.assets[i]， “limit_order”）：交易 = { “票证”：self.ticket_id、 “Symbol”：self.assets[i]， “ActionTime”：self._t、 “类型”：_action、 “手数”： 1， “ActionPrice”： self._l if _action == 0 else self._h， “止损”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”：_profit_taken、 “MaxDD”：0、 “Swap（掉期）”：0.0、 “CloseTime”： “”， ///// “ClosePrice（收盘价）”： 0.0， “点”：0、 “奖励”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”：self._day、 “状态”：0、 “LimitStep”：self.current_step、 “ActionStep”： -1， “CloseStep”：-1、 } self.transaction_limit_order.append（事务）还：交易 = { “票证”：self.ticket_id、 “Symbol”：self.assets[i]， “ActionTime”：self._t、 “类型”：_action、 “手数”： 1， “ActionPrice”：self._c、 “止损”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”：_profit_taken、 “MaxDD”：0、 “Swap（掉期）”：0.0、 “CloseTime”： “”， ///// “ClosePrice（收盘价）”： 0.0， “点”：0、 “奖励”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”：self._day、 “状态”：0、 “LimitStep”：self.current_step、 “ActionStep”：self.current_step、 “CloseStep”：-1、 } self.current_holding[i] += 1 self.tranaction_open_this_step.append（事务） self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_live.append（事务） return sum（奖励） def _calculate_reward（self， i， done， action）： _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[：]： # 复制以避免修改问题 if tr[“Symbol”] == self.assets[i]： _point = self.cf.symbol（self.assets[i]， “点”） # 隔夜现金折扣如果 self._day > tr[“DateDuration”]： tr[“DateDuration”] = self._day tr[“奖励”] -= self.cf.symbol（self.assets[i]， “over_night_penalty”） if tr[“Type”] == 0： # 买入 # 止损触发器 _sl_price = tr[“ActionPrice”] - tr[“SL”] / _point _pt_price = tr[“ActionPrice”] + tr[“PT”] / _point 如果完成： p = （self._c - tr[“ActionPrice”]） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._l <= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._h >= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 else： # 仍然打开 self.current_draw_downs[i] = int（（self._l - tr[“ActionPrice”]） * _point） _max_draw_down += self.current_draw_downs[i] 如果 self.current_draw_downs[i] < 0 和 tr[“MaxDD”] > self.current_draw_downs[i]： tr[“最大DD”] = self.current_draw_downs[i] elif tr[“Type”] == 1： # 卖出 # 止损触发器 _sl_price = tr[“ActionPrice”] + tr[“SL”] / _point _pt_price = tr[“ActionPrice”] - tr[“PT”] / _point 如果完成： p = （tr[“ActionPrice”] - self._c） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._h >= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._l <= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 还： self.current_draw_downs[i] = int（（tr[“ActionPrice”] - self._h） * _point ) _max_draw_down += self.current_draw_downs[i] 如果（ self.current_draw_downs[i] < 0 和 tr[“MaxDD”] > self.current_draw_downs[i] ): tr[“最大DD”] = self.current_draw_downs[i] 如果 _max_draw_down > self.max_draw_downs[i]： self.max_draw_downs[i] = _max_draw_down 返回 _total_reward def _limit_order_process（self， i， _action， done）：对于 self.transaction_limit_order 中的 tr[：]： if tr[“Symbol”] == self.assets[i]： if tr[“Type”] ！= _action 或 done： self.transaction_limit_order.删除（tr） tr[“状态”] = 3 tr[“CloseStep”] = self.current_step self.transaction_history.append（tr） elif （tr[“ActionPrice”] >= self._l 和 _action == 0）或（ tr[“ActionPrice”] <= self._h 和 _action == 1）： tr[“ActionStep”] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_limit_order.删除（tr） self.transaction_live.append（tr） self.tranaction_open_this_step.append（tr） elif （tr[“LimitStep”] + self.cf.symbol（self.assets[i]， “limit_order_expiration”） > self.current_step）： tr[“CloseStep”] = self.current_step tr[“状态”] = 4 self.transaction_limit_order.删除（tr） self.transaction_history.append（tr） def _manage_tranaction（self， tr， _p， close_price， status=1）： self.transaction_live.remove（tr） tr[“收盘价”] = close_price tr[“点”] = int（_p） tr[“奖励”] = int（tr[“奖励”] + _p） # 已实现盈/亏 tr[“状态”] = 状态 # 1=止损/太平洋时间，2=强制平仓，3=取消限制，4=过期限制 tr[“CloseTime”] = self._t tr[“CloseStep”] = self.current_step self.balance += int（tr[“奖励”]） self.total_equity -= int（abs（tr[“奖励”]）） self.tranaction_close_this_step.append（tr） self.transaction_history.append（tr） def analyze_transaction_history（个体经营）：如果不self.transaction_history：指标 = {“trades”： 0， “win_rate”： 0.0， “profit_factor”： 0.0， “sharpe_ratio”： 0.0， “total_profit”： 0.0} 还：交易 = len（self.transaction_history） rewards = [tr[“奖励”] 对于 tr self.transaction_history] wins = sum（如果 r > 0，则奖励中的 r 为 1） losses = sum（如果 r < 0，则奖励中的 r 为 1） gross_profit = sum（r for r for r in rewards if r > 0） gross_loss = abs（sum（r for r for r in rewards if r < 0）） win_rate = 盈利 / 交易如果交易 > 0 否则 0.0 profit_factor = gross_profit / gross_loss 如果 gross_loss > 0 else float（“inf”） # 夏普比率（简化，假设无风险利率 = 0）返回值 = np.array（rewards， dtype=np.float32） sharpe_ratio = np.mean（返回） / np.std（返回）如果 np.std（返回） > 0 else 0.0 total_profit = sum（rewards）指标 = { “trades”：交易、 “win_rate”：win_rate、 “profit_factor”：profit_factor、 “sharpe_ratio”：sharpe_ratio、 “total_profit”：total_profit } # 附加到日志文件其中 open（self.analyze_transaction_history_log_filename， 'a'， newline=''）为 f： writer = csv.DictWriter（f， fieldnames=[“时间戳”， “剧集”， “交易”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”]） metrics[“timestamp”] = datetime.datetime.now（）.strftime（“%Y-%m-%d %H：%M：%S”） metrics[“episode”] = self.episode writer.writerow（指标）返回指标 def step（self， actions）： self.current_step += 1 # 定义终止和截断条件 terminated = self.balance <= 0 # 剧集因破产而结束（最终状态） truncated = self.current_step == len（self.dt_datetime） - 1 # 由于最大步数（时间限制）而结束剧集 done = terminated or truncated # 合并为 VecEnv 的单个 'done' 标志 # 对于渲染或剧集跟踪，您仍然可以检查任一条件是否为 true 如果完成： self.done_information += f“集数： {self.episode} 平衡： {self.balance} 步数： {self.current_step}\n” self.visualization = 真 self.episode += 1 # 增加 episode 计数器 # 计算基础交易奖励 base_reward = self._take_action（作，完成） # 计算持仓的未实现利润 unrealized_profit = 0 atr_scaling = 0 # 用于市场条件缩放对于 i，枚举（self.assets）中的 asset： atr = self.get_observation（self.current_step， i， “ATR”） atr_scaling += atr # 用于标准化的资产的 ATR 总和对于 self.transaction_live 中的 TR： if tr[“Symbol”] == asset： if tr[“Type”] == 0： # 买入未实现 = （self._c - tr[“ActionPrice”]） * self.cf.symbol（asset， “point”） else： # 卖出未实现 = （tr[“ActionPrice”] - self._c） * self.cf.symbol（asset， “点”） unrealized_profit += 未实现 atr_scaling = atr_scaling / len（self.assets） if atr_scaling > 0 else 1 # 避免被 0 除以 # 持续奖励：仅适用于未实现/已实现的利润，由 ATR 缩放 # 调整 0.01 到 0.05 sustained_reward = （unrealized_profit + base_reward） * 0.01 / atr_scaling 如果self.transaction_live否则为 0 # 如果没有持仓，则对不作为的处罚如果不是 self.transaction_live 和 all（math.floor（a） == 2 for a in actions）： sustained_reward -= 0.1 # 鼓励探索的小额惩罚 total_reward = base_reward + sustained_reward 如果 self._day > self.current_day： self.current_day = self._day self.balance -= self.over_night_cash_penalty 如果 self.balance ！= 0： self.max_draw_down_pct = abs（sum（self.max_draw_downs） / self.balance * 100） OBS = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“numeric”]， “portfolio_data”：np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs，数据类型=np.float32 ) } print（f“步骤 {self.current_step}：基础奖励={base_reward}，持续奖励={sustained_reward}，总计={total_reward}，余额={self.balance}”） # 信息字典保持不变 info = {“关闭”： self.tranaction_close_this_step} 返回 obs、total_reward、terminated、truncated、info def get_observation（self， _step， _iter=0， col=None）：如果 col 为 None：返回 self.cached_ohlc_data[_step] 如果 col == “_day”：返回 self.cached_time_serial[_step][1] elif col == “_time”：返回 self.cached_time_serial[_step][0] 尝试： col_pos = self.observation_list.index（col） except ValueError 之外： raise ValueError（f“在 observation_list 中找不到列 '{col}”）返回 self.cached_ohlc_data[_step][_iter * len（self.observation_list） + col_pos] def get_observation_vector（self， _dt， cols=无）： cols = self.observation_list 如果 cols 不是其他 cols v = [] for a in self.assets 中：子集 = self.df.query（f'{self.asset_col} == “{a}” & {self.time_col} == “{_dt}”'） assert not subset.empty v += subset.loc[_dt， cols].tolist（）断言 len（v） == len（self.assets） * len（cols）返回 v def get_economic_vector（self， _dt）：子集 = self.df.loc[_dt] events = subset['events'] if isinstance（subset， pd.系列） else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） currency_ids = [self.currency_map.get（e['currency']， 0） for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） numeric_fields = ['actual_norm'， 'forecast_norm'， 'previous_norm'， 'surprise_norm'， 'event_freq'， 'impact_code'] 数字 = 事件中 e 的 [e[字段] 对于numeric_fields中的字段 :self.max_events] + [0] * （self.max_events * 6 - len（事件） * 6）返回 { “event_ids”： np.array（event_ids， dtype=np.int32）， “currency_ids”： np.array（currency_ids， dtype=np.int32）， “数字”： np.array（数字， dtype=np.float32） } def reset（self， seed=None， options=None）： # 设置可重复性的种子如果 seed 不是 None： self._seed（种子）如果self.random_start： self.current_step = random.choice（range（int（len（self.dt_datetime） * 0.5）））还： self.current_step = 0 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = 真 self.visualization = False OBS = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“numeric”]， “portfolio_data”：np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs，数据类型=np.float32 ) } 信息 = {} 返回 OBS、INFO def render（self， mode=“human”， title=None， **kwargs）： if mode in （“human”， “file”）：打印输出 = 模式 == “人类” 下午 = { “log_header”： self.log_header， “log_filename”： self.log_filename， “printout”：打印输出， “balance”： self.balance， “balance_initial”：self.balance_initial、 “tranaction_close_this_step”：self.tranaction_close_this_step、 “done_information”：self.done_information、 } render_to_file（**pm）如果 self.log_header： self.log_header = 假 elif 模式 == “graph” 和 self.visualization： print（“绘图...”） p = TradingChart（self.df， self.transaction_history） p.plot（） def close（个体）：通过 def get_sb_env（个体经营）： e = DummyVecEnv（[lambda： self]） obs = e.reset（）返回 E、OBS class CustomFeaturesExtractor（BaseFeaturesExtractor）： def __init__（self， observation_space）： n_assets = （observation_space.spaces[“portfolio_data”].shape[0] - 3） // 2 ohlc_dim = observation_space.spaces[“ohlc_data”].shape[0] max_events = observation_space.spaces[“event_ids”].shape[0] economic_numeric_dim = observation_space.spaces[“economic_numeric”].shape[0] portfolio_dim = observation_space.spaces[“portfolio_data”].形状[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # 检查 CUDA 是否可用，否则使用 CPU self.device = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomFeaturesExtractor using device： {self.device}”） super（）.__init__（observation_space， features_dim=features_dim） # 将 embedding 移动到所选设备 self.event_embedding = nn.Embedding（num_embeddings=129， embedding_dim=max_events）.to（self.device） self.currency_embedding = nn。Embedding（num_embeddings=6， embedding_dim=max_events）.to（self.device） print（f“自定义功能提取器： n_assets={n_assets}， features_dim={features_dim}”） def forward（self， obs）： ohlc_data = obs[“ohlc_data”].to（self.device） event_ids = obs[“event_ids”].to（self.device， dtype=torch.long） currency_ids = obs[“currency_ids”].to（self.device， dtype=torch.long） economic_numeric = obs[“economic_numeric”].to（self.device） portfolio_data = obs[“portfolio_data”].to（self.device） event_emb = self.event_embedding（event_ids）.均值（dim=1） currency_emb = self.currency_embedding（currency_ids）.mean（dim=1）特征 = torch.cat（[ohlc_data， event_emb， currency_emb， economic_numeric， portfolio_data]， dim=1）返回功能类 CustomMultiInputPolicy（ActorCriticPolicy）： def __init__（self， observation_space， action_space， lr_schedule， *args， **kwargs）： # 检查 CUDA 是否可用，否则使用 CPU 设备 = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomMultiInputPolicy using device： {device}”） # 提取动作空间边界并将其移动到所选设备 self.action_space_low = torch.tensor （action_space.low， dtype =torch.float32， device=device） self.action_space_high = torch.tensor （action_space.high， dtype =torch.float32， device=device） action_dim = action_space.shape[0] # 资产数量 super（）.__init__（ observation_space， action_space， lr_schedule， features_extractor_class=CustomFeaturesExtractor， features_extractor_kwargs={}， net_arch=dict（pi=[64， 64]， vf=[64， 64]）， *args、 **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor（ features_dim， net_arch=self.net_arch， activation_fn=nn.ReLU ， device=设备）.to（设备） # 定义动作网络以输出高斯的均值和log_std self.action_net = nn.Linear（64， action_dim * 2）.to（device） # 输出每个资产的平均值和log_std self.value_net = nn.线性（64， 1）.to（设备） # 初始化发行版 self.action_dist = SquashedDiagGaussianDistribution （action_dim） self.num_timesteps = 0 def forward（self， obs， deterministic=False）： # 在每次前向传递上增加时间步长 self.num_timesteps += 1 # 提取特征特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） # 从 action_net 获取 mean 和 log_std action_params = self.action_net（latent_pi） # [批量， n_assets * 2] mean_actions， log_std = action_params.chunk（2， dim=-1） # 拆分为均值和log_std log_std = torch.clamp（log_std， min=-20， max=2） # 稳定log_std # 使用当前参数创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 示例作或获取确定性作作 = distribution.actions_from_params（mean_actions， log_std， deterministic=deterministic） # 从 [-1， 1] 到 [0， 3] 的映射 squashed_actions = self._squash_to_range（作、self.action_space_low、self.action_space_high） # 计算未压缩的 action 的对数概率 log_prob = distribution.log_prob（actions） # 对未压缩的动作使用log_prob # 价值预测值 = self.value_net（latent_vf）如果 self.num_timesteps % 1000 == 0： print（f“步骤 {self.num_timesteps}，训练： {self.training}，作： {squashed_actions}，平均值： {mean_actions.mean（）}，对数标准： {log_std.mean（）}”）返回 squashed_actions、值 log_prob def _squash_to_range（self， actions， low， high）： “”“将压缩的作从 [-1， 1] 缩放到 [low， high]。”“”” 返回（作数 + 1） * （最高价 - 最低价） / 2 + 最低价 def extract_features（自身，OBS）：返回self.features_extractor （OBS） def predict_values（自身， OBS）：特征 = self.extract_features（OBS） _， latent_vf = self.mlp_extractor（特征）返回 self.value_net（latent_vf） def evaluate_actions（自我、OBS、作）： “”“评估训练作（由 PPO 使用）。”“” 特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） action_params = self.action_net（latent_pi） mean_actions，log_std = action_params.chunk（2， dim=-1） log_std = torch.clamp（log_std，最小=-20，最大=2） # 创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 将 action 解压缩回 [-1， 1] 以进行log_prob计算 unsquashed_actions = 2 * （作 - self.action_space_low） / （self.action_space_high - self.action_space_low） - 1 # 计算log_prob和熵 log_prob = distribution.log_prob（unsquashed_actions） # 对未压缩的动作使用 log_prob entropy = distribution.entropy（） # 熵不需要动作值 = self.value_net（latent_vf）返回值、log_prob、熵 ```

The error you’re encountering—UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U19'), dtype('<U19')) -> None—occurs in your evaluate() function when computing avg_metrics. This suggests a problem with how NumPy is handling the metrics returned by analyze_transaction_history(). Let’s diagnose and fix this step-by-step, then address your log output to ensure everything aligns.
在计算avg_metrics时，您遇到的错误 UFuncTypeError： ufunc 'add' does not contain a loop with signature matching types （dtype（'<U19'）， dtype（'<U19'）） -> None）发生在 evaluate（）函数中。这表明 NumPy 处理 analyze_transaction_history（）返回的指标的方式存在问题。让我们逐步诊断和修复此问题，然后处理您的日志输出以确保一切一致。

Error Analysis 误差分析

Traceback Breakdown 回溯细分

Location: The error happens in evaluate():
位置：错误发生在 evaluate（）中：
python 蟒
avg_metrics = { k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys() }
- This line computes the mean of each metric across episodes using NumPy’s mean().
  这条线使用 NumPy 的 mean（）计算跨剧集的每个指标的平均值。
Error Message: UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U19'), dtype('<U19')) -> None
错误消息：UFuncType错误：ufunc 'add' 不包含具有签名匹配类型（dtype（' 无
- This indicates NumPy tried to perform an addition operation (ufunc 'add') on incompatible types—specifically, strings (dtype('<U19'), Unicode strings of length ≤ 19) instead of numbers.
  这表明 NumPy 尝试对不兼容的类型执行加法作（ufunc 'add'），具体来说，字符串（dtype（'<U19'）、长度≤ 19 的 Unicode 字符串）而不是数字。
Root Cause: The metrics dictionary returned by analyze_transaction_history() contains non-numeric values (e.g., strings) for some keys, and np.mean() can’t compute a mean over strings.
根源：analyze_transaction_history（）返回的 metrics 字典包含某些键的非数字值（例如字符串），并且 np.mean（）无法计算字符串的平均值。

Examining analyze_transaction_history()
检查 analyze_transaction_history（）

def analyze_transaction_history(self):
    if not self.transaction_history:
        metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0}
    else:
        trades = len(self.transaction_history)
        rewards = [tr["Reward"] for tr in self.transaction_history]
        wins = sum(1 for r in rewards if r > 0)
        losses = sum(1 for r in rewards if r < 0)
        gross_profit = sum(r for r in rewards if r > 0)
        gross_loss = abs(sum(r for r in rewards if r < 0))
        
        win_rate = wins / trades if trades > 0 else 0.0
        profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf")
        returns = np.array(rewards, dtype=np.float32)
        sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0
        total_profit = sum(rewards)
        metrics = {
            "trades": trades,
            "win_rate": win_rate,
            "profit_factor": profit_factor,
            "sharpe_ratio": sharpe_ratio,
            "total_profit": total_profit
        }
    
    # Append to log file
    with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"])
        metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        metrics["episode"] = self.episode
        writer.writerow(metrics)
    
    return metrics

Expected Output: A dictionary with numeric values for trades, win_rate, profit_factor, sharpe_ratio, and total_profit, plus timestamp and episode added for logging.
预期输出：一个字典，其中包含交易、win_rate、profit_factor、sharpe_ratio 和 total_profit 的数值，以及用于日志记录的时间戳和剧集。
Problem: After logging, metrics includes timestamp (a string) and episode (an integer), but evaluate() tries to compute np.mean() over all keys in metrics[0].keys(), including the non-numeric timestamp.
问题：记录后，指标包括 timestamp（字符串）和 episode（整数），但 evaluate（）尝试计算 metrics[0].keys（）中所有键的 np.mean（），包括非数字时间戳。

Why the Error Occurs 发生错误的原因

In evaluate(): 在 evaluate（）中：
python 蟒
avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in metrics[0].keys()}
- metrics[0].keys() includes "timestamp", which is a string (e.g., "2025-03-13 08:51:18").
  metrics[0].keys（）包含“timestamp”，这是一个字符串（例如，“2025-03-13 08：51：18”）。
- [m["timestamp"] for m in metrics] is a list of strings, and np.mean() fails because it can’t average strings.
  [m[“timestamp”] for m in metrics] 是一个字符串列表，np.mean（）失败，因为它无法平均字符串。

Fixing the Error 修复错误

Solution: Exclude Non-Numeric Keys
解决方案：排除非数字键

Modify evaluate() to only compute means for numeric metrics, excluding timestamp and optionally episode (since averaging episode numbers might not be meaningful). Here’s the corrected version:
修改 evaluate（）以仅计算数字指标的均值，不包括 timestamp 和可选的 episode （因为平均剧集编号可能没有意义）。以下是更正后的版本：

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, done, info = env_vec.step(action)  # VecEnv returns 4 values
            episode_rewards += rewards
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    # Define numeric keys to average
    numeric_keys = ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in numeric_keys}
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit

Changes: 变化：

Replaced metrics[0].keys() with a hardcoded list of numeric keys (numeric_keys).
将 metrics[0].keys（）替换为硬编码的数字键列表（numeric_keys）。
Excluded timestamp (string) and episode (integer, optional to exclude) from averaging.
从平均值中排除了时间戳（字符串）和剧集（整数，可选排除）。

Why This Works 为什么有效

numeric_keys ensures only numeric values are passed to np.mean(), avoiding the string issue.
numeric_keys 确保仅将数值传递给 np.mean（），从而避免字符串问题。
The error is eliminated because NumPy no longer tries to average incompatible types.
该错误被消除，因为 NumPy 不再尝试平均不兼容的类型。

Verifying with Your Log Output
使用日志输出进行验证

Let’s connect this fix to your log output to ensure it aligns with the environment’s behavior:
让我们将此修复程序连接到您的日志输出，以确保它与环境的行为保持一致：

Step 10310: Base Reward=0, Sustained Reward=-9.317110000000003, Total=-9.317110000000003, Balance=99100
Asset XAUUSD: Action=0.0, Reward=0, Holding=50
Step 10311: Base Reward=0, Sustained Reward=-9.441610000000003, Total=-9.441610000000003, Balance=99100
Asset XAUUSD: Action=0.0, Reward=0, Holding=50
...
Step 10314: Base Reward=0, Sustained Reward=-9.65761, Total=-9.65761, Balance=99100
Asset XAUUSD: Action=0.0, Reward=-961.3609999999994, Holding=50
Step 10315: Base Reward=-961.3609999999994, Sustained Reward=0, Total=-961.3609999999994, Balance=96157

Observations: 观察：
- Action=0.0: Consistently 0.0, suggesting a buy action (math.floor(0.0) = 0), but no new positions open (holding stays at 50, likely due to max_current_holding limit).
  作 = 0.0：始终为 0.0，表示买入作（math.floor（0.0） = 0），但没有新头寸（持有 50，可能是由于 max_current_holding 限制）。
- Base Reward: Mostly 0 until step 10315, where a large negative reward (-961.361) occurs, likely a stop-loss hit (see _calculate_reward()).
  基本奖励：在第 10315 步之前，大部分是 0，此时会出现较大的负奖励（-961.361），可能是止损命中（参见 _calculate_reward（））。
- Sustained Reward: Negative and increasing until step 10315, then 0, indicating unrealized losses from open positions (50 holdings) until they close.
  持续奖励：负数并增加直到第 10315 步，然后是 0，表示未平仓头寸（50 笔持仓）在平仓前未实现的损失。
- Balance Drop: From 99100 to 96157 at step 10315, reflecting the realized loss.
  余额下降：在第 10315 步从 99100 到 96157，反映已实现的亏损。
Connection to Metrics: 与 Metrics 的连接：
- At step 10315, a transaction closes (likely stop-loss), added to transaction_history.
  在步骤 10315 中，交易关闭（可能是止损），添加到 transaction_history。
- analyze_transaction_history() computes metrics based on transaction_history, including total_profit (e.g., -961.361 for this episode).
  analyze_transaction_history（）根据transaction_history计算指标，包括 total_profit（例如，本集为 -961.361）。
- The error occurs during evaluate() when averaging metrics across episodes, triggered by the timestamp string.
  在 evaluate（）期间，在跨剧集平均量度时，会发生此错误，由 timestamp 字符串触发。

Additional Notes 其他说明

Holding=50: Your environment allows up to max_current_holding (configured in config.json). If it’s 50, no new trades open (Action=0.0 ignored), explaining the sustained negative rewards from unrealized losses.
持仓 = 50：您的环境最多允许 max_current_holding（以 config.json 配置）。如果是 50，则没有新交易（忽略 Action=0.0），这解释了未实现亏损的持续负回报。
Reward Calculation: The large negative base_reward at step 10315 matches a stop-loss (SL) hit, reducing balance and clearing sustained_reward (no live transactions left).
奖励计算：步骤 10315 的大负base_reward与止损（SL）命中匹配，减少余额并清算sustained_reward（没有留下实时交易）。

Final Fixed Code 最终固定代码

Update your evaluate() as shown above. Here’s how it fits into your full script (partial snippet for brevity):
如上所示更新你的 evaluate（）。以下是它如何适应您的完整脚本（为简洁起见，部分代码段）：

# ... (rest of your imports and classes unchanged)

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False):
    total_rewards = []
    total_profits = []
    metrics = []
    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            obs, rewards, done, info = env_vec.step(action)
            episode_rewards += rewards
            
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]
        
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    numeric_keys = ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in numeric_keys}
    print(f"Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    return mean_reward if return_mean_reward else mean_profit

# ... (rest of your script unchanged up to study.optimize)
study.optimize(objective, n_trials=1)

Verify the Fix 验证修复

Replace your evaluate() with the fixed version.
将 evaluate（）替换为 fixed 版本。
Run your script again (study.optimize(objective, n_trials=1)).
再次运行脚本（study.optimize（objective， n_trials=1））。
Check that: 检查：
- No UFuncTypeError occurs.
  未发生 UFuncTypeError。
- avg_metrics prints only numeric values (e.g., {'trades': 1.0, 'win_rate': 0.0, ...}).
  avg_metrics 仅打印数值（例如，{'trades'： 1.0， 'win_rate'： 0.0， ...}）。

Bonus: Log Output Insight
奖励：Log Output Insight

Why Action=0.0 Repeatedly? Your policy might be stuck (e.g., low exploration due to deterministic=True in evaluate(), or model convergence to a suboptimal action). Check CustomMultiInputPolicy.forward() output (e.g., mean_actions) to debug.
为什么 action=0.0 反复出现？您的策略可能卡住了（例如，由于 evaluate（）中的 deterministic=True 导致勘探率低，或者模型收敛到次优作）。检查 CustomMultiInputPolicy.forward（）输出（例如，mean_actions）进行调试。
Balance Drop: Matches expected behavior when a stop-loss triggers, confirming _calculate_reward() and _manage_tranaction() work as intended.
余额下降：匹配止损触发时的预期行为，确认 _calculate_reward（）和 _manage_tranaction（）按预期工作。

If you see further issues or want to debug the Action=0.0 behavior, let me know!
如果您看到更多问题或想要调试 Action=0.0 行为，请告诉我！

give header to this log file: ``` def analyze_transaction_history(self): if not self.transaction_history: metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0} else: trades = len(self.transaction_history) rewards = [tr["Reward"] for tr in self.transaction_history] wins = sum(1 for r in rewards if r > 0) losses = sum(1 for r in rewards if r < 0) gross_profit = sum(r for r in rewards if r > 0) gross_loss = abs(sum(r for r in rewards if r < 0)) win_rate = wins / trades if trades > 0 else 0.0 profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf") # Sharpe Ratio (simplified, assumes risk-free rate = 0) returns = np.array(rewards, dtype=np.float32) sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0 total_profit = sum(rewards) metrics = { "trades": trades, "win_rate": win_rate, "profit_factor": profit_factor, "sharpe_ratio": sharpe_ratio, "total_profit": total_profit } # Append to log file with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f: writer = csv.DictWriter(f, fieldnames=["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]) metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") metrics["episode"] = self.episode writer.writerow(metrics) return metrics ```
Give 标头添加到此日志文件中： ``` def analyze_transaction_history（个体经营）：如果不self.transaction_history：指标 = {“trades”： 0， “win_rate”： 0.0， “profit_factor”： 0.0， “sharpe_ratio”： 0.0， “total_profit”： 0.0} 还：交易 = len（self.transaction_history） rewards = [tr[“奖励”] 对于 tr self.transaction_history] wins = sum（如果 r > 0，则奖励中的 r 为 1） losses = sum（如果 r < 0，则奖励中的 r 为 1） gross_profit = sum（r for r for r in rewards if r > 0） gross_loss = abs（sum（r for r for r in rewards if r < 0）） win_rate = 盈利 / 交易如果交易 > 0 否则 0.0 profit_factor = gross_profit / gross_loss 如果 gross_loss > 0 else float（“inf”） # 夏普比率（简化，假设无风险利率 = 0）返回值 = np.array（rewards， dtype=np.float32） sharpe_ratio = np.mean（返回） / np.std（返回）如果 np.std（返回） > 0 else 0.0 total_profit = sum（rewards）指标 = { “trades”：交易、 “win_rate”：win_rate、 “profit_factor”：profit_factor、 “sharpe_ratio”：sharpe_ratio、 “total_profit”：total_profit } # 附加到日志文件其中 open（self.analyze_transaction_history_log_filename， 'a'， newline=''）为 f： writer = csv.DictWriter（f， fieldnames=[“时间戳”， “剧集”， “交易”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”]） metrics[“timestamp”] = datetime.datetime.now（）.strftime（“%Y-%m-%d %H：%M：%S”） metrics[“episode”] = self.episode writer.writerow（指标）返回指标 ```

To add a header to the CSV log file generated by your analyze_transaction_history() method, you need to ensure that the header is written only once when the file is first created, rather than appending it repeatedly with each call. Currently, your code uses 'a' (append mode) and writes rows without checking if the file is new or empty, which works fine for appending data but doesn’t handle the initial header explicitly.
要将标头添加到 analyze_transaction_history（）方法生成的 CSV 日志文件中，您需要确保在首次创建文件时仅写入一次标头，而不是在每次调用时重复附加它。目前，您的代码使用 'a' （append 模式）并写入行，而不检查文件是新文件还是空文件，这适用于附加数据，但不会显式处理初始标头。

Here’s how to modify your analyze_transaction_history() method to include a header in the log file (transaction_history_log<datetime>.csv) when it’s first created:
以下是修改 analyze_transaction_history（）方法的方法，以便在首次创建日志文件（transaction_history_log<datetime>.csv）时在日志文件（）中包含标头：

Modified Code with Header
带标头的修改代码

def analyze_transaction_history(self):
    if not self.transaction_history:
        metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0}
    else:
        trades = len(self.transaction_history)
        rewards = [tr["Reward"] for tr in self.transaction_history]
        wins = sum(1 for r in rewards if r > 0)
        losses = sum(1 for r in rewards if r < 0)
        gross_profit = sum(r for r in rewards if r > 0)
        gross_loss = abs(sum(r for r in rewards if r < 0))
        
        win_rate = wins / trades if trades > 0 else 0.0
        profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf")
        
        # Sharpe Ratio (simplified, assumes risk-free rate = 0)
        returns = np.array(rewards, dtype=np.float32)
        sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0
        total_profit = sum(rewards)
        metrics = {
            "trades": trades,
            "win_rate": win_rate,
            "profit_factor": profit_factor,
            "sharpe_ratio": sharpe_ratio,
            "total_profit": total_profit
        }
    
    # Prepare metrics with timestamp and episode
    metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    metrics["episode"] = self.episode
    
    # Check if file exists and is empty to write header
    import os
    file_exists = os.path.exists(self.analyze_transaction_history_log_filename)
    file_empty = file_exists and os.stat(self.analyze_transaction_history_log_filename).st_size == 0
    
    # Append to log file with header if it's new or empty
    with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f:
        fieldnames = ["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        
        # Write header only if file doesn't exist or is empty
        if not file_exists or file_empty:
            writer.writeheader()
        
        writer.writerow(metrics)
    
    return metrics

Explanation of Changes 变更说明

Check File Status: 检查文件状态：
- os.path.exists(self.analyze_transaction_history_log_filename): Checks if the log file already exists.
  os.path.exists（self.analyze_transaction_history_log_filename）：检查日志文件是否已经存在。
- os.stat(self.analyze_transaction_history_log_filename).st_size == 0: Checks if the file is empty (size = 0 bytes).
  os.stat（self.analyze_transaction_history_log_filename）.st_size == 0：检查文件是否为空（大小 = 0 字节）。
- Combined into file_exists and file_empty conditions.
  组合成 file_exists 和 file_empty 条件。
Write Header Conditionally:
有条件地写入标头：
- writer.writeheader(): Adds the header row (timestamp,episode,trades,win_rate,profit_factor,sharpe_ratio,total_profit) to the CSV.
  writer.writeheader（）：将标题行（timestamp，episode，trades，win_rate，profit_factor，sharpe_ratio，total_profit）添加到 CSV。
- Only called if the file doesn’t exist (not file_exists) or exists but is empty (file_empty).
  仅当文件不存在（不是 file_exists）或存在但为空（file_empty）时调用。
- Ensures the header is written exactly once when the file is first created.
  确保在首次创建文件时只写入一次标头。
Append Data: 追加数据：
- writer.writerow(metrics): Appends the metrics row as before, now guaranteed to follow the header.
  writer.writerow（metrics）：像以前一样附加 metrics 行，现在保证跟随标头。

Resulting Log File 生成的日志文件

After running your script with this change, the CSV file (e.g., transaction_history_log20250313123456.csv) will look like this:
在运行包含此更改的脚本后，CSV 文件（例如 transaction_history_log20250313123456.csv）将如下所示：

timestamp,episode,trades,win_rate,profit_factor,sharpe_ratio,total_profit
2025-03-13 12:34:56,1,1,0.0,0.0,-1.23,-961.361
2025-03-13 12:35:00,2,2,0.5,1.2,0.45,50.0
...

Header Row: timestamp,episode,trades,win_rate,profit_factor,sharpe_ratio,total_profit is written once at the top.
标题行：timestamp，episode，trades，win_rate，profit_factor，sharpe_ratio，total_profit 在顶部写入一次。
Data Rows: Subsequent calls append metric rows without duplicating the header.
数据行：后续调用会附加指标行，而不会复制标头。

Why This is Necessary 为什么这是必要的

Without this check, using 'a' mode alone would append the header repeatedly if you manually called writeheader() each time, leading to a cluttered file:
如果没有此检查，如果您每次都手动调用 writeheader（），则单独使用 'a' 模式将重复附加标头，从而导致文件混乱：
text 发短信
timestamp,episode,trades,win_rate,profit_factor,sharpe_ratio,total_profit 2025-03-13 12:34:56,1,1,0.0,0.0,-1.23,-961.361 timestamp,episode,trades,win_rate,profit_factor,sharpe_ratio,total_profit 2025-03-13 12:35:00,2,2,0.5,1.2,0.45,50.0
The conditional check ensures a clean, standard CSV format.
条件检查可确保干净、标准的 CSV 格式。

Integration with Your Environment
与您的环境集成

Existing Logic: Your analyze_transaction_history() is called in evaluate() after each episode, and the log file is unique per environment instance (due to the timestamp in self.analyze_transaction_history_log_filename set in __init__).
现有逻辑：你的 analyze_transaction_history（）在每个事件之后在 evaluate（）中调用，并且日志文件对于每个环境实例都是唯一的（由于 self.analyze_transaction_history_log_filename 中的时间戳在 __init__中设置）。
No Conflicts: Since each environment instance gets its own log file (e.g., transaction_history_log20250313123456.csv), the header is written once per file, and subsequent calls append data correctly.
无冲突：由于每个环境实例都有自己的日志文件（例如，transaction_history_log20250313123456.csv），因此每个文件都会写入一次标头，并且后续调用会正确附加数据。

Verify the Change 验证更改

Update analyze_transaction_history() with the code above.
使用上面的代码更新 analyze_transaction_history（）。
Run your script (study.optimize(objective, n_trials=1)).
运行脚本（study.optimize（objective， n_trials=1））。
Check the generated CSV file:
检查生成的 CSV 文件：
- Open transaction_history_log<datetime>.csv in a text editor or spreadsheet.
  在文本编辑器或电子表格中打开 transaction_history_log<datetime>.csv。
- Confirm the header appears once at the top, followed by data rows.
  确认标题在顶部显示一次，然后是数据行。

Additional Notes 其他说明

Alternative Approach: You could initialize the file with a header in __init__ or reset(), but since analyze_transaction_history() is where logging happens, handling it here keeps the logic centralized.
替代方法：您可以使用 __init__ 或 reset（）中的 header 初始化文件，但由于 analyze_transaction_history（）是进行日志记录的地方，因此在此处处理它可以保持逻辑集中。
File Reset: If you want to reset the log file per episode or trial, you’d need to modify reset() or use 'w' mode (overwriting), but that’s not your current intent based on the timestamp naming.
文件重置：如果你想重置每个剧集或试用版的日志文件，你需要修改 reset（）或使用 'w' 模式（覆盖），但这不是你基于时间戳命名的当前意图。

This modification ensures your log file is properly formatted with a header, aligning with standard CSV practices. Let me know if you need further tweaks!
此修改可确保您的日志文件使用标头正确格式化，从而与标准 CSV 做法保持一致。如果您需要进一步调整，请告诉我！

deterministic is True better or False will learn and result better?
确定性是 True 更好还是 False 会学习和结果更好？

In the context of your reinforcement learning (RL) setup with Stable-Baselines3 (SB3) and your CustomMultiInputPolicy, the choice of deterministic=True versus deterministic=False in model.predict() (used in your evaluate() function) has significant implications for how your agent behaves during evaluation and training. Let’s explore which is "better" for learning and achieving better results, considering your specific environment and goals.
在使用 Stable-Baselines3 （SB3）和 CustomMultiInputPolicy 的强化学习（RL）设置的上下文中，model.predict（）（在 evaluate（）函数中使用）中选择 deterministic=True 与 deterministic=False 对代理在评估和训练期间的行为有重大影响。让我们探讨一下哪个“更好”有利于学习和取得更好的结果，同时考虑到您的具体环境和目标。

Understanding deterministic in SB3
了解 SB3 中的确定性

model.predict(obs, deterministic=...):
model.predict（obs， deterministic=...）中：
- This method calls your CustomMultiInputPolicy.forward(obs, deterministic=...) to generate actions.
  该方法调用您的 CustomMultiInputPolicy.forward（obs， deterministic=...）生成作。
- In your policy: 在您的策略中：
  python 蟒
  def forward(self, obs, deterministic=False): # ... (feature extraction and MLP) actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) return squashed_actions, values, log_prob
  - deterministic=True: The policy outputs the mean action of the Gaussian distribution (mean_actions) without sampling. This is deterministic behavior—given the same observation, it always produces the same action.
    deterministic=True：该策略输出高斯分布（mean_actions）的平均作用，而不进行采样。这是确定性行为 — 给定相同的观察值，它总是产生相同的作。
  - deterministic=False: The policy samples an action from the Gaussian distribution defined by mean_actions and log_std. This introduces stochasticity (randomness), so the same observation can yield different actions.
    deterministic=False 中：该策略从由 mean_actions 和 log_std 定义的高斯分布中对作进行采样。这会引入随机性（随机性），因此相同的观察结果可能会产生不同的作。
Where It’s Used: 使用位置：
- In your evaluate() function:
  在 evaluate（）函数中：
  python 蟒
  action, _ = model.predict(obs, deterministic=True)
  - Currently, you’re using deterministic=True during evaluation.
    目前，你在评估期间使用 deterministic=True。
- During training (model.learn()), SB3’s PPO algorithm uses deterministic=False by default in collect_rollouts() to explore the environment, unless explicitly overridden.
  在训练（model.learn（））期间，SB3 的 PPO 算法在 collect_rollouts（）中默认使用 deterministic=False 来探索环境，除非明确覆盖。

deterministic=True vs. deterministic=False: Pros and Cons
deterministic=True vs. deterministic=False：优点和缺点

deterministic=True deterministic=真

Behavior: The agent always picks the "best" action it has learned (the mean of the policy distribution), ignoring randomness.
行为：代理总是选择它学到的 “最佳”作（策略分布的平均值），忽略随机性。
Pros: 优点：
1. Consistency: During evaluation, you get reproducible results—same inputs yield the same actions, making it easier to assess the policy’s performance reliably.
  一致性：在评估过程中，您可以获得可重复的结果 — 相同的输入会产生相同的作，从而更容易可靠地评估策略的绩效。
2. Exploits Learned Policy: It reflects the agent’s optimal strategy based on training, assuming the mean action is the most promising choice.
  Exploits Learned 策略：它反映了代理基于训练的最佳策略，假设平均动作是最有希望的选择。
3. Stable Metrics: Useful for comparing performance across trials or models (e.g., in study.optimize()), as randomness doesn’t skew results.
  稳定的指标：对于比较不同试验或模型的性能（例如，在 study.optimize（）中）很有用，因为随机性不会扭曲结果。
Cons: 缺点：
1. No Exploration: It doesn’t test how the agent handles uncertainty or adapts to variability, potentially overestimating performance if the environment has stochastic elements.
  无探索：它不会测试代理如何处理不确定性或适应可变性，如果环境具有随机元素，则可能会高估性能。
2. Stuck in Local Optima: If training didn’t explore sufficiently, the mean action might be suboptimal, and you won’t see alternative strategies during evaluation.
  卡在本地最理想中：如果训练没有充分探索，则平均作可能不是最优的，并且您在评估期间不会看到替代策略。
When to Use: Best for final evaluation or when you want to deploy the trained policy in a real-world scenario where consistency is critical (e.g., live trading).
何时使用：最适合最终评估，或者当您想在一致性至关重要的实际场景中部署经过训练的策略时（例如，实时交易）。

deterministic=False

Behavior: The agent samples actions from the Gaussian distribution, introducing randomness based on log_std.
行为：代理从高斯分布中对作进行采样，引入基于log_std的随机性。
Pros: 优点：
1. Exploration: It mimics the training process, allowing the agent to explore different actions, which can reveal robustness or adaptability in the policy.
  勘探：它模拟训练过程，允许代理探索不同的作，从而揭示策略的稳健性或适应性。
2. Better Learning Insight: During evaluation, you see how the agent balances exploration and exploitation, potentially identifying if the policy’s variance (log_std) is too high or low.
  更好的学习洞察：在评估过程中，您可以看到代理如何平衡探索和利用，从而可能识别策略的差异（log_std）是太高还是太低。
3. Realistic Performance: In a noisy or stochastic environment (like forex trading with market fluctuations), it tests how the agent performs under uncertainty, closer to real-world conditions.
  逼真的性能：在嘈杂或随机的环境中（如市场波动的外汇交易），它测试代理在不确定性下的表现，更接近现实世界的条件。
Cons: 缺点：
1. Inconsistent Results: Random sampling means evaluation metrics (e.g., mean profit) vary across runs, making it harder to compare policies directly.
  不一致的结果：随机抽样意味着评估指标（例如，平均利润）因运行而异，因此更难直接比较策略。
2. Suboptimal Actions: Sampled actions might deviate from the mean, leading to worse short-term performance, even if the policy is well-trained.
  次优作：抽样作可能会偏离平均值，从而导致更差的短期性能，即使策略训练有素。
When to Use: Best for training diagnostics or early evaluation to understand the policy’s exploration behavior and variance.
何时使用：最适合训练诊断或早期评估，以了解策略的探索行为和差异。

Your Specific Context 您的具体情况

Current Behavior (from Logs)
当前行为（来自日志）

Step 10310: Base Reward=0, Sustained Reward=-9.31711, Total=-9.31711, Balance=99100
Asset XAUUSD: Action=0.0, Reward=0, Holding=50
Step 10311: Base Reward=0, Sustained Reward=-9.44161, Total=-9.44161, Balance=99100
Asset XAUUSD: Action=0.0, Reward=0, Holding=50
...
Step 10315: Base Reward=-961.361, Sustained Reward=0, Total=-961.361, Balance=96157

Observation: Your agent repeatedly outputs Action=0.0 during evaluation (deterministic=True), suggesting:
观察：您的代理在评估期间重复输出 Action=0.0 （deterministic=True），建议：
- The policy’s mean action (mean_actions) is stuck at or near 0.0 (mapped to [0, 3] via _squash_to_range()).
  策略的平均作（mean_actions）卡在 0.0 或附近（通过 _squash_to_range（）映射到 [0， 3]）。
- No new trades open (Holding=50 constant) until a stop-loss triggers at step 10315, likely due to reaching max_current_holding.
  在第 10315 步触发止损之前，没有新交易开放（持有 = 50 常数），可能是由于达到 max_current_holding。
Implication: With deterministic=True, the agent isn’t exploring alternative actions (e.g., sell or hold) and may be overfitting to a single strategy (buying until max holding).
蕴涵：使用 deterministic=True 时，代理没有探索其他作（例如，卖出或持有），并且可能过度拟合单个策略（买入至最大持有量）。

Environment and Policy 环境与政策

Action Space: Box(low=0, high=3, shape=(n_assets,))
作空间：框（low=0， high=3， shape=（n_assets，））
- 0 → Buy, 1 → Sell, 2 → Nothing (interpreted via math.floor(action) in _take_action()).
  0 → 买入，1 → 卖出，2 → 无（通过 _take_action（）中的 math.floor（action）解释）。
Policy: CustomMultiInputPolicy uses a SquashedDiagGaussianDistribution:
Policy：CustomMultiInputPolicy 使用 SquashedDiagGaussianDistribution：
- Outputs mean_actions and log_std, squashed from [-1, 1] to [0, 3].
  输出 mean_actions 和 log_std，从 [-1， 1] 压缩到 [0， 3]。
- deterministic=True takes mean_actions, while False samples around it.
  deterministic=True 接受 mean_actions，而 False 围绕它进行采样。

Training vs. Evaluation 训练与评估

Training: PPO uses deterministic=False by default in collect_rollouts(), allowing exploration via sampling. Your agent learns a distribution over actions.
训练：PPO 在 collect_rollouts（）中默认使用 deterministic=False，允许通过采样进行探索。您的代理学习作的分配。
Evaluation: You’re overriding this with deterministic=True, testing only the mean action.
评估：您将使用 deterministic=True 覆盖此函数，仅测试平均作。

Which is Better for Learning and Results?
哪个更适合学习和结果？

For Learning (Training Phase)
用于学习（培训阶段）

deterministic=False is Better During Training:
deterministic=False 在训练期间更好：
- Why: RL relies on exploration to discover optimal strategies. Sampling actions (deterministic=False) during training allows the agent to try different actions (e.g., buy, sell, hold) and learn from their outcomes.
  为什么：RL 依靠探索来发现最佳策略。训练期间的采样作（deterministic=False）允许代理尝试不同的作（例如，购买、出售、持有）并从其结果中学习。
- Your Case: If the agent only sees Action=0.0 (buy) during evaluation, it suggests insufficient exploration during training. Low log_std (variance) or premature convergence might limit the policy to a suboptimal mean.
  您的案例：如果代理在评估期间只看到 Action=0.0 （buy），则表明在训练期间探索不足。低log_std（方差）或过早收敛可能会将策略限制为次优均值。
- Evidence: Your log shows Action=0.0 repeatedly, and Holding=50 suggests the agent isn’t adapting (e.g., selling to realize profits or avoid losses).
  证据：您的日志反复显示 Action=0.0，而 Holding=50 表明代理没有适应（例如，出售以实现利润或避免损失）。
- Fix: Ensure training explores adequately:
  修复：确保培训充分探索：
  - Check log_std in forward() (e.g., print(f"Log Std: {log_std.mean()}")). If it’s too low (e.g., near -20), increase exploration by adjusting PPO’s clip_range or policy entropy regularization (e.g., ent_coef).
    检查 forward（）中的 log_std（例如，print（f“Log Std： {log_std.mean（）}”））。如果它太低（例如，接近 -20），则通过调整 PPO 的 clip_range 或策略熵正则化（例如 ent_coef）来增加探索。

For Results (Evaluation Phase)
结果（评估阶段）

deterministic=True is Better for Final Results:
deterministic=True 更适合最终结果：
- Why: Once trained, you want the agent to exploit its best strategy (the mean action) to maximize profit consistently, especially for deployment or final performance metrics.
  为什么：经过训练后，您希望代理利用其最佳策略（平均行动）来始终如一地最大化利润，尤其是对于部署或最终性能指标。
- Your Goal: In forex trading, you likely want stable, repeatable profits (e.g., in study.optimize()), which deterministic=True provides by sticking to the learned policy’s “best guess.”
  您的目标：在外汇交易中，您可能希望获得稳定、可重复的利润（例如，在 study.optimize（）中），通过坚持学习策略的“最佳猜测”来提供 deterministic=True 利润。
- Caveat: If training didn’t converge to a good policy (e.g., stuck at Action=0.0), deterministic evaluation will reflect that suboptimal result.
  警告：如果训练没有收敛到一个好的策略（例如，卡在 Action=0.0），确定性评估将反映该次优结果。
deterministic=False Can Improve Early Evaluation:
deterministic=False可以提高早期评估：
- Why: Early in development, sampling actions during evaluation helps diagnose whether the policy has learned a reasonable distribution (e.g., balancing buy/sell/hold) or is overly confident in a bad strategy.
  为什么：在开发初期，评估期间的抽样作有助于诊断策略是否已经学会了合理的分布（例如，平衡买入/卖出/持有）或对糟糕的策略过于自信。
- Your Case: Switching to deterministic=False in evaluate() might reveal if the agent can produce varied actions (e.g., 1.5 or 2.8) and adapt, rather than fixating on 0.0.
  您的案例：在 evaluate（）中切换到 deterministic=False 可能会揭示代理是否可以产生不同的动作（例如，1.5 或 2.8）并适应，而不是固定在 0.0 上。

Recommendation 建议

Short-Term: Test Both in Evaluation
短期：在评估中同时测试

Modify evaluate() to Compare:
修改 evaluate（）以比较：
python 蟒
def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False, deterministic=True): total_rewards = [] total_profits = [] metrics = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=deterministic) obs, rewards, done, info = env_vec.step(action) episode_rewards += rewards print(f"Action: {action}") # Debug action output for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) numeric_keys = ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"] avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in numeric_keys} print(f"Deterministic={deterministic}, Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") return mean_reward if return_mean_reward else mean_profit # Test both print("Evaluating with deterministic=True:") evaluate(model, env_vec, deterministic=True) print("\nEvaluating with deterministic=False:") evaluate(model, env_vec, deterministic=False)
What to Look For: 需要注意的事项
- True: Consistent Action=0.0 with stable (but possibly poor) profits.
  真：一致的 Action=0.0 具有稳定（但可能很差）的利润。
- False: Varied actions (e.g., [0.5], [2.1])—if still 0.0, the policy’s log_std is too low, indicating poor exploration.
  假：各种作（例如，[0.5]、[2.1]） — 如果仍为 0.0，则策略的 log_std 太低，表示探索不佳。

Long-Term: Optimize Training, Then Use True
Long-term：优化训练，然后使用 true

Improve Training Exploration:
改进训练探索：
- Increase ent_coef in PPO (e.g., ent_coef=0.01) to encourage higher log_std and more diverse actions:
  提高 PPO 的 ent_coef（例如，ent_coef=0.01），以鼓励更高的log_std和更多样化的行动：
  python 蟒
  model = PPO( CustomMultiInputPolicy, train_env_vec, learning_rate=lr, n_steps=n_steps, batch_size=batch_size, n_epochs=n_epochs, ent_coef=0.01, # Encourage exploration verbose=1 )
- Check log_std during training (already logged every 1000 steps in forward()). Aim for values closer to 0 or 1, not -20.
  在训练期间检查 log_std（已在 forward（）中每 1000 步记录一次）。目标是接近 0 或 1 的值，而不是 -20。
Evaluate with deterministic=False Early:
使用 deterministic=False Early 进行计算：
- Use this to verify the policy learns a distribution (e.g., actions spread across [0, 3]).
  使用此选项可验证策略是否学习了分布（例如，作分布在 [0， 3] 中）。
- If actions remain stuck, adjust hyperparameters (e.g., clip_range, n_steps).
  如果作仍然卡住，请调整超参数（例如，clip_range、n_steps）。
Switch to deterministic=True for Final Results:
切换到 deterministic=True 以获得最终结果：
- Once training produces a policy with varied actions and good profits (e.g., positive mean_profit in logs), use deterministic=True in evaluate() to lock in the best strategy.
  一旦训练产生了一个具有不同作和良好利润的策略（例如，日志中的正mean_profit），在 evaluate（）中使用 deterministic=True 来锁定最佳策略。

Conclusion 结论

For Learning: deterministic=False during training and early evaluation is better to ensure the agent explores and learns a robust policy. Your current Action=0.0 loop suggests exploration is lacking—test False in evaluate() to diagnose.
对于 Learning：deterministic=False 在训练和早期评估期间更好地确保代理探索和学习稳健的策略。您当前的 Action=0.0 循环表明缺少探索 - 在 evaluate（）中测试 False 以进行诊断。
For Results: deterministic=True is better once the policy is well-trained, as it maximizes exploitation of the learned strategy for consistent profits.
对于策略经过良好训练后，最好使用 Result：deterministic=True 来获得一致的利润，因为它最大限度地利用了所学的策略以获得一致的利润。

Next Steps: 后续步骤：

Run the modified evaluate() with both settings.
使用这两个设置运行修改后的 evaluate（）。
If deterministic=False still yields 0.0, tweak training (e.g., ent_coef) and share the updated logs or log_std values for further tuning advice!
如果 deterministic=False 仍然产生 0.0，请调整训练（例如，ent_coef）并共享更新的日志或 log_std 值以获得进一步的优化建议！

if action = 1.0, it is buy or sell, because 1 should be sell but the balance didn't change: Step 1366: Base Reward=0, Sustained Reward=4.791014388264544, Total=4.791014388264544, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1367: Base Reward=0, Sustained Reward=6.3687861051046175, Total=6.3687861051046175, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1368: Base Reward=0, Sustained Reward=9.076477421540147, Total=9.076477421540147, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1369: Base Reward=0, Sustained Reward=17.769505169795874, Total=17.769505169795874, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1370: Base Reward=0, Sustained Reward=3.500852647613906, Total=3.500852647613906, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1371: Base Reward=0, Sustained Reward=0.8837990080245854, Total=0.8837990080245854, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1372: Base Reward=0, Sustained Reward=0.5982223854193366, Total=0.5982223854193366, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1373: Base Reward=0, Sustained Reward=0.7290066962613, Total=0.7290066962613, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1374: Base Reward=0, Sustained Reward=0.8488516918526922, Total=0.8488516918526922, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1375: Base Reward=0, Sustained Reward=1.0502403431910352, Total=1.0502403431910352, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1376: Base Reward=0, Sustained Reward=0.957284694103678, Total=0.957284694103678, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1377: Base Reward=0, Sustained Reward=1.1278450366935873, Total=1.1278450366935873, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1378: Base Reward=0, Sustained Reward=1.3804403294097483, Total=1.3804403294097483, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1379: Base Reward=0, Sustained Reward=1.691404887297575, Total=1.691404887297575, Balance=99100 Asset XAUUSD: Action=1.0, Reward=0, Holding=50 Step 1380: Base Reward=0, Sustained Reward=2.110638898428104, Total=2.110638898428104, Balance=99100 ``` import datetime import math import random import ast import torch import torch.nn as nn import csv import numpy as np import gymnasium as gym from gymnasium import spaces from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor, MlpExtractor from stable_baselines3.common.policies import ActorCriticPolicy from stable_baselines3.common.torch_layers import MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution from stable_baselines3 import PPO from stable_baselines3.common.policies import ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart import TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym(gym.Env): """forex/future/option trading gym environment 1. Three action space (0 Buy, 1 Sell, 2 Nothing) 2. Multiple trading pairs (EURUSD, GBPUSD...) under same time frame 3. Timeframe from 1 min to daily as long as use candlestick bar (Open, High, Low, Close) 4. Use StopLose, ProfitTaken to realize rewards. each pair can configure it own SL and PT in configure file 5. Configure over night cash penalty and each pair's transaction fee and overnight position holding penalty 6. Split dataset into daily, weekly or monthly..., with fixed time steps, at end of len(df). The business logic will force to Close all positions at last Close price (game over). 7. Must have df column name: [(time_col),(asset_col), Open,Close,High,Low,day] (case sensitive) 8. Addition indicators can add during the data process. 78 available TA indicator from Finta 9. Customized observation list handled in json config file. 10. ProfitTaken = fraction_action * max_profit_taken + SL. 11. SL is pre-fixed 12. Limit order can be configure, if limit_order == True, the action will preset buy or sell at Low or High of the bar, with a limit_order_expiration (n bars). It will be triggered if the price go cross. otherwise, it will be drop off 13. render mode: human -- display each steps realized reward on console file -- create a transaction log graph -- create transaction in graph (under development) 14. 15. Reward, we want to incentivize profit that is sustained over long periods of time. At each step, we will set the reward to the account balance multiplied by some fraction of the number of time steps so far.The purpose of this is to delay rewarding the agent too fast in the early stages and allow it to explore sufficiently before optimizing a single strategy too deeply. It will also reward agents that maintain a higher balance for longer, rather than those who rapidly gain money using unsustainable strategies. 16. Observation_space contains all of the input variables we want our agent to consider before making, or not making a trade. We want our agent to “see” the forex data points (Open price, High, Low, Close, time serial, TA) in the game window, as well a couple other data points like its account balance, current positions, and current profit.The intuition here is that for each time step, we want our agent to consider the price action leading up to the current price, as well as their own portfolio’s status in order to make an informed decision for the next action. 17. reward is forex trading unit Point, it can be configure for each trading pair 18. To make the unrealized profit reward reflect market conditions, we’ll compute ATR for each asset and use it to scale the reward dynamically. """ metadata = {"render.modes": ["graph", "human", "file", "none"]} def __init__( self, df, event_map, currency_map, env_config_file="./neo_finrl/env_fx_trading/config/gdbusd-test-1.json", ): assert df.ndim == 2 super(tgym, self).__init__() self.cf = EnvConfig(env_config_file) self.observation_list = self.cf.env_parameters("observation_list") # Economic data mappings self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy() if 'events' not in self.df.columns: raise ValueError("DataFrame must contain an 'events' column") def parse_events(x): if isinstance(x, str): try: parsed = ast.literal_eval(x) return parsed if isinstance(parsed, list) else [] except (ValueError, SyntaxError): return [] return x if isinstance(x, list) else [] self.df['events'] = self.df['events'].apply(parse_events) if not isinstance(self.df['events'].iloc[0], list): raise ValueError("'events' must be a list") if self.df['events'].iloc[0] and not isinstance(self.df['events'].iloc[0][0], dict): raise ValueError("Elements in 'events' must be dictionaries") self.balance_initial = self.cf.env_parameters("balance") self.over_night_cash_penalty = self.cf.env_parameters("over_night_cash_penalty") self.asset_col = self.cf.env_parameters("asset_col") self.time_col = self.cf.env_parameters("time_col") self.random_start = self.cf.env_parameters("random_start") log_file_datetime = datetime.datetime.now().strftime("%Y%m%d%H%M%S") self.log_filename = ( self.cf.env_parameters("log_filename") + log_file_datetime + ".csv" ) self.analyze_transaction_history_log_filename = ("transaction_history_log" + log_file_datetime + ".csv") self.df["_time"] = self.df[self.time_col] self.df["_day"] = self.df["weekday"] self.assets = self.df[self.asset_col].unique() self.dt_datetime = self.df[self.time_col].sort_values().unique() self.df = self.df.set_index(self.time_col) self.visualization = False # Reset values self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_step = 0 self.episode = 0 # Start from 0, increment on episode end self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True # Cache data self.cached_ohlc_data = [self.get_observation_vector(_dt) for _dt in self.dt_datetime] self.cached_economic_data = [self.get_economic_vector(_dt) for _dt in self.dt_datetime] self.cached_time_serial = ( self.df[["_time", "_day"]].sort_values("_time").drop_duplicates().values.tolist() ) self.reward_range = (-np.inf, np.inf) self.action_space = spaces.Box(low=0, high=3, shape=(len(self.assets),), dtype=np.float32) self.observation_space = spaces.Dict({ "ohlc_data": spaces.Box(low=-np.inf, high=np.inf, shape=(len(self.assets) * len(self.observation_list),), dtype=np.float32), "event_ids": spaces.Box(low=0, high=len(self.event_map)-1, shape=(self.max_events,), dtype=np.int32), "currency_ids": spaces.Box(low=0, high=len(self.currency_map)-1, shape=(self.max_events,), dtype=np.int32), "economic_numeric": spaces.Box(low=-np.inf, high=np.inf, shape=(self.max_events * 6,), dtype=np.float32), "portfolio_data": spaces.Box(low=-np.inf, high=np.inf, shape=(3 + 2 * len(self.assets),), dtype=np.float32) }) print( f"initial done:\n" f"observation_list:{self.observation_list}\n" f"assets:{self.assets}\n" f"time serial: {min(self.dt_datetime)} -> {max(self.dt_datetime)} length: {len(self.dt_datetime)}\n" f"events: {len(self.event_map)}, currencies: {len(self.currency_map)}" ) self._seed() def _seed(self, seed=None): self.np_random, seed = seeding.np_random(seed) return [seed] def _take_action(self, actions, done): # action = math.floor(x), # profit_taken = math.ceil((x- math.floor(x)) * profit_taken_max - stop_loss_max ) # _actions = np.floor(actions).astype(int) # _profit_takens = np.ceil((actions - np.floor(actions)) *self.cf.symbol(self.assets[i],"profit_taken_max")).astype(int) _action = 2 _profit_taken = 0 rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # need use multiply assets for i, action in enumerate(actions): # Actions are now floats between 0 and 3 self._o = self.get_observation(self.current_step, i, "Open") self._h = self.get_observation(self.current_step, i, "High") self._l = self.get_observation(self.current_step, i, "Low") self._c = self.get_observation(self.current_step, i, "Close") self._t = self.get_observation(self.current_step, i, "_time") self._day = self.get_observation(self.current_step, i, "_day") # Extract integer action type and fractional part _action = math.floor(action) # 0=Buy, 1=Sell, 2=Nothing rewards[i] = self._calculate_reward(i, done, _action) # Pass action for exploration reward print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")): # Dynamically calculate PT using action fraction _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 if self.cf.symbol(self.assets[i], "limit_order"): transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._l if _action == 0 else self._h, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": -1, "CloseStep": -1, } self.transaction_limit_order.append(transaction) else: transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._c, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": self.current_step, "CloseStep": -1, } self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) return sum(rewards) def _calculate_reward(self, i, done, action): _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[:]: # Copy to avoid modification issues if tr["Symbol"] == self.assets[i]: _point = self.cf.symbol(self.assets[i], "point") # cash discount overnight if self._day > tr["DateDuration"]: tr["DateDuration"] = self._day tr["Reward"] -= self.cf.symbol(self.assets[i], "over_night_penalty") if tr["Type"] == 0: # Buy # stop loss trigger _sl_price = tr["ActionPrice"] - tr["SL"] / _point _pt_price = tr["ActionPrice"] + tr["PT"] / _point if done: p = (self._c - tr["ActionPrice"]) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._l <= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._h >= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: # still open self.current_draw_downs[i] = int((self._l - tr["ActionPrice"]) * _point) _max_draw_down += self.current_draw_downs[i] if self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i]: tr["MaxDD"] = self.current_draw_downs[i] elif tr["Type"] == 1: # Sell # stop loss trigger _sl_price = tr["ActionPrice"] + tr["SL"] / _point _pt_price = tr["ActionPrice"] - tr["PT"] / _point if done: p = (tr["ActionPrice"] - self._c) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._h >= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._l <= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: self.current_draw_downs[i] = int( (tr["ActionPrice"] - self._h) * _point ) _max_draw_down += self.current_draw_downs[i] if ( self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i] ): tr["MaxDD"] = self.current_draw_downs[i] if _max_draw_down > self.max_draw_downs[i]: self.max_draw_downs[i] = _max_draw_down return _total_reward def _limit_order_process(self, i, _action, done): for tr in self.transaction_limit_order[:]: if tr["Symbol"] == self.assets[i]: if tr["Type"] != _action or done: self.transaction_limit_order.remove(tr) tr["Status"] = 3 tr["CloseStep"] = self.current_step self.transaction_history.append(tr) elif (tr["ActionPrice"] >= self._l and _action == 0) or ( tr["ActionPrice"] <= self._h and _action == 1): tr["ActionStep"] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_limit_order.remove(tr) self.transaction_live.append(tr) self.tranaction_open_this_step.append(tr) elif (tr["LimitStep"] + self.cf.symbol(self.assets[i], "limit_order_expiration") > self.current_step): tr["CloseStep"] = self.current_step tr["Status"] = 4 self.transaction_limit_order.remove(tr) self.transaction_history.append(tr) def _manage_tranaction(self, tr, _p, close_price, status=1): self.transaction_live.remove(tr) tr["ClosePrice"] = close_price tr["Point"] = int(_p) tr["Reward"] = int(tr["Reward"] + _p) # Realized profit/loss tr["Status"] = status # 1=SL/PT, 2=Forced close, 3=Canceled limit, 4=Expired limit tr["CloseTime"] = self._t tr["CloseStep"] = self.current_step self.balance += int(tr["Reward"]) self.total_equity -= int(abs(tr["Reward"])) self.tranaction_close_this_step.append(tr) self.transaction_history.append(tr) def analyze_transaction_history(self): if not self.transaction_history: metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0} else: trades = len(self.transaction_history) rewards = [tr["Reward"] for tr in self.transaction_history] wins = sum(1 for r in rewards if r > 0) losses = sum(1 for r in rewards if r < 0) gross_profit = sum(r for r in rewards if r > 0) gross_loss = abs(sum(r for r in rewards if r < 0)) win_rate = wins / trades if trades > 0 else 0.0 profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf") # Sharpe Ratio (simplified, assumes risk-free rate = 0) returns = np.array(rewards, dtype=np.float32) sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0 total_profit = sum(rewards) metrics = { "trades": trades, "win_rate": win_rate, "profit_factor": profit_factor, "sharpe_ratio": sharpe_ratio, "total_profit": total_profit } # Prepare metrics with timestamp and episode metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") metrics["episode"] = self.episode # Check if file exists and is empty to write header import os file_exists = os.path.exists(self.analyze_transaction_history_log_filename) file_empty = file_exists and os.stat(self.analyze_transaction_history_log_filename).st_size == 0 # Append to log file with header if it's new or empty with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f: fieldnames = ["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"] writer = csv.DictWriter(f, fieldnames=fieldnames) # Write header only if file doesn't exist or is empty if not file_exists or file_empty: writer.writeheader() writer.writerow(metrics) return metrics def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} return obs, total_reward, terminated, truncated, info def get_observation(self, _step, _iter=0, col=None): if col is None: return self.cached_ohlc_data[_step] if col == "_day": return self.cached_time_serial[_step][1] elif col == "_time": return self.cached_time_serial[_step][0] try: col_pos = self.observation_list.index(col) except ValueError: raise ValueError(f"Column '{col}' not found in observation_list") return self.cached_ohlc_data[_step][_iter * len(self.observation_list) + col_pos] def get_observation_vector(self, _dt, cols=None): cols = self.observation_list if cols is None else cols v = [] for a in self.assets: subset = self.df.query(f'{self.asset_col} == "{a}" & {self.time_col} == "{_dt}"') assert not subset.empty v += subset.loc[_dt, cols].tolist() assert len(v) == len(self.assets) * len(cols) return v def get_economic_vector(self, _dt): subset = self.df.loc[_dt] events = subset['events'] if isinstance(subset, pd.Series) else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) currency_ids = [self.currency_map.get(e['currency'], 0) for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) numeric_fields = ['actual_norm', 'forecast_norm', 'previous_norm', 'surprise_norm', 'event_freq', 'impact_code'] numeric = [e[field] for e in events[:self.max_events] for field in numeric_fields] + [0] * (self.max_events * 6 - len(events) * 6) return { "event_ids": np.array(event_ids, dtype=np.int32), "currency_ids": np.array(currency_ids, dtype=np.int32), "numeric": np.array(numeric, dtype=np.float32) } def reset(self, seed=None, options=None): # Set the seed for reproducibility if seed is not None: self._seed(seed) if self.random_start: self.current_step = random.choice(range(int(len(self.dt_datetime) * 0.5))) else: self.current_step = 0 self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True self.visualization = False obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } info = {} return obs, info def render(self, mode="human", title=None, **kwargs): if mode in ("human", "file"): printout = mode == "human" pm = { "log_header": self.log_header, "log_filename": self.log_filename, "printout": printout, "balance": self.balance, "balance_initial": self.balance_initial, "tranaction_close_this_step": self.tranaction_close_this_step, "done_information": self.done_information, } render_to_file(**pm) if self.log_header: self.log_header = False elif mode == "graph" and self.visualization: print("plotting...") p = TradingChart(self.df, self.transaction_history) p.plot() def close(self): pass def get_sb_env(self): e = DummyVecEnv([lambda: self]) obs = e.reset() return e, obs class CustomFeaturesExtractor(BaseFeaturesExtractor): def __init__(self, observation_space): n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2 ohlc_dim = observation_space.spaces["ohlc_data"].shape[0] max_events = observation_space.spaces["event_ids"].shape[0] economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0] portfolio_dim = observation_space.spaces["portfolio_data"].shape[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # Check if CUDA is available, else use CPU self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomFeaturesExtractor using device: {self.device}") super().__init__(observation_space, features_dim=features_dim) # Move embeddings to the selected device self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device) self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device) print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}") def forward(self, obs): ohlc_data = obs["ohlc_data"].to(self.device) event_ids = obs["event_ids"].to(self.device, dtype=torch.long) currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long) economic_numeric = obs["economic_numeric"].to(self.device) portfolio_data = obs["portfolio_data"].to(self.device) event_emb = self.event_embedding(event_ids).mean(dim=1) currency_emb = self.currency_embedding(currency_ids).mean(dim=1) features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1) return features class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Check if CUDA is available, else use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomMultiInputPolicy using device: {device}") # Extract action space bounds and move them to the selected device self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=device ).to(device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities of the unsquashed actions log_prob = distribution.log_prob(actions) # Use log_prob on the unsquashed actions # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return squashed_actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Unsquash the actions back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob(unsquashed_actions) # Use log_prob on unsquashed actions entropy = distribution.entropy() # Entropy doesn’t need actions values = self.value_net(latent_vf) return values, log_prob, entropy ```
如果 action = 1.0，则为 Buy 或 Sell，因为 1 应该是 SELL，但余额没有变化：步骤1366：基础奖励=0，持续奖励=4.791014388264544，总计=4.791014388264544，余额=99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1367：基础奖励=0，持续奖励=6.3687861051046175，总计=6.3687861051046175，余额=99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1368：基础奖励 = 0，持续奖励 = 9.076477421540147，总计 = 9.076477421540147，余额 = 99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1369：基础奖励 = 0，持续奖励 = 17.769505169795874，总计 = 17.769505169795874，余额 = 99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1370：基础奖励 = 0，持续奖励 = 3.500852647613906，总计 = 3.500852647613906，余额 = 99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1371：基础奖励 = 0，持续奖励 = 0.8837990080245854，总计 = 0.8837990080245854，余额 = 99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1372：基础奖励=0，持续奖励=0.5982223854193366，总计=0.5982223854193366，余额=99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1373：基础奖励 = 0，持续奖励 = 0.7290066962613，总计 = 0.7290066962613，余额 = 99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1374：基础奖励=0，持续奖励=0.8488516918526922，总计=0.8488516918526922，余额=99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1375：基础奖励=0，持续奖励=1.0502403431910352，总计=1.0502403431910352，余额=99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1376：基础奖励 = 0，持续奖励 = 0.957284694103678，总计 = 0.957284694103678，余额 = 99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1377：基础奖励=0，持续奖励=1.1278450366935873，总计=1.1278450366935873，余额=99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1378：基础奖励=0，持续奖励=1.3804403294097483，总计=1.3804403294097483，余额=99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤 1379：基础奖励 = 0，持续奖励 = 1.691404887297575，总计 = 1.691404887297575，余额 = 99100 资产 XAUUSD：作 = 1.0，奖励 = 0，持仓 = 50 步骤1380：基础奖励=0，持续奖励=2.110638898428104，总=2.110638898428104，余额=99100 ``` 导入 datetime 导入数学导入随机导入 AST Import Torch （导入火炬）将 torch.nn 导入为 nn 导入 CSV 将 numpy 导入为 NP 将 gymnasium 导入为 Gym 从 Gymnasium 导入空间 from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor， MlpExtractor 从 stable_baselines3.common.policies 导入 ActorCriticPolicy 从 stable_baselines3.common.torch_layers 导入 MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution 从 stable_baselines3 进口 PPO 从 stable_baselines3.common.policies 导入 ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart 导入 TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym（健身房.Env）： “”“外汇/期货/期权交易 GYM 环境 1. 三个作空间（0 买入、1 卖出、2 无） 2. 同一时间范围内的多个交易对（EURUSD、GBPUSD... 3. 时间范围从 1 分钟到每天，只要使用烛台柱（开盘价、最高价、最低价、收盘价） 4. 使用 StopLose、ProfitTaken 实现奖励。每对都可以在配置文件中配置自己的 SL 和 PT 5. 配置隔夜现金罚金以及每对的交易手续费和隔夜持仓罚金 6. 将数据集拆分为每日、每周或每月...，具有固定的时间步长，在 len（df）结束时。业务 logic 将强制在最后收盘价平仓所有持仓（游戏结束）。 7. 必须具有 df 列名称：[（time_col），（asset_col），Open，Close，High，Low，day]（区分大小写） 8. 加法指标可以在数据处理过程中添加。Finta 提供 78 种 TA 指标 9. 在 json 配置文件中处理的自定义观察列表。 10. ProfitTaken = fraction_action * max_profit_taken + 止损。 11. SL 是预先固定的 12. 限价单可以配置，如果 limit_order == True，则动作将预设在柱线的低价或高点买入或卖出，带 limit_order_expiration （n 条）。如果价格交叉，它将触发。否则，它将是 drop off 13. 渲染模式： human -- 在控制台上显示每个步骤实现的奖励 file -- 创建事务日志 graph -- 在 Graph 中创建交易（开发中） 14. 15. 奖励，我们希望激励长期持续的利润。在每个步骤中，我们将奖励设置为账户余额乘以到目前为止时间步数的一小部分。这样做的目的是拖延在早期阶段过快地奖励代理并允许其探索在过于深入地优化单个策略之前。它还将奖励在更长时间内保持较高余额的代理，而不是那些使用不可持续的策略快速赚钱的人。 16. Observation_space 包含我们需要代理的所有输入变量在进行交易或不进行交易之前考虑。我们希望我们的代理 “看到” 游戏窗口中的外汇数据点（开盘价、最高价、最低价、收盘价、时间序列、TA），以及其他一些数据点，如账户余额、当前头寸、和当前利润。这里的直觉是，对于每个时间步，我们都需要我们的代理考虑导致当前价格的价格行为，以及它们的拥有投资组合的状态，以便为下一步行动做出明智的决策。 17. 奖励为外汇交易单位积分，可为每个交易对配置 18. 为了使未实现利润奖励反映市场状况，我们将计算每种资产的 ATR，并使用它来动态扩展奖励。 """ 元数据 = {“render.modes”： [“graph”， “human”， “file”， “none”]} 防守 __init__（自我 DF / event_map， currency_map， env_config_file=“./neo_finrl/env_fx_trading/config/gdbusd-test-1.json”， ): 断言 df.ndim == 2 超级（tgym， self）.__init__（） self.cf = 环境配置（env_config_file） self.observation_list = self.cf.env_parameters（“observation_list”） # 经济数据映射 self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy（）如果 'events' 不在 self.df.columns 中： raise ValueError（“DataFrame 必须包含一个 'events' 列”） def parse_events（x）：如果 isinstance（x， str）：尝试：解析 = ast.literal_eval（x）返回已解析 if isinstance（parsed， list） else [] except （ValueError， SyntaxError）：返回 [] 返回 x if isinstance（x， list） else [] self.df['事件'] = self.df['事件'].apply（parse_events）如果不是 isinstance（self.df['events'].iloc[0]， list）： raise ValueError（“'事件'必须是一个列表”）如果 self.df['events'].iloc[0] 而不是 isinstance（self.df['events'].iloc[0][0]， dict）： raise ValueError（“'events' 中的元素必须是字典”） self.balance_initial = self.cf.env_parameters（“余额”） self.over_night_cash_penalty = self.cf.env_parameters（“over_night_cash_penalty”） self.asset_col = self.cf.env_parameters（“asset_col”） self.time_col = self.cf.env_parameters（“time_col”） self.random_start = self.cf.env_parameters（“random_start”） log_file_datetime = datetime.datetime.now（）.strftime（“%Y%m%d%H%M%S”） self.log_filename = （ self.cf.env_parameters（“log_filename”） + log_file_datetime + “.csv” ) self.analyze_transaction_history_log_filename = （“transaction_history_log” + log_file_datetime + “.csv”） self.df[“_time”] = self.df[self.time_col] self.df[“_day”] = self.df[“工作日”] self.assets = self.df[self.asset_col].unique（） self.dt_datetime = self.df[self.time_col].sort_values（）.unique（） self.df = self.df.set_index（self.time_col） self.visualization = False # 重置值 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_step = 0 self.episode = 0 # 从 0 开始，在剧集结束时递增 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = 真 # 缓存数据 self.cached_ohlc_data = [self.get_observation_vector（_dt）表示 self.dt_datetime _dt] self.cached_economic_data = [self.get_economic_vector（_dt）表示 self.dt_datetime 年的 _dt] self.cached_time_serial = （ self.df[[“_time”， “_day”]].sort_values（“_time”）.drop_duplicates（）.values.tolist（） ) self.reward_range = （-np.inf， np.inf） self.action_space = 空格。盒子（low=0， high=3， shape=（len（self.assets），）， dtype=np.float32） self.observation_space = 空格。字典（{ “ohlc_data”：空格。框（low=-np.inf， high=np.inf， shape=（len（self.assets） * len（self.observation_list），）， dtype=np.float32）， “event_ids”：空格。框（low=0， high=len（self.event_map）-1， shape=（self.max_events，）， dtype=np.int32）， “currency_ids”：空格。框（低=0，高=len（self.currency_map）-1， shape=（self.max_events，）， dtype=np.int32）， “economic_numeric”：空格。盒子（low=-np.inf， high=np.inf， shape=（self.max_events * 6，）， dtype=np.float32）， “portfolio_data”：空格。盒子（low=-np.inf， high=np.inf， shape=（3 + 2 * len（self.assets），）， dtype=np.float32） }) 打印（ f“初始完成：\n” f“observation_list：{self.observation_list}\n” f“资产：{self.assets}\n” f“时间序列： {min（self.dt_datetime）} -> {max（self.dt_datetime）} 长度： {len（self.dt_datetime）}\n” f“事件：{len（self.event_map）}，货币：{len（self.currency_map）}” ) self._seed（） def _seed（self， seed=None）： self.np_random，种子 = seeding.np_random（种子）返回 [种子] def _take_action（self， actions， done）： #作 = math.floor（x）， # profit_taken = math.ceil（（x- math.floor（x）） * profit_taken_max - stop_loss_max ） # _actions = np.floor（作）.astype（int） # _profit_takens = np.ceil（（作 - np.floor（作）） *self.cf.symbol（self.assets[i]，“profit_taken_max”））.astype（int） _action = 2 _profit_taken = 0 奖励 = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # 需要使用 multiply 资产对于 i， action in enumerate（actions）： # 动作现在在 0 和 3 之间浮动 self._o = self.get_observation（self.current_step， i， “打开”） self._h = self.get_observation（self.current_step， i， “高”） self._l = self.get_observation（self.current_step， i， “低”） self._c = self.get_observation（self.current_step， i， “关闭”） self._t = self.get_observation（self.current_step， i， “_time”） self._day = self.get_observation（self.current_step， i， “_day”） # 提取整数动作类型和小数部分 _action = math.floor（action） # 0=买入，1=卖出，2=什么都没有 rewards[i] = self._calculate_reward（i， done， _action） # 通过行动以获得探索奖励 print（f“资产 {self.assets[i]}： action={action}， reward={rewards[i]}， Holding={self.current_holding[i]}”）如果 self.cf.symbol（self.assets[i]， “limit_order”）： self._limit_order_process（i， _action，完成）如果（ _action 英寸（0， 1）并且未完成和 self.current_holding[i] < self.cf.symbol（self.assets[i]， “max_current_holding”））： # 使用 action fraction 动态计算 PT _profit_taken = math.ceil（（作 - _action） * self.cf.symbol（self.assets[i]， “profit_taken_max”）） + self.cf.symbol（self.assets[i]， “stop_loss_max”） self.ticket_id += 1 如果 self.cf.symbol（self.assets[i]， “limit_order”）：交易 = { “票证”：self.ticket_id、 “Symbol”：self.assets[i]， “ActionTime”：self._t、 “类型”：_action、 “手数”： 1， “ActionPrice”： self._l if _action == 0 else self._h， “止损”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”：_profit_taken、 “MaxDD”：0、 “Swap（掉期）”：0.0、 “CloseTime”： “”， ///// “ClosePrice（收盘价）”： 0.0， “点”：0、 “奖励”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”：self._day、 “状态”：0、 “LimitStep”：self.current_step、 “ActionStep”： -1， “CloseStep”：-1、 } self.transaction_limit_order.append（事务）还：交易 = { “票证”：self.ticket_id、 “Symbol”：self.assets[i]， “ActionTime”：self._t、 “类型”：_action、 “手数”： 1， “ActionPrice”：self._c、 “止损”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”：_profit_taken、 “MaxDD”：0、 “Swap（掉期）”：0.0、 “CloseTime”： “”， ///// “ClosePrice（收盘价）”： 0.0， “点”：0、 “奖励”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”：self._day、 “状态”：0、 “LimitStep”：self.current_step、 “ActionStep”：self.current_step、 “CloseStep”：-1、 } self.current_holding[i] += 1 self.tranaction_open_this_step.append（事务） self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_live.append（事务） return sum（奖励） def _calculate_reward（self， i， done， action）： _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[：]： # 复制以避免修改问题 if tr[“Symbol”] == self.assets[i]： _point = self.cf.symbol（self.assets[i]， “点”） # 隔夜现金折扣如果 self._day > tr[“DateDuration”]： tr[“DateDuration”] = self._day tr[“奖励”] -= self.cf.symbol（self.assets[i]， “over_night_penalty”） if tr[“Type”] == 0： # 买入 # 止损触发器 _sl_price = tr[“ActionPrice”] - tr[“SL”] / _point _pt_price = tr[“ActionPrice”] + tr[“PT”] / _point 如果完成： p = （self._c - tr[“ActionPrice”]） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._l <= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._h >= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 else： # 仍然打开 self.current_draw_downs[i] = int（（self._l - tr[“ActionPrice”]） * _point） _max_draw_down += self.current_draw_downs[i] 如果 self.current_draw_downs[i] < 0 和 tr[“MaxDD”] > self.current_draw_downs[i]： tr[“最大DD”] = self.current_draw_downs[i] elif tr[“Type”] == 1： # 卖出 # 止损触发器 _sl_price = tr[“ActionPrice”] + tr[“SL”] / _point _pt_price = tr[“ActionPrice”] - tr[“PT”] / _point 如果完成： p = （tr[“ActionPrice”] - self._c） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._h >= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._l <= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 还： self.current_draw_downs[i] = int（（tr[“ActionPrice”] - self._h） * _point ) _max_draw_down += self.current_draw_downs[i] 如果（ self.current_draw_downs[i] < 0 和 tr[“MaxDD”] > self.current_draw_downs[i] ): tr[“最大DD”] = self.current_draw_downs[i] 如果 _max_draw_down > self.max_draw_downs[i]： self.max_draw_downs[i] = _max_draw_down 返回 _total_reward def _limit_order_process（self， i， _action， done）：对于 self.transaction_limit_order 中的 tr[：]： if tr[“Symbol”] == self.assets[i]： if tr[“Type”] ！= _action 或 done： self.transaction_limit_order.删除（tr） tr[“状态”] = 3 tr[“CloseStep”] = self.current_step self.transaction_history.append（tr） elif （tr[“ActionPrice”] >= self._l 和 _action == 0）或（ tr[“ActionPrice”] <= self._h 和 _action == 1）： tr[“ActionStep”] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_limit_order.删除（tr） self.transaction_live.append（tr） self.tranaction_open_this_step.append（tr） elif （tr[“LimitStep”] + self.cf.symbol（self.assets[i]， “limit_order_expiration”） > self.current_step）： tr[“CloseStep”] = self.current_step tr[“状态”] = 4 self.transaction_limit_order.删除（tr） self.transaction_history.append（tr） def _manage_tranaction（self， tr， _p， close_price， status=1）： self.transaction_live.remove（tr） tr[“收盘价”] = close_price tr[“点”] = int（_p） tr[“奖励”] = int（tr[“奖励”] + _p） # 已实现盈/亏 tr[“状态”] = 状态 # 1=止损/太平洋时间，2=强制平仓，3=取消限制，4=过期限制 tr[“CloseTime”] = self._t tr[“CloseStep”] = self.current_step self.balance += int（tr[“奖励”]） self.total_equity -= int（abs（tr[“奖励”]）） self.tranaction_close_this_step.append（tr） self.transaction_history.append（tr） def analyze_transaction_history（个体经营）：如果不self.transaction_history：指标 = {“trades”： 0， “win_rate”： 0.0， “profit_factor”： 0.0， “sharpe_ratio”： 0.0， “total_profit”： 0.0} 还：交易 = len（self.transaction_history） rewards = [tr[“奖励”] 对于 tr self.transaction_history] wins = sum（如果 r > 0，则奖励中的 r 为 1） losses = sum（如果 r < 0，则奖励中的 r 为 1） gross_profit = sum（r 为 r for r for r in rewards if r > 0） gross_loss = abs（sum（r for r for r in rewards if r < 0）） win_rate = 盈利 / 交易如果交易 > 0 否则 0.0 profit_factor = gross_profit / gross_loss 如果 gross_loss > 0 else float（“inf”） # 夏普比率（简化，假设无风险利率 = 0）返回值 = np.array（rewards， dtype=np.float32） sharpe_ratio = np.mean（返回） / np.std（返回）如果 np.std（返回） > 0 else 0.0 total_profit = sum（rewards）指标 = { “trades”：交易、 “win_rate”：win_rate、 “profit_factor”：profit_factor、 “sharpe_ratio”：sharpe_ratio、 “total_profit”：total_profit } # 准备带有 timestamp 和 episode 的指标 metrics[“timestamp”] = datetime.datetime.now（）.strftime（“%Y-%m-%d %H：%M：%S”） metrics[“episode”] = self.episode # 检查文件是否存在且为空以写入 header 导入作系统 file_exists = os.path.exists（self.analyze_transaction_history_log_filename） file_empty = file_exists 和 os.stat（self.analyze_transaction_history_log_filename）.st_size == 0 # 如果日志文件是新的或空的，则附加到带有 header 的日志文件其中 open（self.analyze_transaction_history_log_filename， 'a'， newline=''）为 f：字段名称 = [“时间戳”， “剧集”， “交易”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”] writer = csv.DictWriter （f，字段名称=字段名称） # 仅当 file 不存在或为空时才写入 header 如果不是 file_exists 或 file_empty： writer.writeheader（） writer.writerow（指标）返回指标 def step（self， actions）： self.current_step += 1 # 定义终止和截断条件 terminated = self.balance <= 0 # 剧集因破产而结束（最终状态） truncated = self.current_step == len（self.dt_datetime） - 1 # 由于最大步数（时间限制）而结束剧集 done = terminated or truncated # 合并为 VecEnv 的单个 'done' 标志 # 对于渲染或剧集跟踪，您仍然可以检查任一条件是否为 true 如果完成： self.done_information += f“集数： {self.episode} 平衡： {self.balance} 步数： {self.current_step}\n” self.visualization = 真 self.episode += 1 # 增加 episode 计数器 # 计算基础交易奖励 base_reward = self._take_action（作，完成） # 计算持仓的未实现利润 unrealized_profit = 0 atr_scaling = 0 # 用于市场条件缩放对于 i，枚举（self.assets）中的 asset： atr = self.get_observation（self.current_step， i， “ATR”） atr_scaling += atr # 用于标准化的资产的 ATR 总和对于 self.transaction_live 中的 TR： if tr[“Symbol”] == asset： if tr[“Type”] == 0： # 买入未实现 = （self._c - tr[“ActionPrice”]） * self.cf.symbol（asset， “point”） else： # 卖出未实现 = （tr[“ActionPrice”] - self._c） * self.cf.symbol（asset， “点”） unrealized_profit += 未实现 atr_scaling = atr_scaling / len（self.assets） if atr_scaling > 0 else 1 # 避免被 0 除以 # 持续奖励：仅适用于未实现/已实现的利润，由 ATR 缩放 # 调整 0.01 到 0.05 sustained_reward = （unrealized_profit + base_reward） * 0.01 / atr_scaling 如果self.transaction_live否则为 0 # 如果没有持仓，则对不作为的处罚如果不是 self.transaction_live 和 all（math.floor（a） == 2 for a in actions）： sustained_reward -= 0.1 # 鼓励探索的小额惩罚 total_reward = base_reward + sustained_reward 如果 self._day > self.current_day： self.current_day = self._day self.balance -= self.over_night_cash_penalty 如果 self.balance ！= 0： self.max_draw_down_pct = abs（sum（self.max_draw_downs） / self.balance * 100） OBS = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“numeric”]， “portfolio_data”：np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs，数据类型=np.float32 ) } print（f“步骤 {self.current_step}：基础奖励={base_reward}，持续奖励={sustained_reward}，总计={total_reward}，余额={self.balance}”） # 信息字典保持不变 info = {“关闭”： self.tranaction_close_this_step} 返回 obs、total_reward、terminated、truncated、info def get_observation（self， _step， _iter=0， col=None）：如果 col 为 None：返回 self.cached_ohlc_data[_step] 如果 col == “_day”：返回 self.cached_time_serial[_step][1] elif col == “_time”：返回 self.cached_time_serial[_step][0] 尝试： col_pos = self.observation_list.index（col） except ValueError 之外： raise ValueError（f“在 observation_list 中找不到列 '{col}”）返回 self.cached_ohlc_data[_step][_iter * len（self.observation_list） + col_pos] def get_observation_vector（self， _dt， cols=无）： cols = self.observation_list 如果 cols 不是其他 cols v = [] for a in self.assets 中：子集 = self.df.query（f'{self.asset_col} == “{a}” & {self.time_col} == “{_dt}”'） assert not subset.empty v += subset.loc[_dt， cols].tolist（）断言 len（v） == len（self.assets） * len（cols）返回 v def get_economic_vector（self， _dt）：子集 = self.df.loc[_dt] events = subset['events'] if isinstance（subset， pd.系列） else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） currency_ids = [self.currency_map.get（e['currency']， 0） for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） numeric_fields = ['actual_norm'， 'forecast_norm'， 'previous_norm'， 'surprise_norm'， 'event_freq'， 'impact_code'] 数字 = 事件中 e 的 [e[字段] 对于numeric_fields中的字段 :self.max_events] + [0] * （self.max_events * 6 - len（事件） * 6）返回 { “event_ids”： np.array（event_ids， dtype=np.int32）， “currency_ids”： np.array（currency_ids， dtype=np.int32）， “数字”： np.array（数字， dtype=np.float32） } def reset（self， seed=None， options=None）： # 设置可重复性的种子如果 seed 不是 None： self._seed（种子）如果self.random_start： self.current_step = random.choice（range（int（len（self.dt_datetime） * 0.5）））还： self.current_step = 0 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = 真 self.visualization = False OBS = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“numeric”]， “portfolio_data”：np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs，数据类型=np.float32 ) } 信息 = {} 返回 OBS、INFO def render（self， mode=“human”， title=None， **kwargs）： if mode in （“human”， “file”）：打印输出 = 模式 == “人类” 下午 = { “log_header”： self.log_header， “log_filename”： self.log_filename， “printout”：打印输出， “balance”： self.balance， “balance_initial”：self.balance_initial、 “tranaction_close_this_step”：self.tranaction_close_this_step、 “done_information”：self.done_information、 } render_to_file（**pm）如果 self.log_header： self.log_header = 假 elif 模式 == “graph” 和 self.visualization： print（“绘图...”） p = TradingChart（self.df， self.transaction_history） p.plot（） def close（个体）：通过 def get_sb_env（个体经营）： e = DummyVecEnv（[lambda： self]） obs = e.reset（）返回 E、OBS class CustomFeaturesExtractor（BaseFeaturesExtractor）： def __init__（self， observation_space）： n_assets = （observation_space.spaces[“portfolio_data”].shape[0] - 3） // 2 ohlc_dim = observation_space.spaces[“ohlc_data”].shape[0] max_events = observation_space.spaces[“event_ids”].shape[0] economic_numeric_dim = observation_space.spaces[“economic_numeric”].shape[0] portfolio_dim = observation_space.spaces[“portfolio_data”].形状[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # 检查 CUDA 是否可用，否则使用 CPU self.device = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomFeaturesExtractor using device： {self.device}”） super（）.__init__（observation_space， features_dim=features_dim） # 将 embedding 移动到所选设备 self.event_embedding = nn.Embedding（num_embeddings=129， embedding_dim=max_events）.to（self.device） self.currency_embedding = nn。Embedding（num_embeddings=6， embedding_dim=max_events）.to（self.device） print（f“CustomFeaturesExtractor： n_assets={n_assets}， features_dim={features_dim}”） def forward（self， obs）： ohlc_data = obs[“ohlc_data”].to（self.device） event_ids = obs[“event_ids”].to（self.device， dtype=torch.long） currency_ids = obs[“currency_ids”].to（self.device， dtype=torch.long） economic_numeric = obs[“economic_numeric”].to（self.device） portfolio_data = obs[“portfolio_data”].to（self.device） event_emb = self.event_embedding（event_ids）.均值（dim=1） currency_emb = self.currency_embedding（currency_ids）.mean（dim=1）特征 = torch.cat（[ohlc_data， event_emb， currency_emb， economic_numeric， portfolio_data]， dim=1）返回功能类 CustomMultiInputPolicy（ActorCriticPolicy）： def __init__（self， observation_space， action_space， lr_schedule， *args， **kwargs）： # 检查 CUDA 是否可用，否则使用 CPU 设备 = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomMultiInputPolicy using device： {device}”） # 提取动作空间边界并将其移动到所选设备 self.action_space_low = torch.tensor （action_space.low， dtype =torch.float32， device=device） self.action_space_high = torch.tensor （action_space.high， dtype =torch.float32， device=device） action_dim = action_space.shape[0] # 资产数量 super（）.__init__（ observation_space， action_space， lr_schedule， features_extractor_class=CustomFeaturesExtractor， features_extractor_kwargs={}， net_arch=dict（pi=[64， 64]， vf=[64， 64]）， *args、 **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor（ features_dim， net_arch=self.net_arch， activation_fn=nn.ReLU ， device=设备）.to（设备） # 定义动作网络以输出高斯的均值和log_std self.action_net = nn.Linear（64， action_dim * 2）.to（device） # 输出每个资产的平均值和log_std self.value_net = nn.线性（64， 1）.to（设备） # 初始化发行版 self.action_dist = SquashedDiagGaussianDistribution （action_dim） self.num_timesteps = 0 def forward（self， obs， deterministic=False）： # 在每次前向传递上增加时间步长 self.num_timesteps += 1 # 提取特征特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） # 从 action_net 获取 mean 和 log_std action_params = self.action_net（latent_pi） # [批量， n_assets * 2] mean_actions， log_std = action_params.chunk（2， dim=-1） # 拆分为均值和log_std log_std = torch.clamp（log_std， min=-20， max=2） # 稳定log_std # 使用当前参数创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 示例作或获取确定性作作 = distribution.actions_from_params（mean_actions， log_std， deterministic=deterministic） # 从 [-1， 1] 到 [0， 3] 的映射 squashed_actions = self._squash_to_range（作、self.action_space_low、self.action_space_high） # 计算未压缩的 action 的对数概率 log_prob = distribution.log_prob（actions） # 对未压缩的动作使用log_prob # 价值预测值 = self.value_net（latent_vf）如果 self.num_timesteps % 1000 == 0： print（f“步骤 {self.num_timesteps}，训练： {self.training}，作： {squashed_actions}，平均值： {mean_actions.mean（）}，对数标准： {log_std.mean（）}”）返回 squashed_actions、值 log_prob def _squash_to_range（self， actions， low， high）： “”“将压缩的作从 [-1， 1] 缩放到 [low， high]。”“”” 返回（作数 + 1） * （最高价 - 最低价） / 2 + 最低价 def extract_features（自身，OBS）：返回self.features_extractor （OBS） def predict_values（自身， OBS）：特征 = self.extract_features（OBS） _， latent_vf = self.mlp_extractor（特征）返回 self.value_net（latent_vf） def evaluate_actions（自我、OBS、作）： “”“评估训练作（由 PPO 使用）。”“” 特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） action_params = self.action_net（latent_pi） mean_actions，log_std = action_params.chunk（2， dim=-1） log_std = torch.clamp（log_std，最小=-20，最大=2） # 创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 将 action 解压缩回 [-1， 1] 以进行log_prob计算 unsquashed_actions = 2 * （作 - self.action_space_low） / （self.action_space_high - self.action_space_low） - 1 # 计算log_prob和熵 log_prob = distribution.log_prob（unsquashed_actions） # 对未压缩的动作使用 log_prob entropy = distribution.entropy（） # 熵不需要动作值 = self.value_net（latent_vf）返回值、log_prob、熵 ```

Let’s analyze why Action=1.0 is interpreted as a "Sell" in your environment but doesn’t change the balance, based on your log output and the provided code.
我们来分析一下为什么 Action=1.0 在您的环境中被解释为 “Sell” 但不改变余额，具体取决于您的日志输出和提供的代码。

Log Output Analysis 日志输出分析

Step 1366: Base Reward=0, Sustained Reward=4.791014388264544, Total=4.791014388264544, Balance=99100
Asset XAUUSD: Action=1.0, Reward=0, Holding=50
Step 1367: Base Reward=0, Sustained Reward=6.3687861051046175, Total=6.3687861051046175, Balance=99100
Asset XAUUSD: Action=1.0, Reward=0, Holding=50
...
Step 1380: Base Reward=0, Sustained Reward=2.110638898428104, Total=2.110638898428104, Balance=99100
Asset XAUUSD: Action=1.0, Reward=0, Holding=50

Observations: 观察：
- Action=1.0: Consistently output across steps 1366–1380.
  作 = 1.0：在步骤 1366–1380 之间输出一致。
- Base Reward=0: No immediate realized reward from trades.
  基础奖励 = 0：没有立即从交易中实现的回报。
- Sustained Reward: Positive and fluctuating, indicating unrealized profits from existing positions.
  持续奖励：正数和波动性，表示现有头寸的未实现利润。
- Balance=99100: Unchanged throughout, despite Action=1.0.
  余额 = 99100：尽管 Action=1.0，但始终保持不变。
- Holding=50: Constant at 50, suggesting no new positions opened or closed.
  持仓 = 50：常数为 50，表示没有新头寸开仓或平仓。

Action Interpretation in Your Environment
环境中的动作解释

In your tgym environment, actions are processed in _take_action():
在您的 tgym 环境中，作在 _take_action（）中处理：

def _take_action(self, actions, done):
    _action = 2  # Default
    _profit_taken = 0
    rewards = [0] * len(self.assets)
    self.tranaction_open_this_step = []
    self.tranaction_close_this_step = []

    for i, action in enumerate(actions):
        # Extract integer action type and fractional part
        _action = math.floor(action)  # 0=Buy, 1=Sell, 2=Nothing
        rewards[i] = self._calculate_reward(i, done, _action)
        print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}")

        if self.cf.symbol(self.assets[i], "limit_order"):
            self._limit_order_process(i, _action, done)

        if (
            _action in (0, 1) 
            and not done 
            and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")
        ):
            _profit_taken = math.ceil(
                (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max")
            ) + self.cf.symbol(self.assets[i], "stop_loss_max")
            self.ticket_id += 1
            if self.cf.symbol(self.assets[i], "limit_order"):
                # Limit order logic
                transaction = {...}
                self.transaction_limit_order.append(transaction)
            else:
                # Market order logic
                transaction = {...}
                self.current_holding[i] += 1
                self.tranaction_open_this_step.append(transaction)
                self.balance -= self.cf.symbol(self.assets[i], "transaction_fee")
                self.transaction_live.append(transaction)

    return sum(rewards)

Action Parsing: 动作解析：
- _action = math.floor(action):
  _action = math.floor（作）：
  - Action=1.0 → _action = math.floor(1.0) = 1.
    作 = 1.0 → _action = math.floor（1.0） = 1。
  - Per your environment’s documentation: 0=Buy, 1=Sell, 2=Nothing.
    根据您的环境文档：0 = 买入，1 = 卖出，2 = 无。
  - Conclusion: Action=1.0 is interpreted as a Sell.
    结论：Action=1.0 被解释为卖出。
- Fractional part (action - _action): 1.0 - 1 = 0, so _profit_taken is just stop_loss_max (no additional profit target scaling).
  小数部分（作 - _action）：1.0 - 1 = 0，因此_profit_taken只是stop_loss_max（没有额外的利润目标缩放）。
Conditions for Opening a Position:
开仓条件：
- _action in (0, 1): 1 (Sell) qualifies.
  _action in （0， 1）：1 （卖出）符合条件。
- not done: Assuming the episode isn’t ending, this is true.
  未完成：假设剧集没有结束，这是真的。
- self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding"): This is the critical condition.
  self.current_holding[i] < self.cf.symbol（self.assets[i]， “max_current_holding”）：这是临界条件。

Why Balance Doesn’t Change
为什么余额不会改变

Holding=50: Your log shows Holding=50 consistently. If max_current_holding (from your config file, e.g., gdbusd-test-1.json) is set to 50, then:
持仓 = 50：您的日志始终显示 Holding=50。如果 max_current_holding（来自您的配置文件，例如 gdbusd-test-1.json）设置为 50，则：
- self.current_holding[i] < 50 is False (50 is not less than 50).
  self.current_holding[i] < 50 为 False（50 不小于 50）。
- The if block to open a new sell position doesn’t execute:
  用于打开新卖出头寸的 if 模块不会执行：
  python 蟒
  self.current_holding[i] += 1 self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction)
- Result: No new sell position opens, no transaction fee is deducted, and balance remains unchanged.
  结果：没有新的卖出头寸，没有扣除交易费用，余额保持不变。
Base Reward=0: In _calculate_reward():
基础奖励 = 0：在 _calculate_reward（）中：
python 蟒
def _calculate_reward(self, i, done, action): _total_reward = 0 for tr in self.transaction_live[:]: if tr["Symbol"] == self.assets[i]: if tr["Type"] == 0: # Buy # Stop-loss or profit-take logic elif tr["Type"] == 1: # Sell # Stop-loss or profit-take logic return _total_reward
- _total_reward only updates if an existing position hits a stop-loss (-SL) or profit-take (PT), or if done=True forces closure.
  _total_reward仅在现有头寸达到止损（-SL）或止盈（PT），或者 done=True 强制平仓时更新。
- Since no new sell position opens (due to Holding=50), and no existing positions close (no SL/PT triggered), rewards[i] = 0.
  由于没有新的卖出头寸打开（由于 Holding=50），也没有现有头寸平仓（未触发 SL/PT），因此 rewards[i] = 0。
Sustained Reward: Positive values come from step():
持续奖励：正值来自 step（）：
python 蟒
unrealized_profit = 0 for tr in self.transaction_live: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0
- Positive sustained_reward suggests existing positions (likely buys, given Holding=50) are in profit (current price _c > ActionPrice).
  正sustained_reward表明现有仓位（可能买入，假设 Holding=50）盈利（当前价格_c > ActionPrice）。
- Action=1.0 (Sell) doesn’t affect this because it doesn’t open a new position.
  Action=1.0 （Sell）不会影响这一点，因为它不会开立新头寸。

Is Action=1.0 Buy or Sell?
Action=1.0 是买入还是卖出？

Answer: Action=1.0 is a Sell, as math.floor(1.0) = 1, and your environment defines 1=Sell.
答案：Action=1.0 是 Sell，因为 math.floor（1.0） = 1，并且您的环境定义了 1=Sell。
Why No Balance Change: The agent can’t open a new sell position because Holding=50 equals max_current_holding (assumed 50 from your config). The environment prevents additional trades, so no transaction fee or position change occurs.
为什么没有余额变化：代理无法开立新的卖出头寸，因为 Holding=50 等于 max_current_holding（根据您的配置假设为 50）。该环境阻止了额外的交易，因此不会发生交易费用或头寸变化。

Debugging and Verification
调试和验证

Check max_current_holding:
检查 max_current_holding：
- In your config file (gdbusd-test-1.json), look for:
  在您的配置文件（gdbusd-test-1.json）中，查找：
  json JSON 格式
  "XAUUSD": { "max_current_holding": 50, ... }
- If it’s 50, that explains why Holding=50 blocks new sells.
  如果是 50，那就解释了为什么 Holding=50 会阻止新的卖出。
Existing Positions: 现有职位：
- Holding=50 and positive sustained_reward suggest 50 open buy positions (Type=0). Check self.transaction_live:
  持仓 = 50 和正 sustained_reward 表示 50 个未平仓买入头寸（Type=0）。检查 self.transaction_live：
  python 蟒
  print(f"Live Transactions: {[tr['Type'] for tr in self.transaction_live]}")
  Add this in _take_action() after the loop to confirm.
  在 _take_action（）循环后添加此代码以确认。
Test with Lower Holding: 使用 Lower Holding 进行测试：
- Temporarily reduce Holding by closing positions (e.g., force a sell in _calculate_reward() for testing) or set max_current_holding higher (e.g., 100) in the config. Then, Action=1.0 should:
  通过平仓（例如，在 _calculate_reward（）中强制卖出进行测试）或在配置中将 max_current_holding 设置为更高的值（例如，100）来暂时减少持仓。那么，Action=1.0 应该：
  - Open a sell position. 开立卖出头寸。
  - Deduct transaction_fee from balance.
    从余额中扣除 transaction_fee。
  - Increment self.current_holding[i].
    递增 self.current_holding[i]。

Why Doesn’t It Sell Existing Positions?
为什么它不出售现有头寸？

Your environment only opens new positions in _take_action(); it doesn’t close existing ones based on the action. Closing happens in _calculate_reward() only via:
您的环境仅在 _take_action（）中打开新位置;它不会根据作关闭现有 KEY。在 _calculate_reward（）中仅通过以下方式关闭：
- Stop-loss (SL). 止损（SL）。
- Profit-take (PT).
  获利回吐（PT）。
- Episode end (done=True).
  剧集结束（done=True）。
Fix Suggestion: If you want Action=1.0 to close an existing buy position:
修复建议：如果您希望 Action=1.0 关闭现有的买入仓位：
python 蟒
def _take_action(self, actions, done): rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] for i, action in enumerate(actions): _action = math.floor(action) rewards[i] = self._calculate_reward(i, done, _action) print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) # Close existing buy position on sell action if _action == 1 and not done: for tr in self.transaction_live[:]: if tr["Symbol"] == self.assets[i] and tr["Type"] == 0: # Buy position p = (self._c - tr["ActionPrice"]) * self.cf.symbol(self.assets[i], "point") self._manage_tranaction(tr, p, self._c, status=1) self.current_holding[i] -= 1 rewards[i] += p break # Close one position per step # Open new position if under max holding if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding") ): _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 transaction = {...} # Same as before self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) return sum(rewards)
- Effect: Action=1.0 closes one buy position (if any), realizing profit/loss, and updates balance.
  Effect：Action=1.0 平仓 1 个买入持仓（如有），实现盈利/亏损，并更新余额。

Conclusion 结论

Action=1.0 is Sell: Correctly interpreted as _action=1.
Action=1.0 为 Sell：正确解释为 _action=1。
No Balance Change: Holding=50 hits max_current_holding, preventing new sell positions. Existing buys remain open, contributing to sustained_reward.
无余额变化：持仓=50 次命中max_current_holding，阻止新的卖出头寸。现有的购买仍然开放，有助于sustained_reward。
Next Steps: 后续步骤：
- Confirm max_current_holding=50 in your config.
  确认 max_current_holding=50 在您的配置中。
- If you want sells to close positions, implement the suggested fix.
  如果您希望 sells 平仓，请实施建议的修复。
- Test with print(self.transaction_live) to see open position types.
  使用 print（self.transaction_live）进行测试以查看未平仓头寸类型。

Let me know if you need help verifying the config or adjusting the logic further!
如果您需要帮助验证配置或进一步调整 logic，请告诉我！

the max_current_holding is means the number of open trade or the price of the trade? it should be the number of openning trade, and when the trade close, the current_holding must - 1, is my current_holding calculate correctly in the code? ``` import datetime import math import random import ast import torch import torch.nn as nn import csv import numpy as np import gymnasium as gym from gymnasium import spaces from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor, MlpExtractor from stable_baselines3.common.policies import ActorCriticPolicy from stable_baselines3.common.torch_layers import MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution from stable_baselines3 import PPO from stable_baselines3.common.policies import ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart import TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym(gym.Env): """forex/future/option trading gym environment 1. Three action space (0 Buy, 1 Sell, 2 Nothing) 2. Multiple trading pairs (EURUSD, GBPUSD...) under same time frame 3. Timeframe from 1 min to daily as long as use candlestick bar (Open, High, Low, Close) 4. Use StopLose, ProfitTaken to realize rewards. each pair can configure it own SL and PT in configure file 5. Configure over night cash penalty and each pair's transaction fee and overnight position holding penalty 6. Split dataset into daily, weekly or monthly..., with fixed time steps, at end of len(df). The business logic will force to Close all positions at last Close price (game over). 7. Must have df column name: [(time_col),(asset_col), Open,Close,High,Low,day] (case sensitive) 8. Addition indicators can add during the data process. 78 available TA indicator from Finta 9. Customized observation list handled in json config file. 10. ProfitTaken = fraction_action * max_profit_taken + SL. 11. SL is pre-fixed 12. Limit order can be configure, if limit_order == True, the action will preset buy or sell at Low or High of the bar, with a limit_order_expiration (n bars). It will be triggered if the price go cross. otherwise, it will be drop off 13. render mode: human -- display each steps realized reward on console file -- create a transaction log graph -- create transaction in graph (under development) 14. 15. Reward, we want to incentivize profit that is sustained over long periods of time. At each step, we will set the reward to the account balance multiplied by some fraction of the number of time steps so far.The purpose of this is to delay rewarding the agent too fast in the early stages and allow it to explore sufficiently before optimizing a single strategy too deeply. It will also reward agents that maintain a higher balance for longer, rather than those who rapidly gain money using unsustainable strategies. 16. Observation_space contains all of the input variables we want our agent to consider before making, or not making a trade. We want our agent to “see” the forex data points (Open price, High, Low, Close, time serial, TA) in the game window, as well a couple other data points like its account balance, current positions, and current profit.The intuition here is that for each time step, we want our agent to consider the price action leading up to the current price, as well as their own portfolio’s status in order to make an informed decision for the next action. 17. reward is forex trading unit Point, it can be configure for each trading pair 18. To make the unrealized profit reward reflect market conditions, we’ll compute ATR for each asset and use it to scale the reward dynamically. """ metadata = {"render.modes": ["graph", "human", "file", "none"]} def __init__( self, df, event_map, currency_map, env_config_file="./neo_finrl/env_fx_trading/config/gdbusd-test-1.json", ): assert df.ndim == 2 super(tgym, self).__init__() self.cf = EnvConfig(env_config_file) self.observation_list = self.cf.env_parameters("observation_list") # Economic data mappings self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy() if 'events' not in self.df.columns: raise ValueError("DataFrame must contain an 'events' column") def parse_events(x): if isinstance(x, str): try: parsed = ast.literal_eval(x) return parsed if isinstance(parsed, list) else [] except (ValueError, SyntaxError): return [] return x if isinstance(x, list) else [] self.df['events'] = self.df['events'].apply(parse_events) if not isinstance(self.df['events'].iloc[0], list): raise ValueError("'events' must be a list") if self.df['events'].iloc[0] and not isinstance(self.df['events'].iloc[0][0], dict): raise ValueError("Elements in 'events' must be dictionaries") self.balance_initial = self.cf.env_parameters("balance") self.over_night_cash_penalty = self.cf.env_parameters("over_night_cash_penalty") self.asset_col = self.cf.env_parameters("asset_col") self.time_col = self.cf.env_parameters("time_col") self.random_start = self.cf.env_parameters("random_start") log_file_datetime = datetime.datetime.now().strftime("%Y%m%d%H%M%S") self.log_filename = ( self.cf.env_parameters("log_filename") + log_file_datetime + ".csv" ) self.analyze_transaction_history_log_filename = ("transaction_history_log" + log_file_datetime + ".csv") self.df["_time"] = self.df[self.time_col] self.df["_day"] = self.df["weekday"] self.assets = self.df[self.asset_col].unique() self.dt_datetime = self.df[self.time_col].sort_values().unique() self.df = self.df.set_index(self.time_col) self.visualization = False # Reset values self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_step = 0 self.episode = 0 # Start from 0, increment on episode end self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True # Cache data self.cached_ohlc_data = [self.get_observation_vector(_dt) for _dt in self.dt_datetime] self.cached_economic_data = [self.get_economic_vector(_dt) for _dt in self.dt_datetime] self.cached_time_serial = ( self.df[["_time", "_day"]].sort_values("_time").drop_duplicates().values.tolist() ) self.reward_range = (-np.inf, np.inf) self.action_space = spaces.Box(low=0, high=3, shape=(len(self.assets),), dtype=np.float32) self.observation_space = spaces.Dict({ "ohlc_data": spaces.Box(low=-np.inf, high=np.inf, shape=(len(self.assets) * len(self.observation_list),), dtype=np.float32), "event_ids": spaces.Box(low=0, high=len(self.event_map)-1, shape=(self.max_events,), dtype=np.int32), "currency_ids": spaces.Box(low=0, high=len(self.currency_map)-1, shape=(self.max_events,), dtype=np.int32), "economic_numeric": spaces.Box(low=-np.inf, high=np.inf, shape=(self.max_events * 6,), dtype=np.float32), "portfolio_data": spaces.Box(low=-np.inf, high=np.inf, shape=(3 + 2 * len(self.assets),), dtype=np.float32) }) print( f"initial done:\n" f"observation_list:{self.observation_list}\n" f"assets:{self.assets}\n" f"time serial: {min(self.dt_datetime)} -> {max(self.dt_datetime)} length: {len(self.dt_datetime)}\n" f"events: {len(self.event_map)}, currencies: {len(self.currency_map)}" ) self._seed() def _seed(self, seed=None): self.np_random, seed = seeding.np_random(seed) return [seed] def _take_action(self, actions, done): # action = math.floor(x), # profit_taken = math.ceil((x- math.floor(x)) * profit_taken_max - stop_loss_max ) # _actions = np.floor(actions).astype(int) # _profit_takens = np.ceil((actions - np.floor(actions)) *self.cf.symbol(self.assets[i],"profit_taken_max")).astype(int) _action = 2 _profit_taken = 0 rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # need use multiply assets for i, action in enumerate(actions): # Actions are now floats between 0 and 3 self._o = self.get_observation(self.current_step, i, "Open") self._h = self.get_observation(self.current_step, i, "High") self._l = self.get_observation(self.current_step, i, "Low") self._c = self.get_observation(self.current_step, i, "Close") self._t = self.get_observation(self.current_step, i, "_time") self._day = self.get_observation(self.current_step, i, "_day") # Extract integer action type and fractional part _action = math.floor(action) # 0=Buy, 1=Sell, 2=Nothing rewards[i] = self._calculate_reward(i, done, _action) # Pass action for exploration reward print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")): # Dynamically calculate PT using action fraction _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 if self.cf.symbol(self.assets[i], "limit_order"): transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._l if _action == 0 else self._h, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": -1, "CloseStep": -1, } self.transaction_limit_order.append(transaction) else: transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._c, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": self.current_step, "CloseStep": -1, } self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) return sum(rewards) def _calculate_reward(self, i, done, action): _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[:]: # Copy to avoid modification issues if tr["Symbol"] == self.assets[i]: _point = self.cf.symbol(self.assets[i], "point") # cash discount overnight if self._day > tr["DateDuration"]: tr["DateDuration"] = self._day tr["Reward"] -= self.cf.symbol(self.assets[i], "over_night_penalty") if tr["Type"] == 0: # Buy # stop loss trigger _sl_price = tr["ActionPrice"] - tr["SL"] / _point _pt_price = tr["ActionPrice"] + tr["PT"] / _point if done: p = (self._c - tr["ActionPrice"]) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._l <= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._h >= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: # still open self.current_draw_downs[i] = int((self._l - tr["ActionPrice"]) * _point) _max_draw_down += self.current_draw_downs[i] if self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i]: tr["MaxDD"] = self.current_draw_downs[i] elif tr["Type"] == 1: # Sell # stop loss trigger _sl_price = tr["ActionPrice"] + tr["SL"] / _point _pt_price = tr["ActionPrice"] - tr["PT"] / _point if done: p = (tr["ActionPrice"] - self._c) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p elif self._h >= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._l <= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: self.current_draw_downs[i] = int( (tr["ActionPrice"] - self._h) * _point ) _max_draw_down += self.current_draw_downs[i] if ( self.current_draw_downs[i] < 0 and tr["MaxDD"] > self.current_draw_downs[i] ): tr["MaxDD"] = self.current_draw_downs[i] if _max_draw_down > self.max_draw_downs[i]: self.max_draw_downs[i] = _max_draw_down return _total_reward def _limit_order_process(self, i, _action, done): for tr in self.transaction_limit_order[:]: if tr["Symbol"] == self.assets[i]: if tr["Type"] != _action or done: self.transaction_limit_order.remove(tr) tr["Status"] = 3 tr["CloseStep"] = self.current_step self.transaction_history.append(tr) elif (tr["ActionPrice"] >= self._l and _action == 0) or ( tr["ActionPrice"] <= self._h and _action == 1): tr["ActionStep"] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_limit_order.remove(tr) self.transaction_live.append(tr) self.tranaction_open_this_step.append(tr) elif (tr["LimitStep"] + self.cf.symbol(self.assets[i], "limit_order_expiration") > self.current_step): tr["CloseStep"] = self.current_step tr["Status"] = 4 self.transaction_limit_order.remove(tr) self.transaction_history.append(tr) def _manage_tranaction(self, tr, _p, close_price, status=1): self.transaction_live.remove(tr) tr["ClosePrice"] = close_price tr["Point"] = int(_p) tr["Reward"] = int(tr["Reward"] + _p) # Realized profit/loss tr["Status"] = status # 1=SL/PT, 2=Forced close, 3=Canceled limit, 4=Expired limit tr["CloseTime"] = self._t tr["CloseStep"] = self.current_step self.balance += int(tr["Reward"]) self.total_equity -= int(abs(tr["Reward"])) self.tranaction_close_this_step.append(tr) self.transaction_history.append(tr) def analyze_transaction_history(self): if not self.transaction_history: metrics = {"trades": 0, "win_rate": 0.0, "profit_factor": 0.0, "sharpe_ratio": 0.0, "total_profit": 0.0} else: trades = len(self.transaction_history) rewards = [tr["Reward"] for tr in self.transaction_history] wins = sum(1 for r in rewards if r > 0) losses = sum(1 for r in rewards if r < 0) gross_profit = sum(r for r in rewards if r > 0) gross_loss = abs(sum(r for r in rewards if r < 0)) win_rate = wins / trades if trades > 0 else 0.0 profit_factor = gross_profit / gross_loss if gross_loss > 0 else float("inf") # Sharpe Ratio (simplified, assumes risk-free rate = 0) returns = np.array(rewards, dtype=np.float32) sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0.0 total_profit = sum(rewards) metrics = { "trades": trades, "win_rate": win_rate, "profit_factor": profit_factor, "sharpe_ratio": sharpe_ratio, "total_profit": total_profit } # Prepare metrics with timestamp and episode metrics["timestamp"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") metrics["episode"] = self.episode # Check if file exists and is empty to write header import os file_exists = os.path.exists(self.analyze_transaction_history_log_filename) file_empty = file_exists and os.stat(self.analyze_transaction_history_log_filename).st_size == 0 # Append to log file with header if it's new or empty with open(self.analyze_transaction_history_log_filename, 'a', newline='') as f: fieldnames = ["timestamp", "episode", "trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"] writer = csv.DictWriter(f, fieldnames=fieldnames) # Write header only if file doesn't exist or is empty if not file_exists or file_empty: writer.writeheader() writer.writerow(metrics) return metrics def step(self, actions): self.current_step += 1 # Define termination and truncation conditions terminated = self.balance <= 0 # Episode ends due to bankruptcy (terminal state) truncated = self.current_step == len(self.dt_datetime) - 1 # Episode ends due to max steps (time limit) done = terminated or truncated # Combine into a single 'done' flag for VecEnv # For rendering or episode tracking, you might still check if either condition is true if done: self.done_information += f"Episode: {self.episode} Balance: {self.balance} Step: {self.current_step}\n" self.visualization = True self.episode += 1 # Increment episode counter # Calculate base trading reward base_reward = self._take_action(actions, done) # Calculate unrealized profit from open positions unrealized_profit = 0 atr_scaling = 0 # For market condition scaling for i, asset in enumerate(self.assets): atr = self.get_observation(self.current_step, i, "ATR") atr_scaling += atr # Sum ATR across assets for normalization for tr in self.transaction_live: if tr["Symbol"] == asset: if tr["Type"] == 0: # Buy unrealized = (self._c - tr["ActionPrice"]) * self.cf.symbol(asset, "point") else: # Sell unrealized = (tr["ActionPrice"] - self._c) * self.cf.symbol(asset, "point") unrealized_profit += unrealized atr_scaling = atr_scaling / len(self.assets) if atr_scaling > 0 else 1 # Avoid division by 0 # Sustained reward: only applies to unrealized/realized profits, scaled by ATR # adjust 0.01 to 0.05 sustained_reward = (unrealized_profit + base_reward) * 0.01 / atr_scaling if self.transaction_live else 0 # Penalty for inaction if no positions are held if not self.transaction_live and all(math.floor(a) == 2 for a in actions): sustained_reward -= 0.1 # Small penalty to encourage exploration total_reward = base_reward + sustained_reward if self._day > self.current_day: self.current_day = self._day self.balance -= self.over_night_cash_penalty if self.balance != 0: self.max_draw_down_pct = abs(sum(self.max_draw_downs) / self.balance * 100) obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } print(f"Step {self.current_step}: Base Reward={base_reward}, Sustained Reward={sustained_reward}, Total={total_reward}, Balance={self.balance}") # Info dictionary remains unchanged info = {"Close": self.tranaction_close_this_step} return obs, total_reward, terminated, truncated, info def get_observation(self, _step, _iter=0, col=None): if col is None: return self.cached_ohlc_data[_step] if col == "_day": return self.cached_time_serial[_step][1] elif col == "_time": return self.cached_time_serial[_step][0] try: col_pos = self.observation_list.index(col) except ValueError: raise ValueError(f"Column '{col}' not found in observation_list") return self.cached_ohlc_data[_step][_iter * len(self.observation_list) + col_pos] def get_observation_vector(self, _dt, cols=None): cols = self.observation_list if cols is None else cols v = [] for a in self.assets: subset = self.df.query(f'{self.asset_col} == "{a}" & {self.time_col} == "{_dt}"') assert not subset.empty v += subset.loc[_dt, cols].tolist() assert len(v) == len(self.assets) * len(cols) return v def get_economic_vector(self, _dt): subset = self.df.loc[_dt] events = subset['events'] if isinstance(subset, pd.Series) else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) currency_ids = [self.currency_map.get(e['currency'], 0) for e in events[:self.max_events]] + [0] * (self.max_events - len(events)) numeric_fields = ['actual_norm', 'forecast_norm', 'previous_norm', 'surprise_norm', 'event_freq', 'impact_code'] numeric = [e[field] for e in events[:self.max_events] for field in numeric_fields] + [0] * (self.max_events * 6 - len(events) * 6) return { "event_ids": np.array(event_ids, dtype=np.int32), "currency_ids": np.array(currency_ids, dtype=np.int32), "numeric": np.array(numeric, dtype=np.float32) } def reset(self, seed=None, options=None): # Set the seed for reproducibility if seed is not None: self._seed(seed) if self.random_start: self.current_step = random.choice(range(int(len(self.dt_datetime) * 0.5))) else: self.current_step = 0 self.equity_list = [0] * len(self.assets) self.balance = self.balance_initial self.total_equity = self.balance + sum(self.equity_list) self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len(self.assets) self.max_draw_downs = [0.0] * len(self.assets) self.max_draw_down_pct = sum(self.max_draw_downs) / self.balance * 100 self.current_holding = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = "" self.log_header = True self.visualization = False obs = { "ohlc_data": np.array(self.cached_ohlc_data[self.current_step], dtype=np.float32), "event_ids": self.cached_economic_data[self.current_step]["event_ids"], "currency_ids": self.cached_economic_data[self.current_step]["currency_ids"], "economic_numeric": self.cached_economic_data[self.current_step]["numeric"], "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) } info = {} return obs, info def render(self, mode="human", title=None, **kwargs): if mode in ("human", "file"): printout = mode == "human" pm = { "log_header": self.log_header, "log_filename": self.log_filename, "printout": printout, "balance": self.balance, "balance_initial": self.balance_initial, "tranaction_close_this_step": self.tranaction_close_this_step, "done_information": self.done_information, } render_to_file(**pm) if self.log_header: self.log_header = False elif mode == "graph" and self.visualization: print("plotting...") p = TradingChart(self.df, self.transaction_history) p.plot() def close(self): pass def get_sb_env(self): e = DummyVecEnv([lambda: self]) obs = e.reset() return e, obs class CustomFeaturesExtractor(BaseFeaturesExtractor): def __init__(self, observation_space): n_assets = (observation_space.spaces["portfolio_data"].shape[0] - 3) // 2 ohlc_dim = observation_space.spaces["ohlc_data"].shape[0] max_events = observation_space.spaces["event_ids"].shape[0] economic_numeric_dim = observation_space.spaces["economic_numeric"].shape[0] portfolio_dim = observation_space.spaces["portfolio_data"].shape[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # Check if CUDA is available, else use CPU self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomFeaturesExtractor using device: {self.device}") super().__init__(observation_space, features_dim=features_dim) # Move embeddings to the selected device self.event_embedding = nn.Embedding(num_embeddings=129, embedding_dim=max_events).to(self.device) self.currency_embedding = nn.Embedding(num_embeddings=6, embedding_dim=max_events).to(self.device) print(f"CustomFeaturesExtractor: n_assets={n_assets}, features_dim={features_dim}") def forward(self, obs): ohlc_data = obs["ohlc_data"].to(self.device) event_ids = obs["event_ids"].to(self.device, dtype=torch.long) currency_ids = obs["currency_ids"].to(self.device, dtype=torch.long) economic_numeric = obs["economic_numeric"].to(self.device) portfolio_data = obs["portfolio_data"].to(self.device) event_emb = self.event_embedding(event_ids).mean(dim=1) currency_emb = self.currency_embedding(currency_ids).mean(dim=1) features = torch.cat([ohlc_data, event_emb, currency_emb, economic_numeric, portfolio_data], dim=1) return features class CustomMultiInputPolicy(ActorCriticPolicy): def __init__(self, observation_space, action_space, lr_schedule, *args, **kwargs): # Check if CUDA is available, else use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print(f"CustomMultiInputPolicy using device: {device}") # Extract action space bounds and move them to the selected device self.action_space_low = torch.tensor(action_space.low, dtype=torch.float32, device=device) self.action_space_high = torch.tensor(action_space.high, dtype=torch.float32, device=device) action_dim = action_space.shape[0] # Number of assets super().__init__( observation_space, action_space, lr_schedule, features_extractor_class=CustomFeaturesExtractor, features_extractor_kwargs={}, net_arch=dict(pi=[64, 64], vf=[64, 64]), *args, **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor( features_dim, net_arch=self.net_arch, activation_fn=nn.ReLU, device=device ).to(device) # Define the action network to output mean and log_std for the Gaussian self.action_net = nn.Linear(64, action_dim * 2).to(device) # Outputs mean and log_std for each asset self.value_net = nn.Linear(64, 1).to(device) # Initialize the distribution self.action_dist = SquashedDiagGaussianDistribution(action_dim) self.num_timesteps = 0 def forward(self, obs, deterministic=False): # Increment timestep on each forward pass self.num_timesteps += 1 # Extract features features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) # Get mean and log_std from action_net action_params = self.action_net(latent_pi) # [batch, n_assets * 2] mean_actions, log_std = action_params.chunk(2, dim=-1) # Split into mean and log_std log_std = torch.clamp(log_std, min=-20, max=2) # Stabilize log_std # Create a fresh distribution instance with current parameters distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Sample actions or get deterministic actions actions = distribution.actions_from_params(mean_actions, log_std, deterministic=deterministic) # Map from [-1, 1] to [0, 3] squashed_actions = self._squash_to_range(actions, self.action_space_low, self.action_space_high) # Compute log probabilities of the unsquashed actions log_prob = distribution.log_prob(actions) # Use log_prob on the unsquashed actions # Value prediction values = self.value_net(latent_vf) if self.num_timesteps % 1000 == 0: print(f"Step {self.num_timesteps}, Training: {self.training}, Actions: {squashed_actions}, Mean: {mean_actions.mean()}, Log Std: {log_std.mean()}") return squashed_actions, values, log_prob def _squash_to_range(self, actions, low, high): """Scale squashed actions from [-1, 1] to [low, high].""" return (actions + 1) * (high - low) / 2 + low def extract_features(self, obs): return self.features_extractor(obs) def predict_values(self, obs): features = self.extract_features(obs) _, latent_vf = self.mlp_extractor(features) return self.value_net(latent_vf) def evaluate_actions(self, obs, actions): """Evaluate actions for training (used by PPO).""" features = self.extract_features(obs) latent_pi, latent_vf = self.mlp_extractor(features) action_params = self.action_net(latent_pi) mean_actions, log_std = action_params.chunk(2, dim=-1) log_std = torch.clamp(log_std, min=-20, max=2) # Create a fresh distribution instance distribution = SquashedDiagGaussianDistribution(self.action_dist.action_dim) # Unsquash the actions back to [-1, 1] for log_prob calculation unsquashed_actions = 2 * (actions - self.action_space_low) / (self.action_space_high - self.action_space_low) - 1 # Compute log_prob and entropy log_prob = distribution.log_prob(unsquashed_actions) # Use log_prob on unsquashed actions entropy = distribution.entropy() # Entropy doesn’t need actions values = self.value_net(latent_vf) return values, log_prob, entropy ```
max_current_holding是指未平仓交易的数量还是交易的价格？它应该是开仓交易的数量，当交易结束时，current_holding必须 - 1，我的current_holding在代码中计算正确吗？ ``` 导入 datetime 导入数学导入随机导入 AST Import Torch （导入火炬）将 torch.nn 导入为 nn 导入 CSV 将 numpy 导入为 NP 将 gymnasium 导入为 Gym 从 Gymnasium 导入空间 from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv from stable_baselines3.common.torch_layers import BaseFeaturesExtractor， MlpExtractor 从 stable_baselines3.common.policies 导入 ActorCriticPolicy 从 stable_baselines3.common.torch_layers 导入 MlpExtractor from stable_baselines3.common.distributions import SquashedDiagGaussianDistribution 从 stable_baselines3 进口 PPO 从 stable_baselines3.common.policies 导入 ActorCriticPolicy from meta.env_fx_trading.util.log_render import render_to_file from meta.env_fx_trading.util.plot_chart 导入 TradingChart from meta.env_fx_trading.util.read_config import EnvConfig class tgym（健身房.Env）： “”“外汇/期货/期权交易 GYM 环境 1. 三个作空间（0 买入、1 卖出、2 无） 2. 同一时间范围内的多个交易对（EURUSD、GBPUSD... 3. 时间范围从 1 分钟到每天，只要使用烛台柱（开盘价、最高价、最低价、收盘价） 4. 使用 StopLose、ProfitTaken 实现奖励。每对都可以在配置文件中配置自己的 SL 和 PT 5. 配置隔夜现金罚金以及每对的交易手续费和隔夜持仓罚金 6. 将数据集拆分为每日、每周或每月...，具有固定的时间步长，在 len（df）结束时。业务 logic 将强制在最后收盘价平仓所有持仓（游戏结束）。 7. 必须具有 df 列名称：[（time_col），（asset_col），Open，Close，High，Low，day]（区分大小写） 8. 加法指标可以在数据处理过程中添加。Finta 提供 78 种 TA 指标 9. 在 json 配置文件中处理的自定义观察列表。 10. ProfitTaken = fraction_action * max_profit_taken + 止损。 11. SL 是预先固定的 12. 限价单可以配置，如果 limit_order == True，则动作将预设在柱线的低点或高点买入或卖出，带 limit_order_expiration （n 条）。如果价格交叉，它将触发。否则，它将是 drop off 13. 渲染模式： human -- 在控制台上显示每个步骤实现的奖励 file -- 创建事务日志 graph -- 在 Graph 中创建交易（开发中） 14. 15. 奖励，我们希望激励长期持续的利润。在每个步骤中，我们将奖励设置为账户余额乘以到目前为止时间步数的一小部分。这样做的目的是拖延在早期阶段过快地奖励代理并允许其探索在过于深入地优化单个策略之前。它还将奖励在更长时间内保持较高余额的代理，而不是那些使用不可持续的策略快速赚钱的人。 16. Observation_space 包含我们需要代理的所有输入变量在进行交易或不进行交易之前考虑。我们希望我们的代理 “看到” 游戏窗口中的外汇数据点（开盘价、最高价、最低价、收盘价、时间序列、TA），以及其他一些数据点，如账户余额、当前头寸、和当前利润。这里的直觉是，对于每个时间步，我们都需要我们的代理考虑导致当前价格的价格行为，以及他们的拥有投资组合的状态，以便为下一步行动做出明智的决策。 17. 奖励为外汇交易单位积分，可为每个交易对配置 18. 为了使未实现利润奖励反映市场状况，我们将计算每种资产的 ATR，并使用它来动态扩展奖励。 """ 元数据 = {“render.modes”： [“graph”， “human”， “file”， “none”]} 防守 __init__（自我 DF / event_map， currency_map， env_config_file=“./neo_finrl/env_fx_trading/config/gdbusd-test-1.json”， ): 断言 df.ndim == 2 超级（tgym， self）.__init__（） self.cf = 环境配置（env_config_file） self.observation_list = self.cf.env_parameters（“observation_list”） # 经济数据映射 self.event_map = event_map self.currency_map = currency_map self.max_events = 8 self.df = df.copy（）如果 'events' 不在 self.df.columns 中： raise ValueError（“DataFrame 必须包含一个 'events' 列”） def parse_events（x）：如果 isinstance（x， str）：尝试：解析 = ast.literal_eval（x）返回已解析 if isinstance（parsed， list） else [] except （ValueError， SyntaxError）：返回 [] 返回 x if isinstance（x， list） else [] self.df['事件'] = self.df['事件'].apply（parse_events）如果不是 isinstance（self.df['events'].iloc[0]， list）： raise ValueError（“'事件'必须是一个列表”）如果 self.df['events'].iloc[0] 而不是 isinstance（self.df['events'].iloc[0][0]， dict）： raise ValueError（“'events' 中的元素必须是字典”） self.balance_initial = self.cf.env_parameters（“余额”） self.over_night_cash_penalty = self.cf.env_parameters（“over_night_cash_penalty”） self.asset_col = self.cf.env_parameters（“asset_col”） self.time_col = self.cf.env_parameters（“time_col”） self.random_start = self.cf.env_parameters（“random_start”） log_file_datetime = datetime.datetime.now（）.strftime（“%Y%m%d%H%M%S”） self.log_filename = （ self.cf.env_parameters（“log_filename”） + log_file_datetime + “.csv” ) self.analyze_transaction_history_log_filename = （“transaction_history_log” + log_file_datetime + “.csv”） self.df[“_time”] = self.df[self.time_col] self.df[“_day”] = self.df[“工作日”] self.assets = self.df[self.asset_col].unique（） self.dt_datetime = self.df[self.time_col].sort_values（）.unique（） self.df = self.df.set_index（self.time_col） self.visualization = False # 重置值 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_step = 0 self.episode = 0 # 从 0 开始，在剧集结束时递增 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = 真 # 缓存数据 self.cached_ohlc_data = [self.get_observation_vector（_dt）表示 self.dt_datetime _dt] self.cached_economic_data = [self.get_economic_vector（_dt）表示 self.dt_datetime 年的 _dt] self.cached_time_serial = （ self.df[[“_time”， “_day”]].sort_values（“_time”）.drop_duplicates（）.values.tolist（） ) self.reward_range = （-np.inf， np.inf） self.action_space = 空格。盒子（low=0， high=3， shape=（len（self.assets），）， dtype=np.float32） self.observation_space = 空格。字典（{ “ohlc_data”：空格。框（low=-np.inf， high=np.inf， shape=（len（self.assets） * len（self.observation_list），）， dtype=np.float32）， “event_ids”：空格。框（low=0， high=len（self.event_map）-1， shape=（self.max_events，）， dtype=np.int32）， “currency_ids”：空格。框（低=0，高=len（self.currency_map）-1， shape=（self.max_events，）， dtype=np.int32）， “economic_numeric”：空格。盒子（low=-np.inf， high=np.inf， shape=（self.max_events * 6，）， dtype=np.float32）， “portfolio_data”：空格。盒子（low=-np.inf， high=np.inf， shape=（3 + 2 * len（self.assets），）， dtype=np.float32） }) 打印（ f“初始完成：\n” f“observation_list：{self.observation_list}\n” f“资产：{self.assets}\n” f“时间序列： {min（self.dt_datetime）} -> {max（self.dt_datetime）} 长度： {len（self.dt_datetime）}\n” f“事件：{len（self.event_map）}，货币：{len（self.currency_map）}” ) self._seed（） def _seed（self， seed=None）： self.np_random，种子 = seeding.np_random（种子）返回 [种子] def _take_action（self， actions， done）： #作 = math.floor（x）， # profit_taken = math.ceil（（x- math.floor（x）） * profit_taken_max - stop_loss_max ） # _actions = np.floor（作）.astype（int） # _profit_takens = np.ceil（（作 - np.floor（作）） *self.cf.symbol（self.assets[i]，“profit_taken_max”））.astype（int） _action = 2 _profit_taken = 0 奖励 = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # 需要使用 multiply 资产对于 i， action in enumerate（actions）： # 动作现在在 0 和 3 之间浮动 self._o = self.get_observation（self.current_step， i， “打开”） self._h = self.get_observation（self.current_step， i， “高”） self._l = self.get_observation（self.current_step， i， “低”） self._c = self.get_observation（self.current_step， i， “关闭”） self._t = self.get_observation（self.current_step， i， “_time”） self._day = self.get_observation（self.current_step， i， “_day”） # 提取整数动作类型和小数部分 _action = math.floor（action） # 0=买入，1=卖出，2=什么都没有 rewards[i] = self._calculate_reward（i， done， _action） # 通过行动以获得探索奖励 print（f“资产 {self.assets[i]}： action={action}， reward={rewards[i]}， Holding={self.current_holding[i]}”）如果 self.cf.symbol（self.assets[i]， “limit_order”）： self._limit_order_process（i， _action，完成）如果（ _action 英寸（0， 1）并且未完成和 self.current_holding[i] < self.cf.symbol（self.assets[i]， “max_current_holding”））： # 使用 action fraction 动态计算 PT _profit_taken = math.ceil（（作 - _action） * self.cf.symbol（self.assets[i]， “profit_taken_max”）） + self.cf.symbol（self.assets[i]， “stop_loss_max”） self.ticket_id += 1 如果 self.cf.symbol（self.assets[i]， “limit_order”）：交易 = { “票证”：self.ticket_id、 “Symbol”：self.assets[i]， “ActionTime”：self._t、 “类型”：_action、 “手数”： 1， “ActionPrice”： self._l if _action == 0 else self._h， “止损”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”：_profit_taken、 “MaxDD”：0、 “Swap（掉期）”：0.0、 “CloseTime”： “”， ///// “ClosePrice（收盘价）”： 0.0， “点”：0、 “奖励”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”：self._day、 “状态”：0、 “LimitStep”：self.current_step、 “ActionStep”： -1， “CloseStep”：-1、 } self.transaction_limit_order.append（事务）还：交易 = { “票证”：self.ticket_id、 “Symbol”：self.assets[i]， “ActionTime”：self._t、 “类型”：_action、 “手数”： 1， “ActionPrice”：self._c、 “止损”： self.cf.symbol（self.assets[i]， “stop_loss_max”）， “PT”：_profit_taken、 “MaxDD”：0、 “Swap（掉期）”：0.0、 “CloseTime”： “”， ///// “ClosePrice（收盘价）”： 0.0， “点”：0、 “奖励”： -self.cf.symbol（self.assets[i]， “transaction_fee”）， “DateDuration”：self._day、 “状态”：0、 “LimitStep”：self.current_step、 “ActionStep”：self.current_step、 “CloseStep”：-1、 } self.current_holding[i] += 1 self.tranaction_open_this_step.append（事务） self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_live.append（事务） return sum（奖励） def _calculate_reward（self， i， done， action）： _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[：]： # 复制以避免修改问题 if tr[“Symbol”] == self.assets[i]： _point = self.cf.symbol（self.assets[i]， “点”） # 隔夜现金折扣如果 self._day > tr[“DateDuration”]： tr[“DateDuration”] = self._day tr[“奖励”] -= self.cf.symbol（self.assets[i]， “over_night_penalty”） if tr[“Type”] == 0： # 买入 # 止损触发器 _sl_price = tr[“ActionPrice”] - tr[“SL”] / _point _pt_price = tr[“ActionPrice”] + tr[“PT”] / _point 如果完成： p = （self._c - tr[“ActionPrice”]） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._l <= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._h >= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 else： # 仍然打开 self.current_draw_downs[i] = int（（self._l - tr[“ActionPrice”]） * _point） _max_draw_down += self.current_draw_downs[i] 如果 self.current_draw_downs[i] < 0 和 tr[“MaxDD”] > self.current_draw_downs[i]： tr[“最大DD”] = self.current_draw_downs[i] elif tr[“Type”] == 1： # 卖出 # 止损触发器 _sl_price = tr[“ActionPrice”] + tr[“SL”] / _point _pt_price = tr[“ActionPrice”] - tr[“PT”] / _point 如果完成： p = （tr[“ActionPrice”] - self._c） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p elif self._h >= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._l <= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 还： self.current_draw_downs[i] = int（（tr[“ActionPrice”] - self._h） * _point ) _max_draw_down += self.current_draw_downs[i] 如果（ self.current_draw_downs[i] < 0 和 tr[“MaxDD”] > self.current_draw_downs[i] ): tr[“最大DD”] = self.current_draw_downs[i] 如果 _max_draw_down > self.max_draw_downs[i]： self.max_draw_downs[i] = _max_draw_down 返回 _total_reward def _limit_order_process（self， i， _action， done）：对于 self.transaction_limit_order 中的 tr[：]： if tr[“Symbol”] == self.assets[i]： if tr[“Type”] ！= _action 或 done： self.transaction_limit_order.删除（tr） tr[“状态”] = 3 tr[“CloseStep”] = self.current_step self.transaction_history.append（tr） elif （tr[“ActionPrice”] >= self._l 和 _action == 0）或（ tr[“ActionPrice”] <= self._h 和 _action == 1）： tr[“ActionStep”] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol（self.assets[i]， “transaction_fee”） self.transaction_limit_order.删除（tr） self.transaction_live.append（tr） self.tranaction_open_this_step.append（tr） elif （tr[“LimitStep”] + self.cf.symbol（self.assets[i]， “limit_order_expiration”） > self.current_step）： tr[“CloseStep”] = self.current_step tr[“状态”] = 4 self.transaction_limit_order.删除（tr） self.transaction_history.append（tr） def _manage_tranaction（self， tr， _p， close_price， status=1）： self.transaction_live.remove（tr） tr[“收盘价”] = close_price tr[“点”] = int（_p） tr[“奖励”] = int（tr[“奖励”] + _p） # 已实现盈/亏 tr[“状态”] = 状态 # 1=止损/太平洋时间，2=强制平仓，3=取消限制，4=过期限制 tr[“CloseTime”] = self._t tr[“CloseStep”] = self.current_step self.balance += int（tr[“奖励”]） self.total_equity -= int（abs（tr[“奖励”]）） self.tranaction_close_this_step.append（tr） self.transaction_history.append（tr） def analyze_transaction_history（个体经营）：如果不self.transaction_history：指标 = {“trades”： 0， “win_rate”： 0.0， “profit_factor”： 0.0， “sharpe_ratio”： 0.0， “total_profit”： 0.0} 还：交易 = len（self.transaction_history） rewards = [tr[“奖励”] 对于 tr self.transaction_history] wins = sum（如果 r > 0，则奖励中的 r 为 1） losses = sum（如果 r < 0，则奖励中的 r 为 1） gross_profit = sum（r for r for r in rewards if r > 0） gross_loss = abs（sum（r for r for r in rewards if r < 0）） win_rate = 盈利 / 交易如果交易 > 0 否则 0.0 profit_factor = gross_profit / gross_loss 如果 gross_loss > 0 else float（“inf”） # 夏普比率（简化，假设无风险利率 = 0）返回值 = np.array（rewards， dtype=np.float32） sharpe_ratio = np.mean（返回） / np.std（返回）如果 np.std（返回） > 0 else 0.0 total_profit = sum（rewards）指标 = { “trades”：交易、 “win_rate”：win_rate、 “profit_factor”：profit_factor、 “sharpe_ratio”：sharpe_ratio、 “total_profit”：total_profit } # 准备带有 timestamp 和 episode 的指标 metrics[“timestamp”] = datetime.datetime.now（）.strftime（“%Y-%m-%d %H：%M：%S”） metrics[“episode”] = self.episode # 检查文件是否存在且为空以写入 header 导入作系统 file_exists = os.path.exists（self.analyze_transaction_history_log_filename） file_empty = file_exists 和 os.stat（self.analyze_transaction_history_log_filename）.st_size == 0 # 如果日志文件是新的或空的，则附加到带有 header 的日志文件其中 open（self.analyze_transaction_history_log_filename， 'a'， newline=''）为 f：字段名称 = [“时间戳”， “剧集”， “交易”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”] writer = csv.DictWriter （f，字段名称=字段名称） # 仅当 file 不存在或为空时才写入 header 如果不是 file_exists 或 file_empty： writer.writeheader（） writer.writerow（指标）返回指标 def step（self， actions）： self.current_step += 1 # 定义终止和截断条件 terminated = self.balance <= 0 # 剧集因破产而结束（最终状态） truncated = self.current_step == len（self.dt_datetime） - 1 # 由于最大步数（时间限制）而结束剧集 done = terminated or truncated # 合并为 VecEnv 的单个 'done' 标志 # 对于渲染或剧集跟踪，您仍然可以检查任一条件是否为 true 如果完成： self.done_information += f“集数： {self.episode} 平衡： {self.balance} 步数： {self.current_step}\n” self.visualization = 真 self.episode += 1 # 增加 episode 计数器 # 计算基础交易奖励 base_reward = self._take_action（作，完成） # 计算持仓的未实现利润 unrealized_profit = 0 atr_scaling = 0 # 用于市场条件缩放对于 i，枚举（self.assets）中的 asset： atr = self.get_observation（self.current_step， i， “ATR”） atr_scaling += atr # 用于标准化的资产的 ATR 总和对于 self.transaction_live 中的 TR： if tr[“Symbol”] == asset： if tr[“Type”] == 0： # 买入未实现 = （self._c - tr[“ActionPrice”]） * self.cf.symbol（asset， “point”） else： # 卖出未实现 = （tr[“ActionPrice”] - self._c） * self.cf.symbol（asset， “点”） unrealized_profit += 未实现 atr_scaling = atr_scaling / len（self.assets） if atr_scaling > 0 else 1 # 避免被 0 除以 # 持续奖励：仅适用于未实现/已实现的利润，由 ATR 缩放 # 调整 0.01 到 0.05 sustained_reward = （unrealized_profit + base_reward） * 0.01 / atr_scaling 如果self.transaction_live否则为 0 # 如果没有持仓，则对不作为的处罚如果不是 self.transaction_live 和 all（math.floor（a） == 2 for a in actions）： sustained_reward -= 0.1 # 鼓励探索的小额惩罚 total_reward = base_reward + sustained_reward 如果 self._day > self.current_day： self.current_day = self._day self.balance -= self.over_night_cash_penalty 如果 self.balance ！= 0： self.max_draw_down_pct = abs（sum（self.max_draw_downs） / self.balance * 100） OBS = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“numeric”]， “portfolio_data”：np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs，数据类型=np.float32 ) } print（f“步骤 {self.current_step}：基础奖励={base_reward}，持续奖励={sustained_reward}，总计={total_reward}，余额={self.balance}”） # 信息字典保持不变 info = {“关闭”： self.tranaction_close_this_step} 返回 obs、total_reward、terminated、truncated、info def get_observation（self， _step， _iter=0， col=None）：如果 col 为 None：返回 self.cached_ohlc_data[_step] 如果 col == “_day”：返回 self.cached_time_serial[_step][1] elif col == “_time”：返回 self.cached_time_serial[_step][0] 尝试： col_pos = self.observation_list.index（col） except ValueError 之外： raise ValueError（f“在 observation_list 中找不到列 '{col}”）返回 self.cached_ohlc_data[_step][_iter * len（self.observation_list） + col_pos] def get_observation_vector（self， _dt， cols=无）： cols = self.observation_list 如果 cols 不是其他 cols v = [] for a in self.assets 中：子集 = self.df.query（f'{self.asset_col} == “{a}” & {self.time_col} == “{_dt}”'） assert not subset.empty v += subset.loc[_dt， cols].tolist（）断言 len（v） == len（self.assets） * len（cols）返回 v def get_economic_vector（self， _dt）：子集 = self.df.loc[_dt] events = subset['events'] if isinstance（subset， pd.系列） else subset['events'].iloc[0] event_ids = [self.event_map[e['event']] for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） currency_ids = [self.currency_map.get（e['currency']， 0） for e in events[:self.max_events]] + [0] * （self.max_events - len（events）） numeric_fields = ['actual_norm'， 'forecast_norm'， 'previous_norm'， 'surprise_norm'， 'event_freq'， 'impact_code'] 数字 = 事件中 e 的 [e[字段] 对于numeric_fields中的字段 :self.max_events] + [0] * （self.max_events * 6 - len（事件） * 6）返回 { “event_ids”： np.array（event_ids， dtype=np.int32）， “currency_ids”： np.array（currency_ids， dtype=np.int32）， “数字”： np.array（数字， dtype=np.float32） } def reset（self， seed=None， options=None）： # 设置可重复性的种子如果 seed 不是 None： self._seed（种子）如果self.random_start： self.current_step = random.choice（range（int（len（self.dt_datetime） * 0.5）））还： self.current_step = 0 self.equity_list = [0] * len（self.assets） self.balance = self.balance_initial self.total_equity = self.balance + sum（self.equity_list） self.ticket_id = 0 self.transaction_live = [] self.transaction_history = [] self.transaction_limit_order = [] self.current_draw_downs = [0.0] * len（self.assets） self.max_draw_downs = [0.0] * len（self.assets） self.max_draw_down_pct = sum（self.max_draw_downs） / self.balance * 100 self.current_holding = [0] * len（self.assets） self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] self.current_day = 0 self.done_information = “” self.log_header = 真 self.visualization = False OBS = { “ohlc_data”： np.array（self.cached_ohlc_data[self.current_step]， dtype=np.float32）， “event_ids”： self.cached_economic_data[self.current_step][“event_ids”]， “currency_ids”： self.cached_economic_data[self.current_step][“currency_ids”]， “economic_numeric”： self.cached_economic_data[self.current_step][“numeric”]， “portfolio_data”：np.array（ [self.balance， self.total_equity， self.max_draw_down_pct] + self.current_holding + self.current_draw_downs，数据类型=np.float32 ) } 信息 = {} 返回 OBS、INFO def render（self， mode=“human”， title=None， **kwargs）： if mode in （“human”， “file”）：打印输出 = 模式 == “人类” 下午 = { “log_header”： self.log_header， “log_filename”： self.log_filename， “printout”：打印输出， “balance”： self.balance， “balance_initial”：self.balance_initial、 “tranaction_close_this_step”：self.tranaction_close_this_step、 “done_information”：self.done_information、 } render_to_file（**pm）如果 self.log_header： self.log_header = 假 elif 模式 == “graph” 和 self.visualization： print（“绘图...”） p = TradingChart（self.df， self.transaction_history） p.plot（） def close（个体）：通过 def get_sb_env（个体经营）： e = DummyVecEnv（[lambda： self]） obs = e.reset（）返回 E、OBS class CustomFeaturesExtractor（BaseFeaturesExtractor）： def __init__（self， observation_space）： n_assets = （observation_space.spaces[“portfolio_data”].shape[0] - 3） // 2 ohlc_dim = observation_space.spaces[“ohlc_data”].shape[0] max_events = observation_space.spaces[“event_ids”].shape[0] economic_numeric_dim = observation_space.spaces[“economic_numeric”].shape[0] portfolio_dim = observation_space.spaces[“portfolio_data”].形状[0] features_dim = ohlc_dim + 2 * max_events + economic_numeric_dim + portfolio_dim # 检查 CUDA 是否可用，否则使用 CPU self.device = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomFeaturesExtractor using device： {self.device}”） super（）.__init__（observation_space， features_dim=features_dim） # 将 embedding 移动到所选设备 self.event_embedding = nn.Embedding（num_embeddings=129， embedding_dim=max_events）.to（self.device） self.currency_embedding = nn。Embedding（num_embeddings=6， embedding_dim=max_events）.to（self.device） print（f“自定义功能提取器： n_assets={n_assets}， features_dim={features_dim}”） def forward（self， obs）： ohlc_data = obs[“ohlc_data”].to（self.device） event_ids = obs[“event_ids”].to（self.device， dtype=torch.long） currency_ids = obs[“currency_ids”].to（self.device， dtype=torch.long） economic_numeric = obs[“economic_numeric”].to（self.device） portfolio_data = obs[“portfolio_data”].to（self.device） event_emb = self.event_embedding（event_ids）.均值（dim=1） currency_emb = self.currency_embedding（currency_ids）.mean（dim=1）特征 = torch.cat（[ohlc_data， event_emb， currency_emb， economic_numeric， portfolio_data]， dim=1）返回功能类 CustomMultiInputPolicy（ActorCriticPolicy）： def __init__（self， observation_space， action_space， lr_schedule， *args， **kwargs）： # 检查 CUDA 是否可用，否则使用 CPU 设备 = torch.device（“cuda：0” if torch.cuda.is_available（） else “cpu”） print（f“CustomMultiInputPolicy using device： {device}”） # 提取动作空间边界并将其移动到所选设备 self.action_space_low = torch.tensor （action_space.low， dtype =torch.float32， device=device） self.action_space_high = torch.tensor （action_space.high， dtype =torch.float32， device=device） action_dim = action_space.shape[0] # 资产数量 super（）.__init__（ observation_space， action_space， lr_schedule， features_extractor_class=CustomFeaturesExtractor， features_extractor_kwargs={}， net_arch=dict（pi=[64， 64]， vf=[64， 64]）， *args、 **kwargs ) features_dim = self.features_extractor.features_dim self.mlp_extractor = MlpExtractor（ features_dim， net_arch=self.net_arch， activation_fn=nn.ReLU ， device=设备）.to（设备） # 定义动作网络以输出高斯的均值和log_std self.action_net = nn.Linear（64， action_dim * 2）.to（device） # 输出每个资产的平均值和log_std self.value_net = nn.线性（64， 1）.to（设备） # 初始化发行版 self.action_dist = SquashedDiagGaussianDistribution （action_dim） self.num_timesteps = 0 def forward（self， obs， deterministic=False）： # 在每次前向传递上增加时间步长 self.num_timesteps += 1 # 提取特征特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） # 从 action_net 获取 mean 和 log_std action_params = self.action_net（latent_pi） # [批量， n_assets * 2] mean_actions， log_std = action_params.chunk（2， dim=-1） # 拆分为均值和log_std log_std = torch.clamp（log_std， min=-20， max=2） # 稳定log_std # 使用当前参数创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 示例作或获取确定性作作 = distribution.actions_from_params（mean_actions， log_std， deterministic=deterministic） # 从 [-1， 1] 到 [0， 3] 的映射 squashed_actions = self._squash_to_range（作、self.action_space_low、self.action_space_high） # 计算未压缩的 action 的对数概率 log_prob = distribution.log_prob（actions） # 对未压缩的动作使用log_prob # 价值预测值 = self.value_net（latent_vf）如果 self.num_timesteps % 1000 == 0： print（f“步骤 {self.num_timesteps}，训练： {self.training}，作： {squashed_actions}，平均值： {mean_actions.mean（）}，对数标准： {log_std.mean（）}”）返回 squashed_actions、值 log_prob def _squash_to_range（self， actions， low， high）： “”“将压缩的作从 [-1， 1] 缩放到 [low， high]。”“”” 返回（作数 + 1） * （最高价 - 最低价） / 2 + 最低价 def extract_features（自身，OBS）：返回self.features_extractor （OBS） def predict_values（自身， OBS）：特征 = self.extract_features（OBS） _， latent_vf = self.mlp_extractor（特征）返回 self.value_net（latent_vf） def evaluate_actions（自我、OBS、作）： “”“评估训练作（由 PPO 使用）。”“” 特征 = self.extract_features（OBS） latent_pi，latent_vf = self.mlp_extractor（特征） action_params = self.action_net（latent_pi） mean_actions，log_std = action_params.chunk（2， dim=-1） log_std = torch.clamp（log_std，最小=-20，最大=2） # 创建新的 distribution 实例分布 = SquashedDiagGaussianDistribution（self.action_dist.action_dim） # 将 action 解压缩回 [-1， 1] 以进行log_prob计算 unsquashed_actions = 2 * （作 - self.action_space_low） / （self.action_space_high - self.action_space_low） - 1 # 计算log_prob和熵 log_prob = distribution.log_prob（unsquashed_actions） # 对未压缩的动作使用 log_prob entropy = distribution.entropy（） # 熵不需要动作值 = self.value_net（latent_vf）返回值、log_prob、熵 ```

Let’s clarify what max_current_holding represents in your tgym environment and verify whether current_holding is calculated correctly, particularly when trades open and close.
让我们澄清一下 max_current_holding 在您的 tgym 环境中代表什么，并验证current_holding是否计算正确，尤其是在交易开仓和平仓时。

What Does max_current_holding Mean?
max_current_holding 是什么意思？

Intended Meaning: Based on your question and the context of your trading environment, max_current_holding should represent the maximum number of open trades (positions) allowed for a given asset at any time. This is a common concept in trading environments to limit exposure or risk per asset.
预期含义：根据您的问题和交易环境的背景，max_current_holding应代表给定资产在任何时候允许的最大未平仓交易（头寸）数量。这是交易环境中的常见概念，用于限制每项资产的敞口或风险。
Not Price: It’s not the price of a trade (e.g., dollar value or asset price), as that wouldn’t align with how it’s used in the code to control the opening of new positions.
非价格：它不是交易的价格（例如，美元价值或资产价格），因为这与代码中用于控制新头寸开仓的方式不一致。
Evidence in Code: 代码中的证据：
- In _take_action():
  在 _take_action（）中：
  python 蟒
  if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding") ): self.current_holding[i] += 1
  - self.current_holding[i] is an integer incremented when a new trade opens, and the condition checks if it’s below max_current_holding. This confirms max_current_holding is a count of open trades, not a price.
    self.current_holding[i] 是新交易开始时递增的整数，条件检查它是否低于 max_current_holding。这确认了 max_current_holding 是未平仓交易的计数，而不是价格。
Config File Reference: 配置文件参考：
- self.cf.symbol(self.assets[i], "max_current_holding") retrieves this value from your config file (e.g., gdbusd-test-1.json).
  self.cf.symbol（self.assets[i]， “max_current_holding”）从你的配置文件（例如 gdbusd-test-1.json）中检索此值。
- Example config snippet: 示例配置片段：
  json JSON 格式
  "XAUUSD": { "max_current_holding": 50, "transaction_fee": 10, "stop_loss_max": 100, "profit_taken_max": 200, ... }
- Here, max_current_holding=50 means up to 50 open trades are allowed for XAUUSD.
  在这里，max_current_holding=50 表示 XAUUSD 最多允许 50 笔未平仓交易。
Conclusion: max_current_holding indeed means the number of open trades, not the price. Your understanding is correct.
结论：max_current_holding 确实是指未平仓交易的数量，而不是价格。你的理解是正确的。

Is current_holding Calculated Correctly?
current_holding计算正确吗？

Let’s examine how current_holding is updated when trades open and close to ensure it decreases by 1 when a trade closes, as you expect.
让我们看看current_holding在交易开盘和收盘时是如何更新的，以确保它在交易结束时减少 1，正如您所期望的那样。

Initialization 初始化

In __init__ and reset():
在 __init__ 和 reset（）中：
python 蟒
self.current_holding = [0] * len(self.assets)
- Starts at 0 for each asset, representing no open trades initially.
  每个资产从 0 开始，表示最初没有未平仓交易。

Opening Trades 开仓交易

In _take_action() (market orders):
在 _take_action（）（市价单）中：
python 蟒
if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding") ): # ... (create transaction) self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.transaction_live.append(transaction)
- When a buy (_action=0) or sell (_action=1) occurs and the limit isn’t reached, current_holding[i] increases by 1. This is correct for opening trades.
  当买入（_action=0）或卖出（_action=1）发生且未达到限制时，current_holding[i] 增加 1。这对于开仓交易是正确的。
In _limit_order_process() (limit orders triggering):
在 _limit_order_process（）（触发限价单）：
python 蟒
elif (tr["ActionPrice"] >= self._l and _action == 0) or ( tr["ActionPrice"] <= self._h and _action == 1): tr["ActionStep"] = self.current_step self.current_holding[i] += 1 self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_limit_order.remove(tr) self.transaction_live.append(tr)
- When a limit order triggers, current_holding[i] also increases by 1. This is correct.
  当限价单触发时，current_holding[i] 也会增加 1。这是正确的。

Closing Trades 平仓交易

Trades close in _calculate_reward() via stop-loss (SL), profit-take (PT), or episode end (done=True):
通过止损（SL）、止盈（PT）或剧集结束（done=True）在 _calculate_reward（）中平仓：

def _calculate_reward(self, i, done, action):
    _total_reward = 0
    for tr in self.transaction_live[:]:
        if tr["Symbol"] == self.assets[i]:
            _point = self.cf.symbol(self.assets[i], "point")
            if tr["Type"] == 0:  # Buy
                _sl_price = tr["ActionPrice"] - tr["SL"] / _point
                _pt_price = tr["ActionPrice"] + tr["PT"] / _point
                if done:
                    p = (self._c - tr["ActionPrice"]) * _point
                    self._manage_tranaction(tr, p, self._c, status=2)
                    _total_reward += p
                elif self._l <= _sl_price:
                    self._manage_tranaction(tr, -tr["SL"], _sl_price)
                    _total_reward += -tr["SL"]
                    self.current_holding[i] -= 1
                elif self._h >= _pt_price:
                    self._manage_tranaction(tr, tr["PT"], _pt_price)
                    _total_reward += tr["PT"]
                    self.current_holding[i] -= 1
            elif tr["Type"] == 1:  # Sell
                _sl_price = tr["ActionPrice"] + tr["SL"] / _point
                _pt_price = tr["ActionPrice"] - tr["PT"] / _point
                if done:
                    p = (tr["ActionPrice"] - self._c) * _point
                    self._manage_tranaction(tr, p, self._c, status=2)
                    _total_reward += p
                elif self._h >= _sl_price:
                    self._manage_tranaction(tr, -tr["SL"], _sl_price)
                    _total_reward += -tr["SL"]
                    self.current_holding[i] -= 1
                elif self._l <= _pt_price:
                    self._manage_tranaction(tr, tr["PT"], _pt_price)
                    _total_reward += tr["PT"]
                    self.current_holding[i] -= 1
    return _total_reward

Closing Logic: 结束逻辑：
- Stop-Loss (SL): 止损（SL）：
  - Buy: self._l <= _sl_price → self.current_holding[i] -= 1.
    买入： self._l <= _sl_price → self.current_holding[i] -= 1.
  - Sell: self._h >= _sl_price → self.current_holding[i] -= 1.
    卖出： self._h >= _sl_price → self.current_holding[i] -= 1.
- Profit-Take (PT): 获利回吐（PT）：
  - Buy: self._h >= _pt_price → self.current_holding[i] -= 1.
    买入： self._h >= _pt_price → self.current_holding[i] -= 1.
  - Sell: self._l <= _pt_price → self.current_holding[i] -= 1.
    卖出： self._l <= _pt_price → self.current_holding[i] -= 1.
- Episode End (done):
  剧集结束（已完成）：
  - No explicit self.current_holding[i] -= 1 in the done block.
    done 块中没有显式的 self.current_holding[i] -= 1。
_manage_tranaction():
_manage_tranaction（）：
python 蟒
def _manage_tranaction(self, tr, _p, close_price, status=1): self.transaction_live.remove(tr) tr["ClosePrice"] = close_price tr["Point"] = int(_p) tr["Reward"] = int(tr["Reward"] + _p) tr["Status"] = status tr["CloseTime"] = self._t tr["CloseStep"] = self.current_step self.balance += int(tr["Reward"]) self.total_equity -= int(abs(tr["Reward"])) self.tranaction_close_this_step.append(tr) self.transaction_history.append(tr)
- Removes the transaction from transaction_live but doesn’t adjust current_holding directly. This is handled in _calculate_reward().
  从 transaction_live 中删除事务，但不直接调整current_holding。这在 _calculate_reward（）中处理。

Verification 验证

Opening Trades: current_holding[i] += 1 is correctly applied when a trade opens (market or triggered limit order).
开仓交易：current_holding[i] += 1 在交易开仓时正确应用（市价或触发限价单）。
Closing Trades: 平仓交易：
- SL/PT: current_holding[i] -= 1 is correctly applied when a trade closes due to hitting SL or PT.
  SL/PT：current_holding[i] -= 1 在交易因触及 SL 或 PT 而平仓时正确应用。
- Episode End (done): Missing current_holding[i] -= 1 in the done block, but since all positions close, it should decrement for each transaction.
  剧集结束（已完成）：在完成的区块中缺少 current_holding[i] -= 1，但由于所有仓位都关闭，因此每笔交易都应该递减。
Bug Check: 错误检查：
- When done=True, all live transactions close, but current_holding[i] isn’t decremented in the done block:
  当 done=True 时，所有实时交易都会关闭，但 current_holding[i] 在 done 区块中不会递减：
  python 蟒
  if done: p = (self._c - tr["ActionPrice"]) * _point # Buy self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p
- Issue: current_holding[i] doesn’t decrease here, leading to an incorrect count after forced closure. It should reflect the number of open trades (i.e., 0 after all close).
  问题：current_holding[i] 在此处没有减少，导致强制关闭后计数不正确。它应反映未平仓交易的数量（即，所有平仓后为 0）。

Is Your current_holding Correct?
你的current_holding正确吗？

Mostly Yes: 大部分是：
- It increments correctly when trades open (+1 in _take_action() and _limit_order_process()).
  当交易开始时，它会正确递增（_take_action（）和 _limit_order_process（）中的 +1）。
- It decrements correctly when trades close via SL or PT (-1 in _calculate_reward()).
  当交易通过 SL 或 PT 平仓时，它会正确递减（_calculate_reward（））中的 -1）。
Bug When done=True:
Bug When done=True：
- When the episode ends, all positions close, but current_holding[i] isn’t updated in the done block. This means current_holding will overstate the number of open trades until the next reset().
  当剧集结束时，所有位置都会关闭，但 current_holding[i] 不会在 done 块中更新。这意味着current_holding将夸大未平仓交易的数量，直到下一次 reset（）。

Example Log Correlation 示例对数关联

From your previous log: 从你之前的日志中：

Step 10315: Base Reward=-961.361, Sustained Reward=0, Total=-961.361, Balance=96157
Asset XAUUSD: Action=0.0, Reward=-961.361, Holding=50

SL Triggered: Base Reward=-961.361 indicates a stop-loss hit, reducing current_holding from 50 to 49 (correctly decremented in _calculate_reward()).
SL Triggered：Base Reward=-961.361 表示止损命中，current_holding从 50 减少到 49（在 _calculate_reward（）中正确递减）。
Next Step (Hypothetical):
下一步（假设）：
- If done=True occurs, current_holding should drop to 0 as all positions close, but the code misses this adjustment.
  如果发生 done=True，则当所有位置关闭时，current_holding 应降至 0，但代码错过了此调整。

Fix for done=True
修复 done=True

Update _calculate_reward() to decrement current_holding when closing trades at episode end:
更新 _calculate_reward（）以在剧集结束时关闭交易时递减current_holding：

def _calculate_reward(self, i, done, action):
    _total_reward = 0
    _max_draw_down = 0

    for tr in self.transaction_live[:]:
        if tr["Symbol"] == self.assets[i]:
            _point = self.cf.symbol(self.assets[i], "point")
            if tr["Type"] == 0:  # Buy
                _sl_price = tr["ActionPrice"] - tr["SL"] / _point
                _pt_price = tr["ActionPrice"] + tr["PT"] / _point
                if done:
                    p = (self._c - tr["ActionPrice"]) * _point
                    self._manage_tranaction(tr, p, self._c, status=2)
                    _total_reward += p
                    self.current_holding[i] -= 1  # Fix: Decrement here
                elif self._l <= _sl_price:
                    self._manage_tranaction(tr, -tr["SL"], _sl_price)
                    _total_reward += -tr["SL"]
                    self.current_holding[i] -= 1
                elif self._h >= _pt_price:
                    self._manage_tranaction(tr, tr["PT"], _pt_price)
                    _total_reward += tr["PT"]
                    self.current_holding[i] -= 1
                else:
                    # ... (drawdown logic)
            elif tr["Type"] == 1:  # Sell
                _sl_price = tr["ActionPrice"] + tr["SL"] / _point
                _pt_price = tr["ActionPrice"] - tr["PT"] / _point
                if done:
                    p = (tr["ActionPrice"] - self._c) * _point
                    self._manage_tranaction(tr, p, self._c, status=2)
                    _total_reward += p
                    self.current_holding[i] -= 1  # Fix: Decrement here
                elif self._h >= _sl_price:
                    self._manage_tranaction(tr, -tr["SL"], _sl_price)
                    _total_reward += -tr["SL"]
                    self.current_holding[i] -= 1
                elif self._l <= _pt_price:
                    self._manage_tranaction(tr, tr["PT"], _pt_price)
                    _total_reward += tr["PT"]
                    self.current_holding[i] -= 1
                else:
                    # ... (drawdown logic)
            if _max_draw_down > self.max_draw_downs[i]:
                self.max_draw_downs[i] = _max_draw_down
    return _total_reward

Conclusion 结论

max_current_holding: Correctly represents the number of open trades, not price, as intended.
max_current_holding：正确表示预期的未平仓交易数量，而不是价格。
current_holding Calculation:
current_holding计算：
- Correct for Opening: Increments properly when trades open.
  正确打开：当交易开始时正确递增。
- Correct for SL/PT Closures: Decrements properly when trades hit SL or PT.
  校正 SL/PT 瓶盖：当交易达到 SL 或 PT 时正确递减。
- Incorrect for Episode End: Fails to decrement when done=True, overestimating open trades until reset.
  剧集结尾不正确：当 done=True 时无法递减，高估未平仓交易直到重置。
Fix Applied: Add self.current_holding[i] -= 1 in the done blocks of _calculate_reward().
已应用修复：在 _calculate_reward（）的 done 块中添加 self.current_holding[i] -= 1。

Next Steps 后续步骤

Apply the fix and test with a scenario where done=True (e.g., end of episode).
在 done=True 的场景中应用修复和测试（例如，剧集结束）。
Add a debug print in _calculate_reward() to confirm:
在 _calculate_reward（）中添加调试打印以确认：
python 蟒
if done: print(f"Closing trade for {self.assets[i]}, Holding before: {self.current_holding[i]}") self._manage_tranaction(tr, p, self._c, status=2) self.current_holding[i] -= 1 print(f"Holding after: {self.current_holding[i]}")
Verify current_holding drops to 0 when all positions close.
验证 current_holding 在所有持仓平仓时下降到 0。

Your intuition about current_holding decreasing by 1 when a trade closes is correct—only the done case was missing this adjustment! Let me know if you need further validation.
您对交易结束时current_holding减少 1 的直觉是正确的——只有 done 情况缺少此调整！如果您需要进一步的验证，请告诉我。

1. edit this code, make it when training, deterministic=False, when evaluation, deterministic=True. and add ent_coef for the PPO, should i fine tunning ent_coef? if yes, add it. ``` # Evaluation function def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False, deterministic=True, quantstats=False): total_rewards = [] total_profits = [] # Track actual trading profit metrics = [] if (quantstats): returns = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 # Sum of realized profits/losses step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) if (quantstats): episode_returns = [] # For QuantStats while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=deterministic) obs, rewards, done, info = env_vec.step(action) # VecEnv returns 4 values episode_rewards += rewards print(f"Action: {action}") # Debug action output # Extract profit from closed transactions for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] # Add realized reward (profit/loss) if (quantstats): episode_returns.append(tr["Reward"]) # Track per-trade returns step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) # Logs to file here if (quantstats): returns.extend(episode_returns if episode_returns else [episode_profit]) # Fallback to total profit mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) # Define numeric keys to average numeric_keys = ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"] avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in numeric_keys} print(f"Deterministic={deterministic}, Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") if (quantstats): # QuantStats reporting (for test evaluation) returns_series = pd.Series(returns, index=pd.date_range(start="2025-03-12", periods=len(returns), freq="D")) qs.reports.html(returns_series, output="quantstats_report.html", title="FX Trading Performance") return mean_reward if return_mean_reward else mean_profit # Return profit instead of reward for optimization # Custom early stopping callback class EarlyStoppingCallback(EvalCallback): def __init__(self, eval_env, eval_freq, n_eval_episodes, patience, min_delta, verbose=0, use_profit=False, best_model_save_path=None): super().__init__( eval_env=eval_env, eval_freq=eval_freq, n_eval_episodes=n_eval_episodes, verbose=verbose, deterministic=True ) self.patience = patience self.min_delta = min_delta self.best_metric = -float('inf') self.no_improvement_count = 0 self.use_profit = use_profit # Toggle between reward and profit self.best_model_save_path = best_model_save_path def _on_step(self): continue_training = super()._on_step() if not continue_training: return False if self.last_mean_reward is not None: # Use profit or reward based on use_profit flag current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit) if current_metric > self.best_metric + self.min_delta: self.best_metric = current_metric self.no_improvement_count = 0 if self.verbose > 0: print(f"New best {'profit' if self.use_profit else 'reward'}: {self.best_metric:.2f}") if self.best_model_save_path: self.model.save(self.best_model_save_path) else: self.no_improvement_count += 1 if self.verbose > 0: print(f"No improvement for {self.no_improvement_count}/{self.patience} evaluations") if self.no_improvement_count >= self.patience: if self.verbose > 0: print(f"Early stopping triggered after {self.patience} evaluations without improvement") return False return True # Objective function for Optuna def objective(trial): # Define hyperparameter search space learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True) n_steps = trial.suggest_int("n_steps", 1024, 8192, step=1024) total_timesteps = trial.suggest_int("total_timesteps", 500000, 2000000, step=100000) batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512]) gamma = trial.suggest_float("gamma", 0.9, 0.9999) gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.99) clip_range = trial.suggest_float("clip_range", 0.1, 0.3) # Train PPO model on training set model = PPO( CustomMultiInputPolicy, train_env_vec, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, n_epochs=100, # Large fixed value gamma=gamma, gae_lambda=gae_lambda, clip_range=clip_range, verbose=1, # 0: No output during training, # 1: Prints basic training progress, # 2: More detailed output (Additional details like optimization steps, loss values (e.g., policy loss, value loss), and learning rate updates.) ) eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # More frequent n_eval_episodes=10, # More episodes for stability patience=5, # Stop if no improvement after 5 evaluations min_delta=0.01, # Minimum improvement to consider verbose=1, use_profit=True, # Track profit for trading focus best_model_save_path=f"./best_model/trial_{trial.number}/" # Save best model ) # print(model.policy) # Should show mlp_extractor with in_features=95 model.learn(total_timesteps=total_timesteps, callback=eval_callback) val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function print(f"Validation Average Profit: {val_avg_profit:.2f}") return val_avg_profit # Maximize reward # Specify the SQLite database file db_path = 'optuna_study.db' # Run optimization study = optuna.create_study( study_name='OHLC_EconomicCalender_ppo_study', storage=f'sqlite:///{db_path}', direction="maximize", load_if_exists=True ) study.optimize(objective, n_trials=1) # Adjust number of trials based on resources # Best parameters print("Best hyperparameters:", study.best_params) print("Best validation reward:", study.best_value) best_params = study.best_params # Train final model with best parameters on training set best_model = PPO( CustomMultiInputPolicy, train_env_vec, # Use full training environment learning_rate=best_params["learning_rate"], n_steps=best_params["n_steps"], batch_size=best_params["batch_size"], n_epochs=100, # Large fixed value gamma=best_params["gamma"], gae_lambda=best_params["gae_lambda"], clip_range=best_params["clip_range"], verbose=1, tensorboard_log="./tensorboard_logs/" ) # Early stopping callback on validation set best_trial = study.best_trial.number eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # more frequent evaluations (e.g., every 5,000 steps) n_eval_episodes=10, patience=5, min_delta=0.01, verbose=1, use_profit=True, # use mean_profits best_model_save_path=f"./best_model/trial_{best_trial}/" ) # Train the final model best_model.learn(total_timesteps=best_params["total_timesteps"], callback=eval_callback) # Save the final model best_model.save(f"ppo_xauusd_optimized_trial_{best_trial}") # Evaluate on test data with QuantStats print("\nEvaluating Final Model on Test Data:") test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, quantstats=True) print(f"Test Average Profit: {test_avg_profit:.2f}") # Optional: Load and re-evaluate to verify # loaded_model = PPO.load("ppo_xauusd_optimized") # test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, quantstats=True) # print(f"Test Average Profit (Loaded Model): {test_avg_profit_loaded:.2f}") # Clean up train_env_vec.close() val_env_vec.close() test_env_vec.close() ``` 2. when to add ent_coef? when deterministic is true or is false? or both can add ent_coef?
1. 编辑这段代码，训练时 deterministic=False，评估时 deterministic=True。并为 PPO 添加 ent_coef，我应该微调 ent_coef 吗？如果是，请添加它。 ``` # 评估函数 def evaluate（model， env_vec， n_episodes=10， return_mean_reward=False， deterministic=True， quantstats=False）： total_rewards = [] total_profits = [] # 跟踪实际交易利润指标 = [] if （quantstats）：返回 = [] 对于 range（n_episodes）中的 _： obs = env_vec.reset（）完成 = np.array（[假] * env_vec.num_envs） episode_rewards = np.zeros（env_vec.num_envs） episode_profit = 0 # 已实现盈亏之和 step_count = 0 max_steps = len（env_vec.envs[0].dt_datetime） if （quantstats）： episode_returns = [] # 对于 QuantStats while not np.all（done）且 step_count < max_steps：作， _ = model.predict（obs， deterministic=deterministic） obs， rewards， done， info = env_vec.step（action） # VecEnv 返回 4 个值 episode_rewards += 奖励 print（f“Action： {action}”） # 调试动作输出 # 从已平仓交易中提取利润对于 range （env_vec.num_envs）中的env_idx： if info[env_idx][“关闭”]：对于 info[env_idx][“Close”] 中的 tr： episode_profit += tr[“Reward”] # 添加已实现的奖励（盈利/亏损） if （quantstats）： episode_returns.append（tr[“Reward”]） # 追踪每笔交易的回报 step_count += 1 total_rewards.extend（episode_rewards） total_profits.append（episode_profit） metrics.append（env_vec.envs[0].analyze_transaction_history（）） # 在此处提交的日志 if （quantstats）： returns.extend（episode_returns if episode_returns else [episode_profit]） # 回退到总利润 mean_reward = np.mean（total_rewards） mean_profit = np.mean（total_profits） # 定义数字键以求平均值 numeric_keys = [“交易”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”] avg_metrics = {k： np.mean（[m[k] for m in metrics]） for k in numeric_keys} print（f“Deterministic={deterministic}，平均奖励： {mean_reward：.2f}，平均利润： {mean_profit：.2f}”） print（f“平均度量： {avg_metrics}”） if （quantstats）： # QuantStats 报告（用于测试评估） returns_series = pd。Series（返回值， index=pd.date_range（start=“2025-03-12”， periods=len（returns）， freq=“D”）） qs.reports.html（returns_series， output=“quantstats_report.html”， title=“外汇交易表现”） return mean_reward if return_mean_reward else mean_profit # 返回利润而不是优化奖励 # 自定义提前停止回调类 EarlyStoppingCallback（EvalCallback）： def __init__（self， eval_env， eval_freq， n_eval_episodes， patience， min_delta， verbose=0， use_profit=False， best_model_save_path=None）： super（）.__init__（ eval_env=eval_env、 eval_freq=eval_freq、 n_eval_episodes=n_eval_episodes， verbose=verbose， deterministic=真 ) self.patience = 耐心 self.min_delta = min_delta self.best_metric = -float（'inf'） self.no_improvement_count = 0 self.use_profit = use_profit # 在奖励和利润之间切换 self.best_model_save_path = best_model_save_path def _on_step（个体经营）： continue_training = super（）._on_step（）如果不continue_training： return False 如果 self.last_mean_reward 不是 None： # 使用基于use_profit标志的利润或奖励 current_metric = evaluate（self.model， self.eval_env， self.n_eval_episodes， return_mean_reward=not self.use_profit）如果 current_metric > self.best_metric + self.min_delta： self.best_metric = current_metric self.no_improvement_count = 0 如果 self.verbose > 0： print（f“新最佳 {'profit' if self.use_profit else 'reward'}： {self.best_metric：.2f}”）如果self.best_model_save_path： self.model.save（self.best_model_save_path）还： self.no_improvement_count += 1 如果 self.verbose > 0： print（f“{self.no_improvement_count}/{self.patience} evaluations没有改善”） if self.no_improvement_count >= self.patience：如果 self.verbose > 0： print（f“在 {self.patience} 评估后触发提前停止，但无改善”） return False 返回 True # Optuna 的目标函数 def 目标（试用）： # 定义超参数搜索空间 learning_rate = trial.suggest_float（“learning_rate”， 1e-5， 1e-3， log=True） n_steps = trial.suggest_int（“n_steps”， 1024， 8192， step=1024） total_timesteps = trial.suggest_int（“total_timesteps”， 500000， 2000000， step=100000） batch_size = trial.suggest_categorical（“batch_size”， [64， 128， 256， 512]）伽玛 = trial.suggest_float（“伽玛”， 0.9， 0.9999） gae_lambda = trial.suggest_float（“gae_lambda”， 0.8， 0.99） clip_range = trial.suggest_float（“clip_range”， 0.1， 0.3） # 在训练集上训练 PPO 模型模型 = PPO（ CustomMultiInputPolicy 和 train_env_vec， learning_rate=learning_rate、 n_steps=n_steps、 batch_size=batch_size， n_epochs=100， # 大的固定值 gamma=伽玛， gae_lambda=gae_lambda、 clip_range=clip_range、 verbose=1， # 0：训练时无输出， # 1：打印基础训练进度， # 2：更详细的输出（其他详细信息，如优化步骤、损失值（例如，策略损失、值损失）和学习率更新。 ) eval_callback = EarlyStoppingCallback（ eval_env=val_env_vec、 eval_freq=5000， # 更频繁 n_eval_episodes=10， # 更多剧集以保持稳定 patience=5， # 如果 5 次评估后没有改善，则停止 min_delta=0.01， # 要考虑的最小改进 verbose=1， use_profit=True， # 跟踪利润以专注于交易 best_model_save_path=f“./best_model/trial_{trial.number}/” # 保存最佳模型 ) # print（model.policy） # 应该显示 in_features=95 的 mlp_extractor model.learn（total_timesteps=total_timesteps， callback=eval_callback） val_avg_profit = evaluate（model， val_env_vec， n_episodes=10） # 评估函数 print（f“验证平均利润： {val_avg_profit：.2f}”） return val_avg_profit # 最大化奖励 # 指定 SQLite 数据库文件 db_path = 'optuna_study.db' # 运行优化研究 = optuna.create_study（ study_name='OHLC_EconomicCalender_ppo_study'，存储=f'sqlite：///{db_path}'， direction=“maximize”， load_if_exists=真 ) study.optimize（objective， n_trials=1） # 根据资源调整 Trial 次数 # 最佳参数 print（“最佳超参数：”， study.best_params） print（“最佳验证奖励：”， study.best_value） best_params = study.best_params # 使用训练集上的最佳参数训练最终模型 best_model = PPO（ CustomMultiInputPolicy 和 train_env_vec， # 使用完整的训练环境 learning_rate=best_params[“learning_rate”]， n_steps=best_params[“n_steps”]， batch_size=best_params[“batch_size”]， n_epochs=100， # 大的固定值伽玛=best_params[“伽玛”]， gae_lambda=best_params[“gae_lambda”]， clip_range=best_params[“clip_range”]， verbose=1， tensorboard_log=“./tensorboard_logs/” ) # 在验证集上提前停止回调 best_trial = study.best_trial.number eval_callback = EarlyStoppingCallback（ eval_env=val_env_vec、 eval_freq=5000， # 更频繁的评估（例如，每 5,000 个步骤） n_eval_episodes=10，耐心 = 5， min_delta=0.01， verbose=1， use_profit=True， # 使用 mean_profits best_model_save_path=f“./best_model/trial_{best_trial}/” ) # 训练最终模型 best_model.learn（total_timesteps=best_params[“total_timesteps”]， callback=eval_callback） # 保存最终模型 best_model.save（f“ppo_xauusd_optimized_trial_{best_trial}”） # 使用 QuantStats 评估测试数据 print（“\n根据测试数据评估最终模型：”） test_avg_profit = evaluate（best_model， test_env_vec， n_episodes=20， quantstats=True） print（f“测试平均利润： {test_avg_profit：.2f}”） # 可选：加载并重新评估以验证 # loaded_model = PPO.load（“ppo_xauusd_optimized”） # test_avg_profit = evaluate（best_model， test_env_vec， n_episodes=20， quantstats=True） # print（f“测试平均利润（加载的模型）：{test_avg_profit_loaded：.2f}”） # 清理 train_env_vec.close（） val_env_vec.close（） test_env_vec.close（） ``` 2. 何时添加ent_coef？确定性何时为 true 或 false？或者两者都可以添加ent_coef？

Let’s address your questions step-by-step:
让我们逐步解决您的问题：

Modify the code to use deterministic=False during training and deterministic=True during evaluation, and add ent_coef to PPO with fine-tuning via Optuna.
修改代码以在训练期间使用 deterministic=False，在评估期间使用 deterministic=True，并通过 Optuna 将ent_coef添加到 PPO 中。
Explain when to add ent_coef, whether it applies to deterministic=True, False, or both.
说明何时添加ent_coef，是应用于 deterministic=True、False 还是两者。

1. Modified Code 1. 修改后的代码

Here’s the updated code with:
这是更新后的代码：

deterministic=False during training (via model.learn() implicitly, as PPO uses stochastic sampling by default).
deterministic=False（通过 model.learn（）隐式，因为 PPO 默认使用随机采样）。
deterministic=True during evaluation (explicitly set in evaluate() calls).
deterministic=True（在 evaluate（）调用中显式设置）。
Added ent_coef to PPO with Optuna fine-tuning.
通过 Optuna 微调为 PPO 添加了ent_coef。

import numpy as np
import pandas as pd
import optuna
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import EvalCallback
import quantstats as qs

# Assuming CustomMultiInputPolicy, train_env_vec, val_env_vec, test_env_vec are defined elsewhere

# Evaluation function
def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False, deterministic=True, quantstats=False):
    total_rewards = []
    total_profits = []  # Track actual trading profit
    metrics = []
    if quantstats:
        returns = []

    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0  # Sum of realized profits/losses
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        if quantstats:
            episode_returns = []  # For QuantStats
        
        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=deterministic)  # Deterministic explicitly set
            obs, rewards, done, info = env_vec.step(action)  # VecEnv returns 4 values
            episode_rewards += rewards
            print(f"Action: {action}")  # Debug action output
            
            # Extract profit from closed transactions
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]  # Add realized reward (profit/loss)
                        if quantstats:
                            episode_returns.append(tr["Reward"])  # Track per-trade returns
            step_count += 1
        
        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
       (metrics.append(env_vec.envs[0].analyze_transaction_history()))  # Logs to file here
        if quantstats:
            returns.extend(episode_returns if episode_returns else [episode_profit])  # Fallback to total profit
    
    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)

    # Define numeric keys to average
    numeric_keys = ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in numeric_keys}
    print(f"Deterministic={deterministic}, Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")
    
    if quantstats:
        # QuantStats reporting (for test evaluation)
        returns_series = pd.Series(returns, index=pd.date_range(start="2025-03-12", periods=len(returns), freq="D"))
        qs.reports.html(returns_series, output="quantstats_report.html", title="FX Trading Performance")
        
    return mean_reward if return_mean_reward else mean_profit  # Return profit instead of reward for optimization

# Custom early stopping callback
class EarlyStoppingCallback(EvalCallback):
    def __init__(self, eval_env, eval_freq, n_eval_episodes, patience, min_delta, verbose=0, use_profit=False, best_model_save_path=None):
        super().__init__(
            eval_env=eval_env,
            eval_freq=eval_freq,
            n_eval_episodes=n_eval_episodes,
            verbose=verbose,
            deterministic=True  # Always deterministic during evaluation
        )
        self.patience = patience
        self.min_delta = min_delta
        self.best_metric = -float('inf')
        self.no_improvement_count = 0
        self.use_profit = use_profit  # Toggle between reward and profit
        self.best_model_save_path = best_model_save_path

    def _on_step(self):
        continue_training = super()._on_step()
        if not continue_training:
            return False

        if self.last_mean_reward is not None:
            # Use profit or reward based on use_profit flag
            current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, 
                                    return_mean_reward=not self.use_profit, deterministic=True)
            if current_metric > self.best_metric + self.min_delta:
                self.best_metric = current_metric
                self.no_improvement_count = 0
                if self.verbose > 0:
                    print(f"New best {'profit' if self.use_profit else 'reward'}: {self.best_metric:.2f}")
                if self.best_model_save_path:
                    self.model.save(self.best_model_save_path)
            else:
                self.no_improvement_count += 1
                if self.verbose > 0:
                    print(f"No improvement for {self.no_improvement_count}/{self.patience} evaluations")
                if self.no_improvement_count >= self.patience:
                    if self.verbose > 0:
                        print(f"Early stopping triggered after {self.patience} evaluations without improvement")
                    return False
        return True

# Objective function for Optuna
def objective(trial):
    # Define hyperparameter search space
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
    n_steps = trial.suggest_int("n_steps", 1024, 8192, step=1024)
    total_timesteps = trial.suggest_int("total_timesteps", 500000, 2000000, step=100000)
    batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512])
    gamma = trial.suggest_float("gamma", 0.9, 0.9999)
    gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.99)
    clip_range = trial.suggest_float("clip_range", 0.1, 0.3)
    ent_coef = trial.suggest_float("ent_coef", 1e-3, 0.1, log=True)  # Added for fine-tuning

    # Train PPO model on training set
    model = PPO(
        CustomMultiInputPolicy,
        train_env_vec,
        learning_rate=learning_rate,
        n_steps=n_steps,
        batch_size=batch_size,
        n_epochs=100,  # Large fixed value
        gamma=gamma,
        gae_lambda=gae_lambda,
        clip_range=clip_range,
        ent_coef=ent_coef,  # Entropy coefficient for exploration
        verbose=1
    )
    eval_callback = EarlyStoppingCallback(
        eval_env=val_env_vec,
        eval_freq=5000,
        n_eval_episodes=10,
        patience=5,
        min_delta=0.01,
        verbose=1,
        use_profit=True,
        best_model_save_path=f"./best_model/trial_{trial.number}/"
    )

    # Training uses deterministic=False implicitly in PPO's collect_rollouts
    model.learn(total_timesteps=total_timesteps, callback=eval_callback)

    # Evaluation uses deterministic=True
    val_avg_profit = evaluate(model, val_env_vec, n_episodes=10, deterministic=True)
    print(f"Validation Average Profit: {val_avg_profit:.2f}")

    return val_avg_profit  # Maximize profit

# Specify the SQLite database file
db_path = 'optuna_study.db'

# Run optimization
study = optuna.create_study(
    study_name='OHLC_EconomicCalender_ppo_study',
    storage=f'sqlite:///{db_path}',
    direction="maximize",
    load_if_exists=True
)
study.optimize(objective, n_trials=1)  # Adjust number of trials based on resources

# Best parameters
print("Best hyperparameters:", study.best_params)
print("Best validation reward:", study.best_value)

best_params = study.best_params

# Train final model with best parameters on training set
best_model = PPO(
    CustomMultiInputPolicy,
    train_env_vec,
    learning_rate=best_params["learning_rate"],
    n_steps=best_params["n_steps"],
    batch_size=best_params["batch_size"],
    n_epochs=100,
    gamma=best_params["gamma"],
    gae_lambda=best_params["gae_lambda"],
    clip_range=best_params["clip_range"],
    ent_coef=best_params["ent_coef"],  # Use tuned entropy coefficient
    verbose=1,
    tensorboard_log="./tensorboard_logs/"
)

# Early stopping callback on validation set
best_trial = study.best_trial.number
eval_callback = EarlyStoppingCallback(
    eval_env=val_env_vec,
    eval_freq=5000,
    n_eval_episodes=10,
    patience=5,
    min_delta=0.01,
    verbose=1,
    use_profit=True,
    best_model_save_path=f"./best_model/trial_{best_trial}/"
)

# Train the final model (deterministic=False implicitly during training)
best_model.learn(total_timesteps=best_params["total_timesteps"], callback=eval_callback)

# Save the final model
best_model.save(f"ppo_xauusd_optimized_trial_{best_trial}")

# Evaluate on test data with QuantStats (deterministic=True)
print("\nEvaluating Final Model on Test Data:")
test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, deterministic=True, quantstats=True)
print(f"Test Average Profit: {test_avg_profit:.2f}")

# Clean up
train_env_vec.close()
val_env_vec.close()
test_env_vec.close()

Key Changes 主要变化

Deterministic Settings: 确定性设置：
- Training: model.learn() uses deterministic=False by default in PPO’s collect_rollouts() for exploration (sampling from the policy distribution). No explicit override is needed, as this is Stable-Baselines3’s default behavior during training.
  Training：model.learn（）在 PPO 的 collect_rollouts（）中默认使用 deterministic=False 进行探索（从策略分布中采样）。不需要显式覆盖，因为这是 Stable-Baselines3 在训练期间的默认行为。
- Evaluation: 评估：
  - evaluate() now explicitly sets deterministic=True in all calls (default parameter already set to True, and explicitly passed in objective and final test evaluation).
    现在，evaluate（）在所有调用中显式设置 deterministic=True（默认参数已设置为 True，并在目标和最终测试评估中显式传递）。
  - EarlyStoppingCallback inherits EvalCallback with deterministic=True, and evaluate() calls within it also use deterministic=True.
    EarlyStoppingCallback 继承了 deterministic=True 的 EvalCallback，其中的 evaluate（）调用也使用 deterministic=True。
Added ent_coef: 新增ent_coef：
- Added ent_coef to the PPO constructor in both objective() and the final model training.
  在 objective（）和最终模型训练中向 PPO 构造函数添加了ent_coef。
- Fine-tuned via Optuna with trial.suggest_float("ent_coef", 1e-3, 0.1, log=True), a reasonable range for entropy regularization (log scale ensures exploration across orders of magnitude).
  通过 Optuna 进行微调，使用 trial.suggest_float（“ent_coef”， 1e-3， 0.1， log=True），熵正则化的合理范围（对数尺度确保跨数量级的探索）。
Should You Fine-Tune ent_coef?
你应该微调 ent_coef 吗？
- Yes: Fine-tuning ent_coef is beneficial because it controls the trade-off between exploration and exploitation in PPO:
  是的：微调ent_coef是有益的，因为它控制了 PPO 中探索和开发之间的权衡：
  - Low ent_coef (e.g., 0.001): Less exploration, faster convergence to a deterministic policy, but risks getting stuck in local optima.
    低ent_coef（例如 0.001）：更少的探索，更快地收敛到确定性策略，但有可能卡在局部最优值中。
  - High ent_coef (e.g., 0.1): More exploration, slower convergence, but better chance of finding globally optimal strategies.
    高ent_coef（例如 0.1）：更多的探索，更慢的收敛，但找到全局最优策略的机会更大。
- Your Case: Since your forex trading environment has complex dynamics (e.g., sustained rewards, multiple assets), fine-tuning ent_coef helps ensure the agent explores diverse actions (buy, sell, hold) rather than fixating early (e.g., on Action=0.0 as seen previously).
  您的案例：由于您的外汇交易环境具有复杂的动态（例如，持续奖励、多种资产），因此微调ent_coef有助于确保代理探索不同的作（买入、卖出、持有），而不是尽早固定（例如，如前所述在 Action=0.0 上）。
- Range: 1e-3 to 0.1 is a standard starting point, adjusted logarithmically to cover small and moderate exploration incentives.
  范围：1e-3 到 0.1 是一个标准起点，对数调整以涵盖小型和中等勘探激励措施。

2. When to Add ent_coef?
2. 何时添加ent_coef？

What is ent_coef?
什么是 ent_coef？

ent_coef (entropy coefficient) is a hyperparameter in PPO that adds an entropy bonus to the loss function:
ent_coef（熵系数）是 PPO 中的一个超参数，它向损失函数添加熵加成：
- Encourages exploration by penalizing overly confident (low-entropy) policies.
  通过惩罚过度自信（低熵）的策略来鼓励探索。
- Loss = Policy Loss - ent_coef * Entropy + Value Loss.
  损失 = 保单损失 - ent_coef * 熵 + 价值损失。
It affects how actions are sampled from the policy distribution during training, increasing the variance (log_std) of the Gaussian distribution in your CustomMultiInputPolicy.
它会影响训练期间从策略分布中采样作的方式，从而增加 CustomMultiInputPolicy 中高斯分布的方差（log_std）。

When Does It Apply? 何时适用？

ent_coef Applies Only During Training:
ent_coef仅在训练期间适用：
- It influences the policy update step in model.learn(), where actions are sampled (deterministic=False) to collect rollouts.
  它会影响 model.learn（）中的策略更新步骤，在该步骤中，对作进行采样（deterministic=False）以收集推出。
- Higher ent_coef increases randomness in action selection, keeping the policy stochastic during training.
  较高的ent_coef会增加作选择的随机性，从而在训练期间保持策略的随机性。
deterministic=True (Evaluation):
deterministic=True （计算）：
- During evaluation, model.predict(..., deterministic=True) uses the mean action (mean_actions) directly, ignoring the distribution’s variance and entropy.
  在评估过程中，model.predict（...， deterministic=True）直接使用均值作（mean_actions），忽略分布的方差和熵。
- Result: ent_coef has no effect when deterministic=True, as no sampling occurs—only the mean is used.
  当 deterministic=True 时，Result：ent_coef 不起作用，因为没有进行采样 - 仅使用平均值。
deterministic=False (Training or Stochastic Evaluation):
deterministic=False（训练或随机评估）：
- When sampling actions (e.g., during model.learn() or if you manually set deterministic=False in evaluate()), ent_coef shapes the policy’s randomness.
  当对作进行采样时（例如，在 model.learn（）期间或在 evaluate（）中手动设置 deterministic=False），ent_coef 会影响策略的随机性。
- Result: ent_coef actively encourages exploration here.
  结果：ent_coef 积极鼓励在这里探索。

Should You Add ent_coef for Both?
您应该为两者添加 ent_coef 吗？

Both Can Use ent_coef in the Model Definition:
两者都可以在模型定义中使用 ent_coef：
- You define ent_coef when initializing the PPO model, and it’s part of the training process regardless of how you use deterministic later.
  您可以在初始化 PPO 模型时定义 ent_coef，无论您以后如何使用确定性模型，它都是训练过程的一部分。
- Example: 例：
  python 蟒
  model = PPO(CustomMultiInputPolicy, env, ent_coef=0.01, ...)
- It’s always present in the model but only impacts behavior when actions are sampled (i.e., deterministic=False).
  它始终存在于模型中，但仅在对动作进行采样时影响行为（即 deterministic=False）。
Practical Usage: 实际用途：
- Training (deterministic=False): Add ent_coef and fine-tune it to balance exploration and exploitation.
  训练（deterministic=False）：添加 ent_coef 并对其进行微调以平衡探索和开发。
- Evaluation (deterministic=True): ent_coef is irrelevant here, as you’re exploiting the learned policy’s mean actions. You don’t need to adjust it separately for evaluation—it’s already baked into the trained policy.
  Evaluation （deterministic=True）：ent_coef 在这里无关紧要，因为您正在利用学习策略的 mean作。您无需单独调整它进行评估，它已经融入到经过训练的策略中。

When to Add ent_coef?
何时添加ent_coef？

Add It Always in PPO Initialization:
始终在 PPO 初始化中添加它：
- Since it only affects training, include it in every PPO instantiation (as done in the modified code).
  由于它只影响训练，因此请将其包含在每个 PPO 实例中（就像在修改后的代码中所做的那样）。
- Fine-tune it via Optuna in objective() to optimize exploration during training, then use the tuned value for the final model.
  通过 objective（）中的 Optuna 对其进行微调，以优化训练期间的探索，然后将调整后的值用于最终模型。
Don’t Tie It to deterministic:
不要将其与确定性捆绑在一起：
- ent_coef isn’t toggled by deterministic. It’s a training hyperparameter, not an evaluation switch.
  ent_coef 不是由 deterministic 切换的。它是一个训练超参数，而不是评估开关。
- Set it once in PPO and let it shape the policy during learn(), while deterministic=True ensures consistent evaluation.
  在 PPO 中设置一次，并让它在 learn（）期间塑造策略，而 deterministic=True 确保评估一致。

Summary 总结

Code Changes: 代码更改：
- Training uses deterministic=False implicitly in model.learn().
  训练在 model.learn（）中隐式使用 deterministic=False。
- Evaluation uses deterministic=True explicitly in evaluate() and EarlyStoppingCallback.
  评估在 evaluate（）和 EarlyStoppingCallback 中显式使用 deterministic=True。
- Added ent_coef with Optuna fine-tuning (1e-3 to 0.1, log scale).
  添加了 Optuna 微调的 ent_coef（1e-3 到 0.1，对数刻度）。
ent_coef Usage: ent_coef用法：
- Add it to PPO for training to control exploration.
  将其添加到 PPO 进行训练以控制探索。
- It applies when deterministic=False (training), not when deterministic=True (evaluation).
  它适用于 deterministic=False（训练）时，而不是 deterministic=True（评估）时适用。
- Fine-tuning is recommended to optimize your trading agent’s performance.
  建议进行微调以优化您的交易代理的性能。

Run the updated code, and monitor the validation profits. If exploration seems insufficient (e.g., actions still stuck), adjust the ent_coef range in Optuna (e.g., up to 0.5) and share the results for further tuning advice!
运行更新的代码，并监控验证利润。如果探索似乎不足（例如，作仍然卡住），请在 Optuna 中调整 ent_coef 范围（例如，最高 0.5）并分享结果以获得进一步的优化建议！

1. is it the need to evaluate the best model on test_env_vec for n_episodes=20? or only test 1 times is enough (n_episodes = 1)? ``` # Evaluation function def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False, deterministic=True, quantstats=False): total_rewards = [] total_profits = [] # Track actual trading profit metrics = [] if (quantstats): returns = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 # Sum of realized profits/losses step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) if (quantstats): episode_returns = [] # For QuantStats while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=deterministic) obs, rewards, done, info = env_vec.step(action) # VecEnv returns 4 values episode_rewards += rewards print(f"Action: {action}") # Debug action output # Extract profit from closed transactions for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] # Add realized reward (profit/loss) if (quantstats): episode_returns.append(tr["Reward"]) # Track per-trade returns step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) # Logs to file here if (quantstats): returns.extend(episode_returns if episode_returns else [episode_profit]) # Fallback to total profit mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) # Define numeric keys to average numeric_keys = ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"] avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in numeric_keys} print(f"Deterministic={deterministic}, Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") if (quantstats): # QuantStats reporting (for test evaluation) returns_series = pd.Series(returns, index=pd.date_range(start="2025-03-12", periods=len(returns), freq="D")) qs.reports.html(returns_series, output="quantstats_report.html", title="FX Trading Performance") return mean_reward if return_mean_reward else mean_profit # Return profit instead of reward for optimization # Custom early stopping callback class EarlyStoppingCallback(EvalCallback): def __init__(self, eval_env, eval_freq, n_eval_episodes, patience, min_delta, verbose=0, use_profit=False, best_model_save_path=None): super().__init__( eval_env=eval_env, eval_freq=eval_freq, n_eval_episodes=n_eval_episodes, verbose=verbose, deterministic=True ) self.patience = patience self.min_delta = min_delta self.best_metric = -float('inf') self.no_improvement_count = 0 self.use_profit = use_profit # Toggle between reward and profit self.best_model_save_path = best_model_save_path def _on_step(self): continue_training = super()._on_step() if not continue_training: return False if self.last_mean_reward is not None: # Use profit or reward based on use_profit flag current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit) if current_metric > self.best_metric + self.min_delta: self.best_metric = current_metric self.no_improvement_count = 0 if self.verbose > 0: print(f"New best {'profit' if self.use_profit else 'reward'}: {self.best_metric:.2f}") if self.best_model_save_path: self.model.save(self.best_model_save_path) else: self.no_improvement_count += 1 if self.verbose > 0: print(f"No improvement for {self.no_improvement_count}/{self.patience} evaluations") if self.no_improvement_count >= self.patience: if self.verbose > 0: print(f"Early stopping triggered after {self.patience} evaluations without improvement") return False return True # Objective function for Optuna def objective(trial): # Define hyperparameter search space learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True) n_steps = trial.suggest_int("n_steps", 1024, 8192, step=1024) total_timesteps = trial.suggest_int("total_timesteps", 500000, 2000000, step=100000) batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512]) gamma = trial.suggest_float("gamma", 0.9, 0.9999) gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.99) clip_range = trial.suggest_float("clip_range", 0.1, 0.3) # Train PPO model on training set model = PPO( CustomMultiInputPolicy, train_env_vec, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, n_epochs=100, # Large fixed value gamma=gamma, gae_lambda=gae_lambda, clip_range=clip_range, verbose=1, # 0: No output during training, # 1: Prints basic training progress, # 2: More detailed output (Additional details like optimization steps, loss values (e.g., policy loss, value loss), and learning rate updates.) ) eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # More frequent n_eval_episodes=10, # More episodes for stability patience=5, # Stop if no improvement after 5 evaluations min_delta=0.01, # Minimum improvement to consider verbose=1, use_profit=True, # Track profit for trading focus best_model_save_path=f"./best_model/trial_{trial.number}/" # Save best model ) # print(model.policy) # Should show mlp_extractor with in_features=95 model.learn(total_timesteps=total_timesteps, callback=eval_callback) val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) # evaluation function print(f"Validation Average Profit: {val_avg_profit:.2f}") return val_avg_profit # Maximize reward # Specify the SQLite database file db_path = 'optuna_study.db' # Run optimization study = optuna.create_study( study_name='OHLC_EconomicCalender_ppo_study', storage=f'sqlite:///{db_path}', direction="maximize", load_if_exists=True ) study.optimize(objective, n_trials=1) # Adjust number of trials based on resources # Best parameters print("Best hyperparameters:", study.best_params) print("Best validation reward:", study.best_value) best_params = study.best_params # Train final model with best parameters on training set best_model = PPO( CustomMultiInputPolicy, train_env_vec, # Use full training environment learning_rate=best_params["learning_rate"], n_steps=best_params["n_steps"], batch_size=best_params["batch_size"], n_epochs=100, # Large fixed value gamma=best_params["gamma"], gae_lambda=best_params["gae_lambda"], clip_range=best_params["clip_range"], verbose=1, tensorboard_log="./tensorboard_logs/" ) # Early stopping callback on validation set best_trial = study.best_trial.number eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # more frequent evaluations (e.g., every 5,000 steps) n_eval_episodes=10, patience=5, min_delta=0.01, verbose=1, use_profit=True, # use mean_profits best_model_save_path=f"./best_model/trial_{best_trial}/" ) # Train the final model best_model.learn(total_timesteps=best_params["total_timesteps"], callback=eval_callback) # Save the final model best_model.save(f"ppo_xauusd_optimized_trial_{best_trial}") # Evaluate on test data with QuantStats print("\nEvaluating Final Model on Test Data:") test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, quantstats=True) print(f"Test Average Profit: {test_avg_profit:.2f}") # Optional: Load and re-evaluate to verify # loaded_model = PPO.load("ppo_xauusd_optimized") # test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, quantstats=True) # print(f"Test Average Profit (Loaded Model): {test_avg_profit_loaded:.2f}") # Clean up train_env_vec.close() val_env_vec.close() test_env_vec.close() ```
1. 是否需要在 test_env_vec 上评估 n_episodes=20 的最佳模型？还是只测试 1 次就足够了（n_episodes = 1）？ ``` # 评估函数 def evaluate（model， env_vec， n_episodes=10， return_mean_reward=False， deterministic=True， quantstats=False）： total_rewards = [] total_profits = [] # 跟踪实际交易利润指标 = [] if （quantstats）：返回 = [] 对于 range（n_episodes）中的 _： obs = env_vec.reset（）完成 = np.array（[假] * env_vec.num_envs） episode_rewards = np.zeros（env_vec.num_envs） episode_profit = 0 # 已实现盈亏之和 step_count = 0 max_steps = len（env_vec.envs[0].dt_datetime） if （quantstats）： episode_returns = [] # 对于 QuantStats while not np.all（done）且 step_count < max_steps：作， _ = model.predict（obs， deterministic=deterministic） obs， rewards， done， info = env_vec.step（action） # VecEnv 返回 4 个值 episode_rewards += 奖励 print（f“Action： {action}”） # 调试动作输出 # 从已平仓交易中提取利润对于 range （env_vec.num_envs）中的env_idx： if info[env_idx][“关闭”]：对于 info[env_idx][“Close”] 中的 tr： episode_profit += tr[“Reward”] # 添加已实现的奖励（盈利/亏损） if （quantstats）： episode_returns.append（tr[“Reward”]） # 追踪每笔交易的回报 step_count += 1 total_rewards.extend（episode_rewards） total_profits.append（episode_profit） metrics.append（env_vec.envs[0].analyze_transaction_history（）） # 在此处提交的日志 if （quantstats）： returns.extend（episode_returns if episode_returns else [episode_profit]） # 回退到总利润 mean_reward = np.mean（total_rewards） mean_profit = np.mean（total_profits） # 定义数字键以求平均值 numeric_keys = [“交易”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”] avg_metrics = {k： np.mean（[m[k] for m in metrics]） for k in numeric_keys} print（f“Deterministic={deterministic}，平均奖励： {mean_reward：.2f}，平均利润： {mean_profit：.2f}”） print（f“平均度量： {avg_metrics}”） if （quantstats）： # QuantStats 报告（用于测试评估） returns_series = pd。Series（返回值， index=pd.date_range（start=“2025-03-12”， periods=len（returns）， freq=“D”）） qs.reports.html（returns_series， output=“quantstats_report.html”， title=“外汇交易表现”） return mean_reward if return_mean_reward else mean_profit # 返回利润而不是优化奖励 # 自定义提前停止回调类 EarlyStoppingCallback（EvalCallback）： def __init__（self， eval_env， eval_freq， n_eval_episodes， patience， min_delta， verbose=0， use_profit=False， best_model_save_path=None）： super（）.__init__（ eval_env=eval_env、 eval_freq=eval_freq、 n_eval_episodes=n_eval_episodes， verbose=verbose， deterministic=真 ) self.patience = 耐心 self.min_delta = min_delta self.best_metric = -float（'inf'） self.no_improvement_count = 0 self.use_profit = use_profit # 在奖励和利润之间切换 self.best_model_save_path = best_model_save_path def _on_step（个体经营）： continue_training = super（）._on_step（）如果不continue_training： return False 如果 self.last_mean_reward 不是 None： # 使用基于use_profit标志的利润或奖励 current_metric = evaluate（self.model， self.eval_env， self.n_eval_episodes， return_mean_reward=not self.use_profit）如果 current_metric > self.best_metric + self.min_delta： self.best_metric = current_metric self.no_improvement_count = 0 如果 self.verbose > 0： print（f“新最佳 {'profit' if self.use_profit else 'reward'}： {self.best_metric：.2f}”）如果self.best_model_save_path： self.model.save（self.best_model_save_path）还： self.no_improvement_count += 1 如果 self.verbose > 0： print（f“{self.no_improvement_count}/{self.patience} evaluations没有改善”） if self.no_improvement_count >= self.patience：如果 self.verbose > 0： print（f“在 {self.patience} 评估后触发提前停止，但无改善”） return False 返回 True # Optuna 的目标函数 def 目标（试用）： # 定义超参数搜索空间 learning_rate = trial.suggest_float（“learning_rate”， 1e-5， 1e-3， log=True） n_steps = trial.suggest_int（“n_steps”， 1024， 8192， step=1024） total_timesteps = trial.suggest_int（“total_timesteps”， 500000， 2000000， step=100000） batch_size = trial.suggest_categorical（“batch_size”， [64， 128， 256， 512]）伽玛 = trial.suggest_float（“伽玛”， 0.9， 0.9999） gae_lambda = trial.suggest_float（“gae_lambda”， 0.8， 0.99） clip_range = trial.suggest_float（“clip_range”， 0.1， 0.3） # 在训练集上训练 PPO 模型模型 = PPO（ CustomMultiInputPolicy 和 train_env_vec， learning_rate=learning_rate、 n_steps=n_steps、 batch_size=batch_size， n_epochs=100， # 大的固定值 gamma=伽玛， gae_lambda=gae_lambda、 clip_range=clip_range、 verbose=1， # 0：训练时无输出， # 1：打印基础训练进度， # 2：更详细的输出（其他详细信息，如优化步骤、损失值（例如，策略损失、值损失）和学习率更新。 ) eval_callback = EarlyStoppingCallback（ eval_env=val_env_vec、 eval_freq=5000， # 更频繁 n_eval_episodes=10， # 更多剧集以保持稳定 patience=5， # 如果 5 次评估后没有改善，则停止 min_delta=0.01， # 要考虑的最小改进 verbose=1， use_profit=True， # 跟踪利润以专注于交易 best_model_save_path=f“./best_model/trial_{trial.number}/” # 保存最佳模型 ) # print（model.policy） # 应该显示 in_features=95 的 mlp_extractor model.learn（total_timesteps=total_timesteps， callback=eval_callback） val_avg_profit = evaluate（model， val_env_vec， n_episodes=10） # 评估函数 print（f“验证平均利润： {val_avg_profit：.2f}”） return val_avg_profit # 最大化奖励 # 指定 SQLite 数据库文件 db_path = 'optuna_study.db' # 运行优化研究 = optuna.create_study（ study_name='OHLC_EconomicCalender_ppo_study'，存储=f'sqlite：///{db_path}'， direction=“maximize”， load_if_exists=真 ) study.optimize（objective， n_trials=1） # 根据资源调整 Trial 次数 # 最佳参数 print（“最佳超参数：”， study.best_params） print（“最佳验证奖励：”， study.best_value） best_params = study.best_params # 使用训练集上的最佳参数训练最终模型 best_model = PPO（ CustomMultiInputPolicy 和 train_env_vec， # 使用完整的训练环境 learning_rate=best_params[“learning_rate”]， n_steps=best_params[“n_steps”]， batch_size=best_params[“batch_size”]， n_epochs=100， # 大的固定值伽玛=best_params[“伽玛”]， gae_lambda=best_params[“gae_lambda”]， clip_range=best_params[“clip_range”]， verbose=1， tensorboard_log=“./tensorboard_logs/” ) # 在验证集上提前停止回调 best_trial = study.best_trial.number eval_callback = EarlyStoppingCallback（ eval_env=val_env_vec、 eval_freq=5000， # 更频繁的评估（例如，每 5,000 个步骤） n_eval_episodes=10，耐心 = 5， min_delta=0.01， verbose=1， use_profit=True， # 使用 mean_profits best_model_save_path=f“./best_model/trial_{best_trial}/” ) # 训练最终模型 best_model.learn（total_timesteps=best_params[“total_timesteps”]， callback=eval_callback） # 保存最终模型 best_model.save（f“ppo_xauusd_optimized_trial_{best_trial}”） # 使用 QuantStats 评估测试数据 print（“\n根据测试数据评估最终模型：”） test_avg_profit = evaluate（best_model， test_env_vec， n_episodes=20， quantstats=True） print（f“测试平均利润： {test_avg_profit：.2f}”） # 可选：加载并重新评估以验证 # loaded_model = PPO.load（“ppo_xauusd_optimized”） # test_avg_profit = evaluate（best_model， test_env_vec， n_episodes=20， quantstats=True） # print（f“测试平均利润（加载的模型）：{test_avg_profit_loaded：.2f}”） # 清理 train_env_vec.close（） val_env_vec.close（） test_env_vec.close（） ```

Let’s determine whether you need to evaluate the best model on test_env_vec with n_episodes=20 or if n_episodes=1 is sufficient. This decision depends on your goals, the nature of your environment, and statistical reliability.
让我们确定是否需要在 n_episodes=20 的 test_env_vec 上评估最佳模型，或者 n_episodes=1 是否足够。此决定取决于您的目标、环境的性质和统计可靠性。

Key Considerations 关键考虑因素

Purpose of Evaluation on test_env_vec:
test_env_vec 评估目的：
- The test evaluation assesses how well your trained model generalizes to unseen data (the test set). It’s your final measure of performance before deployment or reporting results.
  测试评估评估经过训练的模型对看不见的数据（测试集）的泛化程度。这是您在部署或报告结果之前对性能的最终衡量标准。
- You’re using quantstats=True to generate detailed performance metrics (e.g., Sharpe ratio, profit factor), which requires enough data to be meaningful.
  您正在使用 quantstats=True 生成详细的效果指标（例如，夏普比率、利润因子），这需要足够的数据才有意义。
Stochasticity in the Environment:
环境中的随机性：
- Your tgym environment (forex trading) has inherent randomness:
  您的 tgym 环境（外汇交易）具有固有的随机性：
  - Market prices fluctuate (OHLC data).
    市场价格波动（OHLC 数据）。
  - Rewards include unrealized profits scaled by ATR, which varies step-to-step.
    奖励包括由 ATR 调整的未实现利润，该利润因情况而异。
  - Actions may not always trigger trades (e.g., max_current_holding limits).
    作可能并不总是触发交易（例如，max_current_holding限制）。
- With deterministic=True in evaluation, the policy outputs consistent actions for the same observations, but the environment’s response (rewards, trade closures) can still vary due to price movements.
  在评估中使用 deterministic=True 时，策略会针对相同的观察结果输出一致的作，但环境的响应（奖励、交易关闭）仍可能因价格变动而有所不同。
Statistical Reliability: 统计可靠性：
- Running multiple episodes (n_episodes > 1) averages out noise and provides a more robust estimate of performance (mean profit, variance, etc.).
  运行多个事件（n_episodes > 1）可以平均噪声并提供更可靠的性能估计（平均利润、方差等）。
- A single episode (n_episodes=1) gives a point estimate, which might be an outlier (overly optimistic or pessimistic) due to randomness.
  单个事件（n_episodes=1）给出一个点估计，由于随机性，这可能是一个异常值（过于乐观或悲观）。
QuantStats Requirements: QuantStats 要求：
- quantstats generates reports like Sharpe ratio, drawdowns, and win/loss metrics based on a series of returns.
  Quantstats 根据一系列回报生成夏普比率、回撤和赢/输指标等报告。
- With n_episodes=1, you’d get returns from one episode (up to max_steps trades). This might be too few data points for meaningful statistics, especially for metrics like Sharpe ratio that require variance estimates.
  如果 n_episodes=1，您将获得一次事件的回报（最多 max_steps 笔交易）。对于有意义的统计数据来说，这可能太少了，尤其是对于像 Sharpe ratio 这样需要方差估计的指标。
Computational Cost: 计算成本：
- More episodes increase evaluation time. If each episode runs for max_steps (e.g., length of dt_datetime), n_episodes=20 takes 20 times longer than n_episodes=1.
  发作次数越多，评估时间就越长。如果每个剧集运行 max_steps（例如，时长为 dt_datetime），则 n_episodes=20 花费的时间是 n_episodes=1 的 20 倍。

Is n_episodes=20 Necessary or Is n_episodes=1 Enough?
n_episodes=20 是必需的还是 n_episodes=1 就足够了？

Arguments for n_episodes=20
n_episodes=20 的参数

Robustness: 鲁棒性：
- Averaging over 20 episodes reduces the impact of random price fluctuations or rare events (e.g., stop-loss triggers), giving a reliable estimate of expected profit.
  平均超过 20 次事件可以减少随机价格波动或罕见事件（例如，止损触发）的影响，从而对预期利润进行可靠的估计。
- Your validation phase uses n_episodes=10 in objective() and EarlyStoppingCallback. Testing with n_episodes=20 provides even higher confidence on unseen data.
  验证阶段在 objective（）和 EarlyStoppingCallback 中使用 n_episodes=10。使用 n_episodes=20 进行测试可为未看见的数据提供更高的置信度。
QuantStats Quality: QuantStats 质量：
- With quantstats=True, more episodes generate a longer return series (up to 20 * max_steps trades), improving the accuracy of metrics like Sharpe ratio, profit factor, and drawdowns.
  使用 quantstats=True，更多的事件会产生更长的回报序列（高达 20 * max_steps 笔交易），从而提高夏普比率、利润因子和回撤等指标的准确性。
- Example: If max_steps=1000 and each step might close a trade, n_episodes=1 gives ~1000 returns, while n_episodes=20 gives ~20,000—far better for statistical analysis.
  示例：如果 max_steps=1000 并且每个步骤都可能关闭交易，n_episodes=1 提供 ~1000 份回报，而n_episodes=20 提供 ~20,000 份——更适合统计分析。
Trading Context: 交易环境：
- Forex trading performance is often judged over multiple runs to simulate different market conditions. n_episodes=20 mimics this by testing across multiple “simulated days” or scenarios.
  外汇交易表现通常是通过多次运行来判断的，以模拟不同的市场条件。n_episodes=20 通过跨多个 “模拟日 ” 或场景进行测试来模拟这一点。
Precedent: 先例：
- Your code already uses n_episodes=10 for validation and n_episodes=20 for testing, suggesting you value stability in performance estimates.
  您的代码已使用 n_episodes=10 进行验证，使用 n_episodes=20 进行测试，这表明您重视性能估计的稳定性。

Arguments for n_episodes=1
n_episodes=1 的参数

Speed: 速度：
- If you’re resource-constrained or just need a quick sanity check post-training, n_episodes=1 is faster.
  如果您资源受限，或者只需要在训练后进行快速健全性检查，n_episodes=1 会更快。
Deterministic Policy: 确定性策略：
- Since deterministic=True, the policy’s actions are fixed for a given observation sequence. If test_env_vec has no randomness in its initial state (e.g., always starts at the same timestep), and the environment’s stochasticity is minimal, one episode might suffice for a rough estimate.
  由于 deterministic=True，因此对于给定的观察序列，策略的作是固定的。如果test_env_vec在其初始状态中没有随机性（例如，总是在同一时间步开始），并且环境的随机性最小，那么一个事件可能就足以进行粗略估计。
Minimal Reporting: 最少的报告：
- If you don’t need detailed quantstats metrics (e.g., just want mean profit), n_episodes=1 could be a lightweight option.
  如果您不需要详细的 quantstats 指标（例如，只想要平均利润），n_episodes=1 可能是一个轻量级选项。

Your Current Setup 您当前的设置

Validation: n_episodes=10 in objective() and EarlyStoppingCallback.
Validation：n_episodes=10 在 objective（）和 EarlyStoppingCallback 中。
Test: n_episodes=20 with quantstats=True.
Test：n_episodes=20，quantstats=True。
Implication: You’ve chosen higher episode counts for stability, especially in testing where quantstats benefits from more data.
蕴涵：您选择了更高的集数来实现稳定性，尤其是在 quantstats 受益于更多数据的测试中。

Recommendation 建议

Use n_episodes=20 (or at least >1):
使用 n_episodes=20（或至少 >1）：
- Why: Your environment’s stochasticity (price movements, trade outcomes) and the use of quantstats make multiple episodes critical for a reliable performance assessment.
  为什么：您的环境的随机性（价格变动、交易结果）和 quantstats 的使用使得多个事件对于可靠的性能评估至关重要。
- Statistical Confidence: 20 episodes provide a tighter confidence interval around mean profit and better quantstats metrics compared to 1 episode.
  统计置信度：与 1 集相比，20 集在平均利润周围提供了更严格的置信区间，并提供了更好的 quantstats 指标。
- Trading Realism: Forex strategies are evaluated over multiple runs to account for market variability—n_episodes=20 aligns with this.
  交易现实主义：外汇策略通过多次运行进行评估，以考虑市场波动性——n_episodes=20 与此一致。
When n_episodes=1 Might Suffice:
当 n_episodes=1 可能就足够时：
- Debugging or quick checks (e.g., verifying model loading or basic functionality).
  调试或快速检查（例如，验证模型加载或基本功能）。
- If test_env_vec is a fixed, deterministic sequence with no randomness beyond the policy (unlikely given your forex setup).
  如果test_env_vec是一个固定的、确定性的序列，在策略之外没有随机性（考虑到您的外汇设置，不太可能）。

Empirical Check 实证检查

From your previous logs: 从您之前的日志中：

Step 1366: Base Reward=0, Sustained Reward=4.79, Total=4.79, Balance=99100
Step 1379: Base Reward=0, Sustained Reward=1.69, Total=1.69, Balance=99100
Step 10315: Base Reward=-961.36, Sustained Reward=0, Total=-961.36, Balance=96157

Rewards vary significantly due to market conditions and trade closures. A single episode might capture an anomaly (e.g., a large loss), while 20 episodes smooth this out.
奖励因市场状况和交易关闭而有很大差异。单个事件可能会捕获异常（例如，大损失），而 20 个事件可以消除这种情况。

Modified Code (No Change Needed)
修改后的代码（无需更改）

Your current code already uses n_episodes=20 for the test evaluation, which I recommend keeping:
您当前的代码已使用 n_episodes=20 进行测试评估，我建议保留该代码：

# Evaluate on test data with QuantStats
print("\nEvaluating Final Model on Test Data:")
test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, deterministic=True, quantstats=True)
print(f"Test Average Profit: {test_avg_profit:.2f}")

If you want to test with n_episodes=1 for comparison, you can temporarily adjust it:
如果要使用 n_episodes=1 进行测试以进行比较，可以临时调整它：
python 蟒
test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=1, deterministic=True, quantstats=True)
- But expect less reliable quantstats output due to fewer returns.
  但是，由于返回较少，预计 quantstats 输出的可靠性会降低。

Final Answer 最终答案

Is n_episodes=20 Necessary? Yes, it’s preferable for a robust, statistically meaningful evaluation of your best model on test_env_vec, especially with quantstats=True.
n_episodes=20 必要吗？是的，最好在 test_env_vec 上对最佳模型进行稳健的、具有统计意义的评估，尤其是在 quantstats=True 的情况下。
Is n_episodes=1 Enough? Only for quick checks or if you’re not using quantstats and accept a noisier estimate. For your forex trading setup, it’s insufficient for final reporting.
n_episodes=1 够吗？仅用于快速检查，或者如果您不使用 Quantstats 并接受更嘈杂的估计值。对于您的外汇交易设置，它不足以进行最终报告。

Stick with n_episodes=20 unless you’re debugging or severely time-constrained. If you’re unsure, run both (e.g., 1 and 20) and compare the variance in test_avg_profit to see the difference!
坚持使用 n_episodes=20，除非你正在调试或时间严重受限。如果您不确定，请同时运行两者（例如 1 和 20）并比较 test_avg_profit 中的方差以查看差异！

but in EarlyStoppingCallback, i already evaluate the val_env_vec, the test_env_vec is only for the all backtest of the model. is it still a must for test_env_vec for evaluate() 20 times?
但在 EarlyStoppingCallback 中，我已经评估了val_env_vec，test_env_vec仅适用于模型的所有回测。 test_env_vec evaluate（） 20 次仍然是必须的吗？

You’re absolutely right to point out that EarlyStoppingCallback already evaluates on val_env_vec during training, and test_env_vec is reserved for the final backtest of the model. This distinction changes the context slightly, so let’s reassess whether n_episodes=20 is still a "must" for test_env_vec in the evaluate() function, given that it’s the final backtest.
您指出 EarlyStoppingCallback 在训练期间已经对 val_env_vec 进行了评估，并且test_env_vec保留用于模型的最终回测，这是完全正确的。这种区别略微改变了上下文，因此让我们重新评估 n_episodes=20 是否仍然是 evaluate（）函数中 test_env_vec 的“必须”，因为它是最终的回测。

Revised Context 修订后的上下文

Validation (val_env_vec):
验证（val_env_vec）：
- In EarlyStoppingCallback, you evaluate the model on val_env_vec with n_eval_episodes=10 every eval_freq=5000 steps.
  在 EarlyStoppingCallback 中，每 val_env_vec eval_freq=5000 步以 n_eval_episodes=10 计算模型。
- Purpose: Monitor training progress, trigger early stopping, and save the best model based on validation profit (since use_profit=True).
  目的：监控训练进度，触发提前停止，并根据验证利润保存最佳模型（因为 use_profit=True）。
- Outcome: You’ve already tuned and validated the model’s performance on unseen validation data with a decent sample size (10 episodes).
  结果：您已经使用适当的样本量（10 集）调整并验证了模型在看不见的验证数据上的性能。
Test (test_env_vec):
测试（test_env_vec）：
- After training, you evaluate the final best_model on test_env_vec with n_episodes=20 and quantstats=True.
  训练后，使用 n_episodes=20 和 quantstats=True 在 test_env_vec 上评估最终best_model。
- Purpose: Perform a comprehensive backtest on a separate test set to report the model’s performance on completely unseen data, simulating a full historical evaluation.
  目的：对单独的测试集执行全面的回溯测试，以报告模型在完全看不见的数据上的性能，模拟完整的历史评估。
Your Question: Since validation already uses 10 episodes, is it necessary to evaluate test_env_vec 20 times, or can you reduce it (e.g., to 1) for the final backtest?
您的问题：既然验证已经使用了 10 次，那么是否有必要评估 test_env_vec 20 次，或者您可以将其减少（例如，减少到 1 次）以进行最终回测？

Reassessing the Need for n_episodes=20 on test_env_vec
重新评估 test_env_vec 上 n_episodes=20 的需求

Purpose of the Test Evaluation
测试评估的目的

Final Backtest: The test evaluation on test_env_vec is your definitive assessment of the model’s performance. It’s akin to a full backtest in trading, where you’d run the strategy over a historical period to estimate real-world profitability.
最终回测：test_env_vec 上的测试评估是您对模型性能的明确评估。这类似于交易中的全面回测，您将在历史时期运行策略以估计现实世界的盈利能力。
QuantStats Integration: With quantstats=True, you’re generating a detailed report (Sharpe ratio, drawdowns, etc.), which benefits from a robust sample of returns.
QuantStats 集成：使用 quantstats=True，您可以生成一份详细的报告（夏普比率、回撤等），该报告受益于稳健的回报样本。

Key Factors 关键因素

Validation vs. Test Roles:
验证角色与测试角色：
- Validation: Used iteratively to tune hyperparameters and stop training. n_episodes=10 balances reliability and computational cost during training.
  验证：以迭代方式用于优化超参数和停止训练。n_episodes=10 在训练期间平衡可靠性和计算成本。
- Test: A one-time, final evaluation on a holdout set. It’s your "gold standard" result, so you might want higher confidence or a more comprehensive backtest than validation.
  测试：对保留集的一次性最终评估。这是您的“黄金标准”结果，因此您可能需要比验证更高的置信度或更全面的回测。
Environment Stochasticity:
环境随机性：
- Even with deterministic=True, your tgym environment has randomness (e.g., OHLC price movements, trade triggers). Multiple episodes on test_env_vec average out this noise, providing a stable estimate of profit.
  即使 deterministic=True，您的 tgym 环境也具有随机性（例如，OHLC 价格变动、交易触发器）。test_env_vec 上的多个剧集平均消除了这种噪音，从而提供了稳定的利润估计。
Backtest Realism: 回测真实度：
- In trading, a backtest typically covers a single continuous period (e.g., 1 year of data). If test_env_vec represents one fixed sequence (e.g., max_steps = len(dt_datetime)), n_episodes=1 simulates running the strategy once over that period.
  在交易中，回溯测试通常涵盖单个连续周期（例如，1 年的数据）。如果 test_env_vec 表示一个固定序列（例如，max_steps = len（dt_datetime）），则 n_episodes=1 模拟在该时间段内运行策略一次。
- However, your code’s reset() allows random starts (random_start=True), meaning each episode could begin at a different point in dt_datetime. With n_episodes=20, you’re effectively testing 20 different starting points, which mimics multiple backtests across varied market conditions.
  但是，代码的 reset（）允许随机开始（random_start=True），这意味着每个情节都可以在dt_datetime中的不同时间点开始。当 n_episodes=20 时，您实际上是在测试 20 个不同的起点，这模拟了不同市场条件下的多次回溯测试。
QuantStats Requirements: QuantStats 要求：
- quantstats needs a return series. For n_episodes=1, you get returns from one episode (up to max_steps trades). For n_episodes=20, you get up to 20 * max_steps trades.
  QuantStats 需要一个 return 序列。对于 n_episodes=1，您可以从一次事件中获得回报（最多 max_steps 笔交易）。对于 n_episodes=20，您最多可以获得 20 * max_steps 笔交易。
- More episodes = more data points = better statistical metrics (e.g., Sharpe ratio requires variance, which is unreliable with few points).
  更多的事件 = 更多的数据点 = 更好的统计指标（例如，夏普比率需要方差，这在点数很少的情况下是不可靠的）。
Statistical Confidence: 统计置信度：
- Validation (10 episodes): Sufficient for tuning, as it’s repeated many times (every 5000 steps).
  验证（10 集）：对于调整来说已经足够了，因为它会重复很多次（每 5000 步）。
- Test (20 episodes): Since it’s a one-time evaluation, a higher number ensures the result isn’t skewed by a single outlier episode.
  测试（20 集）：由于这是一次性评估，因此较高的数字可确保结果不会因单个异常值事件而产生偏差。

Arguments for n_episodes=20
n_episodes=20 的参数

Comprehensive Backtest: 综合回测：
- With random_start=True, 20 episodes test the model across different market segments, increasing confidence in its generalization.
  在 random_start=True 的情况下，20 集在不同细分市场中测试了模型，从而提高了对其泛化的信心。
- Example: If dt_datetime spans 2 years, 20 episodes might cover various trends (bullish, bearish), while 1 episode might miss key conditions.
  示例：如果 dt_datetime 跨越 2 年，则 20 集可能涵盖各种趋势（看涨、看跌），而 1 集可能会错过关键条件。
QuantStats Quality: QuantStats 质量：
- 20 episodes provide a richer return series (e.g., 20,000 trades vs. 1,000), improving the precision of metrics like Sharpe ratio and drawdowns.
  20 集提供更丰富的回报系列（例如，20,000 笔交易对 1,000 笔交易），提高了夏普比率和回撤等指标的精度。
- A single episode might underrepresent rare events (e.g., large losses), skewing risk metrics.
  单个事件可能会低估罕见事件（例如，巨额损失），从而扭曲风险指标。
Higher Confidence: 更高的置信度：
- Since test_env_vec is your final report card, averaging over 20 episodes reduces the risk of reporting an unrepresentative result.
  由于 test_env_vec 是您的最终成绩单，因此平均超过 20 集可以降低报告不具代表性结果的风险。

Arguments for n_episodes=1
n_episodes=1 的参数

Single Backtest Standard:
单一回测标准：
- Traditional trading backtests often run a strategy once over a fixed period. If test_env_vec is a single, contiguous dataset and random_start=False, n_episodes=1 aligns with this practice.
  传统的交易回溯测试通常在固定的时间内运行一次策略。如果 test_env_vec 是一个连续的数据集，并且 random_start=False，则 n_episodes=1 符合这种做法。
- Your reset() uses random_start=True, but you could disable it for testing to simulate one full run.
  您的 reset（）使用 random_start=True，但您可以禁用它以进行测试以模拟一次完整运行。
Validation Sufficiency: 验证充分性：
- With 10 episodes on val_env_vec already proving stability, n_episodes=1 on test_env_vec could suffice as a final confirmation, assuming the test set is similar in nature.
  由于 val_env_vec 上有 10 集已经证明了稳定性，假设测试集本质上相似，test_env_vec 上 n_episodes=1 就足以作为最终确认。
Speed: 速度：
- 20 episodes take 20x longer than 1, which matters if max_steps is large (e.g., 10,000 steps).
  20 集花费的时间是 1 集的 20 倍，如果 max_steps 很大（例如，10000 步），这一点很重要。

Recommendation 建议

Not a Must, but Recommended (n_episodes=20):
不是必须的，但推荐（n_episodes=20）：
- Why: 为什么：
  - Your setup (random_start=True, quantstats=True) suggests you’re testing robustness across multiple scenarios, not just one fixed period. n_episodes=20 leverages this design, providing a thorough backtest.
    您的设置（random_start=True， quantstats=True）表明您正在多个场景中测试稳健性，而不仅仅是一个固定时间段。n_episodes=20 利用此设计，提供全面的回测。
  - The richer return series enhances quantstats output, making your final report more credible.
    更丰富的回报序列增强了 quantstats 输出，使您的最终报告更加可信。
- When It’s Fine: Keep n_episodes=20 as the default for the final backtest, especially since it’s a one-time evaluation and you’ve already invested in training.
  没事的时候：将 n_episodes=20 保留为最终回测的默认值，特别是因为它是一次性评估，并且您已经投入了训练。
When n_episodes=1 Is Enough:
当 n_episodes=1 就足够时：
- Fixed Backtest: If you modify reset() to use random_start=False for testing (e.g., start at step 0), n_episodes=1 mimics a single, full historical run—common in trading backtests.
  固定回测：如果您将 reset（）修改为使用 random_start=False 进行测试（例如，从步骤 0 开始），n_episodes=1 会模拟单个完整的历史运行——这在交易回溯测试中很常见。
  - Update reset() for testing:
    更新 reset（）以进行测试：
    python 蟒
    def reset(self, seed=None, options=None): if seed is not None: self._seed(seed) self.current_step = 0 # Fixed start for test # ... (rest of reset logic)
- Quick Check: If you just need a rough profit estimate without detailed quantstats analysis.
  快速检查：如果您只需要一个粗略的利润估计，而不需要详细的 quantstats 分析。

Empirical Insight 实证洞察

Validation with 10 episodes likely gave a stable mean profit (e.g., low variance across evaluations). If so, test_env_vec with 20 episodes builds on this, offering even tighter confidence for the final result.
用 10 次发作进行验证可能会给出稳定的平均利润（例如，评估之间的低方差）。如果是这样，那么 20 集的 test_env_vec 在此基础上构建，为最终结果提供了更强的信心。
Check your validation logs: If profit varies widely (e.g., ±500), 20 test episodes ensure the backtest isn’t a fluke.
检查您的验证日志：如果利润差异很大（例如，±500），则 20 次测试集可确保回测不是侥幸。

Final Answer 最终答案

Is n_episodes=20 a Must for test_env_vec?
n_episodes=20 是 test_env_vec 必须的吗？
- No, it’s not mandatory: n_episodes=1 is viable if you treat the test as a single backtest (disable random_start) and accept a simpler quantstats report.
  不，这不是强制性的：如果您将测试视为单个回测（禁用 random_start）并接受更简单的 quantstats 报告，则 n_episodes=1 是可行的。
- But Recommended: Stick with n_episodes=20 because:
  但推荐：坚持使用 n_episodes=20，因为：
  - It aligns with your validation rigor (10 episodes).
    它符合您的验证严谨性（10 集）。
  - It maximizes the value of quantstats for a comprehensive backtest.
    它最大限度地提高了 quantstats 对全面回测的价值。
  - It accounts for stochasticity and varied starting points in test_env_vec.
    它解释了随机性和test_env_vec的不同起点。
Suggestion: 建议：
- Keep n_episodes=20 in your current code for the final evaluation:
  在当前代码中保留 n_episodes=20 以进行最终评估：
  python 蟒
  test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, deterministic=True, quantstats=True)
- Optionally, run n_episodes=1 separately to compare:
  （可选）单独运行 n_episodes=1 以比较：
  python 蟒
  single_test_profit = evaluate(best_model, test_env_vec, n_episodes=1, deterministic=True, quantstats=True) print(f"Single Episode Test Profit: {single_test_profit:.2f}")
- If the single-episode result is close to the 20-episode mean with acceptable quantstats output, you could simplify future tests to n_episodes=1 with random_start=False.
  如果单集结果接近 20 集均值且具有可接受的 quantstats 输出，则可以将未来的测试简化为 n_episodes=1，并且 random_start=False。

Since test_env_vec is your final backtest and you’re using quantstats, n_episodes=20 is the safer, more informative choice unless you explicitly want a single-run simulation. Let me know if you’d like to tweak the code further!
由于 test_env_vec 是您的最终回测，并且您使用的是 quantstats，因此 n_episodes=20 是更安全、信息量更大的选择，除非您明确希望进行单次运行模拟。如果您想进一步调整代码，请告诉我！

which is better? combine the detail quantstats to the evaluate(). ``` # Detailed evaluation for QuantStats obs = test_env_vec.reset() done = np.array([False] * test_env_vec.num_envs) balances = [test_env_vec.envs[0].balance] # Record initial state timestamps = [test_env_vec.envs[0].dt_datetime[test_env_vec.envs[0].current_step]] steps_taken = [test_env_vec.envs[0].current_step] trades = [] while not np.all(done) and test_env_vec.envs[0].current_step < len(test_env_vec.envs[0].dt_datetime): action, _ = model.predict(obs, deterministic=True) step_result = test_env_vec.step(action) if len(step_result) == 4: obs, reward, done, info = step_result else: obs, reward, terminated, truncated, info = step_result done = np.logical_or(terminated, truncated) current_step = test_env_vec.envs[0].current_step timestamp = test_env_vec.envs[0].dt_datetime[current_step] timestamps.append(timestamp) steps_taken.append(current_step) balances.append(test_env_vec.envs[0].balance) if info[0]["Close"]: trades.extend(info[0]["Close"]) # Collect closed trades # Create performance DataFrame perf_df = pd.DataFrame({"time": timestamps, "balance": balances, "step": steps_taken}) print(f"Perf DF length: {len(perf_df)}, Unique timestamps: {perf_df['time'].nunique()}") print(perf_df.head()) # Check first few rows print(perf_df.tail()) # Check last few rows if perf_df["time"].duplicated().any(): print("Duplicates found! Aggregating...") perf_df = perf_df.groupby("time").agg({"balance": "mean", "step": "max"}).reset_index() perf_df.set_index("time", inplace=True) returns = perf_df["balance"].pct_change().fillna(0) # Trade-level analysis trades_df = pd.DataFrame(trades) if trades else pd.DataFrame(columns=["Reward"]) profits = trades_df["Reward"] if not trades_df.empty else pd.Series() win_rate = len(profits[profits > 0]) / len(profits) if len(profits) > 0 else 0 profit_factor = profits[profits > 0].sum() / abs(profits[profits < 0].sum()) if profits[profits < 0].sum() != 0 else float('inf') # QuantStats report qs.extend_pandas() output_dir = "data/log" os.makedirs(output_dir, exist_ok=True) file_name = "xauusd_test.html" qs.reports.html( returns, output=os.path.join(output_dir, file_name), title="XAUUSD Test Performance" ) # Final evaluation and metrics test_reward = evaluate(model, test_env_vec, 10) print(f"Final Test Average Reward: {test_reward:.2f}") print(f"Sharpe Ratio: {qs.stats.sharpe(returns):.2f}") print(f"Max Drawdown: {qs.stats.max_drawdown(returns):.2%}") print(f"Win Rate: {win_rate:.2%}") print(f"Profit Factor: {profit_factor:.2f}") # Clean up train_env_vec.close() val_env_vec.close() test_env_vec.close() # Optional: Download report in Colab try: from google.colab import files files.download(f"{output_dir}/{file_name}") except ImportError: print("Not running in Colab, report saved locally.") ``` ``` # Evaluation function def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False, deterministic=True, quantstats=False): total_rewards = [] total_profits = [] # Track actual trading profit metrics = [] if (quantstats): returns = [] for _ in range(n_episodes): obs = env_vec.reset() done = np.array([False] * env_vec.num_envs) episode_rewards = np.zeros(env_vec.num_envs) episode_profit = 0 # Sum of realized profits/losses step_count = 0 max_steps = len(env_vec.envs[0].dt_datetime) if (quantstats): episode_returns = [] # For QuantStats while not np.all(done) and step_count < max_steps: action, _ = model.predict(obs, deterministic=deterministic) obs, rewards, done, info = env_vec.step(action) # VecEnv returns 4 values episode_rewards += rewards print(f"Action: {action}") # Debug action output # Extract profit from closed transactions for env_idx in range(env_vec.num_envs): if info[env_idx]["Close"]: for tr in info[env_idx]["Close"]: episode_profit += tr["Reward"] # Add realized reward (profit/loss) if (quantstats): episode_returns.append(tr["Reward"]) # Track per-trade returns step_count += 1 total_rewards.extend(episode_rewards) total_profits.append(episode_profit) metrics.append(env_vec.envs[0].analyze_transaction_history()) # Logs to file here if (quantstats): returns.extend(episode_returns if episode_returns else [episode_profit]) # Fallback to total profit mean_reward = np.mean(total_rewards) mean_profit = np.mean(total_profits) # Define numeric keys to average numeric_keys = ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"] avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in numeric_keys} print(f"Deterministic={deterministic}, Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}") print(f"Average Metrics: {avg_metrics}") if (quantstats): # QuantStats reporting (for test evaluation) returns_series = pd.Series(returns, index=pd.date_range(start=timestamps, periods=len(returns), freq="D")) qs.reports.html(returns_series, output="quantstats_report.html", title="Trading Performance") return mean_reward if return_mean_reward else mean_profit # Return profit instead of reward for optimization ```
哪个更好？将 Detail QuantStats 合并到 evaluate（）中。 ``` # QuantStats 的详细评估 obs = test_env_vec.reset（）完成 = np.array（[False] * test_env_vec.num_envs） balances = [test_env_vec.envs[0].balance] # 记录初始状态时间戳 = [test_env_vec.envs[0].dt_datetime[test_env_vec.envs[0].current_step]] steps_taken = [test_env_vec.envs[0].current_step] 交易 = [] 虽然不是 np.all（done）和 test_env_vec.envs[0].current_step < len（test_env_vec.envs[0].dt_datetime）：作， _ = model.predict（obs， deterministic=True） step_result = test_env_vec.step（action）如果 len（step_result） == 4： obs、reward、done、info = step_result 还： obs、reward、terminated、truncated、info = step_result 完成 = np.logical_or（终止，截断） current_step = test_env_vec.envs[0].current_step 时间戳 = test_env_vec.envs[0].dt_datetime[current_step] timestamps.append（时间戳） steps_taken.append（current_step） balances.append（test_env_vec.envs[0].balance） if info[0][“关闭”]： trades.extend（info[0][“Close”]） # 收集已平仓的交易 # 创建性能 DataFrame perf_df = pd。DataFrame（{“time”：时间戳， “balance”： balances， “step”： steps_taken}） print（f“性能 DF 长度：{len（perf_df）}，唯一时间戳：{perf_df['time'].nunique（）}”） print（perf_df.head（）） # 检查前几行 print（perf_df.tail（）） # 检查最后几行 if perf_df[“time”].duplicated（）.any（）： print（“找到重复项！正在聚合...”） perf_df = perf_df.groupby（“time”）.agg（{“balance”： “mean”， “step”： “max”}）.reset_index（） perf_df.set_index（“时间”， inplace=True）返回值 = perf_df[“balance”].pct_change（）.fillna（0） # 交易层面分析 trades_df = pd。DataFrame（trades）如果交易 else pd.DataFrame（columns=[“奖励”]） profits = trades_df[“Reward”] 如果不是 trades_df.empty 否则 pd.系列（） win_rate = len（profits[profits > 0]） / len（profits） if len（profits） > 0 else 0 profit_factor = profits[profits > 0].sum（） / abs（profits[profits < 0].sum（）） if profits[profits < 0].sum（）！= 0 else float（'inf'） # QuantStats 报告 qs.extend_pandas（） output_dir = “数据/日志” os.makedirs（output_dir， exist_ok=真） file_name = “xauusd_test.html” qs.reports.html（返回输出=os.path.join（output_dir， file_name）， title=“XAUUSD 测试性能” ) # 最终评估和指标 test_reward = 评估（模型， test_env_vec， 10） print（f“最终测试平均奖励：{test_reward：.2f}”） print（f“夏普比率： {qs.stats.sharpe（returns）：.2f}”） print（f“最大回撤： {qs.stats.max_drawdown（returns）：.2%}”） print（f“胜率： {win_rate：.2%}”） print（f“利润因子： {profit_factor：.2f}”） # 清理 train_env_vec.close（） val_env_vec.close（） test_env_vec.close（） # 可选：在 Colab 中下载报告尝试：从 google.colab 导入文件 files.download（f“{output_dir}/{file_name}”）除 ImportError 外： print（“未在 Colab 中运行，报告保存在本地。”） ``` ``` # 评估函数 def evaluate（model， env_vec， n_episodes=10， return_mean_reward=False， deterministic=True， quantstats=False）： total_rewards = [] total_profits = [] # 跟踪实际交易利润指标 = [] if （quantstats）：返回 = [] 对于 range（n_episodes）中的 _： obs = env_vec.reset（）完成 = np.array（[假] * env_vec.num_envs） episode_rewards = np.zeros（env_vec.num_envs） episode_profit = 0 # 已实现盈亏之和 step_count = 0 max_steps = len（env_vec.envs[0].dt_datetime） if （quantstats）： episode_returns = [] # 对于 QuantStats while not np.all（done）且 step_count < max_steps：作， _ = model.predict（obs， deterministic=deterministic） obs， rewards， done， info = env_vec.step（action） # VecEnv 返回 4 个值 episode_rewards += 奖励 print（f“Action： {action}”） # 调试动作输出 # 从已平仓交易中提取利润对于 range （env_vec.num_envs）中的env_idx： if info[env_idx][“关闭”]：对于 info[env_idx][“Close”] 中的 tr： episode_profit += tr[“Reward”] # 添加已实现的奖励（盈利/亏损） if （quantstats）： episode_returns.append（tr[“Reward”]） # 追踪每笔交易的回报 step_count += 1 total_rewards.extend（episode_rewards） total_profits.append（episode_profit） metrics.append（env_vec.envs[0].analyze_transaction_history（）） # 在此处提交的日志 if （quantstats）： returns.extend（episode_returns if episode_returns else [episode_profit]） # 回退到总利润 mean_reward = np.mean（total_rewards） mean_profit = np.mean（total_profits） # 定义数字键以求平均值 numeric_keys = [“交易”， “win_rate”， “profit_factor”， “sharpe_ratio”， “total_profit”] avg_metrics = {k： np.mean（[m[k] for m in metrics]） for k in numeric_keys} print（f“Deterministic={deterministic}，平均奖励： {mean_reward：.2f}，平均利润： {mean_profit：.2f}”） print（f“平均度量： {avg_metrics}”） if （quantstats）： # QuantStats 报告（用于测试评估） returns_series = pd。Series（返回值， index=pd.date_range（start=timestamps， periods=len（returns）， freq=“D”）） qs.reports.html（returns_series， output=“quantstats_report.html”， title=“交易表现”） return mean_reward if return_mean_reward else mean_profit # 返回利润而不是优化奖励 ```

Let’s compare the two approaches for evaluating your model with QuantStats integration and determine which is better, then provide a combined solution if desired.
让我们比较两种使用 QuantStats 集成评估模型的方法，并确定哪种方法更好，然后根据需要提供组合解决方案。

Approach 1: Detailed QuantStats Outside evaluate()
方法 1：在 evaluate（）之外提供详细的 QuantStats

# Detailed evaluation for QuantStats
obs = test_env_vec.reset()
done = np.array([False] * test_env_vec.num_envs)
balances = [test_env_vec.envs[0].balance]  # Record initial state
timestamps = [test_env_vec.envs[0].dt_datetime[test_env_vec.envs[0].current_step]]
steps_taken = [test_env_vec.envs[0].current_step]
trades = []

while not np.all(done) and test_env_vec.envs[0].current_step < len(test_env_vec.envs[0].dt_datetime):
    action, _ = model.predict(obs, deterministic=True)
    step_result = test_env_vec.step(action)
    if len(step_result) == 4:
        obs, reward, done, info = step_result
    else:
        obs, reward, terminated, truncated, info = step_result
        done = np.logical_or(terminated, truncated)

    current_step = test_env_vec.envs[0].current_step
    timestamp = test_env_vec.envs[0].dt_datetime[current_step]
    timestamps.append(timestamp)
    steps_taken.append(current_step)
    balances.append(test_env_vec.envs[0].balance)
    if info[0]["Close"]:
        trades.extend(info[0]["Close"])  # Collect closed trades

# Create performance DataFrame
perf_df = pd.DataFrame({"time": timestamps, "balance": balances, "step": steps_taken})
# ... (QuantStats and metrics computation)
test_reward = evaluate(model, test_env_vec, 10)  # Separate call

Pros 优点

Granular Control: 粒度控制：
- Tracks balance and timestamps at every step, enabling a time-series analysis of equity (e.g., daily returns).
  跟踪每一步的余额和时间戳，从而对净值（例如每日回报）进行时间序列分析。
- Captures all trades explicitly in trades for detailed trade-level metrics (win rate, profit factor).
  在交易中明确捕获所有交易，以获取详细的交易级别指标（胜率、利润因子）。
Rich QuantStats Output: 丰富的 QuantStats 输出：
- Uses actual timestamps from dt_datetime, aligning returns with real dates for a realistic backtest report.
  使用dt_datetime的实际时间戳，使返回结果与实际日期保持一致，以获得真实的回测报告。
- Computes returns from balance changes, which reflects overall account performance (not just closed trades).
  计算余额变化的回报，这反映了整体账户绩效（不仅仅是已平仓交易）。
Single Episode Focus: 单集焦点：
- Runs one episode, simulating a single backtest over the test period, which aligns with traditional trading evaluation.
  运行一个插曲，在测试期间模拟单个回测，这与传统的交易评估一致。

Cons 缺点

Redundant Evaluation: 冗余评估：
- Calls evaluate() separately with n_episodes=10 after the detailed backtest, potentially duplicating effort or mismatching results.
  在详细的回测后，使用 n_episodes=10 单独调用 evaluate（），这可能会重复工作或结果不匹配。
Limited Statistical Robustness:
有限的统计稳健性：
- Only one episode’s worth of data, which might miss variability due to random starts (random_start=True) or stochastic market conditions.
  只有一集的数据，由于随机开始（random_start=True）或随机市场条件，可能会错过可变性。
Complexity: 复杂性：
- More manual code to manage state, trades, and DataFrame creation outside the evaluate() function.
  在 evaluate（）函数之外管理状态、交易和 DataFrame 创建的更多手动代码。

Approach 2: QuantStats Inside evaluate()
方法 2：evaluate（）中的 QuantStats

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False, deterministic=True, quantstats=False):
    total_rewards = []
    total_profits = []
    metrics = []
    if quantstats:
        returns = []

    for _ in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)
        if quantstats:
            episode_returns = []

        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=deterministic)
            obs, rewards, done, info = env_vec.step(action)
            episode_rewards += rewards
            for env_idx in range(env_vec.num_envs):
                if info[env_idx]["Close"]:
                    for tr in info[env_idx]["Close"]:
                        episode_profit += tr["Reward"]
                        if quantstats:
                            episode_returns.append(tr["Reward"])
            step_count += 1

        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())
        if quantstats:
            returns.extend(episode_returns if episode_returns else [episode_profit])

    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]}
    if quantstats:
        returns_series = pd.Series(returns, index=pd.date_range(start="2025-03-12", periods=len(returns), freq="D"))
        qs.reports.html(returns_series, output="quantstats_report.html", title="Trading Performance")
    return mean_reward if return_mean_reward else mean_profit

Pros 优点

Unified Function: 统一功能：
- Combines evaluation and QuantStats reporting in one reusable function, reducing code duplication.
  将评估和 QuantStats 报告合并到一个可重用的函数中，从而减少代码重复。
Multiple Episodes: 多集：
- Runs n_episodes (e.g., 10 or 20), averaging results for statistical robustness across random starts or market conditions.
  n_episodes运行（例如，10 或 20），在随机开始或市场条件下对结果进行平均，以实现统计稳健性。
Simplicity: 单纯：
- Less manual state tracking; leverages existing logic to compute profits and metrics.
  减少手动状态跟踪;利用现有逻辑来计算利润和指标。

Cons 缺点

Less Granular QuantStats:
粒度较低的 QuantStats：
- Only tracks returns from closed trades (tr["Reward"]) or total episode profit, not balance changes over time.
  仅跟踪已平仓交易的回报（tr[“Reward”]）或总事件利润，而不跟踪余额随时间的变化。
- Uses a synthetic date range (pd.date_range) instead of real timestamps, reducing temporal accuracy in the report.
  使用合成日期范围（pd.date_range）而不是实际时间戳，从而降低报表中的时间准确性。
Potential Overlap: 潜在重叠：
- Aggregates trade-level returns across episodes, which might mix contexts (e.g., different starting points) in a way that’s less interpretable for a single backtest.
  汇总各个事件的交易级回报，这可能会以一种难以解释的方式混合上下文（例如，不同的起点）。

Which is Better? 哪个更好？

Depends on Your Goal 取决于你的目标

If You Want a Detailed Single Backtest (Approach 1):
如果您想要详细的单次回测（方法 1）：
- Better: Approach 1 (Detailed QuantStats Outside).
  更好：方法 1（外部的详细 QuantStats）。
- Why: 为什么：
  - It simulates a single, continuous run over test_env_vec, tracking balance and timestamps step-by-step—ideal for a traditional trading backtest.
    它模拟 test_env_vec 的单次连续运行，逐步跟踪余额和时间戳——非常适合传统的交易回测。
  - Provides a time-series view of equity, which is more informative for QuantStats metrics like drawdowns and daily returns.
    提供净值的时间序列视图，这对于 QuantStats 指标（如回撤和每日回报）信息量更大。
- Use Case: Final reporting or deployment simulation where you want one realistic historical performance curve.
  用例：最终报告或部署模拟，您需要一条真实的历史性能曲线。
If You Want Robust Statistical Evaluation (Approach 2):
如果您想要稳健的统计评估（方法 2）：
- Better: Approach 2 (QuantStats Inside evaluate()).
  更好：方法 2 （QuantStats Inside evaluate（））。
- Why: 为什么：
  - Multiple episodes (n_episodes=10 or 20) give a broader assessment of performance across varied conditions, aligning with your validation strategy.
    多次发作（n_episodes=10 或 20）可以更广泛地评估不同条件下的性能，与您的验证策略保持一致。
  - Simpler to integrate with training/validation workflows (e.g., Optuna, early stopping).
    更易于与训练/验证工作流程（例如 Optuna、提前停止）集成。
- Use Case: Research or model comparison where you prioritize average performance and robustness over a single backtest narrative.
  用例：研究或模型比较，其中您优先考虑平均性能和稳健性，而不是单个回测叙述。

Your Context 您的背景

You’ve used n_episodes=10 for validation and n_episodes=20 for testing, suggesting a preference for statistical robustness.
您使用 n_episodes=10 进行验证，使用 n_episodes=20 进行测试，这表明您偏好统计稳健性。
However, calling test_env_vec the “all backtest” implies you might want a single, detailed historical run for the final report, especially with quantstats=True.
但是，将test_env_vec称为“all backtest”意味着您可能希望对最终报告进行单个、详细的历史运行，尤其是在 quantstats=True 的情况下。

Combining the Best of Both
结合两者的优点

Here’s a combined evaluate() function that:
下面是一个组合的 evaluate（）函数：

Supports multiple episodes for robustness.
支持多个剧集以实现稳健性。
Tracks detailed balance and timestamp data for a rich QuantStats report.
跟踪详细的余额和时间戳数据，以获得丰富的 QuantStats 报告。
Avoids redundancy by integrating all metrics in one call.
通过将所有指标集成到一个调用中来避免冗余。

import os
import pandas as pd
import numpy as np
import quantstats as qs

def evaluate(model, env_vec, n_episodes=10, return_mean_reward=False, deterministic=True, quantstats=False):
    total_rewards = []
    total_profits = []
    metrics = []
    if quantstats:
        all_balances = []
        all_timestamps = []
        all_steps = []
        all_trades = []

    for episode in range(n_episodes):
        obs = env_vec.reset()
        done = np.array([False] * env_vec.num_envs)
        episode_rewards = np.zeros(env_vec.num_envs)
        episode_profit = 0
        step_count = 0
        max_steps = len(env_vec.envs[0].dt_datetime)

        if quantstats:
            balances = [env_vec.envs[0].balance]  # Initial balance
            timestamps = [env_vec.envs[0].dt_datetime[env_vec.envs[0].current_step]]
            steps_taken = [env_vec.envs[0].current_step]
            trades = []

        while not np.all(done) and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=deterministic)
            step_result = env_vec.step(action)
            if len(step_result) == 4:
                obs, rewards, done, info = step_result
            else:
                obs, rewards, terminated, truncated, info = step_result
                done = np.logical_or(terminated, truncated)

            episode_rewards += rewards
            step_count += 1

            if quantstats:
                current_step = env_vec.envs[0].current_step
                timestamp = env_vec.envs[0].dt_datetime[current_step]
                timestamps.append(timestamp)
                steps_taken.append(current_step)
                balances.append(env_vec.envs[0].balance)
                if info[0]["Close"]:
                    trades.extend(info[0]["Close"])
                    for tr in info[0]["Close"]:
                        episode_profit += tr["Reward"]

            print(f"Episode {episode}, Step {step_count}: Action={action}, Reward={rewards[0]:.2f}")

        total_rewards.extend(episode_rewards)
        total_profits.append(episode_profit)
        metrics.append(env_vec.envs[0].analyze_transaction_history())

        if quantstats:
            all_balances.extend(balances)
            all_timestamps.extend(timestamps)
            all_steps.extend(steps_taken)
            all_trades.extend(trades)

    mean_reward = np.mean(total_rewards)
    mean_profit = np.mean(total_profits)
    avg_metrics = {k: np.mean([m[k] for m in metrics]) for k in ["trades", "win_rate", "profit_factor", "sharpe_ratio", "total_profit"]}

    if quantstats:
        # Create performance DataFrame with actual timestamps
        perf_df = pd.DataFrame({"time": all_timestamps, "balance": all_balances, "step": all_steps})
        if perf_df["time"].duplicated().any():
            perf_df = perf_df.groupby("time").agg({"balance": "mean", "step": "max"}).reset_index()
        perf_df.set_index("time", inplace=True)
        returns = perf_df["balance"].pct_change().fillna(0)

        # Trade-level analysis
        trades_df = pd.DataFrame(all_trades) if all_trades else pd.DataFrame(columns=["Reward"])
        profits = trades_df["Reward"] if not trades_df.empty else pd.Series()
        win_rate = len(profits[profits > 0]) / len(profits) if len(profits) > 0 else 0
        profit_factor = profits[profits > 0].sum() / abs(profits[profits < 0].sum()) if profits[profits < 0].sum() != 0 else float('inf')

        # QuantStats report
        output_dir = "data/log"
        os.makedirs(output_dir, exist_ok=True)
        file_name = f"xauusd_test_{n_episodes}_episodes.html"
        qs.reports.html(
            returns,
            output=os.path.join(output_dir, file_name),
            title=f"XAUUSD Test Performance ({n_episodes} Episodes)"
        )

        print(f"QuantStats Report Saved: {output_dir}/{file_name}")
        print(f"Sharpe Ratio: {qs.stats.sharpe(returns):.2f}")
        print(f"Max Drawdown: {qs.stats.max_drawdown(returns):.2%}")
        print(f"Win Rate: {win_rate:.2%}")
        print(f"Profit Factor: {profit_factor:.2f}")

    print(f"Deterministic={deterministic}, Episodes={n_episodes}, Mean Reward: {mean_reward:.2f}, Mean Profit: {mean_profit:.2f}")
    print(f"Average Metrics: {avg_metrics}")

    return mean_reward if return_mean_reward else mean_profit

# Usage in your final test
print("\nEvaluating Final Model on Test Data:")
test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=20, deterministic=True, quantstats=True)
print(f"Test Average Profit: {test_avg_profit:.2f}")

Comparison of Combined Solution
组合解决方案的比较

Advantages 优势

Flexibility: 灵活性：
- Works with any n_episodes (1 for a single backtest, 20 for robustness).
  适用于任何n_episodes（1 个用于单个回测，20 个用于稳健性）。
- Enables quantstats=False for quick evaluations without detailed reporting.
  启用 quantstats=False 以进行快速评估，而无需详细报告。
Detailed QuantStats: 详细的 QuantStats：
- Tracks balance and timestamps like Approach 1, using real dt_datetime for accurate time-series returns.
  像方法 1 一样跟踪余额和时间戳，使用真实dt_datetime获得准确的时间序列回报。
- Adds trade-level analysis (win rate, profit factor) within the function.
  在函数中添加交易级分析（赢率、利润因子）。
Statistical Robustness: 统计稳健性：
- Supports multiple episodes like Approach 2, averaging metrics across runs.
  支持像方法 2 一样的多个事件，平均运行之间的指标。
Unified Output: 统一输出：
- Combines mean reward/profit, average metrics, and QuantStats in one call—no separate evaluate() needed.
  在一次调用中组合平均奖励/利润、平均指标和 QuantStats，无需单独的 evaluate（）。

Trade-Offs 权衡

Complexity: More code within evaluate(), but it’s reusable and avoids external redundancy.
复杂性：evaluate（）中有更多的代码，但它是可重用的，并避免了外部冗余。
Performance: With quantstats=True and high n_episodes, it’s slower due to detailed tracking (e.g., 20 episodes * max_steps).
性能：使用 quantstats=True 且 n_episodes 较高时，由于详细跟踪（例如，20 集 * max_steps），速度会变慢。

Final Recommendation 最终建议

Best Option: Use the combined evaluate() function.
最好的选择：使用组合的 evaluate（）函数。
- Why: It offers the best of both worlds:
  为什么：它提供了两全其美的优势：
  - Detailed backtest capability (balance tracking, real timestamps) for a single run (n_episodes=1).
    单次运行的详细回测功能（余额跟踪、真实时间戳）（n_episodes=1）。
  - Robust statistical evaluation across multiple episodes (n_episodes=20).
    跨多次发作的稳健统计评估（n_episodes=20）。
  - Comprehensive QuantStats reporting integrated seamlessly.
    无缝集成全面的 QuantStats 报告。
- How to Use: 如何使用：
  - For a single backtest: evaluate(..., n_episodes=1, quantstats=True).
    对于单个回测：evaluate（...， n_episodes=1， quantstats=True）。
  - For robust testing: evaluate(..., n_episodes=20, quantstats=True) (your current setup).
    对于稳健的测试：evaluate（...， n_episodes=20， quantstats=True）（您当前的设置）。
Your Current Code: Sticking with n_episodes=20 and integrating detailed QuantStats as above aligns with your validation rigor and backtest goals, maximizing insight from test_env_vec.
您当前的代码：坚持使用 n_episodes=20 并集成上述详细的 QuantStats 与您的验证严谨性和回测目标保持一致，从而最大限度地提高test_env_vec的洞察力。

Let me know if you want further tweaks (e.g., simplifying for n_episodes=1 only)!
如果您想要进一步的调整，请告诉我（例如，仅简化 n_episodes=1）！

when if done happen, means the current episode is end, shouldn't it be directly reset the environment? why need to decrease 1 current holding? ``` def _calculate_reward(self, i, done, action): _total_reward = 0 _max_draw_down = 0 for tr in self.transaction_live[:]: if tr["Symbol"] == self.assets[i]: _point = self.cf.symbol(self.assets[i], "point") if tr["Type"] == 0: # Buy _sl_price = tr["ActionPrice"] - tr["SL"] / _point _pt_price = tr["ActionPrice"] + tr["PT"] / _point if done: p = (self._c - tr["ActionPrice"]) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p self.current_holding[i] -= 1 # Fix: Decrement here elif self._l <= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._h >= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: # ... (drawdown logic) elif tr["Type"] == 1: # Sell _sl_price = tr["ActionPrice"] + tr["SL"] / _point _pt_price = tr["ActionPrice"] - tr["PT"] / _point if done: p = (tr["ActionPrice"] - self._c) * _point self._manage_tranaction(tr, p, self._c, status=2) _total_reward += p self.current_holding[i] -= 1 # Fix: Decrement here elif self._h >= _sl_price: self._manage_tranaction(tr, -tr["SL"], _sl_price) _total_reward += -tr["SL"] self.current_holding[i] -= 1 elif self._l <= _pt_price: self._manage_tranaction(tr, tr["PT"], _pt_price) _total_reward += tr["PT"] self.current_holding[i] -= 1 else: # ... (drawdown logic) if _max_draw_down > self.max_draw_downs[i]: self.max_draw_downs[i] = _max_draw_down return _total_reward ```
如果 DONE 发生，则表示当前剧集结束，难道不是应该直接重置环境吗？为什么需要减少 1 个当前持有量？ ``` def _calculate_reward（self， i， done， action）： _total_reward = 0 _max_draw_down = 0 对于 self.transaction_live 中的 tr[：]： if tr[“Symbol”] == self.assets[i]： _point = self.cf.symbol（self.assets[i]， “点”） if tr[“Type”] == 0： # 买入 _sl_price = tr[“ActionPrice”] - tr[“SL”] / _point _pt_price = tr[“ActionPrice”] + tr[“PT”] / _point 如果完成： p = （self._c - tr[“ActionPrice”]） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p self.current_holding[i] -= 1 # 修复：此处递减 elif self._l <= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._h >= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 还： # ...（Drawdown Logic） elif tr[“Type”] == 1： # 卖出 _sl_price = tr[“ActionPrice”] + tr[“SL”] / _point _pt_price = tr[“ActionPrice”] - tr[“PT”] / _point 如果完成： p = （tr[“ActionPrice”] - self._c） * _point self._manage_tranaction（tr， p， self._c， status=2） _total_reward += p self.current_holding[i] -= 1 # 修复：此处递减 elif self._h >= _sl_price： self._manage_tranaction（tr， -tr[“SL”]， _sl_price） _total_reward += -tr[“SL”] self.current_holding[i] -= 1 elif self._l <= _pt_price： self._manage_tranaction（tr， tr[“PT”]， _pt_price） _total_reward += tr[“PT”] self.current_holding[i] -= 1 还： # ...（Drawdown Logic）如果 _max_draw_down > self.max_draw_downs[i]： self.max_draw_downs[i] = _max_draw_down 返回 _total_reward ```

Your question is insightful: when done happens, indicating the end of an episode, why do we need to decrease current_holding by 1 in _calculate_reward() instead of relying on the environment reset to handle it? Let’s break this down and clarify the reasoning, then address whether resetting the environment directly would suffice.
你的问题很有见地：当 DONE 发生时，表示一集的结束，为什么我们需要在 _calculate_reward（）中将current_holding减少 1，而不是依靠环境重置来处理它？让我们分解一下并澄清原因，然后解决直接重置环境是否足够。

Understanding done in Your Environment
在您的环境中完成理解

In your tgym environment, done is set in the step() method:
在你的 tgym 环境中，done 在 step（）方法中设置：

def step(self, actions):
    self.current_step += 1
    terminated = self.balance <= 0  # Bankruptcy
    truncated = self.current_step == len(self.dt_datetime) - 1  # Max steps reached
    done = terminated or truncated  # Combined done flag
    # ... (rest of step logic)
    return obs, total_reward, terminated, truncated, info

done=True Means Episode End:
done=真表示剧集结束：
- Either the agent goes bankrupt (balance <= 0) or the episode reaches the end of the data (current_step == max_steps).
  代理破产（余额 <= 0）或剧集到达数据末尾（current_step == max_steps）。
What Happens Next: 接下来会发生什么：
- In a reinforcement learning loop (e.g., Stable-Baselines3’s learn()), when done=True, the environment is reset via reset() before the next episode begins.
  在强化学习循环中（例如，Stable-Baselines3 的 learn（）），当 done=True 时，环境会在下一集开始之前通过 reset（）重置。

In reset(): 在 reset（）中：

def reset(self, seed=None, options=None):
    if seed is not None:
        self._seed(seed)
    self.current_step = 0 if not self.random_start else random.choice(range(int(len(self.dt_datetime) * 0.5)))
    self.balance = self.balance_initial
    self.transaction_live = []
    self.transaction_history = []
    self.transaction_limit_order = []
    self.current_holding = [0] * len(self.assets)  # Reset to 0
    # ... (rest of reset logic)
    return obs, info

current_holding Reset: self.current_holding = [0] * len(self.assets) resets the number of open trades to 0 for all assets.
current_holding reset：self.current_holding = [0] * len（self.assets）将所有资产的未平仓交易数量重置为 0。

Why Decrease current_holding in _calculate_reward()?
为什么在 _calculate_reward（）中减少 current_holding？

Current Logic 电流逻辑

In _calculate_reward():
在 _calculate_reward（）中：

if done:
    p = (self._c - tr["ActionPrice"]) * _point  # Buy
    self._manage_tranaction(tr, p, self._c, status=2)
    _total_reward += p
    self.current_holding[i] -= 1  # Decrement here

When done=True: All open trades (self.transaction_live) are closed at the current price (self._c), simulating a forced closure (e.g., end-of-episode liquidation).
当 done=True 时：所有未平仓交易（self.transaction_live）均以当前价格（self._c）平仓，模拟强制平仓（例如，事件结束时清算）。
self.current_holding[i] -= 1: Reduces the count of open trades for the asset as each trade is closed.
self.current_holding[i] -= 1：在每笔交易平仓时减少资产的未平仓交易数量。

Why Not Rely on reset() Alone?
为什么不单独依赖 reset（）呢？

Consistency Within the Episode:
剧集内的一致性：
- _calculate_reward() is called within step() via _take_action():
  _calculate_reward（）在 step（）中通过 _take_action（）调用：
  python 蟒
  def step(self, actions): base_reward = self._take_action(actions, done) # ... (reward calculation) return obs, total_reward, terminated, truncated, info
- When done=True, the step() method processes the final state before reset() is called by the training loop.
  当 done=True 时，step（）方法在训练循环调用 reset（）之前处理最终状态。
- If current_holding isn’t decremented here, it remains artificially high during the final step’s computations (e.g., observation, reward logging), misrepresenting the state.
  如果 current_holding 在此处没有递减，则在最后一步的计算（例如，观察、奖励记录）期间，它仍然人为地保持较高水平，从而歪曲了状态。
Observation Accuracy: 观察精度：
- The observation returned by step() includes current_holding:
  step（）返回的观察结果包括current_holding：
  python 蟒
  obs = { "portfolio_data": np.array( [self.balance, self.total_equity, self.max_draw_down_pct] + self.current_holding + self.current_draw_downs, dtype=np.float32 ) }
- If current_holding isn’t updated to reflect closed trades, the final observation (sent to the agent or logged) shows an incorrect number of open positions (e.g., 50 instead of 0), even though transaction_live is empty after _manage_tranaction().
  如果 current_holding 没有更新以反映已平仓交易，则最终观察结果（发送给代理或记录）显示不正确的未平仓头寸数量（例如，50 而不是 0），即使transaction_live 在 _manage_tranaction（）后为空。
Reward Integrity: 奖励诚信：
- Closing trades contributes to _total_reward. Without decrementing current_holding, it might appear (e.g., in debug logs or metrics) that trades are still open, confusing analysis of the episode’s outcome.
  平仓有助于_total_reward。在不减少current_holding的情况下，交易可能看起来（例如，在调试日志或指标中）仍然开放，从而混淆对剧集结果的分析。
Reset Happens After step():
重置发生在 step（）之后：
- reset() is called after the step() that returns done=True, not within it:
  reset（）在返回 done=True 的 step（）之后调用，而不是在其中调用：
  python 蟒
  # Pseudo RL loop obs = env.reset() while True: action = model.predict(obs) obs, reward, done, info = env.step(action) if done: obs = env.reset() # Reset happens here, not in step()
- Thus, current_holding must be adjusted in _calculate_reward() to reflect the state at the end of the episode, before reset() wipes it.
  因此，在 reset（）擦除之前，必须在 _calculate_reward（）中调整 current_holding 以反映剧集结束时的状态。

Example Scenario 示例场景

State Before done=True:
done 之前的状态=True：
- current_holding[i] = 50
- transaction_live has 50 open trades.
  transaction_live 有 50 笔未平仓交易。
Step with done=True:
使用 done=True 的步骤：
- _calculate_reward() closes all 50 trades via _manage_tranaction(), emptying transaction_live.
  _calculate_reward（）通过 _manage_tranaction（）关闭所有 50 笔交易，清空 transaction_live。
- Without self.current_holding[i] -= 1, current_holding[i] stays at 50, even though no trades are open.
  如果没有 self.current_holding[i] -= 1，current_holding[i] 保持 50，即使没有交易未平仓。
- Observation reports current_holding[i] = 50, which is wrong.
  观察报告 current_holding[i] = 50，这是错误的。
After Reset: current_holding[i] = 0, but this is too late for the final step’s output.
重置后：current_holding[i] = 0，但这对于最后一步的输出来说太晚了。

Should It Reset the Environment Directly?
它应该直接重置环境吗？

Why Not Reset in step():
为什么不在 step（）中重置：
1. Gym API Violation: Gym API 冲突：
  - The Gym interface expects step() to return (obs, reward, done, info) and not reset the environment internally. Resetting is the caller’s responsibility (e.g., the RL algorithm).
    Gym 接口期望 step（）返回（obs， reward， done， info）而不是在内部重置环境。重置是调用方的责任（例如，RL 算法）。
  - Resetting in step() breaks this contract, potentially causing issues with frameworks like Stable-Baselines3.
    在 step（）中重置会破坏此协定，可能会导致 Stable-Baselines3 等框架出现问题。
2. Loss of Final State: 最终状态的丢失：
  - If step() resets immediately when done=True, the final observation and reward (reflecting the closed trades) wouldn’t be returned to the agent, disrupting learning or logging.
    如果 step（）在 done=True 时立即重置，则最终观察和奖励（反映已平仓交易）不会返回给代理，从而中断学习或日志记录。
3. Multiple Steps in Episode:
  剧集中的多个步骤：
  - Your loop in _calculate_reward() processes all transaction_live trades when done=True. Resetting mid-loop would interrupt this, leaving some trades unclosed.
    当您在 _calculate_reward（）中的循环处理所有 done=True 时transaction_live交易。重置循环中段会打断这一点，使一些交易未平仓。
Correct Approach: 正确的方法：
- Close trades and update current_holding in _calculate_reward() when done=True.
  当 done=True 时，在 _calculate_reward（）中关闭交易并更新current_holding。
- Let the RL loop call reset() after receiving the final step() output.
  让 RL 循环在收到最后的 step（）输出后调用 reset（）。

Is Decrementing current_holding Necessary?
递减current_holding必要吗？

Yes, It’s Necessary: 是的，这是必要的：
- current_holding must reflect the number of open trades (len(transaction_live)) at every step, including the final one.
  current_holding必须反映每个步骤（包括最后一步）的未平仓交易数量（len（transaction_live））。
- Without decrementing, current_holding becomes inconsistent with transaction_live (0 trades) during the done=True step, affecting:
  如果不递减，current_holding在 done=True 步骤中变得与 transaction_live（0 笔交易）不一致，从而影响：
  - Final observation accuracy.
    最终观察精度。
  - Debugging or logging (e.g., print(f"Holding={self.current_holding[i]}")).
    调试或日志记录（例如，print（f“Holding={self.current_holding[i]}”））。
  - Metrics like analyze_transaction_history() that might rely on current_holding.
    像 analyze_transaction_history（）这样的指标可能依赖于 current_holding。
Alternative (Not Recommended):
备选方案（不推荐）：
- You could skip decrementing in _calculate_reward() and reset current_holding in step() when done=True:
  当 done=True 时，你可以在 _calculate_reward（）中跳过递减并在 step（）中重置 current_holding：
  python 蟒
  def step(self, actions): self.current_step += 1 terminated = self.balance <= 0 truncated = self.current_step == len(self.dt_datetime) - 1 done = terminated or truncated base_reward = self._take_action(actions, done) # ... (reward calculation) if done: for i in range(len(self.assets)): self.current_holding[i] = 0 # Reset here instead return obs, total_reward, terminated, truncated, info
- Downside: 缺点：
  - Still risks observation inconsistency if current_holding isn’t updated before obs is computed.
    如果在计算 obs 之前未更新 current_holding 仍然存在观察不一致的风险。
  - Redundant with reset(), complicating maintenance.
    与 reset（）冗余，使维护复杂化。

Final Answer 最终答案

Why Decrease current_holding Instead of Resetting?
为什么减少 current_holding 而不是重置？
- done=True ends the episode, but step() must return a valid final state (obs, reward) before reset() is called externally.
  done=True 结束剧集，但 step（）必须在外部调用 reset（）之前返回有效的最终状态（obs， reward）。
- Decrementing current_holding ensures the state (e.g., observation) reflects all trades being closed during the final step, maintaining consistency with transaction_live.
  递减current_holding确保状态（例如，观察）反映在最后一步关闭的所有交易，从而与transaction_live保持一致。
Should It Reset Directly?
应该直接重置吗？
- No, resetting the environment in step() or _calculate_reward() violates the Gym API and skips returning the final state. The RL loop handles reset() post-done.
  不会，在 step（）或 _calculate_reward（）中重置环境违反了 Gym API，并跳过返回最终状态。RL 循环在 done-done 后处理 reset（）。
Conclusion: 结论：
- Keep self.current_holding[i] -= 1 in _calculate_reward() when done=True. It’s the correct fix to align current_holding with the actual number of open trades (0) at episode end, before reset() takes over.
  当 done=True 时，将 self.current_holding[i] -= 1 保留在 _calculate_reward（）中。在 reset（）接管之前，在剧集结束时，current_holding与实际未平仓交易的数量（0）保持一致是正确的解决方法。

Your intuition about done signaling an episode end is spot-on, but the decrement is needed to bridge the gap between closing trades and the next reset. Let me know if you want to explore this further!
您对完成标志着剧集结束的直觉是正确的，但需要递减来弥合平仓交易和下一次重置之间的差距。如果您想进一步探索这个问题，请告诉我！

for this code, why after model learn, need to run a evaluate() on the val_env_vec for 10 episodes? becase the eval_callback already done the evaluate(). should this " val_avg_profit = evaluate(model, val_env_vec, n_episodes=10, deterministic=True)" change to only run 1 episodes on the test_env_vec? ``` class EarlyStoppingCallback(EvalCallback): def __init__(self, eval_env, eval_freq, n_eval_episodes, patience, min_delta, verbose=0, use_profit=False, best_model_save_path=None): super().__init__( eval_env=eval_env, eval_freq=eval_freq, n_eval_episodes=n_eval_episodes, verbose=verbose, deterministic=True # Always deterministic during evaluation ) self.patience = patience self.min_delta = min_delta self.best_metric = -float('inf') self.no_improvement_count = 0 self.use_profit = use_profit # Toggle between reward and profit self.best_model_save_path = best_model_save_path def _on_step(self): continue_training = super()._on_step() if not continue_training: return False if self.last_mean_reward is not None: # Use profit or reward based on use_profit flag current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit, deterministic=True) if current_metric > self.best_metric + self.min_delta: self.best_metric = current_metric self.no_improvement_count = 0 if self.verbose > 0: print(f"New best {'profit' if self.use_profit else 'reward'}: {self.best_metric:.2f}") if self.best_model_save_path: self.model.save(self.best_model_save_path) else: self.no_improvement_count += 1 if self.verbose > 0: print(f"No improvement for {self.no_improvement_count}/{self.patience} evaluations") if self.no_improvement_count >= self.patience: if self.verbose > 0: print(f"Early stopping triggered after {self.patience} evaluations without improvement") return False return True # Objective function for Optuna def objective(trial): # Define hyperparameter search space learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True) n_steps = trial.suggest_int("n_steps", 1024, 8192, step=1024) total_timesteps = trial.suggest_int("total_timesteps", 500000, 2000000, step=100000) batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512]) gamma = trial.suggest_float("gamma", 0.9, 0.9999) gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.99) clip_range = trial.suggest_float("clip_range", 0.1, 0.3) ent_coef = trial.suggest_float("ent_coef", 1e-3, 0.1, log=True) # Train PPO model on training set model = PPO( CustomMultiInputPolicy, train_env_vec, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, n_epochs=100, # Large fixed value gamma=gamma, gae_lambda=gae_lambda, clip_range=clip_range, ent_coef=ent_coef, # Entropy coefficient for exploration verbose=1, # 0: No output during training, # 1: Prints basic training progress, # 2: More detailed output (Additional details like optimization steps, loss values (e.g., policy loss, value loss), and learning rate updates.) ) eval_callback = EarlyStoppingCallback( eval_env=val_env_vec, eval_freq=5000, # More frequent n_eval_episodes=20, # More episodes for stability patience=5, # Stop if no improvement after 5 evaluations min_delta=0.01, # Minimum improvement to consider verbose=1, use_profit=True, # Track profit for trading focus best_model_save_path=f"./best_model/trial_{trial.number}/" # Save best model ) # Training uses deterministic=False implicitly in PPO's collect_rollouts model.learn(total_timesteps=total_timesteps, callback=eval_callback) # Evaluation uses deterministic=True val_avg_profit = evaluate(model, val_env_vec, n_episodes=10, deterministic=True) print(f"Validation Average Profit: {val_avg_profit:.2f}") return val_avg_profit # Maximize reward ```

Your question highlights a potential redundancy in the code: why run evaluate(model, val_env_vec, n_episodes=10) after model.learn() when EarlyStoppingCallback already evaluates on val_env_vec during training? Let’s analyze this, determine its necessity, and decide whether it should be replaced with a single-episode evaluation on test_env_vec.

Current Workflow in objective()

Model Training:
- model.learn(total_timesteps=..., callback=eval_callback) trains the PPO model with the EarlyStoppingCallback.
- During training, the callback evaluates the model on val_env_vec every eval_freq=5000 steps with n_eval_episodes=20.
Callback Evaluation:
- In EarlyStoppingCallback._on_step():
  python
  current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, return_mean_reward=not self.use_profit, deterministic=True)
- Runs evaluate() on val_env_vec with 20 episodes (n_eval_episodes=20) to compute the mean profit (since use_profit=True).
- Tracks the best profit (best_metric) and saves the best model if improved.
Post-Training Evaluation:
- After learn() completes:
  python
  val_avg_profit = evaluate(model, val_env_vec, n_episodes=10, deterministic=True) return val_avg_profit
- Runs evaluate() again on val_env_vec, but with n_episodes=10, and uses this as the objective value for Optuna.

Why Run evaluate() After learn()?

Current Purpose

Optuna Optimization:
- The val_avg_profit returned by objective() is what Optuna uses to compare trials and select the best hyperparameters.
- This post-training evaluation provides a final validation score for the fully trained model (after all total_timesteps).

Redundancy with EarlyStoppingCallback

Overlap:
- The callback already evaluates on val_env_vec periodically (every 5000 steps) with 20 episodes, tracking the best profit and saving the best model.
- If training stops early (e.g., after 5 evaluations without improvement), the final model might not have been evaluated at the end of total_timesteps.
Difference:
- Callback: 20 episodes, runs during training, stops early if no improvement.
- Post-learn(): 10 episodes, runs once after training completes (or stops).

Issues

Redundant Effort:
- If the callback evaluates frequently and saves the best model, why re-evaluate the final model on val_env_vec? The best validation profit is already tracked in best_metric.
- Running 10 episodes post-training repeats work, especially since it’s the same environment (val_env_vec).
Inconsistency:
- Callback uses 20 episodes, but post-training uses 10. This discrepancy might lead to slightly different profit estimates due to sample size.
Missed Test Set:
- The final evaluation in objective() uses val_env_vec, not test_env_vec. In a proper ML workflow, the test set (test_env_vec) should be reserved for the final assessment, not validation.

Should It Be Removed or Changed?

Why It’s There

Final Score for Optuna:
- Optuna needs a single scalar to optimize. The post-learn() evaluate() ensures a consistent final score for each trial, even if early stopping cuts training short.
Post-Training Check:
- It verifies the model’s performance after all total_timesteps, which might differ from the best score during training (e.g., if overfitting occurs late).

Why It’s Redundant

Callback Covers It:
- EarlyStoppingCallback already evaluates regularly and saves the best model based on 20 episodes. The best current_metric could be returned directly instead of re-evaluating.
Best Model Saved:
- The callback saves the best model to best_model_save_path. The final model after learn() isn’t necessarily the best—it’s just the last state.

Proposed Change

Remove Redundant Evaluation on val_env_vec:
- Instead of running evaluate() post-learn(), use the callback’s best_metric as the Optuna objective value.
- This leverages the callback’s rigorous 20-episode evaluations and avoids redundant computation.
Replace with test_env_vec (1 Episode)?
- Not Ideal Here:
  - objective() is for hyperparameter tuning, which should optimize on validation data (val_env_vec), not the test set (test_env_vec). Using test_env_vec in objective() risks overfitting hyperparameters to the test set, violating the train-val-test split.
  - A single episode (n_episodes=1) on test_env_vec could be a final check after tuning, not during Optuna optimization.

Recommended Modification

Updated Code

Modify objective() to use the callback’s best_metric and reserve test_env_vec for the final evaluation outside Optuna:

class EarlyStoppingCallback(EvalCallback):
    def __init__(self, eval_env, eval_freq, n_eval_episodes, patience, min_delta, verbose=0, use_profit=False, best_model_save_path=None):
        super().__init__(
            eval_env=eval_env,
            eval_freq=eval_freq,
            n_eval_episodes=n_eval_episodes,
            verbose=verbose,
            deterministic=True
        )
        self.patience = patience
        self.min_delta = min_delta
        self.best_metric = -float('inf')
        self.no_improvement_count = 0
        self.use_profit = use_profit
        self.best_model_save_path = best_model_save_path

    def _on_step(self):
        continue_training = super()._on_step()
        if not continue_training:
            return False

        if self.last_mean_reward is not None:
            current_metric = evaluate(self.model, self.eval_env, self.n_eval_episodes, 
                                    return_mean_reward=not self.use_profit, deterministic=True)
            if current_metric > self.best_metric + self.min_delta:
                self.best_metric = current_metric
                self.no_improvement_count = 0
                if self.verbose > 0:
                    print(f"New best {'profit' if self.use_profit else 'reward'}: {self.best_metric:.2f}")
                if self.best_model_save_path:
                    self.model.save(self.best_model_save_path)
            else:
                self.no_improvement_count += 1
                if self.verbose > 0:
                    print(f"No improvement for {self.no_improvement_count}/{self.patience} evaluations")
                if self.no_improvement_count >= self.patience:
                    if self.verbose > 0:
                        print(f"Early stopping triggered after {self.patience} evaluations without improvement")
                    return False
        return True

    def get_best_metric(self):
        return self.best_metric  # Expose best metric for Optuna

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
    n_steps = trial.suggest_int("n_steps", 1024, 8192, step=1024)
    total_timesteps = trial.suggest_int("total_timesteps", 500000, 2000000, step=100000)
    batch_size = trial.suggest_categorical("batch_size", [64, 128, 256, 512])
    gamma = trial.suggest_float("gamma", 0.9, 0.9999)
    gae_lambda = trial.suggest_float("gae_lambda", 0.8, 0.99)
    clip_range = trial.suggest_float("clip_range", 0.1, 0.3)
    ent_coef = trial.suggest_float("ent_coef", 1e-3, 0.1, log=True)

    model = PPO(
        CustomMultiInputPolicy,
        train_env_vec,
        learning_rate=learning_rate,
        n_steps=n_steps,
        batch_size=batch_size,
        n_epochs=100,
        gamma=gamma,
        gae_lambda=gae_lambda,
        clip_range=clip_range,
        ent_coef=ent_coef,
        verbose=1
    )
    eval_callback = EarlyStoppingCallback(
        eval_env=val_env_vec,
        eval_freq=5000,
        n_eval_episodes=20,
        patience=5,
        min_delta=0.01,
        verbose=1,
        use_profit=True,
        best_model_save_path=f"./best_model/trial_{trial.number}/"
    )

    model.learn(total_timesteps=total_timesteps, callback=eval_callback)
    
    # Use the best metric from the callback instead of re-evaluating
    val_avg_profit = eval_callback.get_best_metric()
    print(f"Best Validation Average Profit from Callback: {val_avg_profit:.2f}")

    return val_avg_profit  # Maximize profit

# ... (Rest of your code: Optuna study, final model training, test evaluation)

# Final test evaluation (outside objective)
best_trial = study.best_trial.number
best_model = PPO.load(f"./best_model/trial_{best_trial}/")
print("\nEvaluating Best Model on Test Data:")
test_avg_profit = evaluate(best_model, test_env_vec, n_episodes=1, deterministic=True, quantstats=True)
print(f"Test Average Profit (1 Episode): {test_avg_profit:.2f}")

Changes

Remove Post-learn() evaluate():
- Replace val_avg_profit = evaluate(model, val_env_vec, n_episodes=10, deterministic=True) with eval_callback.get_best_metric().
- This uses the best profit from the callback’s 20-episode evaluations, avoiding redundancy.
Keep test_env_vec Separate:
- Move the test evaluation outside objective() to after Optuna tuning, using the best model.
- Set n_episodes=1 for a single backtest on test_env_vec, aligning with a traditional trading evaluation (adjust to 20 if you prefer robustness).

Why This is Better

Efficiency: Avoids re-evaluating on val_env_vec post-training since the callback already does it.
Correct Split: Reserves test_env_vec for final testing, not hyperparameter tuning.
Flexibility:
- val_avg_profit reflects the best validation performance (20 episodes).
- Test evaluation can use 1 episode for a single backtest or more for robustness.

Should It Change to test_env_vec with 1 Episode?

Not in objective():
- Using test_env_vec in objective() is inappropriate for hyperparameter tuning—it’s meant for final evaluation, not optimization. Stick with val_env_vec for validation.
Post-Optuna:
- Yes, evaluate on test_env_vec with n_episodes=1 (or more) after tuning, as shown above. This respects the train-val-test split and aligns with your backtest intent.

When to Use 1 Episode on test_env_vec

Single Backtest: n_episodes=1 simulates one full run over the test period, ideal for a final report.
Robustness Check: If you want statistical confidence, keep n_episodes=20 (as discussed previously).

Final Answer

Why evaluate() After learn()?
- It provides a final validation score for Optuna, but it’s redundant since EarlyStoppingCallback already evaluates on val_env_vec with 20 episodes and tracks the best profit.
Should It Change?
- Yes: Remove val_avg_profit = evaluate(model, val_env_vec, n_episodes=10) and use eval_callback.get_best_metric() in objective().
- No to test_env_vec in objective(): Keep validation separate from testing. Instead, evaluate on test_env_vec outside objective() after tuning.
- 1 Episode on test_env_vec: Good for a single backtest post-tuning (as shown), but use more episodes if you want robustness.

The updated code streamlines the process and respects the ML workflow. Let me know if you’d like further adjustments!

print out the take profit and stop loss for the new openning trade: ``` def _take_action(self, actions, done): # action = math.floor(x), # profit_taken = math.ceil((x- math.floor(x)) * profit_taken_max - stop_loss_max ) # _actions = np.floor(actions).astype(int) # _profit_takens = np.ceil((actions - np.floor(actions)) *self.cf.symbol(self.assets[i],"profit_taken_max")).astype(int) _action = 2 _profit_taken = 0 rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # need use multiply assets for i, action in enumerate(actions): # Actions are now floats between 0 and 3 self._o = self.get_observation(self.current_step, i, "Open") self._h = self.get_observation(self.current_step, i, "High") self._l = self.get_observation(self.current_step, i, "Low") self._c = self.get_observation(self.current_step, i, "Close") self._t = self.get_observation(self.current_step, i, "_time") self._day = self.get_observation(self.current_step, i, "_day") # Extract integer action type and fractional part _action = math.floor(action) # 0=Buy, 1=Sell, 2=Nothing rewards[i] = self._calculate_reward(i, done, _action) # Pass action for exploration reward print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")): # Dynamically calculate PT using action fraction _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 if self.cf.symbol(self.assets[i], "limit_order"): # Limit order logic transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._l if _action == 0 else self._h, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": -1, "CloseStep": -1, } self.transaction_limit_order.append(transaction) else: # Market order logic transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._c, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": self.current_step, "CloseStep": -1, } self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) # Debug print(f"Live Transactions: {[tr['Type'] for tr in self.transaction_live]}") return sum(rewards) ```

To print out the take profit (PT) and stop loss (SL) for each newly opened trade in your _take_action() method, you can add a print statement right after creating the transaction dictionary for both market and limit orders. This will display the values of PT and SL whenever a new trade is opened (i.e., when _action is 0 or 1, done is False, and current_holding is below the maximum).

Here’s the modified code with the addition:

def _take_action(self, actions, done):
    _action = 2
    _profit_taken = 0
    rewards = [0] * len(self.assets)
    self.tranaction_open_this_step = []
    self.tranaction_close_this_step = []

    for i, action in enumerate(actions):  # Actions are now floats between 0 and 3
        self._o = self.get_observation(self.current_step, i, "Open")
        self._h = self.get_observation(self.current_step, i, "High")
        self._l = self.get_observation(self.current_step, i, "Low")
        self._c = self.get_observation(self.current_step, i, "Close")
        self._t = self.get_observation(self.current_step, i, "_time")
        self._day = self.get_observation(self.current_step, i, "_day")

        # Extract integer action type and fractional part
        _action = math.floor(action)  # 0=Buy, 1=Sell, 2=Nothing
        rewards[i] = self._calculate_reward(i, done, _action)  # Pass action for exploration reward
        print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}")

        if self.cf.symbol(self.assets[i], "limit_order"):
            self._limit_order_process(i, _action, done)

        if (
            _action in (0, 1) 
            and not done 
            and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")
        ):
            # Dynamically calculate PT using action fraction
            _profit_taken = math.ceil(
                (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max")
            ) + self.cf.symbol(self.assets[i], "stop_loss_max")
            self.ticket_id += 1
            if self.cf.symbol(self.assets[i], "limit_order"):
                # Limit order logic
                transaction = {
                    "Ticket": self.ticket_id,
                    "Symbol": self.assets[i],
                    "ActionTime": self._t,
                    "Type": _action,
                    "Lot": 1,
                    "ActionPrice": self._l if _action == 0 else self._h,
                    "SL": self.cf.symbol(self.assets[i], "stop_loss_max"),
                    "PT": _profit_taken,
                    "MaxDD": 0,
                    "Swap": 0.0,
                    "CloseTime": "",
                    "ClosePrice": 0.0,
                    "Point": 0,
                    "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"),
                    "DateDuration": self._day,
                    "Status": 0,
                    "LimitStep": self.current_step,
                    "ActionStep": -1,
                    "CloseStep": -1,
                }
                print(f"New Limit Order - Asset: {self.assets[i]}, Type: {'Buy' if _action == 0 else 'Sell'}, "
                      f"Take Profit: {_profit_taken}, Stop Loss: {self.cf.symbol(self.assets[i], 'stop_loss_max')}")
                self.transaction_limit_order.append(transaction)
            else:
                # Market order logic
                transaction = {
                    "Ticket": self.ticket_id,
                    "Symbol": self.assets[i],
                    "ActionTime": self._t,
                    "Type": _action,
                    "Lot": 1,
                    "ActionPrice": self._c,
                    "SL": self.cf.symbol(self.assets[i], "stop_loss_max"),
                    "PT": _profit_taken,
                    "MaxDD": 0,
                    "Swap": 0.0,
                    "CloseTime": "",
                    "ClosePrice": 0.0,
                    "Point": 0,
                    "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"),
                    "DateDuration": self._day,
                    "Status": 0,
                    "LimitStep": self.current_step,
                    "ActionStep": self.current_step,
                    "CloseStep": -1,
                }
                print(f"New Market Order - Asset: {self.assets[i]}, Type: {'Buy' if _action == 0 else 'Sell'}, "
                      f"Take Profit: {_profit_taken}, Stop Loss: {self.cf.symbol(self.assets[i], 'stop_loss_max')}")
                self.current_holding[i] += 1
                self.tranaction_open_this_step.append(transaction)
                self.balance -= self.cf.symbol(self.assets[i], "transaction_fee")
                self.transaction_live.append(transaction)
        
    # Debug
    print(f"Live Transactions: {[tr['Type'] for tr in self.transaction_live]}")

    return sum(rewards)

Changes Made

Print Statement for Limit Orders:
- Added after creating the transaction dictionary for limit orders:
  python
  print(f"New Limit Order - Asset: {self.assets[i]}, Type: {'Buy' if _action == 0 else 'Sell'}, " f"Take Profit: {_profit_taken}, Stop Loss: {self.cf.symbol(self.assets[i], 'stop_loss_max')}")
- Displays the asset name, trade type (Buy/Sell), take profit (_profit_taken), and stop loss (stop_loss_max from config).
Print Statement for Market Orders:
- Added after creating the transaction dictionary for market orders:
  python
  print(f"New Market Order - Asset: {self.assets[i]}, Type: {'Buy' if _action == 0 else 'Sell'}, " f"Take Profit: {_profit_taken}, Stop Loss: {self.cf.symbol(self.assets[i], 'stop_loss_max')}")
- Same format as limit orders, showing the PT and SL for the new trade.

Explanation

Where It Prints:
- Inside the if _action in (0, 1) and not done and self.current_holding[i] < ... block, which triggers only when a new trade opens (Buy=0 or Sell=1).
- Separate prints for limit and market orders to distinguish their context.
Values Printed:
- Take Profit (_profit_taken): Dynamically calculated from the fractional part of the action:
  python
  _profit_taken = math.ceil((action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max")) + self.cf.symbol(self.assets[i], "stop_loss_max")
  - Example: If action=0.7, _action=0, profit_taken_max=200, stop_loss_max=100, then _profit_taken = ceil(0.7 * 200) + 100 = 240.
- Stop Loss: Fixed value from self.cf.symbol(self.assets[i], "stop_loss_max") (e.g., 100 from your config).
Sample Output:
text
Asset XAUUSD: Action=0.7, Reward=0, Holding=0 New Market Order - Asset: XAUUSD, Type: Buy, Take Profit: 240, Stop Loss: 100 Live Transactions: [0]

Verification

Config Dependency: Ensure profit_taken_max and stop_loss_max are defined in your config (e.g., gdbusd-test-1.json):
json
"XAUUSD": { "max_current_holding": 50, "transaction_fee": 10, "stop_loss_max": 100, "profit_taken_max": 200 }
Action Range: Actions should be floats between 0 and 2 (e.g., 0 to 1 for Buy, 1 to 2 for Sell), as per your PPO policy output.

This addition will help you debug or monitor the PT and SL settings for each new trade, ensuring they align with your expectations. Let me know if you need further tweaks!

_sl_price = tr["ActionPrice"] - tr["SL"] / _point _pt_price = tr["ActionPrice"] + tr["PT"] / _point _point i set in the config file is 0.01. "stop_loss_max": 300, "profit_taken_max": 100. 1. if current price is 2500, what is the stop loss and take profit? 2. is the _point can set to 0.01?

Let’s analyze your questions step-by-step based on the provided code and configuration details:

Relevant Code and Config

From your _calculate_reward() function:

_point = self.cf.symbol(self.assets[i], "point")  # From config, e.g., 0.01
_sl_price = tr["ActionPrice"] - tr["SL"] / _point  # Stop-loss price for Buy
_pt_price = tr["ActionPrice"] + tr["PT"] / _point  # Take-profit price for Buy

From your config:

"point": 0.01
"stop_loss_max": 300
"profit_taken_max": 100

From _take_action() (previous context):

tr["SL"] = self.cf.symbol(self.assets[i], "stop_loss_max") # e.g., 300
tr["PT"] = _profit_taken # Dynamically calculated, but capped at profit_taken_max (100 in config)

Assumptions:

tr["ActionPrice"] is the price at which the trade was opened (e.g., current price when the trade was initiated).
For simplicity, let’s assume this is a Buy trade (Type == 0), as the formulas differ slightly for Sell trades.

1. If Current Price is 2500, What is the Stop Loss and Take Profit?

Setup

Current Price (tr["ActionPrice"]): 2500 (assuming this is the price when the trade opens, e.g., self._c in _take_action()).
Config Values:
- _point = 0.01
- tr["SL"] = 300 (from stop_loss_max)
- tr["PT"] = 100 (assuming _profit_taken equals profit_taken_max for this example; it could vary based on the action fraction).

Stop-Loss Calculation

For a Buy trade:

_sl_price = tr["ActionPrice"] - tr["SL"] / _point

tr["ActionPrice"] = 2500
tr["SL"] = 300
_point = 0.01

_sl_price = 2500 - \frac{300}{0.01} = 2500 - 30000 = -27500

Stop-Loss Price: -27,500
Interpretation: This is an unrealistic negative price, indicating a potential issue (more on this below).

Take-Profit Calculation

For a Buy trade:

_pt_price = tr["ActionPrice"] + tr["PT"] / _point

tr["ActionPrice"] = 2500
tr["PT"] = 100
_point = 0.01

_pt_price = 2500 + \frac{100}{0.01} = 2500 + 10000 = 12500

Take-Profit Price: 12,500

Reality Check

Current Price: 2500 (e.g., XAUUSD in USD per ounce, a realistic gold price).
Stop-Loss Price: -27,500 (impossible, as prices can’t go negative).
Take-Profit Price: 12,500 (extremely high, unlikely for gold unless massive leverage or a misunderstanding).

Problem: The calculations yield absurd results because _point = 0.01 doesn’t align with the units of SL (300) and PT (100) relative to the asset price (2500). Let’s investigate _point further.

2. Can _point Be Set to 0.01?

What is _point?

_point represents the pip value or the smallest price increment for the asset, typically defined in the trading platform’s context (e.g., MT4/MT5).
For forex pairs (e.g., XAUUSD):
- 1 pip is often 0.01 for gold (XAUUSD), meaning _point = 0.01 is the smallest price movement in USD.
- However, SL and PT are usually specified in pips or points, not raw price units.

Current Issue

Config Mismatch:
- stop_loss_max = 300 and profit_taken_max = 100 are likely intended as pips (e.g., 300 pips = 3.00 USD for XAUUSD if 1 pip = 0.01 USD).
- But dividing by _point = 0.01 in the formula amplifies them massively:
  - 300 / 0.01 = 30,000 (300 pips become 30,000 USD).
  - 100 / 0.01 = 10,000 (100 pips become 10,000 USD).
Correct Interpretation:
- If SL and PT are in pips, and _point = 0.01 is the pip value in USD:
  - Stop-Loss distance: 300 pips * 0.01 = 3 USD
  - Take-Profit distance: 100 pips * 0.01 = 1 USD
- Then:
  - _sl_price = 2500 - 3 = 2497
  - _pt_price = 2500 + 1 = 2501

Can _point Be 0.01?

Yes, but Adjust Units:
- _point = 0.01 is valid for XAUUSD if it’s the pip value (common in forex brokers, where 1 pip for XAUUSD is 0.01 USD).
- However, SL and PT should be in pips, and the formula should multiply by _point, not divide:
  python
  _sl_price = tr["ActionPrice"] - tr["SL"] * _point # For Buy _pt_price = tr["ActionPrice"] + tr["PT"] * _point # For Buy
- Corrected:
  - SL = 300 pips, _sl_price = 2500 - 300 * 0.01 = 2500 - 3 = 2497
  - PT = 100 pips, _pt_price = 2500 + 100 * 0.01 = 2500 + 1 = 2501
No, if Units Are Misaligned:
- If SL and PT are already in USD (not pips), _point shouldn’t be involved, or it should be 1.0:
  python
  _sl_price = tr["ActionPrice"] - tr["SL"] # SL in USD _pt_price = tr["ActionPrice"] + tr["PT"] # PT in USD
  - Then: _sl_price = 2500 - 300 = 2200, _pt_price = 2500 + 100 = 2600.

Corrected Answers

Assumptions

XAUUSD price = 2500 USD/oz.
_point = 0.01 (1 pip = 0.01 USD).
stop_loss_max = 300 pips, profit_taken_max = 100 pips (not USD).

Fixed Formula

Update _calculate_reward():

_sl_price = tr["ActionPrice"] - tr["SL"] * _point  # Multiply, not divide
_pt_price = tr["ActionPrice"] + tr["PT"] * _point  # Multiply, not divide

Stop-Loss Price:
- tr["SL"] = 300, _sl_price = 2500 - 300 * 0.01 = 2500 - 3 = 2497
Take-Profit Price:
- tr["PT"] = 100, _pt_price = 2500 + 100 * 0.01 = 2500 + 1 = 2501

Results

If Current Price = 2500:
- Stop Loss: 2497 USD
- Take Profit: 2501 USD
Is _point = 0.01 Valid?
- Yes, if SL and PT are in pips and you multiply by _point (not divide). For XAUUSD, _point = 0.01 is typical (1 pip = 0.01 USD).
- No, with your current formula (/ _point), as it produces incorrect prices (e.g., -27,500).

Recommendation

Fix the Formula: Change division to multiplication in _calculate_reward():
python
def _calculate_reward(self, i, done, action): _total_reward = 0 for tr in self.transaction_live[:]: if tr["Symbol"] == self.assets[i]: _point = self.cf.symbol(self.assets[i], "point") # 0.01 if tr["Type"] == 0: # Buy _sl_price = tr["ActionPrice"] - tr["SL"] * _point _pt_price = tr["ActionPrice"] + tr["PT"] * _point # ... (rest of logic) elif tr["Type"] == 1: # Sell _sl_price = tr["ActionPrice"] + tr["SL"] * _point _pt_price = tr["ActionPrice"] - tr["PT"] * _point # ... (rest of logic) return _total_reward
Verify Config Units: Confirm stop_loss_max and profit_taken_max are in pips (300 pips = 3 USD, 100 pips = 1 USD for XAUUSD).

With this fix, _point = 0.01 works correctly, and your SL/PT prices will be sensible for a 2500 USD starting price. Let me know if you need further clarification!

why the _profit_taken use action-_action? this action and _action is not same value? ``` def _take_action(self, actions, done): # action = math.floor(x), # profit_taken = math.ceil((x- math.floor(x)) * profit_taken_max - stop_loss_max ) # _actions = np.floor(actions).astype(int) # _profit_takens = np.ceil((actions - np.floor(actions)) *self.cf.symbol(self.assets[i],"profit_taken_max")).astype(int) _action = 2 _profit_taken = 0 rewards = [0] * len(self.assets) self.tranaction_open_this_step = [] self.tranaction_close_this_step = [] # need use multiply assets for i, action in enumerate(actions): # Actions are now floats between 0 and 3 self._o = self.get_observation(self.current_step, i, "Open") self._h = self.get_observation(self.current_step, i, "High") self._l = self.get_observation(self.current_step, i, "Low") self._c = self.get_observation(self.current_step, i, "Close") self._t = self.get_observation(self.current_step, i, "_time") self._day = self.get_observation(self.current_step, i, "_day") # Extract integer action type and fractional part _action = math.floor(action) # 0=Buy, 1=Sell, 2=Nothing rewards[i] = self._calculate_reward(i, done, _action) # Pass action for exploration reward print(f"Asset {self.assets[i]}: Action={action}, Reward={rewards[i]}, Holding={self.current_holding[i]}") if self.cf.symbol(self.assets[i], "limit_order"): self._limit_order_process(i, _action, done) if ( _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding")): # Dynamically calculate PT using action fraction _profit_taken = math.ceil( (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max") ) + self.cf.symbol(self.assets[i], "stop_loss_max") self.ticket_id += 1 if self.cf.symbol(self.assets[i], "limit_order"): # Limit order logic transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._l if _action == 0 else self._h, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": -1, "CloseStep": -1, } # Debug print(f"New Limit Order - Asset: {self.assets[i]}, Type: {'Buy' if _action == 0 else 'Sell'}, " f"Take Profit: {_profit_taken}, Stop Loss: {self.cf.symbol(self.assets[i], 'stop_loss_max')}") self.transaction_limit_order.append(transaction) else: # Market order logic transaction = { "Ticket": self.ticket_id, "Symbol": self.assets[i], "ActionTime": self._t, "Type": _action, "Lot": 1, "ActionPrice": self._c, "SL": self.cf.symbol(self.assets[i], "stop_loss_max"), "PT": _profit_taken, "MaxDD": 0, "Swap": 0.0, "CloseTime": "", "ClosePrice": 0.0, "Point": 0, "Reward": -self.cf.symbol(self.assets[i], "transaction_fee"), "DateDuration": self._day, "Status": 0, "LimitStep": self.current_step, "ActionStep": self.current_step, "CloseStep": -1, } # Debug print(f"New Market Order - Asset: {self.assets[i]}, Type: {'Buy' if _action == 0 else 'Sell'}, " f"Take Profit: {_profit_taken}, Stop Loss: {self.cf.symbol(self.assets[i], 'stop_loss_max')}") self.current_holding[i] += 1 self.tranaction_open_this_step.append(transaction) self.balance -= self.cf.symbol(self.assets[i], "transaction_fee") self.transaction_live.append(transaction) # Debug print(f"Live Transactions: {[tr['Type'] for tr in self.transaction_live]}") return sum(rewards) ```

Let’s dive into why _profit_taken uses action - _action in your _take_action() method and clarify the difference between action and _action. This is a clever design choice that leverages the structure of your action space, so let’s break it down step-by-step.

Key Code Context

In your _take_action() method:

for i, action in enumerate(actions):
    _action = math.floor(action)  # 0=Buy, 1=Sell, 2=Nothing
    # ...
    if _action in (0, 1) and not done and self.current_holding[i] < self.cf.symbol(self.assets[i], "max_current_holding"):
        _profit_taken = math.ceil(
            (action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max")
        ) + self.cf.symbol(self.assets[i], "stop_loss_max")

action: A float value output by your PPO model (e.g., between 0 and 3).
_action: The integer part of action, extracted using math.floor(action) (e.g., 0, 1, or 2).
action - _action: The fractional part of action (e.g., 0.7 if action = 0.7 and _action = 0).

Why Use action - _action for _profit_taken?

Action Space Design

Your PPO model outputs continuous actions (floats) via a Gaussian distribution, typically bounded by your environment’s action space (e.g., [0, 3) as noted in the comment: "Actions are now floats between 0 and 3").
You’ve encoded two pieces of information into this single float:
1. Trade Type (Integer Part):
  - 0 = Buy
  - 1 = Sell
  - 2 = Nothing
  - Extracted with math.floor(action) as _action.
2. Take-Profit Level (Fractional Part):
  - The decimal portion (e.g., 0.0 to 0.999...) determines the take-profit (PT) level relative to profit_taken_max.
Why Separate Them?
- _action decides whether to open a trade (Buy/Sell) or do nothing.
- action - _action scales the take-profit dynamically within a range, allowing the agent to learn not just when to trade but also how aggressive its profit target should be.

How _profit_taken is Calculated

Formula:
python
_profit_taken = math.ceil((action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max")) + self.cf.symbol(self.assets[i], "stop_loss_max")
Fractional Part (action - _action):
- Ranges from 0.0 to <1.0 (e.g., if action = 0.7, then _action = 0, so 0.7 - 0 = 0.7).
Scaling: Multiplies by profit_taken_max (e.g., 100 from your config) to set a PT range.
Offset: Adds stop_loss_max (e.g., 300) to ensure PT exceeds SL, creating a positive risk-reward ratio.
Example:
- Config: "profit_taken_max": 100, "stop_loss_max": 300
- action = 0.7 (Buy):
  - _action = 0
  - action - _action = 0.7
  - _profit_taken = math.ceil(0.7 * 100) + 300 = math.ceil(70) + 300 = 70 + 300 = 370
- Result: Take Profit = 370 pips, Stop Loss = 300 pips.

Why This Approach?

Dynamic PT Control:
- The agent can adjust PT based on its confidence or strategy. A higher fractional part (e.g., 0.9 vs. 0.1) increases PT (90 + 300 = 390 vs. 10 + 300 = 310), allowing flexibility within profit_taken_max.
Single Action Output:
- PPO outputs one float per asset, not separate trade type and PT values. Using action - _action elegantly extracts PT from this single value.
Learning Optimization:
- The agent learns to balance trade initiation (_action) and profit targets (action - _action), optimizing both in one go.

Are action and _action the Same Value?

No, They’re Not the Same:
- action: The raw float from the model (e.g., 0.7, 1.3, 2.4).
- _action: The floored integer part (e.g., 0, 1, 2).
- Difference: action is continuous, while _action is discrete. The fractional part (action - _action) carries additional information.
Examples:
- action = 0.7 → _action = 0, action - _action = 0.7
- action = 1.3 → _action = 1, action - _action = 0.3
- action = 2.4 → _action = 2, action - _action = 0.4
When They’re Equal: Only if action is an exact integer (e.g., 1.0), then action - _action = 0.0, setting PT to the minimum (stop_loss_max).

Does This Make Sense?

Intent

Your design assumes the agent outputs actions where:
- 0.0 to 0.999...: Buy with varying PT (0 to profit_taken_max).
- 1.0 to 1.999...: Sell with varying PT.
- 2.0 to 2.999...: Do nothing (no trade opened).
Current Formula Issue:
- Adding stop_loss_max to PT might not align with your intent. If profit_taken_max is the max PT in pips (e.g., 100), adding stop_loss_max (300) inflates it (e.g., 370 for action = 0.7), possibly exceeding intended bounds.

Suggested Fix

If PT should range from 0 to profit_taken_max:
python
_profit_taken = math.ceil((action - _action) * self.cf.symbol(self.assets[i], "profit_taken_max"))
- action = 0.7 → _profit_taken = math.ceil(0.7 * 100) = 70
- SL = 300, PT = 70 (PT < SL, which might be intentional or a risk-reward choice).
If PT should start at stop_loss_max and go up:
- Current formula is fine, but clarify intent (e.g., PT from 300 to 400 pips).

Final Answer

Why action - _action?
- It extracts the fractional part of action to dynamically set _profit_taken based on the model’s output, allowing the agent to learn optimal PT levels alongside trade type (_action).
Same Value?
- No, action is a float (e.g., 0.7), _action is its integer floor (e.g., 0). The difference (action - _action) is the fractional part used for PT.
  否，action 是浮点数（例如 0.7），_action是其整数下限（例如 0）。差值（action - _action）是用于 PT 的小数部分。
Example Output (Config: profit_taken_max=100, stop_loss_max=300):
输出示例（配置：profit_taken_max=100，stop_loss_max=300）：
- action = 0.7 → _action = 0, _profit_taken = ceil(0.7 * 100) + 300 = 370
  作 = 0.7 → _action = 0，_profit_taken = ceil（0.7 * 100） + 300 = 370
- Print: New Market Order - Asset: XAUUSD, Type: Buy, Take Profit: 370, Stop Loss: 300
  打印：新市价单 - 资产：XAUUSD，类型：买入，止盈：370，止损：300

If PT exceeding SL is unintentional, adjust the formula to cap at profit_taken_max. Let me know if you want to refine this further!
如果 PT 超过 SL 是无意的，请调整公式以达到 profit_taken_max 的上限。如果您想进一步完善此内容，请告诉我！

Root Cause

Solution

Key Changes

Additional Notes 其他说明

Testing the Fix 测试修复程序

Updated Code

Key Changes to Ensure CUDA-Only Usage

Additional Modifications

Verifying the Fix

Final Notes

1. Adding an If-Else Statement for CUDA/CPU Selection

Modified CustomFeaturesExtractor

Modified CustomMultiInputPolicy

2. Will the Two CustomMultiInputPolicy Classes Return the Same Value?

Key Differences Between Code 1 and Code 2

Will They Return the Same Value?

Example Fixed Code 1 (for Comparison)

Conclusion

Fixed Code

CustomFeaturesExtractor

CustomMultiInputPolicy

Key Changes

Why the Original Code Failed

Testing the Fix

Additional Notes 其他说明

Fixed CustomMultiInputPolicy

Key Fixes

Why the Error Occurred

Verifying the Fix

Additional Notes

Fixed CustomMultiInputPolicy

Key Changes

Updating

Fixed CustomMultiInputPolicy

Key Changes

Updating the PPO Instantiation

Why the Error Occurred

Verifying the Fix

Additional Notes

Why This Happens

How to Fix It

Fixed step() Method

Key Changes

Updating the Evaluation Functions

Fixed evaluate Function

Notes for final_evaluate

Why It Returned 4 Values

Verifying the Fix

Additional Notes

Root Cause

Why Both 4 and 5 Values Fail

Diagnosing the Problem

Solution

Step 1: Confirm Stable-Baselines3 Version

Step 2: Fix step() to Return 4 Values第 2 步：修复 step（） 以返回 4 个值

Step 3: Adjust Evaluation Functions第 3 步：调整评估函数

Fixed evaluate Function固定 evaluate 函数

Apply the Same to final_evaluate将相同内容应用于 final_evaluate

Step 4: Handle Version Mismatch (If Needed)第 4 步：处理版本不匹配（如果需要）

Recommended Fix 建议的修复

Verifying the Fix 验证修复

Additional Debugging 其他调试

Why This Happens

Upgrade Path for 5-Tuple Support5 元组支持的升级路径

Upgrade Stable-Baselines3升级 stable-baselines3

Additional Dependencies 其他依赖项

Verify Your Environment 验证您的环境

Update Evaluation Functions更新评估函数

Why Upgrade to Pre-Release?为什么要升级到预发布版？

Alternative: Downgrade for 4-Tuple替代方案：降级为 4 元组

Verify the Fix 验证修复

Current Situation

Stable-Baselines3 v2.5.0 Behavior

Why the Confusion?

Should You Upgrade?

Option 1: Upgrade to Pre-Release (Recommended for 5-Tuple Support)

Option 2: Stay with v2.5.0 and Use 4-Tuple

Option 3: Downgrade (Not Recommended)

Recommendation: Upgrade to Pre-Release

Steps to Upgrade and Test

Step 2: Fix step() to Return 4 Values
第 2 步：修复 step（）以返回 4 个值

Step 3: Adjust Evaluation Functions
第 3 步：调整评估函数

Fixed evaluate Function
固定 evaluate 函数

Apply the Same to final_evaluate
将相同内容应用于 final_evaluate

Step 4: Handle Version Mismatch (If Needed)
第 4 步：处理版本不匹配（如果需要）

Upgrade Path for 5-Tuple Support
5 元组支持的升级路径

Upgrade Stable-Baselines3
升级 stable-baselines3

Update Evaluation Functions
更新评估函数

Why Upgrade to Pre-Release?
为什么要升级到预发布版？

Alternative: Downgrade for 4-Tuple
替代方案：降级为 4 元组

Troubleshooting 故障排除

Step 1: Update step() to Return 5 Values
第 1 步：更新 step（）以返回 5 个值

Step 2: Update Evaluation Functions
第 2 步：更新评估函数

Fixed evaluate Function
固定 evaluate 函数

Fixed final_evaluate Function
固定final_evaluate功能

Step 3: Verify Your SB3 Version
第 3 步：验证您的 SB3 版本

Update step() to Return 5 Values
更新 step（）以返回 5 个值

Update evaluate to Expect 5 Values
将 evaluate 更新为 Expect 5 Values

Overview of CustomMultiInputPolicy
CustomMultiInputPolicy 概述

What is the action Return Value?
什么是作 Return Value？

1. Action Generation in forward()
1. forward（）中的动作生成

2. Action Space and _squash_to_range()
2. Action Space 和 _squash_to_range（）

3. Final action Return Value
3. 最终作返回值

What Does the Action Represent?
动作代表什么？

Key Properties of the action Return Value
作 Return Value 的关键属性

How It’s Used in Your Environment
如何在您的环境中使用它

Potential Issues to Check
需要检查的潜在问题

Examining analyze_transaction_history()
检查 analyze_transaction_history（）

Solution: Exclude Non-Numeric Keys
解决方案：排除非数字键

Verifying with Your Log Output
使用日志输出进行验证

Bonus: Log Output Insight
奖励：Log Output Insight

Modified Code with Header
带标头的修改代码