The primary goal of a phase I clinical trial is to find the maximum tolerable dose of a treatment. In this paper, we propose a new stepwise method based on confidence bound and information incorporation to determine t...The primary goal of a phase I clinical trial is to find the maximum tolerable dose of a treatment. In this paper, we propose a new stepwise method based on confidence bound and information incorporation to determine the maximum tolerable dose among given dose levels. On the one hand, in order to avoid severe even fatal toxicity to occur and reduce the experimental subjects, the new method is executed from the lowest dose level, and then goes on in a stepwise fashion. On the other hand, in order to improve the accuracy of the recommendation, the final recommendation of the maximum tolerable dose is accomplished through the information incorporation of an additional experimental cohort at the same dose level. Furthermore, empirical simulation results show that the new method has some real advantages in comparison with the modified continual reassessment method.展开更多
This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival...This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival rates are not known in advance and can change over time.For this purpose,this problem is formulated as an infinite-horizon Constrained Markov Decision Process(CMDP).To handle the constrained optimization problem,we first adopt the Lagrangian relaxation technique to solve it.Then,we propose a variant of Q-learning,Q-greedyUCB that combinesε-greedy and Upper Confidence Bound(UCB)algorithms to solve this constrained MDP problem.We mathematically prove that the Q-greedyUCB algorithm converges to an optimal solution.Simulation results also show that Q-greedyUCB finds an optimal scheduling strategy,and is more efficient than Q-learning withε-greedy,R-learning and the Averagepayoff RL(ARL)algorithm in terms of the cumulative regret.We also show that our algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy under a given power constraint for the new environment.展开更多
The authors consider the uniformly most powerful invariant test of the testing problems (Ⅰ) H 0: μ′Σ -1 μ≥CH 1: μ′Σ -1 μ<C and (Ⅱ) H 00 : β′X′Xβσ 2≥CH 11 : β′X′Xβσ 2<C u...The authors consider the uniformly most powerful invariant test of the testing problems (Ⅰ) H 0: μ′Σ -1 μ≥CH 1: μ′Σ -1 μ<C and (Ⅱ) H 00 : β′X′Xβσ 2≥CH 11 : β′X′Xβσ 2<C under m dimensional normal population N m(μ, Σ) and normal linear model (Y, Xβ, σ 2) respectively. Furthermore, an application of the uniformly most powerful invariant test is given.展开更多
Opinion dynamics have received significant attention in recent years. This paper proposes a bounded confidence opinion model for a group of agents with two different confidence levels. Each agent in the population is ...Opinion dynamics have received significant attention in recent years. This paper proposes a bounded confidence opinion model for a group of agents with two different confidence levels. Each agent in the population is endowed with a confidence interval around her opinion wiih radius αd or (1 - α)d, where α∈ (0, 1/2] represents the differentiation of confidence levels. We analytically derived the critical confidence bound dc = 1/(4α) for the two-level opinion dynamics on Z. A single opinion cluster is formed with probability 1 above this critical value regardless of the ratio p of agents with high/low confidence. Extensive numerical simulations are performed to illustrate our theoretical results. Noticed is a clear impact of p on the collective behavior: more agents with high confidence lead to harder agreement. It is also experimentally revealed that the sharpness of the threshold dc increases with a but does not depend on p.展开更多
Neural architecture search(NAS)plays an important role in many computer vision tasks.However,the high computational cost of forward and backward propagations during the search,process restricts its practical applicati...Neural architecture search(NAS)plays an important role in many computer vision tasks.However,the high computational cost of forward and backward propagations during the search,process restricts its practical application.In this paper,we present the search process as a multi-armed bandit problem,where we take into account the uncertainty caused by the contradiction between the huge search space and limited number of trials.Bandit NAS optimizes the trade-off between exploitation and exploration for a highly efficient search.Specifically,we sampled from a set of operations in one trial,where each operation was weighted by its trial performance and a bias to allow operations with less training to be selected.We further reduced the search space by abandoning the operation with the lowest potential,significantly reducing the search cost.Experimental results on the CIFAR-10 dataset show that the resulting architecture achieves the most advanced precision with a search speed approximately two times faster than that of partially connected differentialble architecture search.On ImageNet,we attained the most advanced top-1 accuracy of 75.3%with a search time of 1.8 GPU days.展开更多
Reinforcement Learning(RL)algorithms work well with well-defined rewards,but they fail with sparse/deceptive rewards and require additional exploration strategies.This work introduces a deep exploration method based o...Reinforcement Learning(RL)algorithms work well with well-defined rewards,but they fail with sparse/deceptive rewards and require additional exploration strategies.This work introduces a deep exploration method based on the Upper Confidence Bound(UCB)bonus.The proposed method can be plugged into actor-critic algorithms that use deep neural networks as a critic.Based on the conclusion of the regret bound under the linear Markov decision process approximation,we use the feature matrix to calculate the UCB bonus for deep exploration.The proposed method is equivalent to the count-based exploration method in special cases and is suitable for general situations.Our method uses the last d-dimensional feature vector in the critic network and is easy to deploy.We design a simple task,“swim”,to demonstrate the principle of the proposed method to achieve exploration in sparse/deceptive reward environments.Then,we perform an empirical evaluation on sparse/deceptive reward version gym environments and Ackermann robot control tasks.The evaluation results verify that the proposed algorithm can perform effective deep explorations in sparse/deceptive reward tasks.展开更多
文摘The primary goal of a phase I clinical trial is to find the maximum tolerable dose of a treatment. In this paper, we propose a new stepwise method based on confidence bound and information incorporation to determine the maximum tolerable dose among given dose levels. On the one hand, in order to avoid severe even fatal toxicity to occur and reduce the experimental subjects, the new method is executed from the lowest dose level, and then goes on in a stepwise fashion. On the other hand, in order to improve the accuracy of the recommendation, the final recommendation of the maximum tolerable dose is accomplished through the information incorporation of an additional experimental cohort at the same dose level. Furthermore, empirical simulation results show that the new method has some real advantages in comparison with the modified continual reassessment method.
基金This work was supported by the research fund of Hanyang University(HY-2019-N)This work was supported by the National Key Research&Development Program 2018YFA0701601.
文摘This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival rates are not known in advance and can change over time.For this purpose,this problem is formulated as an infinite-horizon Constrained Markov Decision Process(CMDP).To handle the constrained optimization problem,we first adopt the Lagrangian relaxation technique to solve it.Then,we propose a variant of Q-learning,Q-greedyUCB that combinesε-greedy and Upper Confidence Bound(UCB)algorithms to solve this constrained MDP problem.We mathematically prove that the Q-greedyUCB algorithm converges to an optimal solution.Simulation results also show that Q-greedyUCB finds an optimal scheduling strategy,and is more efficient than Q-learning withε-greedy,R-learning and the Averagepayoff RL(ARL)algorithm in terms of the cumulative regret.We also show that our algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy under a given power constraint for the new environment.
文摘The authors consider the uniformly most powerful invariant test of the testing problems (Ⅰ) H 0: μ′Σ -1 μ≥CH 1: μ′Σ -1 μ<C and (Ⅱ) H 00 : β′X′Xβσ 2≥CH 11 : β′X′Xβσ 2<C under m dimensional normal population N m(μ, Σ) and normal linear model (Y, Xβ, σ 2) respectively. Furthermore, an application of the uniformly most powerful invariant test is given.
文摘Opinion dynamics have received significant attention in recent years. This paper proposes a bounded confidence opinion model for a group of agents with two different confidence levels. Each agent in the population is endowed with a confidence interval around her opinion wiih radius αd or (1 - α)d, where α∈ (0, 1/2] represents the differentiation of confidence levels. We analytically derived the critical confidence bound dc = 1/(4α) for the two-level opinion dynamics on Z. A single opinion cluster is formed with probability 1 above this critical value regardless of the ratio p of agents with high/low confidence. Extensive numerical simulations are performed to illustrate our theoretical results. Noticed is a clear impact of p on the collective behavior: more agents with high confidence lead to harder agreement. It is also experimentally revealed that the sharpness of the threshold dc increases with a but does not depend on p.
基金supported by the National Natural Science Foundation of China (Grant No.62076016)。
文摘Neural architecture search(NAS)plays an important role in many computer vision tasks.However,the high computational cost of forward and backward propagations during the search,process restricts its practical application.In this paper,we present the search process as a multi-armed bandit problem,where we take into account the uncertainty caused by the contradiction between the huge search space and limited number of trials.Bandit NAS optimizes the trade-off between exploitation and exploration for a highly efficient search.Specifically,we sampled from a set of operations in one trial,where each operation was weighted by its trial performance and a bias to allow operations with less training to be selected.We further reduced the search space by abandoning the operation with the lowest potential,significantly reducing the search cost.Experimental results on the CIFAR-10 dataset show that the resulting architecture achieves the most advanced precision with a search speed approximately two times faster than that of partially connected differentialble architecture search.On ImageNet,we attained the most advanced top-1 accuracy of 75.3%with a search time of 1.8 GPU days.
文摘Reinforcement Learning(RL)algorithms work well with well-defined rewards,but they fail with sparse/deceptive rewards and require additional exploration strategies.This work introduces a deep exploration method based on the Upper Confidence Bound(UCB)bonus.The proposed method can be plugged into actor-critic algorithms that use deep neural networks as a critic.Based on the conclusion of the regret bound under the linear Markov decision process approximation,we use the feature matrix to calculate the UCB bonus for deep exploration.The proposed method is equivalent to the count-based exploration method in special cases and is suitable for general situations.Our method uses the last d-dimensional feature vector in the critic network and is easy to deploy.We design a simple task,“swim”,to demonstrate the principle of the proposed method to achieve exploration in sparse/deceptive reward environments.Then,we perform an empirical evaluation on sparse/deceptive reward version gym environments and Ackermann robot control tasks.The evaluation results verify that the proposed algorithm can perform effective deep explorations in sparse/deceptive reward tasks.