With the rapid development of modern science and technology, traditional randomized controlled trials have become insufficient to meet current scientific research needs, particularly in the field of clinical research....With the rapid development of modern science and technology, traditional randomized controlled trials have become insufficient to meet current scientific research needs, particularly in the field of clinical research. The emergence of real-world data studies, which align more closely with actual clinical evidence, has garnered significant attention in recent years. The following is a brief overview of the specific utilization of real-world data in drug development, which often involves large sample sizes and analyses covering a relatively diverse population without strict inclusion and exclusion criteria. Real-world data often reflects real clinical practice: treatment options are chosen according to the actual conditions and willingness of patients rather than through random assignment. Analysis based on real-world data also focuses on endpoints highly relevant to clinical benefits and the quality of life of patients. The booming big data technology supports the utilization of real-world data to accelerate new drug development, serving as an important supplement to traditional clinical trials.展开更多
Purpose:Many science,technology and innovation(STI)resources are attached with several different labels.To assign automatically the resulting labels to an interested instance,many approaches with good performance on t...Purpose:Many science,technology and innovation(STI)resources are attached with several different labels.To assign automatically the resulting labels to an interested instance,many approaches with good performance on the benchmark datasets have been proposed for multi-label classification task in the literature.Furthermore,several open-source tools implementing these approaches have also been developed.However,the characteristics of real-world multi-label patent and publication datasets are not completely in line with those of benchmark ones.Therefore,the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets.Research limitations:Three real-world datasets differ in the following aspects:statement,data quality,and purposes.Additionally,open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection,which in turn impacts the performance of a multi-label classification approach.In the near future,we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings.Practical implications:The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets,underscoring the complexity of real-world multi-label classification tasks.Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels.With ongoing enhancements in deep learning algorithms and large-scale models,it is expected that the efficacy of multi-label classification tasks will be significantly improved,reaching a level of practical utility in the foreseeable future.Originality/value:(1)Seven multi-label classification methods are comprehensively compared on three real-world datasets.(2)The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution.(3)The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution.展开更多
Battery pack capacity estimation under real-world operating conditions is important for battery performance optimization and health management,contributing to the reliability and longevity of batterypowered systems.Ho...Battery pack capacity estimation under real-world operating conditions is important for battery performance optimization and health management,contributing to the reliability and longevity of batterypowered systems.However,complex operating conditions,coupling cell-to-cell inconsistency,and limited labeled data pose great challenges to accurate and robust battery pack capacity estimation.To address these issues,this paper proposes a hierarchical data-driven framework aimed at enhancing the training of machine learning models with fewer labeled data.Unlike traditional data-driven methods that lack interpretability,the hierarchical data-driven framework unveils the“mechanism”of the black box inside the data-driven framework by splitting the final estimation target into cell-level and pack-level intermediate targets.A generalized feature matrix is devised without requiring all cell voltages,significantly reducing the computational cost and memory resources.The generated intermediate target labels and the corresponding features are hierarchically employed to enhance the training of two machine learning models,effectively alleviating the difficulty of learning the relationship from all features due to fewer labeled data and addressing the dilemma of requiring extensive labeled data for accurate estimation.Using only 10%of degradation data,the proposed framework outperforms the state-of-the-art battery pack capacity estimation methods,achieving mean absolute percentage errors of 0.608%,0.601%,and 1.128%for three battery packs whose degradation load profiles represent real-world operating conditions.Its high accuracy,adaptability,and robustness indicate the potential in different application scenarios,which is promising for reducing laborious and expensive aging experiments at the pack level and facilitating the development of battery technology.展开更多
Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subse...Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.展开更多
There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful...There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful aids in their clinical decision-making while also preserving patient privacy.This is especially important given the epidemiology of chronic kidney disease,renal oncology,and hypertension worldwide.However,there remains a need to create a framework for guidance regarding how to better utilize synthetic data as a practical application in this research.展开更多
To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,t...To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.展开更多
Hepatocellular carcinoma(HCC)is a leading cause of cancer-associated mortality worldwide.HCC is an inflammation-associated immunogenic cancer that frequently arises in chronically inflamed livers.Advanced HCC is manag...Hepatocellular carcinoma(HCC)is a leading cause of cancer-associated mortality worldwide.HCC is an inflammation-associated immunogenic cancer that frequently arises in chronically inflamed livers.Advanced HCC is managed with systemic therapies;the tyrosine kinase inhibitor(TKI)sorafenib has been used in 1st-line setting since 2007.Immunotherapies have emerged as promising treatments across solid tumors including HCC for which immune checkpoint inhibitors(ICIs)are licensed in 1st-and 2nd-line treatment setting.The treatment field of advanced HCC is continuously evolving.Several clinical trials are investigating novel ICI candidates as well as new ICI regimens in combination with other therapeutic modalities including systemic agents,such as other ICIs,TKIs,and anti-angiogenics.Novel immunotherapies including adoptive cell transfer,vaccine-based approaches,and virotherapy are also being brought to the fore.Yet,despite advances,several challenges persist.Lack of real-world data on the use of immunotherapy for advanced HCC in patients outside of clinical trials constitutes a main limitation hindering the breadth of application and generalizability of data to this larger and more diverse patient cohort.Consequently,issues encountered in real-world practice include patient ineligibly for immunotherapy because of contraindications,comorbidities,or poor performance status;lack of response,efficacy,and safety data;and cost-effectiveness.Further real-world data from high-quality large prospective cohort studies of immunotherapy in patients with advanced HCC is mandated to aid evidence-based clinical decision-making.This review provides a critical and comprehensive overview of clinical trials and real-world data of immunotherapy for HCC,with a focus on ICIs,as well as novel immunotherapy strategies underway.展开更多
Objective: To evaluate the accuracy of identifying cancer patients by use of medical claims data in a health insurance system in China, and provide the basis for establishing the claims-based cancer surveillance syste...Objective: To evaluate the accuracy of identifying cancer patients by use of medical claims data in a health insurance system in China, and provide the basis for establishing the claims-based cancer surveillance system in China.Methods: We chose Hua County, Henan Province as the study site, and randomly selected 300 and 1,200 qualified inpatient electronic medical records(EMRs) as well as the New Rural Cooperative Medical Scheme(NCMS) claims records for cancer patients in Hua County People’s Hospital(HCPH) and Anyang Cancer Hospital(ACH) in 2017. Diagnostic information for NCMS claims was evaluated on an individual level, and sensitivity and positive predictive value(PPV) were calculated taking the EMRs as the gold standard.Results: The sensitivity of NCMS was 95.2%(93.8%-96.3%) and 92.0%(88.3%-94.8%) in ACH and HCPH,respectively. The PPV of the NCMS was 97.8%(96.7%-98.5%) in ACH and 89.0%(84.9%-92.3%) in HCPH.Overall, the weighted and combined sensitivity and PPV of NCMS in Hua County was 93.1% and 92.1%,respectively. Significantly higher sensitivity and PPV in identifying patients with common cancers than noncommon cancers were detected in HCPH and ACH separately(P<0.01).Conclusions: Identification of cancer patients by use of the NCMS is accurate on individual level, and it is therefore feasible to conduct claims-based cancer surveillance in areas not covered by cancer registries in China.展开更多
Randomized clinical trials(RCTs)have long been recognized the gold standard for regulatory approval in the drug development.However,RCTs may not be feasible in some diseases and/or under certain situations,and finding...Randomized clinical trials(RCTs)have long been recognized the gold standard for regulatory approval in the drug development.However,RCTs may not be feasible in some diseases and/or under certain situations,and findings from RCTs may not be generalized to real-world patients in routine clinical practice.Real-world evidence(RWE),which is generated from various real-world data(RWD),has become more and more important for the drug development and clinical decision-making in the digital era.This paper described RWD and real-world data studies(RWDSs),followed by the characteristics and differences between RCTs and RWDSs.Furthermore,the challenges and limitations of RWD and RWE were discussed.Finally,this paper highlights that the efforts must be made during RWE generation from data collection/database selection,study design,statistical analysis,and interpretation of the results to minimize the biases and confounding effects.展开更多
BACKGROUND Real-world data on tofacitinib(TOF)covering a period of more than 1 year for a sufficient number of Asian patients with ulcerative colitis(UC)are scarce.AIM To investigate the long-term efficacy and safety ...BACKGROUND Real-world data on tofacitinib(TOF)covering a period of more than 1 year for a sufficient number of Asian patients with ulcerative colitis(UC)are scarce.AIM To investigate the long-term efficacy and safety of TOF treatment for UC,including clinical issues.METHODS We performed a retrospective single-center observational analysis of 111 UC patients administered TOF at Hyogo Medical University as a tertiary inflammatory bowel disease center.All consecutive UC patients who received TOF between May 2018 and February 2020 were enrolled.Patients were followed up until August 2020.The primary outcome was the clinical response rate at week 8.Secondary outcomes included clinical remission at week 8,cumulative persistence rate of TOF administration,colectomy-free survival,relapse after tapering of TOF and predictors of clinical response at week 8 and week 48.RESULTS The clinical response and remission rates were 66.3%and 50.5%at week 8,and 47.1%and 43.5%at week 48,respectively.The overall cumulative clinical remission rate was 61.7%at week 48 and history of anti-tumor necrosis factor-alpha(TNF-α)agents use had no influence(P=0.25).The cumulative TOF persistence rate at week 48 was significantly lower in patients without clinical remission than in those with remission at week 8(30.9%vs 88.1%;P<0.001).Baseline partial Mayo Score was significantly lower in responders vs non-responders at week 8(odds ratio:0.61,95%confidence interval:0.45-0.82,P=0.001).Relapse occurred in 45.7%of patients after TOF tapering,and 85.7%of patients responded within 4 wk after re-increase.All 6 patients with herpes zoster(HZ)developed the infection after achieving remission by TOF.CONCLUSION TOF was more effective in UC patients with mild activity at baseline and its efficacy was not affected by previous treatment with anti-TNF-αagents.Most relapsed patients responded again after re-increase of TOF and nearly half relapsed after tapering off TOF.Special attention is needed for tapering and HZ.展开更多
BACKGROUND Although chronic erosive gastritis(CEG)is common,its clinical characteristics have not been fully elucidated.The lack of consensus regarding its treatment has resulted in varied treatment regimens.AIM To ex...BACKGROUND Although chronic erosive gastritis(CEG)is common,its clinical characteristics have not been fully elucidated.The lack of consensus regarding its treatment has resulted in varied treatment regimens.AIM To explore the clinical characteristics,treatment patterns,and short-term outcomes in CEG patients in China.METHODS We recruited patients with chronic non-atrophic or mild-to-moderate atrophic gastritis with erosion based on endoscopy and pathology.Patients and treating physicians completed a questionnaire regarding history,endoscopic findings,and treatment plans as well as a follow-up questionnaire to investigate changes in symptoms after 4 wk of treatment.RESULTS Three thousand five hundred sixty-three patients from 42 centers across 24 cities in China were included.Epigastric pain(68.0%),abdominal distension(62.6%),and postprandial fullness(47.5%)were the most common presenting symptoms.Gastritis was classified as chronic non-atrophic in 69.9%of patients.Among those with erosive lesions,72.1%of patients had lesions in the antrum,51.0%had multiple lesions,and 67.3%had superficial flat lesions.In patients with epigastric pain,the combination of a mucosal protective agent(MPA)and proton pump inhibitor was more effective.For those with postprandial fullness,acid regurgitation,early satiety,or nausea,a MPA appeared more promising.CONCLUSION CEG is a multifactorial disease which is common in Asian patients and has non-specific symptoms.Gastroscopy may play a major role in its detection and diagnosis.Treatment should be individualized based on symptom profile.展开更多
Objective To study the research status,research hotspots and development trends in the field of real-world data(RWD)through social network analysis and knowledge graph analysis.Methods RWD of the past 10 years were re...Objective To study the research status,research hotspots and development trends in the field of real-world data(RWD)through social network analysis and knowledge graph analysis.Methods RWD of the past 10 years were retrieved,and literature metrological analysis was made by using UCINET and CiteSpace from CNKI.Results and Conclusion The frequency and centrality of related keywords such as real-world study,hospital information system(HIS),drug combination,data mining and TCM are high.The clusters labeled as clinical medication and RWD contain more keywords.In recent 4 years,there are more articles involving the keywords of data specification,data authenticity,data security and information security.Among them,compound Kushen injection,HIS database and RWD are the top three keywords.It is a long-term research hotspot for Chinese and western medicine to use HIS to study clinical medication,clinical characteristics,diseases and injections.Besides,the research of RWD database has changed from construction to standardized collection and governance,which can make RWD effective.Data authenticity,data security and information security will become the new hotspots in the research of RWD.展开更多
With the development of Industry 4.0 and big data technology,the Industrial Internet of Things(IIoT)is hampered by inherent issues such as privacy,security,and fault tolerance,which pose certain challenges to the rapi...With the development of Industry 4.0 and big data technology,the Industrial Internet of Things(IIoT)is hampered by inherent issues such as privacy,security,and fault tolerance,which pose certain challenges to the rapid development of IIoT.Blockchain technology has immutability,decentralization,and autonomy,which can greatly improve the inherent defects of the IIoT.In the traditional blockchain,data is stored in a Merkle tree.As data continues to grow,the scale of proofs used to validate it grows,threatening the efficiency,security,and reliability of blockchain-based IIoT.Accordingly,this paper first analyzes the inefficiency of the traditional blockchain structure in verifying the integrity and correctness of data.To solve this problem,a new Vector Commitment(VC)structure,Partition Vector Commitment(PVC),is proposed by improving the traditional VC structure.Secondly,this paper uses PVC instead of the Merkle tree to store big data generated by IIoT.PVC can improve the efficiency of traditional VC in the process of commitment and opening.Finally,this paper uses PVC to build a blockchain-based IIoT data security storage mechanism and carries out a comparative analysis of experiments.This mechanism can greatly reduce communication loss and maximize the rational use of storage space,which is of great significance for maintaining the security and stability of blockchain-based IIoT.展开更多
In order to address the problems of the single encryption algorithm,such as low encryption efficiency and unreliable metadata for static data storage of big data platforms in the cloud computing environment,we propose...In order to address the problems of the single encryption algorithm,such as low encryption efficiency and unreliable metadata for static data storage of big data platforms in the cloud computing environment,we propose a Hadoop based big data secure storage scheme.Firstly,in order to disperse the NameNode service from a single server to multiple servers,we combine HDFS federation and HDFS high-availability mechanisms,and use the Zookeeper distributed coordination mechanism to coordinate each node to achieve dual-channel storage.Then,we improve the ECC encryption algorithm for the encryption of ordinary data,and adopt a homomorphic encryption algorithm to encrypt data that needs to be calculated.To accelerate the encryption,we adopt the dualthread encryption mode.Finally,the HDFS control module is designed to combine the encryption algorithm with the storage model.Experimental results show that the proposed solution solves the problem of a single point of failure of metadata,performs well in terms of metadata reliability,and can realize the fault tolerance of the server.The improved encryption algorithm integrates the dual-channel storage mode,and the encryption storage efficiency improves by 27.6% on average.展开更多
Time-series data provide important information in many fields,and their processing and analysis have been the focus of much research.However,detecting anomalies is very difficult due to data imbalance,temporal depende...Time-series data provide important information in many fields,and their processing and analysis have been the focus of much research.However,detecting anomalies is very difficult due to data imbalance,temporal dependence,and noise.Therefore,methodologies for data augmentation and conversion of time series data into images for analysis have been studied.This paper proposes a fault detection model that uses time series data augmentation and transformation to address the problems of data imbalance,temporal dependence,and robustness to noise.The method of data augmentation is set as the addition of noise.It involves adding Gaussian noise,with the noise level set to 0.002,to maximize the generalization performance of the model.In addition,we use the Markov Transition Field(MTF)method to effectively visualize the dynamic transitions of the data while converting the time series data into images.It enables the identification of patterns in time series data and assists in capturing the sequential dependencies of the data.For anomaly detection,the PatchCore model is applied to show excellent performance,and the detected anomaly areas are represented as heat maps.It allows for the detection of anomalies,and by applying an anomaly map to the original image,it is possible to capture the areas where anomalies occur.The performance evaluation shows that both F1-score and Accuracy are high when time series data is converted to images.Additionally,when processed as images rather than as time series data,there was a significant reduction in both the size of the data and the training time.The proposed method can provide an important springboard for research in the field of anomaly detection using time series data.Besides,it helps solve problems such as analyzing complex patterns in data lightweight.展开更多
Addressing climate change demands a significant shift away from fossil fuels,with sectors like electricity and transportation relying heavily on renewable energy.Integral to this transition are energy storage systems,...Addressing climate change demands a significant shift away from fossil fuels,with sectors like electricity and transportation relying heavily on renewable energy.Integral to this transition are energy storage systems,notably lithium-ion batteries.Over time,these batteries degrade,affecting their efficiency and posing safety risks.Monitoring and predicting battery aging is essential,especially estimating its state of health(SOH).Various SOH estimation methods exist,from traditional model-based approaches to machine learning approaches.展开更多
Mg alloys possess an inherent plastic anisotropy owing to the selective activation of deformation mechanisms depending on the loading condition.This characteristic results in a diverse range of flow curves that vary w...Mg alloys possess an inherent plastic anisotropy owing to the selective activation of deformation mechanisms depending on the loading condition.This characteristic results in a diverse range of flow curves that vary with a deformation condition.This study proposes a novel approach for accurately predicting an anisotropic deformation behavior of wrought Mg alloys using machine learning(ML)with data augmentation.The developed model combines four key strategies from data science:learning the entire flow curves,generative adversarial networks(GAN),algorithm-driven hyperparameter tuning,and gated recurrent unit(GRU)architecture.The proposed model,namely GAN-aided GRU,was extensively evaluated for various predictive scenarios,such as interpolation,extrapolation,and a limited dataset size.The model exhibited significant predictability and improved generalizability for estimating the anisotropic compressive behavior of ZK60 Mg alloys under 11 annealing conditions and for three loading directions.The GAN-aided GRU results were superior to those of previous ML models and constitutive equations.The superior performance was attributed to hyperparameter optimization,GAN-based data augmentation,and the inherent predictivity of the GRU for extrapolation.As a first attempt to employ ML techniques other than artificial neural networks,this study proposes a novel perspective on predicting the anisotropic deformation behaviors of wrought Mg alloys.展开更多
There are challenges to the reliability evaluation for insulated gate bipolar transistors(IGBT)on electric vehicles,such as junction temperature measurement,computational and storage resources.In this paper,a junction...There are challenges to the reliability evaluation for insulated gate bipolar transistors(IGBT)on electric vehicles,such as junction temperature measurement,computational and storage resources.In this paper,a junction temperature estimation approach based on neural network without additional cost is proposed and the lifetime calculation for IGBT using electric vehicle big data is performed.The direct current(DC)voltage,operation current,switching frequency,negative thermal coefficient thermistor(NTC)temperature and IGBT lifetime are inputs.And the junction temperature(T_(j))is output.With the rain flow counting method,the classified irregular temperatures are brought into the life model for the failure cycles.The fatigue accumulation method is then used to calculate the IGBT lifetime.To solve the limited computational and storage resources of electric vehicle controllers,the operation of IGBT lifetime calculation is running on a big data platform.The lifetime is then transmitted wirelessly to electric vehicles as input for neural network.Thus the junction temperature of IGBT under long-term operating conditions can be accurately estimated.A test platform of the motor controller combined with the vehicle big data server is built for the IGBT accelerated aging test.Subsequently,the IGBT lifetime predictions are derived from the junction temperature estimation by the neural network method and the thermal network method.The experiment shows that the lifetime prediction based on a neural network with big data demonstrates a higher accuracy than that of the thermal network,which improves the reliability evaluation of system.展开更多
As the risks associated with air turbulence are intensified by climate change and the growth of the aviation industry,it has become imperative to monitor and mitigate these threats to ensure civil aviation safety.The ...As the risks associated with air turbulence are intensified by climate change and the growth of the aviation industry,it has become imperative to monitor and mitigate these threats to ensure civil aviation safety.The eddy dissipation rate(EDR)has been established as the standard metric for quantifying turbulence in civil aviation.This study aims to explore a universally applicable symbolic classification approach based on genetic programming to detect turbulence anomalies using quick access recorder(QAR)data.The detection of atmospheric turbulence is approached as an anomaly detection problem.Comparative evaluations demonstrate that this approach performs on par with direct EDR calculation methods in identifying turbulence events.Moreover,comparisons with alternative machine learning techniques indicate that the proposed technique is the optimal methodology currently available.In summary,the use of symbolic classification via genetic programming enables accurate turbulence detection from QAR data,comparable to that with established EDR approaches and surpassing that achieved with machine learning algorithms.This finding highlights the potential of integrating symbolic classifiers into turbulence monitoring systems to enhance civil aviation safety amidst rising environmental and operational hazards.展开更多
文摘With the rapid development of modern science and technology, traditional randomized controlled trials have become insufficient to meet current scientific research needs, particularly in the field of clinical research. The emergence of real-world data studies, which align more closely with actual clinical evidence, has garnered significant attention in recent years. The following is a brief overview of the specific utilization of real-world data in drug development, which often involves large sample sizes and analyses covering a relatively diverse population without strict inclusion and exclusion criteria. Real-world data often reflects real clinical practice: treatment options are chosen according to the actual conditions and willingness of patients rather than through random assignment. Analysis based on real-world data also focuses on endpoints highly relevant to clinical benefits and the quality of life of patients. The booming big data technology supports the utilization of real-world data to accelerate new drug development, serving as an important supplement to traditional clinical trials.
基金the Natural Science Foundation of China(Grant Numbers 72074014 and 72004012).
文摘Purpose:Many science,technology and innovation(STI)resources are attached with several different labels.To assign automatically the resulting labels to an interested instance,many approaches with good performance on the benchmark datasets have been proposed for multi-label classification task in the literature.Furthermore,several open-source tools implementing these approaches have also been developed.However,the characteristics of real-world multi-label patent and publication datasets are not completely in line with those of benchmark ones.Therefore,the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets.Research limitations:Three real-world datasets differ in the following aspects:statement,data quality,and purposes.Additionally,open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection,which in turn impacts the performance of a multi-label classification approach.In the near future,we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings.Practical implications:The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets,underscoring the complexity of real-world multi-label classification tasks.Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels.With ongoing enhancements in deep learning algorithms and large-scale models,it is expected that the efficacy of multi-label classification tasks will be significantly improved,reaching a level of practical utility in the foreseeable future.Originality/value:(1)Seven multi-label classification methods are comprehensively compared on three real-world datasets.(2)The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution.(3)The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution.
基金supported by the National Outstanding Youth Science Fund Project of National Natural Science Foundation of China[Grant No.52222708]the Natural Science Foundation of Beijing Municipality[Grant No.3212033]。
文摘Battery pack capacity estimation under real-world operating conditions is important for battery performance optimization and health management,contributing to the reliability and longevity of batterypowered systems.However,complex operating conditions,coupling cell-to-cell inconsistency,and limited labeled data pose great challenges to accurate and robust battery pack capacity estimation.To address these issues,this paper proposes a hierarchical data-driven framework aimed at enhancing the training of machine learning models with fewer labeled data.Unlike traditional data-driven methods that lack interpretability,the hierarchical data-driven framework unveils the“mechanism”of the black box inside the data-driven framework by splitting the final estimation target into cell-level and pack-level intermediate targets.A generalized feature matrix is devised without requiring all cell voltages,significantly reducing the computational cost and memory resources.The generated intermediate target labels and the corresponding features are hierarchically employed to enhance the training of two machine learning models,effectively alleviating the difficulty of learning the relationship from all features due to fewer labeled data and addressing the dilemma of requiring extensive labeled data for accurate estimation.Using only 10%of degradation data,the proposed framework outperforms the state-of-the-art battery pack capacity estimation methods,achieving mean absolute percentage errors of 0.608%,0.601%,and 1.128%for three battery packs whose degradation load profiles represent real-world operating conditions.Its high accuracy,adaptability,and robustness indicate the potential in different application scenarios,which is promising for reducing laborious and expensive aging experiments at the pack level and facilitating the development of battery technology.
基金supported in part by NIH grants R01NS39600,U01MH114829RF1MH128693(to GAA)。
文摘Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.
文摘There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful aids in their clinical decision-making while also preserving patient privacy.This is especially important given the epidemiology of chronic kidney disease,renal oncology,and hypertension worldwide.However,there remains a need to create a framework for guidance regarding how to better utilize synthetic data as a practical application in this research.
基金Microsoft Research Asia Internet Services in Academic Research Fund(No.FY07-RES-OPP-116)the Science and Technology Development Program of Tianjin(No.06YFGZGX05900)
文摘To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.
文摘Hepatocellular carcinoma(HCC)is a leading cause of cancer-associated mortality worldwide.HCC is an inflammation-associated immunogenic cancer that frequently arises in chronically inflamed livers.Advanced HCC is managed with systemic therapies;the tyrosine kinase inhibitor(TKI)sorafenib has been used in 1st-line setting since 2007.Immunotherapies have emerged as promising treatments across solid tumors including HCC for which immune checkpoint inhibitors(ICIs)are licensed in 1st-and 2nd-line treatment setting.The treatment field of advanced HCC is continuously evolving.Several clinical trials are investigating novel ICI candidates as well as new ICI regimens in combination with other therapeutic modalities including systemic agents,such as other ICIs,TKIs,and anti-angiogenics.Novel immunotherapies including adoptive cell transfer,vaccine-based approaches,and virotherapy are also being brought to the fore.Yet,despite advances,several challenges persist.Lack of real-world data on the use of immunotherapy for advanced HCC in patients outside of clinical trials constitutes a main limitation hindering the breadth of application and generalizability of data to this larger and more diverse patient cohort.Consequently,issues encountered in real-world practice include patient ineligibly for immunotherapy because of contraindications,comorbidities,or poor performance status;lack of response,efficacy,and safety data;and cost-effectiveness.Further real-world data from high-quality large prospective cohort studies of immunotherapy in patients with advanced HCC is mandated to aid evidence-based clinical decision-making.This review provides a critical and comprehensive overview of clinical trials and real-world data of immunotherapy for HCC,with a focus on ICIs,as well as novel immunotherapy strategies underway.
基金supported by the National Natural Science Foundation of China (No. 30930102, 81473033)the National Key R&D Program of China (No. 2016YFC0901404)+2 种基金the Digestive Medical Coordinated Development Center of Beijing Hospitals Authority (No. XXZ0204)the Science Foundation of Peking University Cancer Hospital (No. 2017-4)the Open Project funded by the Key Laboratory of Carcinogenesis and Translational Research, Ministry of Education/Beijing (No. 2017-10)
文摘Objective: To evaluate the accuracy of identifying cancer patients by use of medical claims data in a health insurance system in China, and provide the basis for establishing the claims-based cancer surveillance system in China.Methods: We chose Hua County, Henan Province as the study site, and randomly selected 300 and 1,200 qualified inpatient electronic medical records(EMRs) as well as the New Rural Cooperative Medical Scheme(NCMS) claims records for cancer patients in Hua County People’s Hospital(HCPH) and Anyang Cancer Hospital(ACH) in 2017. Diagnostic information for NCMS claims was evaluated on an individual level, and sensitivity and positive predictive value(PPV) were calculated taking the EMRs as the gold standard.Results: The sensitivity of NCMS was 95.2%(93.8%-96.3%) and 92.0%(88.3%-94.8%) in ACH and HCPH,respectively. The PPV of the NCMS was 97.8%(96.7%-98.5%) in ACH and 89.0%(84.9%-92.3%) in HCPH.Overall, the weighted and combined sensitivity and PPV of NCMS in Hua County was 93.1% and 92.1%,respectively. Significantly higher sensitivity and PPV in identifying patients with common cancers than noncommon cancers were detected in HCPH and ACH separately(P<0.01).Conclusions: Identification of cancer patients by use of the NCMS is accurate on individual level, and it is therefore feasible to conduct claims-based cancer surveillance in areas not covered by cancer registries in China.
文摘Randomized clinical trials(RCTs)have long been recognized the gold standard for regulatory approval in the drug development.However,RCTs may not be feasible in some diseases and/or under certain situations,and findings from RCTs may not be generalized to real-world patients in routine clinical practice.Real-world evidence(RWE),which is generated from various real-world data(RWD),has become more and more important for the drug development and clinical decision-making in the digital era.This paper described RWD and real-world data studies(RWDSs),followed by the characteristics and differences between RCTs and RWDSs.Furthermore,the challenges and limitations of RWD and RWE were discussed.Finally,this paper highlights that the efforts must be made during RWE generation from data collection/database selection,study design,statistical analysis,and interpretation of the results to minimize the biases and confounding effects.
文摘BACKGROUND Real-world data on tofacitinib(TOF)covering a period of more than 1 year for a sufficient number of Asian patients with ulcerative colitis(UC)are scarce.AIM To investigate the long-term efficacy and safety of TOF treatment for UC,including clinical issues.METHODS We performed a retrospective single-center observational analysis of 111 UC patients administered TOF at Hyogo Medical University as a tertiary inflammatory bowel disease center.All consecutive UC patients who received TOF between May 2018 and February 2020 were enrolled.Patients were followed up until August 2020.The primary outcome was the clinical response rate at week 8.Secondary outcomes included clinical remission at week 8,cumulative persistence rate of TOF administration,colectomy-free survival,relapse after tapering of TOF and predictors of clinical response at week 8 and week 48.RESULTS The clinical response and remission rates were 66.3%and 50.5%at week 8,and 47.1%and 43.5%at week 48,respectively.The overall cumulative clinical remission rate was 61.7%at week 48 and history of anti-tumor necrosis factor-alpha(TNF-α)agents use had no influence(P=0.25).The cumulative TOF persistence rate at week 48 was significantly lower in patients without clinical remission than in those with remission at week 8(30.9%vs 88.1%;P<0.001).Baseline partial Mayo Score was significantly lower in responders vs non-responders at week 8(odds ratio:0.61,95%confidence interval:0.45-0.82,P=0.001).Relapse occurred in 45.7%of patients after TOF tapering,and 85.7%of patients responded within 4 wk after re-increase.All 6 patients with herpes zoster(HZ)developed the infection after achieving remission by TOF.CONCLUSION TOF was more effective in UC patients with mild activity at baseline and its efficacy was not affected by previous treatment with anti-TNF-αagents.Most relapsed patients responded again after re-increase of TOF and nearly half relapsed after tapering off TOF.Special attention is needed for tapering and HZ.
基金the National Key Clinical Specialty Construction Project,No.ZK108000CAMS Innovation Fund for Medical Sciences,No.2021-I2M-C&T-A-001 and No.2022-I2M-C&T-B-012.
文摘BACKGROUND Although chronic erosive gastritis(CEG)is common,its clinical characteristics have not been fully elucidated.The lack of consensus regarding its treatment has resulted in varied treatment regimens.AIM To explore the clinical characteristics,treatment patterns,and short-term outcomes in CEG patients in China.METHODS We recruited patients with chronic non-atrophic or mild-to-moderate atrophic gastritis with erosion based on endoscopy and pathology.Patients and treating physicians completed a questionnaire regarding history,endoscopic findings,and treatment plans as well as a follow-up questionnaire to investigate changes in symptoms after 4 wk of treatment.RESULTS Three thousand five hundred sixty-three patients from 42 centers across 24 cities in China were included.Epigastric pain(68.0%),abdominal distension(62.6%),and postprandial fullness(47.5%)were the most common presenting symptoms.Gastritis was classified as chronic non-atrophic in 69.9%of patients.Among those with erosive lesions,72.1%of patients had lesions in the antrum,51.0%had multiple lesions,and 67.3%had superficial flat lesions.In patients with epigastric pain,the combination of a mucosal protective agent(MPA)and proton pump inhibitor was more effective.For those with postprandial fullness,acid regurgitation,early satiety,or nausea,a MPA appeared more promising.CONCLUSION CEG is a multifactorial disease which is common in Asian patients and has non-specific symptoms.Gastroscopy may play a major role in its detection and diagnosis.Treatment should be individualized based on symptom profile.
文摘Objective To study the research status,research hotspots and development trends in the field of real-world data(RWD)through social network analysis and knowledge graph analysis.Methods RWD of the past 10 years were retrieved,and literature metrological analysis was made by using UCINET and CiteSpace from CNKI.Results and Conclusion The frequency and centrality of related keywords such as real-world study,hospital information system(HIS),drug combination,data mining and TCM are high.The clusters labeled as clinical medication and RWD contain more keywords.In recent 4 years,there are more articles involving the keywords of data specification,data authenticity,data security and information security.Among them,compound Kushen injection,HIS database and RWD are the top three keywords.It is a long-term research hotspot for Chinese and western medicine to use HIS to study clinical medication,clinical characteristics,diseases and injections.Besides,the research of RWD database has changed from construction to standardized collection and governance,which can make RWD effective.Data authenticity,data security and information security will become the new hotspots in the research of RWD.
基金supported by China’s National Natural Science Foundation(Nos.62072249,62072056)This work is also funded by the National Science Foundation of Hunan Province(2020JJ2029).
文摘With the development of Industry 4.0 and big data technology,the Industrial Internet of Things(IIoT)is hampered by inherent issues such as privacy,security,and fault tolerance,which pose certain challenges to the rapid development of IIoT.Blockchain technology has immutability,decentralization,and autonomy,which can greatly improve the inherent defects of the IIoT.In the traditional blockchain,data is stored in a Merkle tree.As data continues to grow,the scale of proofs used to validate it grows,threatening the efficiency,security,and reliability of blockchain-based IIoT.Accordingly,this paper first analyzes the inefficiency of the traditional blockchain structure in verifying the integrity and correctness of data.To solve this problem,a new Vector Commitment(VC)structure,Partition Vector Commitment(PVC),is proposed by improving the traditional VC structure.Secondly,this paper uses PVC instead of the Merkle tree to store big data generated by IIoT.PVC can improve the efficiency of traditional VC in the process of commitment and opening.Finally,this paper uses PVC to build a blockchain-based IIoT data security storage mechanism and carries out a comparative analysis of experiments.This mechanism can greatly reduce communication loss and maximize the rational use of storage space,which is of great significance for maintaining the security and stability of blockchain-based IIoT.
文摘In order to address the problems of the single encryption algorithm,such as low encryption efficiency and unreliable metadata for static data storage of big data platforms in the cloud computing environment,we propose a Hadoop based big data secure storage scheme.Firstly,in order to disperse the NameNode service from a single server to multiple servers,we combine HDFS federation and HDFS high-availability mechanisms,and use the Zookeeper distributed coordination mechanism to coordinate each node to achieve dual-channel storage.Then,we improve the ECC encryption algorithm for the encryption of ordinary data,and adopt a homomorphic encryption algorithm to encrypt data that needs to be calculated.To accelerate the encryption,we adopt the dualthread encryption mode.Finally,the HDFS control module is designed to combine the encryption algorithm with the storage model.Experimental results show that the proposed solution solves the problem of a single point of failure of metadata,performs well in terms of metadata reliability,and can realize the fault tolerance of the server.The improved encryption algorithm integrates the dual-channel storage mode,and the encryption storage efficiency improves by 27.6% on average.
基金This research was financially supported by the Ministry of Trade,Industry,and Energy(MOTIE),Korea,under the“Project for Research and Development with Middle Markets Enterprises and DNA(Data,Network,AI)Universities”(AI-based Safety Assessment and Management System for Concrete Structures)(ReferenceNumber P0024559)supervised by theKorea Institute for Advancement of Technology(KIAT).
文摘Time-series data provide important information in many fields,and their processing and analysis have been the focus of much research.However,detecting anomalies is very difficult due to data imbalance,temporal dependence,and noise.Therefore,methodologies for data augmentation and conversion of time series data into images for analysis have been studied.This paper proposes a fault detection model that uses time series data augmentation and transformation to address the problems of data imbalance,temporal dependence,and robustness to noise.The method of data augmentation is set as the addition of noise.It involves adding Gaussian noise,with the noise level set to 0.002,to maximize the generalization performance of the model.In addition,we use the Markov Transition Field(MTF)method to effectively visualize the dynamic transitions of the data while converting the time series data into images.It enables the identification of patterns in time series data and assists in capturing the sequential dependencies of the data.For anomaly detection,the PatchCore model is applied to show excellent performance,and the detected anomaly areas are represented as heat maps.It allows for the detection of anomalies,and by applying an anomaly map to the original image,it is possible to capture the areas where anomalies occur.The performance evaluation shows that both F1-score and Accuracy are high when time series data is converted to images.Additionally,when processed as images rather than as time series data,there was a significant reduction in both the size of the data and the training time.The proposed method can provide an important springboard for research in the field of anomaly detection using time series data.Besides,it helps solve problems such as analyzing complex patterns in data lightweight.
基金supported by the National Natural Science Foundation of China(72201152 and 52207229)。
文摘Addressing climate change demands a significant shift away from fossil fuels,with sectors like electricity and transportation relying heavily on renewable energy.Integral to this transition are energy storage systems,notably lithium-ion batteries.Over time,these batteries degrade,affecting their efficiency and posing safety risks.Monitoring and predicting battery aging is essential,especially estimating its state of health(SOH).Various SOH estimation methods exist,from traditional model-based approaches to machine learning approaches.
基金Korea Institute of Energy Technology Evaluation and Planning(KETEP)grant funded by the Korea government(Grant No.20214000000140,Graduate School of Convergence for Clean Energy Integrated Power Generation)Korea Basic Science Institute(National Research Facilities and Equipment Center)grant funded by the Ministry of Education(2021R1A6C101A449)the National Research Foundation of Korea grant funded by the Ministry of Science and ICT(2021R1A2C1095139),Republic of Korea。
文摘Mg alloys possess an inherent plastic anisotropy owing to the selective activation of deformation mechanisms depending on the loading condition.This characteristic results in a diverse range of flow curves that vary with a deformation condition.This study proposes a novel approach for accurately predicting an anisotropic deformation behavior of wrought Mg alloys using machine learning(ML)with data augmentation.The developed model combines four key strategies from data science:learning the entire flow curves,generative adversarial networks(GAN),algorithm-driven hyperparameter tuning,and gated recurrent unit(GRU)architecture.The proposed model,namely GAN-aided GRU,was extensively evaluated for various predictive scenarios,such as interpolation,extrapolation,and a limited dataset size.The model exhibited significant predictability and improved generalizability for estimating the anisotropic compressive behavior of ZK60 Mg alloys under 11 annealing conditions and for three loading directions.The GAN-aided GRU results were superior to those of previous ML models and constitutive equations.The superior performance was attributed to hyperparameter optimization,GAN-based data augmentation,and the inherent predictivity of the GRU for extrapolation.As a first attempt to employ ML techniques other than artificial neural networks,this study proposes a novel perspective on predicting the anisotropic deformation behaviors of wrought Mg alloys.
文摘There are challenges to the reliability evaluation for insulated gate bipolar transistors(IGBT)on electric vehicles,such as junction temperature measurement,computational and storage resources.In this paper,a junction temperature estimation approach based on neural network without additional cost is proposed and the lifetime calculation for IGBT using electric vehicle big data is performed.The direct current(DC)voltage,operation current,switching frequency,negative thermal coefficient thermistor(NTC)temperature and IGBT lifetime are inputs.And the junction temperature(T_(j))is output.With the rain flow counting method,the classified irregular temperatures are brought into the life model for the failure cycles.The fatigue accumulation method is then used to calculate the IGBT lifetime.To solve the limited computational and storage resources of electric vehicle controllers,the operation of IGBT lifetime calculation is running on a big data platform.The lifetime is then transmitted wirelessly to electric vehicles as input for neural network.Thus the junction temperature of IGBT under long-term operating conditions can be accurately estimated.A test platform of the motor controller combined with the vehicle big data server is built for the IGBT accelerated aging test.Subsequently,the IGBT lifetime predictions are derived from the junction temperature estimation by the neural network method and the thermal network method.The experiment shows that the lifetime prediction based on a neural network with big data demonstrates a higher accuracy than that of the thermal network,which improves the reliability evaluation of system.
基金supported by the Meteorological Soft Science Project(Grant No.2023ZZXM29)the Natural Science Fund Project of Tianjin,China(Grant No.21JCYBJC00740)the Key Research and Development-Social Development Program of Jiangsu Province,China(Grant No.BE2021685).
文摘As the risks associated with air turbulence are intensified by climate change and the growth of the aviation industry,it has become imperative to monitor and mitigate these threats to ensure civil aviation safety.The eddy dissipation rate(EDR)has been established as the standard metric for quantifying turbulence in civil aviation.This study aims to explore a universally applicable symbolic classification approach based on genetic programming to detect turbulence anomalies using quick access recorder(QAR)data.The detection of atmospheric turbulence is approached as an anomaly detection problem.Comparative evaluations demonstrate that this approach performs on par with direct EDR calculation methods in identifying turbulence events.Moreover,comparisons with alternative machine learning techniques indicate that the proposed technique is the optimal methodology currently available.In summary,the use of symbolic classification via genetic programming enables accurate turbulence detection from QAR data,comparable to that with established EDR approaches and surpassing that achieved with machine learning algorithms.This finding highlights the potential of integrating symbolic classifiers into turbulence monitoring systems to enhance civil aviation safety amidst rising environmental and operational hazards.