期刊文献+
共找到935篇文章
< 1 2 47 >
每页显示 20 50 100
Multimodal Deep Neural Networks for Digitized Document Classification
1
作者 Aigerim Baimakhanova Ainur Zhumadillayeva +4 位作者 Bigul Mukhametzhanova Natalya Glazyrina Rozamgul Niyazova Nurseit Zhunissov Aizhan Sambetbayeva 《Computer Systems Science & Engineering》 2024年第3期793-811,共19页
As digital technologies have advanced more rapidly,the number of paper documents recently converted into a digital format has exponentially increased.To respond to the urgent need to categorize the growing number of d... As digital technologies have advanced more rapidly,the number of paper documents recently converted into a digital format has exponentially increased.To respond to the urgent need to categorize the growing number of digitized documents,the classification of digitized documents in real time has been identified as the primary goal of our study.A paper classification is the first stage in automating document control and efficient knowledge discovery with no or little human involvement.Artificial intelligence methods such as Deep Learning are now combined with segmentation to study and interpret those traits,which were not conceivable ten years ago.Deep learning aids in comprehending input patterns so that object classes may be predicted.The segmentation process divides the input image into separate segments for a more thorough image study.This study proposes a deep learning-enabled framework for automated document classification,which can be implemented in higher education.To further this goal,a dataset was developed that includes seven categories:Diplomas,Personal documents,Journal of Accounting of higher education diplomas,Service letters,Orders,Production orders,and Student orders.Subsequently,a deep learning model based on Conv2D layers is proposed for the document classification process.In the final part of this research,the proposed model is evaluated and compared with other machine-learning techniques.The results demonstrate that the proposed deep learning model shows high results in document categorization overtaking the other machine learning models by reaching 94.84%,94.79%,94.62%,94.43%,94.07%in accuracy,precision,recall,F-score,and AUC-ROC,respectively.The achieved results prove that the proposed deep model is acceptable to use in practice as an assistant to an office worker. 展开更多
关键词 document categorization deep learning machine learning classification DIGITIZATION
下载PDF
Automatically Constructing an Effective Domain Ontology for Document Classification 被引量:2
2
作者 Yi-Hsing Chang 《Computer Technology and Application》 2011年第3期182-189,共8页
An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the... An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the base for the Naive Bayes classifier to approve the effectiveness of the domain ontology for document classification. The 1752 documents divided into 10 categories are used to assess the effectiveness of the ontology, where 1252 and 500 documents are the training and testing documents, respectively. The Fl-measure is as the assessment criteria and the following three results are obtained. The average recall of Naive Bayes classifier is 0.94. Therefore, in recall, the performance of Naive Bayes classifier is excellent based on the automatically constructed ontology. The average precision of Naive Bayes classifier is 0.81. Therefore, in precision, the performance of Naive Bayes classifier is gored based on the automatically constructed ontology. The average Fl-measure for 10 categories by Naive Bayes classifier is 0.86. Therefore, the performance of Naive Bayes classifier is effective based on the automatically constructed ontology in the point of F 1-measure. Thus, the domain ontology automatically constructed could indeed be acted as the document categories to reach the effectiveness for document classification. 展开更多
关键词 Naive bayes classifier ONTOLOGY formal concept analysis document classification.
下载PDF
Automatic Arabic Document Classification via kNN
3
作者 HANI M. O. Iwidat 《Computer Aided Drafting,Design and Manufacturing》 2008年第2期65-73,共9页
Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The... Many algorithms have been implemented for the problem of document categorization. The majority work in this area was achieved for English text, while a very few approaches have been introduced for the Arabic text. The nature of Arabic text is different from that of the English text and the preprocessing of the Arabic text is more challenging. This is due to Arabic language is a highly inflectional and derivational language that makes document mining a hard and complex task. In this paper, we present an Automatic Arabic documents classification system based on kNN algorithm. Also, we develop an approach to solve keywords extraction and reduction problems by using Document Frequency (DF) threshold method. The results indicate that the ability of the kNN to deal with Arabic text outperforms the other existing systems. The proposed system reached 0.95 micro-recall scores with 850 Arabic texts in 6 different categories. 展开更多
关键词 Arabic documents classification KNN vector model keywords extraction
下载PDF
Study on Multi-Label Classification of Medical Dispute Documents 被引量:2
4
作者 Baili Zhang Shan Zhou +2 位作者 Le Yang Jianhua Lv Mingjun Zhong 《Computers, Materials & Continua》 SCIE EI 2020年第12期1975-1986,共12页
The Internet of Medical Things(IoMT)will come to be of great importance in the mediation of medical disputes,as it is emerging as the core of intelligent medical treatment.First,IoMT can track the entire medical treat... The Internet of Medical Things(IoMT)will come to be of great importance in the mediation of medical disputes,as it is emerging as the core of intelligent medical treatment.First,IoMT can track the entire medical treatment process in order to provide detailed trace data in medical dispute resolution.Second,IoMT can infiltrate the ongoing treatment and provide timely intelligent decision support to medical staff.This information includes recommendation of similar historical cases,guidance for medical treatment,alerting of hired dispute profiteers etc.The multi-label classification of medical dispute documents(MDDs)plays an important role as a front-end process for intelligent decision support,especially in the recommendation of similar historical cases.However,MDDs usually appear as long texts containing a large amount of redundant information,and there is a serious distribution imbalance in the dataset,which directly leads to weaker classification performance.Accordingly,in this paper,a multi-label classification method based on key sentence extraction is proposed for MDDs.The method is divided into two parts.First,the attention-based hierarchical bi-directional long short-term memory(BiLSTM)model is used to extract key sentences from documents;second,random comprehensive sampling Bagging(RCS-Bagging),which is an ensemble multi-label classification model,is employed to classify MDDs based on key sentence sets.The use of this approach greatly improves the classification performance.Experiments show that the performance of the two models proposed in this paper is remarkably better than that of the baseline methods. 展开更多
关键词 Internet of Medical Things(IoMT) medical disputes medical dispute document(MDD) multi-label classification(MLC) key sentence extraction class imbalance
下载PDF
On the Combination of “The Textual Research on Historical Documents” and “The Comparative Study of Historical Data” —— and a Discussion on “The Law of Quan-ma and Gui-mei” in Chinese Language Studies
5
作者 Lu Guoyao 《宏观语言学》 2007年第1期46-59,共14页
In Chinese language studies, both “The Textual Research on Historical Documents” and “The Comparative Study of Historical Data” are traditional in methodology and they both deserve being treasured, passed on, and ... In Chinese language studies, both “The Textual Research on Historical Documents” and “The Comparative Study of Historical Data” are traditional in methodology and they both deserve being treasured, passed on, and further developed. It will certainly do harm to the development of academic research if any of the two methods is given unreasonable priority. The author claims that the best or one of the best methodologies of the historical study of Chinese language is the combination of the two, hence a new interpretation of “The Double-proof Method”. Meanwhile, this essay is also an attempt to put forward “The Law of Quan-ma and Gui-mei” in Chinese language studies, in which the author believes that it is not advisable to either treat Gui-mei as Quan-ma or vice versa in linguistic research. It is crucial for us to respect always the language facts first, which is considered the very soul of linguistics. 展开更多
关键词 the history of Chinese language methodology The Textual Research on HISTORICAL documents The Comparative Study of HISTORICAL Data Double-proof method the law of Quan-ma and Gui-mei
下载PDF
Document classification approach by rough-set-based corner classification neural network 被引量:1
6
作者 张卫丰 徐宝文 +1 位作者 崔自峰 徐峻岭 《Journal of Southeast University(English Edition)》 EI CAS 2006年第3期439-444,共6页
A rough set based corner classification neural network, the Rough-CC4, is presented to solve document classification problems such as document representation of different document sizes, document feature selection and... A rough set based corner classification neural network, the Rough-CC4, is presented to solve document classification problems such as document representation of different document sizes, document feature selection and document feature encoding. In the Rough-CC4, the documents are described by the equivalent classes of the approximate words. By this method, the dimensions representing the documents can be reduced, which can solve the precision problems caused by the different document sizes and also blur the differences caused by the approximate words. In the Rough-CC4, a binary encoding method is introduced, through which the importance of documents relative to each equivalent class is encoded. By this encoding method, the precision of the Rough-CC4 is improved greatly and the space complexity of the Rough-CC4 is reduced. The Rough-CC4 can be used in automatic classification of documents. 展开更多
关键词 document classification neural network rough set meta search engine
下载PDF
Integrating Intra-and Inter-document Evidences for Improving Sentence Sentiment Classification 被引量:6
7
作者 ZHAO Yan-Yan QIN Bing LIU Ting 《自动化学报》 EI CSCD 北大核心 2010年第10期1417-1425,共9页
关键词 数码相机 像素 富士 光学变焦
下载PDF
Stemming Algorithm to Classify Arabic Documents 被引量:1
8
作者 Marwan AIi.H. Omer Shilong Ma 《通讯和计算机(中英文版)》 2010年第9期1-5,共5页
关键词 阿拉伯语 机密文件 文本分类 算法 分类系统 文件分类 阿拉伯文 实验数据
下载PDF
EDCMS:A Content Management System for Engineering Documents
9
作者 Chris McMahon Mansur Darlington +1 位作者 Steve Culley Peter Wild 《International Journal of Automation and computing》 EI 2007年第1期56-70,共15页
Engineers often need to look for the right pieces of information by sifting through long engineering documents, It is a very tiring and time-consuming job. To address this issue, researchers are increasingly devoting ... Engineers often need to look for the right pieces of information by sifting through long engineering documents, It is a very tiring and time-consuming job. To address this issue, researchers are increasingly devoting their attention to new ways to help information users, including engineers, to access and retrieve document content. The research reported in this paper explores how to use the key technologies of document decomposition (study of document structure), document mark-up (with EXtensible Mark- up Language (XML), HyperText Mark-up Language (HTML), and Scalable Vector Graphics (SVG)), and a facetted classification mechanism. Document content extraction is implemented via computer programming (with Java). An Engineering Document Content Management System (EDCMS) developed in this research demonstrates that as information providers we can make document content in a more accessible manner for information users including engineers.The main features of the EDCMS system are: 1) EDCMS is a system that enables users, especially engineers, to access and retrieve information at content rather than document level. In other words, it provides the right pieces of information that answer specific questions so that engineers don't need to waste time sifting through the whole document to obtain the required piece of information. 2) Users can use the EDCMS via both the data and metadata of a document to access engineering document content. 3) Users can use the EDCMS to access and retrieve content objects, i.e. text, images and graphics (including engineering drawings) via multiple views and at different granularities based on decomposition schemes. Experiments with the EDCMS have been conducted on semi-structured documents, a textbook of CADCAM, and a set of project posters in the Engineering Design domain. Experimental results show that the system provides information users with a powerful solution to access document content. 展开更多
关键词 document content management engineering design decomposition schemes document mark-up facetted classification.
下载PDF
Color and document classification in ancient China:The classification-centered functions of color in document
10
作者 Ya ZHOU 《Journal of Library Science in China》 2014年第1期231-248,共18页
In ancient China,color was an uncommon means of document classification.From pre-Qin period to Qing Dynasty,classifying documents with color mainly existed in the following fields:official document,book publishing,lit... In ancient China,color was an uncommon means of document classification.From pre-Qin period to Qing Dynasty,classifying documents with color mainly existed in the following fields:official document,book publishing,literature collection,and chromatography printing,etc.The functions of color in document classification include:distinguishing 'books',distinguishing 'people',and symbolizing 'meanings';and the levels of classifying based on color consist of documentary units level and knowledge units level.Classi tying documents by color in ancient China was influenced by related factors,such as the concept of fivecolors and orthodox-colors,hierarchical order and ritual system,the theory that man is an integral part of nature,and so on;it has an important influence on color code used in libraries,political life and Chinese language culture.In a word,documents and the society are related with each other. 展开更多
关键词 CLASSICAL document document classification Culture HISTORY COLOR
原文传递
Enhancing Domain Knowledge with Semantic Models of Web Documents
11
作者 Anna Rozeva 《Journal of Mathematics and System Science》 2013年第7期319-326,共8页
The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated wi... The paper considers the problem of semantic processing of web documents by designing an approach, which combines extracted semantic document model and domain- related knowledge base. The knowledge base is populated with learnt classification rules categorizing documents into topics. Classification provides for the reduction of the dimensio0ality of the document feature space. The semantic model of retrieved web documents is semantically labeled by querying domain ontology and processed with content-based classification method. The model obtained is mapped to the existing knowledge base by implementing inference algorithm. It enables models of the same semantic type to be recognized and integrated into the knowledge base. The approach provides for the domain knowledge integration and assists the extraction and modeling web documents semantics. Implementation results of the proposed approach are presented. 展开更多
关键词 Semantic model knowledge base document classification domain ontology knowledge integration.
下载PDF
An improved TF-IDF approach for text classification 被引量:5
12
作者 张云涛 龚玲 王永成 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2005年第1期49-55,共7页
This paper presents a new improved term frequency/inverse document frequency (TF-IDF) approach which uses confidence, support and characteristic words to enhance the recall and precision of text classification. Synony... This paper presents a new improved term frequency/inverse document frequency (TF-IDF) approach which uses confidence, support and characteristic words to enhance the recall and precision of text classification. Synonyms defined by a lexicon are processed in the improved TF-IDF approach. We detailedly discuss and analyze the relationship among confidence, recall and precision. The experiments based on science and technology gave promising results that the new TF-IDF approach improves the precision and recall of text classification compared with the conventional TF-IDF approach. 展开更多
关键词 Term frequency/inverse document frequency (TF-IDF) Text classification CONFIDENCE SUPPORT Characteristic words
下载PDF
Word Net-based lexical semantic classification for text corpus analysis
13
作者 龙军 王鲁达 +2 位作者 李祖德 张祖平 杨柳 《Journal of Central South University》 SCIE EI CAS CSCD 2015年第5期1833-1840,共8页
Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, lea... Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.This work proposed a document representation method, Word Net-based lexical semantic VSM, to solve the problem. Using Word Net,this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis. 展开更多
关键词 document representation lexical semantic content classification EIGENVECTOR
下载PDF
A New Wavelet-Based Document Image Segmentation Scheme
14
作者 赵健 李道京 +1 位作者 俞卞章 耿军平 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2002年第3期86-90,共5页
The document image segmentation is very useful for printing, faxing and data processing. An algorithm is developed for segmenting and classifying document image. Feature used for classification is based on the histogr... The document image segmentation is very useful for printing, faxing and data processing. An algorithm is developed for segmenting and classifying document image. Feature used for classification is based on the histogram distribution pattern of different image classes. The important attribute of the algorithm is using wavelet correlation image to enhance raw image's pattern, so the classification accuracy is improved. In this paper document image is divided into four types; background, photo, text and graph. Firstly, the document image background has been distingusished easily by former normally method;secondly, three image types will be distinguished by their typical histograms, in order to make histograms feature clearer, each resolution's HH wavelet subimage is used to add to the raw image at their resolution. At last, the photo, text and praph have been devided according to how the feature fit to the Laplacian distrbution by 2 and L . Simulations show that classification accuracy is significantly improved. The comparison with related shows that our algorithm provides both lower classification error rates and better visual results. 展开更多
关键词 document image SEGMENTATION classification Wavelet Histogram.
下载PDF
Meaningful String Extraction Based on Clustering for Improving Webpage Classification
15
作者 Chen Jie Tan Jianlong +1 位作者 Liao Hao Zhou Yanquan 《China Communications》 SCIE CSCD 2012年第3期68-77,共10页
Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ... Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification. 展开更多
关键词 webpage classification meaningfulstring extraction document clustering term cluste-ring K-MEANS spectral clustering
下载PDF
Blockchain Technology Based Information Classification Management Service
16
作者 Gi-Wan Hong Jeong-Wook Kim Hangbae Chang 《Computers, Materials & Continua》 SCIE EI 2021年第5期1489-1501,共13页
Hyper-connectivity in Industry 4.0 has resulted in not only a rapid increase in the amount of information,but also the expansion of areas and assets to be protected.In terms of information security,it has led to an en... Hyper-connectivity in Industry 4.0 has resulted in not only a rapid increase in the amount of information,but also the expansion of areas and assets to be protected.In terms of information security,it has led to an enormous economic cost due to the various and numerous security solutions used in protecting the increased assets.Also,it has caused difficulties in managing those issues due to reasons such as mutual interference,countless security events and logs’data,etc.Within this security environment,an organization should identify and classify assets based on the value of data and their security perspective,and then apply appropriate protection measures according to the assets’security classification for effective security management.But there are still difficulties stemming from the need to manage numerous security solutions in order to protect the classified assets.In this paper,we propose an information classification management service based on blockchain,which presents and uses a model of the value of data and the security perspective.It records transactions of classifying assets and managing assets by each class in a distributed ledger of blockchain.The proposed service reduces assets to be protected and security solutions to be applied,and provides security measures at the platform level rather than individual security solutions,by using blockchain.In the rapidly changing security environment of Industry 4.0,this proposed service enables economic security,provides a new integrated security platform,and demonstrates service value. 展开更多
关键词 Information classification data integrity document security blockchain CIA
下载PDF
Incrementally Exploiting Sentential Association for Email Classification
17
作者 李曲 何玉 +1 位作者 冯剑琳 冯玉才 《Journal of Southwest Jiaotong University(English Edition)》 2006年第2期129-134,共6页
A novel association-based algorithm EmailinClass is proposed for incremental Email classification. In view of the fact that the basic semantic unit in an Email is actually a sentence, and the words within the same sen... A novel association-based algorithm EmailinClass is proposed for incremental Email classification. In view of the fact that the basic semantic unit in an Email is actually a sentence, and the words within the same sentence are typically more semantically related than the words that just appear in the same Email, EmailInClass views a sentence rather than an Email as a transaction. Extensive experiments conducted on benchmark corpora Enron reveal that the effectiveness of EmallInClass is superior to the non-incremental alternatives such as NalveBayes and SAT-MOD. In addition, the classification rules generated by EroaillnClass are human readable and revisable, 展开更多
关键词 document Requent itemset Category frequent itemset MODFIT heuristic Category prefix-tree Incremental classification
下载PDF
关于加强我国古籍保护法治建设的若干思考 被引量:1
18
作者 张若冰 邱奉捷 +1 位作者 赵文友 胡平 《国家图书馆学刊》 CSSCI 北大核心 2024年第1期4-12,共9页
做好古籍保护与利用,对赓续中华文脉、弘扬民族精神、增强国家文化软实力、建设社会主义文化强国具有重要意义。近年来,国家高度重视古籍工作,古籍保护立法工作已经取得一定成效,但仍存在文物保护相关法律法规无法完全适用、古籍相关规... 做好古籍保护与利用,对赓续中华文脉、弘扬民族精神、增强国家文化软实力、建设社会主义文化强国具有重要意义。近年来,国家高度重视古籍工作,古籍保护立法工作已经取得一定成效,但仍存在文物保护相关法律法规无法完全适用、古籍相关规定较为宏观或效力较低、实践中仍有大量实际问题需要法律保障等问题。基于此,建议优先选择行政法规作为古籍保护立法的突破口,尽快推动《古籍保护条例》出台;平衡与已有法律法规和政策文件之间的关系;针对古籍保护利用中的实际需求和问题进行立法。表2。参考文献14。 展开更多
关键词 古籍保护 法律法规 政策文件
下载PDF
Chinese Sentiment Classification Using Extended Word2Vec
19
作者 张胜 张鑫 +1 位作者 程佳军 王晖 《Journal of Donghua University(English Edition)》 EI CAS 2016年第5期823-826,共4页
Sentiment analysis is now more and more important in modern natural language processing,and the sentiment classification is the one of the most popular applications.The crucial part of sentiment classification is feat... Sentiment analysis is now more and more important in modern natural language processing,and the sentiment classification is the one of the most popular applications.The crucial part of sentiment classification is feature extraction.In this paper,two methods for feature extraction,feature selection and feature embedding,are compared.Then Word2Vec is used as an embedding method.In this experiment,Chinese document is used as the corpus,and tree methods are used to get the features of a document:average word vectors,Doc2Vec and weighted average word vectors.After that,these samples are fed to three machine learning algorithms to do the classification,and support vector machine(SVM) has the best result.Finally,the parameters of random forest are analyzed. 展开更多
关键词 embedding document segmentation dimensionality suffers projection latter classify preprocessing probabilistic
下载PDF
基于词-主题-文本异质网络的短文本分类方法
20
作者 徐涛 赵星甲 卢敏 《计算机应用与软件》 北大核心 2024年第1期146-152,182,共8页
针对现有分类方法未考虑长距离词的语义相关性和文本间潜在主题共享的问题,提出一种基于词-主题-文本异质网络(WTDHN)的短文本分类方法。通过Word2vec训练词的上下文语义向量;构建词相关性矩阵以充足的词共现信息增强短文本各级别语义学... 针对现有分类方法未考虑长距离词的语义相关性和文本间潜在主题共享的问题,提出一种基于词-主题-文本异质网络(WTDHN)的短文本分类方法。通过Word2vec训练词的上下文语义向量;构建词相关性矩阵以充足的词共现信息增强短文本各级别语义学;构建以词、主题和文本为节点的异质网络,并采用图卷积学习节点之间的高阶邻域信息,丰富短文本语义。相较于基准分类模型,该方法在五个公开短文本数据集上的分类准确率平均提高1.56%。 展开更多
关键词 词-主题-文本异质网络 词共现 文本-主题分布 短文本分类
下载PDF
上一页 1 2 47 下一页 到第
使用帮助 返回顶部