The development of science and technology has made it not only possible but very convenient for people living in different parts of the world to communicate with each other, thus bringing forth a new form of communica...The development of science and technology has made it not only possible but very convenient for people living in different parts of the world to communicate with each other, thus bringing forth a new form of communication: computer-mediated communication (CMC). Text-based CMC is one of the most popular forms of CMC in which people send instant messages to others in different settings. Since this mode of interaction combines features of both the written and spoken language (Greenfield & Subrahmanyam, 2003), it's of great interest whether it follows the same sequential rule as the telephone conversation. However, compared to telephone conversations, computer-mediated communication has received much less attention, let alone text-based CMC. The existing body of literature mostly focuses on content analysis and linguistic features but neglects the sequential organization of such interaction (Paolillo, 1999; Greenfield and Subrahmanyam, 2003; Herring, 1999). In light of this, this paper examines the opening moves of instant message exchanges among Chinese adults in an attempt to find out the unique features characterizing the way they open an online chat. The framework that was chosen for data analysis was the sequential model proposed by Schegloff for American telephone openings.展开更多
开放集文字识别(Open-set text recognition,OSTR)是一项新任务,旨在解决开放环境下文字识别应用中的语言模型偏差及新字符识别与拒识问题.最近的OSTR方法通过将上下文信息与视觉信息分离来解决语言模型偏差问题.然而,这些方法往往忽视...开放集文字识别(Open-set text recognition,OSTR)是一项新任务,旨在解决开放环境下文字识别应用中的语言模型偏差及新字符识别与拒识问题.最近的OSTR方法通过将上下文信息与视觉信息分离来解决语言模型偏差问题.然而,这些方法往往忽视了字符视觉细节的重要性.考虑到上下文信息的偏差,局部细节信息在区分视觉上接近的字符时变得更加重要.本文提出一种基于自适应字符部件表示的开放集文字识别框架,构建基于文字局部结构相似度量的开放集文字识别方法,通过对不同字符部件进行显式建模来改进对局部细节特征的建模能力.与基于字根(Radical)的方法不同,所提出的框架采用数据驱动的部件设计,具有语言无关的特性和跨语言泛化识别的能力.此外,还提出一种局部性约束正则项来使模型训练更加稳定.大量的对比实验表明,本文方法在开放集、传统闭集文字识别任务上均具有良好的性能.展开更多
[目的/意义]旨在探究我国政府数据开放政策的注意力配置情况及不足,进而优化政策制定。[方法/过程]基于2013—2022年我国省级政府发布的98份数据开放政策文本,运用NVivo 11和ROST CM 6.0软件分析注意力配置情况。[结果/结论]总体上我国...[目的/意义]旨在探究我国政府数据开放政策的注意力配置情况及不足,进而优化政策制定。[方法/过程]基于2013—2022年我国省级政府发布的98份数据开放政策文本,运用NVivo 11和ROST CM 6.0软件分析注意力配置情况。[结果/结论]总体上我国地方政府数据开放政策注意力呈现初级开放阶段(2015年以前)、重视利用阶段(2016—2019年)、深度利用阶段(2020年至今)的演变特征;政策工具的使用存在不均衡现象;政策主题主要集中于数据开放与建设、数据安全与保障、数据利用与发展层面,在不同层面政府偏好使用不同的政策工具。在政策工具的使用中,应强化需求型政策工具的使用比例,提高政策科学性;优化政策工具的内部使用结构,创新政策要素;深化数据质量类别的工具使用,克服供给困境。展开更多
A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper c...A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper calculates baseline accuracies for both the ability to match the correct image to a hint and the ability to match up with human preferences. A dataset created by previous work on Dixit is used for testing. CLIP is utilized through the comparison of a hint to multiple images, and previous hints, achieving a final accuracy of 0.5011 which surpasses previous results.展开更多
Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library arc...Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas(TCGA), via a full-text literature analysis.Design/methodology/approach: We focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from Pub Med Central(PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.Findings: The amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing(RNA-seq) platform is the most preferable for use.Research limitations: The current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.Practical implications: This study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.Originality/value: Few studies have been conducted to investigate data usage by governmentfunded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data.展开更多
文摘The development of science and technology has made it not only possible but very convenient for people living in different parts of the world to communicate with each other, thus bringing forth a new form of communication: computer-mediated communication (CMC). Text-based CMC is one of the most popular forms of CMC in which people send instant messages to others in different settings. Since this mode of interaction combines features of both the written and spoken language (Greenfield & Subrahmanyam, 2003), it's of great interest whether it follows the same sequential rule as the telephone conversation. However, compared to telephone conversations, computer-mediated communication has received much less attention, let alone text-based CMC. The existing body of literature mostly focuses on content analysis and linguistic features but neglects the sequential organization of such interaction (Paolillo, 1999; Greenfield and Subrahmanyam, 2003; Herring, 1999). In light of this, this paper examines the opening moves of instant message exchanges among Chinese adults in an attempt to find out the unique features characterizing the way they open an online chat. The framework that was chosen for data analysis was the sequential model proposed by Schegloff for American telephone openings.
文摘开放集文字识别(Open-set text recognition,OSTR)是一项新任务,旨在解决开放环境下文字识别应用中的语言模型偏差及新字符识别与拒识问题.最近的OSTR方法通过将上下文信息与视觉信息分离来解决语言模型偏差问题.然而,这些方法往往忽视了字符视觉细节的重要性.考虑到上下文信息的偏差,局部细节信息在区分视觉上接近的字符时变得更加重要.本文提出一种基于自适应字符部件表示的开放集文字识别框架,构建基于文字局部结构相似度量的开放集文字识别方法,通过对不同字符部件进行显式建模来改进对局部细节特征的建模能力.与基于字根(Radical)的方法不同,所提出的框架采用数据驱动的部件设计,具有语言无关的特性和跨语言泛化识别的能力.此外,还提出一种局部性约束正则项来使模型训练更加稳定.大量的对比实验表明,本文方法在开放集、传统闭集文字识别任务上均具有良好的性能.
文摘[目的/意义]旨在探究我国政府数据开放政策的注意力配置情况及不足,进而优化政策制定。[方法/过程]基于2013—2022年我国省级政府发布的98份数据开放政策文本,运用NVivo 11和ROST CM 6.0软件分析注意力配置情况。[结果/结论]总体上我国地方政府数据开放政策注意力呈现初级开放阶段(2015年以前)、重视利用阶段(2016—2019年)、深度利用阶段(2020年至今)的演变特征;政策工具的使用存在不均衡现象;政策主题主要集中于数据开放与建设、数据安全与保障、数据利用与发展层面,在不同层面政府偏好使用不同的政策工具。在政策工具的使用中,应强化需求型政策工具的使用比例,提高政策科学性;优化政策工具的内部使用结构,创新政策要素;深化数据质量类别的工具使用,克服供给困境。
文摘A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper calculates baseline accuracies for both the ability to match the correct image to a hint and the ability to match up with human preferences. A dataset created by previous work on Dixit is used for testing. CLIP is utilized through the comparison of a hint to multiple images, and previous hints, achieving a final accuracy of 0.5011 which surpasses previous results.
基金supported by the National Population and Health Scientific Data Sharing Program of Chinathe Knowledge Centre for Engineering Sciences and Technology (Medical Centre)the Fundamental Research Funds for the Central Universities (Grant No.: 13R0101)
文摘Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas(TCGA), via a full-text literature analysis.Design/methodology/approach: We focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from Pub Med Central(PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.Findings: The amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing(RNA-seq) platform is the most preferable for use.Research limitations: The current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.Practical implications: This study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.Originality/value: Few studies have been conducted to investigate data usage by governmentfunded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data.