{"success":true,"database":"eegdash","data":{"_id":"6953f4249276ef1ee07a33a5","dataset_id":"ds004952","associated_paper_doi":null,"authors":["Xinyu Mou","Cuilin He","Liwei Tan","Junjie Yu","Huadong Liang","Jianyu Zhang","Tian Yan","Yu-Fang Yang","Ting Xu","Qing Wang","Miao Cao","Zijiao Chen","Chuan-Peng Hu","Xindi Wang","Quanying Liu","Haiyan Wu"],"bids_version":"1.7.0","contact_info":["Haiyan Wu"],"contributing_labs":null,"data_processed":true,"dataset_doi":"doi:10.18112/openneuro.ds004952.v1.2.2","datatypes":["eeg"],"demographics":{"subjects_count":10,"ages":[],"age_min":null,"age_max":null,"age_mean":null,"species":null,"sex_distribution":{"f":5,"m":5},"handedness_distribution":{"r":10}},"experimental_modalities":null,"external_links":{"source_url":"https://openneuro.org/datasets/ds004952","osf_url":null,"github_url":null,"paper_url":null},"funding":[],"ingestion_fingerprint":"424e77775e7758736c3eef75a58bf607494d5fe4707e8c337704a878d1861fc2","license":"CC0","n_contributing_labs":null,"name":"ChineseEEG: A Chinese Linguistic Corpora EEG Dataset for Semantic Alignment and Neural Decoding","readme":"# ChineseEEG: A Chinese Linguistic Corpora EEG Dataset for Semantic Alignment and Neural Decoding\n## Introduction\n\"ChineseEEG\" (Chinese Linguistic Corpora EEG Dataset) contains high-density EEG data and simultaneous eye-tracking data recorded from 10 participants, each silently reading Chinese text for about 11 hours. This dataset further comprises pre-processed EEG sensor-level data generated under different parameter settings, offering researchers a diverse range of selections. Additionally, we provide embeddings of the Chinese text materials encoded from BERT-base-chinese model, which is a pre-trained NLP specifically used for Chinese, aiding researchers in exploring the alignment between text embeddings from NLP models and brain information representations.\n## Participant Overview\nIn total, data from 10 participants were used (18-24 years old, averaged 20.68 years old, and 5 males). No participants reported neurological or psychiatric history. All participants are right-handed and have normal or corrected-to-normal vision.\n## Experiment Materials\nThe experimental materials consist of two novels in Chinese, both in the genre of children's literature. The first is **The Little Prince** and the second is **Garnett Dream**.\nFor **The Little Prince**, the preface was used as material for the practice reading phase. The main body of the novel was then used for seven sessions in the formal reading phase. The first six sessions each included 4 chapters of the novel, while the seventh session included the last two chapters.\nFor **Garnett Dream**, the first 18 chapters were used for 18 sessions in the formal reading stage, with each session including a complete chapter.\nTo properly present the text on the screen during the experiments, the content of each session was segmented into a series of units, with each unit containing no more than 10 Chinese characters. These segmented contents were saved in Excel (.xlsx) format for subsequent usage. During the experiment, three adjacent units from each session's content will be displayed on the screen in three separate lines, with the middle line highlighted for the participant to read.\nIn summary, a total of 115,233 characters (24,324 in **The Little Prince** and 90,909 in **Garnett Dream**), of which 2985 characters were unique, were used as experimental stimuli in ChineseEEG dataset.\nThe original and segmented novels are saved in the `derivatives/novels` folder. The `segmented_novel` folder in `novels` folder contains two types of Excel files: one type of file has names ending with \"display,\" while the other type does not contain this suffix. The former stores units that have been segmented; the latter includes units that have been reassembled according to the experimental presentation format. These files ending with \"display\" will be used to support the execution of relevant code, in order to achieve effective stimulus presentation in the experiment.\nThe code for generating these two types of files, as well as the code for experimental presentation, can be found in the GitHub repository: https://github.com/ncclabsustech/Chinese_reading_task_eeg_processing.\n## Experiment Procedures\nParticipants were tasked with reading a novel and were required to keep their heads still and keep their gaze on the highlighted (red) Chinese characters moving across the screen, reading at a pace set by the program. They were required to read an entire novel in multiple runs within a single session. Each run is divided into two phases: the eye-tracker calibration phase and the reading phase.\nThe eye-tracker calibration phase is at the beginning of each run, requiring participants to keep their gaze at a fixation point, which sequentially appeared at the four corners and the center of the screen.\nIn the reading phase, the screen initially displayed the serial number of the current chapter. Subsequently, the text appeared with three lines per page, ensuring each line contained no more than ten Chinese characters (excluding punctuation). On each page, the middle line was highlighted as the focal point, while the upper and lower lines were displayed with reduced intensity as the background. Each character in the middle line was sequentially highlighted with red color for 0.35 s, and participants were required to read the novel content following the highlighted cues.\nFor detailed information about the experiment settings and procedures, please refer to our paper at https://doi.org/10.1101/2024.02.08.579481.\n## Markers\nTo precisely co-register EEG segments with individual characters during the experiment, we marked the EEG data with triggers.\n- EYES: Eyetracker starts to record\n- EYEE: Eyetracker stops recording\n- CALS: Eyetracker calibration starts\n- CALE: Eyetracker calibration stops\n- BEGN: EGI starts to record\n- STOP: EGI stops recording\n- CHxx：Beginning of specific chapter (Numbers correspond with chapters)\n- ROWS: Beginning of a row\n- ROWE: End of a row\n- PRES：Beginning of the preface\n- PREE：End of the preface\n## Data Record\nThe raw EEG data has a sampling rate of 1 kHz, while the filtered data and pre-processed data has a sampling rate of 256 Hz.\n### Data Structure\nThe dataset is organized following the EEG-BIDS specification using the MNE-BIDS package. The dataset contains some regular BIDS files, 10 participants’ data folders, and a derivatives folder. The stand-alone files offer an overview of the dataset: i) dataset_description.json is a JSON file depicting the information of the dataset, such as the name, dataset type and authors; ii) participants.tsv contains participants’ information, such as age, sex, and handedness; iii) participants.json describes the column attributes in participants.tsv; iv) README.md contains a detailed introduction of the dataset.\nEach participant’s folder contains two folders named ses-LittlePrince and ses-GarnettDream, which store the data of this\nparticipant reading two novels, respectively. Each of the two folders contains a folder eeg and one file sub-xx_scans.tsv. The tsv\nfile contains information about the scanning time of each file. The eeg folder contains the source raw EEG data of several runs,\nchannels, and marker events files. Each run includes an eeg.json file, which encompasses detailed information for that run,\nsuch as the sampling rate and the number of channels. Events are stored in events.tsv with onset and event ID. The EEG data\nis converted from raw metafile format (.mff file) to BrainVision format (.vhdr, .vmrk and .eeg files) since EEG-BIDS is not\nofficially compatible with .mff format.\nThe derivatives folder contains six folders: eyetracking_data, filtered_0.5_80, filtered_0.5_30, preproc, novels, and text_embeddings. The eyetracking_data folder contains all the eye-tracking data. Each eye-tracking data is formatted in a\n.zip file with eye moving trajectories and other parameters like sampling rate saved in different files. The filtered_0.5_80\nfolder and filtered_0.5_30 folder contain data that has been processed up to the pre-processing step of 0.5-80 Hz and 0.5-30\nHz band-pass filtering respectively. This data is suitable for researchers who have specific requirements and want to perform\ncustomized processing on subsequent pre-processing steps like ICA and re-referencing. The preproc folder contains minimally\npre-processed EEG data that is processed using the whole pre-processing pipeline. It includes four additional types of files\ncompared to the participants’ raw data folders in the root directory: i) bad_channels.json contains bad channels marked during\nbad channel rejection phase. ii) ica_components.npy stores the values of all independent components in the ICA phase. iii)\nica_components.json includes the independent components excluded in ICA (the ICA random seed is fixed, allowing for\nreproducible results). iv) ica_components_topography.png is a picture of the topographic maps of all independent components,\nwhere the excluded components are labeled in grey. The novels folder contains the original and segmented text stimuli materials. The original novels are saved in .txt format and the segmented novels corresponding to each experimental run are saved in Excel (.xlsx) files. The text_embeddings folder contains embeddings of the two novels. The embeddings corresponding to each experimental run are stored in NumPy (.npy) files\nFor an overview of the structure, please refer to our paper at https://doi.org/10.1101/2024.02.08.579481.\n### Pre-processing\nFor the pre-processed data in the derivatives folder, we only did minimal pre-processing to retain most useful information. The pre-processing steps include data segmentation, downsampling, filtering, bad channel interpolation, ICA, averaging.\nDuring the data segmentation phase, we only retained data from the formal reading phase of the experiment. Based on the\nevent markers during the data collection phase, we segmented the data, removing sections irrelevant to the formal experiment\nsuch as calibration and preface reading. To minimize the impact of subsequent filtering steps on the beginning and end of the\nsignal, an additional 10 seconds of data was retained before the start of the formal reading phase. Subsequently, the signal was\ndownsampled to 256 Hz.\nFollowing this, a 50 Hz notch filter was applied to remove the powerline noise from the signal. Next, we performed\nband-pass overlap-add FIR filter on the signal to eliminate the low-frequency direct current components and high-frequency\nnoise. Here, two versions of filtered data were offered. The first one has a filter band of 0.5-80 Hz and the second one has\na filter band of 0.5-30 Hz. Researchers can choose the appropriate version based on their specific needs. After filtering, we\nperformed an interpolation of bad channels.\nIndependent Component Analysis (ICA) was then applied to the data, utilizing the infomax algorithm. The number of independent components was set to 20, ensuring that they contain the majority of information while not being so numerous to increase the burden of manual processing. We excluded obvious noise components such as Electrooculography (EOG) and Electrocardiogram (ECG). Finally, the data was re-referenced using the average method.\nThe detailed information of the pre-processing can be found in our paper at https://doi.org/10.1101/2024.02.08.579481.\n### Text Embeddings\nThe dataset provides embeddings of two novels calculated using a pre-trained language model BERT-base-Chinese.  During the experimental procedure, each displayed line of text contains n Chinese characters. The BERT-base-Chinese model processes these n Chinese characters, yielding an embedding of size (n, 768), where n represents the number of Chinese characters, and 768 the dimensionality of the embedding. To ensure displayed lines of varying length to have embeddings of the same shape, the first dimension of the embeddings is averaged to standardize the embedding size to (1, 768) for each instance.\n### Missing Data\n﻿Due to technical reasons, there are some raw data lost:\n- EEG:\n  - Sub-09 ses-LittlePrince run 1-3\n  - Sub-14 ses-GranettDream run 9\n  - Sub-15 ses-GranettDream run 12\n  - Sub-07 ses-GranettDream run 18 (substituted by run 19)\n      Notice: Sub-07 ses-GranettDream run 19 read chapter 19 of GranettDream instead of chapter 18.\n- Eyetracking data:\n  - Sub-08 ses-LittlePrince run 1-2\nDue to bad quality of data or other reasons, there are some pre-processed data lost:\n- In the 0.5-30 Hz filtered version:\n  - Sub-15 ses-LittlePrince run 1-7\n  - Sub-15 ses-GranettDream run 1, 2, 3, 7, 10, 16\n- In the 0.5-80 Hz filtered version:\n  - Sub-13 ses-LittlePrince run 4, 5\n  - Sub-15 ses-LittlePrince run 1-7\n  - Sub-13 ses-GranettDream run 14\n  - Sub-15 ses-GranettDream run 1, 2, 3, 7, 10, 16\n## Usage Note\nIf you want to know more about the dataset, including the detailed parameter settings in our pre-processing steps or how to align the text with EEG segments, please refer to our paper at https://doi.org/10.1101/2024.02.08.579481. You can find the relevant code for the text presentation, data processing and other functions at the GitHub repository: https://github.com/ncclabsustech/Chinese_reading_task_eeg_processing.\nReferences\n----------\nAppelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896). https://doi.org/10.21105/joss.01896\nPernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8","recording_modality":["eeg"],"senior_author":"Haiyan Wu","sessions":["GarnettDream","LittlePrince"],"size_bytes":222393779876,"source":"openneuro","study_design":null,"study_domain":null,"tasks":["reading"],"timestamps":{"digested_at":"2026-04-22T12:27:08.426683+00:00","dataset_created_at":"2024-02-07T05:08:55.557Z","dataset_modified_at":"2025-03-08T10:37:09.000Z"},"total_files":245,"storage":{"backend":"s3","base":"s3://openneuro.org/ds004952","raw_key":"dataset_description.json","dep_keys":["CHANGES","README","participants.json","participants.tsv"]},"nemar_citation_count":2,"computed_title":"ChineseEEG: A Chinese Linguistic Corpora EEG Dataset for Semantic Alignment and Neural Decoding","nchans_counts":[{"val":128,"count":245}],"sfreq_counts":[{"val":1000.0,"count":245}],"stats_computed_at":"2026-04-22T23:16:00.308749+00:00","tags":{"modality":"Visual","pathology":"Healthy","type":"Attention"},"total_duration_s":438266.677,"author_year":"Mou2024","canonical_name":null}}