Hi there! I am a fourth-year Ph.D. student in Computer Science at the University of Chicago, interested in high-performance computing and autonomous laboratory research. I am a member of Globus Labs where I am co-advised by Ian Foster and Kyle Chard. I completed my Bachelors in Computer Science at the Zhejiang University and previously worked at Google and Alibaba.
RESEARCH
The science discovery can be slowed down by tedious assembly and tricky mannual operations. The autonomous laboratory project aims to replace the tasks traditionally performed by human researchers by automated systems and intelligent algorithms. I currently work with Ian Foster and Chibueze Amanchukwu to build an autonomous laboratory to manufacture coin-cell batteries. We propose the development of generative AI models to identify candiate electrolyte solvents with desired properties (high ionic conductivity, oxidative stability, and Coulombic efficiencies) and the deployment of self-driving labs for electrolyte synthesis and battery fabrication and testing.
Modern simulations (e.g. particle simulation, climate simulation) can produce huge amount of data every day. Lossy compression can significantly reduce the data size while preserving important information for analysis. I work with Sheng Di in the compression project. We explore lossy compression on scientific datasets, especially the datasets consisting of floating-point numbers. The data files are usually planar (e.g. CESM dataset 1800x3600) or cubic (e.g. Nyx dataset 512x512x512). Some extremely large single file can be over 900 GB (e.g. Turbulent Channel Flow 10240x7680x1536). Other datasets may contain thousands of smaller files. The goal of this project is to provide a friendly program for users to compress, transfer and store these huge datasets.
PUBLICATIONS
|
Yuanjian Liu, Sheng Di, Kyle Chard, Ian Foster, Franck Cappello |
ICDCS 2023 |
TLDR | URL | Code | Slides | BibTex | PDF |
TLDR: We propose a novel data transfer framework called Ocelot that integrates error-bounded lossy compression into the Globus data transfer infrastructure. We note four key contributions: (1) Ocelot is the first integration of lossy compression in Globus to significantly improve scientific data transfer performance over wide area network (WAN). (2) We propose an effective machine-learning based lossy compression quality estimation model that can predict the quality of error-bounded lossy compressors, which is fundamental to ensure that transferred data are acceptable to users. (3) We develop optimized strategies to reduce the compression time overhead, counter the compute-node waiting time, and improve transfer speed for compressed files. (4) We perform evaluations using many real-world scientific applications across different domains and distributed Globus endpoints. Our experiments show that Ocelot can improve dataset transfer performance substantially, and the quality of lossy compression (time, ratio and data distortion) can be predicted accurately for the purpose of quality assurance. |
@INPROCEEDINGS{10272494, author={Liu, Yuanjian and Di, Sheng and Chard, Kyle and Foster, Ian and Cappello, Franck}, booktitle={2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)}, title={Optimizing Scientific Data Transfer on Globus with Error-Bounded Lossy Compression}, year={2023}, volume={}, number={}, pages={703-713}, keywords={Wide area networks;Quality assurance;Estimation;Distributed databases;Data visualization;Machine learning;Predictive models;Lossy Compression;Performance;Data Transfer;Globus;WAN}, doi={10.1109/ICDCS57875.2023.00064}} |
|
Yuanjian Liu, Huihao Luo, Zhijun Han, Yao Hu, Yehui Yang, Kyle Chard, Sheng Di, Ian Foster, Jiesheng Wu |
preprint |
TLDR | PDF |
TLDR: Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. |
|
Yuanjian Liu, Sheng Di, Kai Zhao, Sian Jin, Cheng Wang, Kyle Chard, Dingwen Tao, Ian Foster, Franck Cappello |
TPDS 2022 |
TLDR | URL | Code | Slides | BibTex | PDF |
TLDR: Many scientific applications have specific requirements or constraints for lossy compression, in order to guarantee that the reconstructed data are valid for post hoc analysis. We handle the lossy compression under several such constraints, including irrelevant data, different error bounds for different value ranges, diverse precision over multiple regions. Experiments with six real-world applications show that our proposed diverse constraints based error-bounded lossy compressor can obtain a higher visual quality or data fidelity on reconstructed data with the same or even higher compression ratios compared with the traditional state-of-the-art compressor SZ. |
@ARTICLE{9844293, author={Liu, Yuanjian and Di, Sheng and Zhao, Kai and Jin, Sian and Wang, Cheng and Chard, Kyle and Tao, Dingwen and Foster, Ian and Cappello, Franck}, journal={IEEE Transactions on Parallel and Distributed Systems}, title={Optimizing Error-Bounded Lossy Compression for Scientific Data With Diverse Constraints}, year={2022}, volume={33}, number={12}, pages={4440-4457}, keywords={Data models;Compressors;Quantization (signal);Predictive models;Analytical models;Encoding;Dark matter;Big data;error-bounded lossy compression;data reduction;large-scale scientific simulation}, doi={10.1109/TPDS.2022.3194695}} |
|
Yuanjian Liu, Sheng Di, Kai Zhao, Sian Jin, Cheng Wang, Kyle Chard, Dingwen Tao, Ian Foster, Franck Cappello |
HiPC 2021 |
TLDR | URL | BibTex | PDF |
TLDR: Existing state-of-the-art error-bounded lossy compressors, however, do not support multi-range based error-bounds in the lossy compression, leaving a critical gap that hampers their effective use in practice. In this work, we address this issue by proposing a multi-range based error-bounded lossy compressor based on the state-of-the-art SZ lossy compressor. Our approach allows users to set different error bounds in different value ranges for a compressoin task. |
@INPROCEEDINGS{9680367, author={Liu, Yuanjian and Di, Sheng and Zhao, Kai and Jin, Sian and Wang, Cheng and Chard, Kyle and Tao, Dingwen and Foster, Ian and Cappello, Franck}, booktitle={2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)}, title={Optimizing Multi-Range based Error-Bounded Lossy Compression for Scientific Datasets}, year={2021}, volume={}, number={}, pages={394-399}, keywords={Visualization;High performance computing;Conferences;Bandwidth;Big Data;Distortion;Compressors}, doi={10.1109/HiPC53243.2021.00036}} |
|
Yuanjian Liu, Sheng Di, Kai Zhao, Sian Jin, Cheng Wang, Kyle Chard, Dingwen Tao, Ian Foster, Franck Cappello |
DRBSD-7 2021 |
TLDR | URL | BibTex | PDF |
TLDR: Lossy compression frameworks have been proposed as a method to reduce the size of data produced by scientific simulations. However, they do so at the expense of precision and existing compressors apply a single error bound across the entire dataset. Varying the precision across user-specified ranges of scalar values appears to be a promising approach to further improve compression ratios while retaining precision in specific areas of interest. In this work, we investigate a specific compression method, based on the SZ framework, that can set multiple error bounds. We evaluate its effectiveness by applying it to real-world datasets which have concrete precision requirements. Our results show that the multi-error-bounded lossy compression can improve compression ration by 15 % with negligible overhead in compression time. |
@INPROCEEDINGS{9652577, author={Liu, Yuanjian and Di, Sheng and Zhao, Kai and Jin, Sian and Wang, Cheng and Chard, Kyle and Tao, Dingwen and Foster, Ian and Cappello, Franck}, booktitle={2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-7)}, title={Understanding Effectiveness of Multi-Error-Bounded Lossy Compression for Preserving Ranges of Interest in Scientific Analysis}, year={2021}, volume={}, number={}, pages={40-46}, keywords={Data analysis;Conferences;Data models;Compressors}, doi={10.1109/DRBSD754563.2021.00010}} |