Hi there! I am a fifth-year Ph.D. student in Computer Science at the University of Chicago, interested in high-performance computing and autonomous laboratory research. I am a member of Globus Labs where I am co-advised by Ian Foster and Kyle Chard. I completed my Bachelors in Computer Science at the Zhejiang University and previously worked at Google and Alibaba.
RESEARCH
The science discovery can be slowed down by tedious assembly and tricky mannual operations. The autonomous laboratory project aims to replace the tasks traditionally performed by human researchers by automated systems and intelligent algorithms. I currently work with Ian Foster and Chibueze Amanchukwu to build an autonomous laboratory to manufacture coin-cell batteries. We propose the development of generative AI models to identify candiate electrolyte solvents with desired properties (high ionic conductivity, oxidative stability, and Coulombic efficiencies) and the deployment of self-driving labs for electrolyte synthesis and battery fabrication and testing.
Modern simulations (e.g. particle simulation, climate simulation) can produce huge amount of data every day. Lossy compression can significantly reduce the data size while preserving important information for analysis. I work with Sheng Di in the compression project. We explore lossy compression on scientific datasets, especially the datasets consisting of floating-point numbers. The data files are usually planar (e.g. CESM dataset 1800x3600) or cubic (e.g. Nyx dataset 512x512x512). Some extremely large single file can be over 900 GB (e.g. Turbulent Channel Flow 10240x7680x1536). Other datasets may contain thousands of smaller files. The goal of this project is to provide a friendly program for users to compress, transfer and store these huge datasets.
PUBLICATIONS
|
Yuanjian Liu, Sheng Di, Jiajun Huang, Zhaorui Zhang, Kyle Chard, Ian Foster |
TPDS 2025 |
TLDR | URL | Code | BibTex | PDF |
TLDR: Large volumes of data generated by scientific simulations, genome sequencing, and other applications need to be moved among clusters for data collection/analysis. Data compression techniques have effectively reduced data storage and transfer costs. However, users' requirements on interactively controlling both data quality and compression ratios are non-trivial to fulfill. We propose a novel Compression-as-a-Service (CaaS) platform called Ocelot with four important contributions: (1) It offers real-time visualization, interactive compression, and transfer of scientific datasets. (2) It incorporates new strategies for compressing diverse types of datasets more effectively than traditional methods. (3) It provides an effective method for estimating the compression ratio and execution time of compression tasks. (4) Experiments on multiple real-world datasets on geographically distributed computers show that Ocelot can significantly improve data transfer efficiency with a performance gain of more than 10x in computing clusters with relatively slow networks. |
@ARTICLE{11007768, author={Liu, Yuanjian and Di, Sheng and Huang, Jiajun and Zhang, Zhaorui and Chard, Kyle and Foster, Ian}, journal={IEEE Transactions on Parallel and Distributed Systems}, title={Ocelot: An Interactive, Efficient Distributed Compression-As-a-Service Platform With Optimized Data Compression Techniques}, year={2025}, volume={}, number={}, pages={1-15}, keywords={Compressors;Genomics;Bioinformatics;Computational modeling;Data transfer;Sequential analysis;Data models;Compression algorithms;Tensors;Hands;Compression as a service (CaaS);data transfer;floating-point tensor compression;genome sequence compression}, doi={10.1109/TPDS.2025.3568221}} |
|
Yuanjian Liu |
Dissertation |
TLDR | URL | Code | Slides | BibTex | PDF |
TLDR: Large volumes of data generated by scientific simulations, genome sequencing, and other applications need to be moved among clusters for data collection/analysis. Data compression techniques have effectively reduced data storage and transfer costs. However, users' requirements on interactively controlling both data quality and compression ratios are non-trivial to fulfill. Lossy compression methods need to respect several data constraints to be useful in a realistic data transfer scenario. In this thesis, I propose a novel Compression-as-a-Service (CaaS) platform called GlobaZip with five important contributions: (1) a multi-interval/multi-region based compression algorithm that supports several data constraints to further limit the distortion in data fidelity even though the compression is lossy; (2) a layer-by-layer compression technique that allows much higher parallel compression rate in HPC systems and can coordidate CPU cores on multiple compute nodes to compress extremely large files without out-of-memory errors; (3) a decision tree-based compression performance prediction model that allows users to use very limited computation overhead to estimate compression characteristics including compression ratio, time and data fidelity; (4) an optimized reference-based genome sequence compression algorithm that exeeds the performance of state-of-the-art algorithms by using more fine-grained sequence alignment procedure, reordering reads, a novel dominant bitmap method for quality score compression, and a few other small optimizations; (5) a Qt5-based user-facing app that utilizes Globus Compute and Globus Transfer to provide users with a universal interface to orchestrate remote data compression and transfer. Experiments on multiple real-world datasets on geographically distributed computers show that GlobaZip can significantly improve data transfer efficiency with a performance gain of more than 10x in computing clusters with relatively slow networks. |
@article{THESIS, recid = {15070}, author = {Liu, Yuanjian}, title = {Hybrid Lossy Compression Methods Can Confidently Optimize Wide Network Transfer of Complex Datasets}, publisher = {University of Chicago}, school = {Ph.D.}, address = {2025-06}, number = {THESIS}, url = {http://knowledge.uchicago.edu/record/15070}, doi = {https://doi.org/10.6082/uchicago.15070}, } |
|
Yuanjian Liu, Sheng Di, Kyle Chard, Ian Foster, Franck Cappello |
ICDCS 2023 |
TLDR | URL | Code | Slides | BibTex | PDF |
TLDR: We propose a novel data transfer framework called Ocelot that integrates error-bounded lossy compression into the Globus data transfer infrastructure. We note four key contributions: (1) Ocelot is the first integration of lossy compression in Globus to significantly improve scientific data transfer performance over wide area network (WAN). (2) We propose an effective machine-learning based lossy compression quality estimation model that can predict the quality of error-bounded lossy compressors, which is fundamental to ensure that transferred data are acceptable to users. (3) We develop optimized strategies to reduce the compression time overhead, counter the compute-node waiting time, and improve transfer speed for compressed files. (4) We perform evaluations using many real-world scientific applications across different domains and distributed Globus endpoints. Our experiments show that Ocelot can improve dataset transfer performance substantially, and the quality of lossy compression (time, ratio and data distortion) can be predicted accurately for the purpose of quality assurance. |
@INPROCEEDINGS{10272494, author={Liu, Yuanjian and Di, Sheng and Chard, Kyle and Foster, Ian and Cappello, Franck}, booktitle={2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)}, title={Optimizing Scientific Data Transfer on Globus with Error-Bounded Lossy Compression}, year={2023}, volume={}, number={}, pages={703-713}, keywords={Wide area networks;Quality assurance;Estimation;Distributed databases;Data visualization;Machine learning;Predictive models;Lossy Compression;Performance;Data Transfer;Globus;WAN}, doi={10.1109/ICDCS57875.2023.00064}} |
|
Yuanjian Liu, Huihao Luo, Zhijun Han, Yao Hu, Yehui Yang, Kyle Chard, Sheng Di, Ian Foster, Jiesheng Wu |
preprint |
TLDR | PDF |
TLDR: Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. |
|
Yuanjian Liu, Sheng Di, Kai Zhao, Sian Jin, Cheng Wang, Kyle Chard, Dingwen Tao, Ian Foster, Franck Cappello |
TPDS 2022 |
TLDR | URL | Code | Slides | BibTex | PDF |
TLDR: Many scientific applications have specific requirements or constraints for lossy compression, in order to guarantee that the reconstructed data are valid for post hoc analysis. We handle the lossy compression under several such constraints, including irrelevant data, different error bounds for different value ranges, diverse precision over multiple regions. Experiments with six real-world applications show that our proposed diverse constraints based error-bounded lossy compressor can obtain a higher visual quality or data fidelity on reconstructed data with the same or even higher compression ratios compared with the traditional state-of-the-art compressor SZ. |
@ARTICLE{9844293, author={Liu, Yuanjian and Di, Sheng and Zhao, Kai and Jin, Sian and Wang, Cheng and Chard, Kyle and Tao, Dingwen and Foster, Ian and Cappello, Franck}, journal={IEEE Transactions on Parallel and Distributed Systems}, title={Optimizing Error-Bounded Lossy Compression for Scientific Data With Diverse Constraints}, year={2022}, volume={33}, number={12}, pages={4440-4457}, keywords={Data models;Compressors;Quantization (signal);Predictive models;Analytical models;Encoding;Dark matter;Big data;error-bounded lossy compression;data reduction;large-scale scientific simulation}, doi={10.1109/TPDS.2022.3194695}} |
|
Yuanjian Liu, Sheng Di, Kai Zhao, Sian Jin, Cheng Wang, Kyle Chard, Dingwen Tao, Ian Foster, Franck Cappello |
HiPC 2021 |
TLDR | URL | BibTex | PDF |
TLDR: Existing state-of-the-art error-bounded lossy compressors, however, do not support multi-range based error-bounds in the lossy compression, leaving a critical gap that hampers their effective use in practice. In this work, we address this issue by proposing a multi-range based error-bounded lossy compressor based on the state-of-the-art SZ lossy compressor. Our approach allows users to set different error bounds in different value ranges for a compressoin task. |
@INPROCEEDINGS{9680367, author={Liu, Yuanjian and Di, Sheng and Zhao, Kai and Jin, Sian and Wang, Cheng and Chard, Kyle and Tao, Dingwen and Foster, Ian and Cappello, Franck}, booktitle={2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)}, title={Optimizing Multi-Range based Error-Bounded Lossy Compression for Scientific Datasets}, year={2021}, volume={}, number={}, pages={394-399}, keywords={Visualization;High performance computing;Conferences;Bandwidth;Big Data;Distortion;Compressors}, doi={10.1109/HiPC53243.2021.00036}} |
|
Yuanjian Liu, Sheng Di, Kai Zhao, Sian Jin, Cheng Wang, Kyle Chard, Dingwen Tao, Ian Foster, Franck Cappello |
DRBSD-7 2021 |
TLDR | URL | BibTex | PDF |
TLDR: Lossy compression frameworks have been proposed as a method to reduce the size of data produced by scientific simulations. However, they do so at the expense of precision and existing compressors apply a single error bound across the entire dataset. Varying the precision across user-specified ranges of scalar values appears to be a promising approach to further improve compression ratios while retaining precision in specific areas of interest. In this work, we investigate a specific compression method, based on the SZ framework, that can set multiple error bounds. We evaluate its effectiveness by applying it to real-world datasets which have concrete precision requirements. Our results show that the multi-error-bounded lossy compression can improve compression ration by 15 % with negligible overhead in compression time. |
@INPROCEEDINGS{9652577, author={Liu, Yuanjian and Di, Sheng and Zhao, Kai and Jin, Sian and Wang, Cheng and Chard, Kyle and Tao, Dingwen and Foster, Ian and Cappello, Franck}, booktitle={2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-7)}, title={Understanding Effectiveness of Multi-Error-Bounded Lossy Compression for Preserving Ranges of Interest in Scientific Analysis}, year={2021}, volume={}, number={}, pages={40-46}, keywords={Data analysis;Conferences;Data models;Compressors}, doi={10.1109/DRBSD754563.2021.00010}} |