github에서 주소 clone 시키기

https://github.com/monologg/KoBigBird.git

Download TensorFlow v1 checkpoint

KoBigBird/download_tfv1_ckpt.md at master · monologg/KoBigBird

[리눅스] 리눅스 tar, gz 압축 및 해제

create_pretraining_data.py

def main():
    tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--input_dir", default='./data', help="Location of text or ko_lm_dataformat files.")
    parser.add_argument("--tokenizer_dir", default='./init_checkpoint/tokenizer', help="Location of tokenizer directory.")
    parser.add_argument("--output_dir", default='./output', help="Where to write out the tfrecords.")
    parser.add_argument("--max_seq_length", default=4096, type=int, help="Number of tokens per example.")
    parser.add_argument(
        "--max_predictions_per_seq",
        default=80,
        type=int,
        help="Maximum number of masked LM predictions per sequence.",
    )