JavisDiT部署推理中遇到的若干问题及解决办法

一、项目背景

官方地址https://github.com/JavisVerse/JavisDiT
本次目标是:

  • 在单机单卡 A100 环境下部署 JavisDiT
  • 完成 text-to-video + audio generation inference
  • 保证推理可运行

涉及组件:

  • python3.10
  • PyTorch 2.5.1 + CUDA 12.1
  • Flash-Attention(核心加速模块)
  • Wan2.1 T2V / AudioLDM2 / VAE / LoRA
  • 自定义 DiT + attention pipeline

二、遇到的困难及解决办法

2.1 显存问题

首先是在Nvidia A10 24GB显存上部署的,最后发现会OOM,因此24G是不够用的,后续想办法成功在A100 40G单卡上部署推理成功

2.2 torch和torchvision版本不匹配问题

在作者在JavisDiT的README(2026/7/2)中的requirements文件中指出需要torch2.5.1和torchvision0.21.0,但实际上这两个版本根本不匹配,实际可用版本如下:

python -c "import torch; print('torch:', torch.__version__)" python -c "import torchvision; print('torchvision:', torchvision.__version__)" python -c "import torchaudio; print('torchaudio:', torchaudio.__version__)"

其输出:

torch: 2.5.1+cu121 torchvision: 0.20.1+cu121 torchaudio: 2.5.1+cu121

2.3 flash-attn库下载编译问题

如果按照官方README中的pip install flash-attn --no-build-isolation方式安装,最后会卡在本地编译,而本地编译大概率失败,因此需要想办法下载已经编译好的wheel文件。
此处根据我的cuda版本和torch版本,选择到flash-attn的realease中下载了flash_attn-2.8.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl该文件,

其安装命令为
pip uninstall -y flash-attnpip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl`

2.4 replace文件失败

在作者给出的Installation中的最后一步cp assets/src/funasr_utils_load_utils.py ${PYTHON_SITE_PACKAGES}/funasr/utils/load_utils.py会失败,需要安装FunASR:

pip install funasr

安装完成后再执行:

cp assets/src/funasr_utils_load_utils.py \ $(python -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())")/funasr/utils/load_utils.py

2.5 ModuleNotFoundError: No module named ‘pkg_resources’

这个问题非常恶心,原本以为是安装不完全/源文件不完整,导致没有这个模块,后面发现是因为新的setuptools 82已经移除了pkg_resources。因此必须把setuptools降级方能不报错。

pip uninstall setuptools -y pip install setuptools==68.2.2 --no-cache-dir

2.6 HuggingFace下载慢问题

作者给出下载模型权重的命令:

# download JavisDiT weights hf download JavisVerse/JavisDiT-v1.0-jav --local-dir ./checkpoints/JavisDiT-v1.0-jav # download VAEs hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./checkpoints/Wan2.1-T2V-1.3B hf download cvssp/audioldm2 --local-dir ./checkpoints/audioldm2

但是国内网络下载极慢,经学长点拨换源到:

export HF_ENDPOINT=https://hf-mirror.com

速度差不多5MB/s,下了一晚上下好了。

2.7 transformers库版本冲突问题

在该项目中,有两个库硬性要求的transformers库版本不同:
javisdit 0.1.0 → transformers4.49.0
colossalai 0.5.0 → transformers4.51.3
直接pip安装会失败,这里以colossalai为重(因为后面推理的时候发现colossalai是必要模块,删不掉,这个库不仅在训练的时候用了,在推理的时候跟什么分布式相关的东西有关,是没办法不下载的)选择保留transformers4.51.3
加参数使得安装忽略依赖就不会报错(但是后续会缺模块,需要手动补):

pip install -v -e . --no-deps

2.8 No module named ‘colossalai’

这个在2.7中已经提到,也是非常的恶心,必须保证transformer库版本是4.51.3才可以安装

2.9 numpy,huggingface_hub,uvicorn版本过高

要求numpy版本不能高于2.0.0:

pip install 'numpy<2.0.0'

huggingface_hub版本也必须低:

huggingface_hub==0.36.2

uvicorn版本也必须低:

uvicorn==0.29.0

2.10 KeyError: ‘Adafactor is already registered in optimizer at torch.optim’

**禁止 mmengine optimizer和transformers optimizer 自动注册

在运行前加:

export MMENGINE_DISABLE_OPS=1 export TRANSFORMERS_NO_ADAFACTOR=1

还是解决不了就用:

pip install --force-reinstall mmengine==0.10.7

2.11 ModuleNotFoundError: No module named ‘torchvision.transforms.functional_tensor’

根本原因是pytorchvideo版本太旧,升级为:

pip install -U pytorchvideo

除此之外还需要改动一部分源码,在命令行输入:

code 你的主机目录/anaconda3/envs/javisdit/lib/python3.10/site-packages/pytorchvideo/transforms/augmentations.py

在augmentations.py文件中把

import torchvision.transforms.functional_tensor as F_t

改为

import torchvision.transforms.functional as F_t

以适应新的接口

总结

解决完以上内容后,使用

CUDA_VISIBLE_DEVICES=2 python scripts/inference.py configs/javisdit-v1-0/inference/sample.py --model-path /data/checkpoints/JavisDiT-v1.0-jav --num-frames 81 --resolution 480p --aspect-ratio 9:16 --prompt "A brown bear is walking towards the camera" --verbose 2

命令即可完成demo的推理。

三、附环境清单

accelerate==0.29.2 addict==2.4.0 aliyun-python-sdk-core==2.16.0 aliyun-python-sdk-kms==2.16.5 annotated-doc==0.0.4 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 anyio==4.14.1 attrs==26.1.0 audioflux==0.1.9 audioread==3.1.0 av==13.1.0 bcrypt==5.0.0 beartype==0.22.9 beautifulsoup4==4.15.0 bitsandbytes==0.49.2 brotli==1.2.0 certifi==2026.6.17 cffi==2.0.0 cfgv==3.5.0 charset-normalizer==3.4.7 click==8.4.2 colossalai==0.5.0 contexttimer==0.3.3 contourpy==1.3.2 crcmod==1.7 cryptography==49.0.0 cycler==0.12.1 decorator==5.3.1 decord==0.6.0 Deprecated==1.3.1 diffusers==0.29.0 distlib==0.4.3 easydict==1.13 editdistance==0.8.1 einops==0.8.2 exceptiongroup==1.3.1 fabric==3.2.3 fastapi==0.138.2 ffmpeg-python==0.2.0 filelock==3.29.0 flash_attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl#sha256=043bf4bf846a2d68a34c210bf392b0af6e5fb33f1d5b0c7ffa6d5837f1f338e2 fonttools==4.63.0 fsspec==2026.4.0 ftfy==6.3.1 funasr==1.3.14 future==1.0.0 fvcore==0.1.5.post20221221 galore-torch==1.0 google==3.0.0 gradio==6.19.0 gradio_client==2.5.0 groovy==0.1.2 h11==0.16.0 hf-gradio==0.4.1 hf-xet==1.5.1 httpcore==1.0.9 httpx==0.28.1 huggingface_hub==0.36.2 hydra-core==1.3.3 identify==2.6.19 idna==3.18 importlib_metadata==9.0.0 invoke==2.2.1 iopath==0.1.10 ipykernel==7.3.0 ipywidgets==8.1.8 jaconv==0.5.0 jamo==0.4.1 -e git+https://github.com/JavisVerse/JavisDiT@b505b37faa9668b52b982abe364825d3d0a5bdca#egg=javisdit jieba==0.42.1 Jinja2==3.1.6 jmespath==0.10.0 joblib==1.5.3 jsonschema==4.26.0 jsonschema-specifications==2025.9.1 kaldiio==2.18.1 kaleido==1.3.0 kiwisolver==1.5.0 librosa==0.9.2 llvmlite==0.47.0 markdown-it-py==4.2.0 MarkupSafe==3.0.3 matplotlib==3.10.9 mdurl==0.1.2 mmengine==0.10.7 modelscope==1.38.0 modelscope-hub==0.1.5 mpmath==1.3.0 msgpack==1.2.1 networkx==3.4.2 ninja==1.13.0 nodeenv==1.10.0 numba==0.65.1 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.9.86 nvidia-nvtx-cu12==12.1.105 omegaconf==2.3.1 openai==2.44.0 opencv-python==4.13.0.92 orjson==3.11.9 oss2==2.19.1 packaging @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_packaging_1777103621/work pandarallel==1.6.5 pandas==2.3.3 parameterized==0.9.0 paramiko==5.0.0 peft==0.13.2 pillow==12.2.0 platformdirs==4.10.0 plotly==6.8.0 plumbum==2.0.1 pooch==1.9.0 portalocker==3.2.0 pre_commit==4.6.0 protobuf==7.35.1 psutil==7.2.2 pyarrow==24.0.0 pycparser==3.0 pycryptodome==3.23.0 pydantic==2.13.4 pydantic_core==2.46.4 pydub==0.25.1 Pygments==2.20.0 PyNaCl==1.6.2 pynndescent==0.6.0 pyparsing==3.3.2 python-dateutil==2.9.0.post0 python-discovery==1.4.2 python-multipart==0.0.32 pytorchvideo @ git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d pytz==2026.2 PyYAML==6.0.3 ray==2.56.0 referencing==0.37.0 regex==2026.6.28 requests==2.34.2 resampy==0.4.3 rich==15.0.0 rotary-embedding-torch==0.5.3 rpds-py==0.30.0 rpyc==6.0.0 safehttpx==0.1.7 safetensors==0.8.0 scikit-learn==1.7.2 scipy==1.14.1 semantic-version==2.10.0 sentencepiece==0.2.1 shellingham==1.5.4 six==1.17.0 soundfile==0.12.1 soupsieve==2.8.4 spaces==0.50.4 starlette==1.3.1 sympy==1.13.1 tabulate==0.10.0 tensorboard==2.21.0 tensorboardX==2.6.5 termcolor==3.3.0 threadpoolctl==3.6.0 tiktoken==0.13.0 timm==0.9.16 tokenizers==0.21.4 tomli==2.4.1 tomlkit==0.14.0 torch==2.5.1+cu121 torch-complex==0.4.4 torchaudio==2.5.1+cu121 torchvision==0.20.1+cu121 tqdm==4.68.3 transformers==4.51.3 triton==3.1.0 typer==0.25.1 typing-inspection==0.4.2 typing_extensions==4.15.0 tzdata==2026.2 umap-learn==0.5.12 urllib3==2.7.0 uvicorn==0.29.0 virtualenv==21.5.1 wandb==0.28.0 wcwidth==0.8.2 wrapt==2.2.2 yacs==0.1.8 yapf==0.43.0 zipp==4.1.0

愿世界再无版本冲突orz