Unstructured – Linux & Android Dialy

pip install ‘unstructured[pdf] と
pip install unstructured[all-docs] の違いを調べた

違いはPDFのみか
全てのドキュメント形式（PDF、Word、Excel、HTMLなど）
ということ

これを使い画像を抽出し学習データを作るので
そのままデータ保存できるcolab で実行する

【Python】unstructuredを使って、 PDFファイル内の非構造データを抽出する
を参考に

!pip install unstructured[all-docs]

でインストール

実行後にセッションの再起動を求められる

ITパスポートの過去問のうち、令和6年度分の問題冊子を使うので
https://www3.jitec.ipa.go.jp/JitesCbt/html/openinfo/questions.html
から
問題をダウンロード

!mkdir data

でフォルダを作成し
ここへアップロード

import os
from unstructured.partition.pdf import partition_pdf

# 現在の作業ディレクトリを取得
current_dir = os.getcwd()

# 'data'フォルダのパスを設定
DATA_PAR_PATH = os.path.join(current_dir, 'data')

# 処理対象のPDFファイルのパスを設定
DATASET_PATH = os.path.join(DATA_PAR_PATH, '2024r06_ip_qs.pdf')

# 画像の出力先フォルダのパスを設定
OUTPUT_PATH = os.path.join(DATA_PAR_PATH, 'images')

# フォルダ確認（存在しない場合エラーになるのでアップロード確認用）
if not os.path.exists(DATA_PAR_PATH):
    raise FileNotFoundError(f"'data'フォルダが見つかりません。Google Colabに'2024r06_ip_qs.pdf'をアップロードしてください。")

if not os.path.exists(DATASET_PATH):
    raise FileNotFoundError(f"ファイル '2024r06_ip_qs.pdf' が'data'フォルダ内に存在しません。ファイルをアップロードしてください。")

# PDFファイル内のデータを分割する
raw_pdf_elements = partition_pdf(
    filename=DATASET_PATH,
    chunking_strategy='by_title',
    infer_table_structure=True,
    extract_images_in_pdf=True,
    extract_image_block_output_dir=OUTPUT_PATH
)

# 処理が成功したことを表示
print("PDFファイルの処理が完了しました。")
print(f"画像は以下のフォルダに保存されます: {OUTPUT_PATH}")

として実行したが

WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='/content/data/2024r06_ip_qs.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
yolox_l0.05.onnx: 100%

 217M/217M [00:02<00:00, 149MB/s]
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    580             env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "")
--> 581         proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
    582 



15 frames

FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'
During handling of the above exception, another exception occurred:

PDFInfoNotInstalledError                  Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    605 
    606     except OSError:
--> 607         raise PDFInfoNotInstalledError(
    608             "Unable to get page count. Is poppler installed and in PATH?"
    609         )

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

となってしまう

これは
pdf2imageパッケージが内部的に使用している
poppler-utilsがインストールされていないために発生

!apt-get update
!apt-get install -y poppler-utils

でインストール

再度実行する

このとき、非構造データは画像として分割され、OUTPUT_PATHで指定したディレクトリに入る

この非構造データを確認してみた結果

* 画像しか検出していない
* テーブルも画像として認識してほしい
* 画像を出力する必要がある
* わざわざ出力せずに、byte型で非構造データを保持したい
* 画像の端っこが見切れている
* 画像によって、見切れ具合は様々

上記の結果を踏まえて非構造データの抽出を工夫

しかし

WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='/content/data/2024r06_ip_qs.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
yolox_l0.05.onnx: 100%

 217M/217M [00:01<00:00, 237MB/s]
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/unstructured_pytesseract/pytesseract.py in get_tesseract_version()
    450     try:
--> 451         output = subprocess.check_output(
    452             [tesseract_cmd, '--version'],



21 frames

FileNotFoundError: [Errno 2] No such file or directory: 'tesseract'
During handling of the above exception, another exception occurred:

TesseractNotFoundError                    Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/unstructured_pytesseract/pytesseract.py in get_tesseract_version()
    456         )
    457     except OSError:
--> 458         raise TesseractNotFoundError()
    459 
    460     raw_version = output.decode(DEFAULT_ENCODING)

TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

これは
このエラーは、PDFの画像からテキストを抽出するために必要なOCRエンジン Tesseract がインストールされていないために発生しています。Google Colabではデフォルトで Tesseract がインストールされていないため

!apt-get update
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev

でインストール

これで再度

import base64
import io
import os
from PIL import Image
from unstructured.partition.pdf import partition_pdf

# Google Colab環境用のパス設定
current_dir = os.getcwd()
DATA_PAR_PATH = os.path.join(current_dir, 'data')  # 'data'フォルダを利用
DATASET_PATH = os.path.join(DATA_PAR_PATH, '2024r06_ip_qs.pdf')

# 必要なフォルダやファイルの存在確認
if not os.path.exists(DATA_PAR_PATH):
    raise FileNotFoundError(f"'data'フォルダが存在しません。Google Colabに'2024r06_ip_qs.pdf'をアップロードしてください。")

if not os.path.exists(DATASET_PATH):
    raise FileNotFoundError(f"ファイル '2024r06_ip_qs.pdf' が 'data' フォルダ内に存在しません。アップロードしてください。")

# PDFファイル内のデータを分割する
raw_pdf_elements = partition_pdf(
    filename=DATASET_PATH,
    infer_table_structure=True,
    strategy='hi_res',
    extract_images_in_pdf=True,
    extract_image_block_types=['Image', 'Table'],
    extract_image_block_to_payload=True
)

# 画像として保持されている非構造データを確認する
for elem in raw_pdf_elements:
    if elem.category in ['Image', 'Table']:
        image_base64 = elem.metadata.image_base64
        decoded_image = base64.b64decode(image_base64)
        image = Image.open(io.BytesIO(decoded_image))
        print(f"Page Number: {elem.metadata.page_number}")
        display(image)  # Colab環境で画像を表示

を実行すると画像が表示される

import os
from unstructured.partition.pdf import partition_pdf

# 現在の作業ディレクトリを取得
current_dir = os.getcwd()

# 'data'フォルダのパスを設定
DATA_PAR_PATH = os.path.join(current_dir, 'data')

# 処理対象のPDFファイルのパスを設定
DATASET_PATH = os.path.join(DATA_PAR_PATH, '2024r06_ip_qs.pdf')

# 画像の出力先フォルダのパスを設定
OUTPUT_PATH = os.path.join(DATA_PAR_PATH, 'images')

# フォルダ確認（存在しない場合エラーになるのでアップロード確認用）
if not os.path.exists(DATA_PAR_PATH):
    raise FileNotFoundError(f"'data'フォルダが見つかりません。Google Colabに'2024r06_ip_qs.pdf'をアップロードしてください。")

if not os.path.exists(DATASET_PATH):
    raise FileNotFoundError(f"ファイル '2024r06_ip_qs.pdf' が'data'フォルダ内に存在しません。ファイルをアップロードしてください。")

# PDFファイル内のデータを分割する
raw_pdf_elements = partition_pdf(
    filename=DATASET_PATH,
    chunking_strategy='by_title',
    infer_table_structure=True,
    extract_images_in_pdf=True,
    extract_image_block_output_dir=OUTPUT_PATH
)

# 処理が成功したことを表示
print("PDFファイルの処理が完了しました。")
print(f"画像は以下のフォルダに保存されます: {OUTPUT_PATH}")

を実行すると dataフォルダの中に images が作成され
ここに抽出した画像が保存される

次はチラシで試す

カテゴリー: Unstructured

UnstructuredによるPDFからの画像抽出