본문 바로가기
etc/web crawling

urlopen과 urlretrieve

by 빠니몽 2023. 6. 9.

2023.06.09

1. urlretrieve

1-1. What is urlretreive?

urlretrieve함수는 URL을 통해 파일을 다운로드 받을 때 쓰는 함수이다. 

필수 인자는 하나이고, 다운로드 받고싶은 파일의 url이다.

리턴값은 카피된 로컬 파일의 이름(경로)와 헤더, 두 개로 구성된 튜플이다.

헤더는 urlOpen.info()를 사용했을 때의 리턴값과 같다.

 

urlretrieve is a function that allows users to download a nextwork object, denoted by a URL, into a local file.
The required argument is the url.
It returns a tuple that has a filename that can be found in local and headers.
The headers is what urlOpen.info() returns

1-2. Code Example

import urllib.request as req

img_url = 'https://ichef.bbci.co.uk/news/640/cpsprodpb/E172/production/_126241775_getty_cats.png' # A random cat pic
html_url = 'http://google.com'

dest_path1 = 'mypath/img.jpg'
dest_path2 = 'mypath/index.html'

try:
    file1, header1 = req.urlretrieve(img_url, dest_path1)
    file2, header2 = req.urlretrieve(html_url, dest_path2)
except Exception as e:
    print('Download failed')
    print(e)
else:
    print(header1)
    print(header2)

    # Downloaded file info
    print('file1 {}'.format(file1))
    print('file2 {}'.format(file2))

결과(Result)

header1 info (date, expiration date, content-type, connection, etc...)
header2 info (date, expiration date, content-type, connection, etc...)

file1 mypath/img.jpg
file2 mypath/index.html

 

2. urlopen

2-1. What is urlropen?

함수 이름 그대로 URL을 여는 함수.

열고 싶은 url 또는 request모듈에서 제공하는 추상클래스인 Request의 객체를 인자로 받는다.

리턴값은 url, 헤더, status를 가지고 있는 객체이다.

 

urlopen literally opens a url
The required argument can be either a string containing a valid, properly encoded URL, or a Request object.
It returns an object that has url, headers, and status.

2-2. Code Example

# URLError: represent protocal errors
# HTTPError: rise when HTTP request returns an unsuccessful status code

import os
from dotenv import load_dotenv

import urllib.request as req
from urllib.error import URLError, HTTPError    # For exception handling

load_dotenv()
PATH = os.getenv('DEST_PATH')
path_list = [PATH + '/img.jpg',PATH + '/index.html']
target_list = ['https://img.freepik.com/free-photo/adorable-kitty-looking-like-it-want-to-hunt_23-2149167099.jpg?w=2000', 'http://google.com']

for i, url in enumerate(target_list):
    try:
        # Read web resonse info
        res = req.urlopen(url)
        contents = res.read()
        print("---------------------------------")

        # Print status info
        print('Header Info: {}'.format(i, res.info()))
        print('HTTP Status Code: {}'.format(res.getcode()))
        print("---------------------------------")

        with open(path_list[i], 'wb') as c:
            c.write(contents)

    except HTTPError as e:
        print("Download failed")
        print("HTTPError code: ", e.code)
    except URLError as e:
        print("URL Error Reason: ", e.reason)
    # Success
    else:
        print("Download succeed.")

결과(Result)

---------------------------------
Header Info: 0
HTTP Status Code: 200
---------------------------------
Download succeed.
---------------------------------
Header Info: 1
HTTP Status Code: 200
---------------------------------
Download succeed.