2005-05-04

プロクシを指定してリクエスト

>>>import urllib2
>>>req = urllib2.Request("http://www.google.co.jp")
>>>req.set_proxy("proxy.example.com:8080", "http")
>>>req.add_header("User-agent", "python")
>>>req.add_header("Pragma", "no-cache")
>>>site = urllib2.urlopen(req)

2005-05-04

ヘッダーに色々追加してリクエスト

>>>import urllib2
>>>req = urllib2.Request("http://www.google.co.jp")
>>>req.add_header("User-agent", "python")
>>>req.add_header("Pragma", "no-cache")
>>>urlhandler = urllib2.urlopen(req)

2005-05-04

User-Agentを指定してリクエスト

>>>import urllib2
>>>req = urllib2.Request("http://www.google.co.jp")
>>>req.add_header("User-agent", "python")
>>>urlhandler = urllib2.urlopen(req)

2005-05-04

webページをゲットする。

>>>import urllib2
>>>urlhandler = urllib2.urlopen("http://www.google.co.jp")
>>>html = urlhandler.read()

一行ずつ読み込む

>>>html = urlhandler.readline()

一行ずつリストに格納

>>>html = urlhandler.readlines()

2005-04-20

起動時の引数(args)解析

optparseモジュールを使って起動時の引数の解析を行う基本。
[argsOp.py]

#!/bin/env python
# -*- coding: shift_jis -*-
import sys
from optparse import OptionParser

#引数の設定
parser = OptionParser()
parser.add_option("-f", "--file", dest="fileName",help="ファイル名指定", metavar="FILE")
parser.add_option("-v", action="store_true", dest="ver" )
parser.add_option("-q", action="store_false", dest="ver" )

#引数の取得
(options, args) = parser.parse_args()

#取得した引数の表示
print  "fileName:" , options.fileName
print  "ver     :" , options.ver
print  "args    :" , args

色々な引数を与ながら起動して試してみる

:\Python23jp>argsOp.py
fileName: None
ver     : None
args    : 

C:\Python23jp>argsOp.py -h
usage: argsOp.py [options]

options:
  -h, --help           show this help message and exit
  -fFILE, --file=FILE  ファイル名指定
  -v
  -q

C:\Python23jp>argsOp.py -f test.txt
fileName: test.txt
ver     : None
args    : 

C:\Python23jp>argsOp.py -q -ftest.txt
fileName: test.txt
ver     : False
args    : 

C:\Python23jp>argsOp.py -ftest.txt -v
fileName: test.txt
ver     : True
args    : 

C:\Python23jp>argsOp.py -ftest.txt -v aaa
fileName: test.txt
ver     : None
args    : ['aaa']

C:\Python23jp>argsOp.py -v aaa -ftest.txt ccc bbb
fileName: test.txt
ver     : True
args    : ['aaa', 'ccc', 'bbb']

引数を設定した時のデータをもとに-h,-helpは自動的に作ってくれるらしい

2005-04-18

ファイルのダウンロード

連番を開くスクリプトを書いたので、せっかくだから開いたurlからファイルをダウンロードするスクリプトをつくってみようかと思ったのだけれど巧くいかない。
urllibのリファレンスに

urlretrieve( url[, filename[, reporthook[, data]]])
URL で表されるネットワーク上のオブジェクトを、必要に応じてローカルなファイルにコピーします

ってあったので使ってみたのだけれど、コレ、404ファイルもダウンロード(?)してきちゃう。
指定したurlが間違っていても気付かない。困った。
とりあえず、吐き出すモンぜんぶ吐き出させて指定したurlのファイル有無を判定できるか調べてみる。

#!/bin/env python
# -*- coding: shift_jis -*-
import urllib

def feedback(count,size, total):
    print "count :%d" % count
    print "size  :%d" % size
    print "total :%d" % total

(file, header) = urllib.urlretrieve("http://www.google.co.jp/","test.html",feedback)
print file
print header

feedback ダウンロードのレポート(ブロックのカウントとサイズ、トータル)
file 　　指定したファイル名
header HTTP応答ヘッダ

応答ヘッダで判定すればいいのかな？
とりあえずググッてみたら、外人さんのソースにこんなのがあった。

(tmp, headers) = urllib.urlretrieve("http://www.google.co.jp/","test.html")
if str(headers).count("Content-Length") == 0:
    print "ERROR: File not found (404 error)"

headersの中のContent-Lengthの数を数えて、それが0だったら404エラーと判定しているみたい。
でも、これだとContent-Lengthさえ応答ヘッダに含まれていればやっぱり存在しないファイルでもダウンロードしてきちゃうよ〜ん
とりあえず試してみた

#!/bin/env python
# -*- coding: shift_jis -*-
import urllib

(tmp, headers) = urllib.urlretrieve("http://www.google.co.jp/aa/bb/cc.gif","test.gif")
print str(headers).count("Content-Length")
print headers
if str(headers).count("Content-Length") == 0:
   print "ERROR: File not found (404 error)"
else:
   print "OK"

ちゃんと"ERROR: File not found (404 error)"がでた。
でも、よく見るとgoogleの応答ヘッダには「Content-length: 1223」とある。
う〜ん「Length」では「length」をカウントしないので404を出してしまう。やっぱり駄目ジャン。
要再考。

追記urllib2モジュールでは、ちゃんと404を検出するらしい。

>>> import urllib
>>> urllib.urlopen("http://www.google.co.jp/aa/bb/cc.gif")
>
>>> import urllib2
>>> urllib2.urlopen("http://www.google.co.jp/aa/bb/cc.gif")

Traceback (most recent call last):
  File "", line 1, in -toplevel-
    urllib2.urlopen("http://www.google.co.jp/aa/bb/cc.gif")
  File "c:\Python23\lib\urllib2.py", line 129, in urlopen
    return _opener.open(url, data)
  File "c:\Python23\lib\urllib2.py", line 326, in open
    '_open', req)
  File "c:\Python23\lib\urllib2.py", line 306, in _call_chain
    result = func(*args)
  File "c:\Python23\lib\urllib2.py", line 901, in http_open
    return self.do_open(httplib.HTTP, req)
  File "c:\Python23\lib\urllib2.py", line 895, in do_open
    return self.parent.error('http', req, fp, code, msg, hdrs)
  File "c:\Python23\lib\urllib2.py", line 352, in error
    return self._call_chain(*args)
  File "c:\Python23\lib\urllib2.py", line 306, in _call_chain
    result = func(*args)
  File "c:\Python23\lib\urllib2.py", line 412, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
>>>

ちゃんと例外がでる、urllib2を使ったほうがいいみたい。
※参考
urllib.urlopen() fails to raise exception
http://mail.python.org/pipermail/python-bugs-list/2004-July/023990.html

2005-04-14

連番urlの展開

正規表現の練習に連番urlを展開して表示するスクリプトを作ってみた。
http://www.doqn.ne.jp/65-32/cg/[10-20].html
だったら
http://www.doqn.ne.jp/65-32/cg/10.html
http://www.doqn.ne.jp/65-32/cg/11.html
http://www.doqn.ne.jp/65-32/cg/12.html
って感じに20まで開いていく。

>>> ================================ RESTART ================================
>>> import re
>>> url = "http://www.domain.ne.jp/65-32/cg/[10-20].html"
>>> splitUrl = re.split('\[|\]', url)		#説明１
>>> print splitUrl
['http://www.domain.ne.jp/65-32/cg/', '10-20', '.html']
>>> number = re.split('-', splitUrl[1])		#説明２
>>> print number
['10', '20']
>>> number = [int(element) for element in number]	#説明３
>>> print number
[10, 20]
>>> for count in range(min(number), max(number)+1):	#説明４
	print "%s%d%s" % (splitUrl[0], count, splitUrl[2])

	
http://www.domain.ne.jp/65-32/cg/10.html
http://www.domain.ne.jp/65-32/cg/11.html
http://www.domain.ne.jp/65-32/cg/12.html
http://www.domain.ne.jp/65-32/cg/13.html
http://www.domain.ne.jp/65-32/cg/14.html
http://www.domain.ne.jp/65-32/cg/15.html
http://www.domain.ne.jp/65-32/cg/16.html
http://www.domain.ne.jp/65-32/cg/17.html
http://www.domain.ne.jp/65-32/cg/18.html
http://www.domain.ne.jp/65-32/cg/19.html
http://www.domain.ne.jp/65-32/cg/20.html
    
        
    
▼酷くおざなりな説明。
>>> splitUrl = re.split('\[|\]', url)
・説明１
re.splitは特定のパターンで区切ったリストを返す正規表現。
変数urlの内容を"["、"]"で区切って変数splitUrlに代入。
"["や"]"は特殊文字なので、円(\バックスラッシュ)をつけてエスケープしている。
>>> number = re.split('-', splitUrl[1])
・説明２
['10-20']をさらに'-'で区切って変数numbaerに代入。
>>> number = [int(element) for element in number]
・説明３
リスト内包表現と呼ばれるPythonの特殊な記述法。
上記は
for element in number:
	number = int(element)
と同じ意味。
>>> for count in range(min(number), max(number)+1):	#説明４
・説明４
min(number)は、配列numberの中で最も小さい値を出す。
max(number)は、配列numberの中で最も大きい値を出す。
ループ数を合わせるのにmax(number)に1をプラス
従って上記のスクリプトでは、
fot i in range(10, 30+1):
とループしていることになる。