21一/100
用Python脚本抓取Linkedin中的公司信息
Linkedin是一个在国外比较知名的商务型SNS.最近在研究Linkedin时,发现Linkedin在添加个人信息时的公司名称提示功能很强大,通过这个功能几乎可以获取一份大型公司名录.于是用Python花了几分钟写了一个小脚本,顺利的获取到了6000余条公司信息.
代码如下:
import urllib2
import json
import MySQLdb
db = MySQLdb.connect(host="localhost",unix_socket='/Applications/MAMP/tmp/mysql/mysql.sock',port=3306,user="root",passwd="000000",db="linkedin_spider")
db.set_character_set("utf8")
def getData(str):
req = urllib2.Request(
url = 'http://www.linkedin.com/typeaheadv3/company?query='+str+'&loc=P'
)
result = urllib2.urlopen(req).read()
b= json.JSONDecoder().decode(result)
for k in b["resultList"]:
cursor = db.cursor()
if k.has_key("imageUrl"):
sql = u"insert into companys values(null,\""+k['displayName'].replace('"','\\"')+"\",\""+k['subLine'].replace('"','\\"')+"\",\""+k["headLine"].replace('"','\\"')+"\",\""+k['imageUrl']+"\",\""+k['url']+"\",\""+k['id']+"\")"
else:
sql = u"insert into companys values(null,\""+k['displayName'].replace('"','\\"')+"\",\""+k['subLine'].replace('"','\\"')+"\",\""+k["headLine"].replace('"','\\"')+"\",\"null\",\""+k['url']+"\",\""+k['id']+"\")"
print sql
cursor.execute(unicode(sql))
print k['displayName']
a="abcdefghijklmnopqrstuvwxyz"
i=0
for x in a:
for y in a:
print i
i=i+1
getData(x+y)
MySQL数据库结构:
SET NAMES utf8; SET FOREIGN_KEY_CHECKS = 0; -- ---------------------------- -- Table structure for `companys` -- ---------------------------- DROP TABLE IF EXISTS `companys`; CREATE TABLE `companys` ( `id` int(11) NOT NULL AUTO_INCREMENT, `displayName` varchar(500) COLLATE utf8_bin DEFAULT NULL, `subLine` varchar(500) COLLATE utf8_bin DEFAULT NULL, `headLine` varchar(500) COLLATE utf8_bin DEFAULT NULL, `imageUrl` varchar(500) COLLATE utf8_bin DEFAULT NULL, `url` varchar(500) COLLATE utf8_bin DEFAULT NULL, `link_id` varchar(500) COLLATE utf8_bin DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
关键字设置的很简单,从aa循环到zz,每个关键字能够搜到10条信息,所以不考虑数据重复性的话,这个程序一共能够抓取到26*26*10=6760条公司信息.而实际运行结果则是插入了6518条,而其中34条是重复数据,可见Linkedin的公司数量非常之多.
需要注意的几个问题:
- Linkedin公司介绍中有西欧字符,默认情况下无法插入sql,报错
Traceback (most recent call last): File "main.py", line 29, in <module> getData(x+y) File "main.py", line 21, in getData cursor.execute(sql) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/cursors.py", line 149, in execute query = query.encode(charset) <span style="color: #ff0000;">UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0142' in position 89: ordinal not in range(256)</span> smbp:Linkedin_Spider scourgen$ python2.6 main.py /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet解决办法是将sql字符串转换成utf8,同时也将数据库的连接属性改成utf8即可
#设置连接属性: db.set_character_set("utf8") #u"" 代表将字符串转成unicode sql=u"insert ..." #再转一次,以防万一 cursor.execute(unicode(sql)) - 公司简介中有些数据含有引号,导致sql语句构造失败
replace('"','\\"')可以将"替换成\",以便构造sql语句 sql = "insert ...."+k['subLine'].replace('"','\\"')