https://blog.csdn.net/Zdelta/article/details/104310984
https://blog.csdn.net/weixin_39401430/article/details/122396832
背景
爬虫项目,由于目标网站验证码频率比较频繁,于是上了IP proxy;
然而
1、c# HttpClient对象有资源释放不及时的问题,会导致系统套接字耗尽;以及内存占用越来越高!
2、如果使用一个全局静态HttpClient对象的话,又会由于初始化时只能有一个固定的代理地址,无法在本地做到动态代理;
//每次请求目标地址,都创建新的对象,在代码层Dispose(),无法释放系统底层socket
HttpClient HttpClient = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip,
UseCookies = true,
Proxy = new WebProxy(new Uri(GetProxy())) { Credentials = null, UseDefaultCredentials = false },
//实例化后就不可以修改代理地址了
UseProxy = true,
AllowAutoRedirect = true,
ClientCertificateOptions = ClientCertificateOption.Automatic,
ServerCertificateCustomValidationCallback = (message, cert, chain, error) => true
});
HttpClient.Timeout = new TimeSpan(0, 0, 3);
HttpClient.BaseAddress = uri;
var result = HttpClient.GetAsync(uri).Result.Content.ReadAsStringAsync().Result;
于是想到调用python request库。。
实现
IronPython可以在vs中直接调用python,但是不支持第三方库,遂选用命令行调用的方式;
1、python request.py
将python请求结果打印在命令行:
import ast
import sys
import time
import requests
FAIL_MESSAGE = "失败的请求"
def send_request(**kwargs):
url = kwargs.get('url')
if not url:
raise Exception("无效的url")
headers = {
'Referer': url,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
}
headers.update(kwargs.get('headers', dict()))
timeout = kwargs.get('timeout', 3)
proxy_api = kwargs.get('proxy_api', Getproxy())
verify_text = kwargs.get('verify_text', '验证码')
proxy_address = requests.get(proxy_api).json()[0]['proxy_address']
# print(f"proxy_address:{proxy_address}")
time.sleep(0.1)
try:
# print(f"headers:{headers}")
proxies = {"http": "http://" + proxy_address, "https": "http://" + proxy_address}
response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)
if verify_text in response.text:
send_request(**kwargs)
else:
print(f"{response.text}")
except Exception as exception:
print(FAIL_MESSAGE, exception)
if __name__ == '__main__':
kwargs = ast.literal_eval(sys.argv[1])
send_request(**kwargs)
2、用c#去读取
private static string RequestByPython(Uri uri)
{
var cmdArgs = "{'url':'" + uri + "'}";
Process process = new Process();
//py脚本地址
string path = Directory.GetCurrentDirectory() + PythonRequestFile;
//本地python安装路径/python.exe
process.StartInfo.FileName = PythonPath;
//使用命令行调用py脚本 约定命令格式
string sArguments = path;
sArguments += " " + cmdArgs;
process.StartInfo.Arguments = sArguments;
process.StartInfo.UseShellExecute = false;
process.StartInfo.RedirectStandardOutput = true;
process.StartInfo.RedirectStandardInput = true;
process.StartInfo.RedirectStandardError = true;
process.StartInfo.CreateNoWindow = true;
process.Start();
StringBuilder stringBuilder = new StringBuilder();
StreamReader streamReader = process.StandardOutput;
while (!streamReader.EndOfStream)
{
stringBuilder.Append(streamReader.ReadLine());
}
process.WaitForExit();
var result = stringBuilder.ToString();
return result;
}
结果
1、内存占用率有升有降,稳定在一个区间;perfect!
2、打印在命令行又读取的网页源码格式有所变化,需要html格式化或者修改正则(如懒汉匹配);
注意:
无论python还是c#使用proxy时都需要忽略https证书错误!
文档更新时间: 2022-12-12 08:02 作者:admin