Bir dizedeki URL'leri bulmak için normal ifade

Question 1

Bir dizedeki URL'leri bulmak için kullanabileceğim bir normal ifade bilen var mı? Bir dizenin tamamının URL olup olmadığını belirlemek için Google'da birçok normal ifade buldum, ancak URL'ler için bir dizenin tamamını arayabilmem gerekiyor. Örneğin , aşağıdaki dizede www.google.comve bulabilmek istiyorum http://yahoo.com:

Hello www.google.com World http://yahoo.com

Dizede belirli URL'leri aramıyorum. Dizedeki TÜM URL'leri arıyorum, bu yüzden düzenli ifadeye ihtiyacım var.

Question 2

Bu benim kullandığım

(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

Benim için çalışıyor, senin için de çalışmalı.

Question 3

Bu kullanım için hiçbir regex mükemmel değildir. Burada oldukça sağlam bir tane buldum

/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])/igm

Burada yayınlanan diğerlerine kıyasla bazı farklılıklar / avantajlar:

O mu değil e-posta adreslerini maç
Localhost ile eşleşiyor: 12345
moo.comOlmadan httpveya olmadan gibi bir şey algılamayacakwww

Örnekler için buraya bakın

Question 4

text = """The link of this question: /programming/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
The code below catches all urls in text and returns urls in list."""

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)

Çıktı:

[
    '/programming/6038061/regular-expression-to-find-urls-within-a-string', 
    'www.google.com', 
    'facebook.com',
    'http://test.com/method?param=wasd'
]

Question 5

Burada sunulan çözümlerin hiçbiri sahip olduğum sorunları / kullanım durumlarını çözmedi.

Burada sunduğum şey, şimdiye kadar bulduğum / yaptığım en iyisidir. Başlamadığı yeni uç durumlar bulduğumda güncelleyeceğim.

\b
  #Word cannot begin with special characters
  (?<![@.,%&#-])
  #Protocols are optional, but take them with us if they are present
  (?<protocol>\w{2,10}:\/\/)?
  #Domains have to be of a length of 1 chars or greater
  ((?:\w|\&\#\d{1,5};)[.-]?)+
  #The domain ending has to be between 2 to 15 characters
  (\.([a-z]{2,15})
       #If no domain ending we want a port, only if a protocol is specified
       |(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])

Question 6

Bu normal ifade kalıbının tam olarak istediğiniz şeyi işlediğini düşünüyorum

/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

ve bu, Url'leri çıkarmak için bir snippet örneğidir:

// The Regular Expression filter
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";

// The Text you want to filter for urls
$text = "The text you want  /programming/6038061/regular-expression-to-find-urls-within-a-string to filter goes here.";

// Check if there is a url in the text
preg_match_all($reg_exUrl, $text, $url,$matches);
var_dump($matches);

Question 7

Yukarıdaki yanıtların tümü, URL'deki Unicode karakterleriyle eşleşmiyor, örneğin: http://google.com?query=đức+filan+đã+search

Çözüm için bu işe yaramalı:

(ftp:\/\/|www\.|https?:\/\/){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}\.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)

Question 8

Bağlantı seçme konusunda katı olmanız gerekiyorsa, şunu seçerim:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Daha fazla bilgi için şunu okuyun:

Eşleşen URL'ler için Geliştirilmiş Liberal, Doğru Bir Normal İfade Kalıbı

Question 9

Bulduğum bu alt dizin parçaları dahil olmak üzere en örnek bağlantı, üzerini örter.

Normal ifade:

(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))?

Question 10

URL kalıbına sahipseniz, dizginizde arayabilmeniz gerekir. Sadece desen doesnt emin olun ^ve $işaretleme başlayan ve url dize sonu. Öyleyse, P URL'nin kalıbı ise, P için eşleşmeleri arayın.

Question 11

Bir dizede url bulmak için aşağıdaki normal ifadeyi kullandım:

/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

Question 12

İşte biraz daha optimize edilmiş regexp:

(?:(?:(https?|ftp|file):\/\/|www\.|ftp\.)|([\w\-_]+(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&amp;:\/~\+#]*[A-Z\-\@?^=%&amp;\/~\+#]){2,6}?

İşte verilerle test: https://regex101.com/r/sFzzpY/6

Question 13

Kısa ve basit. Henüz javascript kodunda test etmedim ama işe yarayacak gibi görünüyor:

((http|ftp|https):\/\/)?(([\w.-]*)\.([\w]*))

Regex101.com'daki kod

Question 14

Bu Regex'i kullanıyorum:

/((\w+:\/\/\S+)|(\w+[\.:]\w+\S+))[^\s,\.]/ig

: Bu gibi birçok URL'ler için ceza işleri http://google.com , https://dev-site.io:8080/home?val=1&count=100 , 8080 / yolu: www.regexr.com, localhost. ..

Question 15

Bu, Rajeev'in cevabında (neye ihtiyacınız olduğuna bağlı olarak) küçük bir iyileştirmedir:

([\w\-_]+(?:(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&amp;:/~\+#]*[A-Z\-\@?^=%&amp;/~\+#]){2,6}?

Buraya bakınNeyin eşleşip eşleşmediğine dair bir örnek için .

Bu olmadan url'leri yakalamak istediğim için "http" vb. Bazı karıştırılmış url'leri yakalamak için normal ifadeye biraz ekledim (yani kullanıcının "." Yerine [nokta] kullanması). Son olarak v2.0 ve "moo.0dd" gibi yanlış pozitifleri azaltmak için "\ w" yi "AZ" ile ve "{2,3}" ile değiştirdim.

Bu karşılama ile ilgili herhangi bir iyileştirme.

Question 16

Muhtemelen çok basit, ancak çalışma yöntemi şöyle olabilir:

[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+

Python'da test ettim ve dize ayrıştırma öncesi ve sonrası bir boşluk içerdiği ve url'de (daha önce hiç görmediğim) hiçbiri olmadığı sürece iyi olmalı.

İşte bunu gösteren çevrimiçi bir fikir

Ancak, kullanmanın bazı avantajları şunlardır:

Bu tanır file:ve localhostaynı zamanda ip adresleri
Onlar olmadan asla eşleşmeyecek
#Veya gibi olağandışı karakterleri önemsemez -(bu gönderinin url'sine bakın)

Question 17

@ JustinLevene tarafından sağlanan normal ifadenin kullanılması, ters eğik çizgilerde uygun kaçış dizilerine sahip değildi. Şimdi doğru olacak şekilde güncellendi ve FTP protokolüyle eşleşecek durumda eklendi: Protokollü veya protokolsüz ve "www." Olmadan tüm url'lerle eşleşecek.

Kod: ^((http|ftp|https):\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?

Örnek: https://regex101.com/r/uQ9aL4/65

Question 18

GELİŞMİŞ

URL'leri şu şekilde algılar:

https://www.example.pl
http://www.example.com
www.example.pl
ornek.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http: //www.example.com#up
http://255.255.255.255
255.255.255.255
http: // www.site.com:8008

Normal ifade:

/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/gm

Question 19

Kendim yazdım:

let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#]?[\w-]+)*\/?/gm

Aşağıdaki alan adlarının TÜMÜ üzerinde çalışır:

https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255

Sen olabilir o regex101 burada gösterdiği performansı görmek ve gerektiği gibi adjust

Question 20

İki nokta veya nokta arasında metin bulma mantığını kullanıyorum

aşağıdaki normal ifade python ile sorunsuz çalışır

(?<=\.)[^}]*(?=\.)

Question 21

Bir metindeki bir URL ile eşleştirmek o kadar karmaşık olmamalıdır

(?:(?:(?:ftp|http)[s]*:\/\/|www\.)[^\.]+\.[^ \n]+)

https://regex101.com/r/wewpP1/2

Question 22

Bunu kullandım

^(https?:\\/\\/([a-zA-z0-9]+)(\\.[a-zA-z0-9]+)(\\.[a-zA-z0-9\\/\\=\\-\\_\\?]+)?)$

Question 23

(?:vnc|s3|ssh|scp|sftp|ftp|http|https)\:\/\/[\w\.]+(?:\:?\d{0,5})|(?:mailto|)\:[\w\.]+\@[\w\.]+

Her bölümün bir açıklamasını istiyorsanız, her karakterin harika bir açıklamasını alacağınız regexr [.] Com'u deneyin.

Bu, "|" ile bölünmüştür. veya "VEYA" çünkü kullanılabilir tüm URI'lerde "//" yoktur, bu nedenle, eşleştirme ile ilgilenebileceğiniz şemalar veya koşullar listesi oluşturabileceğiniz yerdir.

Question 24

C # Uri sınıfını kullandım ve IP Adresi, localhost ile iyi çalışıyor

 public static bool CheckURLIsValid(string url)
    {
        Uri returnURL;

       return (Uri.TryCreate(url, UriKind.Absolute, out returnURL)
           && (returnURL.Scheme == Uri.UriSchemeHttp || returnURL.Scheme == Uri.UriSchemeHttps));


    }

Question 25

Stefan Henze'nin çözümünü beğendim ama 34.56 alacaktı. Çok genel ve ayrıştırılmamış html'im var. Bir url için 4 bağlantı vardır;

www,

http: \ (ve co),

. ardından harfler ve ardından /,

veya harfler. ve şunlardan biri: https://ftp.isc.org/www/survey/reports/current/bynum.txt .

Bu konudan çok fazla bilgi kullandım. Hepinize teşekkür ederim.

"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"

Yukarıda, tek bir dize olarak döndürdüğü "eurls: www.google.com, facebook.com, http: //test.com/" gibi bir dize dışında hemen hemen her şeyi çözer. Tbh idk neden gopher vb. Ekledim. Proof R kodu

if(T){
  wierdurl<-vector()
  wierdurl[1]<-"https://JP納豆.例.jp/dir1/納豆 "
  wierdurl[2]<-"xn--jp-cd2fp15c.xn--fsq.jp "
  wierdurl[3]<-"http://52.221.161.242/2018/11/23/biofourmis-collab"
  wierdurl[4]<-"https://12000.org/ "
  wierdurl[5]<-"  https://vg-1.com/?page_id=1002 "
  wierdurl[6]<-"https://3dnews.ru/822878"
  wierdurl[7]<-"The link of this question: /programming/6038061/regular-expression-to-find-urls-within-a-string
  Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
  The code below catches all urls in text and returns urls in list. "
  wierdurl[8]<-"Thelinkofthisquestion:/programming/6038061/regular-expression-to-find-urls-within-a-string
  Alsotherearesomeurls:www.google.com,facebook.com,http://test.com/method?param=wasd
  Thecodebelowcatchesallurlsintextandreturnsurlsinlist. "
  wierdurl[9]<-"Thelinkofthisquestion:/programming/6038061/regular-expression-to-find-urls-within-a-stringAlsotherearesomeurlsZwww.google.com,facebook.com,http://test.com/method?param=wasdThecodebelowcatchesallurlsintextandreturnsurlsinlist."
  wierdurl[10]<-"1facebook.com/1res"
  wierdurl[11]<-"1facebook.com/1res/wat.txt"
  wierdurl[12]<-"www.e "
  wierdurl[13]<-"is this the file.txt i need"
  wierdurl[14]<-"xn--jp-cd2fp15c.xn--fsq.jpinspiredby "
  wierdurl[15]<-"[xn--jp-cd2fp15c.xn--fsq.jp/inspiredby "
  wierdurl[16]<-"xnto--jpto-cd2fp15c.xnto--fsq.jpinspiredby "
  wierdurl[17]<-"fsety--fwdvg-gertu56.ffuoiw--ffwsx.3dinspiredby "
  wierdurl[18]<-"://3dnews.ru/822878 "
  wierdurl[19]<-" http://mywebsite.com/msn.co.uk "
  wierdurl[20]<-" 2.0http://www.abe.hip "
  wierdurl[21]<-"www.abe.hip"
  wierdurl[22]<-"hardware/software/data"
  regexstring<-vector()
  regexstring[2]<-"(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[3]<-"/(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#\\/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#\\/%=~_|$])/igm"
  regexstring[4]<-"[a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]?"
  regexstring[5]<-"((http|ftp|https)\\:\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[6]<-"((http|ftp|https):\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?"
  regexstring[7]<-"(http|ftp|https)(:\\/\\/)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[8]<-"(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#/%=~_|$])"
  regexstring[10]<-"((http[s]?|ftp):\\/)?\\/?([^:\\/\\s]+)((\\/\\w+)*\\/)([\\w\\-\\.]+[^#?\\s]+)(.*)?(#[\\w\\-]+)?"
  regexstring[12]<-"http[s:/]+[[:alnum:]./]+"
  regexstring[9]<-"http[s:/]+[[:alnum:]./]+" #in DLpages 230
  regexstring[1]<-"[[:alnum:]-]+?[.][:alnum:]+?(?=[/ :])" #in link_graphs 50
  regexstring[13]<-"^(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?$"
  regexstring[14]<-"(((((http|ftp|https):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]+(?:(?:\\.[\\w_-]+)*))((\\.((org|com|net|edu|gov|mil|int)|(([:alpha:]{2})(?=[, ]))))|([\\/]([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
  regexstring[15]<-"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
    }

for(i in wierdurl){#c(7,22)
  for(c in regexstring[c(15)]) {
    print(paste(i,which(regexstring==c)))
    print(str_extract_all(i,c))
  }
}

Question 26

Bu en iyisi.

NSString *urlRegex="(http|ftp|https|www|gopher|telnet|file)(://|.)([\\w_-]+(?:(?:\\.[\\w_-]+)‌+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?";

Question 27

Bu en basit olanıdır. Benim için iyi çalışan

%(http|ftp|https|www)(://|\.)[A-Za-z0-9-_\.]*(\.)[a-z]*%

Question 28

Bu sadece basit.

Bu kalıbı kullanın: \b((ftp|https?)://)?([\w-\.]+\.(com|net|org|gov|mil|int|edu|info|me)|(\d+\.\d+\.\d+\.\d+))(:\d+)?(\/[\w-\/]*(\?\w*(=\w+)*[&\w-=]*)*(#[\w-]+)*)?

Şunları içeren herhangi bir bağlantıyla eşleşir:

İzin Verilen Protokoller: http, https ve ftp

İzin Verilen Etki Alanları: * .com, * .net, * .org, * .gov, * .mil, * .int, * .edu, * .info ve * .me VEYA IP

İzin Verilen Bağlantı Noktaları: true

İzin Verilen Parametreler: true

İzin Verilen Karmalar: doğru